Overview
Stats.CDNLogsSanitizer is a .NET Framework 4.7.2 console executable designed to be run on-demand by operators. It reads raw CDN access-log blobs (GZip-compressed, delimited text) from a source Azure Blob Storage container, passes each log line through a configurable pipeline of sanitizers, and writes the cleaned result to a destination Azure Blob Storage container.
The primary motivation is privacy compliance: raw CDN logs captured from services such as Azure CDN or China-region (Mooncake) CDN contain the original client IP address (c-ip field). Before those logs are forwarded to analytics pipelines or long-term storage, the IP addresses must be obfuscated so that no personally identifiable data is retained.
The README notes this tool is intended to be executed manually, not as a scheduled background job. It does extend
JsonConfigurationJob and can be launched via the standard NuGet Jobs runner, but it has no automatic scheduling.Role in the NuGetGallery Ecosystem
CDN Log Collection
Upstream jobs (e.g.,
Stats.CollectAzureCdnLogs) collect raw CDN logs into Azure Blob Storage. This tool is the privacy-scrubbing step before those blobs are consumed by downstream analytics.Stats Pipeline
Sits between raw CDN ingestion and the stats import/aggregation jobs. Sanitized blobs can then be safely processed by
Stats.ImportAzureCdnStatistics and related tools.Mooncake / China CDN
The example configuration targets
core.chinacloudapi.cn, indicating the tool was created specifically to handle China-region CDN logs that required separate sanitization before analysis.Shared Infrastructure
Reuses
Stats.AzureCdnLogs.Common for blob source/destination abstractions and lease management, keeping I/O concerns out of the sanitization logic.Key Files and Classes
| File | Class / Interface | Purpose |
|---|---|---|
Program.cs | Program | Entry point; instantiates Job and delegates to JobRunner.RunOnce. |
Job.cs | Job | Extends JsonConfigurationJob; resolves configuration, builds AzureStatsLogSource, AzureStatsLogDestination, and the sanitizer list, then wires up Processor. |
JobConfiguration.cs | JobConfiguration | POCO bound from the Initialization JSON section; holds all connection strings, container names, header definition, and tuning parameters. |
Processor.cs | Processor | Core orchestrator; fetches blobs in batches, acquires leases, opens GZip streams, calls ProcessStream, then cleans/releases leases. |
LogHeaderMetadata.cs | LogHeaderMetadata | Parses the configurable log header string and delimiter; provides GetIndex(fieldName) for column lookup. |
Utils.cs | Utils (extension methods) | GetFirstIndex string-array extension used by LogHeaderMetadata to locate a named column. |
Sanitizers/ISanitizer.cs | ISanitizer | Single-method interface (SanitizeLogLine(ref string)) that all sanitizers implement. |
Sanitizers/ClientIPSanitizer.cs | ClientIPSanitizer | The only built-in sanitizer; locates the c-ip column and replaces the raw IP with a value produced by NuGetGallery.Auditing.Obfuscator.ObfuscateIp. |
Dependencies
Internal Project References
| Project | Role |
|---|---|
NuGet.Jobs.Common | Provides JsonConfigurationJob, JobRunner, Key Vault integration, and DI bootstrapping. |
Stats.AzureCdnLogs.Common | Provides AzureStatsLogSource, AzureStatsLogDestination, AzureBlobLeaseManager, ILogSource, ILogDestination, ContentType, and ExtensionsUtils.GetSegmentsFromCSV. |
NuGet / Framework Dependencies (implicit via project refs)
| Package | Usage |
|---|---|
Azure.Storage.Blobs | BlobServiceClient used directly in Job.ValidateAzureCloudStorageAccount. |
Microsoft.Extensions.Logging | ILogger<T> injected into Job, Processor, and Azure log infrastructure. |
Microsoft.Extensions.Options | IOptionsSnapshot<JobConfiguration> for typed config binding. |
Microsoft.Extensions.Configuration | IConfigurationRoot wired through JsonConfigurationJob. |
Autofac | IoC container used by JsonConfigurationJob base class (empty override in this project). |
NuGetGallery.Auditing | Obfuscator.ObfuscateIp called by ClientIPSanitizer to hash/anonymize IP addresses. |
Configuration Reference
Notable Patterns and Implementation Details
Header passthrough:
Processor.ProcessStream treats the first line of every log file as a header and writes it to the destination unchanged. All subsequent lines are passed through the sanitizer pipeline. This means the column structure of the log is always preserved.Sanitizer pipeline is a simple
foreach: Each ISanitizer receives the same ref string and mutates it in place. Order matters if multiple sanitizers target overlapping fields. Currently only ClientIPSanitizer is registered — adding more sanitizers requires modifying Job.InitializeJobConfiguration.Parallelism is batch-scoped:
MaxBlobsToProcess controls how many blobs are fetched per iteration and processed concurrently via Task.WhenAll. The loop continues until a batch returns fewer blobs than MaxBlobsToProcess, signalling the source is exhausted (or the cancellation token fires).Default timeout is 10 days:
DefaultExecutionTimeoutInSeconds = 345600. This unusually high value suggests the tool was designed to run unattended over very large historical log archives without operator supervision.