Skip to main content

Overview

Stats.CDNLogsSanitizer is a .NET Framework 4.7.2 console executable designed to be run on-demand by operators. It reads raw CDN access-log blobs (GZip-compressed, delimited text) from a source Azure Blob Storage container, passes each log line through a configurable pipeline of sanitizers, and writes the cleaned result to a destination Azure Blob Storage container. The primary motivation is privacy compliance: raw CDN logs captured from services such as Azure CDN or China-region (Mooncake) CDN contain the original client IP address (c-ip field). Before those logs are forwarded to analytics pipelines or long-term storage, the IP addresses must be obfuscated so that no personally identifiable data is retained.
The README notes this tool is intended to be executed manually, not as a scheduled background job. It does extend JsonConfigurationJob and can be launched via the standard NuGet Jobs runner, but it has no automatic scheduling.

Role in the NuGetGallery Ecosystem

CDN Log Collection

Upstream jobs (e.g., Stats.CollectAzureCdnLogs) collect raw CDN logs into Azure Blob Storage. This tool is the privacy-scrubbing step before those blobs are consumed by downstream analytics.

Stats Pipeline

Sits between raw CDN ingestion and the stats import/aggregation jobs. Sanitized blobs can then be safely processed by Stats.ImportAzureCdnStatistics and related tools.

Mooncake / China CDN

The example configuration targets core.chinacloudapi.cn, indicating the tool was created specifically to handle China-region CDN logs that required separate sanitization before analysis.

Shared Infrastructure

Reuses Stats.AzureCdnLogs.Common for blob source/destination abstractions and lease management, keeping I/O concerns out of the sanitization logic.

Key Files and Classes

FileClass / InterfacePurpose
Program.csProgramEntry point; instantiates Job and delegates to JobRunner.RunOnce.
Job.csJobExtends JsonConfigurationJob; resolves configuration, builds AzureStatsLogSource, AzureStatsLogDestination, and the sanitizer list, then wires up Processor.
JobConfiguration.csJobConfigurationPOCO bound from the Initialization JSON section; holds all connection strings, container names, header definition, and tuning parameters.
Processor.csProcessorCore orchestrator; fetches blobs in batches, acquires leases, opens GZip streams, calls ProcessStream, then cleans/releases leases.
LogHeaderMetadata.csLogHeaderMetadataParses the configurable log header string and delimiter; provides GetIndex(fieldName) for column lookup.
Utils.csUtils (extension methods)GetFirstIndex string-array extension used by LogHeaderMetadata to locate a named column.
Sanitizers/ISanitizer.csISanitizerSingle-method interface (SanitizeLogLine(ref string)) that all sanitizers implement.
Sanitizers/ClientIPSanitizer.csClientIPSanitizerThe only built-in sanitizer; locates the c-ip column and replaces the raw IP with a value produced by NuGetGallery.Auditing.Obfuscator.ObfuscateIp.

Dependencies

Internal Project References

ProjectRole
NuGet.Jobs.CommonProvides JsonConfigurationJob, JobRunner, Key Vault integration, and DI bootstrapping.
Stats.AzureCdnLogs.CommonProvides AzureStatsLogSource, AzureStatsLogDestination, AzureBlobLeaseManager, ILogSource, ILogDestination, ContentType, and ExtensionsUtils.GetSegmentsFromCSV.

NuGet / Framework Dependencies (implicit via project refs)

PackageUsage
Azure.Storage.BlobsBlobServiceClient used directly in Job.ValidateAzureCloudStorageAccount.
Microsoft.Extensions.LoggingILogger<T> injected into Job, Processor, and Azure log infrastructure.
Microsoft.Extensions.OptionsIOptionsSnapshot<JobConfiguration> for typed config binding.
Microsoft.Extensions.ConfigurationIConfigurationRoot wired through JsonConfigurationJob.
AutofacIoC container used by JsonConfigurationJob base class (empty override in this project).
NuGetGallery.AuditingObfuscator.ObfuscateIp called by ClientIPSanitizer to hash/anonymize IP addresses.

Configuration Reference

{
  "Initialization": {
    "AzureAccountConnectionStringSource": "<source storage connection string>",
    "AzureAccountConnectionStringDestination": "<dest storage connection string>",
    "AzureContainerNameSource": "<source container>",
    "AzureContainerNameDestination": "<dest container>",
    "BlobPrefix": "",
    "LogHeader": "c-ip, timestamp, cs-method, cs-uri-stem, http-ver, sc-status, sc-bytes, c-referer, c-user-agent, rs-duration(ms), hit-miss, s-ip",
    "LogHeaderDelimiter": ",",
    "ExecutionTimeoutInSeconds": 345600,
    "MaxBlobsToProcess": 4
  }
}
Launch command:
Stats.CDNLogsSanitizer.exe \
  -Configuration "Settings\chinadev.json" \
  -InstrumentationKey "<app-insights-key>" \
  -verbose true

Notable Patterns and Implementation Details

Header passthrough: Processor.ProcessStream treats the first line of every log file as a header and writes it to the destination unchanged. All subsequent lines are passed through the sanitizer pipeline. This means the column structure of the log is always preserved.
Sanitizer pipeline is a simple foreach: Each ISanitizer receives the same ref string and mutates it in place. Order matters if multiple sanitizers target overlapping fields. Currently only ClientIPSanitizer is registered — adding more sanitizers requires modifying Job.InitializeJobConfiguration.
SAS connection string normalization quirk: Job.cs contains the line:
connectionString.Replace("SharedAccessSignature=?", "SharedAccessSignature=")
This strips a leading ? that some SAS token generators prepend to the query string. If this character is absent the replace is a no-op, but it is a silent data fix that operators should be aware of when constructing connection strings.
Lease-per-blob, no retry logic: Processor.ProcessBlobAsync acquires a blob lease before processing. If the lease cannot be taken (blob already locked by another process), the blob is silently skipped for that run. There is no retry queue or dead-letter mechanism built into this tool.
Blob prefix filtering: Set BlobPrefix in configuration to target a specific date partition (e.g., 2024/01/15/) and avoid reprocessing the entire container on every run. Leave it empty to process all unprocessed blobs.
Parallelism is batch-scoped: MaxBlobsToProcess controls how many blobs are fetched per iteration and processed concurrently via Task.WhenAll. The loop continues until a batch returns fewer blobs than MaxBlobsToProcess, signalling the source is exhausted (or the cancellation token fires).
Default timeout is 10 days: DefaultExecutionTimeoutInSeconds = 345600. This unusually high value suggests the tool was designed to run unattended over very large historical log archives without operator supervision.