Skip to main content

Overview

Stats.PostProcessReports is a .NET Framework 4.7.2 console application that runs as a scheduled background job. Its sole responsibility is to take large, aggregated download statistics blobs produced upstream by a statistics pipeline (e.g., HDInsight/Spark jobs) and fan them out into thousands of individual per-package JSON files that the NuGet Gallery frontend can serve directly. Each line in an input JSON blob represents one package’s detailed popularity statistics. The job reads each line, extracts the PackageId, and writes a file named recentpopularitydetail_<packageid>.json to a publicly accessible Azure Blob Storage container.
The job is idempotent. It uses sentinel files (_SUCCESS, _WorkCopySucceeded, _JobSucceeded) and Azure Blob metadata to detect whether previous runs completed successfully, allowing it to skip redundant work when re-triggered.

Role in System

This job sits in the statistics post-processing pipeline of the NuGetGallery ecosystem:
  1. An upstream process (HDInsight or similar) produces aggregated report blobs in a source container and writes a _SUCCESS marker when complete.
  2. Stats.PostProcessReports detects the marker, copies the source blobs to a work container for safe in-place processing, then fans out each JSON line to a destination container.
  3. The Gallery web app reads recentpopularitydetail_<id>.json from the destination container to display per-package download statistics on package detail pages.
Source Container (private)          Work Container (private)         Destination Container (public)
  detailed-reports/                   work-path/                       dest-path/
    _SUCCESS              ----copy---> blob.json  ---fan-out--->          recentpopularitydetail_foo.json
    blob.json                          _WorkCopySucceeded                 recentpopularitydetail_bar.json
                                       _JobSucceeded                      ...
The destination container is configured with enablePublicAccess: true, making the individual JSON files publicly readable without authentication — suitable for direct CDN or browser fetch.

Key Files and Classes

FileClass / InterfacePurpose
Program.csProgramEntry point; delegates to JobRunner.Run(job, args) from NuGet.Jobs
Job.csJobExtends JsonConfigurationJob; wires Autofac DI for three named IStorage instances and DetailedReportPostProcessor
IDetailedReportPostProcessor.csIDetailedReportPostProcessorSingle-method interface (CopyReportsAsync) isolating the processing contract
DetailedReportPostProcessor.csDetailedReportPostProcessorCore logic: enumerates blobs, copies to work storage, fans out lines in parallel, writes sentinel files, stores metadata
PostProcessReportsConfiguration.csPostProcessReportsConfigurationStrongly-typed configuration POCO bound from Configuration section; controls container names, paths, and parallelism degree
BlobStatistics.csBlobStatisticsThread-safe per-blob counters for TotalLineCount, FilesCreated, and LinesFailed using Interlocked
TotalStats.csTotalStatsSimple mutable accumulator for job-level aggregate statistics
LineProcessingContext.csLineProcessingContextCarries a single line’s number and raw JSON string through the ConcurrentBag processing queue
Telemetry/ITelemetryService.csITelemetryServiceContract for emitting per-file and job-total metrics, plus source data age
Telemetry/TelemetryService.csTelemetryServiceImplements ITelemetryService via ITelemetryClient (Application Insights wrapper from NuGet.Services.Logging)

Dependencies

Internal Project References

ProjectPurpose
NuGet.Jobs.CommonProvides JsonConfigurationJob, JobRunner, StorageMsiConfiguration, ConfigureStorageMsi extension
NuGet.Services.StorageProvides IStorage, AzureStorage, AzureStorageFactory, BlobServiceClientFactory, StorageListItem, StringStorageContent

Key NuGet / Framework Dependencies (resolved transitively)

PackageUsage
AutofacDI container; keyed IStorage registrations via ContainerBuilder
Azure.Storage.BlobsUnderlying Azure SDK for blob operations
Azure.IdentityManagedIdentityCredential, DefaultAzureCredential for MSI auth
Microsoft.Extensions.ConfigurationIConfigurationRoot for JSON config loading
Microsoft.Extensions.DependencyInjectionIServiceCollection service registration
Microsoft.Extensions.LoggingStructured logging throughout
Microsoft.Extensions.OptionsIOptionsSnapshot<T> for live config access
Newtonsoft.JsonDeserializes each JSON line to extract PackageId
NuGet.Services.LoggingITelemetryClient / Application Insights integration

Configuration Reference

All settings live under the "Configuration" JSON section, bound to PostProcessReportsConfiguration:
PropertyDescription
StorageAccountAzure Blob endpoint or connection string (used for all three containers)
SourceContainerNameContainer holding upstream-generated report blobs
SourcePathPath prefix within the source container
DetailedReportDirectoryNameAppended to SourcePath to locate the detailed report directory
WorkContainerNamePrivate container used as a staging/work area
WorkPathPath prefix within the work container
DestinationContainerNamePublicly accessible container for individual per-package JSON files
DestinationPathPath prefix within the destination container
ReportWriteDegreeOfParallelismNumber of concurrent writer tasks per blob (default: 10)

Notable Patterns and Implementation Details

Three-Storage Architecture

Three distinct IStorage instances (source, work, destination) are registered as Autofac keyed services and injected by parameter name into DetailedReportPostProcessor. This prevents accidental cross-container writes.

Parallel Fan-Out via ConcurrentBag

Lines read from each blob are placed into a ConcurrentBag<LineProcessingContext>. Multiple writer tasks (controlled by ReportWriteDegreeOfParallelism) drain the bag concurrently, each calling _destinationStorage.Save(...).

Blob Metadata as Resume Checkpoint

After processing each blob, the job writes TotalLines, LinesFailed, and FilesCreated as Azure Blob metadata. On the next run, if this metadata exists, the blob is skipped and totals are reconstructed from metadata alone.

Sentinel File Ordering

When cleaning up work storage, flag files (_WorkCopySucceeded, _JobSucceeded) are deleted before data blobs. The FlagFilesFirst sort helper ensures a failed mid-deletion run cannot leave the system in an inconsistent state where a sentinel exists but data does not.
The job raises ServicePointManager.DefaultConnectionLimit to ReportWriteDegreeOfParallelism + 10 on startup. This is a process-wide .NET Framework setting. Since the project targets net472, this is expected behavior, but be aware if any shared-process hosting model is ever adopted.
The MSI authentication path contains a compile-time #if DEBUG branch: in debug builds, DefaultAzureCredential is used for local development; in release builds, ManagedIdentityCredential is used. System-assigned and user-assigned (via ManagedIdentityClientId) identities are both supported.

Deployment

The project packages via a .nuspec file targeting net472 binaries and includes PowerShell deployment scripts (PreDeploy.ps1, PostDeploy.ps1, Functions.ps1) plus nssm.exe for installing the job as a Windows Service. Deployment is orchestrated through Octopus Deploy, with service names and scripts parameterized via $OctopusParameters.