Skip to main content

PackageHash

NuGet.Services.PackageHash is a cursor-driven background job (console executable) that crawls all Available packages in the NuGetGallery database and re-derives their SHA-512 hash from the actual .nupkg bytes stored in blob storage. Any discrepancy between the stored hash and the recomputed hash is logged as an error and appended to a local results.csv file for follow-up investigation.

Overview

Hash Verification

Downloads each .nupkg from blob storage and recomputes its SHA-512 digest, then compares it against the value stored in the Gallery database.

Cursor-Based Progress

Uses a durable file-backed cursor (one per bucket) so the job can be interrupted and resumed without re-processing already-validated packages.

Horizontal Sharding

Work is partitioned across N independent workers via a consistent-hash scheme on {id}/{version}, requiring no shared coordination layer.

Failure Reporting

Mismatched hashes are written to results.csv (appended, with a header row) in the working directory for offline analysis.

Role in the NuGetGallery Ecosystem

PackageHash sits in the integrity-verification tier of the gallery’s operational tooling. It is not invoked during normal package ingestion; instead it runs as a periodic or on-demand sweep job to detect silent corruption or tampering of package blobs after they have already been accepted. It depends on:
  • NuGet.Services.Cursor — provides the DurableCursor abstraction for tracking progress as a UTC timestamp persisted in a local JSON file.
  • Validation.Common.Job — provides ValidationJobBase (itself extending NuGet.Jobs.JobBase) which wires up Autofac DI, configuration binding, IFileDownloader, CryptographyService, and the structured-logging infrastructure shared by all Gallery background jobs.

Key Files and Classes

FileClass / InterfacePurpose
Program.csProgramEntry point; bootstraps Job via JobRunner.Run.
Job.csJob : ValidationJobBaseParses --bucket-number / --bucket-count CLI args, registers all DI services, and delegates execution to IPackageHashProcessor.
PackageHashProcessor.csPackageHashProcessorOuter loop: loads the cursor, queries the DB for a batch of packages newer than the cursor, trims the batch to avoid boundary races, partitions by bucket, and advances the cursor on success.
BatchProcessor.csBatchProcessorInner loop: builds a ConcurrentBag of (source, package) work items and fans them out across a configurable number of parallel tasks, collecting InvalidPackageHash failures.
PackageHashCalculator.csPackageHashCalculatorDownloads the .nupkg from a PackagesContainer URL and calls CryptographyService.GenerateHash to compute the actual SHA-512 digest.
ConsistentHash.csConsistentHash (static)Assigns a package to a zero-based bucket index by XOR-folding a SHA-256 hash of the lowercased id/version string, then taking modulo bucket count.
ResultRecorder.csResultRecorderAppends InvalidPackageHash records to results.csv with columns: Type, URL, ID, Version, ExpectedHash, ActualHash.
PackageHashConfiguration.csPackageHashConfigurationPOCO bound from the PackageHash config section; exposes BatchSize, DegreeOfParallelism, and a list of PackageSource objects.
PackageHash.csPackageHashValue object pairing a PackageIdentity with its expected base64-encoded hash digest from the database.
InvalidPackageHash.csInvalidPackageHashFailure record carrying the PackageSource, the expected PackageHash, and the actual recomputed hash string.
PackageSource.csPackageSourceConfiguration model: a Url and a PackageSourceType enum value.
PackageSourceType.csPackageSourceTypeEnum with a single value PackagesContainer (Azure Blob flat-container layout).

Dependencies

Internal Project References

ProjectRole
NuGet.Services.CursorDurableCursor — file-backed cursor for tracking the last-processed timestamp per bucket.
Validation.Common.JobValidationJobBase, IFileDownloader, CryptographyService, FileStorageFactory, shared job runner infrastructure.

NuGet Packages (resolved transitively via project refs)

PackageUsage
Autofac / Microsoft.Extensions.DependencyInjectionDI container wiring in Job.ConfigureJobServices.
Microsoft.Extensions.LoggingStructured logging throughout all classes.
Microsoft.Extensions.OptionsIOptionsSnapshot<PackageHashConfiguration> for live config reads.
NuGet.Packaging.CorePackageIdentity value type (id + version).
NuGet.VersioningNuGetVersion.Parse for normalized version strings.
EntityFramework (via NuGetGallery)IEntityRepository<Package> / DbContext used in PackageHashProcessor to query the database.
The project targets net472 (full .NET Framework 4.7.2), not .NET Core or .NET 5+. This is consistent with other background jobs in the NuGetGallery solution that depend on Entity Framework 6 and the classic NuGet.Jobs runner.

Notable Patterns and Implementation Details

Cursor Boundary Safety

PackageHashProcessor trims any packages that share the batch’s maximum timestamp before advancing the cursor. This prevents a race condition where two packages with the same Created/LastEdited timestamp straddle a batch boundary and one would be silently skipped.
var maxTimestamp = fullBatch.Select(GetMaxTimestamp).Max();
var trimmedBatch = fullBatch
    .Where(x => GetMaxTimestamp(x) < maxTimestamp)
    .ToList();

One-Hour Lookback Window

When the cursor value is within one hour of UtcNow, the processor substitutes UtcNow - 1h as the effective cursor value and skips advancing the cursor after the batch. This catches packages whose LastEdited timestamps arrive slightly out of order.
When the lookback window is applied, ProcessBatchAsync always returns null, which halts the do-while loop for the current job run. The cursor is not moved forward. This is intentional — the job is expected to be re-invoked periodically rather than running continuously.

Consistent Hash Sharding

ConsistentHash.DetermineBucket hashes the lowercase {id}/{version} key with SHA-256, then XOR-folds the 32-byte result into a single int before taking % bucketCount. This gives a deterministic, stateless partition with no central coordinator.
ConsistentHash.DetermineBucket can return a negative bucket index if the XOR-folded integer overflows into a negative value, because int % n preserves the sign in C#. Callers should be aware of this if bucketCount is ever used outside the current bucketIndex == bucketNumber - 1 comparison, which happens to be safe as written.

Parallel Hash Verification

BatchProcessor uses a ConcurrentBag<Work> drained by DegreeOfParallelism concurrent Task workers rather than Parallel.ForEach or PLINQ, giving full async I/O non-blocking behavior while downloading .nupkg files.

Result Recording

ResultRecorder writes directly to results.csv in the current working directory using FileMode.Append. The header row is only written when the file is new (position == 0). There is no rotation, size cap, or remote upload — the file is intended to be inspected or copied off the machine manually after a run.
To run two workers in parallel against the same package corpus without overlap, launch two processes with matching --bucket-count 2 and --bucket-number 1 / --bucket-number 2 arguments. Each worker maintains its own cursor file named cursor_{n}_of_{total}.json in the working directory.