Skip to main content

Overview

Stats.Warehouse is a SQL Server Database Project (.sqlproj) targeting Azure SQL (SqlAzureV12). It defines the complete schema — tables, stored procedures, user-defined functions, table-valued types, and views — for the NuGet download-statistics data warehouse. The warehouse follows a classic star schema design: a central fact table (Fact_Download) is surrounded by dimension tables for date, time, package, client, platform, and operation. A parallel star for NuGet CLI tool distributions (Fact_Dist_Download) mirrors the same pattern for tracking tool releases. Raw IIS/CDN log files are aggregated by external jobs and loaded into this warehouse via table-valued parameters passed to the Ensure* and StoreLogFileAggregates stored procedures. Reporting stored procedures then query the warehouse to produce the download reports surfaced on nuget.org.
This project contains no application code. It is a pure SQL Server Data Tools (SSDT) database project. The compiled output is a .dacpac that is deployed to the warehouse SQL server. All logic lives in T-SQL.

Role in the System

Upstream: Stats Pipeline

External stats-collection jobs (e.g., Stats.ImportAzureCdnStatistics, Stats.AggregateCdnDownloadsInGallery) parse CDN log files and call the Ensure* procedures and StoreLogFileAggregates to populate the warehouse.

Downstream: Download Reports

Reporting jobs call the DownloadReport* and SelectTotalDownloadCounts* procedures to generate the JSON download-report blobs that nuget.org reads for package download badges and statistics pages.

Downstream: Gallery Sync

GetDirtyPackageIds and the Cursors table form a cursor-based change-detection pattern used by gallery-sync jobs to identify packages whose download counts have changed and need to be pushed back to the Gallery database.

Downstream: Dist Reports

View_Dist_ReleaseDate and Fact_Dist_Download track NuGet CLI tool distribution downloads, feeding release-date and version-alias metadata to tooling dashboards.

Schema: Tables

TableTypePurpose
Fact_DownloadFactCore download event fact table; FK’d to all dimensions
Fact_Dist_DownloadFactTool-distribution download facts (NuGet CLI etc.)
Fact_UserAgentFact/LookupDeduplicated raw user-agent strings
Fact_LogFileNameFact/LookupDeduplicated CDN log file names
Fact_IpAddressFact/LookupDeduplicated edge-server IP addresses
Fact_Package_PackageSetBridgeMany-to-many between packages and package sets
Dimension_PackageDimensionPackage ID + version, with persisted lowercase columns
Dimension_ClientDimensionClient name + version parts + computed ClientCategory
Dimension_DateDimensionFull date spine with fiscal calendar attributes
Dimension_TimeDimensionTime-of-day dimension (hour granularity)
Dimension_OperationDimensionDownload operation type (install, restore, etc.)
Dimension_PlatformDimensionOS/runtime platform
Dimension_ToolDimensionNuGet CLI tool ID, version, and filename
Dimension_PackageSetDimensionNamed groupings of packages
Agg_PackageDownloads_LogFileAggregatePer-log-file, per-date download count rollup
CursorsControlNamed datetime cursors for incremental processing

Schema: Programmability

Key Stored Procedures

ProcedurePurpose
EnsurePackageDimensionsExistUpserts package dimension rows via TVP; returns surrogate IDs
EnsureClientDimensionsExistUpserts client dimension rows via TVP
EnsurePlatformDimensionsExistUpserts platform dimension rows via TVP
EnsureToolDimensionsExistUpserts tool dimension rows via TVP
EnsureUserAgentFactsExistUpserts raw user-agent strings into Fact_UserAgent
EnsureLogFileNameFactsExistUpserts log file name strings into Fact_LogFileName
EnsureIpAddressFactsExistUpserts edge-server IPs into Fact_IpAddress
StoreLogFileAggregatesCursor-based upsert of per-log-file download aggregates
SelectAlreadyAggregatedLogFilesReturns log file names already present in Agg_PackageDownloads_LogFile
CheckLogFileHasPackageStatisticsChecks whether a log file contributed package download facts
CheckLogFileHasToolStatisticsChecks whether a log file contributed tool download facts
SelectTotalDownloadCountsReturns total download count across all packages (with hardcoded -21M correction)
SelectTotalDownloadCountsPerPackageVersionPer-package-version totals
SelectTotalDownloadCountsPerToolVersionPer-tool-version totals
GetDirtyPackageIdsCursor-based detection of packages with new downloads since last run
UpdateDirtyPackageIdCursorAdvances the GetDirtyPackageId cursor in Cursors table
GetTotalPackageDownloadsByDateDaily download totals, used by aggregate reporting
DownloadReportRecentPopularityTop 100 packages by downloads in last 42 days
DownloadReportRecentPopularityDetailPer-version breakdown for recent-popularity report
DownloadReportRecentPopularityDetailByPackagePackage-scoped version breakdown
DownloadReportRecentCommunityPopularityPopularity filtered to community (non-NuGet) clients
DownloadReportRecentCommunityPopularityDetailCommunity popularity with version detail
DownloadReportLast6MonthsMonthly download totals for last 6 months
DownloadReportLast6WeeksWeekly download totals for last 6 weeks
DownloadReportNuGetClientVersionDownloads grouped by NuGet client version
DownloadReportListInactivePackages with zero recent activity
CleanupFactIpAddressDeletes orphaned IP address facts
CleanupFactUserAgentDeletes orphaned user-agent facts
CleanupPlatformDimensionRemoves unused platform dimension rows
GenerateFixedDimensionsSeeds fixed/static dimension rows (operation types, etc.)
GetUnknownUserAgentsReturns user agents with no client dimension mapping
GetLinkedUserAgentsReturns user agents already linked to a client dimension
PatchClientDimensionForUserAgentManually assigns a client dimension to a user-agent string

Functions

FunctionPurpose
GetClientCategoryDispatches to type-detection functions; returns category string
IsNuGetClientReturns 1 if user-agent matches NuGet client patterns
IsBrowserReturns 1 if user-agent is a browser
IsCrawlerReturns 1 if user-agent is a web crawler/bot
IsMobileClientReturns 1 if user-agent is a mobile client
IsScriptClientReturns 1 if user-agent is a scripting client (PowerShell, curl, etc.)
IsUnknownClientReturns 1 if user-agent is unknown/test/noise
ParseCsvStringSplits a CSV string into a table of values

Views

ViewPurpose
View_Dist_ReleaseDateEarliest download date per tool version; computes IsPrerelease and VersionAlias
View_Fixed_Week_Dimension_DateFixed ISO-week mapping over Dimension_Date for weekly reporting

Notable Patterns and Implementation Details

Upsert via Table-Valued Parameters. All Ensure* procedures accept a TVP (table-valued parameter), attempt a bulk insert of new rows using EXCEPT to exclude already-existing ones, and return the full set of surrogate IDs. This is the standard pattern for idempotent dimension loading across the warehouse.
Cursor-based incremental processing. The Cursors table holds named DATETIME positions. GetDirtyPackageIds reads the GetDirtyPackageId cursor, finds facts added since that position, and returns changed package IDs. UpdateDirtyPackageIdCursor advances the position after processing. This decouples the warehouse write path from the gallery sync read path.
Hardcoded download-count correction in SelectTotalDownloadCounts. The procedure subtracts exactly 21,000,000 from the raw sum to correct for historical over-counting. This magic number is not documented within the SQL and must not be changed without understanding the original data-quality issue it compensates for.
StoreLogFileAggregates uses a T-SQL cursor (row-by-row). The procedure iterates the TVP with a DECLARE CURSOR loop rather than a set-based merge, which limits throughput at high log-file volumes. This is a known implementation quirk.
Dimension_Client.ClientCategory and ClientVersion are persisted computed columns. ClientCategory is computed by dbo.GetClientCategory() and persisted at write time, making it indexable and report-ready without runtime function calls. The version triple (Major.Minor.Patch) is similarly persisted as ClientVersion.
Download reports exclude Crawlers, Unknowns, and future NuGet major versions. The DownloadReport* procedures filter out ClientCategory IN ('Crawler', 'Unknown') and also exclude NuGet clients with Major > 10 — a forward-looking guard against unrecognized future client versions inflating counts.
Snapshot isolation is enabled (AllowSnapshotIsolation = True in the project properties). Most reporting procedures use (NOLOCK) hints throughout for non-blocking reads, accepting the trade-off of potentially reading uncommitted data in exchange for query concurrency on the heavily-written fact tables.

Dependencies

This is a pure SSDT database project. It has no NuGet package dependencies and no internal project references.
DependencyTypeNotes
Microsoft.Data.Tools.Schema.SqlTasks.targetsMSBuild / SSDTSSDT build targets; resolved from VS installation or SQLDBExtensionsRefPath env var
Microsoft.Data.Tools.Schema.Sql.SqlAzureV12DatabaseSchemaProviderSSDT DSPTargets Azure SQL (SQL Server 2022-compatible DDL)
System.NET AssemblyReferenced for AssemblyInfo.cs compilation only
The .sqlproj uses TargetFrameworkVersion v4.7.2 and ToolsVersion 4.0 purely as MSBuild scaffolding — there is no runtime .NET component. The project output is a .dacpac file.