Azure Search indexes
Subsystem: Search 🔎 The search subsystem heavily depends on Azure Search for storing package metadata and performing package queries. Within a single Azure Search resource, there can be multiple indexes. An index is simply a collection of documents with a common schema. For the NuGet search subsystem, there are two indexes expected in each Azure Search resource:search-XXX- this is the “search” index which contains documents for discovery querieshijack-XXX- this is the “hijack” index which contains documents for metadata lookup queries
Search index
The search index is designed to fulfill queries for package discovery. This is likely the scenario you would think about first when you imagine how package search would work. It’s optimized for searching package metadata field by one or more keywords and has a scoring profile that returns the most relevant package first. This index has up to four documents per package ID. Each of the four ID-specific documents represents a different view of available package versions. There are two factors for filtering in and out package versions: whether or not to consider prerelease versions and whether or not to consider SemVer 2.0.0 versions. This may seem is a little strange at first, so it’s best to consider an example. Consider a packageBaseTestPackage.SearchFilters that has four versions:
1.1.0- stable, SemVer 1.0.01.2.0-beta, prerelease, SemVer 1.0.01.3.0+metadata, stable, SemVer 2.0.0 (due to build metadata)1.4.0-delta.4, prerelease, SemVer 2.0.0 (due to a dot in the prerelease label)
BaseTestPackage.SearchFilters, there will be four documents, each with a different set of versions included in the
document.
- Stable + SemVer 1.0.0: contains only
1.1.0(example query) - Stable/Prerelease + SemVer 1.0.0: contains
1.1.0and1.2.0-beta(example query) - Stable + SemVer 2.0.0: contains
1.1.0and1.3.0+metadata(example query) - Stable/Prerelease + SemVer 2.0.0: contains all versions (example query)
key property) by a unique string with the following format:
sanitized lowercase ID removes all characters from the package ID that are not acceptable for Azure Search
document keys, like dots and non-ASCII word characters (like Chinese characters). This component of the document key is
included for readability purposes only.
The base64 lowercase ID is the base64 encoding of the package ID’s bytes, encoded with UTF-8. This string is
guaranteed to be a 1:1 mapping with the lowercase package ID and is included for uniqueness. The
HttpServerUtility.UrlTokenEncode API is used for base64 encoding.
The search filter has one of four values:
Default- Stable + SemVer 1.0.0IncludePrerelease- Stable/Prerelease + SemVer 1.0.0IncludeSemVer2- Stable + SemVer 2.0.0IncludePrereleaseAndSemVer2- Stable/Prerelease + SemVer 2.0.0
BaseTestPackage.SearchFilters, the Stable + 1.0.0 document key would be:
NuGet.Services.AzureSearch.SearchDocument.Full class and
its inherited members for a full list of the fields.
Unlisted package versions do not appear in the search index at all.
Hijack index
The hijack index is used by the gallery to fulfill specific metadata lookup operations. For example, if a customer is looking for metadata about all versions of the package IDNewtonsoft.Json, in certain cases the gallery
will query the search service for this metadata and the search service will use the hijack index to fetch the
data.
This index has one document for every version of every package ID, whether it is unlisted or not. The search service
uses this index to find all versions of a package via the ignoreFilter=true parameter including,
- unlisted packages (example query)
- multiple versions of a single ID (example query)
key property) by a unique string with the following format:
sanitized ID/version removes all characters from the {lowercase package ID}/{lowercase, normalized version}
that are not acceptable for Azure Search document keys, like dots and non-ASCII word characters (like Chinese
characters). This component of the document key is included for readability purposes only.
The base64 ID/version is the base64 encoding of the previously mentioned concatenation of ID and version, encoded
with UTF-8. This string is guaranteed to be a 1:1 mapping with the lowercase package ID and version and is included
for uniqueness. The HttpServerUtility.UrlTokenEncode API is used for base64 encoding.
For the package ID BaseTestPackage.SearchFilters and version 1.3.0+metadata, the document key would be:
NuGet.Services.AzureSearch.HijackDocument.Full class and
its inherited members for a full list of the fields.