Package metadata resolution¶
Overview¶
The package metadata resolution system retrieves metadata for packages from upstream repositories. This includes latest available versions, artifact hashes, and publish timestamps. Dependency-Track uses this data for latest version checks, component age policies, and integrity verification.
A singleton durable workflow orchestrates resolution and processes packages in controlled batches. ADR-015 documents the data model.
Responsible consumption of public infrastructure¶
Public registries like Maven Central, npm, and PyPI share a finite pool of bandwidth across all consumers. Sonatype has documented that 1% of IP addresses account for 83% of Maven Central's total bandwidth, and that registries have begun enforcing organization-level throttling in response, returning HTTP 429 errors to excessive consumers.
For a system like Dependency-Track, which may track hundreds of thousands of components across many projects, this has direct architectural implications. Making an HTTP request per component on every BOM upload or scheduled analysis cycle is not viable. Neither is spawning hundreds of concurrent requests against upstream registries. Both patterns would quickly exhaust rate limits, especially in larger deployments where many API server instances run concurrently.
The resolution system thus favors scheduled, controlled processing over ad-hoc lookups. The system identifies components needing metadata resolution in batches, resolves them sequentially per resolver, and persists results with enough provenance to avoid redundant upstream requests.
Resolvers also issue conditional HTTP requests when refreshing
data. A small shared cache stores each upstream response body together with its ETag and
Last-Modified validators. Refreshes after the in-process freshness window send If-None-Match
(or If-Modified-Since) and most often exchange a small 304 Not Modified rather than a full
metadata document. This cuts upstream bandwidth without changing how often the system contacts the
registry, and on registries that exempt 304 responses from rate limiting (notably the GitHub API)
it also conserves quota. Sonatype's
Open is not costless: reclaiming sustainable infrastructure post explains the motivation.
Data model¶
Two tables form a two-level hierarchy, both keyed by PURL:
PACKAGE_METADATA: Keyed by PURL without version, qualifiers, or subpath (for example,pkg:maven/com.acme/acme-lib). Stores the latest available version and resolution provenance. One record per package.PACKAGE_ARTIFACT_METADATA: Keyed by full PURL including qualifiers and subpath (for example,pkg:maven/com.acme/acme-lib@1.2.3?type=jar). Stores artifact hashes, publish timestamp, and resolution provenance. One record per distinct artifact. Has a foreign key toPACKAGE_METADATA.
This separates package-level information (latest version) from artifact-level information (hashes, publish timestamp). The FK constraint enforces that artifact metadata cannot exist without corresponding package metadata, and the orchestration logic respects the resulting write-order dependency.
Refer to ADR-015 for the full rationale.
Persistence¶
Both tables use COALESCE-based upserts that preserve existing non-null values.
A temporal guard (WHERE "RESOLVED_AT" < EXCLUDED."RESOLVED_AT") prevents older results
from overwriting newer ones. Writes use PostgreSQL UNNEST to batch many rows
per statement, reducing round trips.
PURL normalization¶
A raw component PURL is rarely the right key for fetching upstream metadata. PURLs carry
information that does not affect identity (such as repository URLs, checksums, or BOM-specific
qualifiers), and they often omit information that the ecosystem needs to fetch a specific
artifact. Each resolver factory implements normalize to map a PURL to the shape it
expects, returning null if the resolver does not support the type.
PURL normalization serves two purposes:
- It selects which resolver handles a given PURL. The first factory whose
normalizereturns a non-null result wins. - It shapes the PURL into the exact key that gets persisted in
PACKAGE_ARTIFACT_METADATA.
Artifact granularity is thus not controlled by the input PURL, but by the resolver.
Maven¶
A Maven coordinate is not fully identified by group, artifact, and version. The same
GAV can produce more than one artifact, distinguished by the type and classifier
qualifiers: type selects the packaging (jar, war, pom, aar),
and classifier selects a variant (sources, javadoc, linux-x86_64). Each has its own file on disk and its
own hash. The Maven resolver normalizes by keeping type (defaulting to jar, matching the
Maven convention) and classifier if present, and dropping everything else. The result is
that pkg:maven/com.acme/acme-lib@1.2.3 and pkg:maven/com.acme/acme-lib@1.2.3?classifier=sources
resolve to distinct PACKAGE_ARTIFACT_METADATA rows with distinct hashes.
PyPI¶
A PyPI release is not a single file. A version typically ships one source distribution
(.tar.gz) plus a fan of built wheels, one per supported Python version, ABI, and platform
(cp311-cp311-manylinux_2_17_x86_64.whl, cp312-cp312-macosx_11_0_arm64.whl, and so on).
Each file has its own hash, and the PyPI JSON API returns hashes per file rather than per
release. Without a way to pick one, the resolver cannot define which hash to record.
To handle this, the PyPI resolver requires a file_name qualifier on
the input PURL and preserves it during normalization. SBOM tooling that emits PyPI components
needs to include file_name to let the resolver record hashes. PURLs without it are not
rejected outright, but the resolver cannot disambiguate which file applies.
Other ecosystems¶
Most other ecosystems publish one artifact per version (npm tarballs, RubyGems gems,
crates), so their resolvers drop all qualifiers during normalization and key on
type/namespace/name@version alone.
Workflow¶
Package metadata resolution is a singleton durable workflow.
Singleton constraint¶
The workflow uses a fixed instance ID (resolve-package-metadata). The durable execution engine enforces
that only a single execution of a given workflow instance in non-terminal state can exist at
any moment. Attempts to create a run while one is already active are silently deduplicated.
This guarantees that at most one resolution workflow is active across the entire cluster, regardless of how many API server instances are running. Concurrent resolution attempts are structurally impossible, which prevents redundant upstream requests and data races during persistence.
Triggers¶
Three situations trigger the workflow:
| Trigger | When |
|---|---|
| Scheduled | Configurable cron schedule |
| BOM upload | After importing a BOM that contains components |
| Manual analysis | When a user manually triggers vulnerability analysis for a project |
All triggers create a run with the same singleton instance ID. If a run is already active, the creation request is a no-op. This makes triggers cheap to invoke, as the singleton constraint handles deduplication.
Structure¶
sequenceDiagram
participant W as resolve-package-metadata<br/><<Workflow>>
participant F as fetch-candidates<br/><<Activity>>
participant R as resolve-purl-metadata<br/><<Activity>>
activate W
W ->> F: Fetch resolution candidates
activate F
F -->> W: Candidate groups (by resolver)
deactivate F
alt No candidates
Note over W: Complete
else Has candidates
par for each resolver
W ->> R: Resolve batch
activate R
R -->> W: Done (or failure)
deactivate R
end
Note over W: Continue-as-new
end
deactivate W
Candidate fetching¶
| Activity | Task Queue |
|---|---|
fetch-package-metadata-resolution-candidates |
default |
A PURL is eligible for resolution if:
- No
PACKAGE_ARTIFACT_METADATArecord exists for it, or - no
PACKAGE_METADATArecord exists for it, or - the corresponding
PACKAGE_METADATAwas last resolved over 24 hours ago.
The activity fetches candidates in batches of 250 and groups them by resolver. Each PURL maps
to the first resolver whose normalize method returns a non-null result. The activity groups PURLs with no
matching resolver under an empty resolver name. The resolve activity persists
empty results for these so they don't re-appear as candidates in later batches.
Resolution¶
| Activity | Task Queue |
|---|---|
resolve-purl-metadata |
package-metadata-resolutions |
One activity per resolver processes its assigned PURLs sequentially. Different resolver activities run concurrently.
Before processing, the activity loads existing PACKAGE_ARTIFACT_METADATA rows for the batch and
passes each PURL's prior result to the resolver as a hint. Resolvers that can short-circuit on it
(for example, Maven for stable versions) skip the corresponding upstream fetch.
For each PURL, the activity:
- Normalizes the PURL via the resolver factory.
- Looks up configured repositories for the PURL type, ordered by resolution priority.
- Iterates repositories, respecting internal/external classification, invoking the resolver until one succeeds or none remain.
- Buffers results and flushes to the database in batches of 25.
If a resolver signals a retryable error (for example, HTTP 429), the activity flushes any buffered results and propagates the error. The durable execution engine then retries the activity with backoff. The activity catches non-retryable errors for individual PURLs and persists an empty result, preventing the PURL from becoming a candidate again immediately.
Retry policy¶
| Parameter | Value |
|---|---|
| Initial delay | 5 s |
| Delay multiplier | 2x |
| Randomization factor | 0.3 |
| Max delay | 1 m |
| Max attempts | 3 |
Continue-as-new¶
After processing a batch, the workflow uses continueAsNew to start a fresh run that
picks up the next batch. This prevents unbounded history growth: each run's event history
covers only one batch. Without this, a single run resolving thousands of packages would
accumulate a large history that degrades replay performance.
It also creates a natural checkpoint. If the process stops between batches,
the next run starts with a fresh candidate query, skipping already-resolved PURLs
based on the RESOLVED_AT timestamps in the database.
The cycle continues until no candidates remain, at which point the workflow completes normally.
Concurrency control¶
Concurrency control operates at four levels:
| Level | Mechanism | Effect |
|---|---|---|
| Cluster | Singleton instance ID | At most one resolution workflow across the cluster |
| Engine | package-metadata-resolutions queue capacity |
Limits pending resolve activities (default: 25) |
| Node | Activity worker max concurrency | Limits parallel activity execution per node |
| Resolver | Sequential processing within each activity | Each resolver processes one PURL at a time |
Refer to durable execution for how queue capacity and worker concurrency interact.
Resolver extension point¶
Resolvers are pluggable via the extensibility system. The API surface consists of two interfaces:
PackageMetadataResolverFactory: Creates resolver instances. Declares whether the resolver needs a repository, normalizes PURLs (returningnullto signal non-support), and exposes the extension name.PackageMetadataResolver: Given a normalized PURL and an optional repository, returnsPackageMetadataornull.
A single resolver handles a given PURL (first match wins based on factory ordering). Within that resolver, a single repository provides the result (first success wins based on the configured resolution order).
Resolvers route their upstream fetches through a shared HTTP-resource cache that
handles ETag and Last-Modified revalidation transparently. While an entry stays fresh the
cache serves the body without contacting the upstream; once stale but still cached,
the cache sends validators and a 304 Not Modified replays the body, while a 200 OK replaces
the entry. The cache negatively caches 404 and 410 responses so absent packages do not re-issue
requests within the freshness window. The cache uses a separate namespace per resolver and pluggable
between in-memory and database providers via the existing cache provider mechanism.
The candidate query marks a PURL as eligible after 24 hours, but the resolver-side cache typically prevents that from translating to a full upstream transfer.
Maintenance¶
A scheduled task cleans up orphaned metadata:
- Deletes
PACKAGE_ARTIFACT_METADATArows where noCOMPONENTwith a matching PURL exists. - Deletes
PACKAGE_METADATArows where noPACKAGE_ARTIFACT_METADATAreferences them.
The two-step cascade prevents unbounded table growth as components leave the portfolio. Distributed locking ensures only one node executes cleanup at a time.
Resiliency¶
The durable execution engine handles crash recovery and retries transparently:
- If a node crashes mid-resolution, the workflow resumes from the last completed step on restart.
PURLs the prior run flushed before the crash no longer match the candidate query, because
their
RESOLVED_ATtimestamps fall within the 24-hour freshness window. - When resolution fails for a specific resolver even after exhausting retries, the workflow
catches the
ActivityFailureException, logs the failure, and continues. The workflow persists results from other resolvers normally. - On graceful shutdown, the activity checks for thread interruption before each PURL, flushes buffered results, and propagates the interruption.