ADR-010: Vulnerability Datasource Extension Point

Status	Date	Author(s)
Proposed	2025-08-27	@nscuro

Context¶

In the process of moving away from Kafka (see ADR-001), we're moving functionality that so far has lived in separate services back into the API server codebase.

One of those services is mirror-service. The service fetches vulnerability intelligence data from various sources, such as the NVD or GitHub Advisories, converts them to CycloneDX Bill of Vulnerabilities, and publishes them to the dtrack.vulnerability Kafka topic, from where the API server then consumes them and stores them in its database.

This design was advantageous because Kafka acted as a buffer, enabling the API server to consume new data at its own pace.

However, mirror-service was one more moving part, which increased operational complexity. And with the removal of Kafka, its design no longer makes sense. A replacement is needed.

Decision¶

Introduce a new ExtensionPoint for consuming arbitrary data sources of vulnerability intelligence.

By making it an ExtensionPoint, we gain:

Code remaining modular and easy to extend in the future.
The community (soon) being able to plug in their own custom data sources,
without requiring modifications to the core API server code base.

However, it also comes with the following constraints:

Implementations cannot use any code of the API server database, including the internal data model.
Implementations cannot perform any database operations.

These constraints already applied to mirror-service. We can lean on its design by:

Continuing to exchange vulnerability information in CycloneDX BOV format.
Mimicking semantics of "lazily" consuming from a Kafka topic.

The latter can be achieved using constructs of the Java standard library: Iterators.
A data source then becomes an Iterator over BOVs, leading to the following interface definition:

package org.dependencytrack.plugin.api.datasource.vuln;

import org.cyclonedx.proto.v1_6.Bom;
import org.dependencytrack.plugin.api.ExtensionPoint;

import java.util.Iterator;

public interface VulnDataSource extends ExtensionPoint, Iterator<Bom> {
}

To consume BOVs from all VulnDataSources:

SequencedCollection<ExtensionFactory<VulnDataSource>> dataSourceFactories = 
        PluginManager.getInstance().getFactories(VulnDataSource.class);

for(ExtensionFactory<VulnDataSource> dataSourceFactory : dataSourceFactories) {
    try(VulnDataSource dataSource = dataSourceFactory.create()){
        while(dataSource.hasNext()) {
            final Bom bov = dataSource.next();
            // Process BOV...
        }
    }
}

Iterators offer pull-based semantics, similar to Kafka. This puts consumers in full control as to how fast data is being retrieved.

Because data continues to be exchanged in CycloneDX BOV format, existing mapping logic can be reused.

Performance Considerations¶

It is desirable to keep as little data as possible in-memory. Ideally, VulnDataSource implementations only keep the "current" BOV in memory, retrieving more data only when hasNext is called. This ensures that consuming large quantities of data remains efficient.

However, it is up to the implementations to decide whether buffering of additional BOVs should be used. For example, to mirror GitHub Advisories, fetching advisories one-by-one is likely to lead to rate limiting.

Similarly, consumers can decide to collect BOVs into batches before processing them:

SequencedCollection<ExtensionFactory<VulnDataSource>> dataSourceFactories = 
        PluginManager.getInstance().getFactories(VulnDataSource.class);

var currentBatch = new ArrayList<Bom>(100);
for(ExtensionFactory<VulnDataSource> dataSourceFactory : dataSourceFactories) {
    try(VulnDataSource dataSource = dataSourceFactory.create()){
        while(dataSource.hasNext()) {
            final Bom bov = dataSource.next();
            currentBatch.add(bov);
            if (currentBatch.size() == 100) {
                // Process batch...
                currentBatch.clear();
            }
        }
        if (!currentBatch.isEmpty()) {
            // Process batch...
            currentBatch.clear();
        }
    }
}

Interrupts¶

Because data sources can provide potentially large amounts of data, processing all of it can take a long time. It is important that this processing does not block graceful shutdown of the application.

Consumers of VulnDataSources should thus be interrupt-aware. When their thread is interrupted, they must gracefully abort processing:

SequencedCollection<ExtensionFactory<VulnDataSource>> dataSourceFactories = 
        PluginManager.getInstance().getFactories(VulnDataSource.class);

for(ExtensionFactory<VulnDataSource> dataSourceFactory : dataSourceFactories) {
    try(VulnDataSource dataSource = dataSourceFactory.create()){
        while(dataSource.hasNext() && !Thread.currentThread().isInterrupted()) {
            final Bom bov = dataSource.next();
            // Process BOV...
        }
    }
}

Checkpointing¶

Some data sources support incremental retrieval, for example by allowing clients to request data that has changed since a given timestamp.

In order to leverage such mechanisms, VulnDataSource implementations require a means to store the "last seen" timestamp, or similar checkpoint data. A simple key-value storage suffices.

We enable this by allowing extensions to modify runtime configuration values through the plugin API's ConfigRegistry:

class MyVulnDataSource implements VulnDataSource {

    private static ConfigDefinition LAST_SEEN_TIMESTAMP_CONFIG =
            new RuntimeConfigDefinition("last.seen.timestamp", null, false, false);

    private ConfigRegistry configRegistry;
    private String lastSeenTimestamp;

    MyVulnDataSource(ConfigRegistry configRegistry) {
        this.configRegistry = configRegistry;
        this.lastSeenTimestamp = configRegistry.getValue(LAST_SEEN_TIMESTAMP_CONFIG);
    }

    @Override
    public void close() {
        configRegistry.setValue(LAST_SEEN_TIMESTAMP_CONFIG, lastSeenTimestamp);
    }

    // Other methods...

}

class MyVulnDataSourceFactory implements VulnDataSourceFactory {

    private ConfigRegistry configRegistry;

    @Override
    public void init(ConfigRegistry configRegistry) {
        this.configRegistry = configRegistry;
    }

    @Override
    public VulnDataSource create() {
        return new MyVulnDataSource(configRegistry);
    }

    // Other methods...

}

This provides the following benefits:

Checkpointing logic is entirely hidden from the consumer.
Checkpointing data will (soon) be view- and editable via UI, allowing users to reset it if desired.

Built-In Sources¶

We will continue to ship a small selection of data sources out-of-the-box:

GitHub Advisories
NVD
OSV

Instead of maintaining code for them in the apiserver module, we'll create new Maven modules for each one of them:

+ hyades-apiserver/
\-+ plugin/
  |-+ api/
  |-+ vuln-data-source-github/
  |-+ vuln-data-source-nvd/
  \-+ vuln-data-source-osv/

Not only does this "physically" separate them from the core API server code, effectively preventing tight coupling, it also allows for more efficient build parallelization and caching.

Considered Alternatives¶

It was considered to use Streams instead of Iterators since the streaming API is more modern. This was discarded because in order to create a Stream from a custom data source, the Spliterator interface must be implemented. Spliterators are a lot more complex than Iterators without adding much value for what we're trying to achieve.

Consequences¶

Data source implementations must be ported from mirror-service.
mirror-service must be decommissioned.