ADR-010: Vulnerability Datasource Extension Point
Status | Date | Author(s) |
---|---|---|
Proposed | 2025-08-27 | @nscuro |
Context¶
In the process of moving away from Kafka (see ADR-001), we're moving functionality that so far has lived in separate services back into the API server codebase.
One of those services is mirror-service. The service fetches vulnerability intelligence data from
various sources, such as the NVD or GitHub Advisories, converts them to CycloneDX Bill of Vulnerabilities,
and publishes them to the dtrack.vulnerability
Kafka topic, from where the API server then consumes
them and stores them in its database.
This design was advantageous because Kafka acted as a buffer, enabling the API server to consume new data at its own pace.
However, mirror-service
was one more moving part, which increased operational complexity.
And with the removal of Kafka, its design no longer makes sense. A replacement is needed.
Decision¶
Introduce a new ExtensionPoint
for consuming arbitrary data sources of vulnerability intelligence.
By making it an ExtensionPoint
, we gain:
- Code remaining modular and easy to extend in the future.
- The community (soon) being able to plug in their own custom data sources,
without requiring modifications to the core API server code base.
However, it also comes with the following constraints:
- Implementations cannot use any code of the API server database, including the internal data model.
- Implementations cannot perform any database operations.
These constraints already applied to mirror-service
. We can lean on its design by:
- Continuing to exchange vulnerability information in CycloneDX BOV format.
- Mimicking semantics of "lazily" consuming from a Kafka topic.
The latter can be achieved using constructs of the Java standard library: Iterators.
A data source then becomes an Iterator
over BOVs, leading to the following interface definition:
To consume BOVs from all VulnDataSource
s:
Iterator
s offer pull-based semantics, similar to Kafka. This puts consumers in full control
as to how fast data is being retrieved.
Because data continues to be exchanged in CycloneDX BOV format, existing mapping logic can be reused.
Performance Considerations¶
It is desirable to keep as little data as possible in-memory. Ideally, VulnDataSource
implementations
only keep the "current" BOV in memory, retrieving more data only when hasNext
is called.
This ensures that consuming large quantities of data remains efficient.
However, it is up to the implementations to decide whether buffering of additional BOVs should be used. For example, to mirror GitHub Advisories, fetching advisories one-by-one is likely to lead to rate limiting.
Similarly, consumers can decide to collect BOVs into batches before processing them:
Interrupts¶
Because data sources can provide potentially large amounts of data, processing all of it can take a long time. It is important that this processing does not block graceful shutdown of the application.
Consumers of VulnDataSource
s should thus be interrupt-aware. When their thread is interrupted,
they must gracefully abort processing:
Checkpointing¶
Some data sources support incremental retrieval, for example by allowing clients to request data that has changed since a given timestamp.
In order to leverage such mechanisms, VulnDataSource
implementations require a means to
store the "last seen" timestamp, or similar checkpoint data. A simple key-value storage suffices.
We enable this by allowing extensions to modify runtime configuration values through
the plugin API's ConfigRegistry
:
This provides the following benefits:
- Checkpointing logic is entirely hidden from the consumer.
- Checkpointing data will (soon) be view- and editable via UI, allowing users to reset it if desired.
Built-In Sources¶
We will continue to ship a small selection of data sources out-of-the-box:
- GitHub Advisories
- NVD
- OSV
Instead of maintaining code for them in the apiserver
module, we'll create new Maven modules for
each one of them:
+ hyades-apiserver/
\-+ plugin/
|-+ api/
|-+ vuln-data-source-github/
|-+ vuln-data-source-nvd/
\-+ vuln-data-source-osv/
Not only does this "physically" separate them from the core API server code, effectively preventing tight coupling, it also allows for more efficient build parallelization and caching.
Considered Alternatives¶
It was considered to use Stream
s instead of Iterator
s since the streaming API is more modern.
This was discarded because in order to create a Stream
from a custom data source, the Spliterator
interface must be implemented. Spliterator
s are a lot more complex than Iterator
s without adding
much value for what we're trying to achieve.
Consequences¶
- Data source implementations must be ported from
mirror-service
. mirror-service
must be decommissioned.