@cyclonedx/cdxgen
Version:
Creates CycloneDX Software Bill of Materials (SBOM) from source or container image
301 lines (233 loc) • 14 kB
Markdown
# Introduction
This directory contains static knowledge that cdxgen uses at runtime. Some files are passive reference data. Others directly shape behavior, especially query packs, rule files, schemas, aliases, and component-tag metadata.
## Purpose of this directory
Treat `data/` as product behavior, not as a convenient dump of reference files. If a file here is stale, incomplete, or incorrectly sourced, it can change runtime output, validation behavior, or audit findings.
## Contribution policy
Direct pull requests that only hand-edit curated data in `data/` are not accepted. Start with an issue or a broader change proposal that explains:
1. the upstream source of truth
2. whether the file is upstream, derived, or hand-curated
3. how it should be refreshed
4. what tests or validation prove the update is safe
Prefer adding or improving automation under `contrib/` over one-off manual edits.
## Directory contents
| Filename | Purpose | Source | Curation / refresh path |
|---|---|---|---|
| `bom-1.4.schema.json` | CycloneDX 1.4 JSON schema for legacy compatibility validation | CycloneDX specification schema | upstream-derived compatibility copy; active feature work should target 1.5–1.7 |
| `bom-1.5.schema.json` | CycloneDX 1.5 JSON schema for validation | CycloneDX specification schema | upstream-derived |
| `bom-1.6.schema.json` | CycloneDX 1.6 JSON schema for validation | CycloneDX specification schema | upstream-derived |
| `bom-1.7.schema.json` | CycloneDX 1.7 JSON schema for validation | CycloneDX specification schema | upstream-derived |
| `cbomosdb-queries.json` | osquery queries for identifying SSL packages in OS contexts | project-maintained query pack | hand-curated with tests; should evolve with query-pack review |
| `component-tags.json` | tags extracted from component descriptions for classification | project-maintained derived dataset | partially curated; automation opportunities remain |
| `container-knowledge-index.json` | reference knowledge for container analysis | project-maintained derived dataset | partially curated; automation opportunities remain |
| `cosdb-queries.json` | osquery queries useful for identifying OS packages for C | project-maintained query pack | hand-curated with tests |
| `crypto-oid.json` | OID mapping reference used for crypto-aware output | standards and project-maintained mapping inputs | curated compatibility dataset |
| `cryptography-defs.json` | cryptography inventory definitions | project-maintained definitions | curated; should be kept aligned with analyzer and CBOM logic |
| `frameworks-list.json` | string fragments used to classify framework components | project-maintained heuristics | hand-curated; good candidate for future automation |
| `gtfobins-index.json` | GTFOBins reference data used for Linux container and runtime executable enrichment | GTFOBins project data plus project normalization | derived and normalized for cdxgen |
| `known-licenses.json` | hard-coded license corrections | project-maintained compatibility fixes | hand-curated escape hatch; prefer upstream/source fixes when possible |
| `lic-mapping.json` | fallback license-name to identifier mapping | project-maintained mapping | hand-curated compatibility layer |
| `lolbas-index.json` | LOLBAS reference data used for Windows runtime findings | LOLBAS project data plus project normalization | derived and normalized for cdxgen |
| `predictive-audit-allowlist.json` | allowlist data for audit behavior | project-maintained heuristics | curated; should be reviewed alongside audit targeting logic |
| `pypi-pkg-aliases.json` | Python package-name alias data | project-maintained alias mapping | hand-curated compatibility layer |
| `python-stdlib.json` | Python standard-library entries that can be filtered out | Python stdlib references plus project normalization | derived list; automation opportunities remain |
| `queries.json` | Linux osquery query pack for OBOM and runtime inventory | project-maintained query pack | hand-curated with tests |
| `queries-win.json` | Windows osquery query pack | project-maintained query pack | hand-curated with tests |
| `queries-darwin.json` | macOS osquery query pack | project-maintained query pack | hand-curated with tests |
| `rules/` | built-in BOM audit rule packs in YAML | project-maintained rule packs | hand-authored rules validated by tests; users can also supply their own rule packs |
| `spdx-licenses.json` | SPDX license identifiers | SPDX License List data | upstream-derived |
| `spdx-export.schema.json` | SPDX 3.0.1 schema used during export validation | project-derived export schema generated from SPDX model artifacts | derived artifact; there is not a single upstream-published JSON schema that exactly matches this export use case |
| `spdx.schema.json` | SPDX schema for validation | SPDX JSON schema inputs used by the project | upstream-derived compatibility copy |
| `vendor-alias.json` | vendor or group-name alias fixes | project-maintained alias mapping | hand-curated compatibility layer; should eventually be reduced as heuristics improve |
| `wrapdb-releases.json` | Meson WrapDB release data | Meson WrapDB | derived artifact; refresh automation still needs to be formalized and maintained |
## How this directory fits into the architecture
### ASCII view
```text
runtime code
|
+--> lib/cli/* -----------> alias files, framework lists, tag maps
|
+--> lib/stages/postgen/* -> rule packs, standards data, schemas
|
+--> lib/audit/* ---------> rules/, allowlists, scoring support data
|
+--> lib/validator/* -----> CycloneDX and SPDX schemas
|
+--> OBOM flows ----------> queries*.json, GTFOBins, LOLBAS, knowledge indexes
```
### Mermaid view
```mermaid
flowchart TD
A[data/] --> B[schemas]
A --> C[query packs]
A --> D[rule files]
A --> E[alias and mapping files]
A --> F[knowledge indexes]
B --> G[validator]
C --> H[OBOM and runtime inventory]
D --> I[audit engine]
E --> J[parsers and metadata helpers]
F --> K[container and runtime enrichment]
```
## Query-pack files
The three `queries*.json` files are platform-specific osquery packs. They describe what cdxgen should ask osquery for when generating OS and runtime inventory.
### Query-pack shape
| Field | Required | Purpose |
|---|---|---|
| `query` | yes | SQL executed against osquery |
| `description` | yes | human-readable explanation of the collection intent |
| `purlType` | yes | package URL type used for derived components |
| `componentType` | no | CycloneDX component type when `library` is not appropriate |
| `name` | no | component-name override for result sets that do not naturally expose one |
### Example mental model
```text
queries.json entry
|
v
osquery runs SQL
|
v
rows come back
|
v
cdxgen maps rows into components using purlType and componentType
```
### Good query-pack hygiene
| Practice | Why it matters |
|---|---|
| keep descriptions specific | helps users understand collected categories |
| choose `componentType` carefully | affects how consumers interpret results |
| mirror cross-platform entries intentionally | reduces accidental platform drift |
| keep query scope safe and bounded | avoids expensive or unsafe collection |
## Rule files under `data/rules/`
Rule files are YAML packs consumed by the audit flow. Each file groups rules by a shared theme such as container risk, rootfs hardening, OBOM runtime posture, or AI agent governance.
### Rule evaluation flow
#### ASCII view
```text
input BOM
|
v
load YAML rule pack
|
v
for each rule
|
+--> evaluate JSONata condition against BOM
+--> collect matching components
+--> build location object
+--> render message template
+--> attach mitigation, evidence, ATT&CK, and standards metadata
|
v
audit findings
```
#### Mermaid view
```mermaid
flowchart TD
A[BOM input] --> B[load rule YAML]
B --> C[evaluate condition]
C --> D{matched components?}
D -->|no| E[no finding]
D -->|yes| F[build location and message]
F --> G[attach mitigation and evidence]
G --> H[emit finding]
```
## Rule schema in practice
Each rule is a YAML list item. These fields matter most.
| Field | Required | Purpose |
|---|---|---|
| `id` | yes | unique stable identifier such as `CTR-001` |
| `name` | yes | short title used in findings |
| `description` | yes | why the rule exists and what it detects |
| `severity` | yes | risk level such as `critical`, `high`, `medium`, `low`, `info` |
| `category` | yes | thematic category that usually aligns with the file grouping |
| `dry-run-support` | yes | whether the rule can work on dry-run style BOMs |
| `condition` | yes | JSONata expression that selects matching components |
| `location` | yes | JSONata expression that builds a location object for the match |
| `message` | yes | rendered finding text, including placeholders |
| `mitigation` | yes | remediation guidance shown with the finding |
| `evidence` | no | extra structured data carried with the finding |
| `attack` | no | MITRE ATT&CK mapping data |
| `standards` | no | mapping of standard names to reference identifiers; surfaced in audit annotations as `cdx:audit:standards:*` metadata |
## Writing `condition` expressions
Conditions are written in JSONata and evaluated against the BOM document. In practice, most rules filter the `components` array.
```yaml
condition: |
components[
$prop($, 'cdx:some:property') = 'expected-value'
and type = 'library'
]
```
### Helper functions commonly used in rules
| Function | Purpose |
|---|---|
| `$prop(component, name)` | fetches a CycloneDX property by name |
| `$nullSafeProp(component, name)` | null-safe property fetch for comparisons |
| `$listContains(list, value)` | checks list-like property text for a specific entry |
| `$firstNonEmpty(a, b, ...)` | returns the first non-empty value |
### Thinking about rule conditions
A good condition is usually:
1. specific enough to avoid noise
2. readable enough for reviewers to reason about
3. based on stable properties that cdxgen already emits consistently
## Message rendering
The `message` field supports template placeholders using double braces.
```yaml
message: "Package '{{ name }}' at version '{{ version }}' is affected"
```
Those expressions are evaluated in the context of the matched component. Keep messages clear and reviewer-friendly. The message should explain the risk without requiring the reader to decode the raw JSONata condition.
## Authoring rules with the REPL
Use `cdxi` when you want a tight feedback loop while authoring or debugging a rule. A practical flow is:
```text
cdxi bom.json
.query components[type = 'library']
.query components[$prop($, 'cdx:github:action:isShaPinned') = 'false']
.auditfindings
.validate
```
Why this helps:
| REPL command | Use while authoring rules |
|---|---|
| `.query <jsonata>` | test the JSONata shape before copying it into a YAML rule |
| `.inspect <name-or-purl>` | inspect a concrete component when a condition is too broad or too narrow |
| `.auditfindings` | review existing annotations produced by `--bom-audit` or `cdx-audit` |
| `.validate` | quickly validate the loaded BOM before concluding the rule is wrong |
## Using custom rule packs
Users can maintain their own rule packs outside this repository and supply the directory at runtime.
```bash
# Apply custom rules during BOM generation
cdxgen --bom-audit --bom-audit-rules-dir ./my-rules -o bom.json
# Apply custom rules with the standalone audit command
cdx-audit --bom bom.json --direct-bom-audit --rules-dir ./my-rules
```
This is the preferred path for organization-specific policy rather than submitting narrowly scoped custom rules into `data/rules/`.
## Adding a new rule safely
Use this sequence.
1. choose the correct category file under `data/rules/`
2. draft the condition against a real BOM sample
3. keep the location object small and actionable
4. add mitigation text that tells the user what to do next
5. add or update tests in `lib/stages/postgen/auditBom.poku.js`
## Choosing between a rule, a query-pack entry, and a helper-data file
| If you need to add... | It probably belongs in... |
|---|---|
| a new risk detection idea over existing BOM fields | `data/rules/*.yaml` |
| a new host or runtime collection source | `queries*.json` |
| a new alias, mapping, or classifier list | another JSON file in `data/` |
| a new schema or validation artifact | `data/*schema*.json` |
## Automation and maintenance gaps
Some files in `data/` are still compatibility layers, hand-curated heuristics, or locally derived artifacts. The goal should be to reduce those hacks over time, not normalize them.
Current expectations:
1. open an issue before proposing a new refresh process or replacing a derived artifact
2. document the upstream source and whether the file is hand-curated, upstream, or locally derived
3. prefer a repeatable refresh script under `contrib/` where practical
4. keep tests close to any rule or query-pack change
`wrapdb-releases.json` remains a derived artifact, but its refresh path still needs to be formalized and maintained like the rest of the curated datasets. Those gaps should be tracked as issue-first follow-up work rather than solved with silent one-off edits.
## Maintenance advice
This directory changes slowly, but small mistakes here can affect a lot of runtime behavior. Treat edits as code, not content.
| Habit | Why it helps |
|---|---|
| keep examples close to real emitted fields | avoids stale rules |
| review platform symmetry for query packs | avoids one-OS regressions |
| test new rules with realistic BOM fixtures | catches false positives early |
| document new files here | keeps contributors oriented |
| replace hacks with sourced or scripted refresh paths when possible | keeps long-term maintenance manageable |