@cyclonedx/cdxgen

# Introduction This directory contains static knowledge that cdxgen uses at runtime. Some files are passive reference data. Others directly shape behavior, especially query packs, rule files, schemas, aliases, and component-tag metadata. ## Purpose of this directory Treat `data/` as product behavior, not as a convenient dump of reference files. If a file here is stale, incomplete, or incorrectly sourced, it can change runtime output, validation behavior, or audit findings. ## Contribution policy Direct pull requests that only hand-edit curated data in `data/` are not accepted. Start with an issue or a broader change proposal that explains: 1. the upstream source of truth 2. whether the file is upstream, derived, or hand-curated 3. how it should be refreshed 4. what tests or validation prove the update is safe Prefer adding or improving automation under `contrib/` over one-off manual edits. ## Directory contents | Filename | Purpose | Source | Curation / refresh path | |---|---|---|---| | `bom-1.4.schema.json` | CycloneDX 1.4 JSON schema for legacy compatibility validation | CycloneDX specification schema | upstream-derived compatibility copy; active feature work should target 1.5–1.7 | | `bom-1.5.schema.json` | CycloneDX 1.5 JSON schema for validation | CycloneDX specification schema | upstream-derived | | `bom-1.6.schema.json` | CycloneDX 1.6 JSON schema for validation | CycloneDX specification schema | upstream-derived | | `bom-1.7.schema.json` | CycloneDX 1.7 JSON schema for validation | CycloneDX specification schema | upstream-derived | | `cbomosdb-queries.json` | osquery queries for identifying SSL packages in OS contexts | project-maintained query pack | hand-curated with tests; should evolve with query-pack review | | `component-tags.json` | tags extracted from component descriptions for classification | project-maintained derived dataset | partially curated; automation opportunities remain | | `container-knowledge-index.json` | reference knowledge for container analysis | project-maintained derived dataset | partially curated; automation opportunities remain | | `cosdb-queries.json` | osquery queries useful for identifying OS packages for C | project-maintained query pack | hand-curated with tests | | `crypto-oid.json` | OID mapping reference used for crypto-aware output | standards and project-maintained mapping inputs | curated compatibility dataset | | `cryptography-defs.json` | cryptography inventory definitions | project-maintained definitions | curated; should be kept aligned with analyzer and CBOM logic | | `frameworks-list.json` | string fragments used to classify framework components | project-maintained heuristics | hand-curated; good candidate for future automation | | `gtfobins-index.json` | GTFOBins reference data used for Linux container and runtime executable enrichment | GTFOBins project data plus project normalization | derived and normalized for cdxgen | | `known-licenses.json` | hard-coded license corrections | project-maintained compatibility fixes | hand-curated escape hatch; prefer upstream/source fixes when possible | | `lic-mapping.json` | fallback license-name to identifier mapping | project-maintained mapping | hand-curated compatibility layer | | `lolbas-index.json` | LOLBAS reference data used for Windows runtime findings | LOLBAS project data plus project normalization | derived and normalized for cdxgen | | `predictive-audit-allowlist.json` | allowlist data for audit behavior | project-maintained heuristics | curated; should be reviewed alongside audit targeting logic | | `pypi-pkg-aliases.json` | Python package-name alias data | project-maintained alias mapping | hand-curated compatibility layer | | `python-stdlib.json` | Python standard-library entries that can be filtered out | Python stdlib references plus project normalization | derived list; automation opportunities remain | | `queries.json` | Linux osquery query pack for OBOM and runtime inventory | project-maintained query pack | hand-curated with tests | | `queries-win.json` | Windows osquery query pack | project-maintained query pack | hand-curated with tests | | `queries-darwin.json` | macOS osquery query pack | project-maintained query pack | hand-curated with tests | | `rules/` | built-in BOM audit rule packs in YAML | project-maintained rule packs | hand-authored rules validated by tests; users can also supply their own rule packs | | `spdx-licenses.json` | SPDX license identifiers | SPDX License List data | upstream-derived | | `spdx-export.schema.json` | SPDX 3.0.1 schema used during export validation | project-derived export schema generated from SPDX model artifacts | derived artifact; there is not a single upstream-published JSON schema that exactly matches this export use case | | `spdx.schema.json` | SPDX schema for validation | SPDX JSON schema inputs used by the project | upstream-derived compatibility copy | | `vendor-alias.json` | vendor or group-name alias fixes | project-maintained alias mapping | hand-curated compatibility layer; should eventually be reduced as heuristics improve | | `wrapdb-releases.json` | Meson WrapDB release data | Meson WrapDB | derived artifact; refresh automation still needs to be formalized and maintained | ## How this directory fits into the architecture ### ASCII view ```text runtime code | +--> lib/cli/* -----------> alias files, framework lists, tag maps | +--> lib/stages/postgen/* -> rule packs, standards data, schemas | +--> lib/audit/* ---------> rules/, allowlists, scoring support data | +--> lib/validator/* -----> CycloneDX and SPDX schemas | +--> OBOM flows ----------> queries*.json, GTFOBins, LOLBAS, knowledge indexes ``` ### Mermaid view ```mermaid flowchart TD A[data/] --> B[schemas] A --> C[query packs] A --> D[rule files] A --> E[alias and mapping files] A --> F[knowledge indexes] B --> G[validator] C --> H[OBOM and runtime inventory] D --> I[audit engine] E --> J[parsers and metadata helpers] F --> K[container and runtime enrichment] ``` ## Query-pack files The three `queries*.json` files are platform-specific osquery packs. They describe what cdxgen should ask osquery for when generating OS and runtime inventory. ### Query-pack shape | Field | Required | Purpose | |---|---|---| | `query` | yes | SQL executed against osquery | | `description` | yes | human-readable explanation of the collection intent | | `purlType` | yes | package URL type used for derived components | | `componentType` | no | CycloneDX component type when `library` is not appropriate | | `name` | no | component-name override for result sets that do not naturally expose one | ### Example mental model ```text queries.json entry | v osquery runs SQL | v rows come back | v cdxgen maps rows into components using purlType and componentType ``` ### Good query-pack hygiene | Practice | Why it matters | |---|---| | keep descriptions specific | helps users understand collected categories | | choose `componentType` carefully | affects how consumers interpret results | | mirror cross-platform entries intentionally | reduces accidental platform drift | | keep query scope safe and bounded | avoids expensive or unsafe collection | ## Rule files under `data/rules/` Rule files are YAML packs consumed by the audit flow. Each file groups rules by a shared theme such as container risk, rootfs hardening, OBOM runtime posture, or AI agent governance. ### Rule evaluation flow #### ASCII view ```text input BOM | v load YAML rule pack | v for each rule | +--> evaluate JSONata condition against BOM +--> collect matching components +--> build location object +--> render message template +--> attach mitigation, evidence, ATT&CK, and standards metadata | v audit findings ``` #### Mermaid view ```mermaid flowchart TD A[BOM input] --> B[load rule YAML] B --> C[evaluate condition] C --> D{matched components?} D -->|no| E[no finding] D -->|yes| F[build location and message] F --> G[attach mitigation and evidence] G --> H[emit finding] ``` ## Rule schema in practice Each rule is a YAML list item. These fields matter most. | Field | Required | Purpose | |---|---|---| | `id` | yes | unique stable identifier such as `CTR-001` | | `name` | yes | short title used in findings | | `description` | yes | why the rule exists and what it detects | | `severity` | yes | risk level such as `critical`, `high`, `medium`, `low`, `info` | | `category` | yes | thematic category that usually aligns with the file grouping | | `dry-run-support` | yes | whether the rule can work on dry-run style BOMs | | `condition` | yes | JSONata expression that selects matching components | | `location` | yes | JSONata expression that builds a location object for the match | | `message` | yes | rendered finding text, including placeholders | | `mitigation` | yes | remediation guidance shown with the finding | | `evidence` | no | extra structured data carried with the finding | | `attack` | no | MITRE ATT&CK mapping data | | `standards` | no | mapping of standard names to reference identifiers; surfaced in audit annotations as `cdx:audit:standards:*` metadata | ## Writing `condition` expressions Conditions are written in JSONata and evaluated against the BOM document. In practice, most rules filter the `components` array. ```yaml condition: | components[ $prop($, 'cdx:some:property') = 'expected-value' and type = 'library' ] ``` ### Helper functions commonly used in rules | Function | Purpose | |---|---| | `$prop(component, name)` | fetches a CycloneDX property by name | | `$nullSafeProp(component, name)` | null-safe property fetch for comparisons | | `$listContains(list, value)` | checks list-like property text for a specific entry | | `$firstNonEmpty(a, b, ...)` | returns the first non-empty value | ### Thinking about rule conditions A good condition is usually: 1. specific enough to avoid noise 2. readable enough for reviewers to reason about 3. based on stable properties that cdxgen already emits consistently ## Message rendering The `message` field supports template placeholders using double braces. ```yaml message: "Package '{{ name }}' at version '{{ version }}' is affected" ``` Those expressions are evaluated in the context of the matched component. Keep messages clear and reviewer-friendly. The message should explain the risk without requiring the reader to decode the raw JSONata condition. ## Authoring rules with the REPL Use `cdxi` when you want a tight feedback loop while authoring or debugging a rule. A practical flow is: ```text cdxi bom.json .query components[type = 'library'] .query components[$prop($, 'cdx:github:action:isShaPinned') = 'false'] .auditfindings .validate ``` Why this helps: | REPL command | Use while authoring rules | |---|---| | `.query <jsonata>` | test the JSONata shape before copying it into a YAML rule | | `.inspect <name-or-purl>` | inspect a concrete component when a condition is too broad or too narrow | | `.auditfindings` | review existing annotations produced by `--bom-audit` or `cdx-audit` | | `.validate` | quickly validate the loaded BOM before concluding the rule is wrong | ## Using custom rule packs Users can maintain their own rule packs outside this repository and supply the directory at runtime. ```bash # Apply custom rules during BOM generation cdxgen --bom-audit --bom-audit-rules-dir ./my-rules -o bom.json # Apply custom rules with the standalone audit command cdx-audit --bom bom.json --direct-bom-audit --rules-dir ./my-rules ``` This is the preferred path for organization-specific policy rather than submitting narrowly scoped custom rules into `data/rules/`. ## Adding a new rule safely Use this sequence. 1. choose the correct category file under `data/rules/` 2. draft the condition against a real BOM sample 3. keep the location object small and actionable 4. add mitigation text that tells the user what to do next 5. add or update tests in `lib/stages/postgen/auditBom.poku.js` ## Choosing between a rule, a query-pack entry, and a helper-data file | If you need to add... | It probably belongs in... | |---|---| | a new risk detection idea over existing BOM fields | `data/rules/*.yaml` | | a new host or runtime collection source | `queries*.json` | | a new alias, mapping, or classifier list | another JSON file in `data/` | | a new schema or validation artifact | `data/*schema*.json` | ## Automation and maintenance gaps Some files in `data/` are still compatibility layers, hand-curated heuristics, or locally derived artifacts. The goal should be to reduce those hacks over time, not normalize them. Current expectations: 1. open an issue before proposing a new refresh process or replacing a derived artifact 2. document the upstream source and whether the file is hand-curated, upstream, or locally derived 3. prefer a repeatable refresh script under `contrib/` where practical 4. keep tests close to any rule or query-pack change `wrapdb-releases.json` remains a derived artifact, but its refresh path still needs to be formalized and maintained like the rest of the curated datasets. Those gaps should be tracked as issue-first follow-up work rather than solved with silent one-off edits. ## Maintenance advice This directory changes slowly, but small mistakes here can affect a lot of runtime behavior. Treat edits as code, not content. | Habit | Why it helps | |---|---| | keep examples close to real emitted fields | avoids stale rules | | review platform symmetry for query packs | avoids one-OS regressions | | test new rules with realistic BOM fixtures | catches false positives early | | document new files here | keeps contributors oriented | | replace hacks with sourced or scripted refresh paths when possible | keeps long-term maintenance manageable |