Provenance + licence per training dataset
Demonstrate that all training datasets used in AI/ML systems are catalogued with documented provenance and licensing terms that support lawful use and enable traceable lineage.
Description
What this control does
This control requires organizations to maintain a formal inventory of all training datasets used in AI/ML model development, documenting the origin (provenance) and licensing terms for each dataset. Provenance tracking includes source identification, data lineage, acquisition date, and custodial chain, while licensing documentation captures usage rights, redistribution constraints, and compliance obligations. This ensures legal defensibility, reproducibility, and risk transparency in AI systems by preventing unauthorized data use and enabling impact assessment when upstream datasets are compromised or recalled.
Control objective
What auditing this proves
Demonstrate that all training datasets used in AI/ML systems are catalogued with documented provenance and licensing terms that support lawful use and enable traceable lineage.
Associated risks
Risks this control addresses
- Use of training data without proper licensing rights exposes the organization to intellectual property infringement claims and financial liability
- Datasets containing poisoned or malicious samples from unverified sources compromise model integrity and prediction reliability
- Inability to trace data provenance prevents effective response when upstream datasets are found to contain sensitive information or are subject to recall
- Unlicensed or improperly licensed datasets may prohibit commercial deployment of trained models, rendering development investment unusable
- Lack of provenance documentation prevents reproducibility of model training, hindering debugging, auditing, and regulatory compliance verification
- Training on datasets with incompatible licenses (e.g., GPL-licensed data in proprietary models) creates unintended licensing obligations on derived models
- Undocumented datasets from untrusted sources may contain biased, discriminatory, or legally protected content that exposes the organization to regulatory penalties
Testing procedure
How an auditor verifies this control
- Obtain the organization's training dataset inventory or catalog system used to track AI/ML datasets
- Select a representative sample of 8-12 datasets spanning multiple models, projects, and acquisition timeframes
- For each sampled dataset, verify the inventory contains documented provenance information including source, acquisition date, custodian, and lineage chain
- Review licensing documentation for each sampled dataset to confirm usage rights, redistribution terms, attribution requirements, and expiration dates are recorded
- Cross-reference inventory entries against model training logs or MLOps pipeline records to verify dataset-to-model traceability
- Interview data engineers and ML practitioners to confirm processes exist for validating provenance and licensing before datasets enter production training pipelines
- Examine controls preventing use of undocumented datasets, such as approved dataset repositories, automated ingestion checks, or access restrictions on training environments
- Test a scenario by requesting provenance documentation for a specific deployed model to verify end-to-end traceability from model back to constituent datasets and their licenses
Where this control is tested