Skip to main content
← All controls
GOVERN-1.5 / MAP-1.2 / MEASURE-2.7 NIST AI Risk Management Framework (AI RMF 1.0)

Provenance + licence per training dataset

Demonstrate that all training datasets used in AI/ML systems are catalogued with documented provenance and licensing terms that support lawful use and enable traceable lineage.

Description

What this control does

This control requires organizations to maintain a formal inventory of all training datasets used in AI/ML model development, documenting the origin (provenance) and licensing terms for each dataset. Provenance tracking includes source identification, data lineage, acquisition date, and custodial chain, while licensing documentation captures usage rights, redistribution constraints, and compliance obligations. This ensures legal defensibility, reproducibility, and risk transparency in AI systems by preventing unauthorized data use and enabling impact assessment when upstream datasets are compromised or recalled.

Control objective

What auditing this proves

Demonstrate that all training datasets used in AI/ML systems are catalogued with documented provenance and licensing terms that support lawful use and enable traceable lineage.

Associated risks

Risks this control addresses

  • Use of training data without proper licensing rights exposes the organization to intellectual property infringement claims and financial liability
  • Datasets containing poisoned or malicious samples from unverified sources compromise model integrity and prediction reliability
  • Inability to trace data provenance prevents effective response when upstream datasets are found to contain sensitive information or are subject to recall
  • Unlicensed or improperly licensed datasets may prohibit commercial deployment of trained models, rendering development investment unusable
  • Lack of provenance documentation prevents reproducibility of model training, hindering debugging, auditing, and regulatory compliance verification
  • Training on datasets with incompatible licenses (e.g., GPL-licensed data in proprietary models) creates unintended licensing obligations on derived models
  • Undocumented datasets from untrusted sources may contain biased, discriminatory, or legally protected content that exposes the organization to regulatory penalties

Testing procedure

How an auditor verifies this control

  1. Obtain the organization's training dataset inventory or catalog system used to track AI/ML datasets
  2. Select a representative sample of 8-12 datasets spanning multiple models, projects, and acquisition timeframes
  3. For each sampled dataset, verify the inventory contains documented provenance information including source, acquisition date, custodian, and lineage chain
  4. Review licensing documentation for each sampled dataset to confirm usage rights, redistribution terms, attribution requirements, and expiration dates are recorded
  5. Cross-reference inventory entries against model training logs or MLOps pipeline records to verify dataset-to-model traceability
  6. Interview data engineers and ML practitioners to confirm processes exist for validating provenance and licensing before datasets enter production training pipelines
  7. Examine controls preventing use of undocumented datasets, such as approved dataset repositories, automated ingestion checks, or access restrictions on training environments
  8. Test a scenario by requesting provenance documentation for a specific deployed model to verify end-to-end traceability from model back to constituent datasets and their licenses
Evidence required Collect the training dataset inventory or catalog system export showing dataset identifiers, provenance fields, and licensing metadata. Obtain sample dataset records with full provenance documentation including source attribution, acquisition agreements, and license copies. Gather process documentation describing dataset intake procedures, approval workflows, and technical controls enforcing catalog registration before training use. Capture screenshots of MLOps or model registry systems showing dataset-to-model linkages.
Pass criteria All sampled training datasets have documented provenance identifying source and lineage, valid licensing terms permitting organizational use, and traceable linkage to models trained with them.

Where this control is tested

Audit programs including this control