CT.DP-P1 / SI-12 NIST Privacy Framework

PII minimisation pre-training

Demonstrate that the organisation systematically identifies and removes or anonymises PII from datasets before those datasets are used for machine learning model training, with documented processes, technical controls, and evidence of implementation.

Description

What this control does

PII minimisation pre-training is a data governance control that ensures personally identifiable information is identified, assessed, and removed or anonymised from datasets before they are used to train machine learning or AI models. This control involves implementing automated scanning, manual review workflows, and data preparation pipelines that detect and redact sensitive personal data such as names, contact information, identification numbers, and demographic attributes. By preventing PII from entering training datasets, organisations reduce privacy risks, comply with data protection regulations like GDPR and CCPA, and prevent models from inadvertently learning, memorising, or reproducing sensitive personal information during inference or output generation.

Control objective

What auditing this proves

Associated risks

Risks this control addresses

Unauthorised disclosure of PII embedded in model outputs or generated content when models memorise training data containing personal information
Regulatory non-compliance with GDPR, CCPA, or sector-specific privacy laws due to processing personal data without lawful basis or consent for model training purposes
Model inversion or membership inference attacks that enable adversaries to extract or infer PII from trained models by analysing model behaviour or outputs
Privacy harm to data subjects whose personal information is processed for training without appropriate minimisation or anonymisation safeguards
Reputational damage and loss of stakeholder trust when internal investigations or external audits reveal PII was used inappropriately in model development
Legal liability from data subjects exercising rights to erasure or data portability when their PII is embedded in model weights or architectures that cannot be easily modified
Inadequate data quality and biased models when PII is inconsistently removed, creating gaps or distortions in training datasets that degrade model performance

Testing procedure

How an auditor verifies this control

Obtain and review the organisation's data preparation and PII minimisation policy for machine learning training workflows, noting defined PII categories, detection methods, and anonymisation techniques.
Inventory all active machine learning projects and their associated training datasets, identifying which datasets have undergone PII scanning and remediation.
Select a representative sample of training datasets (minimum three datasets from different projects or data sources) for detailed inspection.
Execute PII detection tools or scripts on sampled datasets and document findings, comparing results against the organisation's PII taxonomy and detection baselines.
Review anonymisation or pseudonymisation logs, transformation scripts, and audit trails showing what PII elements were identified and how they were processed prior to training.
Interview data scientists, ML engineers, and data governance personnel to verify understanding of PII minimisation requirements and adherence to documented workflows.
Examine version control records, data lineage documentation, or MLOps pipeline configurations to confirm that PII scanning occurs systematically before datasets enter training environments.
Test a sample of deployed models by querying them with prompts designed to elicit PII, evaluating whether training data containing personal information is reproducible in model outputs.

Evidence required Collect data preparation policy documents specifying PII categories and minimisation procedures, PII scanning tool outputs or logs showing detection results and remediation actions for sampled datasets, data lineage or provenance records demonstrating pre-training PII controls were applied, version control commits or pipeline execution logs from MLOps platforms, interview notes or attestations from data handling personnel, and test query results from deployed models showing absence of PII leakage.

Pass criteria The control passes if all sampled training datasets have documented evidence of PII scanning and remediation prior to use, the organisation maintains an enforced policy defining PII minimisation requirements for ML training, and testing confirms that deployed models do not reproduce identifiable PII from training data.

Where this control is tested

Audit programs including this control

AI Privacy & Training Data Lawfulness

Quick check on the privacy + lawfulness side of using or building AI — training-data provenance, lawful basis,…