PII minimisation pre-training
Demonstrate that the organisation systematically identifies and removes or anonymises PII from datasets before those datasets are used for machine learning model training, with documented processes, technical controls, and evidence of implementation.
Description
What this control does
PII minimisation pre-training is a data governance control that ensures personally identifiable information is identified, assessed, and removed or anonymised from datasets before they are used to train machine learning or AI models. This control involves implementing automated scanning, manual review workflows, and data preparation pipelines that detect and redact sensitive personal data such as names, contact information, identification numbers, and demographic attributes. By preventing PII from entering training datasets, organisations reduce privacy risks, comply with data protection regulations like GDPR and CCPA, and prevent models from inadvertently learning, memorising, or reproducing sensitive personal information during inference or output generation.
Control objective
What auditing this proves
Demonstrate that the organisation systematically identifies and removes or anonymises PII from datasets before those datasets are used for machine learning model training, with documented processes, technical controls, and evidence of implementation.
Associated risks
Risks this control addresses
- Unauthorised disclosure of PII embedded in model outputs or generated content when models memorise training data containing personal information
- Regulatory non-compliance with GDPR, CCPA, or sector-specific privacy laws due to processing personal data without lawful basis or consent for model training purposes
- Model inversion or membership inference attacks that enable adversaries to extract or infer PII from trained models by analysing model behaviour or outputs
- Privacy harm to data subjects whose personal information is processed for training without appropriate minimisation or anonymisation safeguards
- Reputational damage and loss of stakeholder trust when internal investigations or external audits reveal PII was used inappropriately in model development
- Legal liability from data subjects exercising rights to erasure or data portability when their PII is embedded in model weights or architectures that cannot be easily modified
- Inadequate data quality and biased models when PII is inconsistently removed, creating gaps or distortions in training datasets that degrade model performance
Testing procedure
How an auditor verifies this control
- Obtain and review the organisation's data preparation and PII minimisation policy for machine learning training workflows, noting defined PII categories, detection methods, and anonymisation techniques.
- Inventory all active machine learning projects and their associated training datasets, identifying which datasets have undergone PII scanning and remediation.
- Select a representative sample of training datasets (minimum three datasets from different projects or data sources) for detailed inspection.
- Execute PII detection tools or scripts on sampled datasets and document findings, comparing results against the organisation's PII taxonomy and detection baselines.
- Review anonymisation or pseudonymisation logs, transformation scripts, and audit trails showing what PII elements were identified and how they were processed prior to training.
- Interview data scientists, ML engineers, and data governance personnel to verify understanding of PII minimisation requirements and adherence to documented workflows.
- Examine version control records, data lineage documentation, or MLOps pipeline configurations to confirm that PII scanning occurs systematically before datasets enter training environments.
- Test a sample of deployed models by querying them with prompts designed to elicit PII, evaluating whether training data containing personal information is reproducible in model outputs.
Where this control is tested