About this program
Quick check on the privacy + lawfulness side of using or building AI — training-data provenance, lawful basis, DSARs against models, and synthetic data alternatives.
Risks addressed
- Critical Training data scraped without a lawful basis
- High Personal data memorised + leaked by the model
- High Cannot satisfy a DSAR (erasure) against an embedded model
- Critical Customer data fine-tuned into a model and made shareable
Controls (7)
-
Lawful basis recorded for training / fine-tune data
CriticalLawful basis recorded for training / fine-tune data
How to test + evidence
Testing procedure: For every dataset used, the lawful basis is documented (consent, contract, legitimate interest).
Evidence to collect: RoPA entry per dataset.
-
Provenance + licence per training dataset
HighProvenance + licence per training dataset
How to test + evidence
Testing procedure: Source + licence + scraping legality for every dataset on file.
Evidence to collect: Dataset register.
-
PII minimisation pre-training
HighPII minimisation pre-training
How to test + evidence
Testing procedure: PII filtering / hashing applied before training; documented filter coverage.
Evidence to collect: Pipeline + filter coverage report.
-
No customer data in shared / global models
CriticalNo customer data in shared / global models
How to test + evidence
Testing procedure: Customer fine-tunes are tenant-isolated; no cross-customer leakage possible.
Evidence to collect: Architecture + isolation tests.
-
DSAR / erasure process for AI in scope
HighDSAR / erasure process for AI in scope
How to test + evidence
Testing procedure: Erasure workflow accounts for embeddings, indices and cached outputs.
Evidence to collect: DSAR workflow doc.
-
Synthetic / anonymised data preferred where viable
MediumSynthetic / anonymised data preferred where viable
How to test + evidence
Testing procedure: Synthetic data used for testing and evaluation where it preserves utility.
Evidence to collect: Pipeline evidence.
-
Output filters for memorised / sensitive content
HighOutput filters for memorised / sensitive content
How to test + evidence
Testing procedure: Guardrails block outputs containing PII / verbatim training data.
Evidence to collect: Guardrail config + test results.