Skip to main content

Pro audit program · v1.0

AI Privacy & Training Data Lawfulness

Quick check on the privacy + lawfulness side of using or building AI — training-data provenance, lawful basis, DSARs against models, and synthetic data alternatives.

  • General target area
  • GDPR / EU AI Act framework
  • 7 controls in this program
  • Cyentrix Cyentrix Trusted Author

About this program

Quick check on the privacy + lawfulness side of using or building AI — training-data provenance, lawful basis, DSARs against models, and synthetic data alternatives.

Risks addressed

  • Critical Training data scraped without a lawful basis
  • High Personal data memorised + leaked by the model
  • High Cannot satisfy a DSAR (erasure) against an embedded model
  • Critical Customer data fine-tuned into a model and made shareable

Controls (7)

  1. Lawful basis recorded for training / fine-tune data

    Critical

    Lawful basis recorded for training / fine-tune data

    How to test + evidence

    Testing procedure: For every dataset used, the lawful basis is documented (consent, contract, legitimate interest).

    Evidence to collect: RoPA entry per dataset.

  2. Provenance + licence per training dataset

    High

    Provenance + licence per training dataset

    How to test + evidence

    Testing procedure: Source + licence + scraping legality for every dataset on file.

    Evidence to collect: Dataset register.

  3. PII minimisation pre-training

    High

    PII minimisation pre-training

    How to test + evidence

    Testing procedure: PII filtering / hashing applied before training; documented filter coverage.

    Evidence to collect: Pipeline + filter coverage report.

  4. No customer data in shared / global models

    Critical

    No customer data in shared / global models

    How to test + evidence

    Testing procedure: Customer fine-tunes are tenant-isolated; no cross-customer leakage possible.

    Evidence to collect: Architecture + isolation tests.

  5. DSAR / erasure process for AI in scope

    High

    DSAR / erasure process for AI in scope

    How to test + evidence

    Testing procedure: Erasure workflow accounts for embeddings, indices and cached outputs.

    Evidence to collect: DSAR workflow doc.

  6. Synthetic / anonymised data preferred where viable

    Medium

    Synthetic / anonymised data preferred where viable

    How to test + evidence

    Testing procedure: Synthetic data used for testing and evaluation where it preserves utility.

    Evidence to collect: Pipeline evidence.

  7. Output filters for memorised / sensitive content

    High

    Output filters for memorised / sensitive content

    How to test + evidence

    Testing procedure: Guardrails block outputs containing PII / verbatim training data.

    Evidence to collect: Guardrail config + test results.