Eval harness for safety + grounding
Demonstrate that AI models undergo continuous, automated evaluation against safety policies and factual grounding requirements before and after deployment, with failing outputs triggering remediation workflows.
Description
What this control does
An evaluation harness for safety and grounding validates that AI model outputs adhere to safety policies (preventing harmful, biased, or prohibited content) and remain grounded in authoritative source data rather than hallucinating facts. This control involves automated test suites that systematically prompt models with adversarial inputs, edge cases, and factual queries, then scoring outputs against safety rubrics and factual corpora. The harness runs continuously during model development, fine-tuning, and post-deployment to detect regression or drift in model behavior. Implementation typically includes red-team prompt libraries, ground-truth datasets, scoring algorithms, and automated gates that block unsafe model versions from production.
Control objective
What auditing this proves
Demonstrate that AI models undergo continuous, automated evaluation against safety policies and factual grounding requirements before and after deployment, with failing outputs triggering remediation workflows.
Associated risks
Risks this control addresses
- Deployment of models that generate harmful, discriminatory, or offensive content due to lack of pre-release safety testing
- Model hallucination producing factually incorrect information that damages user trust or leads to erroneous decisions
- Adversarial prompt injection bypassing safety guardrails because evaluation coverage omits edge cases or jailbreak techniques
- Model drift over time degrading safety or grounding performance without detection due to absence of continuous monitoring
- Unauthorized model updates reaching production without validation, introducing untested behavioral changes
- Inconsistent evaluation standards across model versions allowing regressions to pass quality gates
- Insufficient logging of evaluation failures preventing root-cause analysis and remediation of recurring safety issues
Testing procedure
How an auditor verifies this control
- Obtain the current evaluation harness configuration files, including test prompt libraries, safety rubrics, grounding datasets, and scoring thresholds
- Review the prompt library inventory to verify coverage of adversarial inputs (jailbreak attempts, bias elicitation), prohibited content categories, and factual queries spanning core use cases
- Select a sample of five recent model evaluation runs and retrieve the automated test execution logs, including pass/fail rates, output samples, and score distributions
- Trace one failed evaluation from detection through remediation, verifying that the failing model version was blocked from production and a corrective action was documented
- Interview the responsible AI team to confirm the frequency of harness execution (pre-commit, pre-release, post-deployment) and escalation procedures for failures
- Execute a live test by submitting three adversarial prompts from the red-team library to the production model and comparing outputs against expected safety behavior
- Review change control records for the last three model deployments to confirm evaluation harness approval gates were enforced
- Verify that evaluation metrics (safety violation rate, hallucination rate, grounding accuracy) are reported to governance stakeholders at least monthly
Where this control is tested