Grounding + factuality evaluations
Demonstrate that generative AI outputs are systematically evaluated for adherence to source material and factual accuracy before and during production deployment.
Description
What this control does
Grounding and factuality evaluations measure whether generative AI systems produce outputs anchored to verifiable source material and free from hallucinations or fabricated information. These evaluations use benchmark datasets, automated scoring against known facts, and human review to assess model adherence to provided context and factual accuracy. This control is critical for applications where misinformation, fabricated citations, or ungrounded claims pose legal, regulatory, or operational risk—such as customer service, healthcare decision support, or legal research assistants.
Control objective
What auditing this proves
Demonstrate that generative AI outputs are systematically evaluated for adherence to source material and factual accuracy before and during production deployment.
Associated risks
Risks this control addresses
- AI model generates fabricated citations or references that do not exist, leading to reputational harm or legal liability
- System produces medically or legally inaccurate information that users rely upon, resulting in patient harm or regulatory violations
- Model hallucinates facts or statistics in customer-facing responses, eroding trust and causing financial loss from incorrect decisions
- Ungrounded content bypasses content filters and spreads disinformation through official communication channels
- Lack of factuality testing allows drift over time as models are retrained, degrading accuracy without detection
- Third-party evaluators or benchmark datasets are not validated, producing false confidence in model accuracy
Testing procedure
How an auditor verifies this control
- Obtain and review the organization's generative AI model inventory and identify which models are subject to grounding and factuality evaluation requirements based on use case risk classification.
- Collect documented grounding and factuality evaluation procedures, including benchmark datasets used, evaluation frequency, pass/fail thresholds, and remediation workflows.
- Request evidence of pre-deployment grounding evaluations for a sample of production models, including automated scoring reports and human reviewer annotations.
- Verify that benchmark datasets or test sets include adversarial examples designed to induce hallucinations, unsupported claims, or citation fabrication.
- Review logs or dashboards showing ongoing factuality monitoring in production, including alert configurations for drift or anomalous outputs.
- Select a sample of flagged or failed outputs from evaluations and trace remediation actions such as prompt engineering changes, retrieval-augmented generation tuning, or model version rollback.
- Interview AI engineering and quality assurance personnel to confirm understanding of grounding metrics such as attribution accuracy, context adherence scores, or entailment consistency.
- Test a live model instance with crafted inputs designed to elicit ungrounded responses and compare output behavior against documented acceptable thresholds.
Where this control is tested