GOVERN-1.7 / MEASURE-2.7 / MANAGE-4.2 NIST AI Risk Management Framework

Grounding + factuality evaluations

Demonstrate that generative AI outputs are systematically evaluated for adherence to source material and factual accuracy before and during production deployment.

Description

What this control does

Grounding and factuality evaluations measure whether generative AI systems produce outputs anchored to verifiable source material and free from hallucinations or fabricated information. These evaluations use benchmark datasets, automated scoring against known facts, and human review to assess model adherence to provided context and factual accuracy. This control is critical for applications where misinformation, fabricated citations, or ungrounded claims pose legal, regulatory, or operational risk—such as customer service, healthcare decision support, or legal research assistants.

Control objective

What auditing this proves

Demonstrate that generative AI outputs are systematically evaluated for adherence to source material and factual accuracy before and during production deployment.

Associated risks

Risks this control addresses

AI model generates fabricated citations or references that do not exist, leading to reputational harm or legal liability
System produces medically or legally inaccurate information that users rely upon, resulting in patient harm or regulatory violations
Model hallucinates facts or statistics in customer-facing responses, eroding trust and causing financial loss from incorrect decisions
Ungrounded content bypasses content filters and spreads disinformation through official communication channels
Lack of factuality testing allows drift over time as models are retrained, degrading accuracy without detection
Third-party evaluators or benchmark datasets are not validated, producing false confidence in model accuracy

Testing procedure

How an auditor verifies this control

Obtain and review the organization's generative AI model inventory and identify which models are subject to grounding and factuality evaluation requirements based on use case risk classification.
Collect documented grounding and factuality evaluation procedures, including benchmark datasets used, evaluation frequency, pass/fail thresholds, and remediation workflows.
Request evidence of pre-deployment grounding evaluations for a sample of production models, including automated scoring reports and human reviewer annotations.
Verify that benchmark datasets or test sets include adversarial examples designed to induce hallucinations, unsupported claims, or citation fabrication.
Review logs or dashboards showing ongoing factuality monitoring in production, including alert configurations for drift or anomalous outputs.
Select a sample of flagged or failed outputs from evaluations and trace remediation actions such as prompt engineering changes, retrieval-augmented generation tuning, or model version rollback.
Interview AI engineering and quality assurance personnel to confirm understanding of grounding metrics such as attribution accuracy, context adherence scores, or entailment consistency.
Test a live model instance with crafted inputs designed to elicit ungrounded responses and compare output behavior against documented acceptable thresholds.

Evidence required Collect grounding and factuality evaluation policies and standard operating procedures; pre-deployment evaluation reports with benchmark scores and human review records; production monitoring dashboards or log exports showing factuality metrics over time; incident or deviation records documenting failed evaluations and corrective actions; screenshots or exports of evaluation tooling configurations including thresholds and alert rules; interview notes with AI engineers or QA staff regarding evaluation methodology.

Pass criteria All sampled production generative AI models have documented pre-deployment grounding and factuality evaluations meeting defined thresholds, ongoing production monitoring is active and configured with alerts, and remediation actions are recorded for all evaluation failures or drift incidents.

Where this control is tested

Audit programs including this control

AI Red Team & Adversarial Testing

You can not trust an AI system you have not tried to break. Audit of red-team capability covering…