GOVERN-1.5 / MEASURE-2.7 NIST AI Risk Management Framework

Jailbreak resistance evaluations

Demonstrate that the organization systematically evaluates deployed AI models against known jailbreak techniques and maintains documented evidence of model resilience to adversarial prompting designed to bypass safety and policy controls.

Description

What this control does

Jailbreak resistance evaluations systematically test AI systems—particularly large language models deployed in production—against adversarial prompt techniques designed to bypass safety guardrails, policy restrictions, or intended behavioral boundaries. These evaluations employ curated libraries of known jailbreak patterns (role-playing exploits, delimiter injection, adversarial suffixes, multi-turn manipulation) and measure the model's ability to refuse malicious requests, maintain alignment with acceptable use policies, and preserve safety boundaries under attack. The control matters because jailbroken AI systems can generate harmful content, leak sensitive training data, execute unintended actions, or facilitate social engineering attacks that undermine organizational security posture.

Control objective

What auditing this proves

Associated risks

Risks this control addresses

Adversarial users bypass content filters and safety guardrails to generate harmful, illegal, or policy-violating outputs through carefully crafted prompts
Multi-turn prompt injection attacks manipulate model context to leak proprietary training data, internal instructions, or confidential system prompts
Role-playing or persona-based jailbreaks trick models into adopting unrestricted behavior modes that ignore acceptable use policies
Delimiter injection and special token exploits cause models to misinterpret instructions and execute unintended actions or disclose restricted information
Untested models deployed to production become vectors for automated social engineering attacks targeting employees or customers
Jailbroken AI systems generate content that exposes the organization to legal liability, regulatory sanctions, or reputational harm
Incremental model updates or fine-tuning inadvertently degrade resistance to previously-mitigated jailbreak patterns without detection

Testing procedure

How an auditor verifies this control

Obtain the organization's AI model inventory and identify all production-deployed systems with natural language interfaces requiring safety guardrails
Request documentation of jailbreak resistance testing methodology, including test libraries, evaluation cadence, pass/fail thresholds, and responsible teams
Review the adversarial prompt test library to verify coverage of known jailbreak categories: role-playing, delimiter injection, multi-turn manipulation, adversarial suffixes, and encoding obfuscation
Select a representative sample of deployed models and examine test execution logs from the most recent evaluation cycle prior to production deployment
Verify that evaluation results include quantitative success rates, categorized failure modes, and documented remediation actions for models failing resistance thresholds
Interview responsible AI safety engineers to confirm testing frequency, trigger events (pre-deployment, post-updates, periodic reassessment), and escalation procedures for failed evaluations
Perform live sampling by executing a small set of organization-approved jailbreak test prompts against one production model and compare actual behavior to documented test results
Review change management records to confirm that model updates triggering new jailbreak vulnerabilities were identified, remediated, and re-evaluated before production release

Evidence required Collect AI model inventory with safety classification tags; jailbreak testing methodology documentation including test libraries and evaluation criteria; test execution logs showing prompt inputs, model outputs, success/fail determinations, and timestamps for the most recent evaluation cycle; documented remediation plans for models failing resistance thresholds; change control records linking model updates to mandatory re-evaluation requirements; interview notes with AI safety engineers confirming operational testing cadence and escalation workflows; live test results from auditor-executed sample prompts with screenshots or API response logs demonstrating current resistance posture.

Pass criteria All production-deployed AI models with natural language interfaces have undergone jailbreak resistance evaluation within the documented testing cadence using a test library covering at least five adversarial prompt categories, documented results demonstrate pass rates meeting organizational thresholds, and change control records confirm mandatory re-evaluation following model updates.

Where this control is tested

Audit programs including this control

AI Red Team & Adversarial Testing

You can not trust an AI system you have not tried to break. Audit of red-team capability covering…