Jailbreak resistance evaluations
Demonstrate that the organization systematically evaluates deployed AI models against known jailbreak techniques and maintains documented evidence of model resilience to adversarial prompting designed to bypass safety and policy controls.
Description
What this control does
Jailbreak resistance evaluations systematically test AI systems—particularly large language models deployed in production—against adversarial prompt techniques designed to bypass safety guardrails, policy restrictions, or intended behavioral boundaries. These evaluations employ curated libraries of known jailbreak patterns (role-playing exploits, delimiter injection, adversarial suffixes, multi-turn manipulation) and measure the model's ability to refuse malicious requests, maintain alignment with acceptable use policies, and preserve safety boundaries under attack. The control matters because jailbroken AI systems can generate harmful content, leak sensitive training data, execute unintended actions, or facilitate social engineering attacks that undermine organizational security posture.
Control objective
What auditing this proves
Demonstrate that the organization systematically evaluates deployed AI models against known jailbreak techniques and maintains documented evidence of model resilience to adversarial prompting designed to bypass safety and policy controls.
Associated risks
Risks this control addresses
- Adversarial users bypass content filters and safety guardrails to generate harmful, illegal, or policy-violating outputs through carefully crafted prompts
- Multi-turn prompt injection attacks manipulate model context to leak proprietary training data, internal instructions, or confidential system prompts
- Role-playing or persona-based jailbreaks trick models into adopting unrestricted behavior modes that ignore acceptable use policies
- Delimiter injection and special token exploits cause models to misinterpret instructions and execute unintended actions or disclose restricted information
- Untested models deployed to production become vectors for automated social engineering attacks targeting employees or customers
- Jailbroken AI systems generate content that exposes the organization to legal liability, regulatory sanctions, or reputational harm
- Incremental model updates or fine-tuning inadvertently degrade resistance to previously-mitigated jailbreak patterns without detection
Testing procedure
How an auditor verifies this control
- Obtain the organization's AI model inventory and identify all production-deployed systems with natural language interfaces requiring safety guardrails
- Request documentation of jailbreak resistance testing methodology, including test libraries, evaluation cadence, pass/fail thresholds, and responsible teams
- Review the adversarial prompt test library to verify coverage of known jailbreak categories: role-playing, delimiter injection, multi-turn manipulation, adversarial suffixes, and encoding obfuscation
- Select a representative sample of deployed models and examine test execution logs from the most recent evaluation cycle prior to production deployment
- Verify that evaluation results include quantitative success rates, categorized failure modes, and documented remediation actions for models failing resistance thresholds
- Interview responsible AI safety engineers to confirm testing frequency, trigger events (pre-deployment, post-updates, periodic reassessment), and escalation procedures for failed evaluations
- Perform live sampling by executing a small set of organization-approved jailbreak test prompts against one production model and compare actual behavior to documented test results
- Review change management records to confirm that model updates triggering new jailbreak vulnerabilities were identified, remediated, and re-evaluated before production release
Where this control is tested