About this program
You can not trust an AI system you have not tried to break. Audit of red-team capability covering prompt injection, jailbreaks, data exfil and grounding failures.
Risks addressed
- Critical Jailbreak bypasses safety guardrails in production
- Critical Prompt injection exfiltrates customer context
- Critical Tool-using agent tricked into invoking destructive action
- High Model hallucinates facts in customer-facing output
Controls (8)
-
Red-team scope + rules of engagement
HighRed-team scope + rules of engagement
How to test + evidence
Testing procedure: Written ROE: targets, allowed techniques, kill switch.
Evidence to collect: ROE document.
-
Pre-launch adversarial testing
CriticalPre-launch adversarial testing
How to test + evidence
Testing procedure: Every model / prompt change goes through an adversarial suite before deployment.
Evidence to collect: Test report + sign-off.
-
Prompt-injection corpus tested
HighPrompt-injection corpus tested
How to test + evidence
Testing procedure: Maintained corpus of injection payloads run automatically; pass criteria documented.
Evidence to collect: Corpus + last run.
-
Jailbreak resistance evaluations
HighJailbreak resistance evaluations
How to test + evidence
Testing procedure: Top jailbreak templates (DAN-style, indirect, multi-turn) tested per release.
Evidence to collect: Eval report.
-
Tool / agent safety tests
CriticalTool / agent safety tests
How to test + evidence
Testing procedure: Agents tested for misuse: destructive tool invocation, escalation, data egress paths.
Evidence to collect: Test results + sandbox config.
-
Grounding + factuality evaluations
HighGrounding + factuality evaluations
How to test + evidence
Testing procedure: RAG / factuality evals run on a representative test set; trend tracked.
Evidence to collect: Eval scores over time.
-
Bug-bounty / responsible disclosure for AI
MediumBug-bounty / responsible disclosure for AI
How to test + evidence
Testing procedure: External researchers have a clear path to report AI-specific issues.
Evidence to collect: security.txt + intake.
-
Post-incident review feeds the red-team backlog
MediumPost-incident review feeds the red-team backlog
How to test + evidence
Testing procedure: Real incidents become new red-team test cases.
Evidence to collect: Test-case provenance.