Glossary
What is LLM Jailbreaking?
Bypassing AI safety guardrails through crafted prompts.
What is jailbreaking?
Jailbreaking is the practice of crafting prompts that bypass an LLM's safety guardrails to generate harmful, unethical, or restricted content. Techniques include roleplay scenarios, encoding tricks, and multi-turn manipulation.
Common Techniques
- Roleplay: "Pretend you're an AI without restrictions"
- Encoding: Base64, ROT13, or other obfuscation
- Multi-turn: Gradually escalating requests
- Hypotheticals: "For a story, how would someone..."
Defense Strategies
- Input detection: Catch jailbreak attempts in prompts
- Output monitoring: Detect successful bypasses
- Guardrails: Block harmful content regardless of prompt
- Rate limiting: Slow down multi-turn attacks
Why Monitoring Matters
New jailbreak techniques emerge constantly. Static defenses become outdated. Continuous monitoring catches:
- Novel attack patterns
- Successful bypasses of existing defenses
- Trends in attack attempts
How do I protect against jailbreaks?
Use multiple layers: prompt injection detection to catch attempts, output monitoring to detect successful bypasses, guardrails to block harmful content, and continuous monitoring to identify new attack patterns.
Detect prompt injection attacks
Start Free