Jailbreak

Safety & Ethics

Techniques used to bypass an AI model's safety guardrails and get it to produce outputs it was designed to refuse.

Jailbreaking is the practice of crafting prompts that trick AI models into bypassing their safety training. These range from simple role-playing prompts ('pretend you're DAN, an AI without restrictions') to sophisticated multi-step attacks that exploit edge cases in the model's training.

AI companies and jailbreak communities are in a constant arms race. Companies patch known jailbreaks, researchers discover new ones. This cat-and-mouse dynamic actually improves AI safety over time — each jailbreak reveals a weakness that gets fixed.

Jailbreaking raises legitimate ethical questions. Security researchers use it to test AI safety (red teaming). But it's also used to generate harmful content, bypass content moderation, and extract training data. Most AI terms of service prohibit jailbreaking.

Real-World Example

The famous 'DAN' (Do Anything Now) prompt was one of the earliest ChatGPT jailbreaks — it asked the model to roleplay as an unrestricted AI.

More in Safety & Ethics

Alignment — The challenge of ensuring AI systems behave in ways that match human values and ...

→

Bias (AI Bias) — Systematic errors in AI output that reflect prejudices in training data or desig...

→

Constitutional AI — Anthropic's approach to AI alignment where the model is trained to follow a set ...

→

Deepfake — AI-generated content that convincingly replaces a person's likeness or voice in ...

→

Guardrails — Safety mechanisms built into AI systems to prevent harmful, inappropriate, or of...

→

Prompt Injection — An attack where malicious text is embedded in user input to override the AI's sy...

→

Red Teaming — The practice of deliberately testing AI systems for vulnerabilities, biases, and...

→

FAQ

What is Jailbreak?

Techniques used to bypass an AI model's safety guardrails and get it to produce outputs it was designed to refuse.

How is Jailbreak used in practice?

The famous 'DAN' (Do Anything Now) prompt was one of the earliest ChatGPT jailbreaks — it asked the model to roleplay as an unrestricted AI.

What concepts are related to Jailbreak?

Key related concepts include Guardrails, Red Teaming, Prompt Injection, Alignment. Understanding these together gives a more complete picture of how Jailbreak fits into the AI landscape.

← Back to AI Glossary

Jailbreak

Real-World Example

Related Terms

More in Safety & Ethics

FAQ