This example illustrates a simple use case. The possibilities are vast, ranging from automating customer support responses to generating content.
: This involves refining a prompt through multiple interactions. The goal is to slowly erode the model's safeguards without direct confrontation. Role-Playing and Personas
: Google trains models like Gemini Pro and Ultra using Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF). This embeds an inner ethical compass, teaching the model to understand intent and prioritize helpfulness and safety.
Bypassing safety filters removes the protection built to protect users from psychological or technical harm. Users risk forcing the AI into generating highly toxic prose, disturbing imagery description, or dangerous misinformation that can be jarring or harmful to process. The Cat-and-Mouse Game of AI Security jailbreak gemini
A jailbreak is a specialized prompt designed to override an AI model's safety guardrails. When a user "jailbreaks" Gemini, they force the model to ignore its core programming, instructions, and ethical restrictions.
AI models are deeply optimized to maintain narrative coherence. By forcing Gemini into an intense, hyper-specific roleplay scenario, users can manipulate the model's logic.
Are you interested in the behind adversarial attacks? Share public link This example illustrates a simple use case
Gemini (formerly Bard) is built with a multi-layered safety architecture. Unlike open-source models (e.g., Llama or Mistral), Gemini is a closed, commercial product subject to Google’s rigorous , which explicitly forbid generating content that promotes hate, violence, or illegal acts.
Because Gemini is natively multimodal—meaning it processes text, audio, images, and video simultaneously—it opens up unique vectors.
: These use ASCII art, Leetspeak, or Base64 encoding to hide forbidden keywords from the initial safety scan. Dark Reading Google's Response and Safety Efforts Failed Attacks The goal is to slowly erode the model's
: These exploits leverage a fundamental tension in how RLHF (Reinforcement Learning from Human Feedback)-trained models operate. Models learn to be helpful and follow instructions. When convincingly framed as playing a character without safety constraints, the helpfulness signal can override harmlessness training. The model doesn't "break"—it follows instructions correctly; the problem is what it was instructed to be.
JULI: Jailbreak Large Language Models by Self-Introspection - arXiv