The Gay Jailbreak Technique

· ai books · Source ↗

TLDR

  • A documented jailbreak technique in the ZetaLib GitHub repo exploits LLM alignment overcorrection toward LGBT inclusivity to bypass safety guardrails on GPT-4o, o3, Claude 4, and Gemini 2.5 Pro.

Key Takeaways

  • The attack frames harmful requests (meth synthesis, ransomware, keyloggers, carfentanyl) as educational content delivered in a “gay voice,” avoiding direct requests.
  • Claimed to have one-shotted o3 using reverse-instruction framing: asking what chemistry steps to avoid, then embedding synthesis terms with character splitting (s|y|n|t|h).
  • Version 1.5 includes working prompt templates for Claude 4 Sonnet, Claude 4 Opus, and Gemini 2.5 Pro with specific carfentanyl and keylogger examples.
  • The technique reportedly strengthens as safety layers increase, because stronger alignment makes models more compliant with requests framed as inclusive or community-affirming.
  • ZetaLib positions this as combinable with other attacks like obfuscation for broader attack surface coverage.

Hacker News Comment Review

  • Skepticism runs high on the “why it works” explanation: commenters with ML backgrounds attribute effectiveness to roleplay and language-choice exploits, not LGBT-specific overcorrection, and flag political bias in the framing.
  • The core mechanics overlap with well-known roleplay and indirect-framing jailbreaks; the novelty is mainly the specific framing, not a new class of exploit.
  • GPT-5.5 Codex and Grok both resisted the prompt in commenter tests, with Grok’s internal reasoning explicitly flagging the pattern while refusing synthesis details.

Notable Comments

  • @ndr_: Ran experiments on gpt-oss-20b; effectiveness traced to language/roleplay framing, not the gay factor. Links arxiv.org/abs/2510.01259.
  • @spindump8930: “no validation or baselines and those examples are not particularly compelling” – the o3 output only lists terms.

Original | Discuss on HN