OpenAI traced a 175% spike in GPT-5.x “goblin” outputs to an over-rewarded creature-metaphor signal in “Nerdy” personality RL training that leaked into base model behavior via SFT data.
Key Takeaways
Nerdy personality (2.5% of responses) accounted for 66.7% of all “goblin” mentions, pointing directly at a reward signal designed to encourage playful, quirky language.
The creature-word preference transferred to non-Nerdy outputs at nearly the same relative rate, confirming RL reward leakage across conditions through SFT feedback loops.
The feedback loop: rewarded style generates tics, tics appear in rollouts, rollouts enter SFT data, SFT amplifies tics further across the full model.
GPT-5.5 launched with a developer-prompt band-aid suppressing goblins, gremlins, raccoons, trolls, ogres, and pigeons because it started training before root cause was found.
The fix combined removing the goblin-affine reward signal and filtering creature-words from training data; incident also produced new internal behavior-audit tooling.
Hacker News Comment Review
Commenters treated this as a case study in reward misspecification: small lexical tics are invisible to aggregate evals yet compound across model generations in ways that are hard to catch early.
Several commenters noted that the suppression prompt in the public Codex system prompt was discovered by users before OpenAI published the explanation, illustrating how opaque model steering instructions leak through open-source repos.
Broader skepticism surfaced that deep-learning behavior is still not mechanistically understood, and that posts like this, while transparent, highlight how much production behavior emerges without intent rather than from deliberate design.
Notable Comments
@ollin: Users found the suppress-creature-words instruction in the public Codex system prompt two days before this post, via GitHub.
@postalcoder: Points to Claude’s overuse of “_ is the real unlock” as evidence that unexplained lexical tics exist across labs, not just OpenAI.