What political censorship looks like inside an LLM's weights (Qwen 3.5)

· ai · Source ↗

TLDR

  • Mechanistic-interpretability study isolates the exact circuit implementing PRC-mandated censorship in Qwen3.5-9B and shows it can be surgically disabled via activation steering.

Key Takeaways

  • Censorship is a small, three-direction circuit in layers 11-20 (writers): d_prc, d_refuse, d_style encode topic detection and response routing independently.
  • Qwen3.5-9B-Base answers Tiananmen, Falun Gong, and Tank Man accurately under raw completion; posttraining layers behavior on top without erasing knowledge.
  • Subtracting d_prc at the writer layer flips deflection and propaganda responses to factual answers; overshoot the dose band and the model falls into a different trained template instead.
  • The filter is topic-specific, not generic: Kosovo gets the one-China line, “self-immolation” triggers safety refusal, but Kent State, Assange, and BLM get normal factual treatment.
  • In thinking mode, the model reasons internally in Chinese and explicitly cites the Cybersecurity Law before deflecting on Tiananmen.

Hacker News Comment Review

  • One commenter draws a parallel to Chomsky’s observation that democratic thought control can be more effective than totalitarian control, noting that the base model’s intact knowledge makes the behavioral layering especially visible.

Original | Discuss on HN