LLMs Corrupt Your Documents When You Delegate

· ai ai-agents · Source ↗

TLDR

  • arXiv paper finds frontier LLMs corrupt ~25% of document content in long delegated workflows across 52 professional domains.

Key Takeaways

  • DELEGATE-52 benchmark tests 19 LLMs on sustained document editing across domains including coding, crystallography, and music notation.
  • Even top models (Gemini 2.5 Pro, Claude 3.7 Opus, GPT-4.5) hit ~25% corruption rates; weaker models fail worse.
  • Errors are sparse but severe and compound silently over long interactions – the core danger for agentic pipelines.
  • Agentic tool use does not improve DELEGATE-52 scores; larger documents, longer sessions, and distractor files all worsen degradation.
  • No current model is a reliable autonomous delegate for high-fidelity document work.

Hacker News Comment Review

  • Commenters recognize the compounding problem from practice: context length growth correlates with error accumulation, matching the paper’s degradation-by-interaction-length finding.
  • The “semantic ablation” framing – repeated LLM passes eroding meaning – is seen as a real systemic risk beyond single-session use cases.

Notable Comments

  • @jonmoore: Flags the round-trip invertible-step evaluation method as strong; asks whether Python’s better results generalize to other languages or are benchmark artifacts.

Original | Discuss on HN