Following the Text Gradient at Scale

· ai · Source ↗

TLDR

  • Paper introduces Feedback Descent, replacing scalar RL rewards with accumulated natural-language evaluator feedback to guide LLM-based optimization across domains.

Key Takeaways

  • Standard RL compresses rich evaluator output into a single scalar reward, discarding causal failure signals – the paper calls this “sucking supervision through a straw.”
  • Feedback Descent alternates two components: an evaluator producing structured text feedback and an editor LLM revising candidates using accumulated prior feedback in-context.
  • On molecular design (SMILES optimization), Feedback Descent matched or beat graph-based specialists and REINVENT, achieving 3.8x fewer docking calls than RL and hitting the 99.9th percentile of a 260k-compound database.
  • The same evaluator-editor loop transferred to SVG image optimization and prompt optimization (outperforming GRPO, competitive with GEPA) with no domain-specific architecture changes.
  • The approach frames text as a complementary learning substrate to weight updates, avoiding catastrophic forgetting since feedback accumulates as persistent context rather than entangled parameters.

Hacker News Comment Review

  • One commenter notes this pattern closely resembles BindCraft for protein/drug design, which uses AF2 folding confidence and structural scoring as rich feedback – the key difference is Feedback Descent adds explicit feedback accumulation across iterations.

Original | Discuss on HN