Following the Text Gradient at Scale

May 7, 2026 · ai · Source ↗

TLDR

Paper introduces Feedback Descent, replacing scalar RL rewards with accumulated natural-language evaluator feedback to guide LLM-based optimization across domains.

Standard RL compresses rich evaluator output into a single scalar reward, discarding causal failure signals – the paper calls this “sucking supervision through a straw.”
Feedback Descent alternates two components: an evaluator producing structured text feedback and an editor LLM revising candidates using accumulated prior feedback in-context.
On molecular design (SMILES optimization), Feedback Descent matched or beat graph-based specialists and REINVENT, achieving 3.8x fewer docking calls than RL and hitting the 99.9th percentile of a 260k-compound database.
The same evaluator-editor loop transferred to SVG image optimization and prompt optimization (outperforming GRPO, competitive with GEPA) with no domain-specific architecture changes.
The approach frames text as a complementary learning substrate to weight updates, avoiding catastrophic forgetting since feedback accumulates as persistent context rather than entangled parameters.

One commenter notes this pattern closely resembles BindCraft for protein/drug design, which uses AF2 folding confidence and structural scoring as rich feedback – the key difference is Feedback Descent adds explicit feedback accumulation across iterations.