Paper introduces Feedback Descent, replacing scalar RL rewards with accumulated natural-language evaluator feedback to guide LLM-based optimization across domains.
Key Takeaways
Standard RL compresses rich evaluator output into a single scalar reward, discarding causal failure signals – the paper calls this “sucking supervision through a straw.”
Feedback Descent alternates two components: an evaluator producing structured text feedback and an editor LLM revising candidates using accumulated prior feedback in-context.
On molecular design (SMILES optimization), Feedback Descent matched or beat graph-based specialists and REINVENT, achieving 3.8x fewer docking calls than RL and hitting the 99.9th percentile of a 260k-compound database.
The same evaluator-editor loop transferred to SVG image optimization and prompt optimization (outperforming GRPO, competitive with GEPA) with no domain-specific architecture changes.
The approach frames text as a complementary learning substrate to weight updates, avoiding catastrophic forgetting since feedback accumulates as persistent context rather than entangled parameters.
Hacker News Comment Review
One commenter notes this pattern closely resembles BindCraft for protein/drug design, which uses AF2 folding confidence and structural scoring as rich feedback – the key difference is Feedback Descent adds explicit feedback accumulation across iterations.