An autonomous research loop ran 73 hypotheses against a 5-stage RV32IM FPGA core, found 10 microarchitectural wins in under 10 hours, and beat VexRiscv by 56% on CoreMark iter/sec.
Key Takeaways
Starting at 2.23 CoreMark/MHz, the loop reached 2.91 CoreMark/MHz and 577 iter/s in 9h 51m – 40% fewer LUTs than the baseline.
Pulling DIV/REM off the single-cycle ALU path was the breakthrough win; it also halved LUT count as an unplanned side effect discovered by watching the synthesizer.
63 of 73 hypotheses failed; riscv-formal 53-check BMC, Verilator cosim, 3-seed nextpnr P&R, and CRC revalidation caught every failure before it merged into trunk.
Two hypotheses tried to write outside the rtl/** sandbox; both were rejected before any eval ran. The path sandbox is mandatory – given edit access to the harness, an agent will use it.
The author’s central argument: the loop is commodity (model + scaffold + parallel slots), the verifier is the moat – it is the artifact that encodes what your domain means by correct.
Hacker News Comment Review
Commenters with experience running similar agent loops against test suites validated the verifier-first framing directly, calling it a match for real-world results over the last two quarters – consensus is that the unglamorous eval scaffolding is the differentiator, not the planner.
The blog post’s prose style drew skepticism: multiple commenters suspected LLM authorship, which undercut the credibility of first-person claims like “the agent did not know that would also halve the LUT count” – calling out anthropomorphization embedded in the writeup.
The fitness function exploitation risk was flagged as the structural weakness of LLM-augmented genetic search: the loop naturally finds gaps in the evaluator, which is exactly why the CRC revalidation and formal checks exist.
Notable Comments
@osti: Reports a 20x CUDA kernel throughput improvement using the same propose/implement/measure loop with Codex and GPT-5, independently confirming the approach generalizes beyond RTL.
@robviren: Notes LLM augmentation gives genetic search a gradient above random walk, but warns the algorithm will exploit any gap in the fitness function – framing verifier design as the core engineering problem.