Benchmark pits browser-use vision agent against auto-generated HTTP endpoints on the same admin panel: 551k tokens and 17 min vs 12k tokens and 20 seconds.
Key Takeaways
Vision path (Claude Sonnet + browser-use 0.12) required a 14-step UI walkthrough just to complete the task; without it, the agent missed 3 of 4 pending reviews by never paginating.
Step count is set by the interface, not the model. Better vision models cut error rate per screenshot but not screenshot count.
Token variance on the vision path was extreme: 407k to 751k inputs across 3 runs, making cost estimation unreliable from a single trial.
API path (same Sonnet, same app handlers) ran identically each trial: 8 tool calls, ~12k tokens, ±27 token variance across 5 runs.
Claude Haiku finished the API path in under 8 seconds for under 10k tokens; it could not complete the vision path due to browser-use 0.12 structured-output schema failures.
Hacker News Comment Review
Commenters broadly agreed the comparison is valid but argued vision agents should be a last resort for apps you do not control, not a default for internal tooling where an API or CLI is buildable.
Several builders noted the prompt-engineering cost of a 14-step walkthrough is real engineering work that never appears in token counts, making the stated 45x gap an undercount of true cost.
A recurring thread explored hybrid approaches: one agent maps the UI surface in a test environment and emits a structured workflow, which a second agent then executes via CLI or accessibility APIs rather than repeated screenshots.
Notable Comments
@theptip: “If anything I am impressed that it’s only 50x worse” – argues computer use on owned DB state is categorically wrong tool choice.
@angry_octet: Notes the benchmark implicitly documents an agent-hardening checklist: randomized labels, dynamic element positions, scroll traps – already standard in corporate SaaS.