Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks

· ai ai-agents devtools · Source ↗

TLDR

  • Forge is a Python reliability layer for self-hosted LLM tool-calling that lifts an 8B model from 53% to 99% on agentic tasks via rescue parsing, retry nudges, and VRAM-aware context management.

Key Takeaways

  • Best self-hosted config: Ministral-3 8B Instruct Q8 on llama-server scores 86.5% across 26 eval scenarios, 76% on the hardest tier.
  • Four usage modes: WorkflowRunner, SlotWorker (multi-agent shared GPU slot), guardrails middleware, and a drop-in OpenAI-compatible proxy.
  • The proxy injects a synthetic respond tool so small models stay in tool-calling mode; the tool call is stripped before returning to the client.
  • Supports Ollama, llama-server, Llamafile, and Anthropic backends; llama-server recommended for top eval performance.
  • Published as a peer-reviewed paper: Zambelli, A. Forge: A Reliability Layer for Self-Hosted LLM Tool-Calling (doi:10.1145/3786335.3813193).

Hacker News Comment Review

  • Commenters initially unclear on what “guardrails” meant here; the author clarified it catches malformed tool calls mid-workflow and injects corrective nudges rather than blocking content.
  • Core mechanism confirmed by author: if a model sends malformed JSON to a tool (e.g., a hotel booking API), Forge intercepts, nudges the model, and retries – but will raise a max-iterations error if the model stays stuck.

Notable Comments

  • @tommica: Asked for a plain-language definition of guardrails in this context – a useful proxy for how non-obvious the framing is to first-time readers.

Original | Discuss on HN