An update on GitHub availability

· databases ai-agents systems · Source ↗

TLDR

  • GitHub’s CTO admits two recent incidents are unacceptable and outlines a 30X scale plan driven by agentic development workflows accelerating since December 2025.

Key Takeaways

  • Original 10X capacity plan (started October 2025) was revised to 30X by February 2026 as agentic workflows drove exponential growth in repo creation, PRs, API calls, and large-monorepo workloads.
  • April 23 merge queue incident corrupted squash merges in groups of 2+ PRs, affecting 230 repos and 2,092 pull requests; no data loss but default branches required manual repair.
  • April 27 Elasticsearch overload (likely botnet) broke search-backed PR, issue, and project views; Git and APIs stayed up but Elasticsearch was not yet isolated as a single point of failure.
  • Remediation stack: webhooks moved out of MySQL, user session cache redesigned, auth/authz DB load reduced, critical services isolated, Ruby monolith hot paths migrated to Go, multi-cloud path started.
  • Priority order is now explicit: availability, then capacity, then new features; merge queue optimizations are a specific investment for repos with thousands of PRs per day.

Hacker News Comment Review

  • Commenters widely reject the framing of “two recent incidents” – the community has tracked dozens of outages since January 2026, with public dashboards and heatmaps as evidence, making the post read as damage control rather than a full accounting.
  • The multi-cloud announcement reads to technical commenters as an implicit admission that Azure alone cannot deliver the reliability GitHub needs, which is notable given the earlier directive to prioritize Azure migration over feature work.
  • Skepticism runs high on execution: priorities stated in this post mirror priorities stated 6 months ago before the Azure-first push, and current PR list counts in the UI are already visibly inconsistent across repositories.

Notable Comments

  • @bartread: documents that outages span “dozens and dozens” since year start, points to community-built heatmaps that surfaced on HN front page as evidence the scope is far beyond two incidents.
  • @mijoharas: flags that “path to multi cloud” is Microsoft implicitly conceding Azure cannot meet GitHub’s reliability bar – “interesting to hear it from microsoft themselves.”
  • @BlackFingolfin: reports live data inconsistency – PR tab shows 78 open, list view shows 35 – confirmed across multiple colleagues and repos, suggesting Elasticsearch fallout is still ongoing.

Original | Discuss on HN