Linux 7.0 Broke PostgreSQL: The Preemption Regression Explained

· systems · Source ↗

TLDR

  • Linux 7.0 removed PREEMPT_NONE and defaulted to PREEMPT_LAZY, causing PostgreSQL spinlock contention to explode when page faults preempt lock holders on high-core-count servers.

Key Takeaways

  • On a 96-vCPU Graviton4 with 1,024 clients, PostgreSQL throughput dropped from 98,565 to 50,751 TPS after upgrading to Linux 7.0.
  • 55% of CPU burned in s_lock inside StrategyGetBuffer – a single global spinlock protecting the shared buffer pool.
  • PREEMPT_LAZY can preempt a backend inside the kernel page-fault handler while it holds the spinlock; every other backend spins the entire time, multiplying the wasted CPU by the waiter count.
  • A 120 GB shared_buffers pool backed by 4 KB pages means ~31 million potential page faults; switching to 2 MB huge pages drops that to ~61,000, and 1 GB pages to ~120.
  • PostgreSQL’s huge_pages config should be set to on, not the default trytry silently falls back to 4 KB pages without any error or warning.

Hacker News Comment Review

  • Technical accuracy is disputed on two fronts: the article conflates kernel preemptibility (what PREEMPT_* controls) with userspace thread scheduling, and misstates the proposed rseq-based fix – commenters say it is about time-slice extension, not preemption-detect-and-retry.
  • Real-world scope is contested: the benchmark used an unusually large 120 GB shared_buffers config on a 96-vCPU machine; no commenter cited production breakage, and several argued a synthetic edge case should not drive a kernel revert.
  • The PREEMPT_LAZY design choice is criticized on principle: page faults are invisible and unplannable, unlike long-running syscalls, making them a poor boundary for lazy preemption decisions.

Notable Comments

  • @singron: The rseq proposal is time-slice extension, not “detect preemption and restart” – the article’s explanation of the fix is wrong.
  • @fulafel: PREEMPT_* options govern kernel-path preemptibility, not userspace; the article’s framing of what changed is confused.
  • @ozgrakkurt: “It is a crime that postgres isn’t able to allocate with 1GB huge pages by changing a config parameter in 2026” – 4 KB page metadata alone exceeds 500 MB at 128 GB RAM.

Original | Discuss on HN