You can beat the binary search

· systems · Source ↗

TLDR

  • Daniel Lemire’s SIMD Quad algorithm beats binary search on sorted uint16 arrays (1-4096 elements) by combining quaternary interpolation search with SIMD block comparison.

Key Takeaways

  • SIMD Quad divides arrays into 16-element blocks, uses quaternary interpolation to find the target block, then checks all 16 values in parallel via NEON or SSE2.
  • On Intel Emerald Rapids (warm cache), SIMD Quad is 2x+ faster than std::binary_search; on Apple M4 (cold cache), also 2x+ faster.
  • The SIMD component drives most gains; the quaternary (quad) split adds measurable benefit on Intel for large cold-cache arrays but minimal effect on Apple M4.
  • Algorithm targets Roaring Bitmap’s uint16 container search (arrays up to 4096 elements), a concrete production use case.
  • Source code provided: ARM path uses vceqq_u16/vmaxvq_u16; x64 path uses _mm_cmpeq_epi16/_mm_movemask_epi8.

Hacker News Comment Review

  • Commenters noted quaternary search is conceptually loop-unrolling binary search by one level; the real win is that modern CPUs have memory-level parallelism that rewards fetching multiple pivot points simultaneously.
  • Several commenters extended the idea: cache-line-aligned SIMD (32 x 16-bit per 64-byte line with AVX-512) could push gains further; exponential search is another practical speedup when query keys are themselves sorted.
  • Discussion surfaced that binary search is optimal only when data distribution is unknown; interpolation or heuristic search can vastly outperform it when distribution priors exist.

Notable Comments

  • @drkrab: Proposes searching an entire 64-byte cache line (32 x 16-bit integers) with a single 512-bit SIMD op, potentially 32x faster than naive binary search.
  • @lalitmaganti: Reports 8x real-world speedup using exponential search when both array contents and query sequence are sorted.

Original | Discuss on HN