Daniel Lemire’s SIMD Quad algorithm beats binary search on sorted uint16 arrays (1-4096 elements) by combining quaternary interpolation search with SIMD block comparison.
Key Takeaways
SIMD Quad divides arrays into 16-element blocks, uses quaternary interpolation to find the target block, then checks all 16 values in parallel via NEON or SSE2.
On Intel Emerald Rapids (warm cache), SIMD Quad is 2x+ faster than std::binary_search; on Apple M4 (cold cache), also 2x+ faster.
The SIMD component drives most gains; the quaternary (quad) split adds measurable benefit on Intel for large cold-cache arrays but minimal effect on Apple M4.
Algorithm targets Roaring Bitmap’s uint16 container search (arrays up to 4096 elements), a concrete production use case.
Commenters noted quaternary search is conceptually loop-unrolling binary search by one level; the real win is that modern CPUs have memory-level parallelism that rewards fetching multiple pivot points simultaneously.
Several commenters extended the idea: cache-line-aligned SIMD (32 x 16-bit per 64-byte line with AVX-512) could push gains further; exponential search is another practical speedup when query keys are themselves sorted.
Discussion surfaced that binary search is optimal only when data distribution is unknown; interpolation or heuristic search can vastly outperform it when distribution priors exist.
Notable Comments
@drkrab: Proposes searching an entire 64-byte cache line (32 x 16-bit integers) with a single 512-bit SIMD op, potentially 32x faster than naive binary search.
@lalitmaganti: Reports 8x real-world speedup using exponential search when both array contents and query sequence are sorted.