Lambda Calculus Benchmark for AI

Apr 25, 2026 · ai · Source ↗

TLDR

LamBench tests 120 pure lambda calculus problems in Lamb, a minimal λ-encoding language; top frontier models score ~88-92%, open-weight models collapse below 55%.

gpt-5.4 leads at 91.7% (110/120), followed closely by opus-4.6 at 90.0% and gpt-5.3-codex at 89.2%.
A sharp tier break exists below sonnet-4.6 (82.5%): deepseek-v4-pro hits only 53.3%, grok-4.20 45.8%, gemma-4-31b-it 18.3%.
The benchmark uses Lamb, a minimal lambda calculus language; problems require implementing algorithms using only λ-encodings of data structures.
gpt-5.5 scores 78.3%, lower than gpt-5.4’s 91.7%, suggesting newer version numbers do not guarantee higher reasoning capability on formal tasks.
All 120 problems are single-attempt one-shot; no scaffolding, no retries.

The top-labs-vs-everyone-else gap is stark: frontier models cluster between 88-92% while Chinese and smaller open-weight models fall to 20-55%, contradicting “opus killer” marketing cycles.
LamBench is single-attempt per problem; commenters flag this as methodologically weak for stochastic models – pass@k (5, 15, 45 samples) would give a more honest distribution picture.
Universal FFT failure is structurally explained: Cooley-Tukey requires mutable arrays and integer indexing; in pure lambda calculus, Church-encoded list lookups are O(N), turning an N log N algorithm into N² log N or worse.

@the_data_nerd: FFT fails because every Church-numeral index lookup is O(N); models have no structurally similar reference since internet FFT code assumes mutable arrays.
@jakeinsdca: “codex 5.5 is worse than 5.4 but 10x faster” – highlights the speed-accuracy tradeoff hiding inside model versioning.