Boosting multimodal inference performance by >10% with a single Python dict

May 9, 2026 · ai systems coding · Source ↗

Modal profiled SGLang’s scheduler on a multimodal workload and shipped a one-dict fix that yielded 16% more throughput and 10% lower latency—merged in SGLang v0.5.10.

What Matters

py-spy flamegraphs on a live SGLang process revealed process_input_requests consuming ~13% of total scheduler CPU time.
The culprit: _new_shared_cuda called on every tensor crossing process boundaries via CUDA IPC, reopening the same pool handles repeatedly.
A Python dict caching pool handles eliminates redundant PyTorch StorageImpl allocation, CUDA event recording, and GIL interaction per scheduler iteration.
Benchmark on Qwen2.5-VL-3B-Instruct, single H100: throughput 22.2→25.7 req/s, TTFT mean 965→838ms, TPOT mean 72→60ms.
Decode latency (TPOT −17.2%) improved despite the fix being in the prefill path—SGLang’s single-threaded scheduler means any slowdown anywhere delays all dispatches.
Cache invalidation is unnecessary because pools are never reallocated; writes are rare, so the lock only guards writes.
Enable via a flag when using CUDA IPC; applies to any multimodal model on SGLang, benefits scale with multimodal input volume. PR: sgl-project/sglang#21418.

Original | Discuss on HN

Bentos

Topics

Boosting multimodal inference performance by >10% with a single Python dict

What Matters

Bentos

Topics

What Matters

Related coverage

I Will Never Use AI to Code

Using Claude Code: The unreasonable effectiveness of HTML

Can LLMs model real-world systems in TLA+?

All my clients wanted a carousel, now it's an AI chatbot