Boosting multimodal inference performance by >10% with a single Python dict
Modal profiled SGLang’s scheduler on a multimodal workload and shipped a one-dict fix that yielded 16% more throughput and 10% lower latency—merged in SGLang v0.5.10.
What Matters
-
py-spy flamegraphs on a live SGLang process revealed
process_input_requestsconsuming ~13% of total scheduler CPU time. -
The culprit:
_new_shared_cudacalled on every tensor crossing process boundaries via CUDA IPC, reopening the same pool handles repeatedly. -
A Python dict caching pool handles eliminates redundant PyTorch
StorageImplallocation, CUDA event recording, and GIL interaction per scheduler iteration. - Benchmark on Qwen2.5-VL-3B-Instruct, single H100: throughput 22.2→25.7 req/s, TTFT mean 965→838ms, TPOT mean 72→60ms.
- Decode latency (TPOT −17.2%) improved despite the fix being in the prefill path—SGLang’s single-threaded scheduler means any slowdown anywhere delays all dispatches.
- Cache invalidation is unnecessary because pools are never reallocated; writes are rare, so the lock only guards writes.
- Enable via a flag when using CUDA IPC; applies to any multimodal model on SGLang, benefits scale with multimodal input volume. PR: sgl-project/sglang#21418.