Boosting multimodal inference performance by >10% with a single Python dict

· ai systems coding · Source ↗

Modal profiled SGLang’s scheduler on a multimodal workload and shipped a one-dict fix that yielded 16% more throughput and 10% lower latency—merged in SGLang v0.5.10.

What Matters

  • py-spy flamegraphs on a live SGLang process revealed process_input_requests consuming ~13% of total scheduler CPU time.
  • The culprit: _new_shared_cuda called on every tensor crossing process boundaries via CUDA IPC, reopening the same pool handles repeatedly.
  • A Python dict caching pool handles eliminates redundant PyTorch StorageImpl allocation, CUDA event recording, and GIL interaction per scheduler iteration.
  • Benchmark on Qwen2.5-VL-3B-Instruct, single H100: throughput 22.2→25.7 req/s, TTFT mean 965→838ms, TPOT mean 72→60ms.
  • Decode latency (TPOT −17.2%) improved despite the fix being in the prefill path—SGLang’s single-threaded scheduler means any slowdown anywhere delays all dispatches.
  • Cache invalidation is unnecessary because pools are never reallocated; writes are rare, so the lock only guards writes.
  • Enable via a flag when using CUDA IPC; applies to any multimodal model on SGLang, benefits scale with multimodal input volume. PR: sgl-project/sglang#21418.

Original | Discuss on HN