The fastest Linux timestamps

· systems · Source ↗

TLDR

  • Bypass vDSO by reading TSC directly in C++ to cut timestamp overhead 30-50%, at the cost of subtle correctness risks most codebases cannot afford.

Key Takeaways

  • Naive use of system_clock::now() + two steady_clock::now() calls costs 46-49 ns per span, consuming the entire latency budget for a sub-100 ns tracing client.
  • vDSO avoids kernel ring transitions but still does a full TSC read, cycle-to-nanosecond conversion (multiply + shift), and seqlock retry loop per call.
  • Replacing monotonic clock calls with raw rdtscp and caching the mult/shift multiplier from the vDSO data page drops elapsed-time cost significantly; no division instruction needed.
  • Invariant TSC (constant_tsc + nonstop_tsc in /proc/cpuinfo) is required: it runs at constant rate across cores and C/P/T-states, making it safe as a cross-core clock source.
  • The data page layout changed in Linux 6.15, so any custom vDSO reader must be versioned against kernel headers.

Hacker News Comment Review

  • Commenters broadly agree raw TSC is faster but flag a real correctness trap: RDTSC alone does not respect x86 memory ordering, so a load-acquire/store-release pair across threads can still observe time going backwards even with invariant TSC.
  • A recurring theme is deferring the cycle-to-nanosecond conversion entirely: emit raw TSC values into the log stream alongside a single reference clock-adjustment entry, then convert offline during decoding, eliminating the multiply/shift cost at capture time.
  • Several commenters question the architectural premise: tracing 1-10 µs pipeline stages with OpenTelemetry spans may be the wrong tool entirely; a sampling profiler fits better when latency is that tight.

Notable Comments

  • @amluto: warns that thread 2 can observe an earlier timestamp than thread 1 after a load-acquire/store-release pair because RDTSC does not participate in x86 memory ordering.
  • @Veserv: proposes emitting clock-adjustment log entries instead of converting TSC to nanoseconds at capture time, pushing all conversion cost to the decoder.
  • @jeffbee: “I can beat this by not trying to wrap a trace span around something that only takes 100ns.”

Original | Discuss on HN