Bypass vDSO by reading TSC directly in C++ to cut timestamp overhead 30-50%, at the cost of subtle correctness risks most codebases cannot afford.
Key Takeaways
Naive use of system_clock::now() + two steady_clock::now() calls costs 46-49 ns per span, consuming the entire latency budget for a sub-100 ns tracing client.
vDSO avoids kernel ring transitions but still does a full TSC read, cycle-to-nanosecond conversion (multiply + shift), and seqlock retry loop per call.
Replacing monotonic clock calls with raw rdtscp and caching the mult/shift multiplier from the vDSO data page drops elapsed-time cost significantly; no division instruction needed.
Invariant TSC (constant_tsc + nonstop_tsc in /proc/cpuinfo) is required: it runs at constant rate across cores and C/P/T-states, making it safe as a cross-core clock source.
The data page layout changed in Linux 6.15, so any custom vDSO reader must be versioned against kernel headers.
Hacker News Comment Review
Commenters broadly agree raw TSC is faster but flag a real correctness trap: RDTSC alone does not respect x86 memory ordering, so a load-acquire/store-release pair across threads can still observe time going backwards even with invariant TSC.
A recurring theme is deferring the cycle-to-nanosecond conversion entirely: emit raw TSC values into the log stream alongside a single reference clock-adjustment entry, then convert offline during decoding, eliminating the multiply/shift cost at capture time.
Several commenters question the architectural premise: tracing 1-10 µs pipeline stages with OpenTelemetry spans may be the wrong tool entirely; a sampling profiler fits better when latency is that tight.
Notable Comments
@amluto: warns that thread 2 can observe an earlier timestamp than thread 1 after a load-acquire/store-release pair because RDTSC does not participate in x86 memory ordering.
@Veserv: proposes emitting clock-adjustment log entries instead of converting TSC to nanoseconds at capture time, pushing all conversion cost to the decoder.
@jeffbee: “I can beat this by not trying to wrap a trace span around something that only takes 100ns.”