yapsnap transcribes any yt-dlp-supported video URL or local audio file to plaintext using a CPU-only streaming Zipformer transducer, no GPU or API required.
Key Takeaways
Uses Kroko English INT8 ONNX model (~80 MB, cached after first run) via sherpa-onnx; runs at several times realtime on a laptop CPU.
Default 1.5x atempo speedup (pitch-preserved) cuts transcription time by ~33%; adjustable with --speed.
Sentence-level [MM:SS] timestamps scale back to original-audio time even at elevated speed factors.
Single Python module with three deps: sherpa-onnx, numpy, yt-dlp. ffmpeg required on PATH for decode.
English-only by default; other sherpa-onnx streaming transducers loadable via --model or KROKO_MODEL env var.