Transcribing long audio has always been a trade-off: fast systems were inaccurate, accurate systems were slow. Insanely Fast Whisper — a community-built CLI tool — has fundamentally changed that equation. Using a combination of Flash Attention 2, intelligent batching, and optimizations from the Hugging Face Transformers ecosystem, it can transcribe 150 minutes (2.5 hours) of audio in under 98 seconds on an NVIDIA A100 80GB GPU.
The Performance Numbers Are Staggering
The project’s benchmarks tell the story clearly. On an NVIDIA A100 80GB:
- Whisper Large v3 (Transformers, fp32): ~31 minutes for 150 minutes of audio
- Whisper Large v3 (fp16 + batch size 24 + BetterTransformer): ~5 minutes
- Whisper Large v3 (fp16 + batch size 24 + Flash Attention 2): 1 minute 38 seconds
- Distil-Whisper Large v2 (fp16 + batch size 24 + Flash Attention 2): 1 minute 18 seconds
To put that in perspective: what previously took a full working day of computation can now be done during a coffee break.
How It Works
Flash Attention 2
The key optimization is Flash Attention 2, a hardware-aware attention mechanism developed by Tri Dao. Standard attention mechanisms process transformer sequences in O(N²) memory complexity — meaning long audio transcriptions eat RAM at an accelerating rate. Flash Attention rewrites the attention computation to be O(N) in memory while actually increasing compute utilization on modern GPU architectures like the A100 and H100.
Batch Processing with BetterTransformer
The tool uses Hugging Face’s BetterTransformer API to automatically optimize the model for inference before execution. Combined with a batch size of 24 (processing 24 audio chunks concurrently), the GPU is kept fed at maximum throughput rather than sitting idle between chunks.
fp16 Precision
Running in half-precision (fp16) floating point cuts memory usage and increases throughput by roughly 2x with negligible accuracy loss for transcription tasks — a well-established result in the ML community.
Using the CLI
Installation is a single command (requires pipx or pip):
pipx install insanely-fast-whisper==0.0.15 --force
Or with pip:
pip install insanely-fast-whisper --ignore-requires-python
Transcribe a file:
insanely-fast-whisper --file-name path/to/audio.mp3
Use Whisper Large v3 with Flash Attention:
insanely-fast-whisper --file-name audio.wav --flash True
Use the lighter Distil-Whisper model:
insanely-fast-whisper --model-name distil-whisper/large-v2 --file-name audio.wav
On macOS with Apple Silicon, add --device-id mps to use the GPU via Metal.
Originally a Benchmark Project
Insanely Fast Whisper started not as a product but as a benchmark showcase for Hugging Face Transformers. The team wanted to demonstrate exactly how much performance was left on the table by running Whisper in its default configuration. The CLI was added as the community began requesting a way to actually use those optimizations in practice — and it quickly grew into a production-grade tool used by podcasters, researchers, and enterprises alike.
Supports Apple Silicon (MPS)
Unlike many GPU-optimized tools, Insanely Fast Whisper works on both NVIDIA GPUs and Apple Silicon Macs via the Metal Performance Shaders (MPS) backend. While performance on a Mac M-series chip won’t match an A100, it opens the door to fast local transcription without cloud compute costs — a significant advantage for privacy-conscious users handling sensitive recordings.
Why It Matters
The bottleneck in audio intelligence has shifted. When transcribing 2.5 hours of audio took 31 minutes, large-scale transcription pipelines were expensive and slow. At under 98 seconds — or as little as 78 seconds with Distil-Whisper — the economics of processing large audio corpora change entirely. Podcasters can auto-transcribe entire seasons overnight. Legal firms can process thousands of hours of depositions before a deadline. Researchers can transcribe and search archival audio collections that would have been impractical to analyze manually. The speed barrier has been effectively eliminated.
Featured image: A microphone and audio waveforms representing fast, accurate speech-to-text technology.