Insanely Fast Whisper: Transcribe 2.5 Hours of Audio in Under 98 Seconds with OpenAI’s Latest Model

Transcribing long audio has always been a trade-off: fast systems were inaccurate, accurate systems were slow. Insanely Fast Whisper — a community-built CLI tool — has fundamentally changed that equation. Using a combination of Flash Attention 2, intelligent batching, and optimizations from the Hugging Face Transformers ecosystem, it can transcribe 150 minutes (2.5 hours) of audio in under 98 seconds on an NVIDIA A100 80GB GPU.

The Performance Numbers Are Staggering

The project’s benchmarks tell the story clearly. On an NVIDIA A100 80GB:

Whisper Large v3 (Transformers, fp32): ~31 minutes for 150 minutes of audio
Whisper Large v3 (fp16 + batch size 24 + BetterTransformer): ~5 minutes
Whisper Large v3 (fp16 + batch size 24 + Flash Attention 2): 1 minute 38 seconds
Distil-Whisper Large v2 (fp16 + batch size 24 + Flash Attention 2): 1 minute 18 seconds

To put that in perspective: what previously took a full working day of computation can now be done during a coffee break.

How It Works

Flash Attention 2

The key optimization is Flash Attention 2, a hardware-aware attention mechanism developed by Tri Dao. Standard attention mechanisms process transformer sequences in O(N²) memory complexity — meaning long audio transcriptions eat RAM at an accelerating rate. Flash Attention rewrites the attention computation to be O(N) in memory while actually increasing compute utilization on modern GPU architectures like the A100 and H100.

Batch Processing with BetterTransformer

The tool uses Hugging Face’s BetterTransformer API to automatically optimize the model for inference before execution. Combined with a batch size of 24 (processing 24 audio chunks concurrently), the GPU is kept fed at maximum throughput rather than sitting idle between chunks.

fp16 Precision

Running in half-precision (fp16) floating point cuts memory usage and increases throughput by roughly 2x with negligible accuracy loss for transcription tasks — a well-established result in the ML community.

Using the CLI

Installation is a single command (requires pipx or pip):

pipx install insanely-fast-whisper==0.0.15 --force

Or with pip:

pip install insanely-fast-whisper --ignore-requires-python

Transcribe a file:

insanely-fast-whisper --file-name path/to/audio.mp3

Use Whisper Large v3 with Flash Attention:

insanely-fast-whisper --file-name audio.wav --flash True

Use the lighter Distil-Whisper model:

insanely-fast-whisper --model-name distil-whisper/large-v2 --file-name audio.wav

On macOS with Apple Silicon, add --device-id mps to use the GPU via Metal.

Originally a Benchmark Project

Insanely Fast Whisper started not as a product but as a benchmark showcase for Hugging Face Transformers. The team wanted to demonstrate exactly how much performance was left on the table by running Whisper in its default configuration. The CLI was added as the community began requesting a way to actually use those optimizations in practice — and it quickly grew into a production-grade tool used by podcasters, researchers, and enterprises alike.

Supports Apple Silicon (MPS)

Unlike many GPU-optimized tools, Insanely Fast Whisper works on both NVIDIA GPUs and Apple Silicon Macs via the Metal Performance Shaders (MPS) backend. While performance on a Mac M-series chip won’t match an A100, it opens the door to fast local transcription without cloud compute costs — a significant advantage for privacy-conscious users handling sensitive recordings.

Why It Matters

The bottleneck in audio intelligence has shifted. When transcribing 2.5 hours of audio took 31 minutes, large-scale transcription pipelines were expensive and slow. At under 98 seconds — or as little as 78 seconds with Distil-Whisper — the economics of processing large audio corpora change entirely. Podcasters can auto-transcribe entire seasons overnight. Legal firms can process thousands of hours of depositions before a deadline. Researchers can transcribe and search archival audio collections that would have been impractical to analyze manually. The speed barrier has been effectively eliminated.

Featured image: A microphone and audio waveforms representing fast, accurate speech-to-text technology.