AudioUtils

Best Audio Format for Speech-to-Text Transcription

Choose the right audio format for transcription services and speech-to-text engines. Covers WAV, MP3, FLAC, and bitrate recommendations for maximum accuracy.

Audio format affects transcription accuracy more than most people expect. The codec, bitrate, sample rate, channel count, and noise floor all influence how well speech-to-text engines can recover words from waveform. This guide covers what every major transcription service actually wants, why, and how to prep audio for maximum accuracy.

The Headline Answer

For best accuracy: mono WAV at 16 kHz 16-bit, denoised, with -16 to -23 LUFS integrated loudness. For convenience: mono MP3 at 128 kbps is acceptable on every modern engine and saves 90% of the storage. Stereo, high sample rates, and lossless beyond 16-bit mono add cost without measurable accuracy gain on speech.

What Each Major Service Prefers

  • OpenAI Whisper (and whisper.cpp, Whisper API): WAV / MP3 / FLAC / OGG / M4A all work. Internal model resamples everything to 16 kHz mono 16-bit float. File size cap on the API: 25 MB. For long audio, chunk into 10-20 minute segments at 16 kHz mono MP3 128 kbps to stay under the cap.
  • Google Speech-to-Text v2: FLAC and LINEAR16 (raw PCM 16-bit) recommended for highest accuracy. MP3 and OGG accepted. Sample rate 8-48 kHz; documented sweet spot is 16 kHz for narrowband models, 24 kHz for the latest 'long' models. Mono required for diarization.
  • Amazon Transcribe: WAV, MP3, MP4, FLAC, AMR, OGG, WebM, M4A. 8-48 kHz. Per-job file size cap 2 GB or 4 hours, whichever is lower. Recommends FLAC for medical / legal where accuracy matters.
  • Microsoft Azure Speech: WAV PCM 16-bit mono 16 kHz is the documented optimum. MP3, OGG Opus, FLAC, ALAW also accepted via the GStreamer pipeline.
  • Rev.ai: MP3, MP4, WAV, M4A, FLAC, OGG, WMA. Accuracy is essentially identical across formats above 96 kbps; below that, lossy artifacts cost a few percent WER.
  • Otter.ai: MP3, AAC, M4A, MP4, MOV, WAV. Internal pipeline resamples to 16 kHz mono. Web upload cap typically 1.9 GB / 4 hours per file.
  • AssemblyAI: All common formats. Recommends 16 kHz+ for best results; below 8 kHz quality degrades sharply.

Why 16 kHz Is the Sweet Spot

Human speech contains intelligible information up to about 8 kHz — sibilants ('s', 'sh', 'f') sit between 4-8 kHz and consonants like 'th' depend on energy near 6 kHz. Nyquist sampling theorem says you need at least 2x the highest frequency you want to capture, so 16 kHz sample rate (capturing up to 8 kHz) covers every phoneme that matters for transcription. Going higher (44.1 or 48 kHz) captures music-band frequencies that speech engines discard immediately during preprocessing. The engine downsamples internally regardless, so giving it 16 kHz directly is slightly more efficient and produces no accuracy loss.

Going lower is dangerous. 8 kHz audio (telephone bandwidth) cuts off at 4 kHz, losing the 4-8 kHz range where fricatives live. Word error rate on 8 kHz audio is typically 2-5x higher than on 16 kHz audio for the same speaker.

Mono vs Stereo: Mono Wins

For single-speaker content, mono is unambiguously better. Stereo doubles the file size, and most engines just sum to mono internally. Some services charge by 'audio minute' regardless of channel count, but a stereo file may upload twice the bytes for the same content.

For multi-speaker content with a separate mic per speaker (interview where each lavalier records to its own track), you can preserve speaker separation by transcribing each channel as a separate mono file and merging the timestamps. This produces cleaner diarization than letting the engine guess from a stereo mix.

Lossy vs Lossless: Where It Actually Matters

For clean studio audio at 96 kbps MP3 or higher, no major engine shows measurable accuracy loss versus FLAC or WAV. Below 96 kbps, fricatives start to fall apart and accuracy drops 1-3% per step.

The accuracy gap shows up in three specific cases:

1. Heavy accent speakers or non-native English — every dB of detail matters; FLAC or WAV gives the model the cleanest signal. 2. Multiple overlapping speakers — lossy compression artifacts confuse diarization and word boundary detection. 3. Low-volume background speech — quiet voices in meeting recordings; lossy encoding tends to throw away the quietest content.

For all three cases, transcribe from FLAC or WAV. For solo voiceover, podcast interviews mic'd cleanly, or call-center recordings, MP3 128 kbps mono is fine.

File Size Limits Across Services

  • Whisper API: 25 MB / file
  • Google Speech-to-Text sync: 60 seconds; async: any length up to 480 minutes
  • Amazon Transcribe: 2 GB or 4 hours per job
  • Otter.ai: ~1.9 GB / 4 hours web upload
  • Rev.ai: 4 GB / 17 hours

For files near a service's cap, mono MP3 128 kbps gives the best size/accuracy ratio: a 2-hour recording is 110 MB, comfortably under any limit.

Denoising Before Transcription

Denoising before transcription reliably improves accuracy on noisy source material. RNNoise (used in Krisp, NVIDIA Broadcast, and FFmpeg's 'arnndn' filter), iZotope RX, or Adobe Podcast Enhance reduce HVAC hum, keyboard clatter, and reverb. The accuracy improvement is largest on conference-room recordings, smallest on close-mic'd studio voice.

Do not over-denoise. Aggressive noise reduction creates artifacts that confuse the model. A gentle pass that drops the noise floor by 6-10 dB is usually optimal.

Loudness Targets

Aim for -16 to -23 LUFS integrated loudness with peaks below -1 dBTP. Quiet audio forces the engine to amplify and noise-detect; clipped audio destroys the consonants the model relies on. Use a normalization step in your DAW or FFmpeg's 'loudnorm' filter to hit the target before sending to transcription.

Recommended Conversion Workflow

1. Record at 48 kHz 24-bit WAV in your interface. 2. Edit, denoise, and normalize to -20 LUFS in your DAW. 3. Export mono 16 kHz 16-bit WAV for highest accuracy, or mono 16 kHz 128 kbps MP3 for a 90% size reduction with no measurable accuracy loss. 4. Use MP3 to WAV, AAC to WAV, or OGG to WAV on AudioUtils to convert any source format into the transcription-friendly target.

For the underlying bitrate concepts, see audio bitrate explained. For sample rate fundamentals, see sample rate explained. For broader format choices, see audio quality settings explained.