What is the best audio format for transcription accuracy?

Mono FLAC or WAV at 16 kHz 16-bit gives the highest measurable accuracy on every major engine — OpenAI Whisper, Google Speech-to-Text, Amazon Transcribe, Microsoft Azure Speech, Rev.ai, Otter.ai. The reason is that speech engines internally resample to 16 kHz mono and prefer lossless input so codec artifacts do not confuse phoneme detection. For clean studio voice, MP3 128 kbps mono is essentially indistinguishable in accuracy and saves 90% of the storage. Use lossless when you have noisy audio, accented speech, or overlapping speakers.

Why is 16 kHz the recommended sample rate for transcription?

Speech engines are tuned for 16 kHz because human speech contains all phonetically meaningful information below 8 kHz, and Nyquist sampling theorem requires 2x that as the sample rate. Going higher (44.1 or 48 kHz) captures music-band frequencies the engine discards anyway during preprocessing. Going lower (8 kHz, telephone bandwidth) cuts off above 4 kHz and loses the 4-8 kHz range where fricatives like 's', 'sh', and 'f' live, causing 2-5x higher word error rates. 16 kHz is the sweet spot for accuracy and file size.

Should I send mono or stereo audio for transcription?

Mono. For a single speaker, stereo doubles the file size and most engines sum to mono internally regardless. Some services count audio minutes by channel, so stereo may double your cost. The exception is multi-speaker recordings with one microphone per speaker — in that case, transcribe each channel as a separate mono file to get cleaner speaker diarization than the engine would produce from a stereo mix. Convert stereo source files to mono with FFmpeg's '-ac 1' flag or AudioUtils' downmix option.

Does denoising audio before transcription help?

Yes, on noisy source material. Tools like RNNoise, NVIDIA Broadcast, Krisp, iZotope RX, or Adobe Podcast Enhance reduce HVAC hum, keyboard clatter, and reverb that confuse speech models. The accuracy gain is largest on conference-room recordings and smallest on close-mic'd studio voice. Avoid over-denoising — aggressive noise reduction creates artifacts that hurt accuracy. A gentle pass that drops the noise floor by 6-10 dB is usually optimal. Run denoising before any format conversion so the model receives the cleanest signal.

Can I send a 2-hour MP3 to OpenAI Whisper?

The Whisper API caps single uploads at 25 MB. A 2-hour mono MP3 at 128 kbps is roughly 110 MB, so you must split the file into 10-20 minute segments first. Use FFmpeg with the segment muxer or a tool like AudioSlicer. Alternatively, run whisper.cpp or the open-source Whisper model locally — those have no file size cap and process long audio in a single pass. Many transcription wrappers like AssemblyAI and Rev.ai accept full-length files up to 4 hours via async processing without segmentation.

Best Audio Format for Speech-to-Text Transcription

Audio format affects transcription accuracy more than most people expect. The codec, bitrate, sample rate, channel count, and noise floor all influence how well speech-to-text engines can recover words from waveform. This guide covers what every major transcription service actually wants, why, and how to prep audio for maximum accuracy.

The Headline Answer

For best accuracy: mono WAV at 16 kHz 16-bit, denoised, with -16 to -23 LUFS integrated loudness. For convenience: mono MP3 at 128 kbps is acceptable on every modern engine and saves 90% of the storage. Stereo, high sample rates, and lossless beyond 16-bit mono add cost without measurable accuracy gain on speech.

What Each Major Service Prefers

OpenAI Whisper (and whisper.cpp, Whisper API): WAV / MP3 / FLAC / OGG / M4A all work. Internal model resamples everything to 16 kHz mono 16-bit float. File size cap on the API: 25 MB. For long audio, chunk into 10-20 minute segments at 16 kHz mono MP3 128 kbps to stay under the cap.
Google Speech-to-Text v2: FLAC and LINEAR16 (raw PCM 16-bit) recommended for highest accuracy. MP3 and OGG accepted. Sample rate 8-48 kHz; documented sweet spot is 16 kHz for narrowband models, 24 kHz for the latest 'long' models. Mono required for diarization.
Amazon Transcribe: WAV, MP3, MP4, FLAC, AMR, OGG, WebM, M4A. 8-48 kHz. Per-job file size cap 2 GB or 4 hours, whichever is lower. Recommends FLAC for medical / legal where accuracy matters.
Microsoft Azure Speech: WAV PCM 16-bit mono 16 kHz is the documented optimum. MP3, OGG Opus, FLAC, ALAW also accepted via the GStreamer pipeline.
Rev.ai: MP3, MP4, WAV, M4A, FLAC, OGG, WMA. Accuracy is essentially identical across formats above 96 kbps; below that, lossy artifacts cost a few percent WER.
Otter.ai: MP3, AAC, M4A, MP4, MOV, WAV. Internal pipeline resamples to 16 kHz mono. Web upload cap typically 1.9 GB / 4 hours per file.
AssemblyAI: All common formats. Recommends 16 kHz+ for best results; below 8 kHz quality degrades sharply.

Why 16 kHz Is the Sweet Spot

Human speech contains intelligible information up to about 8 kHz — sibilants ('s', 'sh', 'f') sit between 4-8 kHz and consonants like 'th' depend on energy near 6 kHz. Nyquist sampling theorem says you need at least 2x the highest frequency you want to capture, so 16 kHz sample rate (capturing up to 8 kHz) covers every phoneme that matters for transcription. Going higher (44.1 or 48 kHz) captures music-band frequencies that speech engines discard immediately during preprocessing. The engine downsamples internally regardless, so giving it 16 kHz directly is slightly more efficient and produces no accuracy loss.

Going lower is dangerous. 8 kHz audio (telephone bandwidth) cuts off at 4 kHz, losing the 4-8 kHz range where fricatives live. Word error rate on 8 kHz audio is typically 2-5x higher than on 16 kHz audio for the same speaker.

Mono vs Stereo: Mono Wins

For single-speaker content, mono is unambiguously better. Stereo doubles the file size, and most engines just sum to mono internally. Some services charge by 'audio minute' regardless of channel count, but a stereo file may upload twice the bytes for the same content.

For multi-speaker content with a separate mic per speaker (interview where each lavalier records to its own track), you can preserve speaker separation by transcribing each channel as a separate mono file and merging the timestamps. This produces cleaner diarization than letting the engine guess from a stereo mix.

Lossy vs Lossless: Where It Actually Matters

For clean studio audio at 96 kbps MP3 or higher, no major engine shows measurable accuracy loss versus FLAC or WAV. Below 96 kbps, fricatives start to fall apart and accuracy drops 1-3% per step.

The accuracy gap shows up in three specific cases:

1. Heavy accent speakers or non-native English — every dB of detail matters; FLAC or WAV gives the model the cleanest signal. 2. Multiple overlapping speakers — lossy compression artifacts confuse diarization and word boundary detection. 3. Low-volume background speech — quiet voices in meeting recordings; lossy encoding tends to throw away the quietest content.

For all three cases, transcribe from FLAC or WAV. For solo voiceover, podcast interviews mic'd cleanly, or call-center recordings, MP3 128 kbps mono is fine.

File Size Limits Across Services

Whisper API: 25 MB / file
Google Speech-to-Text sync: 60 seconds; async: any length up to 480 minutes
Amazon Transcribe: 2 GB or 4 hours per job
Otter.ai: ~1.9 GB / 4 hours web upload
Rev.ai: 4 GB / 17 hours

For files near a service's cap, mono MP3 128 kbps gives the best size/accuracy ratio: a 2-hour recording is 110 MB, comfortably under any limit.

Denoising Before Transcription

Denoising before transcription reliably improves accuracy on noisy source material. RNNoise (used in Krisp, NVIDIA Broadcast, and FFmpeg's 'arnndn' filter), iZotope RX, or Adobe Podcast Enhance reduce HVAC hum, keyboard clatter, and reverb. The accuracy improvement is largest on conference-room recordings, smallest on close-mic'd studio voice.

Do not over-denoise. Aggressive noise reduction creates artifacts that confuse the model. A gentle pass that drops the noise floor by 6-10 dB is usually optimal.

Loudness Targets

Aim for -16 to -23 LUFS integrated loudness with peaks below -1 dBTP. Quiet audio forces the engine to amplify and noise-detect; clipped audio destroys the consonants the model relies on. Use a normalization step in your DAW or FFmpeg's 'loudnorm' filter to hit the target before sending to transcription.

Recommended Conversion Workflow

1. Record at 48 kHz 24-bit WAV in your interface. 2. Edit, denoise, and normalize to -20 LUFS in your DAW. 3. Export mono 16 kHz 16-bit WAV for highest accuracy, or mono 16 kHz 128 kbps MP3 for a 90% size reduction with no measurable accuracy loss. 4. Use MP3 to WAV, AAC to WAV, or OGG to WAV on AudioUtils to convert any source format into the transcription-friendly target.

For the underlying bitrate concepts, see audio bitrate explained. For sample rate fundamentals, see sample rate explained. For broader format choices, see audio quality settings explained.