How to transcribe podcast audio (WhisperX with speaker diarization)

Note: sometimes WhisperX is WAAYYYY too slow so I often end up using https://github.com/ggerganov/whisper.cpp which somehow runs much faster.

I do a lot of podcast transcription work and had need for it again today. The HuggingFace spaces (like this one https://huggingface.co/spaces/vumichien/whisper-speaker-diarization) always error out so aren’t very useful.

This is the one that worked for me.

Note: if you run into a New error: ‘soundfile’ backend is not available error, conda install -c conda-forge libsndfile to fix.

make sure you have the .wav for your podcast audio. you can use quicktime or audacity to convert it. this process doesnt work for mp3
pip3 install git+https://github.com/m-bain/whisperx.git this will take a couple minutes. meanwhile…
Read https://github.com/m-bain/whisperX#voice-activity-detection-filtering—diarization. To enable VAD filtering and Diarization, include your Hugging Face access token that you can generate from Here after the —hf_token argument and accept the user agreement for the following models: Segmentation , Voice Activity Detection (VAD) , and Speaker Diarization. make sure to accept them all in your huggingface account.
whisperx YOUR_AUDIO_FILE.wav --hf_token YOUR_HF_TOKEN_HERE --vad_filter --diarize --min_speakers 3 --max_speakers 3 --language en for 3 speakers in English. remember it must be a .wav file.

It takes about 30 seconds to transcribe 30 seconds so be prepared for it to take the time of your audio podcast to transcribe.