How to transcribe podcast audio (WhisperX with speaker diarization)
Note: sometimes WhisperX is WAAYYYY too slow so I often end up using https://github.com/ggerganov/whisper.cpp which somehow runs much faster.
I do a lot of podcast transcription work and had need for it again today. The HuggingFace spaces (like this one https://huggingface.co/spaces/vumichien/whisper-speaker-diarization) always error out so aren’t very useful.
This is the one that worked for me.
Note: if you run into a New error: ‘soundfile’ backend is not available error,
conda install -c conda-forge libsndfile
to fix.
- make sure you have the
.wav
for your podcast audio. you can use quicktime or audacity to convert it. this process doesnt work for mp3 pip3 install git+https://github.com/m-bain/whisperx.git
this will take a couple minutes. meanwhile…- Read https://github.com/m-bain/whisperX#voice-activity-detection-filtering—diarization. To enable VAD filtering and Diarization, include your Hugging Face access token that you can generate from Here after the —hf_token argument and accept the user agreement for the following models: Segmentation , Voice Activity Detection (VAD) , and Speaker Diarization. make sure to accept them all in your huggingface account.
whisperx YOUR_AUDIO_FILE.wav --hf_token YOUR_HF_TOKEN_HERE --vad_filter --diarize --min_speakers 3 --max_speakers 3 --language en
for 3 speakers in English. remember it must be a .wav file.
It takes about 30 seconds to transcribe 30 seconds so be prepared for it to take the time of your audio podcast to transcribe.