Skip to main content

Whisper automatic speech recognition sample

This example showcases inference of speech recognition Whisper Models. The application doesn't have many configuration options to encourage the reader to explore and modify the source code. For example, change the device for inference to GPU. The sample features ov::genai::WhisperPipeline and uses audio file in wav format as an input source.

Download and convert the model and tokenizers

The --upgrade-strategy eager option is needed to ensure optimum-intel is upgraded to the latest version.

Install ../../export-requirements.txt to convert a model.

pip install --upgrade-strategy eager -r ../../export-requirements.txt
optimum-cli export openvino --trust-remote-code --model openai/whisper-base whisper-base

If NPU is the inference device, an additional option --disable-stateful is required. See NPU with OpenVINO GenAI for the detail.

Prepare audio file

Download example audio file: https://storage.openvinotoolkit.org/models_contrib/speech/2021.2/librispeech_s5/how_are_you_doing_today.wav

Or you can use the recorder.py script. The script records 5 seconds of audio from the microphone.

To install PyAudio dependency follow the installation instructions.

To run the script:

python recorder.py

Run the Whisper model

Install deployment-requirements.txt via pip install -r ../../deployment-requirements.txt and then, run a sample:

python whisper_speech_recognition.py whisper-base how_are_you_doing_today.wav

Output:

 How are you doing today?
timestamps: [0, 2] text: How are you doing today?

See SUPPORTED_MODELS.md for the list of supported models.

Whisper pipeline usage

import openvino_genai
import librosa

def read_wav(filepath):
raw_speech, samplerate = librosa.load(filepath, sr=16000)
return raw_speech.tolist()

pipe = openvino_genai.WhisperPipeline(model_dir, "CPU")
# Pipeline expects normalized audio with Sample Rate of 16kHz
raw_speech = read_wav('how_are_you_doing_today.wav')
result = pipe.generate(raw_speech)
# How are you doing today?

Transcription

Whisper pipeline predicts the language of the source audio automatically.

raw_speech = read_wav('how_are_you_doing_today.wav')
result = pipe.generate(raw_speech)
# How are you doing today?

raw_speech = read_wav('fr_sample.wav')
result = pipe.generate(raw_speech)
# Il s'agit d'une entité très complexe qui consiste...

If the source audio languange is know in advance, it can be specified as an argument to generate method:

raw_speech = read_wav("how_are_you_doing_today.wav")
result = pipe.generate(raw_speech, language="<|en|>")
# How are you doing today?

raw_speech = read_wav("fr_sample.wav")
result = pipe.generate(raw_speech, language="<|fr|>")
# Il s'agit d'une entité très complexe qui consiste...

Translation

By default, Whisper performs the task of speech transcription, where the source audio language is the same as the target text language. To perform speech translation, where the target text is in English, set the task to "translate":

raw_speech = read_wav("fr_sample.wav")
result = pipe.generate(raw_speech, task="translate")
# It is a very complex entity that consists...

Timestamps prediction

The model can predict timestamps. For sentence-level timestamps, pass the return_timestamps argument:

raw_speech = read_wav("how_are_you_doing_today.wav")
result = pipe.generate(raw_speech, return_timestamps=True)

for chunk in result.chunks:
print(f"timestamps: [{chunk.start_ts:.2f}, {chunk.end_ts:.2f}] text: {chunk.text}")
# timestamps: [0.00, 2.00] text: How are you doing today?

Long-Form audio Transcription

The Whisper model is designed to work on audio samples of up to 30s in duration. Whisper pipeline uses sequential chunking algorithm to transcribe audio samples of arbitrary length. Sequential chunking algorithm uses a "sliding window", transcribing 30-second slices one after the other.

Initial prompt and hotwords

Whisper pipeline has initial_prompt and hotwords generate arguments:

  • initial_prompt: initial prompt tokens passed as a previous transcription (after <|startofprev|> token) to the first processing window
  • hotwords: hotwords tokens passed as a previous transcription (after <|startofprev|> token) to the all processing windows

The Whisper model can use that context to better understand the speech and maintain a consistent writing style. However, prompts do not need to be genuine transcripts from prior audio segments. Such prompts can be used to steer the model to use particular spellings or styles:

result = pipe.generate(raw_speech)
# He has gone and gone for good answered Paul Icrom who...

result = pipe.generate(raw_speech, initial_prompt="Polychrome")
# He has gone and gone for good answered Polychrome who...

Troubleshooting

Empty or rubbish output

Example output:

----------------

To resolve this ensure that audio data has a 16k Hz sampling rate. You can use the recorder.py provided to record or use FFmpeg to convert the audio to the required format.