Speech Recognition DeepSpeech Python* Demo

This demo shows Automatic Speech Recognition (ASR) with a pretrained Mozilla* DeepSpeech 0.8.2 model.

It works with version 0.6.1 as well, and should also work with other models trained with Mozilla DeepSpeech 0.6.x/0.7.x/0.8.x/0.9.x with ASCII alphabets.

How It Works

The application accepts

  • Mozilla* DeepSpeech 0.8.2 neural network in Intermediate Representation (IR) format,
  • n-gram language model file in kenlm quantized binary format, and
  • an audio file in PCM WAV 16 kHz mono format.

The application has two modes:

  • Normal mode (default). Audio data is streamed in 10 second chunks into a streaming pipeline of: computation of audio features, running a neural network to get per-frame character probabilities, and CTC decoding. After processing the whole file, the demo prints the decoded text and the time spent.
  • In simulated real-time mode the app simulates speech recognition of live recording by feeding audio data from input file and displaying the current partial result in a creeping line in console output. Data is fed at real-time speed by introducing the necessary delays. Audio data is fed in 0.32 sec chunks (size is controlled by --block-size option) into the same streaming pipeline. In this mode the pipeline provides updated recognition result after each data chunk.

Preparing to Run

The list of models supported by the demo is in <omz_dir>/demos/speech_recognition_deepspeech_demo/python/models.lst file. This file can be used as a parameter for Model Downloader and Converter to download and, if necessary, convert models to OpenVINO Inference Engine format (*.xml + *.bin). Don't forget to configure Model Optimizer, which is a requirement for Model Downloader, as described in its documentation.

An example of using the Model Downloader:

python3 <omz_dir>/tools/downloader/downloader.py --list models.lst

An example of using the Model Converter:

python3 <omz_dir>/tools/downloader/converter.py --list models.lst

Please pay attention to the model license, Mozilla Public License 2.0.

Prerequisites

The demo depends on the ctcdecode_numpy Python extension module, which implements CTC decoding in C++ for faster decoding. Please refer to Open Model Zoo demos for instructions on how to build the extension module and prepare the environment for running the demo. Alternatively, instead of using cmake you can run python -m pip install . inside ctcdecode-numpy directory to build and install ctcdecode-numpy.

Supported Models

  • mozilla-deepspeech-0.6.1
  • mozilla-deepspeech-0.8.2

Please pay attention to the model license, Mozilla Public License 2.0.

NOTE: Refer to the tables Intel's Pre-Trained Models Device Support and Public Pre-Trained Models Device Support for the details on models inference support at different devices.

Running Demo

Run the application with -h option to see help message. Here are the available command line options:

usage: speech_recognition_deepspeech_demo.py [-h] -i FILENAME [-d DEVICE] -m
FILENAME [-L FILENAME] -p NAME
[-b N] [-c N] [--realtime]
[--block-size BLOCK_SIZE]
[--realtime-window REALTIME_WINDOW]
Speech recognition DeepSpeech demo
optional arguments:
-h, --help show this help message and exit
-i FILENAME, --input FILENAME
Path to an audio file in WAV PCM 16 kHz mono format
-d DEVICE, --device DEVICE
Optional. Specify the target device to infer on, for
example: CPU, GPU, HDDL, MYRIAD or HETERO. The
demo will look for a suitable IE plugin for this
device. (default is CPU)
-m FILENAME, --model FILENAME
Path to an .xml file with a trained model (required)
-L FILENAME, --lm FILENAME
path to language model file (optional)
-p NAME, --profile NAME
Choose pre/post-processing profile: mds06x_en for
Mozilla DeepSpeech v0.6.x,
mds07x_en/mds08x_en/mds09x_en for Mozilla DeepSpeech
v0.7.x/v0.8.x/v0.9.x(English), other: filename of a
YAML file (required)
-b N, --beam-width N Beam width for beam search in CTC decoder (default
500)
-c N, --max-candidates N
Show top N (or less) candidates (default 1)
--realtime Simulated real-time mode: slow down data feeding to
real time and show partial transcription during
recognition
--block-size BLOCK_SIZE
Block size in audio samples for streaming into ASR
pipeline (defaults to samples in 10 sec for offline;
samples in 16 frame strides for online)
--realtime-window REALTIME_WINDOW
In simulated real-time mode, show this many characters
on screen (default 79)

The typical command line for offline mode is:

pip install -r requirements.txt
source <openvino_dir>/bin/setupvars.sh
python3 speech_recognition_deepspeech_demo.py \
-p mds08x_en \
-m <path_to_model>/mozilla_deepspeech_0.8.2.xml \
-L <path_to_file>/deepspeech-0.8.2-models.kenlm \
-i <path_to_audio>/audio.wav

For version 0.6.1 it is:

python3 speech_recognition_deepspeech_demo.py \
-p mds06x_en \
-m <path_to_model>/mozilla_deepspeech_0.6.1.xml \
-L <path_to_file>/lm.binary \
-i <path_to_audio>/audio.wav

To run in simulated real-time mode add command-line option --realtime.

NOTE: Only 16-bit, 16 kHz, mono-channel WAVE audio files are supported.

Optional (but highly recommended) language model files, deepspeech-0.8.2-models.kenlm or lm.binary are part of corresponding model downloaded content and will be located in the Model Downloader output folder after model downloading and conversion. An example audio file can be taken from <openvino_dir>/deployment_tools/demo/how_are_you_doing.wav.

Demo Output

The application shows time taken by the initialization and processing stages, and the decoded text for the audio file. In real-time mode the current recognition result is shown while the app is running as well.

See Also