This demo provides a command-line interface for automatic speech recognition using OpenVINO™. Components used by this executable:
lspeech_s5_ext
model - Example pretrained LibriSpeech DNNspeech_library.dll
(.so
) - Open source speech recognition library that uses OpenVINO™ Inference Engine, Intel® Speech Feature Extraction and Intel® Speech Decoder librariesThe application transcribes speech from a given WAV file and outputs the text to the console.
The application requires two command-line parameters, which point to an audio file with speech to transcribe and a configuration file describing the resources to use for transcription.
-wave
- Path to input WAV to process. WAV file needs to be in the following format: RIFF WAVE PCM 16bit, 16kHz, 1 channel, with header.-c
, --config
- Path to configuration file with paths to resources and other parameters.Example usage:
The configuration file is an ASCII text file where:
Parameter | Description | Value used for demo |
---|---|---|
-fe:rt:numCeps | Number of MFCC cepstrums | 13 |
-fe:rt:contextLeft | Numbers of past frames that are stacked to form input vector for neural network inference | 5 |
-fe:rt:contextRight | Numbers of future frames that are stacked to form input vector for neural network inference | 5 |
-fe:rt:hpfBeta | High pass filter beta coefficient, where 0.0f means no filtering | 0.0f |
-fe:rt:inputDataType | Feature extraction input format description | INT16_16KHZ |
-fe:rt:cepstralLifter | Lifting factor | 22.0f |
-fe:rt:noDct | Flag: use DCT as final step or not | 0 |
-fe:rt:featureTransform | Kaldi feature transform file that normalizes stacked features for neural network inference | |
-dec:wfst:acousticModelFName | Full path to the acoustic model file, for example openvino_ir.xml | |
-dec:wfst:acousticScaleFactor | The acoustic log likelihood scaling factor | 0.1f |
-dec:wfst:beamWidth | Viterbi search beam width | 14.0f |
-dec:wfst:latticeWidth | Lattice beam width (extends the beam width) | 0.0f |
-dec:wfst:nbest | Number of n-best hypothesis to be generated | 1 |
-dec:wfst:confidenceAcousticScaleFactor | Scaling parameter to factor in acoustic scores in confidence computations | 1.0f |
-dec:wfst:confidenceLMScaleFactor | Scaling parameter to factor in language model in confidence computations | 1.0f |
-dec:wfst:hmmModelFName | Full path to HMM model | |
-dec:wfst:fsmFName | Full path to pronunciation model or full statically composed LM, if static composition is used | |
-dec:wfstotf:gramFsmFName | Full path to grammar model | |
-dec:wfst:outSymsFName | Full path to the output symbols (lexicon) filename | |
-dec:wfst:tokenBufferSize | Token pool size expressed in number of DWORDs | 150000 |
-dec:wfstotf:traceBackLogSize | Number of entries in traceback expressed as log2(N) | 19 |
-dec:wfstotf:minStableFrames | The time expressed in frames, after which the winning hypothesis is recognized as stable and the final result can be printed | 45 |
-dec:wfst:maxCumulativeTokenSize | Maximum fill rate of token buffer before token beam is adjusted to keep token buffer fill constant. Expressed as factor of buffer size (0.0, 1.0) | 0.2f |
-dec:wfst:maxTokenBufferFill | Active token count number triggering beam tightening expressed as factor of buffer size | 0.6f |
-dec:wfst:maxAvgTokenBufferFill | Average active token count number for utterance, which triggers beam tightening when exceeded. Expressed as factor of buffer size | 1.0f |
-dec:wfst:tokenBufferMinFill | Minimum fill rate of token buffer | 0.1f |
-dec:wfst:pruningTighteningDelta | Beam tightening value when token pool usage reaches the pool capacity | 1.0f |
-dec:wfst:pruningRelaxationDelta | Beam relaxation value when token pool is not meeting minimum fill ratio criterion | 0.5f |
-dec:wfst:useScoreTrendForEndpointing | Extend end pointing with acoustic feedback | 1 |
-dec:wfstotf:cacheLogSize | Number of entries in LM cache expressed as log2(N) | 16 |
-eng:output:format | Format of the speech recognition output | text |
-inference:contextLeft | IE: Additional stacking option, independent from feature extraction | 0 |
-inference:contextRight | IE: Additional stacking option, independent from feature extraction | 0 |
-inference:device | IE: Device used for neural computations | CPU |
-inference:numThreads | IE: Number of threads used by GNA in SW mode | 1 |
-inference:scaleFactor | IE: Scale factor used for static quantization | 3000.0 |
-inference:quantizationBits | IE: Quantization resolution in bits | 16 or 8 |
The resulting transcription for the sample audio file: