"mozilla-deepspeech-0.8.2" is a speech recognition neural network pre-trained by Mozilla based on DeepSpeech architecture (CTC decoder with beam search and n-gram language model) with changed neural network topology.
For details on the original DeepSpeech, see paper https://arxiv.org/abs/1412.5567.
For details on this model, see https://github.com/mozilla/DeepSpeech/releases/tag/v0.8.2.
Metric | Value |
---|---|
Type | Speech recognition |
GFlops per audio frame | 0.0472 |
GFlops per second of audio | 2.36 |
MParams | 47.2 |
Source framework | TensorFlow* |
Metric | Value | Parameters |
---|---|---|
WER @ LibriSpeech test-clean | 8.39% | with LM, beam_width = 32, Python CTC decoder, accuracy checker |
WER @ LibriSpeech test-clean | 6.13% | with LM, beam_width = 500, C++ CTC decoder, accuracy checker |
WER @ LibriSpeech test-clean | 6.15% | with LM, beam_width = 500, C++ CTC decoder, demo |
NB: beam_width=32 is a low value for a CTC decoder, and was used to achieve reasonable evaluation time with Python CTC decoder in Accuracy Checker. Increasing beam_width improves WER metric and slows down decoding. Speech recognition demo has a faster C++ CTC decoder module.
Audio MFCC coefficients, name: input_node
, shape: [1x16x19x26], format: [BxNxTxC], where:
input_lengths
(see below) - number of audio frames in this section of audioSee accuracy-check.yml
for all audio preprocessing and feature extraction parameters.
input_lengths
, shape [1].previous_state_c
, shape: [1x2048], format: [BxC].previous_state_h
, shape: [1x2048], format: [BxC].When splitting a long audio into chunks, these two last inputs must be fed with the corresponding outputs from the previous chunk. Chunk processing order must be from early to late audio positions.
Audio MFCC coefficients, name: input_node
, shape: [1x16x19x26], format: [BxNxTxC], where:
See accuracy-check.yml
for all audio preprocessing and feature extraction parameters.
previous_state_c
, shape: [1x2048], format: [BxC].previous_state_h
, shape: [1x2048], format: [BxC].When splitting a long audio into chunks, these two last inputs must be fed with the corresponding outputs from the previous chunk. Chunk processing order must be from early to late audio positions.
Per-frame probabilities (after softmax) for every symbol in the alphabet, name: logits
, shape: [16x1x29], format: [NxBxC]
The per-frame probabilities are to be decoded with a CTC decoder. The alphabet is: 0 = space, 1...26 = "a" to "z", 27 = apostrophe, 28 = CTC blank symbol.
NB: logits
is probabilities after softmax, despite its name.
new_state_c
, shape: [1x2048], format: [BxC]. See Inputs.new_state_h
, shape: [1x2048], format: [BxC]. See Inputs.Per-frame probabilities (after softmax) for every symbol in the alphabet, name: logits
, shape: [16x1x29], format: [NxBxC]
The per-frame probabilities are to be decoded with a CTC decoder. The alphabet is: 0 = space, 1...26 = "a" to "z", 27 = apostrophe, 28 = CTC blank symbol.
NB: logits
is probabilities after softmax, despite its name.
cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/BlockLSTM/TensorIterator.2
(for new_state_c
), shape: [1x2048], format: [BxC]. See Inputs.cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/BlockLSTM/TensorIterator.1
(for new_state_h
), shape: [1x2048], format: [BxC]. See Inputs.The original model is distributed under the Mozilla Public License, Version 2.0. A copy of the license is provided in MPL-2.0-Mozilla-Deepspeech.txt.