Skip to content

lexkoro/deepspeech-german

 
 

Repository files navigation

DeepSpeech German

This project is build upon the paper German End-to-end Speech Recognition based on DeepSpeech. Original paper code can be found here.

This project aims to develop a working Speech to Text module using Mozilla DeepSpeech, which can be used for any Audio processing pipeline. Mozillla DeepSpeech is a state-of-the-art open-source automatic speech recognition (ASR) toolkit. DeepSpeech is using a model trained by machine learning techniques based on Baidu's Deep Speech research paper. Project DeepSpeech uses Google's TensorFlow to make the implementation easier.


Datasets


Links:


Usage

General infos

File structure will look as follows:

my_deepspeech_folder
    checkpoints
    data_original
    data_prepared
    DeepSpeech
    deepspeech-german    <- This repository

Clone DeepSpeech and build container:

git clone https://github.com/mozilla/DeepSpeech.git
# or
git clone https://github.com/DanBmh/DeepSpeech.git
cd DeepSpeech && make Dockerfile.train && cd ..
docker build -f DeepSpeech/Dockerfile.train -t mozilla_deep_speech DeepSpeech/

Build and run our docker container:

docker build -t deep_speech_german deepspeech-german/
./deepspeech-german/run_container.sh

Download and prepare voice data

Download datasets (Run in docker container):

python3 deepspeech-german/preprocessing/download_data.py --tuda data_original/
python3 deepspeech-german/preprocessing/download_data.py --voxforge data_original/
python3 deepspeech-german/preprocessing/download_data.py --mailabs data_original/
python3 deepspeech-german/preprocessing/download_data.py --swc data_original/
python3 deepspeech-german/preprocessing/download_data.py --tatoeba data_original/
python3 deepspeech-german/preprocessing/download_data.py --common_voice data_original/
python3 deepspeech-german/preprocessing/download_data.py --zamia_speech data_original/

Download css10 german dataset (Requires kaggle account): LINK
Extract and move it to datasets directory (data_original/css_german/)
It seems the files are saved all twice, so remove the duplicate folders


Prepare datasets, this may take some time (Run in docker container):

# Prepare the datasets one by one to ensure everything is working:
python3 deepspeech-german/preprocessing/prepare_data.py --voxforge data_original/voxforge/ data_prepared/voxforge/
python3 deepspeech-german/preprocessing/prepare_data.py --tuda data_original/tuda/ data_prepared/tuda/
python3 deepspeech-german/preprocessing/prepare_data.py --common_voice data_original/common_voice/ data_prepared/common_voice/
python3 deepspeech-german/preprocessing/prepare_data.py --mailabs data_original/mailabs/ data_prepared/mailabs/
python3 deepspeech-german/preprocessing/prepare_data.py --swc data_original/swc/ data_prepared/swc/
python3 deepspeech-german/preprocessing/prepare_data.py --tatoeba data_original/tatoeba/ data_prepared/tatoeba/
python3 deepspeech-german/preprocessing/prepare_data.py --css_german data_original/css_german/ data_prepared/css_german/
python3 deepspeech-german/preprocessing/prepare_data.py --zamia_speech data_original/zamia_speech/ data_prepared/zamia_speech/
# To combine multiple datasets run the command as follows (not recommended):
python3 deepspeech-german/pre-processing/prepare_data.py --tuda data_original/tuda/ --voxforge data_original/voxforge/ data_prepared/tuda_voxforge/
# Or, which is much faster, but only combining train, dev, test and all csv files, run:
python3 deepspeech-german/preprocessing/combine_datasets.py data_prepared/ --tuda --voxforge
# Or to combine specific csv files:
python3 /DeepSpeech/deepspeech-german/preprocessing/combine_datasets.py "" --files_output /DeepSpeech/data_prepared/tvsmc/train_mix.csv --files "/DeepSpeech/data_prepared/tuda/train.csv /DeepSpeech/data_prepared/voxforge/train.csv /DeepSpeech/data_prepared/swc/all.csv /DeepSpeech/data_prepared/mailabs/all.csv /DeepSpeech/data_prepared/common_voice/train.csv"
# To shuffle and replace "äöü" characters and clean the files run (repeat for all 3 csv files):
python3 /DeepSpeech/deepspeech-german/preprocessing/dataset_operations.py /DeepSpeech/data_prepared/voxforge/train.csv /DeepSpeech/data_prepared/voxforge/train_azce.csv --replace --shuffle --clean --exclude
# To split tuda into the correct train, dev and test splits run: 
# (you will have to rename the [train/dev/test]_s.csv files before combining them with other datasets)
python3 deepspeech-german/preprocessing/split_dataset.py data_prepared/tuda/all.csv --tuda --file_appendix _s

Preparation times using Intel i7-8700K:

  • voxforge: some seconds
  • tuda: some minutes
  • mailabs: ~20min
  • common_voice: ~12h
  • swc: ~6h

Download and prepare noise data

You have to merge mozilla/DeepSpeech#2622 for testing with noise, or use my noiseaugmaster branch.
Run in container:

cd data_original/noise/
# Download freesound data:
wget https://zenodo.org/record/3612637/files/FSDKaggle2019.audio_test.zip?download=1 -O FSDKaggle2019.audio_test.zip
wget https://zenodo.org/record/3612637/files/FSDKaggle2019.audio_train_curated.zip?download=1 -O FSDKaggle2019.audio_train_curated.zip
wget https://zenodo.org/record/3612637/files/FSDKaggle2019.audio_train_noisy.z01?download=1 -O FSDKaggle2019.audio_train_noisy.z01
wget https://zenodo.org/record/3612637/files/FSDKaggle2019.audio_train_noisy.z02?download=1 -O FSDKaggle2019.audio_train_noisy.z02
wget https://zenodo.org/record/3612637/files/FSDKaggle2019.audio_train_noisy.z03?download=1 -O FSDKaggle2019.audio_train_noisy.z03
wget https://zenodo.org/record/3612637/files/FSDKaggle2019.audio_train_noisy.z04?download=1 -O FSDKaggle2019.audio_train_noisy.z04
wget https://zenodo.org/record/3612637/files/FSDKaggle2019.audio_train_noisy.z05?download=1 -O FSDKaggle2019.audio_train_noisy.z05
wget https://zenodo.org/record/3612637/files/FSDKaggle2019.audio_train_noisy.z06?download=1 -O FSDKaggle2019.audio_train_noisy.z06
wget https://zenodo.org/record/3612637/files/FSDKaggle2019.audio_train_noisy.zip?download=1 -O FSDKaggle2019.audio_train_noisy.zip
# Merge the seven parts:
zip -s 0 FSDKaggle2019.audio_train_noisy.zip --out unsplit.zip
unzip FSDKaggle2019.audio_test.zip
unzip FSDKaggle2019.audio_train_curated.zip
unzip unsplit.zip
rm *.zip
rm *.z0*
# Download rnnoise data:
wget https://media.xiph.org/rnnoise/rnnoise_contributions.tar.gz
tar -xvzf rnnoise_contributions.tar.gz
rm rnnoise_contributions.tar.gz
# Download zamia noise
wget http://goofy.zamia.org/zamia-speech/corpora/noise.tar.xz
tar -xvf noise.tar.xz
mv noise/ zamia/
rm noise.tar.xz
# Normalize all the audio files (run with python2):
python /DeepSpeech/deepspeech-german/preprocessing/normalize_noise_audio.py --from_dir /DeepSpeech/data_original/noise/ --to_dir /DeepSpeech/data_prepared/noise/ --max_sec 45
# Create csv files:
python3 /DeepSpeech/deepspeech-german/preprocessing/noise_to_csv.py
python3 /DeepSpeech/deepspeech-german/preprocessing/split_dataset.py /DeepSpeech/data_prepared/noise/all.csv  --split "70|15|15"

Create the language model

Download and prepare the text corpora tuda, europarl+news:

cd data_original/texts/
wget "http://ltdata1.informatik.uni-hamburg.de/kaldi_tuda_de/German_sentences_8mil_filtered_maryfied.txt.gz" -O tuda_sentences.txt.gz
gzip -d tuda_sentences.txt.gz
wget "https://www.statmt.org/wmt13/training-monolingual-nc-v8.tgz" -O news-commentary.tgz
tar zxvf news-commentary.tgz && mv training/news-commentary-v8.de news-commentary-v8.de && rm news-commentary.tgz && rm -r training/
# If you have enough space you can also download the other years
wget "https://www.statmt.org/wmt13/training-monolingual-news-2012.tgz" -O news-2012.tgz
tar zxvf news-2012.tgz && mv training-monolingual/news.2012.de.shuffled news.2012.de && rm news-2012.tgz && rm -r training-monolingual/
wget "https://www.statmt.org/wmt13/training-monolingual-europarl-v7.tgz" -O europarl.tgz
tar zxvf europarl.tgz && mv training/europarl-v7.de europarl-v7.de && rm europarl.tgz && rm -r training/
# This needs a lot of memory for processing (~30gb, but you can also skip some of the files)
python3 /DeepSpeech/deepspeech-german/preprocessing/prepare_vocab.py /DeepSpeech/data_original/texts/ /DeepSpeech/data_prepared/clean_vocab_az.txt --replace_umlauts

Generate scorer (Run in docker container):

mkdir data_prepared/lm/
python3 /DeepSpeech/data/lm/generate_lm.py --input_txt /DeepSpeech/data_prepared/clean_vocab_az.txt --output_dir /DeepSpeech/data_prepared/lm/ --top_k 500000 --kenlm_bins /DeepSpeech/native_client/kenlm/build/bin/ --arpa_order 5 --max_arpa_memory "85%" --arpa_prune "0|0|1" --binary_a_bits 255 --binary_q_bits 8 --binary_type trie
python3 /DeepSpeech/data/lm/generate_package.py --alphabet /DeepSpeech/deepspeech-german/data/alphabet_az.txt --lm /DeepSpeech/data_prepared/lm/lm.binary --vocab /DeepSpeech/data_prepared/lm/vocab-500000.txt --package /DeepSpeech/data_prepared/lm/kenlm_az.scorer --default_alpha 0.75 --default_beta 1.85

Fix some issues

For me only training with voxforge worked at first. With tuda dataset I got an error:
"Invalid argument: Not enough time for target transition sequence"

To fix it you have to follow this solution:

# Add the parameter "ignore_longer_outputs_than_inputs=True" in DeepSpeech.py (~ line 231)
# Compute the CTC loss using TensorFlow's `ctc_loss`
total_loss = tfv1.nn.ctc_loss(labels=batch_y, inputs=logits, sequence_length=batch_seq_len, ignore_longer_outputs_than_inputs=True)

This will result in another error after some training steps:
"Invalid argument: WAV data chunk '[Some strange symbol here]"

Just ignore this in the train steps:

# Add another exception (tf.errors.InvalidArgumentError) in the training loop in DeepSpeech.py (~ line 602):
try:
    [...]
    session.run([train_op, global_step, loss, non_finite_files, step_summaries_op], feed_dict=feed_dict)
except tf.errors.OutOfRangeError:
    break
except tf.errors.InvalidArgumentError as e:
    print("Ignoring error:", e)
    continue

Add the parameter and the ignored exception in evaluate.py file too (~ lines 73 and 118).


To filter the files causing infinite loss:

# Below this lines (DeepSpeech.py ~ line 620):
problem_files = [f.decode('utf8') for f in problem_files[..., 0]]
log_error('The following files caused an infinite (or NaN) '
          'loss: {}'.format(','.join(problem_files)))
# Add the following to save the files to excluded_files.json and stop training
sys.path.append("/DeepSpeech/deepspeech-german/training/")
from filter_invalid_files import add_files_to_excluded
add_files_to_excluded(problem_files)
sys.exit(1)

Training

Download pretrained deepspeech checkpoints.

wget https://github.com/mozilla/DeepSpeech/releases/download/v0.7.3/deepspeech-0.7.3-checkpoint.tar.gz -P checkpoints/
tar xvfz checkpoints/deepspeech-0.7.3-checkpoint.tar.gz -C checkpoints/
rm checkpoints/deepspeech-0.7.3-checkpoint.tar.gz

Adjust the parameters to your needs (Run in docker container):

# Delete old model files:
rm -rf /root/.local/share/deepspeech/summaries && rm -rf /root/.local/share/deepspeech/checkpoints
# Run training:
python3 DeepSpeech.py --train_files data_prepared/voxforge/train.csv --dev_files data_prepared/voxforge/dev.csv --test_files data_prepared/voxforge/test.csv \
--alphabet_config_path deepspeech-german/data/alphabet.txt --lm_trie_path data_prepared/trie --lm_binary_path data_prepared/lm.binary --test_batch_size 48 --train_batch_size 48 --dev_batch_size 48 \
--epochs 75 --learning_rate 0.0005 --dropout_rate 0.40 --export_dir deepspeech-german/models --use_allow_growth --use_cudnn_rnn 
# Or adjust the train.sh file and run a training using the english checkpoint:
/bin/bash /DeepSpeech/deepspeech-german/training/train.sh /DeepSpeech/checkpoints/voxforge/ /DeepSpeech/data_prepared/voxforge/train_azce.csv /DeepSpeech/data_prepared/voxforge/dev_azce.csv /DeepSpeech/data_prepared/voxforge/test_azce.csv 1 /DeepSpeech/checkpoints/deepspeech-0.6.0-checkpoint/
# Or to run a cycled training as described in the paper, run:
python3 /DeepSpeech/deepspeech-german/training/cycled_training.py /DeepSpeech/checkpoints/voxforge/ /DeepSpeech/data_prepared/ _azce --voxforge
# Run test only (The use_allow_growth flag fixes "cuDNN failed to initialize" error):
# Don't forget to add the noise augmentation flags if testing with noise
python3 /DeepSpeech/DeepSpeech.py --test_files /DeepSpeech/data_prepared/voxforge/test_azce.csv --checkpoint_dir /DeepSpeech/checkpoints/voxforge/ --scorer_path /DeepSpeech/data_prepared/lm/kenlm_az.scorer --alphabet_config_path /DeepSpeech/deepspeech-german/data/alphabet_az.txt --test_batch_size 36 --use_allow_growth

Training time for voxforge on 2x Nvidia 1080Ti using batch size of 48 is about 01:45min per epoch. Training until early stop took 22min for 10 epochs.

One epoch in tuda with batch size of 12 on single gpu needs about 1:15h. With both gpus it takes about 26min. For 10 cycled training with early stops it took about 15h.

One epoch in mailabs with batch size of 24/12/12 needs about 19min, testing about 21 min.

One epoch in swc with batch size of 12/12/12 needs about 1:08h, testing about 17 min.

One epoch with all datasets and batch size of 12 needs about 2:50h, testing about 1:30h. Training until early stop took 37h for 11 epochs.

One epoch with all datasets and only Tuda + CommonVoice as testset needs about 3:30h. Training for 55 epochs took 8d 6h, testing about 1h.


Results

Paper

Some results from the findings in the paper German End-to-end Speech Recognition based on DeepSpeech:

  • Mozilla 79.7%
  • Voxforge 72.1%
  • Tuda-De 26.8%
  • Tuda-De+Mozilla 57.3%
  • Tuda-De+Voxforge 15.1%
  • Tuda-De+Voxforge+Mozilla 21.5%

To test their uploaded checkpoint you have to add a file best_dev_checkpoint next to the checkpoint files. Insert following content:

model_checkpoint_path: "best_dev-22218"
all_model_checkpoint_paths: "best_dev-22218"

Switch the DeepSpeech repository back to tag v0.5.0 and build a new docker image. Don't forget to fix the above issue in evaluate.py again. Mount the checkpoint and data directories and run:

python3 /DeepSpeech/DeepSpeech.py --test_files /DeepSpeech/data_prepared/voxforge/test_azce.csv --checkpoint_dir /DeepSpeech/checkpoints/dsg05_models/checkpoints/ \
--alphabet_config_path /DeepSpeech/checkpoints/dsg05_models/alphabet.txt --lm_trie_path /DeepSpeech/checkpoints/dsg05_models/trie --lm_binary_path /DeepSpeech/checkpoints/dsg05_models/lm.binary --test_batch_size 48
Dataset Additional Infos Losses Result
Tuda + CommonVoice used newer CommonVoice version, there may be overlaps between test and training data because of random splitting Test: 105.747589 WER: 0.683802, CER: 0.386331
Tuda correct tuda test split, there may be overlaps between test and training data because of random splitting Test: 402.696991 WER: 0.785655, CER: 0.428786

This repo

Some results with a old code version (Default dropout is 0.4, learning rate 0.0005):

Dataset Additional Infos Result
Voxforge WER: 0.676611, CER: 0.403916, loss: 82.185226
Voxforge with augmentation WER: 0.624573, CER: 0.348618, loss: 74.403786
Voxforge without "äöü" WER: 0.646702, CER: 0.364471, loss: 82.567413
Voxforge cleaned data, without "äöü" WER: 0.634828, CER: 0.353037, loss: 81.905258
Voxforge above checkpoint, tested on not cleaned data WER: 0.634556, CER: 0.352879, loss: 81.849220
Voxforge checkpoint from english deepspeech, without "äöü" WER: 0.394064, CER: 0.190184, loss: 49.066357
Voxforge checkpoint from english deepspeech, with augmentation, without "äöü", dropout 0.25, learning rate 0.0001 WER: 0.338685, CER: 0.150972, loss: 42.031754
Voxforge reduce learning rate on plateau, with noise and standard augmentation, checkpoint from english deepspeech, cleaned data, without "äöü", dropout 0.25, learning rate 0.0001, batch size 48 WER: 0.320507, CER: 0.131948, loss: 39.923031
Voxforge above with learning rate 0.00001 WER: 0.350903, CER: 0.147837, loss: 43.451263
Voxforge above with learning rate 0.001 WER: 0.518670, CER: 0.252510, loss: 62.927200
Tuda + Voxforge without "äöü", checkpoint from english deepspeech, cleaned train and dev data WER: 0.740130, CER: 0.462036, loss: 156.115921
Tuda + Voxforge first Tuda then Voxforge, without "äöü", cleaned train and dev data, dropout 0.25, learning rate 0.0001 WER: 0.653841, CER: 0.384577, loss: 159.509476
Tuda + Voxforge + SWC + Mailabs + CommonVoice checkpoint from english deepspeech, with augmentation, without "äöü", cleaned data, dropout 0.25, learning rate 0.0001 WER: 0.306061, CER: 0.151266, loss: 33.218510

Some results with some older code version:
(Default values: batch size 12, dropout 0.25, learning rate 0.0001, without "äöü", cleaned data , checkpoint from english deepspeech, early stopping, reduce learning rate on plateau, evaluation with scorer and top-500k words)

Dataset Additional Infos Losses Training epochs of best model Result
Tuda + Voxforge + SWC + Mailabs + CommonVoice test only with Tuda + CommonVoice others completely for training, language model with training transcriptions, with augmentation Test: 29.363405, Validation: 23.509546 55 WER: 0.190189, CER: 0.091737
Tuda + Voxforge + SWC + Mailabs + CommonVoice above checkpoint tested with 3-gram language model Test: 29.363405 WER: 0.199709, CER: 0.095318
Tuda + Voxforge + SWC + Mailabs + CommonVoice above checkpoint tested on Tuda only Test: 87.074394 WER: 0.378379, CER: 0.167380

Some results with the current code version:
(Default values: batch size 36, dropout 0.25, learning rate 0.0001, without "äöü", cleaned data , checkpoint from english deepspeech, early stopping, reduce learning rate on plateau, evaluation with scorer and top-500k words, data augmentation)

Dataset Additional Infos Losses Training epochs of best model Result
Voxforge training from scratch Test: 79.124008, Validation: 81.982976 29 WER: 0.603879, CER: 0.298139
Voxforge Test: 44.312195, Validation: 47.915317 21 WER: 0.343973, CER: 0.140119
Voxforge without reduce learning rate on plateau Test: 46.160049, Validation: 48.926518 13 WER: 0.367125, CER: 0.163931
Voxforge dropped last layer Test: 49.844028, Validation: 52.722362 21 WER: 0.389327, CER: 0.170563
Voxforge 5 cycled training Test: 42.973358 WER: 0.353841, CER: 0.158554
Tuda training from scratch, correct train/dev/test splitting Test: 149.653427, Validation: 137.645307 9 WER: 0.606629, CER: 0.296630
Tuda correct train/dev/test splitting Test: 103.179092, Validation: 132.243965 3 WER: 0.436074, CER: 0.208135
Tuda dropped last layer, correct train/dev/test splitting Test: 107.047821, Validation: 101.219325 6 WER: 0.431361, CER: 0.195361
Tuda dropped last two layers, correct train/dev/test splitting Test: 110.523621, Validation: 103.844562 5 WER: 0.442421, CER: 0.204504
Tuda checkpoint from Voxforge with WER 0.344, correct train/dev/test splitting Test: 100.846367, Validation: 95.410456 3 WER: 0.416950, CER: 0.198177
Tuda 10 cycled training, checkpoint from Voxforge with WER 0.344, correct train/dev/test splitting Test: 98.007607 WER: 0.410520, CER: 0.194091
Tuda random dataset splitting, checkpoint from Voxforge with WER 0.344
Important Note: These results are not meaningful, because same transcriptions can occur in train and test set, only recorded with different microphones
Test: 23.322618, Validation: 23.094230 27 WER: 0.090285, CER: 0.036212
CommonVoice checkpoint from Tuda with WER 0.417 Test: 24.688297, Validation: 17.460029 35 WER: 0.217124, CER: 0.085427
CommonVoice above tested with reduced testset where transcripts occurring in trainset were removed, Test: 33.376812 WER: 0.211668, CER: 0.079157
CommonVoice + GoogleWavenet above tested with GoogleWavenet Test: 17.653290 WER: 0.035807, CER: 0.007342
CommonVoice checkpoint from Voxforge with WER 0.344 Test: 23.460932, Validation: 16.641201 35 WER: 0.215584, CER: 0.084932
CommonVoice dropped last layer Test: 24.480028, Validation: 17.505738 36 WER: 0.220435, CER: 0.086921
Tuda + GoogleWavenet added GoogleWavenet to train data, dev/test from Tuda, checkpoint from Voxforge with WER 0.344 Test: 95.555939, Validation: 90.392490 3 WER: 0.390291, CER: 0.178549
Tuda + GoogleWavenet GoogleWavenet as train data, dev/test from Tuda Test: 346.486420, Validation: 326.615474 0 WER: 0.865683, CER: 0.517528
Tuda + GoogleWavenet GoogleWavenet as train/dev data, test from Tuda Test: 477.049591, Validation: 3.320163 23 WER: 0.923973, CER: 0.601015
Tuda + GoogleWavenet above checkpoint tested with GoogleWavenet Test: 3.406022 WER: 0.012919, CER: 0.001724
Tuda + GoogleWavenet checkpoint from english deepspeech tested with Tuda Test: 402.102661 WER: 0.985554, CER: 0.752787
Voxforge + GoogleWavenet added all of GoogleWavenet to train data, dev/test from Voxforge Test: 45.643063, Validation: 49.620488 28 WER: 0.349552, CER: 0.143108
CommonVoice + GoogleWavenet added all of GoogleWavenet to train data, dev/test from CommonVoice Test: 25.029057, Validation: 17.511973 35 WER: 0.214689, CER: 0.084206
CommonVoice + GoogleWavenet above tested with reduced testset Test: 34.191067 WER: 0.213164, CER: 0.079121

Updated to DeepSpeech v0.7.3 and new english checkpoint:
(Testing with noise and speech overlay is done with older noiseaugmaster branch, which implemented this functionality)

Dataset Additional Infos Losses Training epochs of best model Result
Voxforge Test: 32.844025, Validation: 36.912005 14 WER: 0.240091, CER: 0.087971
Voxforge without freq_and_time_masking augmentation Test: 33.698494, Validation: 38.071722 10 WER: 0.244600, CER: 0.094577
Voxforge using new audio augmentation options Test: 29.280865, Validation: 33.294815 21 WER: 0.220538, CER: 0.079463
Voxforge updated augmentations again Test: 28.846869, Validation: 32.680268 16 WER: 0.225360, CER: 0.083504
Voxforge test above with older noiseaugmaster branch Test: 28.831675 WER: 0.238961, CER: 0.081555
Voxforge test with speech overlay Test: 89.661995 WER: 0.570903, CER: 0.301745
Voxforge test with noise overlay Test: 53.461609 WER: 0.438126, CER: 0.213890
Voxforge test with speech and noise overlay Test: 79.736122 WER: 0.581259, CER: 0.310365
Voxforge second test with speech and noise to check random influence Test: 81.241333 WER: 0.595410, CER: 0.319077
Voxforge add speech overlay augmentation Test: 28.843914, Validation: 32.341234 27 WER: 0.222024, CER: 0.083036
Voxforge change snr=50:209m to snr=30:159 Test: 28.502413, Validation: 32.236247 28 WER: 0.226005, CER: 0.085475
Voxforge test above with older noiseaugmaster branch Test: 28.488537 WER: 0.239530, CER: 0.083855
Voxforge test with speech overlay Test: 47.783081 WER: 0.383612, CER: 0.175735
Voxforge test with noise overlay Test: 51.682060 WER: 0.428566, CER: 0.209789
Voxforge test with speech and noise overlay Test: 60.275940 WER: 0.487709, CER: 0.255167
Voxforge add noise overlay augmentation Test: 27.940659, Validation: 31.988175 28 WER: 0.219143, CER: 0.076050
Voxforge change snr=50:206 to snr=24:126 Test: 26.588453, Validation: 31.151855 34 WER: 0.206141, CER: 0.072018
Voxforge change to snr=18:9~6 Test: 26.311581, Validation: 30.531299 30 WER: 0.211865, CER: 0.074281
Voxforge test above with older noiseaugmaster branch Test: 26.300938 WER: 0.227466, CER: 0.073827
Voxforge test with speech overlay Test: 76.401451 WER: 0.499962, CER: 0.254203
Voxforge test with noise overlay Test: 44.011471 WER: 0.376783, CER: 0.165329
Voxforge test with speech and noise overlay Test: 65.408264 WER: 0.496168, CER: 0.246516
Voxforge speech and noise overlay Test: 27.101889, Validation: 31.407527 44 WER: 0.220243, CER: 0.082179
Voxforge test above with older noiseaugmaster branch Test: 27.087360 WER: 0.232094, CER: 0.080319
Voxforge test with speech overlay Test: 46.012951 WER: 0.362291, CER: 0.164134
Voxforge test with noise overlay Test: 44.035809 WER: 0.377276, CER: 0.171528
Voxforge test with speech and noise overlay Test: 53.832214 WER: 0.441768, CER: 0.218798
Tuda + Voxforge + SWC + Mailabs + CommonVoice test with Voxforge + Tuda + CommonVoice others completely for training, with noise and speech overlay Test: 22.055849, Validation: 17.613633 46 WER: 0.208809, CER: 0.087215

Language Model and Checkpoints

Scorer with training transcriptions: Link
Checkpoints TVSMC training with 0.19 WER: Link
Graph model for above checkpoint: Link

About

Automatic Speech Recognition (ASR) - German

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 83.4%
  • Shell 13.7%
  • Dockerfile 2.9%