Home | Previous - Setting up your DeepSpeech training environment | Next - Testing and evaluating your trained model
- Training a DeepSpeech model
Before we can train a model, we need to make the training data available to the Docker container. The training data was previously prepared in the instructions for formatting data. Copy or extract them to the directory you specified in your bind mount. This will make them available to the Docker container.
$ cd deepspeech-data
$ ls cv-corpus-6.1-2020-12-11/
total 12
4 drwxr-xr-x 3 kathyreid kathyreid 4096 Feb 9 10:42 ./
4 drwxrwxr-x 7 kathyreid kathyreid 4096 Feb 9 10:43 ../
4 drwxr-xr-x 3 kathyreid kathyreid 4096 Feb 9 10:43 id/
We're now ready to being training.
We're going to walk through some of the key parameters you can use with DeepSpeech.py
.
python3 DeepSpeech.py \
--train_files persistent-data/cv-corpus-6.1-2020-12-11/id/clips/train.csv \
--dev_files persistent-data/cv-corpus-6.1-2020-12-11/id/clips/dev.csv \
--test_files persistent-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv
Do not run this yet
The options --train_files
, --dev_files
and --test_files
take a path to the relevant data, which was prepared in the section on data formatting.
As you are training your model, DeepSpeech will store checkpoints to disk. The checkpoint allows interruption to training, and to restart training from the checkpoint, saving hours of training time.
Because we have our training environment configured to use Docker, we must ensure that our checkpoint directories are stored in the directory used by the bind mount, so that they persist in the event of failure.
To specify checkpoint directories, use the --checkpoint_dir
parameter with DeepSpeech.py
:
python3 DeepSpeech.py \
--train_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/train.csv \
--dev_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/dev.csv \
--test_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
--checkpoint_dir deepspeech-data/checkpoints
Do not run this yet
Checkpoints are stored as Tensorflow tf.Variable
objects. This is a binary file format; that is, you won't be able to read it with a text editor. The checkpoint stores all the weights and biases of the current state of the neural network as training progresses.
Checkpoints are named by the total number of steps completed. For example, if you train for 100 epochs at 2000 steps per epoch, then the final checkpoint will be named 20000
.
~/deepspeech-data/checkpoints-true-id$ ls
total 1053716
4 drwxr-xr-x 2 root root 4096 Feb 24 14:17 ./
4 drwxrwxr-x 5 root root 4096 Feb 24 13:18 ../
174376 -rw-r--r-- 1 root root 178557296 Feb 24 14:11 best_dev-12774.data-00000-of-00001
4 -rw-r--r-- 1 root root 1469 Feb 24 14:11 best_dev-12774.index
1236 -rw-r--r-- 1 root root 1262944 Feb 24 14:11 best_dev-12774.meta
4 -rw-r--r-- 1 root root 85 Feb 24 14:11 best_dev_checkpoint
4 -rw-r--r-- 1 root root 247 Feb 24 14:17 checkpoint
4 -rw-r--r-- 1 root root 3888 Feb 24 13:18 flags.txt
174376 -rw-r--r-- 1 root root 178557296 Feb 24 14:09 train-12774.data-00000-of-00001
4 -rw-r--r-- 1 root root 1469 Feb 24 14:09 train-12774.index
1236 -rw-r--r-- 1 root root 1262938 Feb 24 14:09 train-12774.meta
174376 -rw-r--r-- 1 root root 178557296 Feb 24 14:13 train-14903.data-00000-of-00001
4 -rw-r--r-- 1 root root 1469 Feb 24 14:13 train-14903.index
1236 -rw-r--r-- 1 root root 1262938 Feb 24 14:13 train-14903.meta
174376 -rw-r--r-- 1 root root 178557296 Feb 24 14:17 train-17032.data-00000-of-00001
4 -rw-r--r-- 1 root root 1469 Feb 24 14:17 train-17032.index
1236 -rw-r--r-- 1 root root 1262938 Feb 24 14:17 train-17032.meta
174376 -rw-r--r-- 1 root root 178557296 Feb 24 14:01 train-19161.data-00000-of-00001
4 -rw-r--r-- 1 root root 1469 Feb 24 14:01 train-19161.index
1236 -rw-r--r-- 1 root root 1262938 Feb 24 14:01 train-19161.meta
174376 -rw-r--r-- 1 root root 178557296 Feb 24 14:05 train-21290.data-00000-of-00001
4 -rw-r--r-- 1 root root 1469 Feb 24 14:05 train-21290.index
Checkpoints can consume a lot of disk space, so you may wish to configure how often a checkpoint is written to disk, and how many checkpoints are stored.
-
--checkpoint_secs
specifies the time interval for storing a checkpoint. The default is600
, or every five minutes. You may wish to increase this if you have limited disk space. -
--max_to_keep
specifies how many checkpoints to keep. The default is5
. You may wish to decrease this if you have limited disk space.
In this example we will store a checkpoint every 15 minutes, and keep only 3 checkpoints.
python3 DeepSpeech.py \
--train_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/train.csv \
--dev_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/dev.csv \
--test_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
--checkpoint_dir deepspeech-data/checkpoints \
--checkpoint_secs 1800 \
--max_to_keep 3
Do not run this yet
In some cases, you may wish to load checkpoints from one location, but save checkpoints to another location - for example if you are doing fine tuning or transfer learning.
-
--load_checkpoint_dir
specifies the directory to load checkpoints from. -
--save_checkpoint_dir
specifies the directory to save checkpoints to.
In this example we will store a checkpoint every 15 minutes, and keep only 3 checkpoints.
python3 DeepSpeech.py \
--train_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/train.csv \
--dev_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/dev.csv \
--test_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
--load_checkpoint_dir deepspeech-data/checkpoints-to-train-from \
--save_checkpoint_dir deepspeech-data/checkpoints-to-save-to
Do not run this yet
Again, because we have our training environment configured to use Docker, we must ensure that our trained model is stored in the directory used by the bind mount, so that it persists in the event of failure of the Docker container.
To specify where the trained model should be saved, use the --export-dir
parameter with DeepSpeech.py
:
python3 DeepSpeech.py \
--train_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/train.csv \
--dev_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/dev.csv \
--test_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
--checkpoint_dir deepspeech-data/checkpoints \
--export_dir deepspeech-data/exported-model
You can run this command to start training
_For a full list of parameters that can be passed to DeepSpeech.py
, please consult the documentation.
DeepSpeech.py
has many parameters - too many to cover in an introductory PlayBook. Here are some of the commonly used parameters that are useful to explore as you begin to train speech recognition models with DeepSpeech.
n_hidden
parameter
Neural networks work through a series of layers. Usually there is an input layer, which takes an input - in this case an audio recording, and a series of hidden layers which identify features of the input layer, and an output layer, which makes a prediction - in this case a character.
In large datasets, you need many hidden layers to arrive at an accurate trained model. With smaller datasets, often called toy corpora or toy datasets, you don't need as many hidden layers.
If you are learning how to train using DeepSpeech, and are working with a small dataset, you will save time by reducing the value of --n_hidden
. This reduces the number of hidden layers in the neural network. This both reduces the amount of computing resources consumed during training, and makes training a model much faster.
The --n_hidden
parameter has a default value of 2048
.
python3 DeepSpeech.py \
--train_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/train.csv \
--dev_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/dev.csv \
--test_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
--checkpoint_dir deepspeech-data/checkpoints \
--export_dir deepspeech-data/exported-model \
--n_hidden 64
In neural networks, the learning rate is the rate at which the neural network makes adjustments to the predictions it generates. The accuracy of predictions is measured using the loss. The lower the loss, the lower the difference between the neural network's predictions, and actual known values. If training is effective, loss will reduce over time. A neural network that has a loss of 0
has perfect prediction.
If the learning rate is too low, predictions will take a long time to align with actual targets. If the learning rate is too high, predictions will overshoot actual targets. The learning rate has to aim for a balance between exploration and exploitation.
If loss is not reducing over time, then the training is said to have plateaued - that is, the adjustments to the predictions are not reducing loss. By adjusting the learning rate, and other parameters, we may escape the plateau and continue to decrease loss.
-
The
--reduce_lr_on_plateau
parameter instructsDeepSpeech.py
to automatically reduce the learning rate if a plateau is detected. By default, this isfalse
. -
The
--plateau_epochs
parameter specifies the number of epochs of training during which there is no reduction in loss that should be considered a plateau. The default value is10
. -
The
--plateau_reduction
parameter specifies a multiplicative factor that is applied to the current learning rate if a plateau is detected. This number must be less than1
, otherwise it will increase the learning rate. The default value is0.1
.
An example of training with these parameters would be:
python3 DeepSpeech.py \
--train_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/train.csv \
--dev_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/dev.csv \
--test_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
--checkpoint_dir deepspeech-data/checkpoints \
--export_dir deepspeech-data/exported-model \
--n_hidden 64 \
--reduce_lr_on_plateau true \
--plateau_epochs 8 \
--plateau_reduction 0.08
If training is not resulting in a reduction of loss over time, you can pass parameters to DeepSpeech.py
that will stop training. This is called early stopping and is useful if you are using cloud compute resources, or shared resources, and can't monitor the training continuously.
-
The
--early_stop
parameter enables early stopping. It is set tofalse
by default. -
The
--es_epochs
parameter takes an integer of the number of epochs with no improvement after which training will be stopped. It is set to25
by default, for example if this parameter is omitted, but--early_stop
is set totrue
. -
The
--es_min_delta
parameter is the minimum change in loss per epoch that qualifies as an improvement. By default it is set to0.05
.
An example of training with these parameters would be:
python3 DeepSpeech.py \
--train_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/train.csv \
--dev_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/dev.csv \
--test_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
--checkpoint_dir deepspeech-data/checkpoints \
--export_dir deepspeech-data/exported-model \
--n_hidden 64 \
--reduce_lr_on_plateau true \
--plateau_epochs 8 \
--plateau_reduction 0.08 \
--early_stop true \
--es_epochs 10 \
--es_min_delta 0.06
In machine learning, one of the risks during training is that of overfitting. Overfitting is where training creates a model that does not generalize well. That is, it fits to only the set of data on which it is trained. During inference, new data is not recognised accurately.
Dropout is a technical approach to reduce overfitting. In dropout, nodes are randomly removed from the neural network created during training. This simulates the effect of more diverse data, and is a computationally cheap way of reducing overfitting, and improving the generalizability of the model.
Dropout can be set for any layer of a neural network. The parameter that has the most effect for DeepSpeech training is --dropout_rate
, which controls the feedforward layers of the neural network. To see the full set of dropout parameters, consult the DeepSpeech documentation.
- The
-dropout_rate
parameter specifies how many nodes should be dropped from the neural network during training. The default value is0.05
. However, if you are training on less than thousands of hours of voice data, you will find a value of0.3
to0.4
works better to prevent overfitting.
An example of training with this parameter would be:
python3 DeepSpeech.py \
--train_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/train.csv \
--dev_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/dev.csv \
--test_files deepspeech-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
--checkpoint_dir deepspeech-data/checkpoints \
--export_dir deepspeech-data/exported-model \
--n_hidden 64 \
--reduce_lr_on_plateau true \
--plateau_epochs 8 \
--plateau_reduction 0.08 \
--early_stop true \
--es_epochs 10 \
--es_min_delta 0.06 \
--dropout_rate 0.3
In training, a step is one update of the gradient; that is, one attempt to find the lowest, or minimal loss. The amount of processing done in one step depends on the batch size. By default, DeepSpeech.py
has a batch size of 1
. That is, it processes one audio file in each step.
An epoch is one full cycle through the training data. That is, if you have 1000 files listed in your train.tsv
file, then you will expect to process 1000 steps per epoch (assuming a batch size of 1
).
To find out how many steps to expect in each epoch, you can count the number of lines in your train.tsv
file:
~/deepspeech-data/cv-corpus-6.1-2020-12-11/id$ wc -l train.tsv
2131 train.tsv
In this case there would be 2131
steps per epoch.
-
--epochs
specifies how many epochs to train. It has a default of75
, which would be appropriate for training tens to hundreds of hours of audio. If you have thousands of hours of audio, you may wish to increase the number of epochs to around 150-300. -
--train_batch_size
,--dev_batch_size
,--test_batch_size
all specify the batch size per step. These all have a default value of1
. Increasing the batch size increases the amount of memory required to process the step; you need to be aware of this before increasing the batch size.
Advanced training options are available, such as feature cache and augmentation. They are beyond the scope of this PlayBook, but you can read more about them in the DeepSpeech documentation.
For a full list of parameters that can be passed to the DeepSpeech.py
file, please consult the DeepSpeech documentation.
In a separate terminal (ie not from the session where you have the Docker container open), run the command nvtop
. You should see the DeepSpeech.py
process consuming all available GPUs.
If you do not see the GPU(s) being heavily utilised, you may be training only on your CPUs and you should double check your environment.
Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
error when training
You can safely skip this section if you have not encountered this error
There have been several reports of an error similar to the below when training is initiated. Anecdotal evidence suggests that the error is more likely to be encountered if you are training using an RTX-model GPU.
The error will look like this:
Epoch 0 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node tower_0/conv1d}}]]
[[concat/concat/_99]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node tower_0/conv1d}}]]
0 successful operations.
0 derived errors ignored.
To work around this error, you will need to set the TF_FORCE_GPU_ALLOW_GROWTH
flag to True
.
This is done in the file
DeepSpeech/training/deepspeech_training/util/config.py
and you should edit it as below:
root@687a2e3516d7:/DeepSpeech/training/deepspeech_training/util# nano config.py
...
# Standard session configuration that'll be used for all new sessions.
c.session_config = tfv1.ConfigProto(allow_soft_placement=True, log_device$
inter_op_parallelism_threads=FLAGS.in$
intra_op_parallelism_threads=FLAGS.in$
gpu_options=tfv1.GPUOptions(allow_gro$
# Set TF_FORCE_GPU_ALLOW_GROWTH to work around cuDNN error on RTX GPUs
c.session_config.gpu_options.allow_growth=True
Home | Previous - Setting up your DeepSpeech training environment | Next - Testing and evaluating your trained model