Created by Eleftherios Fanioudakis and Anastasios Vafeiadis
This repository contains the code for Task 1 (Speech Activity Detection) of the Fearless Steps Challenge. More details about the challenge can be found at Fearless Steps.
You can also check our paper that was accepted at INTERSPEECH 2019 Two-Dimensional Convolutional Recurrent Neural Networks for Speech Activity Detection.
Four system output possibilities are considered:
- True Positive (TP) – system correctly identifies start-stop times of speech segments compared to the reference (manual annotation),
- True Negative (TN) – system correctly identifies start-stop times of non-speech segments compared to reference,
- False Positive (FP) – system incorrectly identifies speech in a segment where the reference identifies the segment as non-speech, and
- False Negative (FN) – system missed identification of speech in a segment where the reference identifies a segment as speech.
SAD error rates represent a measure of the amount of time that is misclassified by the systems segmentation of the test audio files. Missing, or failing to detect, actual speech is considered a more serious error than misidentifying its start and end times.
The following link explains the Decision Cost Function (DCF) metric, as well as the '.txt' output file format: Evaluation Plan. In particular look at pages: 14-16 and 25.
Library Prerequisites
- python_speech_features
- LibROSA
- tqdm
- Python 2.7 (The scripts can also run with Python 3.5 and above)
This script processes the 30 min recordings for training and evaluation into 1 sec chunks (8000 samples).
We target this problem as a multi-label problem. Despite having two labels (0: non-speech and 1: speech), we will have 8000 different labels for each 1 s wav file.
The script saves a NumPy array for each 1 sec file with a corresponding NumPy array for its labels.
You can run the script as python extract_sad.py train
, for the Train files and python extract_sad.py test
for the Eval files.
This script will create a 129x126 spectrogram image (grayscale) for each 1 s wav that was created from the extract_sad.py. This spectrogram image will be used as an input to our 2D CRNN.
The proposed algorithm was trained on an Intel Core i5 - 7600K (4-cores, 4-threads) clocked at 4.2 GHz. The GPU that was used was an Nvidia GTX 1080Ti Founder’s edition with 11 GB GDDR5x memory, 3584 cuda cores and 11.34 TFLOPS float (FP32) performance. The PC had 32 GB of DDR4 RAM and the entire algorithm was developed in Keras 2.2.4 with TensorFlow 1.13.1 backend, CUDA 10.0 and cuDNN 7.5.