This is the code for the paper
Schulze-Forster, K., Doire, C., Richard, G., & Badeau, R. "Phoneme Level Lyrics Alignment and Text-Informed Singing Voice Separation." IEEE/ACM Transactions on Audio, Speech and Language Processing (2021).
doi: 10.1109/TASLP.2021.3091817
If you use parts of the code in your work, please cite the paper.
📄 Publicly available paper print
🔊 Audio examples
📝 MUSDB18 lyrics transcripts
💻 Lyrics alignment tool
Clone the repository to your machine:
git clone https://github.com/schufo/plla-tisvs.git
The project was implemented using Python 3.6 and a conda environment. Your can create the environment with all dependencies with the following command:
conda env create -f environment.yml
Then activate the environment:
conda activate plla_tisvs
To prepare the TIMIT and MUSDB audio and text data, the preprocessing scripts must be run in the indicated order. Note that the paths to the datasets must be adapted by the user in the scripts.
Models for joint alignment and separation can be trained as follows:
python train.py --tag 'pre_trained_joint' --architecture 'InformedOpenUnmix3' --attention 'dtw' --dataset 'timit_music' --text-units 'cmu_phonemes' --epochs 66 --batch-size 16 --nb-channels 1 --nb-workers 4 --samplerate 16000 --nfft 512 --nhop 256 --weight-decay 0 --lr 0.001 --comment 'pre training on speech music mixtures' python train.py --tag 'JOINT1' --architecture 'InformedOpenUnmix3' --wst-model 'pre_trained_joint' --attention 'dtw' --dataset 'musdb_lyrics' --text-units 'cmu_phonemes' --space-token-only --epochs 2000 --batch-size 16 --nb-channels 1 --nb-workers 4 --samplerate 16000 --nfft 512 --nhop 256 --weight-decay 0 --lr 0.001 --comment '...' python train.py --tag 'JOINT2' --architecture 'InformedOpenUnmix3' --wst-model 'pre_trained_joint' --attention 'dtw' --dataset 'blended' --speech-examples 1000 --text-units 'cmu_phonemes' --space-token-only --epochs 2000 --batch-size 16 --nb-channels 1 --nb-workers 4 --samplerate 16000 --nfft 512 --nhop 256 --weight-decay 0 --lr 0.001 --comment 'like JOINT1 but added speech examples to trainig set' python train.py --tag 'JOINT3' --architecture 'InformedOpenUnmix3' --wst-model 'pre_trained_joint' --attention 'dtw' --dataset 'blended' --speech-examples 1000 --text-units 'cmu_phonemes' --space-token-only --add-silence --epochs 2000 --batch-size 16 --nb-channels 1 --nb-workers 4 --samplerate 16000 --nfft 512 --nhop 256 --weight-decay 0 --lr 0.001 --comment 'like JOINT2 but added silence to singing voice examples' python train.py --tag 'JOINT_SP' --architecture 'InformedOpenUnmix3' --attention 'dtw' --dataset 'timit_music' --text-units 'cmu_phonemes' --epochs 2000 --batch-size 16 --nb-channels 1 --nb-workers 4 --samplerate 16000 --nfft 512 --nhop 256 --weight-decay 0 --lr 0.001 --comment 'trained only on speech-music mixtures'
Before the separation models can be trained using aligned lyrics, the alignments need to be available. Alignments can be obtained from the above models using the script save_alignment_paths.py. We also published the trained model JOINT3 as alignment tool.
Separation models can then be trained as follows:
python train.py --tag 'SEQ' --architecture 'InformedOpenUnmix3NA2' --dataset 'musdb_lyrics' --text-units 'cmu_phonemes' --alignment-from 'JOINT3' --space-token-only --epochs 2000 --batch-size 16 --nb-channels 1 --nb-workers 4 --samplerate 16000 --nfft 512 --nhop 256 --weight-decay 0 --lr 0.001 --comment 'informed with aligned text' python train.py --tag 'SEQ_BL1' --architecture 'InformedOpenUnmix3NA2' --dataset 'musdb_lyrics' --text-units 'ones' --alignment-from 'JOINT3' --fake-alignment --space-token-only --epochs 2000 --batch-size 16 --nb-channels 1 --nb-workers 4 --samplerate 16000 --nfft 512 --nhop 256 --weight-decay 0 --lr 0.001 --comment 'informed with a constant representation and a constant alignment' python train.py --tag 'SEQ_BL2' --architecture 'InformedOpenUnmix3NA2' --dataset 'musdb_lyrics' --text-units 'voice_activity' --alignment-from 'JOINT3' --space-token-only --epochs 2000 --batch-size 16 --nb-channels 1 --nb-workers 4 --samplerate 16000 --nfft 512 --nhop 256 --weight-decay 0 --lr 0.001 --comment 'informed with voice activity information derived from aligned text'
python eval_separation.py --tag 'SEQ' --testset 'musdb_lyrics' --test-snr 5 # the original mixture is used as input when no test SNR is specified python eval_separation.py --tag 'SEQ' --testset 'musdb_lyrics'
- Compute and save alignments:
python estimate_alignment.py --tag 'JOINT3' --testset 'Hansen' python estimate_alignment.py --tag 'JOINT3' --testset 'NUS_acapella' python estimate_alignment.py --tag 'JOINT3' --testset 'NUS' --snr 0
- Compute alignment evaluation scores using eval_alignment.py
This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 765068.
Copyright 2021 Kilian Schulze-Forster of Télécom Paris, Institut Polytechnique de Paris. All rights reserved.