diff --git a/README.md b/README.md index 7fd6aae..7bb2ee4 100644 --- a/README.md +++ b/README.md @@ -5,38 +5,39 @@ Language: **English** [简体中文](./cn_README.md) [한국어(outdated)](. Real-time end-to-end singing voice conversion system based on DDSP (Differentiable Digital Signal Processing). -## (4.0 - test) New DDSP cascade diffusion model -Data preparation, configuring the pre-trained encoder (hubert or contentvec ) , pitch extractor (RMVPE) and vocoder (nsf-hifigan) is the same as training a pure DDSP model. +## (4.0 - Update) New DDSP cascade diffusion model +Installing dependencies, data preparation, configuring the pre-trained encoder (hubert or contentvec ) , pitch extractor (RMVPE) and vocoder (nsf-hifigan) are the same as training a pure DDSP model (See chapter 1 ~ 3 below). + +We provide a pre-trained model here: +https://huggingface.co/datasets/ms903/DDSP-SVC-4.0/resolve/main/pre-trained-model/model_0.pt (using 'contentvec768l12' encoder) + +Move the `model_0.pt` to the model export folder specified by the 'expdir' parameter in `diffusion-new.yaml`, and the program will automatically load the pre-trained model in that folder. (1) Preprocessing: ```bash python preprocess.py -c configs/diffusion-new.yaml ``` + (2) Train a cascade model (only train one model): ```bash python train_diff.py -c configs/diffusion-new.yaml ``` +Note: There is a temporary problem with fp16 training, but fp32 and bf16 are working normally, + (3) Non-real-time inference: ```bash -python main_diff.py -i -diff -o -k -id -diffid -speedup -method -kstep +python main_diff.py -i -diff -o -k -id -speedup -method -kstep ``` -'kstep' needs to be less than or equal to `k_step_max` in the configuration file. - -## Future plan -The idea of shallow diffusion proposed by this repository has received widespread attention from the SVC community, so we built a more elegant shallow diffusion project: [Diffusion-SVC](https://github.com/CNChTu/Diffusion-SVC). - -If you want to try the latest shallow diffusion models, you can choose to migrate to this repository, which may perform better on both real-time and non-real-time SVC. - -Also, the version 4.1 update of [SO-VITS-SVC](https://github.com/svc-develop-team/so-vits-svc) has heavily referenced our code. Now you can also use the shallow diffusion model in SO-VITS-SVC. - -The reason we created a new repository is that the DDSP section has been completely removed, and yes, shallow diffusion can actually be completely unrelated to DDSP. Unfortunately, DDSP as a technical idea is hardly competitive today, and it hardly produces state-of-the-art results. - -Of course, DDSP itself is not without room for improvement, so this repository will continue to update some interesting ideas, which will gradually diverge from the branch of Diffusion-SVC in SO-VITS-SVC. +The 4.0 version model has a built-in DDSP model, so specifying an external DDSP model using `-ddsp` is unnecessary. The other options have the same meaning as the 3.0 version model, but 'kstep' needs to be less than or equal to `k_step_max` in the configuration file, it is recommended to keep it equal (the default is 100) +(4) Real-time GUI: +```bash +# It's under testing. +``` ## (3.0 - Update) Shallow diffusion model (DDSP + Diff-SVC refactor version) ![Diagram](diagram.png) -Data preparation, configuring the pre-trained encoder (hubert or contentvec ) , pitch extractor (RMVPE) and vocoder (nsf-hifigan) is the same as training a pure DDSP model. +Installing dependencies, data preparation, configuring the pre-trained encoder (hubert or contentvec ) , pitch extractor (RMVPE) and vocoder (nsf-hifigan) are the same as training a pure DDSP model (See chapter 1 ~ 3 below). Because the diffusion model is more difficult to train, we provide some pre-trained models here: @@ -87,11 +88,11 @@ DDSP-SVC is a new open source singing voice conversion project dedicated to the Compared with the famous [SO-VITS-SVC](https://github.com/svc-develop-team/so-vits-svc), its training and synthesis have much lower requirements for computer hardware, and the training time can be shortened by orders of magnitude, which is close to the training speed of [RVC](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI). -In addition, when performing real-time voice changing, the hardware resource consumption of this project is significantly lower than that of SO-VITS-SVC and RVC, and a lower delay can be achieved by tuning parameters on the same hardware configuration. +In addition, when performing real-time voice changing, the hardware resource consumption of this project is significantly lower than that of SO-VITS-SVC,but probably slightly higher than the latest version of RVC. -Although the original synthesis quality of DDSP is not ideal (the original output can be heard in tensorboard while training), after enhancing the sound quality with a pre-trained vocoder based enhancer (old version) or with a shallow diffusion model (new version) , for some data sets, it can achieve the synthesis quality no less than SOVITS-SVC and RVC. The demo outputs are in the `samples` folder, and the related model checkpoint can be downloaded from the release page. +Although the original synthesis quality of DDSP is not ideal (the original output can be heard in tensorboard while training), after enhancing the sound quality with a pre-trained vocoder based enhancer (old version) or with a shallow diffusion model (new version) , for some datasets, it can achieve the synthesis quality no less than SOVITS-SVC and RVC. -The old version models are still compatible, the following chapters are the instructions for the old version. Some operations of the new version are the same, see the previous chapter. +The old version models are still compatible, the following chapters are the instructions for the old version. Some operations of the new version are the same, see the previous chapters. Disclaimer: Please make sure to only train DDSP-SVC models with **legally obtained authorized data**, and do not use these models and any audio they synthesize for illegal purposes. The author of this repository is not responsible for any infringement, fraud and other illegal acts caused by the use of these model checkpoints and audio. @@ -105,6 +106,7 @@ pip install -r requirements.txt NOTE : I only test the code using python 3.8 (windows) + torch 1.9.1 + torchaudio 0.6.0, too new or too old dependencies may not work UPDATE: python 3.8 (windows) + cuda 11.8 + torch 2.0.0 + torchaudio 2.0.1 works, and training is faster. + ## 2. Configuring the pretrained model - Feature Encoder (choose only one): @@ -141,6 +143,8 @@ python preprocess.py -c configs/sins.yaml ``` for a model of sinusoids additive synthesiser. +For training the diffusion model, see section 3.0 or 4.0 above. + You can modify the configuration file `config/.yaml` before preprocessing. The default configuration is suitable for training 44.1khz high sampling rate synthesiser with GTX-1660 graphics card. NOTE 1: Please keep the sampling rate of all audio clips consistent with the sampling rate in the yaml configuration file ! If it is not consistent, the program can be executed safely, but the resampling during the training process will be very slow. @@ -149,9 +153,9 @@ NOTE 2: The total number of the audio clips for training dataset is recommended NOTE 3: The total number of the audio clips for validation dataset is recommended to be about 10, please don't put too many or it will be very slow to do the validation. -NOTE 4: If your dataset is not very high quality, set 'f0_extractor' to 'crepe' in the config file. The crepe algorithm has the best noise immunity, but at the cost of greatly increasing the time required for data preprocessing. +NOTE 4: If your dataset is not very high quality, set 'f0_extractor' to 'rmvpe' in the config file. -UPDATE: Multi-speaker training is supported now. The 'n_spk' parameter in configuration file controls whether it is a multi-speaker model. If you want to train a **multi-speaker** model, audio folders need to be named with **positive integers not greater than 'n_spk'** to represent speaker ids, the directory structure is like below: +NOTE 5: Multi-speaker training is supported now. The 'n_spk' parameter in configuration file controls whether it is a multi-speaker model. If you want to train a **multi-speaker** model, audio folders need to be named with **positive integers not greater than 'n_spk'** to represent speaker ids, the directory structure is like below: ```bash # training dataset # the 1st speaker @@ -203,6 +207,7 @@ tensorboard --logdir=exp Test audio samples will be visible in TensorBoard after the first validation. NOTE: The test audio samples in Tensorboard are the original outputs of your DDSP-SVC model that is not enhanced by an enhancer. If you want to test the synthetic effect after using the enhancer (which may have higher quality) , please use the method described in the following chapter. + ## 6. Non-real-time VC (**Recommend**) Enhance the output using the pretrained vocoder-based enhancer: ```bash @@ -232,6 +237,7 @@ python gui.py The front-end uses technologies such as sliding window, cross-fading, SOLA-based splicing and contextual semantic reference, which can achieve sound quality close to non-real-time synthesis with low latency and resource occupation. Update: A splicing algorithm based on a phase vocoder is now added, but in most cases the SOLA algorithm already has high enough splicing sound quality, so it is turned off by default. If you are pursuing extreme low-latency real-time sound quality, you can consider turning it on and tuning the parameters carefully, and there is a possibility that the sound quality will be higher. However, a large number of tests have found that if the cross-fade time is longer than 0.1 seconds, the phase vocoder will cause a significant degradation in sound quality. + ## 8. Acknowledgement * [ddsp](https://github.com/magenta/ddsp) * [pc-ddsp](https://github.com/yxlllc/pc-ddsp) diff --git a/cn_README.md b/cn_README.md index 702c970..48e690a 100644 --- a/cn_README.md +++ b/cn_README.md @@ -5,39 +5,40 @@ Language: [English](./README.md) **简体中文** 基于 DDSP(可微分数字信号处理)的实时端到端歌声转换系统 -## (4.0 - 测试) 新的 DDSP 级联扩散模型 -数据准备,配置编码器(hubert 或者 contentvec) ,声码器 (nsf-hifigan) 与音高提取器 (RMVPE) 的环节与训练纯 DDSP 模型相同。 +## (4.0 升级) 新的 DDSP 级联扩散模型 +安装依赖,数据准备,配置编码器(hubert 或者 contentvec) ,声码器 (nsf-hifigan) 与音高提取器 (RMVPE) 的环节与训练纯 DDSP 模型相同 (见下面的章节)。 -(1) 预处理: +我们提供了一个预训练模型: +https://huggingface.co/datasets/ms903/DDSP-SVC-4.0/resolve/main/pre-trained-model/model_0.pt (使用 'contentvec768l12' 编码器) + +将名为`model_0.pt`的预训练模型, 放到`diffusion-new.yaml`里面 "expdir: exp/*****" 参数指定的模型导出文件夹内, 没有就新建一个, 程序会自动加载该文件夹下的预训练模型。 + +(1)预处理: ```bash python preprocess.py -c configs/diffusion-new.yaml ``` -(2) 训练级联模型 (只训练一个模型): + +(2)训练级联模型 (只训练一个模型): ```bash python train_diff.py -c configs/diffusion-new.yaml ``` -(3) 非实时推理: +注:fp16 训练暂时有问题,fp32 和 bf16 是可以正常训练的。 + +(3)非实时推理: ```bash -python main_diff.py -i -diff -o -k -id -diffid -speedup -method -kstep +python main_diff.py -i -diff -o -k -id -speedup -method -kstep ``` -'kstep' 需要小于等于配置文件中的 `k_step_max`。 +4.0版本模型内置了 DDSP 模型,因此不需要使用 -ddsp 指定外部 DDSP 模型, 其他选项与3.0版本模型含义相同,但 kstep 需要小于等于配置文件中的 `k_step_max`,建议保持相等 (默认是 100)。 -## 未来计划 - -本仓库提出的浅扩散的想法得到了 SVC 社区的广泛关注,因此我们构建了一个更优雅的浅扩散项目:[Diffusion-SVC](https://github.com/CNChTu/Diffusion-SVC). - -如果你想尝试最新的浅扩散模型,可以选择迁移至该仓库,它们在实时和非实时 SVC 上都可能有更好的表现. - -另外, [SO-VITS-SVC](https://github.com/svc-develop-team/so-vits-svc) 的 4.1 版本更新也很大程度上参考了我们的代码。现在你也可以在 SO-VITS-SVC 里使用浅扩散模型。 - -之所以我们建立了一个新仓库,是因为 DDSP 的部分已经被完全被移除了,是的,浅扩散实际上可以完全和 DDSP 没关系。很不幸的是,作为一种技术理念的 DDSP 在今天几乎已经没有竞争力,它很难产生最先进的结果。 - -当然,DDSP 本身也不是没有改进的空间,所以本仓库也会持续更新一些有趣的想法,会与 Diffusion-SVC 和 SO-VITS-SVC 的分支逐渐岔开。 +(4)实时 GUI : +```bash +# 正在测试中 +``` ## (3.0 升级)浅扩散模型 (DDSP + Diff-SVC 重构版) ![Diagram](diagram.png) -数据准备,配置编码器(hubert 或者 contentvec) ,声码器 (nsf-hifigan) 与音高提取器 (RMVPE) 的环节与训练纯 DDSP 模型相同。 +安装依赖,数据准备,配置编码器(hubert 或者 contentvec) ,声码器 (nsf-hifigan) 与音高提取器 (RMVPE) 的环节与训练纯 DDSP 模型相同 (见下面的章节)。 因为扩散模型更难训练,我们提供了一些预训练模型: @@ -87,11 +88,11 @@ DDSP-SVC 是一个新的开源歌声转换项目,致力于开发可以在个 相比于著名的 [SO-VITS-SVC](https://github.com/svc-develop-team/so-vits-svc), 它训练和合成对电脑硬件的要求要低的多,并且训练时长有数量级的缩短,和 [RVC](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI) 的训练速度接近。 -另外在进行实时变声时,本项目的硬件资源消耗显著低于 SO-VITS-SVC 和 RVC,在相同的硬件配置上经过调参可以达到更低的延迟。 +另外在进行实时变声时,本项目的硬件资源消耗显著低于 SO-VITS-SVC , 但可能略高于RVC 最新版本。 -虽然 DDSP 的原始合成质量不是很理想(训练时在 tensorboard 中可以听到原始输出),但在使用基于预训练声码器的增强器(老版本)或使用浅扩散模型(新版本)增强音质后,对于部分数据集可以达到不亚于 SOVITS-SVC 和 RVC 的合成质量。在`samples`文件夹中包含一个合成示例,相关模型检查点可以从仓库发布页面下载。 +虽然 DDSP 的原始合成质量不是很理想(训练时在 tensorboard 中可以听到原始输出),但在使用基于预训练声码器的增强器(老版本)或使用浅扩散模型(新版本)增强音质后,对于部分数据集可以达到不亚于 SOVITS-SVC 和 RVC 的合成质量。 -老版本的模型仍然兼容的,以下章节是老版本的使用说明。新版本部分操作是相同的,见上一章节。 +老版本的模型仍然是兼容的,以下章节是老版本的使用说明。新版本部分操作是相同的,见之前章节。 免责声明:请确保仅使用**合法获得的授权数据**训练 DDSP-SVC 模型,不要将这些模型及其合成的任何音频用于非法目的。 本库作者不对因使用这些模型检查点和音频而造成的任何侵权,诈骗等违法行为负责。 @@ -191,14 +192,13 @@ data ```bash python preprocess.py -c configs/combsub.yaml ``` - 2. 训练基于正弦波加法合成器的模型: ```bash python preprocess.py -c configs/sins.yaml ``` - 3. 您可以在预处理之前修改配置文件 `config/.yaml`,默认配置适用于GTX-1660 显卡训练 44.1khz 高采样率合成器。 +4. 如果要训练扩散模型,见上述 3.0 或 4.0 章节 ### 3. 备注: 1. 请保持所有音频切片的采样率与 yaml 配置文件中的采样率一致!如果不一致,程序可以跑,但训练过程中的重新采样将非常缓慢。(可选:使用Adobe Audition™的响度匹配功能可以一次性完成重采样修改声道和响度匹配。) @@ -207,12 +207,11 @@ python preprocess.py -c configs/sins.yaml 3. 验证集的音频切片总数建议为 10 个左右,不要放太多,不然验证过程会很慢。 -4. 如果您的数据集质量不是很高,请在配置文件中将 'f0_extractor' 设为 'crepe'。crepe 算法的抗噪性最好,但代价是会极大增加数据预处理所需的时间。 +4. 如果您的数据集质量不是很高,请在配置文件中将 'f0_extractor' 设为 'rmvpe'. 5. 配置文件中的 ‘n_spk’ 参数将控制是否训练多说话人模型。如果您要训练**多说话人**模型,为了对说话人进行编号,所有音频文件夹的名称必须是**不大于 ‘n_spk’ 的正整数**。 -## 4. 训练 -### 1. 不使用预训练数据进行训练: +## 4. 训练 ```bash # 以训练 combsub 模型为例 python train.py -c configs/combsub.yaml @@ -222,12 +221,7 @@ python train.py -c configs/combsub.yaml 2. 可以随时中止训练,然后运行相同的命令来继续训练。 3. 微调 (finetune):在中止训练后,重新预处理新数据集或更改训练参数(batchsize、lr等),然后运行相同的命令。 -### 2. 使用预训练数据(底模)进行训练: -1. **使用预训练模型请修改配置文件中的 'n_spk' 参数为 '2' ,同时配置`train`目录结构为多人物目录,不论你是否训练多说话人模型。** -2. **如果你要训练一个更多说话人的模型,就不要下载预训练模型了。** -3. 欢迎PR训练的多人底模 (请使用授权同意开源的数据集进行训练)。 -4. 从[**这里**](https://github.com/yxlllc/DDSP-SVC/releases/download/2.0/opencpop+kiritan.zip)下载预训练模型,并将`model_300000.pt`解压到`.\exp\combsub-test\`中 -5. 同不使用预训练数据进行训练一样,启动训练。 + ## 5. 可视化 ```bash # 使用tensorboard检查训练状态 @@ -236,6 +230,7 @@ tensorboard --logdir=exp 第一次验证 (validation) 后,在 TensorBoard 中可以看到合成后的测试音频。 注:TensorBoard 中的测试音频是 DDSP-SVC 模型的原始输出,并未通过增强器增强。 如果想测试模型使用增强器的合成效果(可能具有更高的合成质量),请使用下一章中描述的方法。 + ## 6. 非实时变声 1. (**推荐**)使用预训练声码器增强 DDSP 的输出结果: ```bash @@ -258,6 +253,7 @@ python main.py -h # 将1号说话人和2号说话人的音色按照0.5:0.5的比例混合 python main.py -i -m -o -k -mix "{1:0.5, 2:0.5}" -e true -eak 0 ``` + ## 7. 实时变声 用以下命令启动简易操作界面: ```bash @@ -266,6 +262,7 @@ python gui.py 该前端使用了滑动窗口,交叉淡化,基于SOLA 的拼接和上下文语义参考等技术,在低延迟和资源占用的情况下可以达到接近非实时合成的音质。 更新:现在加入了基于相位声码器的衔接算法,但是大多数情况下 SOLA 算法已经具有足够高的拼接音质,所以它默认是关闭状态。如果您追求极端的低延迟实时变声音质,可以考虑开启它并仔细调参,有概率音质更高。但大量测试发现,如果交叉淡化时长大于0.1秒,相位声码器反而会造成音质明显劣化。 + ## 8. 感谢 * [ddsp](https://github.com/magenta/ddsp) * [pc-ddsp](https://github.com/yxlllc/pc-ddsp) diff --git a/configs/diffusion-new.yaml b/configs/diffusion-new.yaml index 6aed1af..728bb07 100644 --- a/configs/diffusion-new.yaml +++ b/configs/diffusion-new.yaml @@ -43,7 +43,7 @@ train: interval_log: 1 interval_val: 2000 interval_force_save: 10000 - lr: 0.0002 + lr: 0.00015 decay_step: 50000 gamma: 0.5 weight_decay: 0 diff --git a/samples/source.wav b/samples/source.wav deleted file mode 100644 index f4e4fbf..0000000 Binary files a/samples/source.wav and /dev/null differ diff --git a/samples/svc-kiritan+12key.wav b/samples/svc-kiritan+12key.wav deleted file mode 100644 index 7024fe4..0000000 Binary files a/samples/svc-kiritan+12key.wav and /dev/null differ diff --git a/samples/svc-opencpop+12key.wav b/samples/svc-opencpop+12key.wav deleted file mode 100644 index 604d683..0000000 Binary files a/samples/svc-opencpop+12key.wav and /dev/null differ diff --git a/samples/svc-opencpop_kiritan_mix+12key.wav b/samples/svc-opencpop_kiritan_mix+12key.wav deleted file mode 100644 index 2e56fcc..0000000 Binary files a/samples/svc-opencpop_kiritan_mix+12key.wav and /dev/null differ