Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

スマホ版VOICEVOXの開発 #10

Open
Hiroshiba opened this issue Feb 9, 2022 · 32 comments
Open

スマホ版VOICEVOXの開発 #10

Hiroshiba opened this issue Feb 9, 2022 · 32 comments

Comments

@Hiroshiba
Copy link
Member

Hiroshiba commented Feb 9, 2022

スマホ版VOICEVOXを作りたいです。

目的

VOICEVOXのバリューであるユーザー数の増加と、ミッションである音声合成キャラの浸透ができそうだからです。

背景

そもそも動画を作る人というのは、高校生・大学生が多いと思います。時間がないと作れないからです。
今の高校生・大学生は基本的にスマホで物事を完結します。動画作成も例外ではないです(想像できませんが・・・)
スマホで動く音声合成アプリは少なく、特に無料のものとなるとかなり数が少ないはずです。そこを攻めます。
この領域は特に企業が参入しづらいはずです。どう頑張っても儲からないからです。
ほとんど未開のこの領域に踏み込んでみたい、というのがこのプロジェクトの意図です。

ゴール

とりあえずTTSができるアプリのデモができればOKとしたいです。
リリースに向けての動き方とかは後々に考える見込みです。

内容

開発はOSSベースを想定しています。いろんな方の力をお借りしたいからです。
初手はiOSだけで良いと思います。日本語TTSを使うメインユーザーが日本のユーザーであり、かつデバイスの計算リソースが強めなためです。
UIフレームワークはReact Nativeを検討しています。VOICEVOXがjs製なのと、マルチプラットフォームに展開したいからです。

課題

一番の課題は、音声合成用の機械学習モデルの推論をどう実現するかだと思います。
とりあえずCoreMLに変換する方法がありそうなので検討中です。
ちょっと調べた感じ、onnxruntimeをスマホ用にビルドすることもできそうですが、前例がなかなか見つからず、前途多難な予感がしています。

2番めの課題は、openjtalkが必要な点です。
これはこちらのプロジェクトのC++ TTSライブラリができ次第着手するのが効率がいいのかなと思っています。

3番めの課題はUIです。がんばってデザインしていきます。
とりあえずアクセント調整だけできれば良いかなとも思っています。

その他

手が空き次第、僕が着手しようかなと思っていますが、他のタスクも多くなかなか手がつけられていません。
もしご興味があればコメント等頂ければと思います!

@HyodaKazuaki
Copy link

一番の課題は、音声合成用の機械学習モデルの推論をどう実現するかだと思います。
とりあえずCoreMLに変換する方法がありそうなので検討中です。
ちょっと調べた感じ、onnxruntimeをスマホ用にビルドすることもできそうですが、前例がなかなか見つからず、前途多難な予感がしています。

こちらの件について、ONNX RuntimeをiOS向けにビルドするためのドキュメントがあったので共有しておきます。
CoreMLを利用する場合のビルドオプションに関する記述もあります。
https://onnxruntime.ai/docs/build/ios.html

また、CoreMLでサポートされているオペレーションについては以下のドキュメントに記載があります。
https://onnxruntime.ai/docs/execution-providers/CoreML-ExecutionProvider

@Hiroshiba
Copy link
Member Author

ありがとうございます!
同じページを見ていたのですが、ビルドしてみた報告ブログなどが見つからず、onnxruntimeのビルドがうまくいくかは修羅の道なのかもと思ってたりします。

オペレーション一覧はまだ見てませんでした。VOICEVOXの推論機構が全部表現できるかはパッとわからないですね。。足りてないのがあるかもしれない。。

@Hiroshiba
Copy link
Member Author

でもOSSとして開発していくのであれば、おそらく暗号化済みのモデルファイルを共有する仕組みがない(?)CoreMLよりも、ローカルストレージにあるバイナリファイルからモデルをloadできるonnxruntimeのほうが筋が通っているように感じました。
onnxruntimeでやっていきたいですね!!

@HyodaKazuaki
Copy link

オペレーション一覧はまだ見てませんでした。VOICEVOXの推論機構が全部表現できるかはパッとわからないですね。。足りてないのがあるかもしれない。。

VOICEVOX/voicevox_core で公開されている各種onnxファイルとオペレーションが変わらないのであれば、そこから対応可能か確認できそうです。
ちょっと確認してみます。

おそらく暗号化済みのモデルファイルを共有する仕組みがない(?)

この点については、CoreMLでモデルを暗号化して提供することはできそうです。
https://developer.apple.com/documentation/coreml/encrypting_a_model_in_your_app
https://qiita.com/kazuhiro4949/items/becb1850172d2e96281f

また、CoreMLの形式もアプリにバンドルすることは可能なので、アプリのリリースとともにモデルを配布することもできそうです。
https://developer.apple.com/documentation/coreml/integrating_a_core_ml_model_into_your_app?changes=latest_minor

CoreMLとONNXのパフォーマンスの違いはおそらくないと思います。
ですので、開発の方針として「他プラットフォームとの開発の差異を限りなく小さくすること」を優先するのであれば、ONNXモデルのまま利用できたほうがいいと思います。

@HyodaKazuaki
Copy link

yukarin_s.onnnxyukarin_sa.onnxdecode.onnxの3つのONNXモデルについて、CoreML実行プロバイダを利用できるかオペレーションを確認してきました。
以下の表が3つのONNXモデルで使っているオペレーションとその対応状況です。

かなり多くのオペレーションが対応していないので、CoreML実行プロバイダをONNXで使うのは難しそうです。

Operator Supported?
Add Yes
Cast Yes
Concat Yes
ConcatFromSequence No
ConstantOfShape No
Conv Yes
ConvTranspose No
Cos No
Div No
Equal No
Expand No
Gather No
GRU No
LeakyRelu No
Loop No
MatMul Yes
Mul No
Pow No
Range No
ReduceMean No
Relu Yes
Reshape Yes
ScatterND No
Shape No
Sigmoid Yes
Sin No
Slice No
Softmax No
SplitToSequence No
Sqrt No
Sub No
Tanh Yes
Transpose Yes
Unsqueeze No
Where No

参考として、CoreMLに変換する場合のことを書いておきます。
ONNXからCoreMLに変換する機能は、Core ML Toolsというツールが提供していますが、次のバージョンでONNXからの変換が廃止されるようです。
PyTorchから直接変換する機能は提供されています。
https://developer.apple.com/jp/documentation/coreml/converting_trained_models_to_core_ml/
https://coremltools.readme.io/docs/onnx-conversion
https://coremltools.readme.io/docs/pytorch-conversion

@Hiroshiba
Copy link
Member Author

VOICEVOX/voicevox_core で公開されている各種onnxファイルとオペレーションが変わらないのであれば

こちら、少なくとも今は変わってないです!


対応表ありがとうございます!!!!とても参考になります!!
そして思った以上に未対応が多いですね・・・
(cosとかsinとかどこで使ってるんだろうと思ったら、位置エンコーディングですね・・・)
僕もonnxruntimeでCoreMLを使うのは(かなり)難しいと思いました。


CoreMLのこともありがとうございます。
ローカルファイルから読む方法、あるんですね!
であればこちらでも全然OSSとして開発できそうな印象を受けました。
まあCoreMLを使う感じ・・・かなぁ・・・

一応他にも、onnxruntimeをCoreML使わずCPUで利用するとか、WebViewを経由してWebGL版onnxruntimeを使うとかの方法が考えられます。
WebViewを経由する方法はそれはそれでしんどそうなので微妙な気持ちですが、
性能が良いらしいiPhoneであればCPU推論が意外と早いかもとちょっと思ってます。
CPU推論が実用に耐えうるかサクッと試したいかもですが、方法ありそうでしょうか👀

@HyodaKazuaki
Copy link

HyodaKazuaki commented Feb 12, 2022

性能が良いらしいiPhoneであればCPU推論が意外と早いかもとちょっと思ってます。
CPU推論が実用に耐えうるかサクッと試したいかもですが、方法ありそうでしょうか👀

ONNX化の影響でCPU推論がかなり高速化されたので、もしかするとiPhoneやiPadでもCPUで十分快適に動作するかもしれません。
(とはいえ、現在サポートされているiPhoneやiPadの中には古いものもあるので、快適に利用できないものもありそうです)
現在、CocoaPods(iOSなど向けのライブラリ管理ツール)にonnxruntime(onnxruntime-mobile-c)があります。
これを使えば、ONNXモデルが動作するか、そしてどれぐらいの処理速度かを確認することはできそうです。

@Hiroshiba
Copy link
Member Author

おーーなるほどです!!割と簡単に確かめられるかもなんですね!!

@Hiroshiba
Copy link
Member Author

Hiroshiba commented Jun 6, 2022

wasmでどれくらい速度が出るのか調べるために、onnxruntime-webを用いてonnxモデルで推論してみるコードを書いてみました。
https://github.com/Hiroshiba/vv_check_web/tree/6809d140e526eeaa109d64d3483329f63ee71a51

PC上でブラウザを開いてCPUを用いて推論したところ、5秒ほどの音声を生成するのに10秒ほどかかりました。
ネイティブで生成した場合はCPUでも1秒未満で完了するので、比較するとざっと10倍ほど遅そうです。さすがに使えなさそう。

また、onnxruntime-webはWebGLモードもあるのですが、対応していないものがあって推論できませんでした。
ちなみにTypeError: int64 is not supportedというエラーでした。

WebGLを用いてどれくらい早くなるのかを確かめたい気持ちがあります。
onnxモデル作成コードはこちらにあります。

@Hiroshiba
Copy link
Member Author

onnxruntime-webのthreadingを有効にした状態で検証してみました。 (thx @yamachu !!! )
https://github.com/Hiroshiba/vv_check_web/tree/9adb272b576e3c125432459ee32fe6119658ac0f
時間は大幅に縮まりましたが、Core i7-11700で5秒の音声を生成するのに3.4秒かかり、まだやっぱりちょっと遅いなという印象でした。

WebGLを使うルートも検証し始めました。
pytorchモデルの中の処理を変える必要がある、というのがわかってきました。
ご興味あればぜひ一緒に検証しましょう・・・!!!

@Patchethium
Copy link

Besides CoreML, I suggest considering NCNN or tract for mobile deployment, they run on native code. Although it makes use of WebGL, wasm can still be pretty slow.

@Hiroshiba
Copy link
Member Author

NCNN, good one!
Due to encryption, I would like to load the model from memory (not from a file), but I couldn't find in the documentation if it is possible. ;->

@Patchethium
Copy link

Check this tutorial, ncnn supports stripping readable information.

@Hiroshiba
Copy link
Member Author

Great!!!
I will try to convert it to NCNN model.

@Patchethium
Copy link

Great, BTW if you're converting from pytorch, it's recommended to give ncnn's pnnx tool a try. It can directly convert the pytorch module to ncnn without generating redundant OPs like in ONNX.

@Hiroshiba
Copy link
Member Author

Hiroshiba commented Jun 12, 2022

I tried to convert using ncnn from onnx, but there seemed to be a lot of errors! ;->

Unsupported unsqueeze axes !
Unsupported unsqueeze axes !
Gather not supported yet!
Shape not supported yet!
Gather not supported yet!
  # axis=0
Shape not supported yet!
Gather not supported yet!
  # axis=0
Shape not supported yet!
Gather not supported yet!
  # axis=0
Unsupported unsqueeze axes !
Unsupported unsqueeze axes !
Unsupported unsqueeze axes !
Shape not supported yet!
ConstantOfShape not supported yet!
  # value 4
Equal not supported yet!
Where not supported yet!
Expand not supported yet!
Shape not supported yet!
ConstantOfShape not supported yet!
  # value 4
Shape not supported yet!
Gather not supported yet!
  # axis=0
Unsupported unsqueeze axes !
ConstantOfShape not supported yet!
  # value 4
Shape not supported yet!
Gather not supported yet!
  # axis=0
Unsupported unsqueeze axes !
ConstantOfShape not supported yet!
  # value 4
Shape not supported yet!
Gather not supported yet!
  # axis=0
Cast not supported yet!
  # to=1
Range not supported yet!
Gather not supported yet!
  # axis=2
Shape not supported yet!
Expand not supported yet!
Shape not supported yet!
Gather not supported yet!
  # axis=0
Cast not supported yet!
  # to=7
Range not supported yet!
Shape not supported yet!
Gather not supported yet!
  # axis=0
Cast not supported yet!
  # to=7
Range not supported yet!
Shape not supported yet!
Shape not supported yet!
ConstantOfShape not supported yet!
  # value 4
Equal not supported yet!
Where not supported yet!
Expand not supported yet!
Shape not supported yet!
ConstantOfShape not supported yet!
  # value 4
Equal not supported yet!
Where not supported yet!
Expand not supported yet!
Shape not supported yet!
ConstantOfShape not supported yet!
  # value 4
Equal not supported yet!
Where not supported yet!
Expand not supported yet!
Shape not supported yet!
Unknown data type 0
ScatterND not supported yet!
Gather not supported yet!
  # axis=2
Shape not supported yet!
Expand not supported yet!
Shape not supported yet!
Gather not supported yet!
  # axis=0
Cast not supported yet!
  # to=7
Range not supported yet!
Shape not supported yet!
Gather not supported yet!
  # axis=0
Cast not supported yet!
  # to=7
Range not supported yet!
Shape not supported yet!
Shape not supported yet!
ConstantOfShape not supported yet!
  # value 4
Equal not supported yet!
Where not supported yet!
Expand not supported yet!
Shape not supported yet!
ConstantOfShape not supported yet!
  # value 4
Equal not supported yet!
Where not supported yet!
Expand not supported yet!
Shape not supported yet!
ConstantOfShape not supported yet!
  # value 4
Equal not supported yet!
Where not supported yet!
Expand not supported yet!
Shape not supported yet!
Unknown data type 0
ScatterND not supported yet!
Gather not supported yet!
  # axis=2
Shape not supported yet!
Expand not supported yet!
Shape not supported yet!
Gather not supported yet!
  # axis=0
Cast not supported yet!
  # to=7
Range not supported yet!
Shape not supported yet!
Gather not supported yet!
  # axis=0
Cast not supported yet!
  # to=7
Range not supported yet!
Shape not supported yet!
Shape not supported yet!
ConstantOfShape not supported yet!
  # value 4
Equal not supported yet!
Where not supported yet!
Expand not supported yet!
Shape not supported yet!
ConstantOfShape not supported yet!
  # value 4
Equal not supported yet!
Where not supported yet!
Expand not supported yet!
Shape not supported yet!
ConstantOfShape not supported yet!
  # value 4
Equal not supported yet!
Where not supported yet!
Expand not supported yet!
Shape not supported yet!
Unknown data type 0
ScatterND not supported yet!
Gather not supported yet!
  # axis=2
Shape not supported yet!
Expand not supported yet!
Shape not supported yet!
Gather not supported yet!
  # axis=0
Cast not supported yet!
  # to=7
Range not supported yet!
Shape not supported yet!
Gather not supported yet!
  # axis=0
Cast not supported yet!
  # to=7
Range not supported yet!
Shape not supported yet!
Shape not supported yet!
ConstantOfShape not supported yet!
  # value 4
Equal not supported yet!
Where not supported yet!
Expand not supported yet!
Shape not supported yet!
ConstantOfShape not supported yet!
  # value 4
Equal not supported yet!
Where not supported yet!
Expand not supported yet!
Shape not supported yet!
ConstantOfShape not supported yet!
  # value 4
Equal not supported yet!
Where not supported yet!
Expand not supported yet!
Shape not supported yet!
Unknown data type 0
ScatterND not supported yet!
Shape not supported yet!
Gather not supported yet!
  # axis=0
Unsupported unsqueeze axes !
Unknown data type 0
Shape not supported yet!
Gather not supported yet!
  # axis=0
Unsupported unsqueeze axes !
Unknown data type 0
Unsupported slice step !
Unsupported unsqueeze axes !
Unsupported unsqueeze axes !
Shape not supported yet!
Gather not supported yet!
  # axis=0
Cast not supported yet!
  # to=7
Cast not supported yet!
  # to=7
Shape not supported yet!
Gather not supported yet!
  # axis=0
Shape not supported yet!
Gather not supported yet!
  # axis=0
Cast not supported yet!
  # to=7
Cast not supported yet!
  # to=7
Shape not supported yet!
Gather not supported yet!
  # axis=0
Unsupported unsqueeze axes !
Unsupported unsqueeze axes !
Unknown data type 0
Unknown data type 0
Shape not supported yet!
Gather not supported yet!
  # axis=0
Unsupported unsqueeze axes !
Unknown data type 0
Unsupported unsqueeze axes !
Unknown data type 0
Unsupported unsqueeze axes !
Unknown data type 0
Shape not supported yet!
Gather not supported yet!
  # axis=0
Unsupported unsqueeze axes !
Unknown data type 0
Shape not supported yet!
Gather not supported yet!
  # axis=0
Shape not supported yet!
Gather not supported yet!
  # axis=0
Shape not supported yet!
Gather not supported yet!
  # axis=0
Unsupported unsqueeze axes !
Unsupported unsqueeze axes !
Unsupported unsqueeze axes !
ConstantOfShape not supported yet!
  # value 4
Shape not supported yet!
Gather not supported yet!
  # axis=0
Shape not supported yet!
Gather not supported yet!
  # axis=0
Shape not supported yet!
Gather not supported yet!
  # axis=0
Shape not supported yet!
Gather not supported yet!
  # axis=0
Unsupported unsqueeze axes !
Unsupported unsqueeze axes !
Unsupported unsqueeze axes !
Unsupported unsqueeze axes !
Unknown data type 0
Shape not supported yet!
Unknown data type 0
Shape not supported yet!
Unsupported squeeze axes !
Cast not supported yet!
  # to=7
Cast not supported yet!
  # to=7
Unsupported unsqueeze axes !
Unknown data type 0
Shape not supported yet!
Gather not supported yet!
  # axis=0
Equal not supported yet!
Cast not supported yet!
  # to=9
Where not supported yet!
Cast not supported yet!
  # to=9
Where not supported yet!
Unsupported unsqueeze axes !
Unknown data type 0
Shape not supported yet!
Gather not supported yet!
  # axis=0
Unsupported unsqueeze axes !
Unknown data type 0
Unsupported unsqueeze axes !
Unknown data type 0
Unsupported unsqueeze axes !
Unknown data type 0
Shape not supported yet!
Gather not supported yet!
  # axis=0
Unsupported unsqueeze axes !
Unknown data type 0
Shape not supported yet!
Gather not supported yet!
  # axis=0
Shape not supported yet!
Gather not supported yet!
  # axis=0
Shape not supported yet!
Gather not supported yet!
  # axis=0
Unsupported unsqueeze axes !
Unsupported unsqueeze axes !
Unsupported unsqueeze axes !
ConstantOfShape not supported yet!
  # value 4
Shape not supported yet!
Gather not supported yet!
  # axis=0
Shape not supported yet!
Gather not supported yet!
  # axis=0
Shape not supported yet!
Gather not supported yet!
  # axis=0
Shape not supported yet!
Gather not supported yet!
  # axis=0
Unsupported unsqueeze axes !
Unsupported unsqueeze axes !
Unsupported unsqueeze axes !
Unsupported unsqueeze axes !
Unknown data type 0
Shape not supported yet!
Unknown data type 0
Shape not supported yet!
Unsupported squeeze axes !
Cast not supported yet!
  # to=7
Cast not supported yet!
  # to=7
Unsupported unsqueeze axes !
Unknown data type 0
Shape not supported yet!
Gather not supported yet!
  # axis=0
Equal not supported yet!
Cast not supported yet!
  # to=9
Where not supported yet!
Cast not supported yet!
  # to=9
Where not supported yet!
Unsupported unsqueeze axes !
Unknown data type 0
Gather not supported yet!
  # axis=0
Unsupported unsqueeze axes !
Gather not supported yet!
  # axis=0
Gather not supported yet!
  # axis=0

@Hiroshiba
Copy link
Member Author

Hiroshiba commented Jun 12, 2022

Great, BTW if you're converting from pytorch, it's recommended to give ncnn's pnnx tool a try.

I didn't know there was such a thing!
It's a bit of effort as it requires torch script, but I'd like to give it a try.
(It looks like I could get ncnn params and bin, but it doesn't say if this will work on ncnn...)

I see that pnnx was in a separate repository. I will try to use the exe distributed in the releases here.

@Patchethium
Copy link

Check the second line of its README:

Note: The current implementation is in https://github.com/Tencent/ncnn/tree/master/tools/pnnx

Apparently they merged pnnx into ncnn's repo.

@Hiroshiba
Copy link
Member Author

Oh, I know that one!
I didn't find the executable binary in ncnn/tools/pnnx, but I did find it in pnnx/pnnx.
Thanks!

@Hiroshiba
Copy link
Member Author

I tried pnnx!
I found that execution stopped without any useful error messages.

The .pt file can be found here. The hiho_decode_script_cpu.pt is the target you want to onnx convert.

The shape I'm inputting looks right at [-1,1],[-1,45],[1]i64.... It seems difficult.
https://github.com/Hiroshiba/yukarin_soso_connector/blob/b875c25a1f2e331c3647a26a692316a9e38d634e/yukarin_soso_connector/jit_forwarder/jit_forwarder.py#L255-L259

$ ./pnnx/pnnx.exe hiho_decode_script_cpu.pt inputshape=[100,1],[100,45],[1]i64 inputshape2=[200,1],[200,45],[1]i64

pnnxparam = hiho_decode_script_cpu.pnnx.param
pnnxbin = hiho_decode_script_cpu.pnnx.bin
pnnxpy = hiho_decode_script_cpu_pnnx.py
ncnnparam = hiho_decode_script_cpu.ncnn.param
ncnnbin = hiho_decode_script_cpu.ncnn.bin
ncnnpy = hiho_decode_script_cpu_ncnn.py
optlevel = 2
device = cpu
inputshape = [100,1]f32,[100,45]f32,[1]i64
inputshape2 = [200,1]f32,[200,45]f32,[1]i64
customop =
moduleop =
############# pass_level0
inline function is_tracing
inline function pad_sequence
inline function pad_sequence
inline function make_pad_mask
inline function make_non_pad_mask
inline module = espnet_pytorch_library.conformer.convolution.ConvolutionModule
inline module = espnet_pytorch_library.conformer.encoder.Encoder
inline module = espnet_pytorch_library.conformer.encoder_layer.EncoderLayer
inline module = espnet_pytorch_library.conformer.swish.Swish
inline module = espnet_pytorch_library.transformer.attention.RelPositionMultiHeadedAttention
inline module = espnet_pytorch_library.transformer.embedding.RelPositionalEncoding
inline module = espnet_pytorch_library.transformer.layer_norm.LayerNorm
inline module = espnet_pytorch_library.transformer.multi_layer_conv.MultiLayeredConv1d
inline module = espnet_pytorch_library.transformer.repeat.MultiSequential
inline module = hifi_gan.models.Generator
inline module = hifi_gan.models.ResBlock1
inline module = yukarin_soso_connector.jit_forwarder.jit_yukarin_sosoa.JitPostnet
inline module = yukarin_soso_connector.jit_forwarder.jit_yukarin_sosoa.JitYukarinSosoa
inline function is_tracing
inline function pad_sequence
inline function pad_sequence
inline function make_pad_mask
inline function make_non_pad_mask
inline module = espnet_pytorch_library.conformer.convolution.ConvolutionModule
inline module = espnet_pytorch_library.conformer.encoder.Encoder
inline module = espnet_pytorch_library.conformer.encoder_layer.EncoderLayer
inline module = espnet_pytorch_library.conformer.swish.Swish
inline module = espnet_pytorch_library.transformer.attention.RelPositionMultiHeadedAttention
inline module = espnet_pytorch_library.transformer.embedding.RelPositionalEncoding
inline module = espnet_pytorch_library.transformer.layer_norm.LayerNorm
inline module = espnet_pytorch_library.transformer.multi_layer_conv.MultiLayeredConv1d
inline module = espnet_pytorch_library.transformer.repeat.MultiSequential
inline module = hifi_gan.models.Generator
inline module = hifi_gan.models.ResBlock1
inline module = yukarin_soso_connector.jit_forwarder.jit_yukarin_sosoa.JitPostnet
inline module = yukarin_soso_connector.jit_forwarder.jit_yukarin_sosoa.JitYukarinSosoa
51  52  length.1  f00.1  phoneme.1  h.1  h0  h2.1  maxlen.1  seq_range.1  105  seq_range_expand.1  seq_length_expand.1  mask.5  111  113  mask.3  120  122  123  124  x.8  131  132  134  135  136  1094  139  1096  140  142  143  1100  145  input.2  147  148  150  151  153  154  161  bias.3  weight.3  x.3  input.8  185  186  input0.29  188  input1.25  190  input2.27  192  193  input.10  bias.5  weight.5  query.2  202  204  205  pos_bias_v.2  pos_bias_u.2  n_batch.2  234  q.2  237  k.2  240  v.2  q0.2  k0.2  value.2  q1.2  n_batch_pos.2  250  p.2  p0.2  254  q_with_bias_u.2  256  q_with_bias_v.2  258  matrix_ac.2  260  x.5  263  266  269  zero_pad.2  x_padded.2  276  279  282  283  286  x_padded0.2  290  291  292  293  295  296  1160  297  299  300  301  matrix_bd.2  303  scores.2  n_batch0.2  308  mask.2  scores0.2  311  input.12  313  x0.2  315  316  input0.10  319  320  input0.12  bias.7  weight.7  x.7  input.14  336  input0.14  338  339  340  input.16  342  343  344  input1.10  bias.9  weight.9  x.9  input.18  359  360  input0.16  362  input1.12  364  input2.8  366  1198  367  input2.10  bias.11  weight.11  input0.18  377  bias.2  weight.2  x.2  input.31  401  402  input0.35  404  input1.31  406  input2.25  408  409  input.6  bias.4  weight.4  query.1  418  420  421  pos_bias_v.1  pos_bias_u.1  n_batch.1  450  q.1  453  k.1  456  v.1  q0.1  k0.1  value.1  q1.1  n_batch_pos.1  466  p.1  p0.1  470  q_with_bias_u.1  472  q_with_bias_v.1  474  matrix_ac.1  476  x.4  479  482  485  zero_pad.1  x_padded.1  492  495  498  499  502  x_padded0.1  506  507  508  509  511  512  1255  513  515  516  517  matrix_bd.1  519  scores.1  n_batch0.1  524  mask.1  scores0.1  527  input.27  529  x0.1  531  532  input0.37  535  536  input0.33  bias.6  weight.6  x.6  input.4  553  input0.25  555  556  557  input.29  559  560  561  input1.29  bias.8  weight.8  x.1  input.33  576  577  input0.27  579  input1.27  581  input2.31  583  1293  584  input2.29  bias.10  weight.10  input0.31  bias.1  weight.1  599  h3.1  602  output1.1  606  input0.2  input1.2  input2.2  xs0.2  input0.4  input1.4  input2.4  xs1.2  input0.6  input1.6  input2.6  xs2.2  input0.8  input1.8  input2.33  xs3.2  input0.39  input1.33  650  651  output2.1  spec.1  x.10  20  663  700  input.3  702  input.5  718  input8.1  720  input9.1  input10.1  723  input11.1  725  input12.1  input13.1  728  input14.1  730  xs.5  input.7  747  input0.5  749  input1.5  input2.5  752  input3.5  754  input4.5  input5.5  757  input6.5  759  760  xs.3  input.9  777  input0.7  779  input1.7  input2.7  782  input3.7  784  input4.7  input5.7  787  input6.7  789  790  xs0.1  input0.3  input1.3  794  input.11  810  input0.9  812  input1.9  input2.9  815  input3.9  817  input4.9  input5.9  820  input6.9  822  xs.7  input.13  839  input0.11  841  input1.11  input2.11  844  input3.11  846  input4.11  input5.11  849  input6.11  851  852  xs1.1  input.15  869  input0.13  871  input1.13  input2.13  874  input3.13  876  input4.13  input5.13  879  input6.13  881  882  xs2.1  1351  input2.3  input3.3  886  input.17  902  input0.15  904  input1.15  input2.15  907  input3.15  909  input4.15  input5.15  912  input6.15  914  xs.9  input.19  931  input0.17  933  input1.17  input2.17  936  input3.17  938  input4.17  input5.17  941  input6.17  943  944  xs3.1  input.21  961  input0.19  963  input1.19  input2.19  966  input3.19  968  input4.19  input5.19  971  input6.19  973  974  xs4.1  1376  input4.3  input5.3  978  input.23  994  input0.21  996  input1.21  input2.21  999  input3.21  1001  input4.21  input5.21  1004  input6.21  1006  xs.1  input.25  1023  input0.23  1025  input1.23  input2.23  1028  input3.23  1030  input4.23  input5.23  1033  input6.23  1035  1036  xs5.1  input.1  1053  input0.1  1055  input1.1  input2.1  1058  input3.1  1060  input4.1  input5.1  1063  input6.1  1065  1066  xs6.1  1401  input6.3  input7.1  1070  1071  23
----------------

@Patchethium
Copy link

Patchethium commented Jun 14, 2022

The error message is very useful.
For the decoder,

terminate called after throwing an instance of 'c10::Error'
  what():  forward() Expected a value of type 'List[Tensor]' for argument 'f0_list' but instead found type 'Tensor'.

It says that you specified the f0_list in forward call to be List[Tensor] but in pnnx you use [-1,1] which means a 2d Tensor.
I think you may fix it by stacking the list of f0 into one Tensor. Also, do the same thing to the phoneme list.

I also tried out the yukarin_s and yukarin_sa, got this error from both of them:

RuntimeError: index out of range in self

at the forward call of

self.speaker_embedder

I think this might could be fixed by specifying an example_input in jit export, with a speaker id no larger than the embedding size.

I'd like to fix them myself but I don't have access to the original models so \_(ツ)_/

@Hiroshiba
Copy link
Member Author

Hiroshiba commented Jun 16, 2022

It's true!
I ran the ubuntu version and got an error!!!

I'd like to fix them myself but I don't have access to the original models so _(ツ)_/

I see!
The binary data of the models can be found here.
https://github.com/Hiroshiba/vv_core_inference/releases/tag/0.0.1

The network structure of the model can be found here.
https://github.com/Hiroshiba/yukarin_soso_connector

The conversion to torch script can be done with the following code.

python run_jit.py \
    --yukarin_s_model_dir "model/yukarin_s" \
    --yukarin_sa_model_dir "model/yukarin_sa" \
    --yukarin_sosoa_model_dir "model/yukarin_sosoa" \
    --hifigan_model_dir "model/hifigan" \
    --texts "hello" \
    --speaker_ids 0 1

@Hiroshiba
Copy link
Member Author

Hiroshiba commented Jun 16, 2022

I've changed List[Tensor] to Tensor! Working on this branch.
https://github.com/Hiroshiba/yukarin_soso_connector/tree/to-ncnn

I ran the above code to get a new .pt file and the level0 optimization passed through 🎉.
And I got a wonderful error in level1 optimization. ;->

############# pass_level1
no attribute value
Segmentation fault

2022/06/24 I created the issue.

@Patchethium
Copy link

Sorry recently I didn't have time to check it out 🙇

creates the issue

I guess it's better this way, the maintainer of ncnn is actively involved in the community and would give solutions way better than mine. Nevertheless, I'll keep tracking this issue whenever I have the time.

@Hiroshiba
Copy link
Member Author

Hiroshiba commented Jun 27, 2022

decodeのncnn用のバイナリができました!
https://github.com/Hiroshiba/vv_core_inference/releases/tag/ncnn

pnnx経由でncnn化する制約としてtorch.jit.traceを使う必要があるのですが、その影響でyukarin_saの自己回帰が使えず、saのncnn化ができてません。

@Patchethium
Copy link

Have you tried it out? Actually I didn't see any issues with tracing an auto regressive model, see this tutorial.

@Hiroshiba
Copy link
Member Author

Thanks for letting me know!
In this example, the autoregression code was written in GreedySearchDecoder, where torch.jit.script was used instead of trace.

@Patchethium
Copy link

Patchethium commented Feb 15, 2023

Sorry I wasn't around for a period, I went out to try other frameworks, ncnn, tvm, openvino, TNN, tract... and ended up with Alibaba's MNN.

Like NCNN is (kinda) from Tencent, MNN is also made by a Chinese Big Tech Alibaba, the one running AliExpress. It could either be an advantage or disadvantage, fortunately it has an English doc for non-Chinese speaker.

Anyway, I was able to convert the onnx model here to MNN format with little tweaking. predict-duration and predict-intonation works out-of-box, while on decoder I only need to change an axes attribute. It's just amazing in regard to NCNN which can't even run Unsqueeze.

Compile MNN Convert Tool
git clone https://github.com/alibaba/MNN.git
cd MNN
mkdir build
cmake .. -DMNN_BUILD_CONVERTER=ON
make -j4

# convert
./MNNConvert -f ONNX --modelFile predict-duration.onnx --MNNModel predict-duration.mnn --bizCode biz

# test the inference result
python ../tools/script/fastTestOnnx.py ./onnx/predict-duration.onnx
Modify the decoder
import onnx

model = onnx.load("decode-0.onnx")

node = next(n for n in model.graph.node if n.name == "Unsqueeze_481")

node.attribute.remove(node.attribute[0])
axes_attr = onnx.helper.make_attribute("axes", [0])
node.attribute.insert(0, axes_attr)

onnx.save(model, "./onnx/decode-0-modified.onnx")

I haven't written any deployment or inference code yet since I don't have Android Studio or XCode on my laptop.

Edit: n.op_type -> n.name

@Hiroshiba
Copy link
Member Author

That's great !!!!!!!!!!!!!
I'm very interested whether it will work on a smart phone or not !!!!!

@Patchethium
Copy link

It works, if you go to the docs' about page you'll see

● iOS platform: static library size for armv7+arm64 platforms is about 5MB, size increase of linked executables is about 620KB, and metallib file is about 600KB.
● Android platform: core so size is about 400KB, OpenCL so is about 400KB, Vulkan so is about 400KB.

Originally it was made for mobile platforms, just like NCNN.

@sevenc-nanashi
Copy link
Member

sevenc-nanashi commented Apr 20, 2023

ここ数日間でのDiscord会話や自分が試してわかったことからタスクリストを作ってみました。

VOICEVOX/voicevox_mobile#28 に移動)

@sevenc-nanashi
Copy link
Member

sevenc-nanashi commented Apr 27, 2023

新設計APIを使えばエンジンのJS実装部分を減らすことができそうだったので、それを使うようにタスクリストを更新しました。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants