PytorchDcTts : A Machine Learning Model for Text-to-speech Synthesis

Original Source Here


PytorchDcTts (Pytorch Deep Convolutional Text-to-Speech) is a machine learning model released in October 2017. It is capable of generating an audio file of a voice pronouncing a given input text.


Recursive Neural Networks (RNN) are commonly used for speech synthesis tasks, but they have the problem of taking a long time to learn. To address this problem, PytorchDcTts uses CNNs to construct speech synthesis, which can be learned in about 15 hours on a typical gaming PC.

Speech synthesis without deep learning relies on a complex system with multiple components such as text analyzer, F0 generator, spectrum generator, pause estimator, and vocoder.

With deep learning, these multiple components can be aggregated into a single end-to-end model, allowing input to output to be computed directly.

The model architecture of PytorchDcTts works as follows.


In the flow diagram above, the input text is vectorized using TextEnc. Attention creates pairs of text and melspectogram with weigths. Then AudioDec compute the melspectrum and the SSRN (Spectrogram Super-resolution Netrowk) is used to improve the audio quality.

Below is an example of speech synthesis. From the top, we can see Attention, mel spectrogram, and linear STFT spectrogram.


The LJ Speech Dataset was used for training, which consists of 13K pairs of text and associated speeches, for a total of 24 hours of data.


You can use the following command to output a wav file from any English text.

$ python3 -i "Hello world" -s output.wav

Here is an example of the output speech of an input text introducing ailia SDK.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: