Friday, December 29, 2017

Alphabet’s Tacotron 2 Text-to-Speech Engine Sounds Nearly Indistinguishable From a Human

Alphabet's subsidiary, DeepMind, developed WaveNet, a neural network that powers the Google Assistant's speech synthesis, in October. It's capable of better and more realistic audio samples than the search giant's previous text-to-speech system, and what's more, it generates raw audio — not spliced-together sounds from voice actors. Now, researchers at Alphabet have developed a new version, Tacotron 2, that uses multiple neural networks to produce speech almost indistinguishable from a human.

Tacotron 2 consists of two deep neural networks. As the research paper published this month describes it, the first translates text into a spectrogram, a visual representation of a spectrum of audio frequencies. The second — DeepMind's WaveNet — interpret the chart and generates corresponding audio elements. The result is an end-to-end engine that can emphasize words, correctly pronounce names, pick up on syntactical clues (i.e., stress words that are italicized or capitalized), and alter the way it enunciates based on punctuation.

It's unclear whether Tacotron 2 will make its way to user-facing services like the Google Assistant, but it'd be par for the course. Shortly after the publication of DeepMind's WaveNet research, Google rolled out machine learning-powered speech recognition in multiple languages on Assistant-powered smartphones, speakers, and tablets.

There's only one problem: Right now, the Tacotron 2 system is trained to mimic one female voice. To generate new voices and speech patterns, Google would need to train the system again.


Tacotron 2



from xda-developers http://ift.tt/2zMBl81
via IFTTT

No comments:

Post a Comment