Brainy Voices: Innovative Voice Creation Based on Deep Learning by Acapela Group Research Lab

Discover how Acapela Group can create a synthetic version of any voice based on a few minutes of speech recordings

News provided by

Jun 29, 2017, 09:16 ET

MONS, Belgium, June 29, 2017 /PRNewswire/ --

Neural Networks have revolutionized artificial vision and automatic speech recognition. This machine learning revolution is holding its promises as it enters the Text to Speech arena. Acapela Group is actively working on Deep Neural Networks (DNN) and we are very enthusiastic and proud to present the first achievements of our research in this fascinating field, creating new opportunities for voice interfaces.

Our R&D lab has developed Acapela DNN, an engine capable of creating a voice using a limited amount of existing or new speech recordings.

"Acapela DNN represents 'Acapela's ultimate talking machine', benefiting from our speech expertise and learning from our vast voice and language databases to model voice identities and reproduce speech, in many languages. This is much more than concatenating speech recordings from the studio like we used to do with unit selection. We are talking about creating a voice signal and persona from scratch and in many languages and it is happening now. We need only one week to release a new voice based on a few minutes of speech recordings," says Vincent Pagel, R&D and Linguistic Group manager, Acapela Group.

While synthetic voice creation was usually based on rich audio material recorded by a professional voice actor, in a professional studio and under the supervision of a linguistic expert, Acapela can now create a voice with an average of 10 to 15 minutes of clean audio recordings and the associated text transcription of the audio samples.

Voices can be created based on minutes or hours of speech recordings, depending on the targeted usage. In specific cases such as voice replacement for patients, Acapela DNN can work with a few minutes of speech. For professional usage, such as creating a voice for a video game or for a passenger information system, Acapela DNN will need more recordings. Obviously, the more data there is, the more the DNN can learn from specific habits and create a voice that matches the original.

The first results of voices created using this approach are impressive.

We have worked on voice recordings of well-known people. We have also created voices for individuals who cannot speak correctly anymore due to surgery or disease. They will be the first ones to speak with voices created with Acapela DNN. Here are some voice samples.

Listen to voice samples:

Above voice samples have been produced with only a few minutes of speech. Based on the speech recordings provided by the users, the Acapela DNN has defined a voice ID and after training has provided a voice that is very close to them.

John, US English

Sample - Original voice: http://bit.ly/2t3su03
Sample - voice created by Acapela DNN: http://bit.ly/2trp3mp

Stephen, US English

Sample - Original voice: http://bit.ly/2u0PXPB
Sample - voice created by Acapela DNN: http://bit.ly/2t3wMVg

Anonymous user, French

Sample - Original voice: http://bit.ly/2s8J3X5
Sample - voice created by Acapela DNN: http://bit.ly/2sp9ibe

Other ongoing experiments include among others voices for video games or robots. Creating voices based on DNN is limitless. With this new approach, Acapela will push the boundaries of technology allowing everyone to have a voice.

Material needed: average of 10-15 min of clean recordings + text transcription

Acapela DNN is trained offline with all the many different voices in our catalogue. We feed it all the text and acoustic databases we have for all of our voices. This means Acapela DNN knows a lot about human speech in general but doesn't yet know anything about a specific person's voice and will need to hear this voice for a while before reproducing it.

> 1^st pass algorithm: 'Voice ID' parameters to define the digital signature (or sonority) of the vocal tract of the speaker.

> 2^nd pass algorithm: Acapela DNN additional training to match the imprint of the voice with its fine grain details (accents, speaking habits, etc.)

>> Creation of a new voice based on limited amount of audio data

About DNN:

A deep neural network (DNN) is an artificial neural network (ANN) with multiple hidden layers between the input and output layers. DNNs can model complex non-linear relationships. We use them in Text-to-Speech to learn the relationship between a set of input texts and their acoustic realizations by different speakers.

Neural networks are a set of algorithms, modelled loosely after the human brain, that are designed to recognize patterns. They interpret sensory data through a kind of machine perception, labelling or clustering raw input. The patterns they recognize are numerical, contained in vectors, into which all real-world data, be it images, sound, text or time series, must be translated.

About Acapela Group: http://www.acapela-group.com

Voice information is everywhere in our daily lives and enriches content and interfaces.

Acapela Group, leading voice expert with 30 years of experience behind it, invents speech solutions to give you the say. We create voices that read, inform, explain, present, guide, educate, tell stories, help to communicate, alarm, notify, entertain. Text-to-speech solutions that give the say to tiny toys or server farms, AI, screen readers or robots, cars & trains, smartphones, IoT and much more.

We innovate to give a voice to All. Our in-house speech technologies and solutions are designed to provide a smart and pleasant spoken audio result.

Lend an ear to more than 100 in 34 languages and accents, or create your own custom voice with Acapela's bespoke expertise.

SOURCE Acapela Group