The future of personal devices is widely considered to be voice-assisted. Amazon, Google, Apple and even Facebook are putting more emphasis on their AI platforms and building digital assistants such as Alexa, Google Assistant and Siri. While Alexa and Assistant have entered our homes, they suffer from a common pitfall. These digital assistants sound like a machine, often reminding users about HAL 9000 from 2001: A Space Odyssey.
“Tell me a joke” is one of the first things people say to their voice assistant. While these digital assistants can do a number of things, people often tend to converse with them by asking personal or emotional questions. These interactions have forced big tech companies to make their voice assistants sound less robotic, less monotone and more human. Dutch AI startup DAISYS might have a headstart with its breakthrough in voice technology.
Human-sounding voices by means of AI
In order to get rid of their mechanical sound, companies often tend to tweak their speech synthesis markup language tags like in the case of Amazon Alexa. These tags allow Alexa to pause and whisper, bleep out expletives. It also helps vary the speed, volume and pitch of its speech. DAISYS B.V. has developed a new way of creating human-sounding voices with the help of artificial intelligence (AI).
With this new voice system, the written text is narrated in a natural way, generating new, realistically sounding voices that don’t exist yet. The technology also offers the ability to adjust speech properties such as speed and pitch in real-time. This allows the voice to be further customised and one that is far from voice created with deepfake technology.
The backbone of this technology is, of course, artificial intelligence and a model capable of processing speech data. The startup from Leiden has worked on its technology with a small international team of AI developers during the past year and a half. Joost Broekens, CTO and co-founder at DAISYS, explains that the startup had to make important adjustments to the existing basic technology during this process.
“The model we developed is significantly changed to be able to do what we do. I cannot tell you now how we did that,”
The startup also had to cleverly train its models by using the right balance of speech data from different speakers. “Because of this we’ve managed to generate new, naturally sounding voices that can be real-time adjusted by means of gender, pitch, power and speed,” Broekens explains in a company blog post.
The critical thing here seems to be the training phase of DAISYS’ AI model, which must have required a number of different speakers and constant change in pitch, power and speed of the speakers. The inference phase of this AI model seems to be a human-sounding voice that is both natural and realistic compared to existing voice assistants.
“The amount of hours of training depends on the size of the dataset and whether or not we pre-train or fine-tune. Pre-training is in the order of days on multiple high-end GPUS while fine tuning is significantly faster, as we found our models are quite good at transfer learning with limited amounts of training (e.g. for a different speaking style or different emotion),” Broekens added: “Dataset size depends on the goal of the training. Base pre-training uses vast amounts of data from many different speakers (hundreds of hours of segmented audio in total). Fine Tuning uses much less data. We cannot say how much right now, as this is part of the innovation.”
DAISYS’ human-sounding speech technology currently supports English and Dutch. Joost says the model has been designed in such a way that any “other language with similar phonetic or linguistic characteristics won’t be a problem in principle.” The startup also plans to license the technology once it has fine-tuned the speech model.
Why human sounds cannot be created with deepfake
Barnier Geerling, CEO of DAISYS, says that the new voice technology is suitable for all online and offline surroundings where the human voice is used. It can be used in traditional media, smart devices, games, robots, speech assistants and public announcement systems. “Credible voice technology is becoming ever more important. Nowadays, everything has a voice: your phone, your car, even your coffee machine. Imagine being able to adjust that voice to your preference. That future is now within reach,” Geerling said.
A voice sample built using the new AI model posted by DAISYS not only shows its natural tone but also highlights the accent and varying pitch of the voice. The voice profile is definitely different from the natural sounding voices created using deepfake technology based on the audio data of professional speakers.
Geerling explains that deepfakes in text-to-speech cannot be used for a number of reasons. One of the reasons is that not everyone “wishes to lend out their voice without having control about what is being said with it”. Geerling says, “this technology makes it easier and faster to apply speech-steered technology. The market potential is enormous, think of audio-visual media using voice-overs, or ‘talking’ cars, robots, or appliances. For manufacturers, this means the possibility to integrate realistic speech in their products becomes much easier and more efficient.”