“My voice isn’t me, it sounds like a robot. I want it to be me,” eight-year-old Maeve Flack told speech scientist Dr Rupal Patel during their first meeting in 2014.
It was a frustration Dr Patel had heard many times before.
Maeve, who has cerebral palsy, was one of the hundreds of thousands of speech-impaired Americans using one of a wide range of assistive technologies to communicate, but with only a handful of impersonal digital voices to choose from.
Maeve was using a generic adult female voice called “Heather” on the device her family called her “talker”, but it didn’t reflect her age or personality—much in the same way astrophysicist Stephen Hawking’s iconic American-sounding “Perfect Paul” voice didn’t speak to his British heritage.
In a TED Talk broadcast earlier that year, Dr Patel described how her company VocaliD was using AI-powered technology to blend vocal samples from speech-impaired people and matching them with healthy talkers to create bespoke synthetic voices, each as unique as a fingerprint.
“I decided to crowdsource voices to create a huge bank of voices from which we could create unique vocal identities,” explains Dr Patel. “The TED Talk really precipitated everything.”
Maeve had seen the talk and wanted to be an early adopter. Her sister Erin, who was nine at the time, would be her contributor. Maeve became one of seven “trailblazer” recipients funded by IndieGoGo to receive a voice within the year; since then, VocaliD has delivered more than 250 one-of-a-kind voices to recipients in the US, UK, Ireland and Australia.
Dr Patel hopes she can create more voices for the tens of millions of people using augmentative and alternative communication (AAC) devices around the world as more people learn about VocaliD.
“Currently, there are 26,000 people on the platform and, in totality, I think we’ve banked 14 million sentences,” says Dr Patel.
No matter how little of an individual’s speech remains, they have a unique vocal identity: they can still modulate their source —the pitch, tempo and loudness of the vibrations they generate using their voice box—even if they can’t properly control their filter, their tongue, lips and mouth, to produce consonants and vowels.
Even a single “aahhhhhhh” contains enough vocal DNA to seed the personalisation process. By using software to combine a speech-impaired person’s vocalisation with the full set of sounds in English recorded by a healthy talker, Dr Patel can create new computerised voices for a wide range of AAC devices.
The first recordings were made in VocaliD’s Boston-based lab but in December 2014, Dr Patel launched the Human Voicebank online portal so anyone with a home computer, internet connection and microphone-headset could contribute their voice.
The first contributors had to record 3,500 sentences, which took around eight hours, and then VocaliD’s engineers got to work, spending another 20 to 30 hours parsing out the speech into a data set from which the digital voice could be generated.
The latest iteration of VocaliD’s software takes just two and a half hours — and far fewer engineering hours — to bank a voice although only 60 to 90 minutes of good quality speech from one person is actually needed, and the team can work with even less. VocaliD’s latest synthesis engine, which uses Google’s open-source WaveNet and Tacotron software coupled with a proprietary vocoder, can combine recordings from multiple contributors’ voices and samples from a single recipient to create a complete voice in around 10 days.
Blending multiple contributors has several benefits: it enables the team to utilise incomplete data sets (many people don’t finish banking their voice fully) while giving more uniformity to the final voice. It also protects the vocal identity of each contributor, something that’s becoming more of a concern with the proliferation of voiceprint matching as a security protocol, like in telephone banking.
Currently, VocaliD offers two products for individuals, both priced at US$1,499: BeSpoke voice, which combines vocalisations from a speech-impaired person and matching contributor(s), and Vocal Legacy, a digital voice synthesised from a single contributor who is about to lose their voice, either to progressive degenerative diseases such as ALS or Parkinson’s, due to an invasive surgery or because they want to preserve their voice for future generations.
When Lonnie Blanchard was told he had oral cancer and needed an immediate glossectomy to remove 98 per cent of his tongue, he spent the week before his surgery visiting and talking to loved ones and banking as much of his voice with VocaliD as he could. When Mr Blanchard received his AAC device following his surgery, it was programmed with a digital speech generator that sounded like him.
“He writes us just to tell us how he’s doing and how important it is to him that his family can still hear his voice,” says Dr Patel.
Since starting in 2014, the price of making each voice has gone down from tens of thousands of dollars to a few hundred, without taking into account the initial R&D and considerable ongoing support costs. Dr Patel hopes to offer voices in other languages as they secure more funding to serve low-income populations at scale, but spreading the word and determining pricing has been tricky.
VocaliD’s main source of funding has come from US Government grants, which enable for-profit companies to undertake research to be used in commercial products as a means of generating jobs, but prohibit the team from spending on marketing — a huge stumbling block for a niche market product dependent on social awareness
“It’s a bootstrapping mechanism but it’s really the only way to start a company that had such a social mission to it because the market is small,” explains Dr Patel.
One idea to get contributors engaged was to set up a shareable interface with photos and personal blurbs from contributors and recipients, but this was problematic due to the vulnerable nature of the population receiving VocaliD voices. Now, the company simply notifies contributors whenever their voice has been deployed.
VocaliD also has to be sensitive about whether the person offering their voice is a suitable contributor for the recipient. Dr Patel has received three long letters from inmates at federal prisons who want to offer their voice to someone in need as an exercise in healing and reparation. Understandably, many potential recipients may not want the voice of a convicted felon.
“That’s another reason for keeping contributors and recipients anonymous. These days people are less tolerant of each other,” she adds.
Even though voices have been heavily subsidised, those who need it most don’t always have the ability to pay; and BeSpoke recipients and their families often delay committing to invest. Many users’ ability to vocalise is not deteriorating — a person with cerebral palsy will still be able to record their unique sounds in a year or five years — and they are hopeful that their US health insurance, which currently covers the cost of a replacement device every five years, will soon have a provision for custom voices.
“But insurance won’t cover a voice until we have enough documented cases of people using it; it’s real chicken and egg,” says Dr Patel.
People’s voices also change over time, and some families weigh the cost of investing in a voice now only to have to buy another down the line as their child gets older. VocaliD categorises voices into four voice “sizes” — child, adolescent, adult and mature — and while the team can make some modifications to an existing synthetic voice as the user ages, tinkering with it too much algorithmically introduces strange artefacts in the voice, necessitating the building of a new voice as the recipient moves from one voice size to another, such as from childhood to adolescence, or adolescence to adulthood.
Dr Patel says the team has learnt that recipient preferences — both the individual receiving the voice and whoever is paying for it — are often as important as accuracy when developing synthetic voices. She mentions the case of an autistic little girl for whom VocaliD made a high-pitched voice before they discovered she found high-pitched voices extremely distressing.
“The only sample we got from her vocalisations — because she has such a hard time making vocalisations — was high pitched squeals and so when we were making her voice, we went with what we knew about her,” explains Dr Patel. It can be difficult to ask the right questions and store data that would guide the production process due to privacy laws and customer wishes, she adds.
“A lot of the time, we’re dealing with personal pain and as a therapist and academic in this space, I understand that. Obviously, the intention is to empower, but it’s tricky. Voice is just personal.”
As the number of things that talk has proliferated—for instance, Google, Siri, Alexa and the applications that run on them—VocaliD is also exploring new applications for enterprise products and how to differentiate their vocal personalities from the competition.
“Humans have different voices for a reason; it’s so that we can know who’s talking,” says Dr Patel. “Brands think about their logos, think about their colours… They now have to think about voice in this voice-first era.”
Another application for VocaliD software is developing voice avatars for actors who may not be able to come into the studio to record more lines, either for conventional, real-time interactive or over-the-top marketing like advertising distributed online, bypassing TV and other mainstream media.
Dr Patel has worked with influencers—who she can’t name—to create synthetic voices for their virtual interactions with fans, such as the digital version of a rapper’s voice for a hype ad campaign for Stōk Coffee.
“We call it a Voice Dubb. It’s like a stunt double for a [talent]’s voice,” Dr Patel explains.
“The estimate right now is that there are 500 million things that talk or things and applications that talk. By 2020, it’s supposed to be 40 billion. This is a disrupter certainly, but not a displacer,” she adds.
A synthetic voice is still distinguishable from a human voice but the gap is closing all the time.
Dr Patel plays me a minute-long voice sample that contains human and synthesised speech. I listen to it twice; the fact that the melody sounds a little off in parts tips me off that it’s not a real person speaking, but I can’t call out with any degree of certainty which parts are spoken by a human and which are machine-generated.
This slight lack of authenticity in cadence, coupled with the delay between a computer processing a human question and generating a response—which Dr Patel says stands at about six seconds—means that spoofing attacks with AI-assisted synthetic voices aren’t yet viable but they certainly aren’t far off.
She does expect the number of applications for synthetic speech genesis to continue to grow from a niche market to the mainstream, in the same way inner-ear buds started out as hearing aids but are now used by anyone listening to content via their phone or computer. She is now looking for equity funding to develop applications for synthetic voices to service speech-impaired clients.
Maeve has already had several iterations to her voice as VocaliD’s software has improved. Dr Patel says that hardware is now bottlenecking developments in digital voice—VocaliD can build more sophisticated voices but the user’s AACs lack the latest processing technology.
“We can make a better voice for Maeve, but she can’t use it because it takes a graphics processing unit to run it, and she doesn’t walk around with
a GPU.”
Even so, Maeve and her family are happy that she is being heard.
“It’s great that Maeve can have this computerised voice that sounds like a little girl. A voice is so special and it’s truly her voice,” says Maeve’s mother
Kara Flack.
“When someone is walking down the hall and hears Maeve talking, they know that it’s her voice. It’s priceless.”
This article first appeared in the July issue of A.