Speech-to-Text in Voice AI: Does it have to fit all?



Original Source Here

Speech-to-Text in Voice AI: Does it have to fit all?

“Why are deep learning technologists so overconfident?” ask Narayanan and Kapoor, and continue “the central dogma of deep learning is that the only thing you need to solve a new type of learning problem is to collect training examples.” The timing of this reading was interesting as I was questioning common approaches in the voice AI industry. Recent advances in deep learning created new opportunities for voice AI and improved speech recognition significantly. However, the industry still applies the same approach to every problem even if there’s a better, and proven alternative.

Making voice files discoverable and searchable

Enterprises are looking for ways to solve their unstructured data problem with machine learning. Voice data is not discoverable and searchable due to its unstructured nature. It needs to be processed. Most people in voice AI think converting voice to text is the only way to process it. Tim Olson says “the speech must first be converted to text” to be searched or sorted. However, he also acknowledges that this added step is trickier due to proper nouns and homophones. They are widely known challenges of automatic speech recognition. He tells that they are improving the machine learning models with more training data to minimize transcription errors to tackle these challenges. Unfortunately, he doesn’t explain why speech MUST be converted to text. So let’s ask:

Does speech have to be converted to text to make voice files searchable or is it just a dogma because text-indexing algorithms have been successful, so it’s the way we know how to index things?

I believe Clay Christensen’s simple question, what’s the job to be done, can be applied to many problems, including this one. Olson’s answer is “helping people find audio news more effectively.” So the job to be done is not minimizing the transcription errors. Just like Google Search Engine indexing the internet, we need a search engine indexing audio news to make them searchable. The solution that we are looking for is speech indexing, Speech-to-Index. Text is an intermediary “added” step that’s assumed to be there. However, it’s possible to index audio news, without that added step and help people find news more accurately and effectively.

Detecting users’ intents

Differentiating homophones might be tough for both machines and humans. In most cases, humans can do easily given the context. If a human is at a doctor’s office or a bank, they could infer it’s “calluses” not “calculus” or “IBAN” not “I ban.” However, Alexa doesn’t know where it goes. While humans can be context-aware naturally, machines cannot. Yet, we often want them to be context-aware, too. For example, we do not expect a voice assistant used in a warehouse to improve productivity to have a conversation about the meaning of life. A banking application is not expected to understand and play users’ favourite songs. So, why does the voice AI industry invest in adjusting speech-to-text models to tackle homophones when machines do not even need to recognize alternatives? Let’s ask Christensen’s question again. What’s the job to be done? It is to enable users to get things done. If it’s a coffee maker, a user wants to get coffee. If it’s a banking application, a user wants to send money. The job is to convert speech to intent (i.e. understand) and then trigger an action accordingly. Text is, again, an added step that’s assumed to be there to understand the intent, although the job can be done without text. The solution that we are looking for is Speech-to-Intent, not speech-to-text-to-intent.

In the end, Narayanan and Kapoor credit deep learning technologists by acknowledging that they have proven skeptics wrong several times. Deep learning technologists in voice AI should also get huge credit for their work. Speech-to-text is a phenomenal technology. Modern ones can recognize ten times more words than an average human being does, getting closer to human-level accuracy. It enabled various voice applications which wouldn’t have been possible otherwise. The company I work for also offers speech-to-text. It is the only option for some cases such as dictation, call center transcription, subtitling or open domain (generalist) voice assistants like Alexa. However, the fact voice vendors can offer phenomenal speech-to-text doesn’t mean that it is always the best solution for every use case and every customer. For voice vendors, it’s time to ask the right questions and reframe the problem, before getting obsessed with the technology and solution. For voice buyers, it’s time to make sure that vendors offer the best solution for their needs, not the only available solution. One doesn’t have to fit all. One generally doesn’t fit all.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: