Original Source Here
Big Tech Going Offline
On-device Speech Recognition in 2022
2022 has been the best year for on-device speech recognition considering the magnitude and frequency of the announcements.
Google announced Google Cloud Speech On-Device on October 20. Open AI — the creator of the famous DALL-E, introduced Whisper in September. Apple added on-device live captions as a part of its accessibility and privacy initiatives in May. Microsoft completed the $20 Billion Nuance acquisition in March. At Picovoice, we added 20 MB speech-to-text models that run anywhere to our on-device speech portfolio. Amazon shared how on-device speech processing makes Alexa faster in January.
Why on-device speech recognition?
Processing voice data on the device (i.e. edge voice AI) has advantages over the cloud. Cloud computing is expensive at scale, carries security and privacy risks and has a significant carbon footprint. Bringing processing to the device instead of sending data to the cloud gives control to enterprises and users. Edge voice AI offers privacy, cost-effectiveness and improved user experience with zero latency and reliability.
OpenAI targets AI researchers with Whisper. However, many others, including reporters, got excited because of the privacy benefit of on-device speech recognition. Big tech has been known for the unethical use of voice data. Now, independent voice API vendors such as Otter.ai follow in their footsteps.
Highly accurate models offered by cloud API providers are expensive for both vendors and buyers. For example, they don’t provide more than a couple of hours of transcription for free because training and running large speech models in the cloud is costly. They have to pass the cloud costs on to the customers.
Although high-speed internet is widely available, connectivity, latency and outages are still significant problems. Inherent risks of cloud dependency affect user experience and productivity.
Plus, environmental consciousness and sustainability. The carbon footprint of cloud computing surpassed that of the airline industry. The carbon footprint of training a large AI model is equivalent to what five cars generate over their lifetimes.
When you can have the same results and a better experience at a lower cost by consuming less energy, why wouldn’t you?
Till recently, end-users, developers and enterprises have had to sacrifice privacy over convenience and cost and tolerate poor experiences because the voice AI market dominated by big tech didn’t offer an alternative. First, there’s a conflict of interest. Second, it’s not easy to do so.
Keeping developers in the cloud helps big tech maintain their cloud oligopoly. The standard approach in speech recognition is to train large models to achieve high accuracy. Server farms can run large models without question, but not every device. For example, Mitchell Clark reports it takes 52 mins to transcribe a 24-minute interview with Whisper vs. 8 mins with Otter.ai. For hobby projects, 52 minutes may not be a problem. However, for enterprise applications, it certainly is. Real-time Factor is one of the questions we get from prospects evaluating speech-to-text engines. It’s not easy to optimize models that can run across platforms efficiently. [The Picovoice team learned it the very hard way.] It is also not easy to compete by being 6.5 times slower. As a result, ASR models run in the cloud.
It’s not easy to optimize models that can run across platforms efficiently. It is also not easy to compete by being 6.5 times slower. As a result, ASR models run in the cloud. Until now…
It is easier to optimize ASR models if one can control the hardware. Thus, big tech is moving to the edge — for their products and selected partners.
In March, Microsoft acquired Nuance. The company is known for Dragon Speech Recognition -famous offline dictation software specialized in healthcare and legal. So far, Nuance has announced just cloud investments.
Using cloud and edge or on-device in the same sentence may sound unconventional to many. Yet, it’s the way to give control back to enterprises and users and stop harming the environment.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot