Speech Recognition—sometimes referred to as automatic speech recognition (ASR), speech to text (STT), or computer speech recognition—is the task of converting spoken language into text. ASR processes raw audio signals and transcribes them.
ASR falls under the family of “conversational AI” applications. Conversational AI is the use of natural language to communicate with machines. It typically consists of three subsystems:
- Automatic speech recognition: transcribing the audio
- Natural language processing: extract meaning from the transcribed text. NLP is concerned with computers processing and analyzing natural language, i.e., any language that has developed naturally, rather than artificially (such as with computer coding languages).
- Text-to-speech (TTS) or speech synthesis: converts text to human-like speech.
Each of the three subsystems integrates multiple neural networks to create a seamless experience for the end-user. Neural networks are a programming paradigm that enables a computer to learn from observational data using a unique architecture of small functional units (the neurons) which are wired together in a manner mimicking a rudimentary model of the human brain.
The most prominent voice-driven applications that have embedded into our day-to-day lives are voice assistants such as Apple’s Siri, Amazon’s Alexa, Google Assistant, as well as Microsoft’s Cortana. These question-answering systems have been improving at a tremendous pace in the last few years due to advances in deep learning and big data. As an example of the forward momentum, Google just announced Meena, a “2.6 billion parameter end-to-end trained neural conversational model” designed to handle wide-ranging conversations.
Our previous posts explored NLP techniques such as sentiment analysis and named entity recognition (NER).
Here, we explore the first subsystem of a conversational workflow: automatic speech recognition. We explain how it works, explore some use-cases, and see how you can apply it in your business.
How does speech recognition work?
At its heart, speech recognition is a 3-step process:
- Feature extraction: we first sample the continuous raw audio signal into a discrete one. The discrete signal needs to be converted into a form that is digestible by a machine learning algorithm, such as a spectrogram. The spectrogram input to the algorithm can be thought of as a descriptive numerical vector at each timestep. It is obtained by applying a mathematical operation called a short-time Fourier transform on the discrete audio signal.
- Acoustic modeling: this takes the spectrogram and predicts the probability of all words in a vocabulary for each time step.
- Language modeling: this adds contextual information about the language. This is used to correct mistakes in the acoustic model. It tries to determine what was spoken by combining both what the acoustic model thinks it heard with what is a likely next word (based on its knowledge of the language).
The acoustic model and language model are types of neural networks.
ASR models are evaluated using the word error rate (WER) percentage. This is the percentage of the total words the model inserted, deleted, and substituted divided by the length of the words in the actual transcription.
Where does speech recognition struggle?
There are numerous challenges faced by ASR. Here are just a few:
- Background noise
- Different accents and dialects
- Low-quality microphone/recording equipment
- Listening to what people say is more than just hearing the words they speak; we engage all our senses during a conversation, reading a person’s facial expressions, body language, and inflections in their voice. This information is lost when processed as a raw audio signal.
- Multiple speakers
- Abbreviations and continuously evolving language.
- Overlapping speech
- Homonyms: words like ‘there/their,’ ‘air/hair,’ ‘right/write’ are pronounced similarly but have different meanings.
The most precise way forward, as is often the case in machine learning, is to generate more relevant labeled training data. By relevant, we mean training data that is more representative of the population of situations that the model will encounter once deployed.
The more high quality and relevant labeled data you train your model off, the better it becomes at handling noise, accents, and other variations in speech.
Speech recognition use cases
In line with its versatility, ASR has a wide range of use cases. With the advancements in conversational AI, it would not be surprising for speech to become the dominant user interface in the coming decade.
Some notable use cases include:
- Voice-driven intelligent virtual assistants (IVA): speaking is a more natural interface to interact with an intelligent machine compared to typed text or pushing buttons. Smart virtual assistants have seen a tremendous adoption rate due to the ease of interaction. With the global market forecast to achieve a 19.8% annual compound growth rate through 2026.
- Home automation: with the move to smart homes and IoT devices, there is going to be rapid growth in voice-activated devices. In fact, 24% of US adults already own at least one smart speaker.
- Smart devices: from our phones, computers, and watches to our TV, refrigerators, and In-car systems, this decade, will be transformed by conversational AI applications. At the minute, 54% of the U.S. population report having used voice-command technology at some point, and that figure is sure to rise.
- Generating transcripts of discussions and meetings in conferences, meetings, and classroom lectures. Such tools increase inclusivity for people with hearing and seeing difficulties.
- Automated call centers to reduce the staffing costs of human customer representatives in multiple industries, from banking to healthcare
- Automatic captioning of videos using speech recognition: with some 500 hour’s worth of video content uploaded to YouTube every second, manual video captioning is not a feasible option. YouTube uses ASR to generate subtitles for its videos.
- Seamless translation of languages
- Hands-free computing
How can I use speech recognition?
If you think that your business or project could benefit from ASR, it’s pretty easy to start. Kaldi and Nvidia’s NeMo (standing for neural modules) are popular open-source toolkits for speech recognition.
But before you begin using one of these frameworks to build a model, you will need to produce a relevant labeled dataset to train the model.
With Canotic, you can provide us with your raw audio files, and we’ll transcribe the audio for you, returning a high-quality training dataset that you can take to train and tailor your ASR model off.
If you’re interested in learning more, or have a specialized use case, reach out to us. You can also stay tuned to our blog, where we’re continuing to run a series of posts covering different aspects of NLP.
Keep in touch