Voice and speech recognition have progressed leaps and bounds in recent times. Humans depend too much on mobiles these days. According to a study, most people on average spend 3 hours and 15 minutes on their phones. That’s a lot of human dependence on a single piece of tech.
Even multinational conglomerates across the globe are coming to realising that a smooth and efficient human to computer interaction is the need of the hour. They have identified voice recognition software as a much-needed tool to streamline the tasks that are otherwise conventionally done.
It is estimated that by the year 2024, the global voice-based smart speaker market could be worth $30 billion! It is also expected that by the end of this year, almost half of all searches across the internet will be voice-based.
E-commerce is one domain that has witnessed major incorporation of voice commands, compared to the traditional web searches. The voice-based shopping is expected to jump to $40 billion in 2022. These forecasts only point towards the fact that consumer behavior is undergoing a seismic shift.
So, while thumbing the screens of your phones or accessing the internet via virtual assistants, have you ever asked yourself - how does voice recognition work?
What is a Voice Recognition Software - How Does a Computer Even Know What I Am Saying?
This happens via voice recognition software!
Voice or speech recognition software enables you to feed data in a computer using your voice. More advanced versions of voice recognition software are capable of decoding human voice to perform a command accordingly. So, as you speak into a voice recognition system, your voice is converted into text.
When conceived in 1952, the Automatic Speech Recognition (ASR) system was only capable of recognizing single digits. But today, this technology is used in almost every professional field including defense, medicine and healthcare, law, education, telecommunications, and personal computing.
Why is Speech Recognition So Difficult?
Short answer: Computers understand logic not emotions and they HATE noise.
Humans started to use language as means of communication around 2 million years ago. But when it comes to teaching machines how to understand, analyze and decode human speech we are far from attaining perfection. So, what makes speech recognition difficult?
Human speech is not as easy as it seems. It has taken us millions of years of evolution to reach the stage where we can associate our thoughts with unique sounds, process them coherently so that the person we are talking to can get the message easily.
For the listener as well, the message must be received (via sound waves) as it was intended to be despite the background noise and other linguistic barriers. Humans have only mastered this art after millions of years of evolution.
A computer, no matter how fast and complex it can be, will certainly fail to understand and analyze the following aspects of speech recognition:
- Suppression of noise: Humans can certainly separate useful parts of a speech from ambient noises and background jibber-jabber. A computer will take it as a part of the whole input.
- Speed of verbal communication: Humans are capable of understanding slow and fast speech, high- and low-pitched voices laced with emotions and expressions. Most of the ASR system struggle when it comes to understanding a speech consisting of more than 200 words per minute.
- Accents and dialects: Even humans fail to understand dialects from certain parts of the globe. To expect a computer to understand unique dialects and accents is way too premature at this stage.
- Context of the speech: Humans can understand the context of a conversation from the simplest of prompts, but the ASR system requires direct and precise instructions. This is often time-consuming and tedious and defeats the whole purpose of instant commands.
Human to human conversation is full of expressions, anecdotes, and emotions. With computers, we have not yet hit the phase where we can code them to interact with users like other humans. It would be extremely interesting to see how engineers and scientists are able to induce something as natural and human as verbal communication into computers that run on direct commands and instructions.
The Million-Dollar Question: How Does Voice Recognition Work?
Voice Recognition means making a computer understand human speech. It is done by converting human voice into text by using a microphone and a speech recognition software. The basic recognition of speech system is shown below:
1. Speech to text conversion
When sound waves are fed into the computer, they need to be sampled first. Sampling refers to breaking down of the continuous voice signals into discrete, smaller samples- as small as a thousandth of a second. These smaller samples can be fed directly to a Recurrent Neural Network (RNN) which forms the engine of a speech recognition model. But to get better and accurate results, pre-processing of sampled signals is done.
2. Pre-processing of speech
Pre-processing is important as it decides the efficiency and performance of the speech recognition model. Sampled waves are usually as small as 1/16000th of a second. They are then pre-processed, which is breaking them into a group of data. Generally grouping of the sound wave is done within interval of time mostly for 20-25 milliseconds. This whole process helps us convert sound waves into numbers (bits) that can be easily identified by a computer system.
3. Recurrent Neural Network (RNN)
Inspired by the functioning of human brain, scientists developed a bunch of algorithms that are capable of taking a huge set of data, and processing that it by drawing out patterns from it to give output. These are called Neural networks as they try to replicate how the neurons in a human brain operate. They learn by example. Neural Networks have proved to be extremely efficient by applying deep learning to recognize patterns in images, texts and speech.
Recurrent Neural networks (RNN) are the ones with memory that is capable of influencing the future outcomes. So RNN reads each letter with the likelihood of predicting the next letter as well. For example, if a user says HEL, it is highly likely that he will say LO after that, not some gibberish such as XYZ. RNN saves the previous predictions in its memory to accurately make the future predictions of the spoken words.
Using RNN over traditional neural networks in preferred because the traditional neural networks work by assuming that there is no dependence of input on the output. They do no use the memory of words used before to predict the upcoming word or portion of that word in a spoken sentence. So RNN not only enhances the efficiency of speech recognition model but also gives better results.
4. RNN Algorithm
Following are the steps involved in RNN algorithm:
a. The input states:
b. St 🡪 hidden state. It is the hidden memory. It stores the data of what things took place in all the previous or past time steps. It is calculated as:
c. The output states:
Ot🡪 output at the step t. It is calculated exclusively based on the memory at time ‘t’. It is calculated as:
As shown in the figure, the RNN uses (U,V,W) parameters. It implies that by passing various inputs at different steps, the same task is being done at every step. This limits the number of parameters to be learned. Even though there is an output at each time step, dependence on the task to perform is not required.
To make it easier to understand, consider an example where we have to predict the output of a sentence. To do so, we won't concern ourselves with the output of each word, but with the final output. Same implies for the inputs as well, that is, we do not need input at each time step.
5. Training An RNN
So far, we know that in RNN, the output at a certain time step not only depends on the current time step but also on the gradients calculated in the past steps. Consider an example where you have to calculate the gradient at t=6. In order to do so, you will have to back propagate 5 steps and sum up all the gradients. This is called Back propagation through Time (BPTT) and we employ this algorithm to train an RNN.
This method of training an RNN has one major drawback. This makes the network to depend on steps which are quite apart from each other. This problem of long term dependency is sorted by using other RNNs like LSTM.
6. LSTM
We know that RNN cannot process very long sequences. To overcome this problem, scientists came up with Long short term memory or LSTM. While RNN have only one structure, LSTM have four. They consist of cell state that allow any information to flow through it. By applying gates, any information can also be added or removed.
LSTM employ three types of gates: input gate, output gate and forget gate. These three gates together protect or control the cell state. LSTM also uses sigmoid functions that only give two outputs. Either they will pass every information that was given at the input or no input information will be passed at all.
This is where LSTM is better than RNN as using cell states we can control the long term dependencies.
Types of Speech Recognition Software
- Speaker-dependent voice recognition software: They are heavily dependent on the speaker as they need to learn and analyze the characteristics of the user’s voice. Once provided with enough data to recognize the voice and speech patterns, they can be used as highly efficient dictation software.
- Speaker-independent voice recognition software: They do not depend much on the speaker’s voice pattern, as they are trained to recognize anyone’s voice. Naturally, they are not as efficient as the speaker-dependent software and hence are more commonly found in telephone applications.
- Command and control voice recognition software: These systems are used to navigate and control devices using voice commands. Tasks such as starting the programs, browsing through websites and other functions can be easily accomplished.
- Discrete input voice recognition software: They aim for high accuracy of word identification. They do so by requiring a pause after each word is spoken. This limits their efficacy to around 60-80 words per minute.
- Continuous input voice recognition software: They are designed to analyze a stream of continuous word flow. As compared to other software, they consider fewer words and hence find their application mostly in medicine and healthcare.
- Natural speech input voice recognition software: They are capable of understanding words that are spoken fluently and can understand as high as 160 words per minute.
Innovative Uses of Voice Recognition
The smartphones we get today are equipped with virtual assistants such as Siri, Cortana, Alexa, etc. Even household equipment such as smart TVs, refrigerators and washing machines are being engaged with voice. So as far as domestic usage of voice recognition is concerned, it has been an extremely welcoming advancement. Other than this, there are many innovative uses of voice recognition software and speech recognition in today’s world. Some of them are below:
- Forensics and crime analysis: Audio forensics specifically deals with the analysis of voice clippings to solve a crime by using it as admissible evidence in a court of law. Researchers at the University of East Anglia have gained some success in using visual speech recognition to produce conversations captured by videos with no sounds.
- Virtual banking: Fintech was one of the earliest sectors to jump on the speech recognition bandwagon. It is estimated that banks in 2017 the North American banks alone had invested over $20 billion to incorporate voice recognition into their apps. Payment gateways and UPIs also provide with exclusive voice command feature to facilitate the transactions.
- Healthcare: One of the most overlooked aspects of the medical industry is reporting. Speech recognition has enabled medical professionals to keep meticulous records of procedures as they perform it. It is believed that the day is not far when voice-controlled surgical instruments will be used to perform complex cardiac and brain surgeries.
- Home security: Gone are the days when keys and locks would be guarding our houses with all the precious belongings inside. A lot of home security systems have started to incorporate speech recognition to authenticate the personnel entering a building. This is considered even more secure and fail-proof than using fingerprint scans or electronic locks.
- Transcription: Journalists, lawyers, and bookkeepers have to maintain notes regularly. The voice recognition will not only provide them with a seamless option to dictate and store notes but will also help them effectively manage other aspects of their trade in the time saved.
Future of Voice Recognition Software: Where is It Headed?
Voice and speech recognition have already started to dominate our domestic lives. Smart devices such as Amazon’s Alexa and Google’s ‘Home’ hub have made a significant impact on the lifestyle of the urban population. Till a couple of years ago, we seemed to have pioneered touch screen devices and now it is believed that the future of consumer electronics is going to be hands-free.
Once the technical issues such as noise, dialects, and incorporation of more regional languages are sorted, the voice and speech recognition technology will surely change the way we interact with the world around us. Corporations are also getting more and more aware of the importance of speech recognition as an efficient way of documentation and record-keeping.
Voice governed internet searches are bound to affect the search engine dynamics. So, voice SEO is going to play a crucial role. Digital marketers will have to equally invest, if not more, in voice-based searches than they do in traditional SEO.
Given that the next decade is going to be the decade of wearable tech and with the incorporation of voice command systems, the way humans interact with computers will experience a massive change. Voice and speech recognition are here to stay until something more natural and efficient comes up.