What is Speech Recognition? with Python Example

By Marina Chatterjee Updated on Nov 5, 2024

What is Speech Recognition?

Speech recognition is giving the computer the ability to understand natural language. We are very complex creatures, and so is our language. We may be discussing the most important issues, and yet suddenly decide to talk about something totally unrelated to the same. Such a switch in context is called non-linearity. We can understand this. Computers are on their way to attain this ability. Recently, Alibaba and Google demonstrated these applications which amazed the entire world. This is the big picture, but have you ever wondered how to include speech recognition to a project that you are working on? If yes, then let’s learn some basic concepts related to speech recognition, and implement it using readily available packages in Python.

Speech recognition examples:

When amazon invests heavily in its voice assistant Alexa, they intend to reduce the friction between a consumer and the need. Natural language enables this. Just imagine, while you are cooking you remember you need a knife. You say, Alexa, get me a knife for cooking within Rs 500. It will do the job for you. That is the future. Therefore, all major companies are investing in voice assistants that can understand the context, and emotion of the consumer. It enables a new level of interactivity and accessibility.

In this guide we will understand the details about speech recognition, the available options to implement the same and a short program as well.

Speech recognition can be approached in many ways. It began with using simple templates to detect beeps, and slowly moved towards understanding its frequency components. Today, we are at an intersection of frequency analysis and deep learning. The importance of deep learning is that it enables context, which enables services. Hence, the huge rush into deep learning.

Speech Recognition and Deep Learning:

Sound waves are a form of data, which consists of information such as phase, amplitude, signal-to-noise ratio etc. These parameters define the one-dimensional wave structure. At each instant in time, they reflect a value based on the amplitude, also known as the height, of the wave. This is a form of analog signal. Therefore, to feed it to a computer, we discretize it using a process called sampling.

Sampled data is quantized, a process through which it is converted into quantifiable bits. This data is sent to the machine learning model, which predicts the output. Since there is a sequence involved with respect to speech, that is, context matters in speech, we need to use networks that remember the correlation between the previous input and current input. Therefore, we use sequence models, such as recurrent neural networks, Hidden Markov models, Long Short Term Memory models, as building blocks for our classifier. Property of such models is that previous outputs provide feedback to future inputs, therefore adding an element of memory.

For example, if the current input is “my name” , it’s very likely to use the word “is”. The concept is, the company of a word defines its usage. When name is used, “is”,”what”,”my” etc, are going to be used more frequently than the word “kitchen”. Once we have the audio in a processable format, we feed it to a deep neural network model.

What are the various available packages?

There are many packages for speech recognition on Python. A few of them include:

Automatic-speech-recognition
Vocalist
Xy-speech
Google-cloud-speech
Watson-developer-cloud

Google-cloud speech and IBM Watson services include a free tier, which can be used for experimentation. PyAudio is required if and only if you want to use microphone input (Microphone).

Code
Initally install the package speech_recognition using python package installer, pip or conda.

import speech_recognition as sr  

# get audio from the microphone                                                                       
r = sr.Recognizer()                                                                                   
with sr.Microphone() as source:                                                                       
    print("Speak:")                                                                                   
    audio = r.listen(source)   

try:
    print("You said " + r.recognize_google(audio))
except sr.UnknownValueError:
    print("Could not understand audio")
except sr.RequestError as e:
    print("Could not request results; {0}".format(e))

Read our blog to learn more about how pattern recognition works in speech recognition.

Great Learning offers courses on Artificial Intelligence and Machine Learning to help you master the nuances of the domain and its subcategories like deep learning and pattern recognition.

Embarking on a journey towards a career in data science opens up a world of limitless possibilities. Whether you’re an aspiring data scientist or someone intrigued by the power of data, understanding the key factors that contribute to success in this field is crucial. The below path will guide you to become a proficient data scientist.

Data Science Course Certificates

Data Science Course Placements

Data Science Course Syllabus

Data Science Course Eligibility