Speech Recognition in Python— The Complete Beginner’s Guide

Simple and hands-on walkthrough

Welcome to The Complete Beginner’s Guide to Speech Recognition in Python.

In this post, I will walk you through some great hands-on exercises that will help you to have some understanding of speech recognition and use of machine learning. Speech recognition helps us to save time by speaking instead of typing. It also gives us the power to communicate with our devices without even writing one line of code. This makes technological devices more accessible and easier to use. Speech recognition is a great example of using machine learning in the real life.

Anohter nice example of speech recognition: Google Meet web application, did you know that from the settings you can turn on the subtitles? When you turn on subtitles, a program in the back will recognize your speech and convert it to text in real life. It’s really impressive to see how fast it happens. Another cool feature of this Google Meet recognizer is that it also knows who is speaking. In this walkthrough, we will use Google’s Speech API. I can’t wait to show you how to built our own speech recognizer. Let’s get started!

Speech Recognition Libraries
Recognizer Class
Speech Recognition Functions
Audio Preprocessing
Bonus

Speech Recognition Libraries

CMU Sphinx by Carnegie Mellon University
Kaldi
SpeechRecognition
Wav2letter++ by Facebook

CMU Sphinx collects over 20 years of the CMU research. Some advantage of this library: CMUSphinx tools are designed specifically for low-resource platforms, flexible design, and focus on practical application development and not on research

Kaldi is a toolkit for speech recognition, intended for use by speech recognition researchers and professionals.

Speech Recognition is a library for performing speech recognition, with support for several engines and APIs, online and offline.

wav2letter++ is a fast, open source speech processing toolkit from the Speech team at Facebook AI Research built to facilitate research in end-to-end models for speech recognition.

From these libraries, we will be working with SpeechRecogntion library because of it’s low barrier to entry and it’s compatibility with many available speech recognition APIs. We can install SpeechRecogntion library by running the following line in our terminal window:

pip install SpeechRecognition

Recognizer Class

SpeechRecognition library has many classes but we will be focusing on a class called Recognizer. This is the class that will help us to convert audio files into text. To access the Recognizer class, first let’s import the library.

import speech_recognition as sr

Now, let’s define a variable and assign an instance of recognizer class by calling it.

recognizer = sr.Recognizer()

Now, let’s set the energy threshold to 300. You can think of the energy threshold as the loudness of the audio files. The values below the threshold are considered silence, and the values above the threshold are considered speech. This will improve the recognition of the speech when working with the audio file.

recognizer.energy_threshold = 300

SpeechRecognition’s documentation recommends 300 as a threshold value which works great with most of the audio files. Also keep in mind that the energy threshold value will adjust automatically as the recognizer listens to audio files.

Speech Recognition Functions

In this step, we will see our recognizer in action but before we get it work let’s see some cool functions of this instance. Speech Recognition has built-in function to make it work with many of the APIs out there:

recognize_bing()
recognize_google()
recognize_google_cloud()
recognize_wit()

Bing Recognizer function uses Microsoft’s cognitive services.

Google Recognizer function uses Google’s free web search API.

Google Cloud Recognizer function uses Google’s cloud speech API.

Wit Recognizer function uses the wit.ai platform.

We will use the Google Recognizer function, which is recognize_google(). It’s free and doesn’t require an API key to use. There is one drawback about this recognizer, it limits you when you want to work with longer audio files. In my experience, I didn’t have any issues when working with audio files under 5 minutes. I don’t recommend using this recognizer with long audio files. There are different techniques to work with longer audio files, I am planning to cover it in a different post.

Basic Example

import speech_recognition as sr

recognizer = sr.Recognizer()

recognizer.recognize_google(audio_data="my_audio.wav", language=”en-US”)

Audio Preprocessing

The previous example was just a simple example. For a better recognition, a preprocessing step is necessary. You can think of it like data preprocessing that we do before doing data analysis. There is a special class that we will use for this step called AudioFile.

AudioFile

import speech_recognition as sr

recognizer = sr.Recognizer()

audio_file = sr.AudioFile(“my_audio.wav”)

type(audio_file)

recognizer.recognize_google(audio_data= audio_file)

When we try to pass the clean_support_call variable inside recognize_google() function it will not accept it. The reason is the function accepts audiodata but our current variable type is audiofile. To convert it to an audio data type, we will use the recognizer class’s built-in method called record.

Record Method

Here is how we do it:

with audio_file as source:
  audio_file = recognizer.record(source)
  recognizer.recognize_google(audio_data=audio_file)

type(audio_file)

There are two parameters of the record method that we can also use.

Duration
Offset

By default, these both parameters are equal to None. And in default mode, the record method will record the audio data from the beginning of the file until there is no more audio. But we can change this by giving them float values.

Duration: let’s say we want only the first 7 seconds of the whole audio file, we have to set the duration parameter to 7.
Offset: it is used to cut off or skip over a specified amount of second at the start of an audio file. Let’s say we don’t want the first 3 second of the audio files, we have to set the offset parameter to 3.

Duration

with audio_file as source:
  audio_file_duration = recognizer.record(source, duration = 7.0)

Offset

with clean_support_call as source:
  audio_file_offset = recognizer.record(source, offset = 3.0)

Bonus

Speechless Audio

# Import the silent audio file
silent_audio_file = sr.AudioFile(“silent_audio.wav”)

# Convert the AudioFile to AudioData
with silent_audio_file as source:
  silent_audio = recognizer.record(source)

# Recognize the AudioData with show_all turned on
recognizer.recognize_google(silent_audio, show_all=True)

Multiple Speakers

This process of understanding the different speakers in a single audio file is known as speaker diarization. This is a really cool function to have but unfortunately it is not available in this library. One solution to do this is to have different audio files for different speakers, go through them using for loop.

recognizer = sr.Recognizer()

# Multiple speakers on different files
speakers = [sr.AudioFile(“speaker_0.wav”), sr.AudioFile(“speaker_1.wav”), sr.AudioFile(“speaker_2.wav”)]

# Transcribe each speaker individually
for i, speaker in enumerate(speakers):
  with speaker as source:
    speaker_audio = recognizer.record(source)
  
  print(f”Text from speaker {i}:”)
  print(recognizer.recognize_google(speaker_audio,language=”en-US”))

Background Noise

To handle the background noise, the recognizer class has a built-in function called adjust_for_ambient_noise function, which also takes a parameter of duration. Using this function the recognizer class listens to the audio for the specified duration seconds from the beginning of the aiod and then adjusts the energy threshold value so that the whole audio is more recognizable.

# Import audio file with background nosie
noisy_audio_file = sr.AudioFile(noisy_support_call.wav)

# Adjust for ambient noise and record
with noisy_audio_file as source:
  recognizer.adjust_for_ambient_noise(source, duration=0.5)
  noisy_audio = recognizer.record(source)

# Recognize the audio
recognizer.recognize_google(noisy_audio)

Video Demonstration

Just started my journey on YouTube, I am doing video demonstrations in following topics: Machine Learning, Data Science, Artificial Intelligence. Enjoy!

Thank you for reading this post, I hope you enjoyed and learnt something new today. Feel free to contact me through my blog if you have any questions while implementing the code. I will be more than happy to help. Stay safe and happy coding!

Follow me on Medium

5 responses to “Speech Recognition in Python— The Complete Beginner’s Guide”

Sabrina says:

November 4, 2020 at 3:38 pm

Thank you for this! I just built some speech recognition scripts using your tutorials! It’s super cool but I would really love to know how to handle larger audio files. We have to do some interviews for uni and couldn’t find any good transcribers and thought I’d try to build one.

LikeLiked by 1 person

- sonsuzdesign says:
  
  December 23, 2020 at 7:01 pm
  
  Hello Sabrina, I am happy that you found this project helpful. You are right, it doesn’t work great with larger audio files. After doing some research I found the splitting the audio method. You can find couple article on google. Thanks again!
  
  LikeLike
  
RRTutors says:

January 19, 2021 at 10:57 pm

Great Article on Speech recognition concept. I followed it. I was more exiting on this.

LikeLike

Gianluca says:

February 6, 2021 at 7:31 am

Hi, i m Gianluca from Italy.
Thank s very very much for your awesome tutorial.
I am trying to use gTTS at the same time. It works .. but I can’t stop the voice with the voice commands because the sound of the speaker enters the microphone .. and the microphone is always busy.
Could you help me?

LikeLike

Building a Speech Emotion Recognizer using Python – Sonsuz Design says:

March 15, 2021 at 9:24 am

[…] Speech Recognition in Python – The Complete Beginner’s Guide […]

LikeLike