Extract conversations easily from an audio recording with Python
In this article we will learn how to use IBM’s Speech to Text API to recognize speech from an audio recording file. We are going to use the free version of the API, which may have some limitations, such as audio length limit. I will talk more about this later in Speech Recognition section.
If you are reading this article, I am sure you heard the term “artificial intelligence” and how important it is. I can easily tell that a nice use of artificial intelligence in the real life is speech recognition.
Recognition of speech from audio basically allows us to save time by speaking instead of typing things. This makes it more fun and easier to use our technological gadgets. This technology also helps us to interact with these devices without writing one line of code. Imagine people had to know programming to give commands to Alexa or Siri. That would be so insane. 😊
I can’t wait to show you the speech recognizer in action. Let’s get to work. Here are the steps the we will follow in this speech recognition project.
Table of Contents:
- Speech Recognition Cloud Services
- Step 1 – Library
- Step 2 – Import an Audio Clip
- Step 3 – Define the Recognizer
- Step 4 – Speech Recognizer in Action
- Final Step – Exporting the result
Speech Recognition Cloud Services
Many giant tech companies have their own speech recognition models. I will share some of them here to show you the big picture. These APIs are working through cloud and can be accessed from anywhere in the world as long as there is an internet connection. Also, most of them are paid services, but can be tested for free. For example, Microsoft offers a year long free access with an Azure cloud account.
Here are some of the most popular speech to text cloud services available:
Step 1 – Library
For this project, we are going to need only one library. And that is SpeechRecognition. SpeechRecognition is free and open-source. It supports multiple speech recognition engines and APIs. Such as; Microsoft Azure Speech, Google Cloud Speech, IBM Watson Speech to Text APIs and more. For this project, we will be testing the IBM Watson Speech to Text API. Feel free to check the source code and the documentation of SpeechRecognition package from here.
Let’s start off by installing the package. We are going to use pip, which is Python library manager.
pip install SpeechRecognition
After the installation process is completed, we can go ahead and open our code editor. You can also use Jupyter Notebook.
import speech_recognition as s_r
Step 2 – Import an Audio Clip
I’ve have recorded a voice memo using the computer. It was in m4a format, but recognizer doesn’t work with m4a format, that’s why I had to convert it to wav format.
audio_file = s_r.AudioFile('my_clip.wav')
Step 3 – Define the Recognizer
In this step, all we will do is defining the speech recognizer. Earlier, we have imported the library. And now, we will create a new variable and assign the recognition attribute to it.
rcgnzr = s_r.Recognizer()
Step 4 – Speech Recognizer in Action
It’s show time! We will run IBM’s speech to text on our audio file. Before running the recognizer, I will go ahead and run another function called “adjust_for_ambient_noise” and “record”, which will clean some of the noise in the recording. This way, our recognizer will be able to return more accurate results.
with audio_file as source: rcgnzr.adjust_for_ambient_noise(source) clean_audio = rcgnzr.record(source)
Perfect, now we have a cleaner audio recording. Now, let’s go ahead and run IBM’s speech recognizer. (It took me couple of hours to figure out how IBM Speech-to-Text API integrates with the SpeechRecogniton python library). Here is the shortest way to do it:
recognized_speech_ibm = r.recognize_ibm(clean_audio, username="apkikey", password= "your API Key")
Note: IBM’s API doesn’t work without an API Key. That’s why, we will need to get one from IBM Watson page. I have created an account to test this Speech-to-Text model. Good thing about IBM’s model, we can still do 500 minutes of recording using the lite account. Which is more than enough for learning purposes.
Final Step – Exporting the Result
We are almost done. It’s time to check the result. Our recognizer has detected the speeches from the audio file in the previous step. We will go ahead and check how it worked. If we are satisfied with the result, we will export the result into a text document.
To check the recognized speech, let’s print out the recognized variable:
Looks good. It has did a great recognition of my audio recording. I was reading a paragraph from this article. If you are not satisfied with the result, there are many ways to preprocess the audio file to get better results. Here is a nice article showing a little more detailed information about speech recognition and how to increase the prediction of the recognizer.
Now, I will export the recognized speech into a text document. We will see the message “ready!” in our terminal when the exporting is completed.
with open('recognized_speech.txt',mode ='w') as file: file.write("Recognized Speech:") file.write("\n") file.write(recognized) print("ready!")
Congrats!! You’ve accomplished building a speech recognizer if you’re reading this paragraph. Hoping that you enjoyed this hands-on tutorial and learnt something new today. The best way of practicing your programming skills is making fun projects. I’ve shared many more hands-on projects like this one. Feel free to reach me if you have any questions while implementing the program.