Building a Speech Emotion Recognizer using Python

Step-by-step guide to speech emotion recognition with MLP artificial neural network

In this article, I will show you how to recognize different emotions from pre-recorded audio recordings. We know that voice-controlled personal assistants such as Amazon Alexa, Apple Siri, and Google Assistant and many more have become more powerful and still evolving. We start to see them integrated into phones, laptops, kitchen gadgets, cars, basically on almost anything we use daily. I think the ease of use is the primary key that makes this field grow magnificently.

When I find out about the Speech Emotion Recognition project on Kaggle using RAVDESS Emotional speech audio dataset, I decided to work on it myself and then share it as a written tutorial. I think this is an exciting and fun project. As we use more voice-controlled gadgets, I believe emotion recognition will be part of these devices in the following years. The artificial intelligence behind these devices will be smart enough to understand our emotions when we speak to them and give more personalized responses.

For example, before getting on the road, I ask Siri to “play music from the Music app,” and then it starts to play my broad mix. But imagine if we add the emotion recognition power into that command. This way, it will play kinds of music depending on my mood. Many music apps are already giving categories with different desires, so why not play that mix with just a simple “play music” command.

I enjoy working on Speech Recognition related projects like this one. I have published a handful of articles connected to this topic. I will add them at the end of this article. Feel free to check them out if you want to develop yourself in this field.

If you are ready, let’s get to work. Here is the structure that we will follow in this article.

Step 1 — Libraries
Step 2 — Understanding the RAVDESS Data
Step 3 — Extracting Features from Audio Recordings
Step 4 — Loading and Preparing the Data
Final Step — MLP Classifier Prediction Model
Conclusion

Sonsuz Design is an ad-free blog. I want anyone to have access to these contents. Your support will make me happy.

Step 1 — Libraries

First thing first, let’s install the libraries that we will need. We can use PIP install, which is a python library management tool. We can install multiple libraries in one line as follows:

pip install soundfile librosa numpy sklearn

After the installation process is completed, we can go ahead and open a new text editor. I am using Jupyter Notebook for most of my projects, and it works great with machine learning and data science projects.

And then, I will import the modules that we will need in this project. We can rename the libraries using “as” after importing them. It makes them easier to call and use.

import librosa as lb
import soundfile as sf
import numpy as np
import os, glob, pickle

from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

I will add the official documentation pages for each library for reference. It is always a great place to learn more about the modules and how to use them:

Step 2— Understanding the RAVDESS Data

Let me start by sharing what RAVDESS stands for Ryerson Audio-Visual Database of Emotional Speech and Song. It is a large dataset will an audio and video database. The original size of this data is around 24Gb. But we will use a smaller portion of it and not the whole dataset. This will help us to stay focused, train our model faster and to keep things simple. The small portion of the dataset can be found here on Kaggle.

A little background information about the data

This portion of the RAVDESS contains 1440 files: 60 trials per actor x 24 actors = 1440. The data contains 24 professional actors: 12 female and 12 male. Speech emotions include calm, happy, sad, angry, fearful, surprise, and disgust expressions. You can learn more on the Kaggle website.

The file names are renamed following a particular pattern. This pattern consists of 7 parts. And these parts are divided as following: Modality, Vocal channel, Emotion, Emotional intensity, Statement, Repetition, and Actor. Each information also has its sub-division. All this information is labeled; you can find more about these on the Kaggle website.

Here is a screenshot of the Actor_1 folder within the dataset:

Emotion labels

Here are the labels of the emotion category. We are going to create this dictionary to use when training the machine learning model. And after the labels, we are creating a list of emotions that we want to focus in this project. It’s hard to do a prediction using all emotions, because the speech may sound in more than one emotion simultaneously, and that will affect our prediction scores. That’s why I chose three primary emotions, which are happy, sad, and angry. Feel free to try different emotions.

emotion_labels = {
  '01':'neutral',
  '02':'calm',
  '03':'happy',
  '04':'sad',
  '05':'angry',
  '06':'fearful',
  '07':'disgust',
  '08':'surprised'
}

focused_emotion_labels = ['happy', 'sad', 'angry']

Step 3— Extracting Features from Audio Recordings

In this step, we are going to define a function. This function will extract audio recordings and return them as stack arrays in sequence horizontally using a numpy hstack method.

There are many features of audio files. And some of them are MFCC, Chroma and Mel. Here is a well written article written by Joel and published on TDS publication. I liked the way the writer explained how each audio feature can impact when training the model.

def audio_features(file_title, mfcc, chroma, mel):
    with sf.SoundFile(file_title) as audio_recording:
        audio = audio_recording.read(dtype="float32")
        sample_rate = audio_recording.samplerate
        if chroma:
            stft=np.abs(lb.stft(audio))
            result=np.array([])
        if mfcc:
            mfccs=np.mean(lb.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40).T, axis=0)
            result=np.hstack((result, mfccs))
        if chroma:
            chroma=np.mean(lb.feature.chroma_stft(S=stft, sr=sample_rate).T,axis=0)
            result=np.hstack((result, chroma))
        if mel:
            mel=np.mean(lb.feature.melspectrogram(audio, sr=sample_rate).T,axis=0)
            result=np.hstack((result, mel))
        return result

Step 4 — Loading and Preparing the Data

In this step, we are going to define a function to load our dataset. First, we are loading the data and then extracting the features using the function defined in the previous step. While features are extracting, we are assigning the features with the labels emotions. You can think of features as our input (x) and the labeled emotion as an output (y). This is a well-known machine learning model, also known as Supervised Learning.

And then, we are going to split the labeled dataset using the train_test_split() function. It is a well-known splitting function by Scikit-learn module. It divides the dataset into four chunks. We can define how much of the dataset we want to use for training and how much for testing. You can adjust these values to see how it affects the prediction. There is no one size fits all rule; it usually depends on the dataset. But in most cases, the 0.25 test size is applied. This means 3/4 of dataset is used for training and 1/4 for testing.

def loading_audio_data():
    x = []
    y = []
    for file in glob.glob("data//Actor_*//*.wav"):
        file_path=os.path.basename(file)
        emotion = emotion_labels[file_path.split("-")[2]]
        if emotion not in focused_emotion_labels:
             continue
        feature = audio_features(file, mfcc=True, chroma=True, mel=True)
        x.append(feature)
        y.append(emotion)

     final_dataset = train_test_split(np.array(x), y, test_size=0.1, random_state=9)
     return final_dataset

Final Step — MLP Classifier Prediction Model

We are almost done. This is the final step, where will start calling the functions we defined earlier and recognizing emotions from speech audio recordings.

Loading and Splitting the Data

Let’s start by running the loading_audio_data() function. This function will return four lists. That’s we are going to use four different variables for each list — the order matters. You should be familiar with this splitting method, especially if you are working with machine learning projects.

X_train, X_test, y_train, y_test = loading_audio_data()

Defining the Model

MLP Classifier is multi-layer perceptron classifier. It uses a neural network model to optimize the log-loss function using Limited memory BFGS or stochastic gradient descent.

Here is the official documentation page for the MLP Classifier model. You can learn more about its parameters and how they can affect the training process.

model = MLPClassifier(hidden_layer_sizes=(200,), learning_rate='adaptive', max_iter=400)

Fitting the Model

model.fit(X_train, y_train)

Model Prediction Accuracy Score

After our model is fit, we can move to the prediction step. We are going to assign the predicted values into a new variable called y_pred. This way, we can calculate the accuracy score of the prediction. The accuracy score function checks how many of the predicted values are matching with the original label data. Here is the code and I will add a screenshot of the result below it.

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_true=y_test, y_pred=y_pred)
print("Accuracy of the Recognizer is: {:.1f}%".format(accuracy*100))

Conclusion

Our accuracy score is 79.3, and that is pretty impressive. I usually get a similar score after fitting the model multiple times. I do think that this is a satisfying score for an emotion recognition model, which was trained by audio recordings. Thanks to machine learning and artificial intelligence model developers.

Congrats! We have created a speech emotion recognizer using python. As I mentioned earlier, this field is growing so rapidly and becoming more and more part of our daily lives. These kinds of projects will help you find new ideas to implement. I am glad if you learned something new today. Working on hands-on programming projects is the best way to sharpen your coding skills. Feel free to reach me if you have any questions while implementing the code. I do my best to get back to everyone.

Join me and thousands of other great writers on Medium. Make money writing.

More Speech Recognition Related Machine Learning Projects

Building a Speech Recognizer in Python

Speech Recognition using IBM Speech-to-Text API

Speech Recognition in Python – The Complete Beginner’s Guide