Tuesday, March 17, 2020

Emotion conveyed by speech signal

The method we are going to use for this section involves a multilayer perceptron classifier (MLP Classifier) to build a model.

The following audio properties are considered:
mfcc: Mel Frequency Cepstral Coefficient, represents the short-term power spectrum of a sound

Flow Diagram

Datasets

Made use of two different datasets:

1. RAVDESS

This dataset includes around 1500 audio file input from 24 different actors. 12 male and 12 female where these actors record short audios in 8 different emotions

i.e. 1 = neutral, 2 = calm, 3 = happy, 4 = sad, 5 = angry, 6 = fearful, 7 = disgust, 8 = surprised

Each audio file is named in such a way that the 7th character is consistent with the different emotions that they represent.

2. SAVEE 

 This dataset contains around 500 audio files recorded by 4 different male actors.

Feature Extraction

The next step involves extracting the features from the audio files which will help our model learn between these audio files. For feature extraction we make use of the LibROSA library in python which is one of the libraries used for audio analysis.


  • While extracting the features, all the audio files have been timed for 3 seconds to get equal number of features. 
  • The sampling rate of each file is doubled keeping sampling frequency constant to get more features which will help classify the audio file when the size of dataset is small.

Building Models

Since the project is a classification problem, Convolution Neural Network seems the obvious choice.

The model which gave the max validation accuracy against test data was little more than 70%

Predictions

After tuning the model, tested it out by predicting the emotions for the test data. For a model with the given accuracy these are a sample of the actual vs predicted values.


No comments:

Post a Comment