Team weByte: 2020

Tuesday, April 28, 2020

Detection of deception by speech signal

We expect to detect stress in speech by analyzing the change in microtremor frequency of the speaker’s voice.
For this purpose we have found a program which uses Empirical Mode Decomposition (EMD), a method that has been shown to be effective for the purpose of detecting stress in a person’s voice.

EMD Process

Empirical Mode Decomposition (EMD) decomposes the original signal into a finite number of intrinsic mode functions (IMFs).
IMFs are time-varying mono-component (single frequency) functions. The signal is decomposed into IMFs in such a manner that the highest frequency component of each event in the signal is captured by the first IMF.

An IMF should satisfy two conditions:

The upper and lower envelope has to be symmetric;
The number of zero-crossings and the number of extrema are exactly equal or they differ at most by one.

Once the decomposition is finalized, a real world signal can be mapped as:

Where:

c_i [k] = set of IMFs

r[k] = trend within the data (also referred to as the last IMF or residual)

Detecting stress induced signals

The second to last IMF is considered the microtremor frequency, unless the total number of IMFs are less than 3, where it is considered the last IMF.
If the tremor frequency lies in the range of 8-12 Hz, it is a considered a stress response, while a frequency outside this range is considered a stress response.

Tuesday, March 17, 2020

Emotion conveyed by speech signal

The method we are going to use for this section involves a multilayer perceptron classifier (MLP Classifier) to build a model.

The following audio properties are considered:
mfcc: Mel Frequency Cepstral Coefficient, represents the short-term power spectrum of a sound

Flow Diagram

Datasets

Made use of two different datasets:

1. RAVDESS

This dataset includes around 1500 audio file input from 24 different actors. 12 male and 12 female where these actors record short audios in 8 different emotions

i.e. 1 = neutral, 2 = calm, 3 = happy, 4 = sad, 5 = angry, 6 = fearful, 7 = disgust, 8 = surprised

Each audio file is named in such a way that the 7th character is consistent with the different emotions that they represent.

2. SAVEE

This dataset contains around 500 audio files recorded by 4 different male actors.

Feature Extraction

The next step involves extracting the features from the audio files which will help our model learn between these audio files. For feature extraction we make use of the LibROSA library in python which is one of the libraries used for audio analysis.

While extracting the features, all the audio files have been timed for 3 seconds to get equal number of features.
The sampling rate of each file is doubled keeping sampling frequency constant to get more features which will help classify the audio file when the size of dataset is small.

Building Models

Since the project is a classification problem, Convolution Neural Network seems the obvious choice.

The model which gave the max validation accuracy against test data was little more than 70%

Predictions

After tuning the model, tested it out by predicting the emotions for the test data. For a model with the given accuracy these are a sample of the actual vs predicted values.