Speech Emotion Detection using SVM, Decision Tree, Random Forest, MLP, CNN with different architectures
MIT License
As human beings speech is amongst the most natural way to express ourselves.As emotions play a vital role in communication, the detection and analysis of the same is of vital importance in today’s digital world of remote communication. Emotion detection is a challenging task, because emotions are subjective. There is no common consensus on how to measure or categorize them.
The models which were discussed in the repository are MLP,SVM,Decision Tree,CNN,Random Forest and neural networks of mlp and CNN with different architectures.
In this project, I use RAVDESS dataset to train.
2452 audio files, with 12 male speakers and 12 Female speakers, the lexical features (vocabulary) of the utterances are kept constant by speaking only 2 statements of equal lengths in 8 different emotions by all speakers. This dataset was chosen because it consists of speech and song files classified by 247 untrained Americans to eight different emotions at two intensity levels: Calm, Happy, Sad, Angry, Fearful, Disgust, and Surprise, along with a baseline of Neutral for each actor.
The main story in preprocessing audio files is to extract features from them.
Features supported:
In this project, code related to preprocessing the dataset is written in two functions.
load_data() is used to traverse every file in a directory and we extract features from them and we prepare input and output data for mapping and feed to machine learning algorithms. and finally, we split the dataset into 80% training and 20% testing.
def load_data(test_size=0.2):
X, y = [], []
try :
for file in glob.glob("/content/drive/My Drive/wav/Actor_*/*.wav"):
# get the base name of the audio file
basename = os.path.basename(file)
print(basename)
# get the emotion label
emotion = int2emotion[basename.split("-")[2]]
# we allow only AVAILABLE_EMOTIONS we set
if emotion not in AVAILABLE_EMOTIONS:
continue
# extract speech features
features = extract_feature(file, mfcc=True, chroma=True, mel=True)
# add to data
X.append(features)
y.append(emotion)
except :
pass
# split the data to training and testing and return it
return train_test_split(np.array(X), y, test_size=test_size, random_state=7)
Below is the code snippet to extract features from each file.
def extract_feature(file_name, **kwargs):
"""
Extract feature from audio file `file_name`
Features supported:
- MFCC (mfcc)
- Chroma (chroma)
- MEL Spectrogram Frequency (mel)
- Contrast (contrast)
- Tonnetz (tonnetz)
e.g:
`features = extract_feature(path, mel=True, mfcc=True)`
"""
mfcc = kwargs.get("mfcc")
chroma = kwargs.get("chroma")
mel = kwargs.get("mel")
contrast = kwargs.get("contrast")
tonnetz = kwargs.get("tonnetz")
with soundfile.SoundFile(file_name) as sound_file:
X = sound_file.read(dtype="float32")
sample_rate = sound_file.samplerate
if chroma or contrast:
stft = np.abs(librosa.stft(X))
result = np.array([])
if mfcc:
mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T, axis=0)
result = np.hstack((result, mfccs))
if chroma:
chroma = np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T,axis=0)
result = np.hstack((result, chroma))
if mel:
mel = np.mean(librosa.feature.melspectrogram(X, sr=sample_rate).T,axis=0)
result = np.hstack((result, mel))
if contrast:
contrast = np.mean(librosa.feature.spectral_contrast(S=stft, sr=sample_rate).T,axis=0)
result = np.hstack((result, contrast))
if tonnetz:
tonnetz = np.mean(librosa.feature.tonnetz(y=librosa.effects.harmonic(X), sr=sample_rate).T,axis=0)
result = np.hstack((result, tonnetz))
return result
Let's drive further into the project ..
Performs different traditional algorithms such as -Decision Tree, SVM, Random forest .
Refer
Finds that these algorithms don’t give satisfactory results. So Deep Learning comes into action.
implements classical neural network architecture such as mlp Refer
Found that Deep learning algorithms like mlp tends to overfit to the data. So the preferred neural network is CNN which is a game changer in many fields and applications. Wants to perform some analysis to find the best CNN architecture for available dataset. Here CNN with different architectures is trained against the dataset and the accuracy is recorded. Here every architecture has same configuration and is trained to 500 epochs.
Refer
for better understanding about data and also for visualizing waveform and spectogram of audio files. Refer