Indonesian Journal of Electrical Engineering and Computer Science V ol. 37, No. 3, March 2025, pp. 2044 2057 ISSN: 2502-4752, DOI: 10.1 1591/ijeecs.v37.i3.pp2044-2057 r 2044 BanSpEmo: a Bangla audio dataset for speech emotion r ecognition and its baseline evaluation Babe Sultana 1,2 , Md Gulzar Hussain 3,4 , Mahmuda Rahman 1,5 1 Department of CSE, Faculty of Science and Engineering, Green University of Bangladesh, Dhaka, Bangladesh 2 Department of CSE, Faculty of Science and Engineering, United International University , Dhaka, Bangladesh 3 School of Software, Nanjing University of Information Science and T echnology , Nanjing, Jiangsu, China 4 School of Computer Science and Artificial Intelligence, Changzhou University , Changzhou, Jiangsu, China 5 Department of ICT , Mohammadpur Preparatory School and College, Dhaka, Bangladesh Article Info Article history: Received Jun 1 1, 2024 Revised Sep 28, 2024 Accepted Oct 7, 2024 Keywords: Audio dataset Bangla SER Emotion classification Machine learning Speech emotion ABSTRACT Speech interfaces provide a natural and comfortable way for humans to communicate with machines. Recognizing emotions from acoustic signals is essential in audio and speech processing. Detection of emotion in speech is critical to the next generation of human-computer interaction (HCI) fields. However , a lack of lar ge-scale datasets has hampered the progress of relevant research. In this study , we prepare BANSpEmo, a demanding Bangla speech emotion dataset consisting of 792 audio recordings totaling more than 1 hour and 23 minutes. The recordings feature 22 native speakers and each speaker uttered two sets of sentences representing six emotions: disgust, happiness, anger , sadness, surprise, and fear . The dataset consists of 12 Bangla sentences, each expressed in these six emotions. Furthermore, a series of investigations are carried out to assess the baseline performance of the support vector machine (SVM), logistic regression (LR), and multinomial Naive Bayes models on the BANSpEmo dataset pre- sented in this study . The studies found that SVM performed best on this dataset, with an accuracy of 87.18%. This is an open access article under the CC BY -SA license. Corresponding Author: Md Gulzar Hussain School of Software, Nanjing University of Information Science and T echnology Nanjing, Jiangsu, China Email: gulzar .ace@gmail.com 1. INTRODUCTION Speech is an essential and preferred way of communication for people. It’ s an important technique to convey emotions and plays a significant role in human-machine interactions. Speech emotion recognition (SER) research has received significant attention over the last few years due to its application in remote patient monitoring systems, robotics, the psychological assessment of people and many more [ 1 ]. While tremendous progress has been achieved in SER for widely used languages like English and Mandarin, there is still a signifi- cant deficit in resources and research committed to less commonly studied languages. Bangla, spoken by about 250 million inhabitants globally [ 2 ], is one of the underdeveloped languages in the field of SER. Although a sig- nificant amount of studies has been conducted in the area of textual data in the Bangla language in emotion and sentiment analysis—such as analyzing basic emotions [ 3 ] sentiment analysis in Bangla English Code-mixed text [ 4 ], [ 5 ] and emotion classification [ 6 ]. These ef forts have greatly enhanced research understanding and insights into the Bangla language of textual data domain. Journal homepage: http://ijeecs.iaescor e.com Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 r 2045 Detection of emotions from SER is a growing research topic due to its importance in the community , society , and commercial domains. In the realm of speech recognition (SR) and natural language processing (NLP), an extensive range of speech corpora has been created for multiple languages. Although there has been lots of research on SER for various languages, such as English [ 7 ], [ 8 ], Urdu [ 9 ], Chinese [ 10 ], Italian [ 1 1 ] and others [ 12 ], [ 13 ], there have been just a few ef forts at developing SER dataset for Bangla. T able 1 shows some previously developed speech emotion recognition datasets and their limitations. T able 1. Comparison of some previous speech emotion dataset in Bangla and other languages Article Y ear Language Contributions Limitations Paper [ 14 ] 2023 Bangla A Speech Emotion dataset named KBES is developed of 900 recordings. Number of recordings is limited and gender balance is not considered. Paper [ 9 ] 2022 Urdu The first Urdu Speech Emotion dataset is developed with 2,500 recordings. In the dataset, the disgust emotion is hard to distin- guish and with the disgust emotion, the accuracy is low . Paper [ 15 ] 2022 Bangla A Speech Emotion dataset named SUBESCO is developed with 7000 recordings. Purely neutral sentences are uttered with dif ferent emotions which is dif ficult to express. Paper [ 16 ] 2022 Bangla A Speech Emotion dataset named Ban- glaSER is developed of 1467 record- ings. Number of sentences for uttering and the number of recordings is limited. Paper [ 17 ] 2022 Bangla A Speech Emotion dataset is devel- oped of 452 recordings. Number of recordings is limited, the annotation process is not explained and gender balance is not considered. Paper [ 7 ] 2021 English A new Speech Emotion dataset named LSSED is developed. Number of data is limited, only 820. Also, the total length is only about 20 minutes. Paper [ 18 ] 2021 Bangla A Speech Emotion dataset named ABEG is developed. Dataset is not publicly available, has only 3 classes, and the annotation process is not clear . Paper [ 13 ] 2020 Spanish, Por - tuguese, German, French A Speech Emotion dataset named CMU-MOSEAS is developed with 40,000 multi-modal samples. Data samples are not balanced for the 4 languages. Gender balance is not considered also. Paper [ 8 ] 2018 English A new Speech Emotion dataset named RA VDESS is developed with 7356 recordings. Have limited lexical variability due to the inclusion of only two statements. Paper [ 12 ] 2016 English A Speech Emotion dataset named EmoReact is developed with 1 102 audio-visual clips. Gender ef fects are not considered and the dataset is not gender balanced. Also number of clips is lim- ited. Paper [ 10 ] 2006 Mandarin A new Speech Emotion dataset named MASC is developed with 25,636 recordings. The dataset is not gender balanced as it contains the recordings of 23 female and 45 male Chinese speakers. Also, the traditional speaker verifica- tion and identification systems are limited for the dataset. From T able 1 it can be observed that there have been few ef forts to create datasets for SER in the Bangla language. Dhar and Guha [ 17 ] created a dataset designated as ABEG. They employed three emotional states: angry , happy , and neutral. There was no further description of their dataset, and the data is not accessible to the public. A team of academicians prepared a small discrete corpus of 160 sentences to test the speech-emotion identification system they proposed [ 18 ]. This dataset has 20 individuals who represented emotions such as happy , angry , sad, and neutral where perceptual evaluation was not available. Nevertheless, just three corpora for the task of emotion detection from speech in the Bangla language are publicly accessible now: SUBESCO [ 19 ], BanglaSER [ 16 ], and KBES [ 14 ]. Using these openly accessible Bangla language datasets, multiple pub- lications have been published illustrating how to detect emotions in Bangla speech employing machine learning and deep learning methods. Dif ferent ensemble learning approaches are compared in multiple trials to show that they outperform typical machine learning techniques. The research findings show that ensemble learning BanSpEmo: a Bangla audio dataset for speech emotion r ecognition and its baseline evaluation (Babe Sultana) Evaluation Warning : The document was created with Spire.PDF for Python.
2046 r ISSN: 2502-4752 approaches can reach a great accuracy of 84.37%, which is achieved by utilizing the bootstrap aggregation and voting method. Sultana et al. [ 15 ] used the SUBESCO and RA VDESS [ 8 ] datasets to undertake cross-lingual in- vestigations involving cross-dataset training, multi-dataset training, and transfer learning in English and Bangla. The suggested model demonstrated cutting-edge perceptual ability , with weighted accuracy (W A) of 86.9% for the SUBESCO and 82.7% for the RA VDESS. Hassan et al. [ 20 ] combines a one-dimensional convolutional neural network (CNN) and a long short-term memory (LSTM) framework to create a fully connected network for SER, comparing the performance of these two datasets. Islam et al. [ 21 ] combines transformed features from three separate methodologies—chroma short-time fourier transform, short-time fourier transform (STFT), and mel-frequency cepstral coef ficient (MFCC)—and feeds them into a 3 dimensional CNN block to extract the features. The outputs are then processed by a bidirectional LSTM layer to classify Bangla speech emotions. In article from Sultana and Rahman [ 22 ], the researchers employed the grid search method with five-folded cross-validation for determining the best parameters for the support vector machines (SVM), random forest, and XGBoost algorithms. They discovered that choosing the most important features enabled machine learning models to achieve high levels of accuracy , equivalent to deep learning models. A recent study from Aziz et al. [ 23 ] presents a CNN-based approach for SER in Bengali, using MFCC features and data augmentation ap- proaches. This method produced remarkable accuracies of 90% on the SUBESCO and 78% on the BanglaSER datasets. Research on the detection of emotions in cross-linguistic speeches has demonstrated that systems trained on a single language dataset often perform poorly when evaluated on a separate language corpus, yield- ing lower accuracy rates than monolingual recognition rates. This performance gap emphasizes the importance of language-specific datasets for accurate emotion recognition. Over the past few years, there has been exten- sive investigation into SER in various languages [ 24 ], [ 25 ]. Despite having limited natural speech corpora [ 26 ], [ 27 ] or verified recorded emotional speech corpora [ 14 ], [ 16 ], [ 19 ] published for the Bangla language. Relevant linguistics resources for recognizing emotions are still inadequate. The SER system uses various approaches to classify and analyze audio files to find embedded emotions. The initial stage in its improvement is to generate a dataset for the tar geted language which is one of the main goals of this research. The following is a summary of this work’ s main contributions: This research introduces BanSpEmo, a needed diversified Bangla dataset for emotion recognition from voice. It comprises 12 distinct sentences uttered by 22 native speakers to represent six desired emotions. The total duration is 1 hour and 23 minutes. This dataset enables more comprehensive simulations of real-world scenarios by increasing the lexical and sentence variability , allowing machine learning techniques and deep neural networks to grasp their pattern better . However , using the BanSpEmo dataset, this study compared the performance of three well-known algo- rithms: logistic regression (LR), SVM, and multinomial Naive Bayes for Bangla voice emotion classifica- tion. This research also shows an investigation of these algorithms against a few well-known audio features to evaluate their ef ficiency in classifying emotions in Bangla speech. After analyzing the results, we discovered each algorithm’ s performance, showing useful information about their usability for SER tasks in t he Bangla language. A detailed description of the dataset is provided in section 2. The proposed research framework is explained in section 3. While the performance analysis of machine learning algorithms applied to this frame- work and also discusses the research findings in section 4. Overall discussion and insights into the future directions of this work is provided in section 5. 2. CORPUS DESCRIPTION Speech is one of the modes of communication on various online platforms, such as Facebook and Y ouT ube, where emotions are frequently conveyed. In this context, creating a speech dataset for the Bangla language is a significant contribution. The dataset we have prepared is available named BANSpEmo [ 28 ] constitutes the main portion of our research. As a low-resource language, Bangla has limited speech datasets. BANSpEmo marks the 4 th audio dataset developed for SER in Bangla. Indonesian J Elec Eng & Comp Sci, V ol. 37, No. 3, March 2025: 2044–2057 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 r 2047 2.1. The experimental setup The BANSpEmo dataset includes 792 voice recordings, capturing six fundamental emotional reactions across two sets of sentences, each with six sentences. The voices were recorded using a smartphone’ s recording application, a microphone, and a laptop. T o create the dataset, we used our university’ s dedicated research lab, which was not entirely soundproof like a professional audio recording studio, but it was essentially noiseless. W e took extra care to eliminate any background noise, including human or other ambient sounds. W e made sure the recording environment was as controlled as possible by implementing stringent measures to reduce background noise. Maintaining these cautions, we were able to record audio that was both consistent and clear enough for our study . Each recording, lasting 5-6 seconds on average per emotion, had noise removed using Audacity software. Additionally , we used W avePad Sound Editor software for further editing. The summary of tools required for making the dataset: Microphone: BOY A BY -BM301 1 compact shotgun microphone. Sound editor: W avePad sound editor . Audio noise remover: audacity software. 2.2. Corpus cr eation pr ocess The speakers naturally conveyed the emotional states, ensuring that the recordings were not merely read aloud. The emotions represented are happiness, disgust, sadness, anger , surprise, and fear . This dataset focuses on data collected from individuals aged between 20 and 25. The corpus comprises voice recordings from 22 speakers, with an equal distribution of 1 1 males and 1 1 females. The duration of the tapes varies between 3 and 12 seconds, influenced by the length of the sentences and the time the speaker takes. While there are roughly equal numbers of male and female speakers overall, neither sentence set reflects this balance. W ith 6 sentences × 1 repetition × 6 emotions × 18 speakers and 6 sentences × 1 repetition × 6 emotions × 4 speakers, the total number of recordings is 792 utterances. The complete audio dataset spans a total of 1 hr , 23 mins, and 12 secs. The following T able 2 presents a summary of the dataset. T able 2. Dataset description table T ype of dataset Performed, scripted T ype of File Audio Language Bangla Gender Male and Female Data format W aveform Audio File Format (W A V) Number of Groups 2 Number of Sentences per Group 6 States of Emotion Happiness, Disgust, Sadness, Anger , Surprise, and Fear T otal Number of Statements 12 T otal Number of Audio T apes 792 2.3. Details of sentences W e selected 12 sentences to ensure diversity , as these sentences are typically used to express dif ferent emotions. W e trained our speakers to deliver each selected sentence with six emotions to prepare our audio dataset. A wide range of emotional expressions, such as happiness, sadness, anger , fear , surprise, and disgust, were carefully considered when crafting each sentence. By doing this, we hope to build a solid dataset that will be useful for a range of speech emotion recognition and af fective computing applications. The orators underwent comprehensive training to guarantee uniformity and precision in their emotive communication. The chosen Bangla text and their English meanings are shown in the T able 3 . In T able 4 , we aim to present a comparison of existing freely accessible SER datasets in this domain alongside our work, BANSpEmo. Given the volume of audio recordings and the total duration, this collection ranks as the fourth-lar gest emotional speech database in the Bangla language. Despite its relatively small size compared to other datasets, and with participation levels being fairly typical, the key strength of this dataset is its broad range of sentence variations. This lexical and sentence diversity enhances the ability to capture diverse emotional expressions in dif ferent forms in Bangla speech. Essentially , we chose a wide variety of sentences to explore how dif ferent expressions of the same emotion can be introduced in Bangla speech. T o systematize BanSpEmo: a Bangla audio dataset for speech emotion r ecognition and its baseline evaluation (Babe Sultana) Evaluation Warning : The document was created with Spire.PDF for Python.
2048 r ISSN: 2502-4752 future training standards, we divided our BANSpEmo dataset into the training and the test sets. Initially , we shuf fled all samples and then allocated 20% to the test set, leaving 80% for the training set. It was made sure that the distribution of each emotion class in both the training and test sets was consistent or at the very least, similar . T able 3. The selected Bangla text and their English meaning SL. Bangla sentence English meaning 1. িকছ তথ সিঠক ভােব উপাপন করা দরকার, বার বার একই কের চেলেছ সংবাদ মাধম িল! Some information needs to be conveyed appropriately , and the media is making similar mistakes repeatedly! 2 আপনার ববহার েতা চমৎকার। েখর ভাষা অেনক র। Y our behavior is wonderful. Y our words are also pleasant. 3 এর পিরেিেত িশকেদর াথ সংি িশক সিমিতর মধ েথেক েকােনা ধরেনর িমকা পিরলিত না হওয়ায় আিম ভীষন ভােব উি। In this regard, no role has been observed from the teacher s as- sociations regarding the interest of teachers made me densely concerned. 4 আমার একটা বাপার মাথায় ধের না, "ইিলশ বাঁচাও" েাগান খিরত িমিডয়া েকন এবং িক কারেণ "ইিলেশর বাসান (নদী) বাঁচাও" েাগান িনেয় মােত না? Why the slogan ”Save the habitat (river) of Hilsa” rather than ”Save the Hilsa” is being avoided by the media baf fles me. 5 েদশ িক মধম আেয়র েদেশ পার হে নািক মেগর েকর েদ েশ পিরনত হে? Is the country turning into a middle-income country or a coun- try of chaos? 6 আিম একমা সরকাির েকান কােজ আ েলর চাপ িদেত রািজ আিছ, িশিত বি আ েলর চাপ েদয় না। I agree to have my fingerprints used for government purposes, but reasonable people might not. 7 তেগা মেন কেতা েম ের! জীবেন একটা করিছ তােতই েল েড় েশষ। Y ou are bursting with love! I once tried to embrace it, but I got burned. 8 আজেকর মাচ ভারতেক হারােত চাই টাইগার বাংলােদশ সাবাস সািকব আল হাসান। T o defeat India in today’ s match, we need the tiger of Bangladesh, W ell done Shakib Al Hasan! 9 টাইটািনক জাহাজ েব েগেছ আর বাংলােদশ েব যােব The T itanic has sunk, and Bangladesh will sink too. 10  যিদ হয় তাহেল পরীা েনবার িক দরকার? সবাইেক গেড় াস িদেয় িদেব। If the questions are incorrect, what’ s the sense of taking the exam? Simply give everyone the A+ grade. 1 1 যিদ খায় পানতা ইিলশ তা িদেয় তার গালটা কর মািলশ। He should be punished for making extravagant expenses dur - ing the price hike of hilsa. 12 েয জািত পঁচা ভাত েখেয় বছর  কের, এরা উিত লাভ করেব িক কের! A nation that starts the year by eating spoiled rice, how will they ever progress! T able 4. A comparison between publicly available Bangla Language SER corpora and the BANSpEmo Description SUBESCO BanglaSER KBES BANSpEmo Audio Clips 7000 1467 900 792 Emotions 7 5 9 6 Sentences 10 3 N/A 12 Participant 20 34 35 22 T rained Actors Y es No Y es No Rate of Sampling Rate 48 kHz 44.1 kHz 48 kHz 44.1 kHz Class Equilibrium Y es Y es Y es Y es Gender Equilibrium Y es Y es Y es Y es 3. METHOD In this Figure 1 , we present our proposed system architecture. The collected raw data underwent a thorough cleaning and preprocessing stage, with mel-frequency cepstral coef ficients, spectrogram (MFCCs), zero crossing rate (ZCR), root-mean-square ener gy (RMSE), and chroma being utilized as a feature extraction technique. W e have applied several well-known machine learning algorithms support vector machine, logistic regression, and multinomial Naive Bayes to provide a comparative performance evaluation of existing tasks. 3.1. Data cleaning and pr epossessing T o augment the dataset, each audio is divided into three segments. In the data preprocessing and clean- ing phase, every audio under goes trimming, we remove portions where no voice is detected. These segments typically correspond to pauses or moments when the speaker takes a breath. Additionally , we have standardized the frequency of each split audio to 44.1 kHz to ensure uniformity across all instances. Subsequently , features are extracted from each trimmed audio. Indonesian J Elec Eng & Comp Sci, V ol. 37, No. 3, March 2025: 2044–2057 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 r 2049 Figure 1. The suggested system’ s flow architecture 3.2. Featur e extraction 3.2.1. Mel-fr equency cepstral coefficients T o illustrate the short-term power spectrum of a voice signal, MFCCs are a set of coef ficients broadly used in voice and audio processing. It is a condensed set of features, typically around 10 to 20. They serve as valuable features for machine learning models due to their ability to succinctly capture the key attributes of an audio signal while also reducing its dimensionality . In our feature extraction process, we calculated 20 MFCCs using t he ’librosa.feature.mfccc()’ Python module. Figure 2 illustrates the MFCC feature waveform for “Happy Emotion”. Figure 2. Sample MFCCs visualization for happy emotion 3.2.2. Spectr ogram A spectrogram behaves as a pictorial portrayal of how the frequency components of a signal evolve. It holds significant utility in signal processing, audio examination, and diverse scientific domains. Spectrograms provide a means to depict the alterations in signal frequencies across time, facilitating the examination and visualization of the evolving frequency characteristics of audio or other time-based signals. Figure 3 depicts a spectrogram visualization illustrating the signal’ s loudness over time across various frequencies in a specific waveform, for the “Happy Emotion”. BanSpEmo: a Bangla audio dataset for speech emotion r ecognition and its baseline evaluation (Babe Sultana) Evaluation Warning : The document was created with Spire.PDF for Python.
2050 r ISSN: 2502-4752 Figure 3. Sample spectrogram visualization for happy emotion 3.2.3. Zer o cr ossing rate In the domains of signal processing, audio analysis, and speech recognition, the ZCR is a frequently employed characteristic. It quantifies the speed at which a signal alters its polarity or intersects the zero am- plitude line within a specified timeframe, essentially gauging how often a signal’ s waveform crosses the zero point. ZCR is formally defined as (1). Figure 4 illustrates the ZCR V isualization for the ”Happy Emotion” which portrays the rate at which the signal transitions either from negative to zero to positive or from positive to zero to negative. z cr = 1 T 1 T 1 t =1 1 R < 0 ( s t s t 1 ) (1) Figure 4. Sample ZCR visualization for happy emotion 3.2.4. Root-mean-squar e energy In signal processing and diverse domains, RMSE is a mathematical metric employed to assess the ener gy level within a signal. It of fers a means to characterize the amplitude or intensity of a signal within a defined time segment. The RMSE is defined as follows: R M S E = 1 N n | x ( n ) | 2 (2) here, RMSE is the root-mean-square ener gy . N is the number of samples in the time window . x(n) represents the signal samples. In this context, with N = 44,100 and x(n) = 204,800. The RMSE value provides insight into the signal’ s ener gy and amplitude within the specified time frame, and Figure 5 depicts the visualization for “Happy Emotion”. Figure 5. Sample RMSE visualization for happy emotion Indonesian J Elec Eng & Comp Sci, V ol. 37, No. 3, March 2025: 2044–2057 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 r 2051 3.2.5. Chr oma The chroma feature is a compact representation that conveys the tonal characteristics of a musical audio signal. Chroma features are derived from the chromagram representation of audio signals. These features encompass the chroma vector (depicting the intensity of each pitch class), chroma ener gy (the summation of squared chroma values), and chroma cross-correlation (which quantifies the similarity between chroma vectors). Figure 6 illustrates the visual representation of “Happy Emotion”. Figure 6. Sample chroma visualization for happy emotion 3.3. Classifier model Following the data set cleaning, preprocessing, and feature extraction stages, we employed numerous machine learning algorithms. However , this research specifically focuses on three machine learning algorithms: SVM, LR, and multinomial Naive Bayes. These three algorithms were chosen due to their outstanding perfor - mance in terms of accuracy on this particular data set. 3.3.1. Support vector machine The SVM stands out as a potent and adaptable machine learning algorithm extensively utilized in binary and multi-class classification tasks and regression. SVM functions by identifying the optimal hyper - plane within the feature space, ef fectively maximizing the mar gin between distinct classes and facilitating ac- curate classification. This technique seeks to identify the optimal hyperplane for distinguishing between feature classes. One way to represent the equation for a linear SVM hyperplane is as: f ( x ) = arg max c ( w c · x + b c ) (3) here, f ( x ) is the decision function. c represents the dif ferent classes. w c is the class c weight vector . x is the input feature vector . b c is the class c bias term. In this configuration, the class with the highest score from the decision function is chosen to determine the projected class for an input sample. This technique is frequently used in multiclass SVM settings, where dif ferent binary classifiers are trained using one of two strategies: one-vs-one (OvO) or one-vs-rest (OvR). W e have employed the one-vs-rest (OvR) technique in our implementation. The final class assignment is subse- quently determined by taking into account the outputs of each binary classifier s decision function. 3.3.2. Logistic r egr ession The popular machine learning method known as LR was initially created for binary categorization. W ith the help of this technique, LR may be used ef fectively in scenarios with more than two classes, providing insightful information about the likelihood that each class would be an accurate prediction. But in this case, we’ve abandoned LR in favor of the OvR technique to handle multi-class classification problems. The OvR ap- proach simplifies the LR adjustment for multi-class jobs, making it a versatile solution for various classification problems. The class c classifier models the probability that a z belongs to class c in the following way: P ( y = c | z ) = 1 1 + e ( w c T z + b c ) (4) here, BanSpEmo: a Bangla audio dataset for speech emotion r ecognition and its baseline evaluation (Babe Sultana) Evaluation Warning : The document was created with Spire.PDF for Python.
2052 r ISSN: 2502-4752 The probability is P ( y = c | z ) which means the output y belongs to class c. For class c, w c is the weight vector and b c is the bias term. V ector of input features is z . 3.3.3. Multinomial Naive Bayes Among the probabilistic classification techniques designed for discrete feature scenarios is the multi- nomial Naive Bayes, which is widely applied to text classification problems. Building on the foundations of Bayes’ theorem, this method is designed assuming that the features, given the class label, exhibit conditional independence. Multinomial Naive Bayes extends its usefulness in a multi-class classification scenario by using the concepts of the Bayes theorem to determine the probability that an instance will be assigned to each class. The equation for predicting the class x probability is given the features f 1 , f 2 , ..., f n can be shown as: P ( X = x | f 1 , f 2 , ..., f n ) P ( X = x ) n i =1 P ( f i | X = x ) (5) Here, X is used as a class variable. f 1 , f 2 , ..., f n is used as the feature variables. P ( X = x | f 1 , f 2 , ..., f n ) is the class x posterior probability of given the features. P ( X = x ) is the class x prior probability . P ( f i | X = x ) is the conditional probability of feature f i given class x . These probabilities are computed using the training dataset during the training step. During the predic- tion step, the algorithm then computes the succeeding probabilities for every class, identifying the class with the maximum probability as the forecasted class for the specified set of features. The algorithm’ s ease of use in han- dling multi-class classification problems can be ascribed to its ef fectiveness, ease of handling high-dimensional data, and simplicity . 4. PERFORMANCE EV ALUA TION This section compares the outcomes of several machine learning algorithms and presents their perfor - mance analyses. It discusses the environmental setup, evaluation metrics analysis of dif ferent machine learning models, performance comparison with some previous works, confusion matrix analysis, and receiver operating characteristic (ROC) curve analysis. 4.1. Envir onmental setup Operating system: W indows 10 64 bit Processor: Intel(R) Core(TM) i5-4300M CPU @ 2.60 GHz RAM: 8 GB IDE : Google Colab Programming language: Python 4.2. Result analysis and discussion Our goal in this section is to present a thorough analysis and discussion of the findings from the many assessment metrics we used in our study . W e have also looked at ROC curves, which serve as an ef fective and lucid visual aid for illustrating the classifier s accuracy . 4.2.1. Evaluation metrics analysis W e employ a range of standard performance assessment metrics to evaluate and contrast the ef fective- ness of dif ferent classifiers. Our comparative analysis evaluates the relative performance of classifiers using accuracy , precision, recall, and F1 scores. Accuracy is utilized by comparing the predicted labels of each in- stance with the ground-truth labels, but its limitations are acknowledged as certain samples may introduce bias. Therefore, we also incorporate precision, recall, and F1 measures to provide a more comprehensive evalua- tion. W e experimented with various machine learning algorithms and eventually selected three—SVM, LR, and multinomial Naive Bayes—due to their commendable performance in this context. Among these, SVM exhibited the highest accuracy at 87.18%, followed by LR at 84.45%, and multinomial Naive Bayes at 82.77%. In T ables 5 - 7 , we presented the individual precision, recall, and F1 scores for all emotions considered in our Indonesian J Elec Eng & Comp Sci, V ol. 37, No. 3, March 2025: 2044–2057 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 r 2053 research. From these tables, it is evident that SVM attains the highest weighted average values for precision, recall, and F1 score, with 0.87, 0.87, and 0.86, respectively . In contrast, LR yields weighted average precision, recall, and F1 scores of 0.85, 0.84, and 0.84, respectively . For multinomial Naive Bayes, the corresponding values are 0.83, 0.83, and 0.82. T able 5. Outcomes of SVM-based precision, recall, f1-score, and a ccuracy in six distinct categories (emotions) Category Precision Recall F1-Score Accuracy (%) Anger 0.92 0.93 0.92 87.18% Disgust 0.82 0.88 0.85 Fear 0.85 0.87 0.96 Happy 0.88 0.93 0.90 Sad 0.91 0.82 0.86 Surprised 0.82 0.66 0.73 T able 6. Outcomes of LR-based precision, recall, f1-score, and accuracy in six distinct categories (emotions) Category Precision Recall F1-score Accuracy (%) Anger 0.85 0.92 0.88 84.45% Disgust 0.89 0.89 0.89 Fear 0.85 0.87 0.86 Happy 0.78 0.83 0.81 Sad 0.82 0.81 0.81 Surprised 0.89 0.60 0.71 T able 7. Outcomes of multinomial Naive Bayes-based precision, recall, f1-score, and accuracy in six distinct categories (emotions) Category Precision Recall F1-Score Accuracy(%) Anger 0.82 0.91 0.86 82.77% Disgust 0.84 0.87 0.85 Fear 0.81 0.87 0.84 Happy 0.86 0.85 0.86 Sad 0.80 0.78 0.79 Surprised 0.82 0.51 0.63 4.2.2. Performance comparison with r elevant Bangla datasets In the T able 8 , we aim to compare this work with previous studies that have focused on Bangla SER. Hassan et al. [ 20 ] primarily utilized two datasets: one in English, named RA VDESS, and another dataset SUBESCO for Bangla. Their proposed model, which integrated a 1D CNN with a fully convolutional network (FCN) layer , achieved 98.30% accuracy on the RA VDESS dataset and 98.97% on the SUBESCO dataset. T able 8. Comparison with related works used Bangla speech emotion datasets Article Dataset Features extraction techniques Classifier Accuracy Paper [ 20 ] SUBESCO MFCC, ZCR, Mel-Spectrogram, Root Mean Square, etc 1D CNN + FCN layers 98.97% Paper [ 15 ] SUBESCO CNN + TDF layer DCTFB 86.9% Paper [ 21 ] SUBESCO MFCCs + STFT + Chroma STFT 4CNN + TDF + Bi-LSTM 89.57% Paper [ 29 ] KBES MFCC, STFT , Chroma STFT , CNN TDF layer , Bi-LSTM, LSTM 71.67% Paper [ 30 ] SUBESCO, Ban- glaSER CNN KNN, AdaBoost, Bi-LSTM 90% This W ork BanSpEmo MFCC, Spectrogram, ZCR, RMSE SVM, LR, MNB 87.18% Using the dataset SUBESCO paper [ 15 ] utilized CNN and TDF features with DCTFB classifier and achieved an accuracy of 86.9%. Additionally , Islam et al. [ 21 ] used 3D CNN and bidirectional long short-term memory networks (Bi-LSTM) as models while working with the SUBESCO dataset. They achieved an accuracy BanSpEmo: a Bangla audio dataset for speech emotion r ecognition and its baseline evaluation (Babe Sultana) Evaluation Warning : The document was created with Spire.PDF for Python.