IAES Inter national J our nal of Articial Intelligence (IJ-AI) V ol. 14, No. 6, December 2025, pp. 5157 5171 ISSN: 2252-8938, DOI: 10.11591/ijai.v14.i6.pp5157-5171 5157 Classier model f or lectur er e v aluation by students using speech emotion r ecognition and deep lear ning appr oaches Y esy Diah Rosita 1,2 , W ah yu Andi Saputra 2 1 Center of Excellence for Human Centric Engineering, Institute of Sustainable Society , T elk om Uni v ersity , Bandung, Indonesia 2 Informatics Engineering Study Program, School of Computing, T elk om Uni v ersity , Purw ok erto, Indonesia Article Inf o Article history: Recei v ed Jul 31, 2024 Re vised Sep 10, 2025 Accepted Oct 16, 2025 K eyw ords: Bi-LSTM Ener gy Ev aluation Lecturer MFCC Student Zero-crossing rate ABSTRA CT Lecturers play a crucial role in higher education, with their teaching beha vior directly impacting learning and teaching quality . Lecturer e v aluation by students (LES) is a common method for assessing lecturer performance, though it often relies on subjecti v e perceptions. As a more object i v e alternati v e, s peech emotion recognition (SER) uses speech technology to analyze emotions in the speech of lecturers during classes. Thi s study proposes using deep learning-based SER, including con v olutional neural netw ork (CNN) and bidirectional long short-term memory (Bi-LSTM), to e v aluate teaching quality by analyzing displayed emotions. Remo ving silence from audio signals is crucial for enhancing feature analysis, such as ener gy , zero-crossing rate (ZCR), and mel-frequenc y cepstral coef cients (MFCC). This method remo v es inacti v e se gments, emphasizing signicant se gments, and impro ving accurac y in detecting v oice and emotions. Results sho w that the 1D CNN model with Bi-LSTM, using MFCC with 13 coef cients, ener gy , and ZCR, performs e xcell ently in emotion detection, achie ving a v alidation accurac y of o v er 0.851 with an accurac y g ap of 0.002. This small g ap indi cates good generalization and reduces the risk of o v ertting, making teaching e v aluations more objecti v e and v aluable for impro ving practices. This is an open access article under the CC BY -SA license . Corresponding A uthor: Y esy Diah Rosita Informatics Engineering Study Program, School of Computing, T elk om Uni v ersity St. D.I. P anjaitan No. 128, Purw ok erto, Ban yumas, Central Ja v a-53147, Indonesia Email: yesydr@telk omuni v ersity .ac.id 1. INTR ODUCTION Lecturers play a crucial role i n higher education, where their teaching beha vior directly im pacts the learning process and ultimately determines the quality of education pro vided. This role is vital as the quality of teaching af fects students learning e xperiences and their academic outcomes. T o ensure that teaching standards remain high, man y higher education institutions ha v e implemented lecturer e v aluation by students (LES) systems to assess lecturer performance during classes [1]. These e v aluations typically co v er aspects such as lecturer discipline, subject mastery , and their interactions with students. LES is presented in the form of a questionnaire that students complete at the end of the sem ester . This questionnaire aims to pro vide an o v ervie w of the teaching quality deli v ered by lecturers, and the results of this e v aluation impact the c o ur se grades listed on students’ transcripts [2]. Ho we v er , this method tends to be subjecti v e because the assessment is based on each student’ s personal perception, which can be inuenced by f actors such as mood, personal e xperiences, or indi vidual interactions with the lecturer . Consequently , the J ournal homepage: http://ijai.iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
5158 ISSN: 2252-8938 results of LES may not fully reect the objecti v e quality of teaching and are often inadequate as a sole measure of lecturer performance. As an alternati v e for a more objecti v e assessment of teaching quality , emotion anal ysis-based approaches can be utilized. One promising method is speech emotion recognition (SER), which le v erages speech recognition technology to analyze emotions [3], [4] pres ent in lecturers’ speech during classes. SER relies on e xtracting features from audio speech signals to determine the types of emotions e xpressed by lecturers. This technology of fers potential for a more objecti v e e v aluation since the emotions captured in speech can pro vide deeper insights into the lecturer’ s mood and attitude while teaching. Pre vious research indicates that emotions can generally be cate gorized into three classes: positi v e, ne g ati v e, and neutral [5]. Using SER in this conte xt allo ws for a more holistic assessment of ho w lecturers display their emot ions during teaching. By identifying feature e xtraction patterns and appropriate model congurations, SER can pro vide accurate data on the percentage of emotions e xpressed by lecturers throughout a class session. This pa v es the w ay for a more objecti v e e v aluation method that relies not only on students’ subjecti v e perceptions b ut also on empirical data generated from audio analysis. In the conte xt of technological de v elopment, the use of deep learning has become an increasingly popular approach in SER. Deep learning algorithms, particularly deep neural netw orks, can process and analyze feature data more ef fecti v ely than con v entional methods. Con v olutional neural netw orks (CNNs) and long short-term memory (LSTM) netw orks ha v e pro v en highly ef cient in recognizing patterns in speech and emotion data [6], [7]. The application of these techniques in SER enables enhanced accurac y and the model’ s ability to understand more com ple x emotional conte xts. The combination of SER and deep learning of fers an inno v ati v e solution for lecturer e v aluation. By inte grating emotion analysis technology with deep learning algorithms, we can g ain deeper insights into teaching quality and classroom atmosphere. This approach not only enhances accurac y in assessment b ut also pro vides more v aluable data for continuous impro v ement in teaching practices. 2. METHOD The objecti v e of this study is to e v aluate the performance of a deep learning model capable of classifying lecturer performance in deli v ering lecture material through SER. In this conte xt, lecturers’ emotions are classied into three classes: positi v e (happ y and surprised), neutral, and ne g ati v e (angry and sad). The methodology in v olv es se v eral stages: data collection, preprocessing, feature e xtraction, model creation, and performance e v aluation. 2.1. Data collection The data consists of speech samples in Indonesian, totaling 1,600 samples with a duration of 3-5 seconds: 491 positi v e (250 happ y and 241 surprised), 619 ne g ati v e (337 angry and 282 sad), and 400 neutral. The audio les are in .w a v format and mono channel. Data w as collected using a clip-on wireless microphone placed on the respondent’ s chest to ensure stable recording. The equipment features include: up to 100 m wireless operating range, selectable mono/stereo output mode, 3.5 mm headphone jack for real-time monitoring, b uilt-in omnidirectional microphone f or 360° sound pickup, and compatible with smartphones, tablets, cameras, recorders, or other audio/video recording de vices. The same equipment w as used to record lecturers during their presentations, with audio sample s lasting approximately 30-60 seconds for emotion analysis. A total of 30 samples were collected, corresponding to the number of acti v e lecturers in the School of Computing, T elk om Uni v ersity . T o pro vide a clearer picture of the dataset composition, T able 1 summarizes the distrib ution of emotion classes. T able 1. Emotion class distrib ution in the dataset Emotion Cate gory N Happ y Positi v e 250 Surprised Positi v e 241 Neutral Neutral 400 Angry Ne g ati v e 337 Sad Ne g ati v e 282 Int J Artif Intell, V ol. 14, No. 6, December 2025: 5157–5171 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 5159 2.2. Pr epr ocessing This stage aims to obtain audio data with v oice acti vity by applying a threshold of 0.001. P re vious research often remo v ed silence only from the be ginning and end of speech data [8], b ut in this study , se gments with v alues belo w the threshold are remo v ed throughout the entire recording, including the be ginning, middle, and end. Figure 1 pro vides a visual comparison between the original audio signal input and the signal after silence remo v al. The silence remo v al technique w as implemented using the Librosa library in Python, which is widely adopted for audio processing tasks due to its e xibility and ease of inte gration. In this study , an amplitude threshold of 0.001 w as applied to distinguish between speech and non-speech se gments. Se gments with amplitude v alues belo w this threshold were considered silent and thus e xcluded from further analysis. The selection of the 0.001 threshold w as not arbitrary . It w as informed by prior research, which demonstrated that such a v alue ef fecti v ely remo v es lo w-ener gy , non-informati v e se gments while preserving the esse ntial speech content necessary for reliable feature e xtraction and classication. Figure 1(a) illustrates a portion of the audio signal where there is no e vident v oice acti vity . The amplitude remains consistently close to zero, clearly indicating the presence of silence, as dened by the 0.001 threshold. This se gment does not contrib ute meaningful acoustic features and, theref o r e, is identied for remo v al [9]. As a result, Figure 1(b) displays the modied signal after the silence has been remo v ed, sho wcasing only the rele v ant speech se gments retained for further processing. This preprocessing step is crucial in enhancing the quality of input data, reducing noise, and impro ving the performance of subsequent feature e xtraction and classication stages in SER systems. (a) (b) Figure 1. The dif ference in signal: (a) without silence remo v al and (b) with silence remo v al 2.3. F eatur e extraction This stage aims to obtain audio data with v oice acti vity by applying a threshold of 0.001. P re vious research often remo v ed silence onl y from the be ginning and end of speech data [8]. Still, in this study , se gments with v alues belo w the threshold are remo v ed throughout the entire recording, including the be ginning, middle, and end. The ne xt stage in v olv es feature e xtraction, which includes three types. First, mel-frequenc y cepstral coef cients (MFCC) [10] with v arying coef cients (12 coef cients as in [11]–[13]; 13 coef cients as in [14]; 40 coef cients as in [15]–[17], combined with ener gy and zero-crossing rate (ZCR) [4], [14], [18]. Additionally , comparisons are made with combinations of MFCC coef cients, Chroma [18]–[21], and mel-spectrogram [18], [21]. This results in 40 feature combinations for model de v el opment. These dynamic features enhance sensiti vity to temporal changes in speech, which can signal emotional transitions. This stage re v eals the characteristics of the v oice from v arious perspecti v es and assesses the performance of Classier model for lectur er e valuation by students using speec h emotion r eco gnition ... (Y esy Diah Rosita) Evaluation Warning : The document was created with Spire.PDF for Python.
5160 ISSN: 2252-8938 each characteristic. The ener gy after silence remo v al is usually higher compared to the original signal ener gy , primarily because quiet or sil ent parts are remo v ed, lea ving only the louder or v oice-containing sections. Ho we v er , if only the silent parts are remo v ed, the total ener gy may not change signicantly , b ut the ener gy distrib ution per frame might. Similarly , with the ZCR feat ure, silence in the original signal may contain small uctuations t hat cause zero-crossi ngs. When silence is remo v ed, these uctuations disappear , resulting in a lo wer ZCR. After silence remo v al, the remaining parts may be more consistent or stable, meaning fe wer rapid changes crossing zero, leading to a decrease in ZCR. Lik e ZCR, silence in the original signal can also af fect the spectral representation captured by MFCC. MFCC is a crucial feature in v oice signal analysis used to capture rich spectral information. When silence is remo v ed, MFCC analysis becomes more focused on the rele v ant parts of the v oice, impro ving accurac y in recognizing v oice patterns and emotions. By remo ving sil ence, we eliminate se gments that do not carry important information, making t he resulting MFCC more representati v e of the true characteristics of the v oice. V isualization of MFCC before and after silence remo v al will sho w dif ferences in spectral representation, where MFCC after silence remo v al will be more stable and reect clearer and more consistent v oice patterns. The MFCC feature also sho ws signicant changes after silence remo v al. MFCC is an important representation in v oice signal analysis that captures rich spectral information. When silence sections are remo v ed, MFCC analysis becomes more focused on the rele v ant v oice parts, potentially enhancing accurac y in recognizing v oice patterns and emotions. Remo ving silence eliminates se gments that do not pro vide important information, resulting in MFCC that is more representatri v e of the true characteristics of the v oice. 2.3.1. Ener gy It is one of the most fundamental acoustic features in SER. It quanties the o v erall strength or po wer of the speech signal in the time domain, reecting ho w loudly or forceful ly a person is speaking. V ocal intensity , captured by ener gy , often corresponds with emotional arousal and acti v ation le v els: for instance, high-arousal emotions lik e anger , jo y , or fear tend to be e xpressed with greater ener gy , while lo w-arousal states lik e sadness or boredom result in quieter speech. Man y studies in SER therefore incorporate ener gy as a reliable indicator of emotional e xpression, and frequently apply statistical functionals (e.g., mean, v ariance, and e xtremes) o v er ener gy contours to characterize emotion [19], [22], [23]. These temporal-ener gy patterns help cla ssiers distinguish between high-intensity emotional states and more subdued e xpressions, enhancing the o v erall rob ustness and performance of emotion detection systems. Signal ener gy is a fundamental measure that quanties the total po wer contained within an audio signal o v er time. It reects ho w ‘strong’ or ‘loud’ the signal is, which is essential for tasks lik e v oice acti vity detection and emotion analysis. The signal input represents the amplitude of the signal at the samples and the total number of samples in the frame. By squaring the amplitude, we ensure that both positi v e and ne g ati v e v alues contrib ute positi v ely to the total ener gy , thereby pro viding an accurate measure of signal st rength. This discrete-time denition is widely used in audio processing due to its simplicity and computational ef cienc y . In practice, the inte gral is approximated by summing o v er nite-duration frames, as sho wn abo v e, because real-w orld signals are nite. This form is based on continuous-domain theory b ut is rarel y used directly in digital signal processing due to discretization. Signal ener gy correlates with percei v ed loudness, though loudness perception is more comple x and frequenc y-dependent. This con v ersion allo ws audio engineers to handle v ery lar ge v ariations in signal ener gy more con v eniently , aligning more closely with human perception. Signal ener gy is a core feature in emotion recognit ion systems since more intense v ocal e xpressions (lik e anger or e xcitement) e xhibit higher ener gy , whereas calmer speech (lik e sadness) tends to ha v e lo wer ener gy . In feature e xtraction pipelines, ener gy is often used alongside MFCC and ZCR to pro vide a more holistic representation of the emotional content of speech. Figure 2 visualizes the ener gy feature without silence remo v al as sho wn in Figure 2(a), and with silence rem o v al as presented in Figure 2(b) are by plotting a short-time ener gy contour directly beneath the ra w w a v eform, with time on the x-axis and ener gy magnitude on the y-axis. This representation clearly highlights where speech se gments occur , peaks correlate with v oiced, high-intensity speech, while v alle ys indicate silence or quieter , lo w-arousal states lik e sadness or boredom. This visualization is especially useful for v oice acti vity detection and emotion analysis: the temporal patterns when ener gy spik es or dips, help in characterizing emotional states o v er t ime. T ypically , a sliding windo w of 10–30 ms (e.g., 160–320 samples at 16 kHz) is used to balance time resolution and smoothing of rapid amplitude changes. Int J Artif Intell, V ol. 14, No. 6, December 2025: 5157–5171 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 5161 (a) (b) Figure 2. The dif ference in ener gy: (a) without silence remo v al and (b) with silence remo v al 2.3.2. Chr oma It is chosen because the y capture harmonic and pitch-related information—attrib utes that go be yond what typical spectral features (lik e MFCC or ZCR/ener gy) represent. By encoding the distrib ution of ener gy across the twelv e pitch classes, chroma features re v eal tonal characteristics and musicality within speech, such as subtle pitch modulations, intonation patterns, and harmonic struct ure, that are often link ed to the e xpression of emotions [24]. In f act, studies sho w that adding chroma to traditional feature sets consistently impro v es SER performance: for e xample, the y notably contrib ute to emoti on discrimination across datasets lik e RA VDESS and TESS, helping models distinguish emotional nuances that w ould be missed by MFCC alone. Ho we v er , chroma’ s performance can v ary depending on the dataset and emotional content, and it sti ll requires further tuning, lik e combining chroma with temporal or rh ythmi c conte xt, to reach optimal accurac y in emotion classication tasks [24]. Chroma features represent the spectral ener gy distrib ution in the twelv e musical pitch classe s (C, C/ D,..., B), which aggre g ates octa v e-independent pitch information. The y are particularly v aluable in audio analysis tasks, such as emotion recognition in speech, because the y capture harmonic and tonal characteristics while being in v ariant to timbre, instrumentation, and octa v e shifts. This ma k es them rob ust descriptors for capturing pitch-related v ariations in spok en utterances. Figure 3 demonstrates that the resulting chromagram without silence remo v al (Figure 3(a)) and with silence remo v al (Figure 3(b)) are the tw o-dimensional time–chroma matrix sho wing ho w the spectral content is distrib uted across pitch classes o v er time. This structure is highly ef fecti v e at summarizing harmonic content, as notes with identical pitch class b ut dif ferent octa v es contrib ute to the same bin, preserving musical color re g ardless of octa v e. Such octa v e in v ariance also ensures chroma features remain stable under pitch shifts or speak er v ariations. Chroma features are rob ust to changes in timbre and dynamic range since the y focus on pitch-cl ass patterns rather than e xact spectral shapes. In emotion analysis, this helps capture intonational melodies and pitch modulations associated with af fecti v e speech, e v en amidst background noise or speak er v ariability . Augmentations lik e harmonic pitch class proles (HPCP) further enhance rob ustness by tuning alignment and ener gy normalization across octa v es. Classier model for lectur er e valuation by students using speec h emotion r eco gnition ... (Y esy Diah Rosita) Evaluation Warning : The document was created with Spire.PDF for Python.
5162 ISSN: 2252-8938 (a) (b) Figure 3. The dif ference in chroma: (a) without silence remo v al and (b) with silence remo v al 2.3.3. Mel-fr equency cepstral coefcients It is chosen because the y ef fecti v ely encode the ph ysical characteristics of sound signals by si mulating human auditory perception: the y apply a mel-scale lter bank that emphasizes frequencies in a w ay humans percei v e, tak e log arithmic compression to resemble loudness perception, and perform a discrete cosine transform to decorrelate lter outputs into compact coef cients. This structure enables MFCCs to e xtract phonetic content that is particularly v aluable for emotion classication: prior w ork has sho wn that e v en a modest number of MFCC features [15], [10], [25] carry signica nt emotion discrimination po wer by capturing spectral v ariations tied to v ocal t ract dynamics. By forming a distinct feature map, these coef cients allo w machine learning models to dif ferent iate subtle emotional cues in speech, making MFCCs an ef fecti v e choice for emotion recognition tasks. Figure 4 visualizes the characteristics of an audio si gnal, which is typically represented as a heatmap, a time-series visualization of MFCC coef cients without silence remo v al (Figure 4(a)) and with silence remo v al (Figure 4(b)). On the x-axis of the heatmap is time, while the y-axis sho ws the cepstral coef cient indices (e.g., MFCC 1–13). Each cell in the heatmap represents an amplitude v alue, dark er or lighter depending on the color palette, corresponding to a specic time and coef cient inde x. 2.3.4. Mel-spectr ogram It is used because it pro vides a frequenc y representation on the mel scale with both time and frequenc y dimensions, which are suitable for processing with 2D con v olutional k ernels in deep learning models. By con v erting audio signals into image-lik e spectrograms, CNNs can ef fecti v ely learn localized time-frequenc y Int J Artif Intell, V ol. 14, No. 6, December 2025: 5157–5171 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 5163 patterns, such as ener gy b ursts, formant shifts, and pitch contours that are strongly associated with dif ferent emotional states. Recent research demonstrates that feeding mel-spectrograms into CNN architectures enables models to autonomously e xtract salient emotional cues, leading to impro v ed classication performance compared to con v entional approaches [26]. A mel-spectrogram is a perceptually moti v ated time–frequenc y representation of audio, widely used in speech and emotion recognition. It aligns with ho w humans percei v e sound by emphasizing lo wer frequencies and compressing hi gher bands. As a result, it produces an image lik e matrix well-suited for deep learning applications, especially CNNs. The log transformation compresses the dynamic range, mimicking human loudness perception. Adding a small constant ϵ pre v ents taking the log of zero. The result yields a stable feature representation for deep learning. Figure 5 sho ws a heatmap of mel-spectogram without silence remo v al (Figure 5(a)) and with s ilence remo v al (Figure 5(b)) that with time on the x-axis and mel-frequenc y (in Hz on the mel scale) on the y-axis, where color intensity represents magnitude in decibels (dB). Brighter bands on the heatmap indicate re gions of high ener gy at specic frequencies and times, such as formant resonances or pitch harmonics, while dark er areas sho w quieter portions. This image-lik e representation enables deep learning models, particularly 2D CNNs, to detect localized time-frequenc y patterns, such as ener gy b ursts or frequenc y shifts, associated with emotional cues. The decibel scale (log amplitude) ensures the dynamic range is visually compressed, making both subtle and prominent audio features apparent. (a) (b) Figure 4. The dif ference in MFCC: (a) without silence remo v al and (b) with silence remo v al Classier model for lectur er e valuation by students using speec h emotion r eco gnition ... (Y esy Diah Rosita) Evaluation Warning : The document was created with Spire.PDF for Python.
5164 ISSN: 2252-8938 (a) (b) Figure 5. The dif ference in mel-spectogram: (a) without silence remo v al and (b) with silence remo v al The nal mel-spectrogram matrix { ˜ S m ( t ) } t =1 ,...,T m =1 ,...,M serv es as an ef cient and perceptually aligned input to 2D con v olutional k ernels. It preserv es time–frequenc y locality , enabling neural netw orks to detect emotional cues such as formant shifts, pitch contours, and ener gy b ursts. By combining mel-scale ltering, log compression, and spectral s moothing, mel-spectrograms outperform linear spectrograms in capturing emoti v e v ocal patterns, making them ideal for emotion recognition architectures. 2.3.5. Zer o-cr ossing rate This feature measures the smoothness of the audio signal by counting ho w frequently it changes sign, crossing from positi v e to zero to ne g ati v e, or vice v ersa , within a gi v en time frame [27]. Also kno wn as the number of zero-axis crossings per unit time [4], ZCR ef fecti v ely captures the noisiness or smoothness of the signal: noisy or un v oiced se gments typically e xhibit higher ZCR, while v oiced and more periodic re gions yield lo wer v alues. Due to its cle ar association with spectral content, higher ZCR indicates richer high-frequenc y components, and lo wer v alues align with more periodic, lo w-frequenc y sounds. ZCR is widely used in v oice acti vity det ection, v oiced/un v oiced frame classication, and e v en as an e xcitation le v el indicator in emotion recognition systems. Combined with features lik e ener gy and MFCCs, ZCR enhances spectral representations by pro viding insights into speech articulation dynamics and intensity uctuations. The Figure 6 ZCR visualization without silence remo v al (Figure 6(a)) and with silence remo v al (Figure 6(b)) that each of both using a dual-plot l ayout: the top plot displays the ra w audio w a v eform Int J Artif Intell, V ol. 14, No. 6, December 2025: 5157–5171 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 5165 (amplitude vs. time), while the bottom plot sho ws the short-time ZCR o v er the same time axis. Peaks in the ZCR curv e correspond to rapid sign changes, common during un v oiced sounds or noisy se gments, while troughs align with v oiced re gions where the w a v eform oscillates smoothly and crosses zero less often. This visual alignment allo ws researchers to immediately identify v oiced/un v oiced se gments and associate sudden uctuations with phonetic or emotional cues. Because ZCR is calculated per frame (e.g., 10–30 ms windo ws), the contour’ s temporal resolution ef fecti v ely highlights dynamic speech features critica l for emotion detection and v oice acti vity tasks. (a) (b) Figure 6. The dif ference in ZCR: (a) without silence remo v al and (b) with silence remo v al Furthermore, in Figure 6, the ZCR contour is often depicted alongside a horizontal threshold line that classies frames as v oiced or un v oiced. Frames whose ZCR e x c eeds the threshold are mark ed as un v oiced (typically sho wn in one color), while those belo w are considered v oiced (sho wn in another). This threshold-based se gmentation is v alidated by prior w ork demonstrating that un v oiced se gments generally e xhibit higher ZCR and lo wer ener gy compared to v oiced se gments, where ZCRs are lo w and ener gies are high. Such delineation enables automated v oice acti vit y detection and helps the model focus on emotionally rich v oiced re gions. Moreo v er , the sharp contrast in ZCR trends between v oiced and un v oiced re gions of fers visual cues about changes in speech e xcitation peaks in the ZCR curv e often align with phonetic transitions or b ursts, which can be critical indicators of emotional states or emphatic speech patterns. 2.4. Model ar chitectur e Experiments were conducted with se v eral types of models, including CNN-1D, LSTM, bidirec tional long short-term memory (Bi-LSTM), combinations of CNN-1D and LSTM, and CNN-1D and Bi-LSTM. Each model w as tested with 8 feature e xt raction results from the pre vious stage , resulting in 40 model scenarios. The summary of model types and their respecti v e layer compositions in presented in T able 2. In this model, the frame size is deri v ed from the audio’ s frame duration multiplied by the sa mple rate. At a standard sample rate of 22,050 Hz, using n f ft=2048 results in each frame spanning approximately Classier model for lectur er e valuation by students using speec h emotion r eco gnition ... (Y esy Diah Rosita) Evaluation Warning : The document was created with Spire.PDF for Python.
5166 ISSN: 2252-8938 93 ms, as Librosa applies an FFT windo w of that size by def ault. Meanwhile, a hop length of 512 samples is emplo yed, which leads to approximately 75% o v erlap between successi v e frames. In practical terms, this conguration yields analysis shifts of roughly 23 ms per frame, promoting smoother temporal transitions during feature e xtraction. The dataset is di vided into 80% for training and 20% for testing (test size=0.2). T raining is performed for up to 50 epochs, with early stopping enabled via EarlyStopping(monitor= v al accurac y’, patience=5 to halt training if v alidation accurac y does not impro v e o v er v e consecuti v e epochs. W e utilize the Adam optimizer , combined with the cate gorical crossentrop y loss function and an initial learning rate of 0.001. The model is trained using a batch size of 32. Acti v ation functions include rectied linear unit (ReLU) in the con v olutional and dense hidden layers, and Softmax in the output layer for multi-class classication. T able 2. The summary of model types Architecture Layers CNN-1D - Con v1D(lters=x1, k ernel=3, ReLU) - MaxPooling1D(pool=2) - Con v1D(lters=x2, k ernel=3, ReLU) - MaxPooling1D(pool=2) - Flatten - Dense(units=x3, ReLU) - Dense(units=3, Softmax) LSTM - LSTM(x1 units, tanh, return sequences=T rue) - LSTM(x2 units, tanh) - Dense(x3 units, ReLU) - Dense(3 units, Softmax) CNN-1D + LSTM - Con v1D(128, k ernel=5, ReLU) + MaxPooling - Con v1D(64, k ernel=5, ReLU) + MaxPooling - Dropout(0.3) - LSTM(128, return sequences=T rue) - LSTM(64) - Dense(32, ReLU) - Dense(3, Softmax) CNN-1D + Bi-LSTM - Con v1D(32, k ernel=3, ReLU, L2=1e-4), BatchNorm, MaxPool(2), Dropout(0.3) - Con v1D(64, k ernel=3, ReLU, L2=1e-4), BatchNorm, MaxPool(2), Dropout(0.3) - Bi-LSTM(128, return sequences=T rue, L2=1e-4), Dropout(0.3) - Bi-LSTM(64, L2=1e-4), Dropout(0.3) - Dense(output softmax) with L1=1e-5, L2=1e-4 re gularization 2.5. Experimental r esult In this study , the dataset w as partitioned into three subsets: 80% for training, 10% for v alidation, and 10% for testing. This st ratied split not only ensures the model has ample data to learn underlying patterns b ut also pro vides a rob ust frame w ork for e v aluation. The v alidation set is used during training to monitor o v ertting, tune h yperparameters, and guide early stopping, while the test set remains unseen until the v ery end to of fer an unbiased measure of generalization performance. Adopting this split ratio aligns with standard machine learning practices, where an 80/10/10 partition is widely recommended to maintain representati v e class distrib utions and a v oid biased estimates. Moreo v er , stratied sampling w as applied to preserv e the proportional repres entation of each emotion class across all subsets, which pre v ents class imbalance from sk e wing model performance e v aluations. 2.6. Implementation r esult The process of e v aluating lecturer performance through SER starts with analyzing all audio data obtained from lecture recordings. The audio data under goes consistent preprocessing to ensure accurate and reliable results. SER typically follo ws a structured pipeline. It be gins by capturing and cleaning the ra w audio signal, remo ving noise and di viding it into short, o v erlapping frames, often using pre-emphasis, endpoint detection, and framing techniques. Ne xt, for each frame, acoustic features are e xtracted, which often include hand-crafted descriptors lik e MFCCs, pitch, ener gy , ZCR, spectral coef cients, or more adv anced formant and w a v elet features. In modern systems, these handcrafted features might be enhanced with deep representations (e.g., embeddings from w a v2v ec or HuBER T), sometimes using multi-stream fusion architectures to capture complementary information. The result ing features are then fed into classiers, ranging from traditional models Int J Artif Intell, V ol. 14, No. 6, December 2025: 5157–5171 Evaluation Warning : The document was created with Spire.PDF for Python.