IAES
Inter
national
J
our
nal
of
Articial
Intelligence
(IJ-AI)
V
ol.
14,
No.
6,
December
2025,
pp.
5157
∼
5171
ISSN:
2252-8938,
DOI:
10.11591/ijai.v14.i6.pp5157-5171
❒
5157
Classier
model
f
or
lectur
er
e
v
aluation
by
students
using
speech
emotion
r
ecognition
and
deep
lear
ning
appr
oaches
Y
esy
Diah
Rosita
1,2
,
W
ah
yu
Andi
Saputra
2
1
Center
of
Excellence
for
Human
Centric
Engineering,
Institute
of
Sustainable
Society
,
T
elk
om
Uni
v
ersity
,
Bandung,
Indonesia
2
Informatics
Engineering
Study
Program,
School
of
Computing,
T
elk
om
Uni
v
ersity
,
Purw
ok
erto,
Indonesia
Article
Inf
o
Article
history:
Recei
v
ed
Jul
31,
2024
Re
vised
Sep
10,
2025
Accepted
Oct
16,
2025
K
eyw
ords:
Bi-LSTM
Ener
gy
Ev
aluation
Lecturer
MFCC
Student
Zero-crossing
rate
ABSTRA
CT
Lecturers
play
a
crucial
role
in
higher
education,
with
their
teaching
beha
vior
directly
impacting
learning
and
teaching
quality
.
Lecturer
e
v
aluation
by
students
(LES)
is
a
common
method
for
assessing
lecturer
performance,
though
it
often
relies
on
subjecti
v
e
perceptions.
As
a
more
object
i
v
e
alternati
v
e,
s
peech
emotion
recognition
(SER)
uses
speech
technology
to
analyze
emotions
in
the
speech
of
lecturers
during
classes.
Thi
s
study
proposes
using
deep
learning-based
SER,
including
con
v
olutional
neural
netw
ork
(CNN)
and
bidirectional
long
short-term
memory
(Bi-LSTM),
to
e
v
aluate
teaching
quality
by
analyzing
displayed
emotions.
Remo
ving
silence
from
audio
signals
is
crucial
for
enhancing
feature
analysis,
such
as
ener
gy
,
zero-crossing
rate
(ZCR),
and
mel-frequenc
y
cepstral
coef
cients
(MFCC).
This
method
remo
v
es
inacti
v
e
se
gments,
emphasizing
signicant
se
gments,
and
impro
ving
accurac
y
in
detecting
v
oice
and
emotions.
Results
sho
w
that
the
1D
CNN
model
with
Bi-LSTM,
using
MFCC
with
13
coef
cients,
ener
gy
,
and
ZCR,
performs
e
xcell
ently
in
emotion
detection,
achie
ving
a
v
alidation
accurac
y
of
o
v
er
0.851
with
an
accurac
y
g
ap
of
0.002.
This
small
g
ap
indi
cates
good
generalization
and
reduces
the
risk
of
o
v
ertting,
making
teaching
e
v
aluations
more
objecti
v
e
and
v
aluable
for
impro
ving
practices.
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
Y
esy
Diah
Rosita
Informatics
Engineering
Study
Program,
School
of
Computing,
T
elk
om
Uni
v
ersity
St.
D.I.
P
anjaitan
No.
128,
Purw
ok
erto,
Ban
yumas,
Central
Ja
v
a-53147,
Indonesia
Email:
yesydr@telk
omuni
v
ersity
.ac.id
1.
INTR
ODUCTION
Lecturers
play
a
crucial
role
i
n
higher
education,
where
their
teaching
beha
vior
directly
im
pacts
the
learning
process
and
ultimately
determines
the
quality
of
education
pro
vided.
This
role
is
vital
as
the
quality
of
teaching
af
fects
students
’
learning
e
xperiences
and
their
academic
outcomes.
T
o
ensure
that
teaching
standards
remain
high,
man
y
higher
education
institutions
ha
v
e
implemented
lecturer
e
v
aluation
by
students
(LES)
systems
to
assess
lecturer
performance
during
classes
[1].
These
e
v
aluations
typically
co
v
er
aspects
such
as
lecturer
discipline,
subject
mastery
,
and
their
interactions
with
students.
LES
is
presented
in
the
form
of
a
questionnaire
that
students
complete
at
the
end
of
the
sem
ester
.
This
questionnaire
aims
to
pro
vide
an
o
v
ervie
w
of
the
teaching
quality
deli
v
ered
by
lecturers,
and
the
results
of
this
e
v
aluation
impact
the
c
o
ur
se
grades
listed
on
students’
transcripts
[2].
Ho
we
v
er
,
this
method
tends
to
be
subjecti
v
e
because
the
assessment
is
based
on
each
student’
s
personal
perception,
which
can
be
inuenced
by
f
actors
such
as
mood,
personal
e
xperiences,
or
indi
vidual
interactions
with
the
lecturer
.
Consequently
,
the
J
ournal
homepage:
http://ijai.iaescor
e
.com
Evaluation Warning : The document was created with Spire.PDF for Python.
5158
❒
ISSN:
2252-8938
results
of
LES
may
not
fully
reect
the
objecti
v
e
quality
of
teaching
and
are
often
inadequate
as
a
sole
measure
of
lecturer
performance.
As
an
alternati
v
e
for
a
more
objecti
v
e
assessment
of
teaching
quality
,
emotion
anal
ysis-based
approaches
can
be
utilized.
One
promising
method
is
speech
emotion
recognition
(SER),
which
le
v
erages
speech
recognition
technology
to
analyze
emotions
[3],
[4]
pres
ent
in
lecturers’
speech
during
classes.
SER
relies
on
e
xtracting
features
from
audio
speech
signals
to
determine
the
types
of
emotions
e
xpressed
by
lecturers.
This
technology
of
fers
potential
for
a
more
objecti
v
e
e
v
aluation
since
the
emotions
captured
in
speech
can
pro
vide
deeper
insights
into
the
lecturer’
s
mood
and
attitude
while
teaching.
Pre
vious
research
indicates
that
emotions
can
generally
be
cate
gorized
into
three
classes:
positi
v
e,
ne
g
ati
v
e,
and
neutral
[5].
Using
SER
in
this
conte
xt
allo
ws
for
a
more
holistic
assessment
of
ho
w
lecturers
display
their
emot
ions
during
teaching.
By
identifying
feature
e
xtraction
patterns
and
appropriate
model
congurations,
SER
can
pro
vide
accurate
data
on
the
percentage
of
emotions
e
xpressed
by
lecturers
throughout
a
class
session.
This
pa
v
es
the
w
ay
for
a
more
objecti
v
e
e
v
aluation
method
that
relies
not
only
on
students’
subjecti
v
e
perceptions
b
ut
also
on
empirical
data
generated
from
audio
analysis.
In
the
conte
xt
of
technological
de
v
elopment,
the
use
of
deep
learning
has
become
an
increasingly
popular
approach
in
SER.
Deep
learning
algorithms,
particularly
deep
neural
netw
orks,
can
process
and
analyze
feature
data
more
ef
fecti
v
ely
than
con
v
entional
methods.
Con
v
olutional
neural
netw
orks
(CNNs)
and
long
short-term
memory
(LSTM)
netw
orks
ha
v
e
pro
v
en
highly
ef
cient
in
recognizing
patterns
in
speech
and
emotion
data
[6],
[7].
The
application
of
these
techniques
in
SER
enables
enhanced
accurac
y
and
the
model’
s
ability
to
understand
more
com
ple
x
emotional
conte
xts.
The
combination
of
SER
and
deep
learning
of
fers
an
inno
v
ati
v
e
solution
for
lecturer
e
v
aluation.
By
inte
grating
emotion
analysis
technology
with
deep
learning
algorithms,
we
can
g
ain
deeper
insights
into
teaching
quality
and
classroom
atmosphere.
This
approach
not
only
enhances
accurac
y
in
assessment
b
ut
also
pro
vides
more
v
aluable
data
for
continuous
impro
v
ement
in
teaching
practices.
2.
METHOD
The
objecti
v
e
of
this
study
is
to
e
v
aluate
the
performance
of
a
deep
learning
model
capable
of
classifying
lecturer
performance
in
deli
v
ering
lecture
material
through
SER.
In
this
conte
xt,
lecturers’
emotions
are
classied
into
three
classes:
positi
v
e
(happ
y
and
surprised),
neutral,
and
ne
g
ati
v
e
(angry
and
sad).
The
methodology
in
v
olv
es
se
v
eral
stages:
data
collection,
preprocessing,
feature
e
xtraction,
model
creation,
and
performance
e
v
aluation.
2.1.
Data
collection
The
data
consists
of
speech
samples
in
Indonesian,
totaling
1,600
samples
with
a
duration
of
3-5
seconds:
491
positi
v
e
(250
happ
y
and
241
surprised),
619
ne
g
ati
v
e
(337
angry
and
282
sad),
and
400
neutral.
The
audio
les
are
in
.w
a
v
format
and
mono
channel.
Data
w
as
collected
using
a
clip-on
wireless
microphone
placed
on
the
respondent’
s
chest
to
ensure
stable
recording.
The
equipment
features
include:
up
to
100
m
wireless
operating
range,
selectable
mono/stereo
output
mode,
3.5
mm
headphone
jack
for
real-time
monitoring,
b
uilt-in
omnidirectional
microphone
f
or
360°
sound
pickup,
and
compatible
with
smartphones,
tablets,
cameras,
recorders,
or
other
audio/video
recording
de
vices.
The
same
equipment
w
as
used
to
record
lecturers
during
their
presentations,
with
audio
sample
s
lasting
approximately
30-60
seconds
for
emotion
analysis.
A
total
of
30
samples
were
collected,
corresponding
to
the
number
of
acti
v
e
lecturers
in
the
School
of
Computing,
T
elk
om
Uni
v
ersity
.
T
o
pro
vide
a
clearer
picture
of
the
dataset
composition,
T
able
1
summarizes
the
distrib
ution
of
emotion
classes.
T
able
1.
Emotion
class
distrib
ution
in
the
dataset
Emotion
Cate
gory
N
Happ
y
Positi
v
e
250
Surprised
Positi
v
e
241
Neutral
Neutral
400
Angry
Ne
g
ati
v
e
337
Sad
Ne
g
ati
v
e
282
Int
J
Artif
Intell,
V
ol.
14,
No.
6,
December
2025:
5157–5171
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
5159
2.2.
Pr
epr
ocessing
This
stage
aims
to
obtain
audio
data
with
v
oice
acti
vity
by
applying
a
threshold
of
0.001.
P
re
vious
research
often
remo
v
ed
silence
only
from
the
be
ginning
and
end
of
speech
data
[8],
b
ut
in
this
study
,
se
gments
with
v
alues
belo
w
the
threshold
are
remo
v
ed
throughout
the
entire
recording,
including
the
be
ginning,
middle,
and
end.
Figure
1
pro
vides
a
visual
comparison
between
the
original
audio
signal
input
and
the
signal
after
silence
remo
v
al.
The
silence
remo
v
al
technique
w
as
implemented
using
the
Librosa
library
in
Python,
which
is
widely
adopted
for
audio
processing
tasks
due
to
its
e
xibility
and
ease
of
inte
gration.
In
this
study
,
an
amplitude
threshold
of
0.001
w
as
applied
to
distinguish
between
speech
and
non-speech
se
gments.
Se
gments
with
amplitude
v
alues
belo
w
this
threshold
were
considered
silent
and
thus
e
xcluded
from
further
analysis.
The
selection
of
the
0.001
threshold
w
as
not
arbitrary
.
It
w
as
informed
by
prior
research,
which
demonstrated
that
such
a
v
alue
ef
fecti
v
ely
remo
v
es
lo
w-ener
gy
,
non-informati
v
e
se
gments
while
preserving
the
esse
ntial
speech
content
necessary
for
reliable
feature
e
xtraction
and
classication.
Figure
1(a)
illustrates
a
portion
of
the
audio
signal
where
there
is
no
e
vident
v
oice
acti
vity
.
The
amplitude
remains
consistently
close
to
zero,
clearly
indicating
the
presence
of
silence,
as
dened
by
the
0.001
threshold.
This
se
gment
does
not
contrib
ute
meaningful
acoustic
features
and,
theref
o
r
e,
is
identied
for
remo
v
al
[9].
As
a
result,
Figure
1(b)
displays
the
modied
signal
after
the
silence
has
been
remo
v
ed,
sho
wcasing
only
the
rele
v
ant
speech
se
gments
retained
for
further
processing.
This
preprocessing
step
is
crucial
in
enhancing
the
quality
of
input
data,
reducing
noise,
and
impro
ving
the
performance
of
subsequent
feature
e
xtraction
and
classication
stages
in
SER
systems.
(a)
(b)
Figure
1.
The
dif
ference
in
signal:
(a)
without
silence
remo
v
al
and
(b)
with
silence
remo
v
al
2.3.
F
eatur
e
extraction
This
stage
aims
to
obtain
audio
data
with
v
oice
acti
vity
by
applying
a
threshold
of
0.001.
P
re
vious
research
often
remo
v
ed
silence
onl
y
from
the
be
ginning
and
end
of
speech
data
[8].
Still,
in
this
study
,
se
gments
with
v
alues
belo
w
the
threshold
are
remo
v
ed
throughout
the
entire
recording,
including
the
be
ginning,
middle,
and
end.
The
ne
xt
stage
in
v
olv
es
feature
e
xtraction,
which
includes
three
types.
First,
mel-frequenc
y
cepstral
coef
cients
(MFCC)
[10]
with
v
arying
coef
cients
(12
coef
cients
as
in
[11]–[13];
13
coef
cients
as
in
[14];
40
coef
cients
as
in
[15]–[17],
combined
with
ener
gy
and
zero-crossing
rate
(ZCR)
[4],
[14],
[18].
Additionally
,
comparisons
are
made
with
combinations
of
MFCC
coef
cients,
Chroma
[18]–[21],
and
mel-spectrogram
[18],
[21].
This
results
in
40
feature
combinations
for
model
de
v
el
opment.
These
dynamic
features
enhance
sensiti
vity
to
temporal
changes
in
speech,
which
can
signal
emotional
transitions.
This
stage
re
v
eals
the
characteristics
of
the
v
oice
from
v
arious
perspecti
v
es
and
assesses
the
performance
of
Classier
model
for
lectur
er
e
valuation
by
students
using
speec
h
emotion
r
eco
gnition
...
(Y
esy
Diah
Rosita)
Evaluation Warning : The document was created with Spire.PDF for Python.
5160
❒
ISSN:
2252-8938
each
characteristic.
The
ener
gy
after
silence
remo
v
al
is
usually
higher
compared
to
the
original
signal
ener
gy
,
primarily
because
quiet
or
sil
ent
parts
are
remo
v
ed,
lea
ving
only
the
louder
or
v
oice-containing
sections.
Ho
we
v
er
,
if
only
the
silent
parts
are
remo
v
ed,
the
total
ener
gy
may
not
change
signicantly
,
b
ut
the
ener
gy
distrib
ution
per
frame
might.
Similarly
,
with
the
ZCR
feat
ure,
silence
in
the
original
signal
may
contain
small
uctuations
t
hat
cause
zero-crossi
ngs.
When
silence
is
remo
v
ed,
these
uctuations
disappear
,
resulting
in
a
lo
wer
ZCR.
After
silence
remo
v
al,
the
remaining
parts
may
be
more
consistent
or
stable,
meaning
fe
wer
rapid
changes
crossing
zero,
leading
to
a
decrease
in
ZCR.
Lik
e
ZCR,
silence
in
the
original
signal
can
also
af
fect
the
spectral
representation
captured
by
MFCC.
MFCC
is
a
crucial
feature
in
v
oice
signal
analysis
used
to
capture
rich
spectral
information.
When
silence
is
remo
v
ed,
MFCC
analysis
becomes
more
focused
on
the
rele
v
ant
parts
of
the
v
oice,
impro
ving
accurac
y
in
recognizing
v
oice
patterns
and
emotions.
By
remo
ving
sil
ence,
we
eliminate
se
gments
that
do
not
carry
important
information,
making
t
he
resulting
MFCC
more
representati
v
e
of
the
true
characteristics
of
the
v
oice.
V
isualization
of
MFCC
before
and
after
silence
remo
v
al
will
sho
w
dif
ferences
in
spectral
representation,
where
MFCC
after
silence
remo
v
al
will
be
more
stable
and
reect
clearer
and
more
consistent
v
oice
patterns.
The
MFCC
feature
also
sho
ws
signicant
changes
after
silence
remo
v
al.
MFCC
is
an
important
representation
in
v
oice
signal
analysis
that
captures
rich
spectral
information.
When
silence
sections
are
remo
v
ed,
MFCC
analysis
becomes
more
focused
on
the
rele
v
ant
v
oice
parts,
potentially
enhancing
accurac
y
in
recognizing
v
oice
patterns
and
emotions.
Remo
ving
silence
eliminates
se
gments
that
do
not
pro
vide
important
information,
resulting
in
MFCC
that
is
more
representatri
v
e
of
the
true
characteristics
of
the
v
oice.
2.3.1.
Ener
gy
It
is
one
of
the
most
fundamental
acoustic
features
in
SER.
It
quanties
the
o
v
erall
strength
or
po
wer
of
the
speech
signal
in
the
time
domain,
reecting
ho
w
loudly
or
forceful
ly
a
person
is
speaking.
V
ocal
intensity
,
captured
by
ener
gy
,
often
corresponds
with
emotional
arousal
and
acti
v
ation
le
v
els:
for
instance,
high-arousal
emotions
lik
e
anger
,
jo
y
,
or
fear
tend
to
be
e
xpressed
with
greater
ener
gy
,
while
lo
w-arousal
states
lik
e
sadness
or
boredom
result
in
quieter
speech.
Man
y
studies
in
SER
therefore
incorporate
ener
gy
as
a
reliable
indicator
of
emotional
e
xpression,
and
frequently
apply
statistical
functionals
(e.g.,
mean,
v
ariance,
and
e
xtremes)
o
v
er
ener
gy
contours
to
characterize
emotion
[19],
[22],
[23].
These
temporal-ener
gy
patterns
help
cla
ssiers
distinguish
between
high-intensity
emotional
states
and
more
subdued
e
xpressions,
enhancing
the
o
v
erall
rob
ustness
and
performance
of
emotion
detection
systems.
Signal
ener
gy
is
a
fundamental
measure
that
quanties
the
total
po
wer
contained
within
an
audio
signal
o
v
er
time.
It
reects
ho
w
‘strong’
or
‘loud’
the
signal
is,
which
is
essential
for
tasks
lik
e
v
oice
acti
vity
detection
and
emotion
analysis.
The
signal
input
represents
the
amplitude
of
the
signal
at
the
samples
and
the
total
number
of
samples
in
the
frame.
By
squaring
the
amplitude,
we
ensure
that
both
positi
v
e
and
ne
g
ati
v
e
v
alues
contrib
ute
positi
v
ely
to
the
total
ener
gy
,
thereby
pro
viding
an
accurate
measure
of
signal
st
rength.
This
discrete-time
denition
is
widely
used
in
audio
processing
due
to
its
simplicity
and
computational
ef
cienc
y
.
In
practice,
the
inte
gral
is
approximated
by
summing
o
v
er
nite-duration
frames,
as
sho
wn
abo
v
e,
because
real-w
orld
signals
are
nite.
This
form
is
based
on
continuous-domain
theory
b
ut
is
rarel
y
used
directly
in
digital
signal
processing
due
to
discretization.
Signal
ener
gy
correlates
with
percei
v
ed
loudness,
though
loudness
perception
is
more
comple
x
and
frequenc
y-dependent.
This
con
v
ersion
allo
ws
audio
engineers
to
handle
v
ery
lar
ge
v
ariations
in
signal
ener
gy
more
con
v
eniently
,
aligning
more
closely
with
human
perception.
Signal
ener
gy
is
a
core
feature
in
emotion
recognit
ion
systems
since
more
intense
v
ocal
e
xpressions
(lik
e
anger
or
e
xcitement)
e
xhibit
higher
ener
gy
,
whereas
calmer
speech
(lik
e
sadness)
tends
to
ha
v
e
lo
wer
ener
gy
.
In
feature
e
xtraction
pipelines,
ener
gy
is
often
used
alongside
MFCC
and
ZCR
to
pro
vide
a
more
holistic
representation
of
the
emotional
content
of
speech.
Figure
2
visualizes
the
ener
gy
feature
without
silence
remo
v
al
as
sho
wn
in
Figure
2(a),
and
with
silence
rem
o
v
al
as
presented
in
Figure
2(b)
are
by
plotting
a
short-time
ener
gy
contour
directly
beneath
the
ra
w
w
a
v
eform,
with
time
on
the
x-axis
and
ener
gy
magnitude
on
the
y-axis.
This
representation
clearly
highlights
where
speech
se
gments
occur
,
peaks
correlate
with
v
oiced,
high-intensity
speech,
while
v
alle
ys
indicate
silence
or
quieter
,
lo
w-arousal
states
lik
e
sadness
or
boredom.
This
visualization
is
especially
useful
for
v
oice
acti
vity
detection
and
emotion
analysis:
the
temporal
patterns
when
ener
gy
spik
es
or
dips,
help
in
characterizing
emotional
states
o
v
er
t
ime.
T
ypically
,
a
sliding
windo
w
of
10–30
ms
(e.g.,
160–320
samples
at
16
kHz)
is
used
to
balance
time
resolution
and
smoothing
of
rapid
amplitude
changes.
Int
J
Artif
Intell,
V
ol.
14,
No.
6,
December
2025:
5157–5171
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
5161
(a)
(b)
Figure
2.
The
dif
ference
in
ener
gy:
(a)
without
silence
remo
v
al
and
(b)
with
silence
remo
v
al
2.3.2.
Chr
oma
It
is
chosen
because
the
y
capture
harmonic
and
pitch-related
information—attrib
utes
that
go
be
yond
what
typical
spectral
features
(lik
e
MFCC
or
ZCR/ener
gy)
represent.
By
encoding
the
distrib
ution
of
ener
gy
across
the
twelv
e
pitch
classes,
chroma
features
re
v
eal
tonal
characteristics
and
musicality
within
speech,
such
as
subtle
pitch
modulations,
intonation
patterns,
and
harmonic
struct
ure,
that
are
often
link
ed
to
the
e
xpression
of
emotions
[24].
In
f
act,
studies
sho
w
that
adding
chroma
to
traditional
feature
sets
consistently
impro
v
es
SER
performance:
for
e
xample,
the
y
notably
contrib
ute
to
emoti
on
discrimination
across
datasets
lik
e
RA
VDESS
and
TESS,
helping
models
distinguish
emotional
nuances
that
w
ould
be
missed
by
MFCC
alone.
Ho
we
v
er
,
chroma’
s
performance
can
v
ary
depending
on
the
dataset
and
emotional
content,
and
it
sti
ll
requires
further
tuning,
lik
e
combining
chroma
with
temporal
or
rh
ythmi
c
conte
xt,
to
reach
optimal
accurac
y
in
emotion
classication
tasks
[24].
Chroma
features
represent
the
spectral
ener
gy
distrib
ution
in
the
twelv
e
musical
pitch
classe
s
(C,
C/
D,...,
B),
which
aggre
g
ates
octa
v
e-independent
pitch
information.
The
y
are
particularly
v
aluable
in
audio
analysis
tasks,
such
as
emotion
recognition
in
speech,
because
the
y
capture
harmonic
and
tonal
characteristics
while
being
in
v
ariant
to
timbre,
instrumentation,
and
octa
v
e
shifts.
This
ma
k
es
them
rob
ust
descriptors
for
capturing
pitch-related
v
ariations
in
spok
en
utterances.
Figure
3
demonstrates
that
the
resulting
chromagram
without
silence
remo
v
al
(Figure
3(a))
and
with
silence
remo
v
al
(Figure
3(b))
are
the
tw
o-dimensional
time–chroma
matrix
sho
wing
ho
w
the
spectral
content
is
distrib
uted
across
pitch
classes
o
v
er
time.
This
structure
is
highly
ef
fecti
v
e
at
summarizing
harmonic
content,
as
notes
with
identical
pitch
class
b
ut
dif
ferent
octa
v
es
contrib
ute
to
the
same
bin,
preserving
musical
color
re
g
ardless
of
octa
v
e.
Such
octa
v
e
in
v
ariance
also
ensures
chroma
features
remain
stable
under
pitch
shifts
or
speak
er
v
ariations.
Chroma
features
are
rob
ust
to
changes
in
timbre
and
dynamic
range
since
the
y
focus
on
pitch-cl
ass
patterns
rather
than
e
xact
spectral
shapes.
In
emotion
analysis,
this
helps
capture
intonational
melodies
and
pitch
modulations
associated
with
af
fecti
v
e
speech,
e
v
en
amidst
background
noise
or
speak
er
v
ariability
.
Augmentations
lik
e
harmonic
pitch
class
proles
(HPCP)
further
enhance
rob
ustness
by
tuning
alignment
and
ener
gy
normalization
across
octa
v
es.
Classier
model
for
lectur
er
e
valuation
by
students
using
speec
h
emotion
r
eco
gnition
...
(Y
esy
Diah
Rosita)
Evaluation Warning : The document was created with Spire.PDF for Python.
5162
❒
ISSN:
2252-8938
(a)
(b)
Figure
3.
The
dif
ference
in
chroma:
(a)
without
silence
remo
v
al
and
(b)
with
silence
remo
v
al
2.3.3.
Mel-fr
equency
cepstral
coefcients
It
is
chosen
because
the
y
ef
fecti
v
ely
encode
the
ph
ysical
characteristics
of
sound
signals
by
si
mulating
human
auditory
perception:
the
y
apply
a
mel-scale
lter
bank
that
emphasizes
frequencies
in
a
w
ay
humans
percei
v
e,
tak
e
log
arithmic
compression
to
resemble
loudness
perception,
and
perform
a
discrete
cosine
transform
to
decorrelate
lter
outputs
into
compact
coef
cients.
This
structure
enables
MFCCs
to
e
xtract
phonetic
content
that
is
particularly
v
aluable
for
emotion
classication:
prior
w
ork
has
sho
wn
that
e
v
en
a
modest
number
of
MFCC
features
[15],
[10],
[25]
carry
signica
nt
emotion
discrimination
po
wer
by
capturing
spectral
v
ariations
tied
to
v
ocal
t
ract
dynamics.
By
forming
a
distinct
feature
map,
these
coef
cients
allo
w
machine
learning
models
to
dif
ferent
iate
subtle
emotional
cues
in
speech,
making
MFCCs
an
ef
fecti
v
e
choice
for
emotion
recognition
tasks.
Figure
4
visualizes
the
characteristics
of
an
audio
si
gnal,
which
is
typically
represented
as
a
heatmap,
a
time-series
visualization
of
MFCC
coef
cients
without
silence
remo
v
al
(Figure
4(a))
and
with
silence
remo
v
al
(Figure
4(b)).
On
the
x-axis
of
the
heatmap
is
time,
while
the
y-axis
sho
ws
the
cepstral
coef
cient
indices
(e.g.,
MFCC
1–13).
Each
cell
in
the
heatmap
represents
an
amplitude
v
alue,
dark
er
or
lighter
depending
on
the
color
palette,
corresponding
to
a
specic
time
and
coef
cient
inde
x.
2.3.4.
Mel-spectr
ogram
It
is
used
because
it
pro
vides
a
frequenc
y
representation
on
the
mel
scale
with
both
time
and
frequenc
y
dimensions,
which
are
suitable
for
processing
with
2D
con
v
olutional
k
ernels
in
deep
learning
models.
By
con
v
erting
audio
signals
into
image-lik
e
spectrograms,
CNNs
can
ef
fecti
v
ely
learn
localized
time-frequenc
y
Int
J
Artif
Intell,
V
ol.
14,
No.
6,
December
2025:
5157–5171
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
5163
patterns,
such
as
ener
gy
b
ursts,
formant
shifts,
and
pitch
contours
that
are
strongly
associated
with
dif
ferent
emotional
states.
Recent
research
demonstrates
that
feeding
mel-spectrograms
into
CNN
architectures
enables
models
to
autonomously
e
xtract
salient
emotional
cues,
leading
to
impro
v
ed
classication
performance
compared
to
con
v
entional
approaches
[26].
A
mel-spectrogram
is
a
perceptually
moti
v
ated
time–frequenc
y
representation
of
audio,
widely
used
in
speech
and
emotion
recognition.
It
aligns
with
ho
w
humans
percei
v
e
sound
by
emphasizing
lo
wer
frequencies
and
compressing
hi
gher
bands.
As
a
result,
it
produces
an
image
lik
e
matrix
well-suited
for
deep
learning
applications,
especially
CNNs.
The
log
transformation
compresses
the
dynamic
range,
mimicking
human
loudness
perception.
Adding
a
small
constant
ϵ
pre
v
ents
taking
the
log
of
zero.
The
result
yields
a
stable
feature
representation
for
deep
learning.
Figure
5
sho
ws
a
heatmap
of
mel-spectogram
without
silence
remo
v
al
(Figure
5(a))
and
with
s
ilence
remo
v
al
(Figure
5(b))
that
with
time
on
the
x-axis
and
mel-frequenc
y
(in
Hz
on
the
mel
scale)
on
the
y-axis,
where
color
intensity
represents
magnitude
in
decibels
(dB).
Brighter
bands
on
the
heatmap
indicate
re
gions
of
high
ener
gy
at
specic
frequencies
and
times,
such
as
formant
resonances
or
pitch
harmonics,
while
dark
er
areas
sho
w
quieter
portions.
This
image-lik
e
representation
enables
deep
learning
models,
particularly
2D
CNNs,
to
detect
localized
time-frequenc
y
patterns,
such
as
ener
gy
b
ursts
or
frequenc
y
shifts,
associated
with
emotional
cues.
The
decibel
scale
(log
amplitude)
ensures
the
dynamic
range
is
visually
compressed,
making
both
subtle
and
prominent
audio
features
apparent.
(a)
(b)
Figure
4.
The
dif
ference
in
MFCC:
(a)
without
silence
remo
v
al
and
(b)
with
silence
remo
v
al
Classier
model
for
lectur
er
e
valuation
by
students
using
speec
h
emotion
r
eco
gnition
...
(Y
esy
Diah
Rosita)
Evaluation Warning : The document was created with Spire.PDF for Python.
5164
❒
ISSN:
2252-8938
(a)
(b)
Figure
5.
The
dif
ference
in
mel-spectogram:
(a)
without
silence
remo
v
al
and
(b)
with
silence
remo
v
al
The
nal
mel-spectrogram
matrix
{
˜
S
m
(
t
)
}
t
=1
,...,T
m
=1
,...,M
serv
es
as
an
ef
cient
and
perceptually
aligned
input
to
2D
con
v
olutional
k
ernels.
It
preserv
es
time–frequenc
y
locality
,
enabling
neural
netw
orks
to
detect
emotional
cues
such
as
formant
shifts,
pitch
contours,
and
ener
gy
b
ursts.
By
combining
mel-scale
ltering,
log
compression,
and
spectral
s
moothing,
mel-spectrograms
outperform
linear
spectrograms
in
capturing
emoti
v
e
v
ocal
patterns,
making
them
ideal
for
emotion
recognition
architectures.
2.3.5.
Zer
o-cr
ossing
rate
This
feature
measures
the
smoothness
of
the
audio
signal
by
counting
ho
w
frequently
it
changes
sign,
crossing
from
positi
v
e
to
zero
to
ne
g
ati
v
e,
or
vice
v
ersa
,
within
a
gi
v
en
time
frame
[27].
Also
kno
wn
as
the
number
of
zero-axis
crossings
per
unit
time
[4],
ZCR
ef
fecti
v
ely
captures
the
noisiness
or
smoothness
of
the
signal:
noisy
or
un
v
oiced
se
gments
typically
e
xhibit
higher
ZCR,
while
v
oiced
and
more
periodic
re
gions
yield
lo
wer
v
alues.
Due
to
its
cle
ar
association
with
spectral
content,
higher
ZCR
indicates
richer
high-frequenc
y
components,
and
lo
wer
v
alues
align
with
more
periodic,
lo
w-frequenc
y
sounds.
ZCR
is
widely
used
in
v
oice
acti
vity
det
ection,
v
oiced/un
v
oiced
frame
classication,
and
e
v
en
as
an
e
xcitation
le
v
el
indicator
in
emotion
recognition
systems.
Combined
with
features
lik
e
ener
gy
and
MFCCs,
ZCR
enhances
spectral
representations
by
pro
viding
insights
into
speech
articulation
dynamics
and
intensity
uctuations.
The
Figure
6
ZCR
visualization
without
silence
remo
v
al
(Figure
6(a))
and
with
silence
remo
v
al
(Figure
6(b))
that
each
of
both
using
a
dual-plot
l
ayout:
the
top
plot
displays
the
ra
w
audio
w
a
v
eform
Int
J
Artif
Intell,
V
ol.
14,
No.
6,
December
2025:
5157–5171
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
5165
(amplitude
vs.
time),
while
the
bottom
plot
sho
ws
the
short-time
ZCR
o
v
er
the
same
time
axis.
Peaks
in
the
ZCR
curv
e
correspond
to
rapid
sign
changes,
common
during
un
v
oiced
sounds
or
noisy
se
gments,
while
troughs
align
with
v
oiced
re
gions
where
the
w
a
v
eform
oscillates
smoothly
and
crosses
zero
less
often.
This
visual
alignment
allo
ws
researchers
to
immediately
identify
v
oiced/un
v
oiced
se
gments
and
associate
sudden
uctuations
with
phonetic
or
emotional
cues.
Because
ZCR
is
calculated
per
frame
(e.g.,
10–30
ms
windo
ws),
the
contour’
s
temporal
resolution
ef
fecti
v
ely
highlights
dynamic
speech
features
critica
l
for
emotion
detection
and
v
oice
acti
vity
tasks.
(a)
(b)
Figure
6.
The
dif
ference
in
ZCR:
(a)
without
silence
remo
v
al
and
(b)
with
silence
remo
v
al
Furthermore,
in
Figure
6,
the
ZCR
contour
is
often
depicted
alongside
a
horizontal
threshold
line
that
classies
frames
as
v
oiced
or
un
v
oiced.
Frames
whose
ZCR
e
x
c
eeds
the
threshold
are
mark
ed
as
un
v
oiced
(typically
sho
wn
in
one
color),
while
those
belo
w
are
considered
v
oiced
(sho
wn
in
another).
This
threshold-based
se
gmentation
is
v
alidated
by
prior
w
ork
demonstrating
that
un
v
oiced
se
gments
generally
e
xhibit
higher
ZCR
and
lo
wer
ener
gy
compared
to
v
oiced
se
gments,
where
ZCRs
are
lo
w
and
ener
gies
are
high.
Such
delineation
enables
automated
v
oice
acti
vit
y
detection
and
helps
the
model
focus
on
emotionally
rich
v
oiced
re
gions.
Moreo
v
er
,
the
sharp
contrast
in
ZCR
trends
between
v
oiced
and
un
v
oiced
re
gions
of
fers
visual
cues
about
changes
in
speech
e
xcitation
peaks
in
the
ZCR
curv
e
often
align
with
phonetic
transitions
or
b
ursts,
which
can
be
critical
indicators
of
emotional
states
or
emphatic
speech
patterns.
2.4.
Model
ar
chitectur
e
Experiments
were
conducted
with
se
v
eral
types
of
models,
including
CNN-1D,
LSTM,
bidirec
tional
long
short-term
memory
(Bi-LSTM),
combinations
of
CNN-1D
and
LSTM,
and
CNN-1D
and
Bi-LSTM.
Each
model
w
as
tested
with
8
feature
e
xt
raction
results
from
the
pre
vious
stage
,
resulting
in
40
model
scenarios.
The
summary
of
model
types
and
their
respecti
v
e
layer
compositions
in
presented
in
T
able
2.
In
this
model,
the
frame
size
is
deri
v
ed
from
the
audio’
s
frame
duration
multiplied
by
the
sa
mple
rate.
At
a
standard
sample
rate
of
22,050
Hz,
using
n
f
ft=2048
results
in
each
frame
spanning
approximately
Classier
model
for
lectur
er
e
valuation
by
students
using
speec
h
emotion
r
eco
gnition
...
(Y
esy
Diah
Rosita)
Evaluation Warning : The document was created with Spire.PDF for Python.
5166
❒
ISSN:
2252-8938
93
ms,
as
Librosa
applies
an
FFT
windo
w
of
that
size
by
def
ault.
Meanwhile,
a
hop
length
of
512
samples
is
emplo
yed,
which
leads
to
approximately
75%
o
v
erlap
between
successi
v
e
frames.
In
practical
terms,
this
conguration
yields
analysis
shifts
of
roughly
23
ms
per
frame,
promoting
smoother
temporal
transitions
during
feature
e
xtraction.
The
dataset
is
di
vided
into
80%
for
training
and
20%
for
testing
(test
size=0.2).
T
raining
is
performed
for
up
to
50
epochs,
with
early
stopping
enabled
via
EarlyStopping(monitor=
’
v
al
accurac
y’,
patience=5
to
halt
training
if
v
alidation
accurac
y
does
not
impro
v
e
o
v
er
v
e
consecuti
v
e
epochs.
W
e
utilize
the
Adam
optimizer
,
combined
with
the
cate
gorical
crossentrop
y
loss
function
and
an
initial
learning
rate
of
0.001.
The
model
is
trained
using
a
batch
size
of
32.
Acti
v
ation
functions
include
rectied
linear
unit
(ReLU)
in
the
con
v
olutional
and
dense
hidden
layers,
and
Softmax
in
the
output
layer
for
multi-class
classication.
T
able
2.
The
summary
of
model
types
Architecture
Layers
CNN-1D
-
Con
v1D(lters=x1,
k
ernel=3,
ReLU)
-
MaxPooling1D(pool=2)
-
Con
v1D(lters=x2,
k
ernel=3,
ReLU)
-
MaxPooling1D(pool=2)
-
Flatten
-
Dense(units=x3,
ReLU)
-
Dense(units=3,
Softmax)
LSTM
-
LSTM(x1
units,
tanh,
return
sequences=T
rue)
-
LSTM(x2
units,
tanh)
-
Dense(x3
units,
ReLU)
-
Dense(3
units,
Softmax)
CNN-1D
+
LSTM
-
Con
v1D(128,
k
ernel=5,
ReLU)
+
MaxPooling
-
Con
v1D(64,
k
ernel=5,
ReLU)
+
MaxPooling
-
Dropout(0.3)
-
LSTM(128,
return
sequences=T
rue)
-
LSTM(64)
-
Dense(32,
ReLU)
-
Dense(3,
Softmax)
CNN-1D
+
Bi-LSTM
-
Con
v1D(32,
k
ernel=3,
ReLU,
L2=1e-4),
BatchNorm,
MaxPool(2),
Dropout(0.3)
-
Con
v1D(64,
k
ernel=3,
ReLU,
L2=1e-4),
BatchNorm,
MaxPool(2),
Dropout(0.3)
-
Bi-LSTM(128,
return
sequences=T
rue,
L2=1e-4),
Dropout(0.3)
-
Bi-LSTM(64,
L2=1e-4),
Dropout(0.3)
-
Dense(output
softmax)
with
L1=1e-5,
L2=1e-4
re
gularization
2.5.
Experimental
r
esult
In
this
study
,
the
dataset
w
as
partitioned
into
three
subsets:
80%
for
training,
10%
for
v
alidation,
and
10%
for
testing.
This
st
ratied
split
not
only
ensures
the
model
has
ample
data
to
learn
underlying
patterns
b
ut
also
pro
vides
a
rob
ust
frame
w
ork
for
e
v
aluation.
The
v
alidation
set
is
used
during
training
to
monitor
o
v
ertting,
tune
h
yperparameters,
and
guide
early
stopping,
while
the
test
set
remains
unseen
until
the
v
ery
end
to
of
fer
an
unbiased
measure
of
generalization
performance.
Adopting
this
split
ratio
aligns
with
standard
machine
learning
practices,
where
an
80/10/10
partition
is
widely
recommended
to
maintain
representati
v
e
class
distrib
utions
and
a
v
oid
biased
estimates.
Moreo
v
er
,
stratied
sampling
w
as
applied
to
preserv
e
the
proportional
repres
entation
of
each
emotion
class
across
all
subsets,
which
pre
v
ents
class
imbalance
from
sk
e
wing
model
performance
e
v
aluations.
2.6.
Implementation
r
esult
The
process
of
e
v
aluating
lecturer
performance
through
SER
starts
with
analyzing
all
audio
data
obtained
from
lecture
recordings.
The
audio
data
under
goes
consistent
preprocessing
to
ensure
accurate
and
reliable
results.
SER
typically
follo
ws
a
structured
pipeline.
It
be
gins
by
capturing
and
cleaning
the
ra
w
audio
signal,
remo
ving
noise
and
di
viding
it
into
short,
o
v
erlapping
frames,
often
using
pre-emphasis,
endpoint
detection,
and
framing
techniques.
Ne
xt,
for
each
frame,
acoustic
features
are
e
xtracted,
which
often
include
hand-crafted
descriptors
lik
e
MFCCs,
pitch,
ener
gy
,
ZCR,
spectral
coef
cients,
or
more
adv
anced
formant
and
w
a
v
elet
features.
In
modern
systems,
these
handcrafted
features
might
be
enhanced
with
deep
representations
(e.g.,
embeddings
from
w
a
v2v
ec
or
HuBER
T),
sometimes
using
multi-stream
fusion
architectures
to
capture
complementary
information.
The
result
ing
features
are
then
fed
into
classiers,
ranging
from
traditional
models
Int
J
Artif
Intell,
V
ol.
14,
No.
6,
December
2025:
5157–5171
Evaluation Warning : The document was created with Spire.PDF for Python.