Indonesian
Journal
of
Electrical
Engineering
and
Computer
Science
V
ol.
37,
No.
3,
March
2025,
pp.
2044
∼
2057
ISSN:
2502-4752,
DOI:
10.1
1591/ijeecs.v37.i3.pp2044-2057
r
2044
BanSpEmo:
a
Bangla
audio
dataset
for
speech
emotion
r
ecognition
and
its
baseline
evaluation
Babe
Sultana
1,2
,
Md
Gulzar
Hussain
3,4
,
Mahmuda
Rahman
1,5
1
Department
of
CSE,
Faculty
of
Science
and
Engineering,
Green
University
of
Bangladesh,
Dhaka,
Bangladesh
2
Department
of
CSE,
Faculty
of
Science
and
Engineering,
United
International
University
,
Dhaka,
Bangladesh
3
School
of
Software,
Nanjing
University
of
Information
Science
and
T
echnology
,
Nanjing,
Jiangsu,
China
4
School
of
Computer
Science
and
Artificial
Intelligence,
Changzhou
University
,
Changzhou,
Jiangsu,
China
5
Department
of
ICT
,
Mohammadpur
Preparatory
School
and
College,
Dhaka,
Bangladesh
Article
Info
Article
history:
Received
Jun
1
1,
2024
Revised
Sep
28,
2024
Accepted
Oct
7,
2024
Keywords:
Audio
dataset
Bangla
SER
Emotion
classification
Machine
learning
Speech
emotion
ABSTRACT
Speech
interfaces
provide
a
natural
and
comfortable
way
for
humans
to
communicate
with
machines.
Recognizing
emotions
from
acoustic
signals
is
essential
in
audio
and
speech
processing.
Detection
of
emotion
in
speech
is
critical
to
the
next
generation
of
human-computer
interaction
(HCI)
fields.
However
,
a
lack
of
lar
ge-scale
datasets
has
hampered
the
progress
of
relevant
research.
In
this
study
,
we
prepare
BANSpEmo,
a
demanding
Bangla
speech
emotion
dataset
consisting
of
792
audio
recordings
totaling
more
than
1
hour
and
23
minutes.
The
recordings
feature
22
native
speakers
and
each
speaker
uttered
two
sets
of
sentences
representing
six
emotions:
disgust,
happiness,
anger
,
sadness,
surprise,
and
fear
.
The
dataset
consists
of
12
Bangla
sentences,
each
expressed
in
these
six
emotions.
Furthermore,
a
series
of
investigations
are
carried
out
to
assess
the
baseline
performance
of
the
support
vector
machine
(SVM),
logistic
regression
(LR),
and
multinomial
Naive
Bayes
models
on
the
BANSpEmo
dataset
pre-
sented
in
this
study
.
The
studies
found
that
SVM
performed
best
on
this
dataset,
with
an
accuracy
of
87.18%.
This
is
an
open
access
article
under
the
CC
BY
-SA
license.
Corresponding
Author:
Md
Gulzar
Hussain
School
of
Software,
Nanjing
University
of
Information
Science
and
T
echnology
Nanjing,
Jiangsu,
China
Email:
gulzar
.ace@gmail.com
1.
INTRODUCTION
Speech
is
an
essential
and
preferred
way
of
communication
for
people.
It’
s
an
important
technique
to
convey
emotions
and
plays
a
significant
role
in
human-machine
interactions.
Speech
emotion
recognition
(SER)
research
has
received
significant
attention
over
the
last
few
years
due
to
its
application
in
remote
patient
monitoring
systems,
robotics,
the
psychological
assessment
of
people
and
many
more
[
1
].
While
tremendous
progress
has
been
achieved
in
SER
for
widely
used
languages
like
English
and
Mandarin,
there
is
still
a
signifi-
cant
deficit
in
resources
and
research
committed
to
less
commonly
studied
languages.
Bangla,
spoken
by
about
250
million
inhabitants
globally
[
2
],
is
one
of
the
underdeveloped
languages
in
the
field
of
SER.
Although
a
sig-
nificant
amount
of
studies
has
been
conducted
in
the
area
of
textual
data
in
the
Bangla
language
in
emotion
and
sentiment
analysis—such
as
analyzing
basic
emotions
[
3
]
sentiment
analysis
in
Bangla
English
Code-mixed
text
[
4
],
[
5
]
and
emotion
classification
[
6
].
These
ef
forts
have
greatly
enhanced
research
understanding
and
insights
into
the
Bangla
language
of
textual
data
domain.
Journal
homepage:
http://ijeecs.iaescor
e.com
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
r
2045
Detection
of
emotions
from
SER
is
a
growing
research
topic
due
to
its
importance
in
the
community
,
society
,
and
commercial
domains.
In
the
realm
of
speech
recognition
(SR)
and
natural
language
processing
(NLP),
an
extensive
range
of
speech
corpora
has
been
created
for
multiple
languages.
Although
there
has
been
lots
of
research
on
SER
for
various
languages,
such
as
English
[
7
],
[
8
],
Urdu
[
9
],
Chinese
[
10
],
Italian
[
1
1
]
and
others
[
12
],
[
13
],
there
have
been
just
a
few
ef
forts
at
developing
SER
dataset
for
Bangla.
T
able
1
shows
some
previously
developed
speech
emotion
recognition
datasets
and
their
limitations.
T
able
1.
Comparison
of
some
previous
speech
emotion
dataset
in
Bangla
and
other
languages
Article
Y
ear
Language
Contributions
Limitations
Paper
[
14
]
2023
Bangla
A
Speech
Emotion
dataset
named
KBES
is
developed
of
900
recordings.
Number
of
recordings
is
limited
and
gender
balance
is
not
considered.
Paper
[
9
]
2022
Urdu
The
first
Urdu
Speech
Emotion
dataset
is
developed
with
2,500
recordings.
In
the
dataset,
the
disgust
emotion
is
hard
to
distin-
guish
and
with
the
disgust
emotion,
the
accuracy
is
low
.
Paper
[
15
]
2022
Bangla
A
Speech
Emotion
dataset
named
SUBESCO
is
developed
with
7000
recordings.
Purely
neutral
sentences
are
uttered
with
dif
ferent
emotions
which
is
dif
ficult
to
express.
Paper
[
16
]
2022
Bangla
A
Speech
Emotion
dataset
named
Ban-
glaSER
is
developed
of
1467
record-
ings.
Number
of
sentences
for
uttering
and
the
number
of
recordings
is
limited.
Paper
[
17
]
2022
Bangla
A
Speech
Emotion
dataset
is
devel-
oped
of
452
recordings.
Number
of
recordings
is
limited,
the
annotation
process
is
not
explained
and
gender
balance
is
not
considered.
Paper
[
7
]
2021
English
A
new
Speech
Emotion
dataset
named
LSSED
is
developed.
Number
of
data
is
limited,
only
820.
Also,
the
total
length
is
only
about
20
minutes.
Paper
[
18
]
2021
Bangla
A
Speech
Emotion
dataset
named
ABEG
is
developed.
Dataset
is
not
publicly
available,
has
only
3
classes,
and
the
annotation
process
is
not
clear
.
Paper
[
13
]
2020
Spanish,
Por
-
tuguese,
German,
French
A
Speech
Emotion
dataset
named
CMU-MOSEAS
is
developed
with
40,000
multi-modal
samples.
Data
samples
are
not
balanced
for
the
4
languages.
Gender
balance
is
not
considered
also.
Paper
[
8
]
2018
English
A
new
Speech
Emotion
dataset
named
RA
VDESS
is
developed
with
7356
recordings.
Have
limited
lexical
variability
due
to
the
inclusion
of
only
two
statements.
Paper
[
12
]
2016
English
A
Speech
Emotion
dataset
named
EmoReact
is
developed
with
1
102
audio-visual
clips.
Gender
ef
fects
are
not
considered
and
the
dataset
is
not
gender
balanced.
Also
number
of
clips
is
lim-
ited.
Paper
[
10
]
2006
Mandarin
A
new
Speech
Emotion
dataset
named
MASC
is
developed
with
25,636
recordings.
The
dataset
is
not
gender
balanced
as
it
contains
the
recordings
of
23
female
and
45
male
Chinese
speakers.
Also,
the
traditional
speaker
verifica-
tion
and
identification
systems
are
limited
for
the
dataset.
From
T
able
1
it
can
be
observed
that
there
have
been
few
ef
forts
to
create
datasets
for
SER
in
the
Bangla
language.
Dhar
and
Guha
[
17
]
created
a
dataset
designated
as
ABEG.
They
employed
three
emotional
states:
angry
,
happy
,
and
neutral.
There
was
no
further
description
of
their
dataset,
and
the
data
is
not
accessible
to
the
public.
A
team
of
academicians
prepared
a
small
discrete
corpus
of
160
sentences
to
test
the
speech-emotion
identification
system
they
proposed
[
18
].
This
dataset
has
20
individuals
who
represented
emotions
such
as
happy
,
angry
,
sad,
and
neutral
where
perceptual
evaluation
was
not
available.
Nevertheless,
just
three
corpora
for
the
task
of
emotion
detection
from
speech
in
the
Bangla
language
are
publicly
accessible
now:
SUBESCO
[
19
],
BanglaSER
[
16
],
and
KBES
[
14
].
Using
these
openly
accessible
Bangla
language
datasets,
multiple
pub-
lications
have
been
published
illustrating
how
to
detect
emotions
in
Bangla
speech
employing
machine
learning
and
deep
learning
methods.
Dif
ferent
ensemble
learning
approaches
are
compared
in
multiple
trials
to
show
that
they
outperform
typical
machine
learning
techniques.
The
research
findings
show
that
ensemble
learning
BanSpEmo:
a
Bangla
audio
dataset
for
speech
emotion
r
ecognition
and
its
baseline
evaluation
(Babe
Sultana)
Evaluation Warning : The document was created with Spire.PDF for Python.
2046
r
ISSN:
2502-4752
approaches
can
reach
a
great
accuracy
of
84.37%,
which
is
achieved
by
utilizing
the
bootstrap
aggregation
and
voting
method.
Sultana
et
al.
[
15
]
used
the
SUBESCO
and
RA
VDESS
[
8
]
datasets
to
undertake
cross-lingual
in-
vestigations
involving
cross-dataset
training,
multi-dataset
training,
and
transfer
learning
in
English
and
Bangla.
The
suggested
model
demonstrated
cutting-edge
perceptual
ability
,
with
weighted
accuracy
(W
A)
of
86.9%
for
the
SUBESCO
and
82.7%
for
the
RA
VDESS.
Hassan
et
al.
[
20
]
combines
a
one-dimensional
convolutional
neural
network
(CNN)
and
a
long
short-term
memory
(LSTM)
framework
to
create
a
fully
connected
network
for
SER,
comparing
the
performance
of
these
two
datasets.
Islam
et
al.
[
21
]
combines
transformed
features
from
three
separate
methodologies—chroma
short-time
fourier
transform,
short-time
fourier
transform
(STFT),
and
mel-frequency
cepstral
coef
ficient
(MFCC)—and
feeds
them
into
a
3
dimensional
CNN
block
to
extract
the
features.
The
outputs
are
then
processed
by
a
bidirectional
LSTM
layer
to
classify
Bangla
speech
emotions.
In
article
from
Sultana
and
Rahman
[
22
],
the
researchers
employed
the
grid
search
method
with
five-folded
cross-validation
for
determining
the
best
parameters
for
the
support
vector
machines
(SVM),
random
forest,
and
XGBoost
algorithms.
They
discovered
that
choosing
the
most
important
features
enabled
machine
learning
models
to
achieve
high
levels
of
accuracy
,
equivalent
to
deep
learning
models.
A
recent
study
from
Aziz
et
al.
[
23
]
presents
a
CNN-based
approach
for
SER
in
Bengali,
using
MFCC
features
and
data
augmentation
ap-
proaches.
This
method
produced
remarkable
accuracies
of
90%
on
the
SUBESCO
and
78%
on
the
BanglaSER
datasets.
Research
on
the
detection
of
emotions
in
cross-linguistic
speeches
has
demonstrated
that
systems
trained
on
a
single
language
dataset
often
perform
poorly
when
evaluated
on
a
separate
language
corpus,
yield-
ing
lower
accuracy
rates
than
monolingual
recognition
rates.
This
performance
gap
emphasizes
the
importance
of
language-specific
datasets
for
accurate
emotion
recognition.
Over
the
past
few
years,
there
has
been
exten-
sive
investigation
into
SER
in
various
languages
[
24
],
[
25
].
Despite
having
limited
natural
speech
corpora
[
26
],
[
27
]
or
verified
recorded
emotional
speech
corpora
[
14
],
[
16
],
[
19
]
published
for
the
Bangla
language.
Relevant
linguistics
resources
for
recognizing
emotions
are
still
inadequate.
The
SER
system
uses
various
approaches
to
classify
and
analyze
audio
files
to
find
embedded
emotions.
The
initial
stage
in
its
improvement
is
to
generate
a
dataset
for
the
tar
geted
language
which
is
one
of
the
main
goals
of
this
research.
The
following
is
a
summary
of
this
work’
s
main
contributions:
−
This
research
introduces
BanSpEmo,
a
needed
diversified
Bangla
dataset
for
emotion
recognition
from
voice.
It
comprises
12
distinct
sentences
uttered
by
22
native
speakers
to
represent
six
desired
emotions.
The
total
duration
is
1
hour
and
23
minutes.
−
This
dataset
enables
more
comprehensive
simulations
of
real-world
scenarios
by
increasing
the
lexical
and
sentence
variability
,
allowing
machine
learning
techniques
and
deep
neural
networks
to
grasp
their
pattern
better
.
−
However
,
using
the
BanSpEmo
dataset,
this
study
compared
the
performance
of
three
well-known
algo-
rithms:
logistic
regression
(LR),
SVM,
and
multinomial
Naive
Bayes
for
Bangla
voice
emotion
classifica-
tion.
−
This
research
also
shows
an
investigation
of
these
algorithms
against
a
few
well-known
audio
features
to
evaluate
their
ef
ficiency
in
classifying
emotions
in
Bangla
speech.
After
analyzing
the
results,
we
discovered
each
algorithm’
s
performance,
showing
useful
information
about
their
usability
for
SER
tasks
in
t
he
Bangla
language.
A
detailed
description
of
the
dataset
is
provided
in
section
2.
The
proposed
research
framework
is
explained
in
section
3.
While
the
performance
analysis
of
machine
learning
algorithms
applied
to
this
frame-
work
and
also
discusses
the
research
findings
in
section
4.
Overall
discussion
and
insights
into
the
future
directions
of
this
work
is
provided
in
section
5.
2.
CORPUS
DESCRIPTION
Speech
is
one
of
the
modes
of
communication
on
various
online
platforms,
such
as
Facebook
and
Y
ouT
ube,
where
emotions
are
frequently
conveyed.
In
this
context,
creating
a
speech
dataset
for
the
Bangla
language
is
a
significant
contribution.
The
dataset
we
have
prepared
is
available
named
BANSpEmo
[
28
]
constitutes
the
main
portion
of
our
research.
As
a
low-resource
language,
Bangla
has
limited
speech
datasets.
BANSpEmo
marks
the
4
th
audio
dataset
developed
for
SER
in
Bangla.
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
37,
No.
3,
March
2025:
2044–2057
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
r
2047
2.1.
The
experimental
setup
The
BANSpEmo
dataset
includes
792
voice
recordings,
capturing
six
fundamental
emotional
reactions
across
two
sets
of
sentences,
each
with
six
sentences.
The
voices
were
recorded
using
a
smartphone’
s
recording
application,
a
microphone,
and
a
laptop.
T
o
create
the
dataset,
we
used
our
university’
s
dedicated
research
lab,
which
was
not
entirely
soundproof
like
a
professional
audio
recording
studio,
but
it
was
essentially
noiseless.
W
e
took
extra
care
to
eliminate
any
background
noise,
including
human
or
other
ambient
sounds.
W
e
made
sure
the
recording
environment
was
as
controlled
as
possible
by
implementing
stringent
measures
to
reduce
background
noise.
Maintaining
these
cautions,
we
were
able
to
record
audio
that
was
both
consistent
and
clear
enough
for
our
study
.
Each
recording,
lasting
5-6
seconds
on
average
per
emotion,
had
noise
removed
using
Audacity
software.
Additionally
,
we
used
W
avePad
Sound
Editor
software
for
further
editing.
The
summary
of
tools
required
for
making
the
dataset:
−
Microphone:
BOY
A
BY
-BM301
1
compact
shotgun
microphone.
−
Sound
editor:
W
avePad
sound
editor
.
−
Audio
noise
remover:
audacity
software.
2.2.
Corpus
cr
eation
pr
ocess
The
speakers
naturally
conveyed
the
emotional
states,
ensuring
that
the
recordings
were
not
merely
read
aloud.
The
emotions
represented
are
happiness,
disgust,
sadness,
anger
,
surprise,
and
fear
.
This
dataset
focuses
on
data
collected
from
individuals
aged
between
20
and
25.
The
corpus
comprises
voice
recordings
from
22
speakers,
with
an
equal
distribution
of
1
1
males
and
1
1
females.
The
duration
of
the
tapes
varies
between
3
and
12
seconds,
influenced
by
the
length
of
the
sentences
and
the
time
the
speaker
takes.
While
there
are
roughly
equal
numbers
of
male
and
female
speakers
overall,
neither
sentence
set
reflects
this
balance.
W
ith
6
sentences
×
1
repetition
×
6
emotions
×
18
speakers
and
6
sentences
×
1
repetition
×
6
emotions
×
4
speakers,
the
total
number
of
recordings
is
792
utterances.
The
complete
audio
dataset
spans
a
total
of
1
hr
,
23
mins,
and
12
secs.
The
following
T
able
2
presents
a
summary
of
the
dataset.
T
able
2.
Dataset
description
table
T
ype
of
dataset
Performed,
scripted
T
ype
of
File
Audio
Language
Bangla
Gender
Male
and
Female
Data
format
W
aveform
Audio
File
Format
(W
A
V)
Number
of
Groups
2
Number
of
Sentences
per
Group
6
States
of
Emotion
Happiness,
Disgust,
Sadness,
Anger
,
Surprise,
and
Fear
T
otal
Number
of
Statements
12
T
otal
Number
of
Audio
T
apes
792
2.3.
Details
of
sentences
W
e
selected
12
sentences
to
ensure
diversity
,
as
these
sentences
are
typically
used
to
express
dif
ferent
emotions.
W
e
trained
our
speakers
to
deliver
each
selected
sentence
with
six
emotions
to
prepare
our
audio
dataset.
A
wide
range
of
emotional
expressions,
such
as
happiness,
sadness,
anger
,
fear
,
surprise,
and
disgust,
were
carefully
considered
when
crafting
each
sentence.
By
doing
this,
we
hope
to
build
a
solid
dataset
that
will
be
useful
for
a
range
of
speech
emotion
recognition
and
af
fective
computing
applications.
The
orators
underwent
comprehensive
training
to
guarantee
uniformity
and
precision
in
their
emotive
communication.
The
chosen
Bangla
text
and
their
English
meanings
are
shown
in
the
T
able
3
.
In
T
able
4
,
we
aim
to
present
a
comparison
of
existing
freely
accessible
SER
datasets
in
this
domain
alongside
our
work,
BANSpEmo.
Given
the
volume
of
audio
recordings
and
the
total
duration,
this
collection
ranks
as
the
fourth-lar
gest
emotional
speech
database
in
the
Bangla
language.
Despite
its
relatively
small
size
compared
to
other
datasets,
and
with
participation
levels
being
fairly
typical,
the
key
strength
of
this
dataset
is
its
broad
range
of
sentence
variations.
This
lexical
and
sentence
diversity
enhances
the
ability
to
capture
diverse
emotional
expressions
in
dif
ferent
forms
in
Bangla
speech.
Essentially
,
we
chose
a
wide
variety
of
sentences
to
explore
how
dif
ferent
expressions
of
the
same
emotion
can
be
introduced
in
Bangla
speech.
T
o
systematize
BanSpEmo:
a
Bangla
audio
dataset
for
speech
emotion
r
ecognition
and
its
baseline
evaluation
(Babe
Sultana)
Evaluation Warning : The document was created with Spire.PDF for Python.
2048
r
ISSN:
2502-4752
future
training
standards,
we
divided
our
BANSpEmo
dataset
into
the
training
and
the
test
sets.
Initially
,
we
shuf
fled
all
samples
and
then
allocated
20%
to
the
test
set,
leaving
80%
for
the
training
set.
It
was
made
sure
that
the
distribution
of
each
emotion
class
in
both
the
training
and
test
sets
was
consistent
or
at
the
very
least,
similar
.
T
able
3.
The
selected
Bangla
text
and
their
English
meaning
SL.
Bangla
sentence
English
meaning
1.
িকছ
ু
তথ
সিঠক
ভােব
উপাপন
করা
দরকার,
বার
বার
একই
ভ
ু
ল
কের
চেলেছ
সংবাদ
মাধম
িল!
Some
information
needs
to
be
conveyed
appropriately
,
and
the
media
is
making
similar
mistakes
repeatedly!
2
আপনার
ববহার
েতা
চমৎকার।
ম
ু
েখর
ভাষা
ও
অেনক
স
ু
র।
Y
our
behavior
is
wonderful.
Y
our
words
are
also
pleasant.
3
এর
পিরেিেত
িশকেদর
াথ
সংি
িশক
সিমিতর
মধ
েথেক
েকােনা
ধরেনর
ভ
ূ
িমকা
পিরলিত
না
হওয়ায়
আিম
ভীষন
ভােব
উি।
In
this
regard,
no
role
has
been
observed
from
the
teacher
’
s
as-
sociations
regarding
the
interest
of
teachers
made
me
densely
concerned.
4
আমার
একটা
বাপার
মাথায়
ধের
না,
"ইিলশ
বাঁচাও"
োগান
ম
ু
খিরত
িমিডয়া
েকন
এবং
িক
কারেণ
"ইিলেশর
বাসান
(নদী)
বাঁচাও"
োগান
িনেয়
মােত
না?
Why
the
slogan
”Save
the
habitat
(river)
of
Hilsa”
rather
than
”Save
the
Hilsa”
is
being
avoided
by
the
media
baf
fles
me.
5
েদশ
িক
মধম
আেয়র
েদেশ
পার
হে
নািক
মেগর
ম
ু
ল
ু
েকর
েদ
েশ
পিরনত
হে?
Is
the
country
turning
into
a
middle-income
country
or
a
coun-
try
of
chaos?
6
আিম
একমা
সরকাির
েকান
কােজ
আ
ু
েলর
চাপ
িদেত
রািজ
আিছ,
িশিত
বি
আ
ু
েলর
চাপ
েদয়
না।
I
agree
to
have
my
fingerprints
used
for
government
purposes,
but
reasonable
people
might
not.
7
তেগা
মেন
কেতা
েম
ের!
জীবেন
একটা
করিছ
তােতই
েল
প
ু
েড়
েশষ।
Y
ou
are
bursting
with
love!
I
once
tried
to
embrace
it,
but
I
got
burned.
8
আজেকর
মাচ
ভারতেক
হারােত
চাই
টাইগার
বাংলােদশ
সাবাস
সািকব
আল
হাসান।
T
o
defeat
India
in
today’
s
match,
we
need
the
tiger
of
Bangladesh,
W
ell
done
Shakib
Al
Hasan!
9
টাইটািনক
জাহাজ
ড
ু
েব
েগেছ
আর
বাংলােদশ
ও
ড
ু
েব
যােব
।
The
T
itanic
has
sunk,
and
Bangladesh
will
sink
too.
10
যিদ
ভ
ু
ল
হয়
তাহেল
পরীা
েনবার
িক
দরকার?
সবাইেক
গেড়
াস
িদেয়
িদেব।
If
the
questions
are
incorrect,
what’
s
the
sense
of
taking
the
exam?
Simply
give
everyone
the
A+
grade.
1
1
যিদ
খায়
পানতা
ইিলশ
জ
ু
তা
িদেয়
তার
গালটা
কর
মািলশ।
He
should
be
punished
for
making
extravagant
expenses
dur
-
ing
the
price
hike
of
hilsa.
12
েয
জািত
পঁচা
ভাত
েখেয়
বছর
কের,
এরা
উিত
লাভ
করেব
িক
কের!
A
nation
that
starts
the
year
by
eating
spoiled
rice,
how
will
they
ever
progress!
T
able
4.
A
comparison
between
publicly
available
Bangla
Language
SER
corpora
and
the
BANSpEmo
Description
SUBESCO
BanglaSER
KBES
BANSpEmo
Audio
Clips
7000
1467
900
792
Emotions
7
5
9
6
Sentences
10
3
N/A
12
Participant
20
34
35
22
T
rained
Actors
Y
es
No
Y
es
No
Rate
of
Sampling
Rate
48
kHz
44.1
kHz
48
kHz
44.1
kHz
Class
Equilibrium
Y
es
Y
es
Y
es
Y
es
Gender
Equilibrium
Y
es
Y
es
Y
es
Y
es
3.
METHOD
In
this
Figure
1
,
we
present
our
proposed
system
architecture.
The
collected
raw
data
underwent
a
thorough
cleaning
and
preprocessing
stage,
with
mel-frequency
cepstral
coef
ficients,
spectrogram
(MFCCs),
zero
crossing
rate
(ZCR),
root-mean-square
ener
gy
(RMSE),
and
chroma
being
utilized
as
a
feature
extraction
technique.
W
e
have
applied
several
well-known
machine
learning
algorithms
support
vector
machine,
logistic
regression,
and
multinomial
Naive
Bayes
to
provide
a
comparative
performance
evaluation
of
existing
tasks.
3.1.
Data
cleaning
and
pr
epossessing
T
o
augment
the
dataset,
each
audio
is
divided
into
three
segments.
In
the
data
preprocessing
and
clean-
ing
phase,
every
audio
under
goes
trimming,
we
remove
portions
where
no
voice
is
detected.
These
segments
typically
correspond
to
pauses
or
moments
when
the
speaker
takes
a
breath.
Additionally
,
we
have
standardized
the
frequency
of
each
split
audio
to
44.1
kHz
to
ensure
uniformity
across
all
instances.
Subsequently
,
features
are
extracted
from
each
trimmed
audio.
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
37,
No.
3,
March
2025:
2044–2057
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
r
2049
Figure
1.
The
suggested
system’
s
flow
architecture
3.2.
Featur
e
extraction
3.2.1.
Mel-fr
equency
cepstral
coefficients
T
o
illustrate
the
short-term
power
spectrum
of
a
voice
signal,
MFCCs
are
a
set
of
coef
ficients
broadly
used
in
voice
and
audio
processing.
It
is
a
condensed
set
of
features,
typically
around
10
to
20.
They
serve
as
valuable
features
for
machine
learning
models
due
to
their
ability
to
succinctly
capture
the
key
attributes
of
an
audio
signal
while
also
reducing
its
dimensionality
.
In
our
feature
extraction
process,
we
calculated
20
MFCCs
using
t
he
’librosa.feature.mfccc()’
Python
module.
Figure
2
illustrates
the
MFCC
feature
waveform
for
“Happy
Emotion”.
Figure
2.
Sample
MFCCs
visualization
for
happy
emotion
3.2.2.
Spectr
ogram
A
spectrogram
behaves
as
a
pictorial
portrayal
of
how
the
frequency
components
of
a
signal
evolve.
It
holds
significant
utility
in
signal
processing,
audio
examination,
and
diverse
scientific
domains.
Spectrograms
provide
a
means
to
depict
the
alterations
in
signal
frequencies
across
time,
facilitating
the
examination
and
visualization
of
the
evolving
frequency
characteristics
of
audio
or
other
time-based
signals.
Figure
3
depicts
a
spectrogram
visualization
illustrating
the
signal’
s
loudness
over
time
across
various
frequencies
in
a
specific
waveform,
for
the
“Happy
Emotion”.
BanSpEmo:
a
Bangla
audio
dataset
for
speech
emotion
r
ecognition
and
its
baseline
evaluation
(Babe
Sultana)
Evaluation Warning : The document was created with Spire.PDF for Python.
2050
r
ISSN:
2502-4752
Figure
3.
Sample
spectrogram
visualization
for
happy
emotion
3.2.3.
Zer
o
cr
ossing
rate
In
the
domains
of
signal
processing,
audio
analysis,
and
speech
recognition,
the
ZCR
is
a
frequently
employed
characteristic.
It
quantifies
the
speed
at
which
a
signal
alters
its
polarity
or
intersects
the
zero
am-
plitude
line
within
a
specified
timeframe,
essentially
gauging
how
often
a
signal’
s
waveform
crosses
the
zero
point.
ZCR
is
formally
defined
as
(1).
Figure
4
illustrates
the
ZCR
V
isualization
for
the
”Happy
Emotion”
which
portrays
the
rate
at
which
the
signal
transitions
either
from
negative
to
zero
to
positive
or
from
positive
to
zero
to
negative.
z
cr
=
1
T
−
1
T
−
1
∑
t
=1
1
R
<
0
(
s
t
s
t
−
1
)
(1)
Figure
4.
Sample
ZCR
visualization
for
happy
emotion
3.2.4.
Root-mean-squar
e
energy
In
signal
processing
and
diverse
domains,
RMSE
is
a
mathematical
metric
employed
to
assess
the
ener
gy
level
within
a
signal.
It
of
fers
a
means
to
characterize
the
amplitude
or
intensity
of
a
signal
within
a
defined
time
segment.
The
RMSE
is
defined
as
follows:
R
M
S
E
=
√
1
N
∑
n
|
x
(
n
)
|
2
(2)
here,
−
RMSE
is
the
root-mean-square
ener
gy
.
−
N
is
the
number
of
samples
in
the
time
window
.
−
x(n)
represents
the
signal
samples.
In
this
context,
with
N
=
44,100
and
x(n)
=
204,800.
The
RMSE
value
provides
insight
into
the
signal’
s
ener
gy
and
amplitude
within
the
specified
time
frame,
and
Figure
5
depicts
the
visualization
for
“Happy
Emotion”.
Figure
5.
Sample
RMSE
visualization
for
happy
emotion
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
37,
No.
3,
March
2025:
2044–2057
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
r
2051
3.2.5.
Chr
oma
The
chroma
feature
is
a
compact
representation
that
conveys
the
tonal
characteristics
of
a
musical
audio
signal.
Chroma
features
are
derived
from
the
chromagram
representation
of
audio
signals.
These
features
encompass
the
chroma
vector
(depicting
the
intensity
of
each
pitch
class),
chroma
ener
gy
(the
summation
of
squared
chroma
values),
and
chroma
cross-correlation
(which
quantifies
the
similarity
between
chroma
vectors).
Figure
6
illustrates
the
visual
representation
of
“Happy
Emotion”.
Figure
6.
Sample
chroma
visualization
for
happy
emotion
3.3.
Classifier
model
Following
the
data
set
cleaning,
preprocessing,
and
feature
extraction
stages,
we
employed
numerous
machine
learning
algorithms.
However
,
this
research
specifically
focuses
on
three
machine
learning
algorithms:
SVM,
LR,
and
multinomial
Naive
Bayes.
These
three
algorithms
were
chosen
due
to
their
outstanding
perfor
-
mance
in
terms
of
accuracy
on
this
particular
data
set.
3.3.1.
Support
vector
machine
The
SVM
stands
out
as
a
potent
and
adaptable
machine
learning
algorithm
extensively
utilized
in
binary
and
multi-class
classification
tasks
and
regression.
SVM
functions
by
identifying
the
optimal
hyper
-
plane
within
the
feature
space,
ef
fectively
maximizing
the
mar
gin
between
distinct
classes
and
facilitating
ac-
curate
classification.
This
technique
seeks
to
identify
the
optimal
hyperplane
for
distinguishing
between
feature
classes.
One
way
to
represent
the
equation
for
a
linear
SVM
hyperplane
is
as:
f
(
x
)
=
arg
max
c
(
w
c
·
x
+
b
c
)
(3)
here,
−
f
(
x
)
is
the
decision
function.
−
c
represents
the
dif
ferent
classes.
−
w
c
is
the
class
c
weight
vector
.
−
x
is
the
input
feature
vector
.
−
b
c
is
the
class
c
bias
term.
In
this
configuration,
the
class
with
the
highest
score
from
the
decision
function
is
chosen
to
determine
the
projected
class
for
an
input
sample.
This
technique
is
frequently
used
in
multiclass
SVM
settings,
where
dif
ferent
binary
classifiers
are
trained
using
one
of
two
strategies:
one-vs-one
(OvO)
or
one-vs-rest
(OvR).
W
e
have
employed
the
one-vs-rest
(OvR)
technique
in
our
implementation.
The
final
class
assignment
is
subse-
quently
determined
by
taking
into
account
the
outputs
of
each
binary
classifier
’
s
decision
function.
3.3.2.
Logistic
r
egr
ession
The
popular
machine
learning
method
known
as
LR
was
initially
created
for
binary
categorization.
W
ith
the
help
of
this
technique,
LR
may
be
used
ef
fectively
in
scenarios
with
more
than
two
classes,
providing
insightful
information
about
the
likelihood
that
each
class
would
be
an
accurate
prediction.
But
in
this
case,
we’ve
abandoned
LR
in
favor
of
the
OvR
technique
to
handle
multi-class
classification
problems.
The
OvR
ap-
proach
simplifies
the
LR
adjustment
for
multi-class
jobs,
making
it
a
versatile
solution
for
various
classification
problems.
The
class
c
classifier
models
the
probability
that
a
z
belongs
to
class
c
in
the
following
way:
P
(
y
=
c
|
z
)
=
1
1
+
e
−
(
w
c
T
z
+
b
c
)
(4)
here,
BanSpEmo:
a
Bangla
audio
dataset
for
speech
emotion
r
ecognition
and
its
baseline
evaluation
(Babe
Sultana)
Evaluation Warning : The document was created with Spire.PDF for Python.
2052
r
ISSN:
2502-4752
−
The
probability
is
P
(
y
=
c
|
z
)
which
means
the
output
y
belongs
to
class
c.
−
For
class
c,
w
c
is
the
weight
vector
and
b
c
is
the
bias
term.
−
V
ector
of
input
features
is
z
.
3.3.3.
Multinomial
Naive
Bayes
Among
the
probabilistic
classification
techniques
designed
for
discrete
feature
scenarios
is
the
multi-
nomial
Naive
Bayes,
which
is
widely
applied
to
text
classification
problems.
Building
on
the
foundations
of
Bayes’
theorem,
this
method
is
designed
assuming
that
the
features,
given
the
class
label,
exhibit
conditional
independence.
Multinomial
Naive
Bayes
extends
its
usefulness
in
a
multi-class
classification
scenario
by
using
the
concepts
of
the
Bayes
theorem
to
determine
the
probability
that
an
instance
will
be
assigned
to
each
class.
The
equation
for
predicting
the
class
x
probability
is
given
the
features
f
1
,
f
2
,
...,
f
n
can
be
shown
as:
P
(
X
=
x
|
f
1
,
f
2
,
...,
f
n
)
∝
P
(
X
=
x
)
n
∏
i
=1
P
(
f
i
|
X
=
x
)
(5)
Here,
−
X
is
used
as
a
class
variable.
−
f
1
,
f
2
,
...,
f
n
is
used
as
the
feature
variables.
−
P
(
X
=
x
|
f
1
,
f
2
,
...,
f
n
)
is
the
class
x
posterior
probability
of
given
the
features.
−
P
(
X
=
x
)
is
the
class
x
prior
probability
.
−
P
(
f
i
|
X
=
x
)
is
the
conditional
probability
of
feature
f
i
given
class
x
.
These
probabilities
are
computed
using
the
training
dataset
during
the
training
step.
During
the
predic-
tion
step,
the
algorithm
then
computes
the
succeeding
probabilities
for
every
class,
identifying
the
class
with
the
maximum
probability
as
the
forecasted
class
for
the
specified
set
of
features.
The
algorithm’
s
ease
of
use
in
han-
dling
multi-class
classification
problems
can
be
ascribed
to
its
ef
fectiveness,
ease
of
handling
high-dimensional
data,
and
simplicity
.
4.
PERFORMANCE
EV
ALUA
TION
This
section
compares
the
outcomes
of
several
machine
learning
algorithms
and
presents
their
perfor
-
mance
analyses.
It
discusses
the
environmental
setup,
evaluation
metrics
analysis
of
dif
ferent
machine
learning
models,
performance
comparison
with
some
previous
works,
confusion
matrix
analysis,
and
receiver
operating
characteristic
(ROC)
curve
analysis.
4.1.
Envir
onmental
setup
−
Operating
system:
W
indows
10
64
bit
−
Processor:
Intel(R)
Core(TM)
i5-4300M
CPU
@
2.60
GHz
−
RAM:
8
GB
−
IDE
:
Google
Colab
−
Programming
language:
Python
4.2.
Result
analysis
and
discussion
Our
goal
in
this
section
is
to
present
a
thorough
analysis
and
discussion
of
the
findings
from
the
many
assessment
metrics
we
used
in
our
study
.
W
e
have
also
looked
at
ROC
curves,
which
serve
as
an
ef
fective
and
lucid
visual
aid
for
illustrating
the
classifier
’
s
accuracy
.
4.2.1.
Evaluation
metrics
analysis
W
e
employ
a
range
of
standard
performance
assessment
metrics
to
evaluate
and
contrast
the
ef
fective-
ness
of
dif
ferent
classifiers.
Our
comparative
analysis
evaluates
the
relative
performance
of
classifiers
using
accuracy
,
precision,
recall,
and
F1
scores.
Accuracy
is
utilized
by
comparing
the
predicted
labels
of
each
in-
stance
with
the
ground-truth
labels,
but
its
limitations
are
acknowledged
as
certain
samples
may
introduce
bias.
Therefore,
we
also
incorporate
precision,
recall,
and
F1
measures
to
provide
a
more
comprehensive
evalua-
tion.
W
e
experimented
with
various
machine
learning
algorithms
and
eventually
selected
three—SVM,
LR,
and
multinomial
Naive
Bayes—due
to
their
commendable
performance
in
this
context.
Among
these,
SVM
exhibited
the
highest
accuracy
at
87.18%,
followed
by
LR
at
84.45%,
and
multinomial
Naive
Bayes
at
82.77%.
In
T
ables
5
-
7
,
we
presented
the
individual
precision,
recall,
and
F1
scores
for
all
emotions
considered
in
our
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
37,
No.
3,
March
2025:
2044–2057
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
r
2053
research.
From
these
tables,
it
is
evident
that
SVM
attains
the
highest
weighted
average
values
for
precision,
recall,
and
F1
score,
with
0.87,
0.87,
and
0.86,
respectively
.
In
contrast,
LR
yields
weighted
average
precision,
recall,
and
F1
scores
of
0.85,
0.84,
and
0.84,
respectively
.
For
multinomial
Naive
Bayes,
the
corresponding
values
are
0.83,
0.83,
and
0.82.
T
able
5.
Outcomes
of
SVM-based
precision,
recall,
f1-score,
and
a
ccuracy
in
six
distinct
categories
(emotions)
Category
Precision
Recall
F1-Score
Accuracy
(%)
Anger
0.92
0.93
0.92
87.18%
Disgust
0.82
0.88
0.85
Fear
0.85
0.87
0.96
Happy
0.88
0.93
0.90
Sad
0.91
0.82
0.86
Surprised
0.82
0.66
0.73
T
able
6.
Outcomes
of
LR-based
precision,
recall,
f1-score,
and
accuracy
in
six
distinct
categories
(emotions)
Category
Precision
Recall
F1-score
Accuracy
(%)
Anger
0.85
0.92
0.88
84.45%
Disgust
0.89
0.89
0.89
Fear
0.85
0.87
0.86
Happy
0.78
0.83
0.81
Sad
0.82
0.81
0.81
Surprised
0.89
0.60
0.71
T
able
7.
Outcomes
of
multinomial
Naive
Bayes-based
precision,
recall,
f1-score,
and
accuracy
in
six
distinct
categories
(emotions)
Category
Precision
Recall
F1-Score
Accuracy(%)
Anger
0.82
0.91
0.86
82.77%
Disgust
0.84
0.87
0.85
Fear
0.81
0.87
0.84
Happy
0.86
0.85
0.86
Sad
0.80
0.78
0.79
Surprised
0.82
0.51
0.63
4.2.2.
Performance
comparison
with
r
elevant
Bangla
datasets
In
the
T
able
8
,
we
aim
to
compare
this
work
with
previous
studies
that
have
focused
on
Bangla
SER.
Hassan
et
al.
[
20
]
primarily
utilized
two
datasets:
one
in
English,
named
RA
VDESS,
and
another
dataset
SUBESCO
for
Bangla.
Their
proposed
model,
which
integrated
a
1D
CNN
with
a
fully
convolutional
network
(FCN)
layer
,
achieved
98.30%
accuracy
on
the
RA
VDESS
dataset
and
98.97%
on
the
SUBESCO
dataset.
T
able
8.
Comparison
with
related
works
used
Bangla
speech
emotion
datasets
Article
Dataset
Features
extraction
techniques
Classifier
Accuracy
Paper
[
20
]
SUBESCO
MFCC,
ZCR,
Mel-Spectrogram,
Root
Mean
Square,
etc
1D
CNN
+
FCN
layers
98.97%
Paper
[
15
]
SUBESCO
CNN
+
TDF
layer
DCTFB
86.9%
Paper
[
21
]
SUBESCO
MFCCs
+
STFT
+
Chroma
STFT
4CNN
+
TDF
+
Bi-LSTM
89.57%
Paper
[
29
]
KBES
MFCC,
STFT
,
Chroma
STFT
,
CNN
TDF
layer
,
Bi-LSTM,
LSTM
71.67%
Paper
[
30
]
SUBESCO,
Ban-
glaSER
CNN
KNN,
AdaBoost,
Bi-LSTM
90%
This
W
ork
BanSpEmo
MFCC,
Spectrogram,
ZCR,
RMSE
SVM,
LR,
MNB
87.18%
Using
the
dataset
SUBESCO
paper
[
15
]
utilized
CNN
and
TDF
features
with
DCTFB
classifier
and
achieved
an
accuracy
of
86.9%.
Additionally
,
Islam
et
al.
[
21
]
used
3D
CNN
and
bidirectional
long
short-term
memory
networks
(Bi-LSTM)
as
models
while
working
with
the
SUBESCO
dataset.
They
achieved
an
accuracy
BanSpEmo:
a
Bangla
audio
dataset
for
speech
emotion
r
ecognition
and
its
baseline
evaluation
(Babe
Sultana)
Evaluation Warning : The document was created with Spire.PDF for Python.