Indonesian
J
our
nal
of
Electrical
Engineering
and
Computer
Science
V
ol.
40,
No.
2,
No
v
ember
2025,
pp.
640
∼
653
ISSN:
2502-4752,
DOI:
10.11591/ijeecs.v40.i2.pp640-653
❒
640
Laryngeal
pathology
detection
using
EMD-based
v
oice
acoustic
featur
es
analysis
and
SVM-RBF
Soane
Cherif
1
,
Abdelhad
Kaddour
1
,
Abdelmoudjib
Benkada
2
,
Said
Kar
oui
2
,
Ouissem
Chibani
Bahi
1
,
Asmaa
Bouzid
Daho
1
1
Laboratory
of
Signals,
Systems
and
Data
(LSSD),
Department
of
Electronic,
F
aculty
of
Electrical
Engineering,
Uni
v
ersity
of
Sciences
and
T
echnology
of
Oran
Mohamed
Boudiaf
(UST
O-MB),
Oran,
Algeria
2
Laboratory
of
Intelligent
Systems
Research
(LARESI),
Department
of
Electronics,
F
aculty
of
Electrical
Engineering,
Uni
v
ersity
of
Sciences
and
T
echnology
of
Oran
Mohamed
Boudiaf
(UST
O-MB),
Oran,
Algeria
Article
Inf
o
Article
history:
Recei
v
ed
Sep
6,
2024
Re
vised
Jul
22,
2025
Accepted
Oct
14,
2025
K
eyw
ords:
Acoustic
features
EMD
Laryngeal
pathology
SVM
V
oice
analysis
ABSTRA
CT
T
raditional
techniques
for
detecting
laryngeal
pathologies,
such
as
laryngoscop
y
and
endoscop
y
,
are
costly
and
in
v
asi
v
e.
This
study
presents
a
no
v
el
approach
for
detecting
laryngeal
disorders
using
empirical
mode
decomposition
(EMD)-
based
acoustic
features
analysis
and
support
v
ector
machine
(SVM)
with
a
ra-
dial
basis
function
(RBF)
k
ernel.
The
e
xperiments
were
conduct
ed
using
the
Saarbr
¨
uck
en
v
oice
database
(SVD).
The
v
oice
signals
were
then
decomposed
us-
ing
EMD
to
e
xtract
the
intrinsic
mode
functions
(IMFs).
The
IMF
with
the
high-
est
ener
gy
v
alue
w
as
selected
as
the
most
rele
v
ant.
A
set
of
acoustic
features,
including
mel-frequenc
y
cepstral
coef
cients
(MFCCs),
linear
predicti
v
e
cep-
stral
coef
cients
(LPCCs),
Pitch
(fundamental
frequenc
y),
higher
-order
statistics
(HOSs),
zero-crossing
rate
(ZCR),
spectral
centroid
(SC),
and
spectral
roll-of
f
(SR
O),
is
deri
v
ed
from
the
most
rele
v
ant
IMFs
and
fed
into
an
SVM
classier
to
dif
ferentiate
between
health
y
and
pathological
v
oices.
Experimental
results
demonstrate
the
ef
fecti
v
eness
of
the
proposed
methodology
,
achie
ving
a
high
classication
accurac
y
of
94.5%,
a
sensiti
vity
of
94.2%,
a
specicity
of
95.3%,
and
an
F1
score
of
96.1%,
outperforming
con
v
entional
approaches.
These
re-
sults
highlight
the
potential
of
EMD-based
v
oice
analysis
as
a
non-in
v
asi
v
e
and
reliable
tool
for
early
diagnosis
of
laryngeal
disorders.
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
Soane
Cherif
Laboratory
of
Signals,
Systems
and
Data
(LSSD),
Department
of
Electronic,
F
aculty
of
Electrical
Engineering
Uni
v
ersity
of
Sciences
and
T
echnology
of
Oran
Mohamed
Boudiaf
(UST
O-MB)
P
.O.
Box
1505,
El
Mnaouar
,
31000
Oran,
Algeria
Email:
soane.cherif@uni
v-usto.dz
1.
INTR
ODUCTION
Speech
production
is
a
vital
function
of
the
v
ocal
tract
system,
enabling
the
creation
of
speech
sounds.
Impaired
v
oice
production
can
signicantly
impact
an
indi
vidual’
s
qualit
y
of
life.
Speech
pathologists
assess
impairments
af
fecting
communication,
language,
and
v
oice
[1].
The
human
v
oice
plays
a
crucial
role
in
f
a-
cilitating
communication
and
social
interaction.
Ho
we
v
er
,
improper
v
oice
use
can
lead
to
v
arious
problems.
Approximately
25%
of
the
w
orld’
s
population
suf
fers
from
v
oice
disorders
[2],
which
are
often
caused
by
conditions
af
fecting
the
larynx
and
v
ocal
cords,
kno
wn
as
laryngeal
pathologies
[3].
Con
v
entional
diagnos-
tic
techniques,
such
as
stroboscop
y
and
laryngoscop
y
,
are
commonly
used
b
ut
can
cause
patients
discomfort.
Non-in
v
asi
v
e
methods,
such
as
electroglottograph
y
(EGG)
and
self-assessment,
of
fer
alternati
v
es
b
ut
require
J
ournal
homepage:
http://ijeecs.iaescor
e
.com
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
641
specialist
e
xpertise
for
accurate
analysis
[4],
[5].
T
o
address
these
challenges
and
enhance
the
accurac
y
of
v
oice
disorder
detection,
researchers
ha
v
e
de-
v
eloped
v
arious
models
that
e
xtract
v
ocal
characteristics,
such
as
mel-frequenc
y
cepstral
coef
cients
(MFCCs)
and
linear
predicti
v
e
cepstral
coef
cients
(LPCCs)
.
These
models
utilize
lar
ge
v
oice
databases,
such
as
the
Saarbr
¨
uck
en
v
oice
database
(SVD),
and
emplo
y
adv
anced
classication
techniques,
including
support
v
ector
machine
(SVM),
Gaussian
mi
x
t
ure
models
(GMM),
and
uni
v
ersal
background
model
Gaussian
mixture
mod-
els
(GMM-UBM).
Adv
ances
in
articial
intelligence
and
machine
learning
ha
v
e
signicantly
impro
v
ed
the
ef
cienc
y
of
these
classication
algorithms,
enabling
more
precise
and
non-in
v
asi
v
e
detection
of
laryngeal
pathologies
[6].
V
arious
inno
v
ati
v
e
approaches,
particularly
t
hose
le
v
eraging
deep
l
earning
techniques,
ha
v
e
achie
v
ed
signicant
adv
ancements
in
v
oice
disorder
detection.
Alhussein
and
Muhammad
[7]
ha
v
e
de
v
eloped
a
system
for
detecting
speech
disorders
using
deep
learning
techniques.
The
y
trained
their
model
on
the
SVD
dataset
and
e
v
aluated
it
using
the
Massachusetts
e
ye
and
ear
inrmary
v
oice
disorders
database
(M
EEI).
The
visual
geometry
group-16
(V
GG16)
and
Caf
feNet
algorithms
achie
v
ed
94.5%
and
94.1%
accurac
y
rates,
respecti
v
ely
.
Le
v
eraging
deep
con
v
olutional
neural
netw
orks
(CNNs)
further
impro
v
ed
the
accurac
y
to
97.5%.
Hammami
[8]
proposed
a
technique
that
utilizes
w
a
v
elet
coef
cients
to
classify
v
ocal
disor
d
e
rs.
Their
analysis
w
as
based
on
sustained
v
o
wel
recordings
of
the
sound
/a/
from
the
SVD
dataset.
Through
e
xperiments
with
v
arious
GMM,
the
y
found
that
incorporating
the
teager
ener
gy
operator
and
using
32
Gaus-
sian
mixtures
yielded
an
accurac
y
of
96.66%.
Con
v
ersely
,
when
combining
three
feature
v
ectors,
the
accurac
y
dropped
to
92.22%.
F
ang
et
al.
[9]
utilized
a
lar
ge
set
of
features,
including
430
basic
acoustic
features
(B
AFS—basic
acoustic
features),
84
cepstral
coef
cients
based
on
the
mel
S-transform
(MSCC—Mel
S-transform
cepstrum
coef
cients),
and
12
chaotic
features.
Feature
optimizati
on
w
as
conducted
using
radar
charts
and
the
F-score,
reducing
the
feature
dimensionality
from
526
to
96
dimensions
for
the
NKI-CCR
T
corpus
and
104
dimensions
for
the
SVD
corpus.
These
optimized
features
were
fed
into
an
SVM
classier
to
detect
v
oice
disorders.
Ho
we
v
er
,
their
approach
achie
v
ed
only
84.4%
accurac
y
on
the
NKI-CCR
T
database
and
78.7%
on
the
SVD
database.
Al-Dhief
et
al.
[10]
suggested
a
w
ay
to
get
MFCC
features
from
the
SVD
database
and
use
them
with
the
OS-LEM
(online
sequential
e
xtreme
learning
machine)
classier
.
The
approach
achie
v
ed
a
maximum
accurac
y
of
91.17%,
recall
of
91%,
F-measure
of
87%,
G-mean
of
87.55%,
and
specicity
of
97.67%.
Ribas
et
al.
[11]
de
v
eloped
a
model
based
on
deep
neural
netw
orks
(DNN)
to
dif
ferentiate
between
health
y
and
pathological
v
oices.
The
model
achie
v
ed
maximum
accurac
y
rates
of
80.71%
for
sentences
and
82.8%
for
v
o
wels
(/a/,
/i/,
/u/).
The
authors
utilized
t
he
automatic
v
oice
disorder
detection
(A
VDD)
system
with
self-supervised
representations
to
e
xtract
distincti
v
e
auditory
features.
The
y
incorporated
a
feedforw
ard
layer
with
a
class-tok
en
transformer
to
consolidate
temporal
feature
sequences.
The
researchers
augmented
the
training
dataset
with
out-of-scope
data
to
address
data
a
v
ailability
concerns.
Experimental
results
demonstrated
a
classication
accurac
y
of
93.36%,
representing
signicant
impro
v
ements
of
4.1%
without
data
augmentation
and
15.62%
with
data
augmentation.
Using
self-supervised
(SS)
representations
in
A
VDD
resulted
in
an
accu-
rac
y
rate
of
90%
[11].
Lee
[12]
emplo
yed
deep
learning
techniques
to
classify
v
oice
samples,
specically
using
feedforw
ard
nural
netw
orks
(FNN)
and
CNN.
Their
study
found
that
utilizing
the
LPCCs,
the
CNN
classier
achie
v
ed
a
maximum
accurac
y
of
82.69%
for
the
v
o
wel
/a/
in
male
subjects.
Ding
et
al.
[13]
utilized
v
oice
signal
analysis
to
de
v
elop
a
method
for
the
early
diagnosis
and
treatment
of
v
oic
e
disorders.
The
y
also
introduced
a
no
v
el
computer
-aided
assessment
approach
for
pathological
v
oice
classication
(CS-PVC),
specically
designed
to
distinguish
between
pathological
and
health
y
v
oices
in
areas
with
signicant
discrepancies.
The
m
od
e
l
achie
v
ed
identication
accurac
y
of
81.6%
on
the
SVD
dataset
and
82.2%
on
the
self-b
uilt
Shenzhen
People’
s
Hospital
v
oice
database
(SZUPD).
Ja
v
anmardi
et
al.
[14]
conducted
a
comparati
v
e
analysis
of
v
arious
data
augmentation
(D
A)
tech-
niques
for
v
ocal
pathology
detection,
e
v
aluating
three
temporal
methods
(noise
addition,
pitch
shifting,
and
time
stretching),
one
time-frequenc
y
technique
(SpecAugment),
and
tw
o
v
ocoder
-based
approaches
(modify-
ing
the
harmonic-to-noise
ratio
(HNR)
and
glottal
pulse
length).
The
e
xtracted
features
include
static
and
dynamic
MFCCs,
the
spectrogram,
and
the
mel-spectrogram,
which
were
then
fed
into
machine
learning
mod-
els
(SVM
and
random
forest)
and
deep
learning
models
(long
short-term
memory
(LSTM)
and
CNN).
The
best
performance,
achie
v
ed
with
a
2D
CNN,
reached
an
accurac
y
of
80%
on
the
SVD
database
[14].
Albadr
et
al.
[15]
impro
v
ed
the
detection
and
classication
of
v
oice
pathologies
(VP)
using
a
f
ast-
learning
net
w
ork
(FLN)
classier
based
on
MFCCs
features.
Their
study
comprised
tw
o
phases:
the
rs
t
phase
Laryng
eal
patholo
gy
detection
using
EMD-based
voice
acoustic
featur
es
...
(Soane
Cherif)
Evaluation Warning : The document was created with Spire.PDF for Python.
642
❒
ISSN:
2502-4752
analyzed
v
ocal
samples
of
sust
ained
v
o
wels
(/a/,
/i/,
and
/u/)
along
with
spok
en
phrases.
In
contras
t,
the
second
phase
focused
on
v
ocal
sampl
es
from
three
common
v
oice
disorders—paralysis,
polyps,
and
c
ysts—using
the
v
o
wel
/a/
spok
en
in
a
neutral
tone.
The
e
xperimental
results
achie
v
ed
an
accurac
y
of
84.64%,
a
precision
of
97.39%,
a
recall
of
86.05%,
an
F-measure
of
86.80%,
a
G-mean
of
86.81%,
and
a
specicity
of
88.24%.
According
to
the
literature,
traditional
methods
for
identifying
laryngeal
pathologies
rely
on
v
ocal
signal
analysis.
Ho
we
v
er
,
the
y
ha
v
e
se
v
eral
limitat
ions,
particularly
the
l
ack
of
proper
pre-processing
of
v
oice
datasets.
Researchers
often
e
xtract
features
direc
tly
and
classify
them
using
a
limited
number
of
samples,
mak-
ing
it
challenging
to
eliminate
residual
noise
in
the
reconstructed
signal.
This
leads
to
oscillations
that
distort
mode
decomposition.
Additionally
,
these
approaches
hinder
the
systematic
e
v
aluation
of
e
xtracted
parame-
ters.
T
o
address
these
issues,
we
propose
a
no
v
el
method,
described
in
section
2,
to
impro
v
e
the
detection
of
laryngeal
disorders
from
speech
signals.
This
article
is
structured
as
follo
ws:
section
2
presents
the
proposed
frame
w
ork,
detailing
the
mate
rials
and
methodologies
used
in
this
study
,
encompassing
both
theoretical
and
practical
aspects.
Section
3
pro
vides
an
in-depth
discussion
of
the
results,
e
v
aluating
the
ef
fecti
v
eness
of
the
proposed
method
in
detecting
laryngeal
issues.
Finally
,
section
4
concludes
with
k
e
y
ndings
and
suggests
potential
directions
for
future
research
on
diagnosing
laryngeal
pathologies.
2.
METHOD
Figure
1
presents
the
block
diagram
illustrating
the
proposed
methodology
for
the
accurate
and
unbi-
ased
diagnosis
of
laryngeal
pathologies.
This
methodology
consists
of
four
k
e
y
steps:
silence
remo
v
al,
lo
w-pass
ltering,
normalization,
and
empirical
mode
decomposition
(EMD).
This
method
decomposes
the
v
ocal
signal
into
IMFs,
representing
its
harmonic
components.
The
most
rele
v
ant
IMFs
are
selected
based
on
their
maxi-
mum
temporal
ener
gy
and
are
framed
into
short
se
gments
(0.1-second
duration
with
0.01-second
o
v
erlap)
for
analysis.
Each
frame
is
then
multiplied
by
a
Hamming
windo
w
to
minimize
discontinuities
at
the
be
ginning
and
end
of
the
s
ignal,
thereby
enhancing
the
accurac
y
of
the
frequenc
y
analysis.
The
Hamming
windo
w
is
the
same
length
as
the
frame.
Figure
1.
Block
diagram
illustrating
the
proposed
methodology
Afterw
ard,
we
e
xtract
se
v
en
features:
Pitch
(fundamental
frequenc
y),
spectral
roll-of
f
(SR
O),
spec-
tral
centroid
(SC),
zero-crossing
rate
(ZCR),
higher
-orderstatistics
(HOSs),
LPCCs,
and
MFCCs.
Finally
,
each
e
xtracted
feature
serv
es
as
input
for
a
support
SVM-RBF
classier
,
enhancing
the
accurac
y
of
laryngeal
pathol-
ogy
diagnosis.
The
originality
of
this
study
lies
in
inte
grating
v
oice
signal
pre-processing
and
empirical
mode
decomposition
to
e
xtract
acoustic
features.
The
main
contrib
utions
of
this
study
are
as
follo
ws:
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
40,
No.
2,
No
v
ember
2025:
640–653
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
643
-
De
v
eloping
a
non-in
v
asi
v
e,
lo
w-cost
method
for
the
detection
of
laryngeal
pathologies
-
Experimental
v
alidation
of
the
ef
fecti
v
eness
of
the
proposed
system
using
the
SVD
database
-
Using
more
adv
anced
v
oice
signal
pre-processing
methods,
including
dif
ferent
feature
e
xtraction
and
clas-
sication
algorithms,
to
mak
e
diagnosing
laryngeal
pathology
much
more
reliable
and
accurate.
2.1.
Database
This
study
utilized
the
SVD
database,
an
online
repository
containing
o
v
er
2,000
audio
les
fea
turing
three
distinct
v
o
wel
sounds:
/a/,
/i/,
and
/u/.
Each
le
has
a
duration
ranging
from
1
to
4
seconds
and
is
sampled
at
a
frequenc
y
of
50
kHz
with
a
16-bit
resolution.
F
or
analysis,
we
selected
v
ocal
signals
of
the
sustained
neutral
v
o
wel
/a/
from
a
group
of
200
health
y
males
and
91
males
with
pathological
conditions.
The
pathology
subset
includes
recordings
from
four
specic
conditions:
50
cases
of
laryngitis,
19
ca
ses
of
v
ocal
cord
cancer
,
5
cases
of
Reink
e’
s
edema,
and
17
cases
of
v
ocal
cord
polyps.
2.2.
V
ocal
signals
pr
epr
ocessing
Before
using
v
ocal
signals
in
speech-processing
applications,
performing
pre-processing
tasks
such
as
zero-mean
normalization,
amplied
normalization,
lo
w-pass
ltering,
and
silence
remo
v
al
is
important.
Subtracting
the
mean
from
a
signal
centers
it
around
zero,
making
the
a
v
erage
of
all
the
signal
samples
equal
to
zero.
This
process
is
commonly
used
to
prepare
data
for
machine
learning
algorithms.
The
signal
is
then
scaled
by
di
viding
each
sample
by
the
maximum
absolute
v
alue.
This
ensures
that
the
signal’
s
peak
is
normalized
t
o
1
if
the
peak
is
positi
v
e
or
-1
if
the
peak
is
ne
g
ati
v
e.
W
e
applied
a
lo
w-pass
lter
with
a
cutof
f
frequenc
y
of
1
kHz
to
isolate
the
rele
v
ant
lo
w-frequenc
y
components
and
remo
v
e
unw
anted
high-frequenc
y
components.
Silence
remo
v
al
refers
to
detecting
and
remo
ving
periods
of
silence
in
a
signal
while
maintaining
its
timing.
This
method
uses
an
ener
gy
threshold
to
identify
silent
periods.
In
this
study
,
the
threshold
w
as
set
at
2%
of
the
maximum
ener
gy
le
v
el.
An
y
se
gment
with
ener
gy
belo
w
this
threshold
w
as
considered
silent.
V
ocal
signals
primarily
contain
ener
gy
at
lo
wer
frequencies
,
whil
e
non-v
ocal
signals
typically
ha
v
e
higher
frequencies
[16].
As
illustrated
in
Figure
2,
we
present
the
preprocessing
steps
applied
to
the
v
ocal
signal
of
speak
er
563
from
the
SVD
database
to
impro
v
e
clarity
.
Figure
2(a)
sho
ws
the
v
oice
signal
114-a-n.w
a
v
after
the
application
of
lo
w-pass
ltering
and
normal-
ization.
The
signal
is
centered
around
zero,
reecting
the
attenuation
of
high-frequenc
y
components
and
the
standardization
of
the
amplitude
scale.
Figure
2(b)
displays
a
10,000-sample
e
xcerpt
of
the
same
signal,
corre-
sponding
to
a
duration
of
0.2
seconds,
to
f
acilitate
visual
observ
ati
on
.
This
e
xcerpt
allo
ws
for
a
more
detailed
analysis
of
the
w
a
v
eform
of
the
preprocessed
signal,
enabling
a
localized
e
xamination
of
its
acoustic
content.
Figure
2(c)
illustrates
the
v
oice
signal
after
silence
remo
v
al
(7,893
samples,
corresponding
to
a
duration
of
0.1579
seconds).
The
reduced
signal
length
highlights
the
ef
fecti
v
e
elimination
of
silent
se
gments.
(a)
(b)
(c)
Figure
2.
Preprocessing
of
the
v
oice
signal:
(a)
lo
w-pass
ltered
and
normalized
v
oice
signal
114-a
n.w
a
v,
(b)
10,000-sample
e
xcerpt
of
a
lo
w-pass
ltered
and
normalized
v
oice
signal,
and
(c)
v
oice
signal
after
silence
remo
v
al
(7,893
samples
corresponding
to
0.1579-second
duration)
Laryng
eal
patholo
gy
detection
using
EMD-based
voice
acoustic
featur
es
...
(Soane
Cherif)
Evaluation Warning : The document was created with Spire.PDF for Python.
644
❒
ISSN:
2502-4752
2.3.
Empirical
mode
decomposition
Man
y
researchers
ha
v
e
used
EMD
to
process
v
ocal
signals
due
to
its
e
xcellent
performance
with
this
specic
type
of
signal
[17]-[20].
T
o
detect
the
presence
of
v
oice
in
a
non-stationary
speech
signal,
we
applied
EMD
to
decompose
it
into
a
sequence
of
oscillatory
patterns
kno
wn
as
IMFs
and
a
residual
component,
as
sho
wn
in
(1).
x
(
n
)
=
r
k
(
n
)
+
k
X
i
=1
I
M
F
i
(
n
)
(1)
Where
x
(
n
)
is
the
digitized
v
oice
signal,
n
representing
the
sample,
k
is
the
number
of
IMFs
e
xtracted
and
r
k
(
n
)
is
the
residual.
W
e
incorporated
the
stopping
condition
proposed
by
Huang
et
al.
[17]
for
the
sifting
procedure.
This
criterion
limits
the
standard
de
viation
(SD)
between
tw
o
consecuti
v
e
sifting
results
typically
between
0.2
and
0.3.
F
or
an
IMF
to
be
considered
genuine,
it
must
satisfy
tw
o
criteria:
the
dif
ference
between
the
number
of
zero
crossings
and
the
number
of
e
xtrema
must
not
e
xceed
one,
and
the
a
v
erage
v
alue
of
the
en
v
elope
formed
by
the
local
maxima
and
minima
must
be
zero.
Figure
3
illustrates
the
decomposition
process
as
well
as
the
criteria
used
t
o
identify
the
most
rele
v
ant
IMFs,
summarizing
the
k
e
y
steps
of
our
method.
It
highlights
both
the
decomposition
procedure
and
the
steps
used
to
e
xtract
acoustic
information
from
the
most
signicant
components.
The
IMFs,
sho
wn
in
Figure
3(a),
are
obtained
through
an
iterati
v
e
sifting
process,
which
in
v
olv
es
the
follo
wing
steps:
i)
Determine
all
e
xtrema
(local
maxima
and
minima)
of
the
signal
x
(
t
)
.
ii)
Estimate
the
v
alues
of
the
minima
and
maxima
using
cubic
spline
interpolation,
creating
the
lo
wer
en
v
e-
lope
e
min
(
t
)
and
the
upper
en
v
elope
e
max
(
t
)
.
iii)
Determine
the
en
v
elope’
s
mean
by
applying
the
follo
wing
formula:
m
1
(
t
)
=
e
max
(
t
)
+
e
min
(
t
)
2
(2)
i
v)
Calculate
the
IMF
by
calculating
the
dif
ference
between
the
x
(
t
)
and
m
1
(
t
)
signals.
x
(
t
)
−
m
1
(
t
)
=
h
1
(
t
)
(3)
v)
If
h
1
(
t
)
is
an
IMF
,
it
is
dened
as
the
rst
IMF
component
of
x
(
t
)
.
Alternati
v
ely
,
h
1
(
t
)
is
considered
the
original
signal.
vi)
Iterate
the
preceding
steps,
treating
h
1
(
t
)
as
the
ne
w
x
(
t
)
,
and
obtain
h
11
(
t
)
.
If
h
11
(
t
)
is
an
IMF
,
s
top
the
process.
Otherwise,
continue
iterating.
After
the
decomposition,
we
ha
v
e
identied
the
IMF
with
the
highest
ener
gy
v
al
ue
as
the
most
rele
v
ant
IMFs.
The
ener
gy
is
calculated
using
(4).
E
k
=
N
X
n
=1
[
I
M
F
k
(
n
)]
2
(4)
Where
E
k
is
the
ener
gy
of
the
k
−
th
I
M
F
,
N
is
the
length
of
the
backscattered
signal,
and
I
M
F
k
(
n
)
is
the
v
alue
of
the
k
−
th
I
M
F
at
sample
n
.
The
rele
v
ant
IMF
obtained
(Figure
3(b))
is
se
gmented
into
0.1-second
interv
als
and
then
m
ultiplied
by
a
Hamming
windo
w
of
the
same
length
(Figure
3(c))
to
e
xtract
acoustic
features.
2.4.
F
eatur
e
extraction
2.4.1.
Mel-fr
equency
cepstral
coefcients
The
MFCCs
are
e
xtensi
v
ely
utilized
features
in
speech
and
audio
processing.
It
denotes
the
short-term
po
wer
spectrum
of
an
auditory
input,
emulating
human
speech
perception.
MFCCs
are
crucial
for
identifying
v
ocal
abnormalities
in
the
v
ocal
domain
[21],
[22].
Figure
4
illus
trates
the
steps
in
v
olv
ed
in
computing
MFCCs.
The
pre-emphasis
step
enhances
high
frequencies
to
balance
the
spectrum.
The
f
ast
fourier
transform
(FFT)
then
con
v
erts
the
time-domain
signal
into
a
frequenc
y
spectrum.
Subsequently
,
a
Mel
lter
bank
is
applied
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
40,
No.
2,
No
v
ember
2025:
640–653
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
645
to
map
frequencies
onto
the
mel
scale,
which
aligns
with
human
auditory
perception.
Finally
,
the
amplitudes
are
con
v
erted
to
a
log
arithmic
scale
(similar
to
human
perception)
and
subjected
to
a
discrete
cosine
transform
(DCT),
e
xtracting
the
most
rele
v
ant
MFCCs
for
classifying
laryngeal
diseases.
(a)
(b)
(c)
Figure
3.
Decomposition
of
v
oice
signal:
(a)
IMFs,
(b)
the
rele
v
ant
mode,
and
(c)
rele
v
ant
mode
multiplied
by
the
hamming
windo
w
(0.1-second)
Figure
4.
Steps
to
compute
MFCCs
Laryng
eal
patholo
gy
detection
using
EMD-based
voice
acoustic
featur
es
...
(Soane
Cherif)
Evaluation Warning : The document was created with Spire.PDF for Python.
646
❒
ISSN:
2502-4752
2.4.2.
Linear
pr
edicti
v
e
cepstral
coefcients
LPCCs
are
an
adv
anced
signal
process
ing
technique
used
to
estimate
the
source
signal
of
v
ocal
sounds.
This
method
utilizes
LPCCs—also
referred
to
as
CPLC—to
perform
a
detailed
analysis
of
the
v
ocal
signal.
The
primary
goal
of
LPCCs
is
to
model
the
signal’
s
spectral
en
v
elope
to
e
xtract
its
essential
features.
The
v
ocal
tract
is
an
innite
impulse
response
(IIR)
lter
modeled
through
a
recursi
v
e
and
graphical
approach
[23].
This
modeling
process
is
described
in
(5).
H
(
z
)
=
G
1
+
P
p
k
=1
a
p
(
k
)
Z
−
k
(5)
Where
p
is
the
number
of
poles,
G
denotes
the
lter
g
ain,
and
a
p
(
k
)
are
the
coef
cients.
The
e
xtraction
of
LPCCs
in
v
olv
es
a
series
of
sequential
steps,
as
illustrated
in
Figure
5.
First,
the
rele
v
ant
signal
se
gment—multiplied
by
a
0.1-second
hamming
windo
w—is
modeled
using
a
linear
predicti
v
e
model,
which
assumes
that
the
current
sample
can
be
estimated
as
a
linear
combination
of
pre
vious
samples.
The
model
coef
cients
are
obtained
by
minimizing
the
prediction
error
.
The
autocorrelation
function
of
the
predicted
signal
is
then
computed
to
assess
the
similarity
between
dif
ferent
parts
of
the
signal.
Subsequently
,
the
iterati
v
e
Le
vinson-Durbin
algorithm
is
empl
o
yed
to
deri
v
e
the
LPCCs
from
the
autocorrelation
function.
Finally
,
the
LPCCs
are
transformed
into
the
cepstral
domain
by
applying
the
discrete
cosine
transform
(DCT)
[24],
[25].
Figure
5.
Steps
to
compute
LPCCs
2.4.3.
Pitch
The
fundamental
frequenc
y
(
F
0
),
often
called
pitch,
is
the
frequenc
y
at
which
the
v
ocal
cords
vibrate
when
producing
v
oiced
sounds.
This
frequenc
y
is
a
crucial
indicator
of
laryngeal
diseases.
Se
v
eral
methods
for
calculating
F
0
are
described
in
the
literature,
including
those
based
on
autocorrelation,
spectral
analysis,
and
combinations
of
these
techniques
[20].
F
or
our
study
,
we
chose
the
autocorrelation
method,
as
dened
by
(6).
R
[
k
]
=
N
−
k
−
1
X
n
=0
x
[
n
]
·
x
[
n
+
k
]
(6)
Where:
−
R
[
k
]
represents
the
one-lag
autocorrelation
function
k
,
−
x
[
n
]
is
the
input
signal
at
time
n
,
−
k
denotes
the
shift
inde
x
(lag),
−
N
is
the
length
of
the
signal.
The
rst
peak
(local
maximum)
in
the
autocorrelation
function,
after
the
peak
at
k
=
0
,
corresponds
to
the
fundamental
period
of
the
signal.
The
period
T
0
is
the
distance
between
this
peak
and
k
=
0
.
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
40,
No.
2,
No
v
ember
2025:
640–653
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
647
2.4.4.
Higher
order
statistics
Our
w
ork
e
xplicitly
e
xamined
the
HOSs
characteristics,
focusing
on
the
third-order
moments
(sk
e
w-
ness)
and
fourth-order
moments
(K
urtosis).
One
notable
bene
t
of
these
HOSs
features
is
their
compatibility
with
per
iodic
and
non-periodic
signals.
Sk
e
wness
quanties
the
lack
of
symmetry
in
a
v
oice’
s
probability
dis-
trib
ution,
whereas
K
urtosis
measures
the
e
xtent
to
which
a
distrib
ution
is
at
and
contains
impulsi
v
e
elements
in
a
signal.
These
tw
o
statistics
pro
vide
a
v
aluable
method
for
analyzing
v
oice
features
and
diagnosing
pathol-
ogy
laryngeal,
assessing
data
distrib
ution,
and
identifying
impulsi
v
e
components.
W
e
compute
the
Sk
e
wness
and
K
urtosis
using
(7)
and
(8)
in
sequential
order
[26]-[28]:
γ
3
=
P
N
n
=1
(
x
n
−
µ
)
3
(
N
−
1)
σ
3
(7)
γ
4
=
P
N
n
=1
(
x
n
−
µ
)
4
(
N
−
1)
σ
4
(8)
Where
γ
3
and
γ
4
denote
the
measures
of
sk
e
wness
and
K
urtosis,
respecti
v
ely
,
N
the
number
of
samples,
µ
the
mean
and
σ
the
SD.
2.4.5.
Zer
o-cr
ossing
rate
The
ZCR
is
a
quantitati
v
e
meas
ure
emplo
yed
t
o
assess
the
frequenc
y
char
acteristics
of
a
signal.
The
term
“sign
change
rate”
refers
to
the
frequenc
y
at
which
a
signal
changes
its
polarity
within
a
specic
time
frame.
More
precisely
,
it
counts
the
number
of
times
the
signal
changes
from
positi
v
e
to
ne
g
ati
v
e
v
alues
(or
vice
v
ersa)
and
then
standardizes
thi
s
tally
by
di
viding
it
by
the
total
duration
of
the
frame.
The
follo
wing
mathematical
e
xpression
determines
the
zero-crossing
rate:
Z
n
=
1
w
l
w
l
X
m
=1
|
sgn
[
x
n
(
m
)]
−
sgn
[
x
n
(
m
−
1)]
|
(9)
The
length
of
the
frame
is
represented
by
w
l
,
the
frame
number
is
represented
by
m
,
and
the
sign
function
is
represented
by
sg
n
.
sgn
[
x
n
(
m
)]
=
1
si
x
n
(
m
)
>
0
,
0
si
x
n
(
m
)
=
0
,
−
1
si
x
n
(
m
)
<
0
.
(10)
2.4.6.
Spectral
centr
oid
The
spectral
centroid
is
a
crucial
feat
ure
used
to
identify
v
oice
disorders.
It
represents
the
“center
of
gra
vity”
of
the
spectrum
and
is
computed
using
frequenc
y
and
amplitude
information
deri
v
ed
from
the
fourier
transform
[29],
[30].
The
spectral
centroid
indicates
the
frequenc
y
in
Hertz
(Hz)
at
which
the
spectral
ener
gy
is
balanced
or
e
v
enly
distrib
uted.
It
is
calculated
as
the
weighted
a
v
erage
of
the
frequencies
contained
in
the
signal,
as
e
xpressed
by
(11).
Spectral
centroid
=
P
N
k
=1
f
k
·
S
k
P
N
k
=1
S
k
(11)
Where
N
represents
the
number
of
spectral
bins
or
frequencies,
f
k
is
the
frequenc
y
of
the
k
-th
spectral
bin,
and
S
k
denotes
the
the
amplitude
of
the
k
-th
spectral
bin.
2.4.7.
Spectral
r
oll-off
The
term
“spectral
roll-of
f”
refers
to
a
metric
that
is
used
to
dene
a
lter
that
is
intended
to
decreas
e
the
amplitude
of
f
requencies
that
f
all
outside
of
a
particular
range.
This
technique
is
frequently
used
to
reduce
undesired
frequencies
in
a
transmission.
It
is
a
measure
that
identies
the
frequenc
y
at
which
a
specic
per
-
centage
of
the
total
ener
gy
in
a
spectrum
is
concentrated
belo
w
.
The
equation
for
SR
O
states
that
the
spectral
Laryng
eal
patholo
gy
detection
using
EMD-based
voice
acoustic
featur
es
...
(Soane
Cherif)
Evaluation Warning : The document was created with Spire.PDF for Python.
648
❒
ISSN:
2502-4752
ener
gy
accumulated
up
to
the
i-th
bin
is
proportional
to
the
total
ener
gy
contained
between
the
b
1
and
b
2
bins
and
it
is
typically
e
xpressed
as
follo
ws
[28]:
Roll-of
f
spectral
(
i
)
=
i
X
k
=
b
1
S
k
=
K
b
2
X
k
=
b
1
S
k
(12)
where
S
k
represents
the
spectral
amplitude
at
the
k
frequenc
y
bin.
b
1
and
b
2
are
the
band
edges
o
v
er
which
the
spectral
spread
is
calculated,
and
K
represents
the
percentage
of
total
ener
gy
.
The
equation
e
xpresses
that
the
spectral
ener
gy
accumulated
up
to
the
ii-th
bin
is
proportional
to
the
total
ener
gy
contained
between
the
b
1
and
b
2
bins.
2.5.
Classication
Se
v
eral
techniques
are
a
v
ailable
for
classifying
laryngeal
disorders
bas
ed
on
v
ocal
signals,
including
CNNs,
Ale
xNet,
SVMs,
random
forests,
K-nearest
neighbors
(KNN),
decision
trees,
and
deep
neural
netw
orks
(DNNs).
Each
algorithm
of
fers
distinct
adv
antages,
impro
ving
classication
accurac
y
depending
on
the
conte
xt
and
dataset
[4],
[31].
In
our
study
,
we
selected
a
SVM
with
a
RBF
k
ernel.
The
SVM-RBF
is
a
supervised
learning
model
designed
to
construct
an
optimal
h
yperplane
that
separates
data
into
tw
o
distinct
classes.
One
of
it
s
k
e
y
strengths
lies
in
its
deterministic
nature,
as
it
does
not
rely
on
probabilistic
ass
u
m
ptions.
Such
an
approach
can
lead
to
more
consistent
and
interpretable
results
in
specic
applications.
The
SVM-RBF’
s
goal
is
to
nd
the
h
yperplane
that
maximizes
the
mar
gin—the
distance
between
the
h
yperplane
and
the
closest
support
v
ectors.
This
mar
gin
serv
es
as
a
decision
boundary
that
best
dif
ferentiates
the
tw
o
classes.
A
wider
mar
gin
typically
impro
v
es
the
model’
s
generalization
capability
,
enabling
it
to
more
accurately
classify
ne
w
,
unseen
data.
Additionally
,
the
mar
gin-based
approach
contrib
utes
to
rob
ustness
by
reducing
the
model’
s
sensiti
vity
to
outliers
and
noise
in
the
dataset
[27].
T
o
optimize
the
performance
of
the
SVM-RBF
model
for
our
specic
dataset,
we
conducted
an
e
x-
hausti
v
e
parameter
search.
In
particular
,
we
ne-
tuned
tw
o
crucial
paramete
rs:
the
k
erne
l
scale
(
γ
)
and
the
box
constraint
(
C
)
.
The
k
ernel
scale
re
gulates
the
impact
of
indi
vidual
training
samples
on
the
conguration
of
the
decision
border
,
whereas
the
box
constraint
mediates
the
balance
between
optimizing
the
mar
gin
and
reducing
classication
mistak
es
[32].
By
carefully
adjusting
these
parameters,
we
could
re
gulate
the
comple
xity
of
the
decision
surf
ace
and
enhance
the
model’
s
ef
fecti
v
eness
in
classifying
v
ocal
signals
associated
with
laryngeal
disorders.
The
RBF
k
ernel
used
in
SVMs
is
mathematically
dened
as
follo
ws:
K
(
x
i
,
x
j
)
=
e
−
γ
∥
x
i
−
x
j
∥
(13)
where:
−
x
i
and
x
j
are
feature
v
ectors
in
the
input
space,
−
K
(
x
i
,
x
j
)
is
the
k
ernel
function
that
computes
the
similarity
between
tw
o
data
points
x
i
and
x
j
,
−
∥
x
i
−
x
j
∥
represents
the
Euclidean
distance
between
the
tw
o
data
points
x
i
and
x
j
,
−
γ
is
a
parameter
that
controls
the
spread
of
the
k
ernel.
A
higher
v
alue
of
γ
results
in
a
narro
wer
k
ernel,
meaning
that
o
nl
y
points
that
are
v
ery
close
to
each
other
will
be
considered
si
milar
.
Con
v
ersely
,
a
lo
wer
v
alue
of
γ
mak
es
the
k
ernel
wider
,
considering
more
distant
points
as
similar
.
While
the
box
constraint
(
C
)
is
a
re
gularization
parameter
that
controls
the
trade-of
f
between
achie
ving
a
lo
w
training
error
and
maintaining
a
simpler
decision
boundary
.
A
higher
v
alue
of
C
penalizes
misclassica-
tions
more
hea
vily
,
leading
to
a
comple
x
decision
boundary
that
may
o
v
ert
the
data,
whereas
a
lo
wer
C
allo
ws
for
more
classication
errors,
promoting
a
simpler
and
more
generalized
model.
min
ω
,b,ϵ
1
2
∥
ω
∥
2
+
C
l
X
i
=1
ϵ
i
(14)
Subject
to
the
constraints:
y
i
(
w
T
x
i
+
b
)
≥
1
−
ϵ
i
,
ϵ
i
≥
0
,
i
=
1
,
.
.
.
,
l
(15)
where
w
denotes
the
normal
v
ector
dening
the
h
yperplane,
b
represents
the
bias
shifting
the
h
yperplane,
l
is
the
total
number
of
data
points,
ϵ
are
the
slack
v
ariables
allo
wing
for
tolerance
of
classication
errors,
and
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
40,
No.
2,
No
v
ember
2025:
640–653
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
649
y
i
∈
{
+1
,
−
1
}
is
the
class
of
the
sample
x
i
.
W
e
in
v
estig
ated
the
optimization
parameters
C
=
2
k
and
γ
=
2
m
,
where
k
and
m
are
inte
gers
chosen
within
the
range
of
-20
to
20.
By
ne-tuning
these
parameters,
we
aim
to
enhance
classication
performance
while
preserving
a
balance
between
accurac
y
and
generalization.
W
e
aim
to
enhance
classication
performance
by
ne-tuning
these
parameters
while
preserving
a
bal
-
ance
between
accurac
y
and
generalization.
W
e
e
v
aluated
these
automated
classicati
on
and
detection
methods
for
laryngeal
diseases
using
four
k
e
y
metrics:
accurac
y
,
sensiti
vity
,
specicity
,
and
the
F1
score.
In
this
case,
the
algorithm
classies
samples
as
either
pathological
or
health
y
,
accordingly
labeling
them
as
true
positi
v
es
(TP)
or
f
alse
ne
g
ati
v
es
(FN).
Con
v
ersely
,
health
y
samples
are
classied
as
either
pathological
or
health
y
,
cor
-
responding
to
true
ne
g
ati
v
es
(TN)
and
f
alse
positi
v
es
(FP).
The
follo
wing
equations
dene
a
v
ariety
of
these
performance
measures.
Accurac
y
=
T
P
+
T
N
T
P
+
T
N
+
F
P
+
F
N
(16)
Sensiti
vity
(Recall)
=
T
P
T
P
+
F
N
(17)
Precision
=
T
P
T
P
+
F
P
(18)
Specicity
=
T
N
T
N
+
F
P
(19)
F1
Score
=
2
×
Precision
×
Recall
Precision
+
Recall
(20)
3.
RESUL
TS
AND
DISCUSSION
The
proposed
laryngeal
disease
detection
and
classication
method
w
as
e
v
aluated
using
the
SVD
database,
described
in
section
2.1.
In
our
e
xperiments,
80%
of
the
data
w
as
used
for
training,
while
20%
w
as
reserv
ed
for
testing
and
v
alidation
to
e
v
aluate
the
model’
s
performance.
Interpreting
the
confusion
matrix
is
essential
for
e
v
aluating
the
model’
s
performance
in
accurately
classifying
the
dif
ferent
cate
gories
(normal
or
pathological).
This
e
v
aluation
is
guided
by
the
metrics
dened
in
section
2.5,
which
pro
vide
a
quantitati
v
e
clas-
sication
performance
assessment.
T
able
1
presents
the
metric
v
alues
corresponding
t
o
each
feature:
MFCCs,
LPCCs,
HOSs,
Pitch,
SR
O,
ZCR,
and
SC.
T
able
1.
Ev
aluation
metrics
table
of
dif
ferent
characterization
parameters
P
arameter
Accurac
y
(%)
Sensiti
vity
(%)
Specicity
(%)
F1
(%)
A
UC
(%)
14
MFCCs
94.5
94.2
95.3
96.1
94.5
14
LPCCs
85.8
88.7
78.1
88.7
85.5
HOSs
86.1
91.3
71.9
91.3
86.1
Pitch
86.6
87.9
83.5
87.9
89.1
SR
O
86.1
86.9
83.9
86.9
86.1
ZCR
79.2
90.6
50.7
90.6
79.2
SC
86.0
93.2
68.2
93.2
86
The
metri
cs
presented
in
T
able
1
pro
vide
v
aluable
insights
into
the
contrib
ution
of
each
acoust
ic
feature
to
the
classication
of
normal
and
pathological
v
oices.
Among
all
the
parameters,
MFCCs,
and
LPCCs
e
xhibit
the
highest
performance
across
all
e
v
a
luation
metrics,
indicating
their
strong
discriminati
v
e
po
wer
in
detecting
v
ocal
pathologies.
This
result
is
consistent
with
pre
vious
studies,
which
highlight
the
ef
cienc
y
of
cepstral
features
in
capturing
rele
v
ant
information
from
speech
signals.
HOSs
also
sho
w
promi
sing
results,
suggesting
that
the
v
oice
signal’
s
nonlinear
characteristics
contain
useful
diagnostic
cues.
Pitch
and
SR
O
demonstrate
moderate
classication
perf
o
r
mance,
lik
ely
because
the
y
capture
complementary
aspects
of
v
ocal
signal
v
ariability
that
may
not
be
as
rob
ust
across
all
samples.
Laryng
eal
patholo
gy
detection
using
EMD-based
voice
acoustic
featur
es
...
(Soane
Cherif)
Evaluation Warning : The document was created with Spire.PDF for Python.