Inter
national
J
our
nal
of
Adv
ances
in
A
pplied
Sciences
(IJ
AAS)
V
ol.
14,
No.
3,
September
2025,
pp.
955
∼
965
ISSN:
2252-8814,
DOI:
10.11591/ijaas.v14.i3.pp955-965
❒
955
Pitch
extraction
using
discr
ete
cosine
transf
orm
based
po
wer
spectrum
method
in
noisy
speech
Humaira
Sunzida
1
,
Nar
gis
P
ar
vin
2
,
J
afrin
Akter
J
eba
1
,
Sulin
Chi
3
,
Md.
Shiplu
Ali
1
,
Moinur
Rahman
1
,
Md.
Saifur
Rahman
1
1
Department
of
Information
and
Communication
T
echnology
,
F
aculty
of
Engineering,
Comilla
Uni
v
ersity
,
Cumilla,
Bangladesh
2
Department
of
Computer
Science
and
Engineering,
Bangladesh
Army
International
Uni
v
ersity
of
Science
and
T
echnology
,
Cumilla,
Bangladesh
3
Department
of
Information
Engineering,
Otemon
Gakuin
Uni
v
ersity
,
Osaka,
Japan
Article
Inf
o
Article
history:
Recei
v
ed
Jun
8,
2024
Re
vised
Mar
9,
2025
Accepted
Jun
8,
2025
K
eyw
ords:
Autocorrelation
function
Cumulati
v
e
po
wer
spectrum
Discrete
cosine
transform
Fundamental
frequenc
y
Pitch
ABSTRA
CT
The
pitch
period
is
a
k
e
y
com
ponent
of
man
y
speech
analysis
research
projects.
In
real-w
orld
applications,
v
oice
data
is
frequently
g
athered
in
noisy
surround-
ings,
therefore
algorithms
must
be
able
to
manage
background
noise
well
in
order
to
estimate
pitch
accurately
.
Despite
adv
ancements,
man
y
state-of–the-art
algorithms
struggle
to
deli
v
er
adequate
results
when
f
aced
with
lo
w
signal-to-
noise
ratios
(SNRs)
in
processing
noisy
speech
signals.
This
research
proposes
an
ef
fecti
v
e
concept
specically
designed
for
speech
processing
applications,
particularly
in
noisy
conditions.
T
o
achie
v
e
this
goal,
we
introduce
a
fundamen-
tal
frequenc
y
e
xtraction
algorithm
designed
to
tolerate
non-stationary
changes
in
the
amplitude
and
frequenc
y
of
the
input
signal.
In
order
to
impro
v
e
the
e
xtrac-
tion
accurac
y
,
we
also
use
a
cumulati
v
e
po
wer
spectrum
(CPS)
based
on
discrete
cosine
transform
(DCT)
rather
than
con
v
entional
po
wer
spectrum.
W
e
enhance
e
xtraction
accurac
y
of
our
method
by
utilizing
shorter
sub-frames
of
the
input
signal
to
mitig
ate
the
noise
characteristics
present
in
speech
signals.
According
to
the
e
xperimental
results,
our
proposed
technique
demonstrates
superior
per
-
formance
in
noisy
conditions
compared
to
other
e
xisting
state-of-the-art
meth-
ods
without
utilizing
an
y
kind
of
post-processing
techniques.
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
Md.
Saifur
Rahman
Department
of
Information
and
Communication
T
echnology
,
F
aculty
of
Engineering,
Comilla
Uni
v
ersity
K
otbari,
Cumilla,
Bangladesh
Email:
saifurice@cou.ac.bd
1.
INTR
ODUCTION
The
v
ocalized
form
of
human
communication,
kno
wn
as
speech,
is
dened
as
the
mo
v
ement
of
d
i
f
fer
-
ent
speech
or
g
ans,
to
produce
sounds.
In
other
w
ords,
speech
can
be
dened
as
a
series
of
sounds
arr
anged
in
a
sequence.
A
symbolic
representation
of
information
that
needs
to
be
transmitted
between
people
or
between
people
and
machines
is
sound.
The
speech
signal,
represented
acoustically
as
uctuations
in
air
pressure,
con-
v
e
y
information
between
indi
viduals
or
between
indi
viduals
and
machines.
Speech
may
tak
e
the
form
being
v
oiced,
un
v
oiced,
or
silent,
reecting
dif
ferent
approaches
to
v
ocalization
and
sound
generation.
A
v
oiced
sound
occurs
when
the
speak
er’
s
v
ocal
cords
vibrate
during
sound
production,
while
an
un
v
oiced
s
o
und
is
pro-
duced
without
v
ocal
cord
vibration
and
when
nothing
is
coming
out
from
mouth
is
considered
as
a
silence
part.
When
a
person
speaks,
their
v
ocal
cords
vibrate,
and
the
pitch
is
determined
by
ho
w
long
it
tak
es
for
the
cords
to
open
and
close,
kno
wn
as
the
pitch
period.
This
periodicity
denes
the
fundamental
frequenc
y
,
which
is
J
ournal
homepage:
http://ijaas.iaescor
e
.com
Evaluation Warning : The document was created with Spire.PDF for Python.
956
❒
ISSN:
2252-8814
also
represented
as
the
pitch.
In
v
oiced
sounds,
the
percei
v
ed
pitch
is
determined
by
the
apparent
periodicity
of
v
ocal
cord
vibrations.
Essentially
,
”pitch”
in
speech
corresponds
to
the
frequenc
y
of
v
ocal
cord
vibrations
during
v
oiced
sounds
[1].
Pitch
le
v
el
correlates
with
the
fundamental
frequenc
y:
lo
wer
frequencies
correspond
to
lo
wer
pitches,
while
higher
frequencies
indicate
higher
pitches
[2].
Children
and
females
capable
of
reaching
frequencies
up
to
500
Hz,
while
males
typically
ha
v
e
a
lo
wer
fundamental
frequenc
y
around
60
Hz
[3].
Pitch,
or
fundamental
frequenc
y
(
F
0
),
is
vital
in
speech
production,
reecting
the
rate
of
v
ocal
fold
vibration
and
inuencing
intonation
and
emotion
perception.
Accurate
pitch
estimation
is
essential
across
multiple
elds
lik
e
speech
processing
and
music,
enabling
tasks
such
as
music
analysis,
speech
prosody
un-
derstanding,
and
telecommunications.
Preci
sion
in
pitch
e
xtraction
signicantly
impacts
the
ef
fecti
v
eness
of
applications
lik
e
music
synthesis,
speech
processing,
and
v
oice
modulation
[4],
[5].
Up
till
no
w
,
a
v
ariety
of
pitch
recognition
methods
ha
v
e
been
co
v
ered.
Pitch
detection
algorithm
(PD
A)
is
the
term
used
to
describe
these
techniques,
which
were
founded
on
v
arious
mathematical
principles
[6].
PD
As
can
be
used
in
three
dif
ferent
w
ays:
in
the
frequenc
y
domain,
in
the
time
domain,
or
in
combination
of
the
tw
o
[7].
Some
pitch
detection
methods
focus
on
identifying
a
nd
timing
specic
features
in
the
time
domain.
Pitch
estimators
in
the
time
domain
usually
ha
v
e
three
parts:
a
basic
estimator
,
a
post
processor
for
error
correction,
and
a
preprocessor
for
signal
simplication.
W
ithin
this
domain,
v
arious
techniques,
such
as
autocorrelation
function
(A
CF)
[8],
a
v
erage
magnitude
dif
ference
function
(AMDF)
[9],
a
v
erage
squared
mean
dif
ference
function
(ASMDF)
[10],
weighted
autocorrelation
function
(W
AF)
[11],
and
YIN
[12].
The
autocorrelation
approach
is
the
most
often
used
method
for
guring
out
a
v
oice
signal’
s
pitch
period.
The
correlation
between
the
input
signal
and
a
time-delayed
v
ersion
of
itself
is
indicated
by
the
A
CF
.
AMDF
,
kno
wn
for
sho
wcasing
lo
w
points
at
inte
gral
multiples
of
the
pitch
period,
is
often
utilized
for
pitch
estimation
[13].
AMDF
stands
as
an
alternati
v
e
approach
to
autocorrelation
analysis,
presenting
a
simplied
v
ersion
compared
to
A
CF
.
W
ith
AMDF
,
as
opposed
to
A
CF
,
the
delayed
speech
is
subtracted
from
the
original
to
create
a
dif-
ference
signal,
and
the
absolute
magnitude
is
then
determined
at
each
delay
v
alue.
In
the
W
AF
method,
the
periodicity
property
shared
wi
th
A
CF
and
AMDF
is
uti
lized.
The
W
AF
is
characterized
by
emplo
ying
the
A
CF
as
its
numerator
and
the
AMDF
as
its
denominator
.
An
algorithm
called
the
YIN
technique
analyzes
the
traditional
A
CF
[14].
In
frequenc
y
domain
techniques,
v
arious
techniques
ha
v
e
been
de
v
eloped
to
analyze
the
frequenc
y
do-
main
cepstrum
coef
cients
or
spectrum
of
periodic
signals
i
n
order
to
e
xtract
pitch.
The
cepstrum
(CEP)
[15]
method
is
one
of
the
most
well-kno
wn
methods.
This
method,
relies
on
spectral
char
acteristics.
CEP
is
able
to
distinguish
v
ocal
tract
features
from
periodic
components.
Ho
we
v
er
,
its
performance
is
signicantly
com-
promised
in
a
noisy
en
vironment,
where
the
prese
n
c
e
of
noise
has
a
pronounced
im
pact
on
the
log-amplitude
spectrum.
Enhancements
to
the
cepstrum
method
are
tackled
in
the
modied
cepstrum
(MCEP)
[16].
Features
from
both
windo
wless
autocorrelation
function
(WLA
CF)
and
cepstral
analysis
are
included
in
the
cepstrum
technique
kno
wn
as
WLA
CF-CEP
.
WLA
CF
reduces
noise
in
the
speech
signal
without
compromising
its
peri-
odicity
.
Pitch
estimation
lter
with
amplitude
compression
(PEF
A
C)
utilizes
summations
of
sub-harmonics
in
the
log
frequenc
y
domain.
T
o
impro
v
e
its
resilience
to
noise,
the
PEF
A
C
incorporates
an
amplitude
compres-
sion
technique
[17].
Using
both
log
arithmic
and
po
wer
functions,
[18]
reduces
the
ef
fect
of
formants
and
utilizes
the
Radon
transform
to
pro
vide
a
no
v
el
method
for
estimating
pitch
in
noisy
speech
conditions.
It
also
incorporates
the
V
iterbi
algorithm
for
pitch
pattern
renement.
Mnasri
et
al
.
[19]
based
on
establishing
a
pragmatic
relationship
between
the
instantaneous
frequenc
y
(
F
i
)
and
the
fundamental
frequenc
y
(
F
0
).
It
determines
whether
speech
areas
are
v
oiced
or
un
v
oiced
and
e
xtracts
t
he
F
0
contour
by
approximating
it
as
a
smoothed
en
v
elope
of
remain-
ing
F
i
v
alues.
T
o
estimate
pitch
by
comparing
the
temporal
accumulations
of
clean
and
noisy
speech
samples,
the
topology-a
w
are
intra-operator
parallelism
strate
gy
searching
(T
APS)
algorithm,
as
described
in
[20],
trains
a
set
of
peak
spectrum
e
x
emplars.
T
o
understand
ho
w
noise
af
fects
the
locations
and
amplitudes
within
the
spectrum
of
clear
speech,
Chu
and
Al
w
an
de
v
eloped
the
statistical
algorithm
for
F0
estimation
(SAFE)
model
[21].
Pitch
estimation
is
enhanced
using
self-supervised
pitch
esti
mation
(SPICE),
as
stated
in
[22],
by
rening
the
acquired
data
and
training
a
constant
Q
transform
of
signals.
T
o
accommodate
pitches
with
v
arying
noise
le
v
els,
Deep
F
0
[23]
e
xpands
the
netw
ork’
s
recepti
v
e
range.
It
has
been
demonstrated
that
Harmo
F
0
outper
-
forms
Deep
F
0
in
pit
ch
estimation
by
emplo
ying
a
range
of
dilated
con
v
olutions.
On
the
other
hand,
BaNa
[24]
opts
for
the
initial
v
e
amplitude
spectral
peaks
from
the
speech
signal’
s
spectrum
on
a
v
erage
for
both
male
Int
J
Adv
Appl
Sci,
V
ol.
14,
No.
3,
September
2025:
955–965
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Adv
Appl
Sci
ISSN:
2252-8814
❒
957
and
female
speak
ers.
Existing
methods
often
struggle
with
accurac
y
in
noisy
conditions,
particularly
when
the
signal
-to-
noise
ratio
(SNR)
is
lo
w
.
In
a
no
v
el
approach,
the
study
e
xplores
using
discrete
cosine
transform
(DCT)
[25]
instead
of
f
ast
F
ourier
transform
(FFT)
[26],
which
pro
v
es
ef
fecti
v
e
in
noisy
signals
b
ut
susceptible
to
v
ocal
tract
ef
fects,
resulting
in
some
inconsistencies.
Ho
we
v
er
,
when
DCT
w
as
appl
ied
directly
in
po
wer
spectrum,
detection
accurac
y
decreased.
T
o
mitig
ate
noise
impact
and
impro
v
e
accurac
y
,
the
study
introduces
a
no
v
el
method
combining
cumulati
v
e
po
wer
spectrum
(CPS)
with
DCT
features.
Instead
of
t
he
con
v
entional
po
wer
spectrum,
the
proposed
technique
emplo
ys
CPS
based
on
DCT
.
CPS
emphasizes
the
shorter
sub-frames
which
is
more
ef
fecti
v
e
to
reduce
the
noise
characteristics
as
well
as
mitig
ate
the
ef
fect
of
v
ocal
tract.
Therefore,
the
proposed
approach
outperforms
traditional
pitch
e
xtraction
methods
in
noisy
speech
signals
by
ef
fecti
v
ely
suppressing
noise
components,
demonstrating
superior
ef
cac
y
in
fundamental
frequenc
y
e
xtraction
under
noisy
conditions.
2.
PR
OPOSED
METHOD
Assuming
that
y
(
n
)
represents
a
speech
signal
impacted
by
noise,
as
specied
by
(1),
y
(
n
)
=
s
(
n
)
+
w
(
n
)
(1)
Where
w
(
n
)
is
additi
v
e
noise
and
s
(
n
)
is
a
clean
speech
signal.
The
CPS
approach’
s
block
diagram
is
displayed
in
Figure
1.
The
initial
step
in
v
olv
es
di
viding
the
noise
corrupted
speech
signal
y
(
n
)
into
frames.
Figure
1.
Block
diagram
of
DCT
based
CPS
In
this
approach,
framing
is
accomplished
by
emplo
ying
a
rectangular
windo
w
function.
In
our
e
x-
periments,
the
input
signal
needs
to
be
partitioned
into
frames,
each
comprising
800
samples
(equi
v
alent
to
50
[
ms
]
).
The
signal
framed
as
y
f
(
n
)
,
where
0
≤
n
≤
N
−
1
,
is
partitioned
into
three
sub-frames
using
a
time
di
vision
approach.
These
sub-frames
are
part
as
(2)-(4).
y
f
,
1
(
n
)
=
y
f
(
n
)
,
0
≤
n
≤
M
−
1
(2)
y
f
,
1
(
n
−
D
)
=
y
f
(
n
)
,
D
≤
n
≤
D
+
M
−
1
(3)
y
f
,
1
(
n
−
2
D
)
=
y
f
(
n
)
,
2
D
≤
n
≤
2
D
+
M
−
1
(4)
In
this
conte
xt,
where
M
represents
an
inte
ger
indicating
the
sub-frame
length
and
D
denotes
the
frame
shift
in
samples,
the
goal
is
typically
to
set
2
D
+
M
−
1
to
be
equal
t
o
N
.
In
s
ection
3,
the
v
alues
for
the
lengths
of
M
and
D
are
specied
as
30
[
ms
]
and
10
[
ms
]
,
respecti
v
ely
.
The
signal
y
f
(
n
)
,
where
0
≤
n
≤
N
−
1
,
under
goes
frequenc
y
domain
transformation
through
Periodogram
computation
using
DCT
.
W
e
e
xamine
the
y
f
(
n
)
based
po
wer
spectrum
to
obtain
information
about
the
basic
frequencies
re
g
arding
the
DCT
.
DCT
is
a
F
ourier
-related
transform
that
uses
only
real
v
alues,
much
similar
to
dis
crete
F
ourier
trans-
form
(DFT)
[27].
The
DCT
w
as
f
a
v
ored
o
v
er
the
DFT
in
the
transformation
of
actual
signals,
lik
e
an
acoustic
signal.
Dif
ferent
kinds
of
DCT
and
in
v
erse
discrete
cosine
transform
(IDCT)
pairings
can
be
used
for
imple-
mentation
purposes.
The
DFT
changes
a
complicated
signal
within
its
intricate
spectrum.
On
the
other
hand,
half
of
the
data
is
redundant
and
half
of
the
computation
is
w
asted
if
the
signal
is
real,
as
it
is
in
the
majority
of
applications.
DCT
tends
to
concentrate
signal
ener
gy
in
a
smaller
number
of
coef
cients
compared
to
DFT
.
Pitc
h
e
xtr
action
using
discr
ete
cosine
tr
ansform
based
power
spectrum
method
in
...
(Humair
a
Sunzida)
Evaluation Warning : The document was created with Spire.PDF for Python.
958
❒
ISSN:
2252-8814
The
DFT
pro
vides
a
comple
x
spectrum
for
a
real
signal,
thereby
w
asting
o
v
er
half
of
the
data.
On
the
other
hand,
the
DCT
eliminates
the
need
to
compute
redundant
data
by
producing
a
true
spectrum
of
real
signals.
DCT
g
athers
most
of
the
signal’
s
information
and
sends
it
to
the
signal’
s
lo
wer
-order
coef
cients,
resulting
in
a
lar
ge
reduction
in
processing
costs
[28].
DCT
a
v
oids
superuous
data
and
computation
by
producing
a
real
spectrum
of
a
real
signal
as
a
real
transform.
DCT
has
a
further
benet
in
that
it
requires
a
straightfor
-
w
ard
phase
unwrapping
procedure
because
it
is
a
real
function.
Furthermore,
as
DCT
is
deri
v
ed
from
DFT
,
all
of
DFT’
s
adv
antageous
characteristics
are
retained,
and
a
quick
algorithm
is
a
v
ailable.
Because
DCT
is
a
fully
real
transform
and
doesn’
t
require
comple
x
v
ariables
or
arithmetic,
it
is
computationally
more
ef
cient
than
DFT
.
T
aking
into
account
the
benets
of
DCT
for
actual
signals,
the
DCT
Y
f
(
k
)
of
y
f
(
n
)
is
chosen
and
deri
v
ed
as
(5).
Y
f
(
k
)
=
c
d
(
k
)
X
y
f
(
n
)
cos
π
(2
n
−
1)(
k
−
1)
2
N
(5)
In
(5),
k
represents
the
frequenc
y
bin
inde
x,
and
the
coef
cient
c
d
(
k
)
can
be
found
as
follo
ws:
Here,
c
d
(
k
)
=
q
1
N
for
k=1,
and
c
d
(
k
)
=
q
2
N
for
2
≤
k
≤
N
.
Therefore,
the
Y
f
(
k
)
is
obtained.
The
fundamen-
tal
frequenc
y
and
higher
harmonics
are
represented
as
sharper
,
higher
amplitude
peaks
in
the
DCT
spectrum.
DCT’
s
do
wnsampled
or
compressed
spectra
allo
w
for
the
location
of
the
higher
harmonics
at
the
fundamental
frequenc
y
.
The
resultant
spectrum,
identied
as
the
po
wer
spectrum
of
y
f
(
n
)
,
is
denoted
as
P
y
f
(
k
)
,
where
k
corresponds
to
the
frequenc
y
bin
number
associated
with
a
discrete
representation
of
w
represented
by
w
k
.
F
or
each
sub-frame
y
f
,
1
(
n
)
,
where
j
=
1
,
2
,
3
and
0
≤
n
≤
M
−
1
,
the
po
wer
spectra
are
computed
as
P
y
f
,
1
(
k
)
,
P
y
f
,
2
(
k
)
,
and
P
y
f
,
3
(
k
)
.
The
accumulations
of
these
three
po
wer
spectra
are
performed
for
each
frequenc
y
bin
as
(6).
¯
P
y
f
(
k
)
=
3
X
j
=1
P
y
f
,j
(
k
)
(6)
The
obtained
po
wer
spectrum
under
goes
an
IDCT
.
By
identifying
the
maximum
location
in
the
resulting
A
CF
,
the
fundame
n
t
al
frequenc
y
of
y
f
(
n
)
is
detected.
Figure
2
displays
the
output
w
a
v
eforms
of
the
noisy
speech
signal,
the
con
v
entional
A
CF
approach,
DCT
-based
A
CF
,
and
DCT
-based
CPS.
Figure
2.
V
alidation
of
CPS-DCT
using
output
w
a
v
eform
Int
J
Adv
Appl
Sci,
V
ol.
14,
No.
3,
September
2025:
955–965
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Adv
Appl
Sci
ISSN:
2252-8814
❒
959
In
Figure
2,
the
f
alse
peak
represents
the
v
ocal
tract
ef
f
ect,
while
the
true
peak
indicates
the
fundamen-
tal
frequenc
y
.
The
con
v
entional
A
CF
output
w
a
v
eform
is
notably
impacted
by
the
v
ocal
tract
ef
fect,
resulting
in
a
f
alse
peak
close
to
the
true
peak.
The
adoption
of
DCT
in
place
of
F
FT
within
A
CF
helps
alle
viate
the
v
ocal
tract
ef
fect.
Whereas
our
proposed
method
plays
a
crucial
role
in
achie
ving
a
smoother
signal
than
the
DCT
-based
A
CF
.
It
not
only
signicantly
reduces
the
v
ocal
tract
ef
fect
b
ut
also
pro
vides
a
more
seamless
w
a
v
eform
compared
to
other
methods.
The
results
from
the
autocorrelation
method
applied
to
a
v
oiced
frame
are
illustrated
in
Figure
2.
The
w
a
v
eform
in
the
Figure
2
represents
the
ef
fect
of
FFT
and
DCT
in
A
CF
of
the
speech
signal.
These
gures
depict
the
outcome
of
speech
deli
v
ered
by
a
male
speak
er
in
the
presence
of
white
noise.
W
e
ha
v
e
already
e
xplored
that
in
the
cross-correlation
of
noisy
and
clean
speech,
this
component
becomes
zero.
Hence,
clean
speech
is
signicantly
emphasized,
and
the
A
CF
pro
v
es
to
be
v
ery
ef
fecti
v
e
in
the
case
of
a
noisy
signal.
Ho
we
v
er
,
A
CF
is
considerably
inuenc
ed
by
the
v
ocal
tract
ef
fect,
leading
to
some
unsmooth
occurrences
in
the
signal
due
to
noise.
The
use
of
DCT
-based
A
CF
can
mitig
ate
the
v
ocal
tract
ef
fect,
yet
some
residual
noise
occurrences
are
still
observ
able
in
the
signal.
Also
when
we
used
DCT
in
A
CF
,
the
detection
accurac
y
went
do
wn.
In
order
to
further
diminish
the
impact
of
nois
e
characteristics
and
acie
v
e
better
accurac
y
,
we
ha
v
e
introduced
our
proposed
method
that
combines
the
feature
of
CPS
with
DCT
.
On
the
other
hand,
Figure
3
represents
the
v
alidation
of
our
proposed
idea
by
utilizing
the
harmonic
characteristics.
From
Figure
3,
we
ha
v
e
observ
ed
that
DCT
based
CPS
(proposed)
is
more
ef
fecti
v
e
ag
ainst
noise
characteristics
that
that
of
FFT
and
DCT
based
po
wer
spectrum.
In
the
case
of
FFT
based
po
wer
spectrum,
we
ha
v
e
in
v
estig
ated
that
harmonics
are
highly
af
fected
by
noise
which
is
in
mark
ed
by
circle.
Figure
3.
V
alidation
of
CPS-DCT
using
harmonic
characteristics
3.
RESUL
TS
AND
DISCUSSION
In
this
section,
we
asses
s
the
ef
fecti
v
eness
of
the
CPS
in
identifying
the
fundamental
frequenc
y
in
the
presence
of
noisy
speech.
Our
assessment
in
v
olv
es
conducting
e
xperiments
on
speech
signals
to
e
xamine
the
performance
of
the
cumulation-based
approach.
Ulti
mately
,
we
present
a
comparati
v
e
analysis
of
the
outcomes
achie
v
ed
with
our
proposed
method
ag
ainst
those
obtained
from
con
v
entional
pitch
detection
methods.
3.1.
Experimental
conditions
The
proposed
pitch
detection
method
is
implemented
using
speech
signals
obtained
from
the
KEELE
database
[29]
and
the
NTT
database
[30].
This
database
contains
speech
recordings
from
ten
speak
ers,
e
v
enly
di
vided
between
v
e
males
and
v
e
females.
The
collecti
v
e
duration
of
speech
signals
e
xtracted
from
the
KEELE
database,
encompassing
the
s
peeches
of
all
ten
speak
ers,
amounts
to
around
5.5
[
m
]
.
These
speech
signals
were
sampl
ed
at
a
frequenc
y
of
16
[
k
H
z
]
.
Eight
utterances
by
Japanese
speak
ers,
each
lasting
ten
sec-
onds
and
with
a
3.4
[
k
H
z
]
band
limitation
and
10
[
k
H
z
]
sampling
rate,
are
a
v
ailable
in
the
NTT
database.
This
research
introduces
a
no
v
el
idea
that
pro
v
es
to
be
more
suitable
for
speech
processing
applications,
particularly
in
the
accurate
retrie
v
al
of
pitch
from
speech
signals
under
noisy
conditions.
T
o
simulate
noisy
speech
sam-
Pitc
h
e
xtr
action
using
discr
ete
cosine
tr
ansform
based
power
spectrum
method
in
...
(Humair
a
Sunzida)
Evaluation Warning : The document was created with Spire.PDF for Python.
960
❒
ISSN:
2252-8814
ples,
we
blend
clean
speech
recordings
with
noise
collected
from
en
vironments
with
high
le
v
els
of
background
sound.
T
o
create
the
appropriate
noisy
v
oice
samples,
our
method
combines
se
v
eral
forms
of
noise
with
the
original
speech
signals.
F
our
distinct
noise
cate
gories,
eac
h
with
dif
ferent
SNR
le
v
els,
were
introduced
into
the
initial
signal
s
to
e
v
aluate
the
algorithms’
rob
ustness
to
noise.
These
noise
cate
gories
include
white
noise,
babble
noise,
train
noise,
high
frequenc
y
(HF)-channel
noise,
all
obtained
from
the
NOISEX-92
[31],
sampled
at
a
frequenc
y
of
20
[
k
H
z
]
.
The
noises
were
adjusted
to
a
16
[
k
H
z
]
sample
frequenc
y
in
order
to
match
the
KEELE
database’
s
signal
properties
and
10
[
k
H
z
]
sample
frequenc
y
in
order
to
match
the
NTT
database’
s
signal
properties.
The
SNR,
or
signal-to-noise
ratio
w
as
systematically
v
aried
at
le
v
els
of
(0,
5,
10,
15
and
20
[dB])
for
the
assessment.
The
remaining
e
xperimental
parameters
for
e
xtracting
the
fundamental
frequenc
y
were
as
follo
ws:
–
Frame
length
without
PEF
A
C
and
BaNa,
the
frame
length
is
50
[ms].
–
The
frame
shift
is
10
[ms].
–
W
indo
w
type:
rectangular
,
with
the
e
xception
of
BaNa
and
PEF
A
C.
–
DCT
(IDCT)
points:
2048
points
(KEELE)
and
1024
points
(NTT)
when
BaNa
and
PEF
A
C
are
not
present.
3.2.
Ev
aluation
criteria
Pitch
estimation
error
is
determined
by
measuring
the
dif
ference
between
the
reference
and
est
imated
fundamental
frequencies.
The
accurac
y
of
basic
frequenc
y
detection
is
assessed,
follo
wing
Rabiner’
s
rule
[31],
utilizes
the
fundamental
frequenc
y
detection
error
e
(
l
)
.
e
(
l
)
=
F
est
(
l
)
−
F
tr
ue
(
l
)
(7)
Where
l
is
frame
number
,
F
est
(
l
)
is
estimated
fundamental
frequenc
y
at
the
l
-th
frame
from
a
noisy
spok
en
signal,
and
F
tr
ue
(
l
)
is
true
fundamental
frequenc
y
at
the
l
-th
frame.
If
the
absolute
v
alue
of
e
(
i
)
e
xceeds
10%
,
(
i.e.
|
e
(
i
)
|
>
10%
)
of
F
tr
ue
(
i
)
,
it
f
alls
under
the
cate
gory
of
gross
pitch
error
(GPE),
and
the
o
v
erall
proportion
of
this
error
is
computed
for
each
uttered
frame
in
the
speech
data.
The
error
w
as
designated
as
the
ne
pitch
error
(FPE)
if
|
e
(
i
)
|
≤
10%
from
the
ground
truth
rst
harmonic
frequenc
y
.
W
e
specically
identied
and
e
v
aluated
the
v
oiced
portions
in
sentences
concerning
the
fundamental
frequenc
y
.
Our
analysis
utilized
a
search
range
from
f
min
=
50[
H
z
]
to
f
max
=
400[
H
z
]
,
corresponding
to
the
fundamental
frequenc
y
range
commonly
observ
ed
in
most
people.
3.3.
Results
and
perf
ormance
comparison
In
this
section,
we
conduct
a
comparati
v
e
analysis
between
our
proposed
method
and
con
v
entional
approaches,
such
as
PEF
A
C,
BaNa,
and
YIN,
using
distinct
utterances
from
the
KEELE
and
NTT
databases.
W
e
e
v
aluate
performance
under
four
types
of
noise:
white
noise,
babble
noise,
HF
channel
noise,
and
train
noise.
P
arameters
lik
e
frame
length,
windo
w
function,
and
the
number
of
DFT
(IDFT)
points
specic
to
PEF
A
C
and
BaNa
were
adjusted,
while
other
parameters
remained
consistent
across
methods.
The
Hamming
windo
w
function
w
as
applied
uniformly
in
PEF
A
C
and
BaNa.
F
or
BaNa,
the
frame
duration
w
as
set
to
60
[ms],
and
2
16
points
were
used
for
DFT
(IDFT)
points.
The
source
code
of
BaNa,
tailored
for
this
en
vironment,
w
as
implemented
(as
described
in
[32]).
PEF
A
C
util
ized
a
Hamming
windo
w
function
with
a
duration
of
90
[ms]
for
both
the
windo
w
function
and
frame
length.
The
source
code
used
2
13
as
the
v
alue
for
the
DFT
(IDFT)
points.
The
implementation
of
PEF
A
C
in
this
en
vironment
is
well-suited
for
BaNa
(as
indicated
in
[17],
[33]).
Performance
e
v
aluation
w
as
conducted
using
the
GPE
and
the
FPE.
The
a
v
erage
GPE
and
FPE
results
obtained
from
the
e
xperimental
outcomes
of
the
proposed
method,
PEF
A
C,
BaNa,
YIN,
were
considered
for
utterances
from
both
female
and
male
speak
ers
at
v
arious
SNRs
(0,
5,
10,
15
and
20
[
dB
]
).
T
ables
1-8
present
a
comparison
of
GPE
for
the
KEELE
database
and
NTT
database,
respecti
v
ely
under
v
arious
noise
conditions,
including
white
noise
,
babble
noise,
HF
channel
noise,
and
train
noise.
On
the
other
hand,
T
ables
9-16
present
a
comparison
of
FPE
for
the
KEELE
database
and
NTT
database,
respecti
v
ely
under
the
abo
v
e
noise
conditions.
The
GPE
and
FPE
v
alues
of
our
proposed
method
are
contrasted
with
those
of
PEF
A
C,
BaNa,
and
YIN.
Int
J
Adv
Appl
Sci,
V
ol.
14,
No.
3,
September
2025:
955–965
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Adv
Appl
Sci
ISSN:
2252-8814
❒
961
T
able
1.
A
v
erage
GPE
rate
(%)
for
KEELE
database
for
white
noise
SNR
[dB]
Proposed
PEF
A
C
BaNa
YIN
0
20.58
37.15
22.61
31.37
5
15.96
34.38
19.58
21.59
10
13.86
33.01
17.80
16.57
15
13.12
32.50
16.97
14.29
20
12.90
31.98
16.59
12.87
T
able
2.
A
v
erage
GPE
rate
(%)
for
KEELE
database
for
babble
noise
SNR[dB]
Proposed
PEF
A
C
BaNa
YIN
0
35.18
49.01
40.54
36.89
5
22.88
41.86
29.48
23.68
10
16.57
37.41
22.84
16.64
15
13.09
34.98
19.69
13.16
20
11.87
33.39
17.70
12.14
T
able
3.
A
v
erage
GPE
rate
(%)
for
KEELE
database
for
train
noise
SNR
[dB]
Proposed
PEF
A
C
BaNa
YIN
0
33.44
43.17
29.08
34.38
5
22.81
38.99
23.11
22.76
10
16.98
35.59
20.04
16.36
15
14.50
33.40
18.31
13.42
20
13.49
32.25
17.36
12.16
T
able
4.
A
v
erage
GPE
rate
(%)
for
KEELE
database
for
HF-channel
noise
SNR
[dB]
Proposed
PEF
A
C
BaNa
YIN
0
24.70
40.13
22.64
31.55
5
17.64
36.86
19.82
21.01
10
14.79
34.37
17.90
16.06
15
13.45
32.98
17.31
13.76
20
13.04
32.11
16.57
12.79
T
able
5.
A
v
erage
GPE
rate
(%)
for
NTT
database
for
white
noise
SNR
[dB]
Proposed
PEF
A
C
BaNa
YIN
0
4.71
17.47
8.00
14.20
5
1.90
12.89
5.52
4.70
10
1.38
11.34
3.98
2.08
15
1.36
11.93
3.26
1.55
20
1.38
13.21
3.30
1.46
T
able
6.
A
v
erage
GPE
rate
(%)
for
NTT
database
for
babble
noise
SNR
[dB]
Proposed
PEF
A
C
BaNa
YIN
0
28.26
39.86
27.71
31.75
5
10.01
24.75
12.60
12.31
10
2.80
16.11
5.20
3.20
15
1.58
12.45
4.08
1.52
20
1.44
11.69
4.02
1.41
T
able
7.
A
v
erage
GPE
rate
(%)
for
NTT
database
for
train
noise
SNR
[dB]
Proposed
PEF
A
C
BaNa
YIN
0
14.98
25.28
10.91
20.32
5
4.66
16.3657
5.72
6.76
10
1.92
12.28
4.28
2.34
15
1.38
10.21
3.44
1.61
20
1.36
9.29
3.47
1.33
T
able
8.
A
v
erage
GPE
rate
(%)
for
NTT
database
for
HF-channel
noise
SNR
[dB]
Proposed
PEF
A
C
BaNa
YIN
0
5.73
18.91
5.97
14.32
5
2.34
13.13
4.52
4.84
10
1.62
11.00
4.41
2.02
15
1.49
10.72
4.29
1.48
20
1.45
10.06
4.13
1.39
T
able
9.
A
v
erage
FPE
rate
(Hz)
for
KEELE
database
for
white
noise
SNR
[dB]
Proposed
PEF
A
C
BaNa
YIN
0
4.42
5.45
5.23
4.54
5
4.14
5.36
5.22
3.97
10
4.03
5.32
5.19
3.60
15
3.99
5.26
5.14
3.46
20
3.97
5.25
5.08
3.44
T
able
10.
A
v
erage
FPE
rate
(Hz)
for
KEELE
database
for
babble
noise
SNR
[dB]
Proposed
PEF
A
C
BaNa
YIN
0
4.54
5.62
5.29
4.12
5
4.28
5.49
5.18
3.79
10
4.10
5.38
5.11
3.59
15
4.01
5.30
5.09
3.50
20
3.98
5.24
5.08
3.50
T
able
11.
A
v
erage
FPE
rate
(Hz)
for
KEELE
database
for
train
noise
SNR
[dB]
Proposed
PEF
A
C
BaNa
YIN
0
4.48
5.51
5.30
3.96
5
4.24
5.40
5.15
3.68
10
4.06
5.33
5.11
3.53
15
3.98
5.31
5.05
3.45
20
3.95
5.27
5.03
3.44
T
able
12.
A
v
erage
FPE
rate
(Hz)
for
KEELE
database
for
HF
channel
noise
SNR
[dB]
Proposed
PEF
A
C
BaNa
YIN
0
4.62
5.51
5.24
4.21
5
4.30
5.38
5.21
3.80
10
4.10
5.33
5.21
3.56
15
3.99
5.30
5.14
3.48
20
3.97
5.29
5.11
3.43
Pitc
h
e
xtr
action
using
discr
ete
cosine
tr
ansform
based
power
spectrum
method
in
...
(Humair
a
Sunzida)
Evaluation Warning : The document was created with Spire.PDF for Python.
962
❒
ISSN:
2252-8814
T
able
13.
A
v
erage
FPE
rate
(Hz)
for
NTT
database
for
white
noise
SNR
[dB]
Proposed
PEF
A
C
BaNa
YIN
0
3.01
3.42
2.39
3.82
5
2.69
3.34
2.20
2.59
10
2.53
3.25
2.09
2.16
15
2.49
3.20
2.00
2.03
20
2.49
3.15
1.95
1.99
T
able
14.
A
v
erage
FPE
rate
(Hz)
for
NTT
database
for
babble
noise
SNR
[dB]
Proposed
PEF
A
C
BaNa
YIN
0
2.26
3.88
2.69
3.09
5
2.40
3.52
2.25
2.42
10
2.50
3.31
2.03
2.15
15
2.50
3.21
1.93
2.02
20
2.48
3.16
1.84
1.99
T
able
15.
A
v
erage
FPE
rate
(Hz)
for
NTT
database
for
train
noise
SNR
[dB]
Proposed
PEF
A
C
BaNa
YIN
0
2.84
3.61
2.51
3.25
5
2.70
3.44
2.19
2.44
10
2.56
3.25
2.05
2.44
15
2.51
3.15
1.94
2.02
20
2.49
3.13
1.87
1.99
T
able
16.
A
v
erage
FPE
rate
(Hz)
for
NTT
database
for
HF
channel
noise
SNR
[dB]
Proposed
PEF
A
C
BaNa
YIN
0
3.13
3.55
2.34
3.85
5
2.69
3.40
2.19
2.70
10
2.54
3.28
2.09
2.18
15
2.50
3.17
2.01
2.02
20
2.48
3.11
1.93
1.99
In
the
case
of
KEELE
database,
the
proposed
approach
consistently
e
xhibits
the
lo
west
a
v
erage
GPE
rate
compared
to
other
techniques
across
almost
all
SNRs
in
all
noise
cases
e
xcept
lo
w
SNR
(0
[
dB
]
)
at
train
and
HF
channel
noise
cases.
At
SNR
(0
[
dB
]
)
in
train
and
HF
channel
noise
cases,
BaNa
pro
vides
the
sl
ightly
lo
wer
gross
pitch
error
rate
due
to
processing
strate
gy
according
to
the
noise
characteristics.
On
the
other
hand,
in
the
case
of
NTT
database,
the
proposed
method
sho
ws
the
almost
similar
properties
with
the
KEELE
database.
In
the
case
of
FPE
of
T
ables
9-12
i
n
KEELE
database,
the
proposed
method
pro
vides
the
lo
wer
FPE
(Hz)
than
that
of
the
PEF
A
C
and
BaNa
at
almost
all
SNRs
in
all
noise
cases
e
xcept
the
YIN
method.
The
proposed
method
is
highly
competiti
v
e
with
the
YIN
method
e
xcept
white
noise
case.
In
the
case
of
NTT
database,
the
FPE
(Hz)
of
the
proposed
method
is
lo
wer
than
that
of
PEF
A
C
and
YIN
method
and
highly
competiti
v
e
with
BaNa
e
xcept
babble
noise.
In
babble
noise,
proposed
method
sho
ws
the
superior
performance
compared
with
the
other
methods.
4.
CONCLUSION
Accurately
estimating
perfect
pitch
poses
a
challenge
in
speech
analysis,
especially
in
noisy
en
vi-
ronments.
In
this
study
,
we
introduce
an
impro
v
ed
method
that
e
xcels
in
isolating
noise
from
the
w
a
v
eform,
particularly
in
babble
noise
scenarios,
outperforming
other
techniques.
This
method
e
xhibits
a
lo
wer
a
v
erage
GPE
rate
compared
to
alternati
v
e
approaches,
and
it
achie
v
es
this
without
an
y
complicated
post-processing.
Additionally
,
it
ef
ciently
mitig
ates
the
impact
of
v
ocal
tract
ef
fects
by
equalizing
unnecessary
ripples
in
the
w
a
v
eform.
According
to
their
noise
type
and
SNRs,
our
research
so
demonstrates
that
it
is
more
rob
ust
than
other
traditional
methods
without
requiring
a
n
y
comple
x
post-processing.
In
the
future,
research
might
fo-
cus
on
creating
a
ne
w
pitch
e
xtraction
technique
that
is
more
ef
fecti
v
e
in
speech
processing
applications
and
incredibly
resilient
to
e
xtremely
lo
w
SNR
instances
across
a
range
of
real-w
orld
noise
scenarios.
FUNDING
INFORMA
TION
No
funding
in
v
olv
ed.
A
UTHOR
CONTRIB
UTIONS
ST
A
TEMENT
This
journal
uses
the
Contrib
utor
Roles
T
axonomy
(CRediT)
to
recognize
indi
vidual
author
contrib
u-
tions,
reduce
authorship
disputes,
and
f
acilitate
collaboration.
Int
J
Adv
Appl
Sci,
V
ol.
14,
No.
3,
September
2025:
955–965
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Adv
Appl
Sci
ISSN:
2252-8814
❒
963
Name
of
A
uthor
C
M
So
V
a
F
o
I
R
D
O
E
V
i
Su
P
Fu
Humaira
Sunzida
✓
✓
✓
✓
Nar
gis
P
arvin
✓
✓
✓
✓
Jafrin
Akter
Jeba
✓
✓
✓
Sulin
Chi
✓
✓
✓
Md.
Shiplu
Ali
✓
✓
Moinur
Rahman
✓
✓
✓
Md.
Saifur
Rahman
✓
✓
✓
✓
✓
✓
✓
✓
C
:
C
onceptualization
I
:
I
n
v
estig
ation
V
i
:
V
i
sualization
M
:
M
ethodology
R
:
R
esources
Su
:
Su
pervision
So
:
So
ftw
are
D
:
D
ata
Curation
P
:
P
roject
Administration
V
a
:
V
a
lidation
O
:
Writing
-
O
riginal
Draft
Fu
:
Fu
nding
Acquisition
F
o
:
F
o
rmal
Analysis
E
:
Writing
-
Re
vie
w
&
E
diting
CONFLICT
OF
INTEREST
ST
A
TEMENT
Authors
state
no
conict
of
interest.
D
A
T
A
A
V
AILABILITY
The
authors
conrm
that
the
data
supporting
the
ndings
of
this
study
are
a
v
ailable
within
the
article.
REFERENCES
[1]
S.
S.
Upadh
ya,
“Pitch
detection
in
time
and
frequenc
y
domain,
”
Pr
oceedings
-
2012
International
Confer
ence
on
Communication,
Information
and
Computing
T
ec
hnolo
gy
,
ICCICT
2012
,
2012,
doi:
10.1109/ICCICT
.2012.6398150.
[2]
M.
S.
Rahman,
“Pitch
e
xtraction
for
speech
signals
in
noisy
en
vironments,
”
P
.hD.
Dissertation
,
Department
of
Mathematics,
Electronics,
and
Informatics,
Saitama
Uni
v
ersity
,
Saitama,
Japan,
2020.
[Online].
A
v
ailable:
https://sucra.repo.nii.ac.jp/record/19377/les/GD0001258.pdf.
[3]
X.
Zhang,
H.
Zhang,
S.
Nie,
G.
Gao,
and
W
.
Liu,
“
A
pairwise
algorithm
using
the
deep
stacking
netw
ork
for
speech
separation
and
pitch
estimation,
”
IEEE/A
CM
T
r
ansactions
on
A
udio
Speec
h
and
Langua
g
e
Pr
ocessing
,
v
ol.
24,
no.
6,
pp.
1066–1078,
2016,
doi:
10.1109/T
ASLP
.2016.2540805.
[4]
D.
W
ang,
C.
Y
u,
and
J.
H.
L.
Hansen,
“Rob
ust
harmonic
feat
ures
for
classication-based
pitch
estimation,
”
IEEE/A
CM
T
r
ansactions
on
A
udio
Speec
h
and
Langua
g
e
Pr
ocessing
,
v
ol.
25,
no.
5,
pp.
952–964,
2017,
doi:
10.1109/T
ASLP
.2017.2667879.
[5]
D.
Gerhard,
“Pitch
e
xtraction
and
fundamental
frequenc
y:
history
and
current
techniques
theory
of
pitch,
”
T
ec
hnical
Repor
t
TR-CS
,
2003.
[Online].
A
v
ailable:
https://www
.cs.b
u.edu/f
ac/sn
yder/cs583/Literature
and
Resources/PitchExtractionMastersThesis.pdf.
[6]
N.
S.
B.
Ruslan,
M.
Mamat
,
R.
R.
Porle,
and
N.
P
arimon,
“
A
comparati
v
e
study
of
pitch
detection
algorithms
for
microcontroller
based
v
oice
pitch
detector
,
”
Advanced
Science
Letter
s
,
v
ol.
23,
no.
11,
pp.
11521–11524,
2017,
doi:
10.1166/asl.2017.10320.
[7]
L.
Sukhostat
and
Y
.
Imamv
erdiye
v
,
“
A
comparati
v
e
analysis
of
pitch
detection
methods
under
the
inuence
of
dif
ferent
noise
condi
tions,
”
J
ournal
of
V
oice
,
v
ol.
29,
no.
4,
pp.
410–417,
2015,
doi:
10.1016/j.jv
oice.2014.09.016.
[8]
L.
R.
Rabiner
,
“On
the
use
of
autocorrelation
analysis
for
pitch
detection,
”
IEEE
T
r
ansactions
on
Acoustics,
Speec
h,
and
Signal
Pr
ocessing
,
v
ol.
25,
no.
1,
pp.
24–33,
1977,
doi:
10.1109/T
ASSP
.1977.1162905.
[9]
A.
Cohen
et
al.,
“
A
v
erage
magnitude
dif
ference
function
pitch
e
xtractor
,
”
IEEE
T
r
ansactions
on
Acoustics,
Speec
h,
and
Signal
Pr
ocessing
,
v
ol.
ASSP-22,
no.
5,
pp.
353–362,
1974,
doi:
10.1109/T
ASSP
.1974.1162598.
[10]
R.
Chakraborty
,
D.
Sengupta,
and
S.
Sinha,
“Pitch
tracking
of
acoustic
signals
based
on
a
v
erage
squared
mean
dif
ference
function,
”
Signal,
Ima
g
e
and
V
ideo
Pr
ocessing
,
v
ol.
3,
no.
4,
pp.
319–327,
2009,
doi:
10.1007/s11760-008-0072-5.
[11]
T
.
Shimamura
and
H.
K
obayashi,
“W
eighted
autocorrelation
for
pitch
e
xtraction
of
noisy
speech,
”
IEEE
T
r
ansactions
on
Speec
h
and
A
udio
Pr
ocessing
,
v
ol.
9,
no.
7,
pp.
727–730,
2001,
doi:
10.1109/89.952490.
[12]
A.
de
Che
v
eign
´
e
and
H.
Ka
w
ahara,
“YIN,
a
fundamental
fre
quenc
y
estimator
for
speech
and
music,
”
The
J
ournal
of
the
Acoustical
Society
of
America
,
v
ol.
111,
no.
4,
pp.
1917–1930,
2002,
doi:
10.1121/1.1458024.
[13]
C.
Shahnaz,
W
.
P
.
Zhu,
and
M.
O.
Ahmad,
“Pitch
estimation
based
on
a
har
monic
sinusoidal
autocorrelation
model
and
a
time-
domain
matching
scheme,
”
IEEE
T
r
ansactions
on
A
udio,
Speec
h
and
Langua
g
e
Pr
ocessing
,
v
ol.
20,
no.
1,
pp.
322–335,
2012,
doi:
10.1109/T
ASL.2011.2161579.
[14]
H.
Haji
molahoseini,
R.
Amirf
attahi,
S.
Gazor
,
and
H.
Soltanian-Zadeh,
“Rob
ust
estimation
and
tracking
of
pitch
period
using
an
ef
cient
bayesian
lter
,
”
IEEE/A
CM
T
r
ansactions
on
A
udio
Speec
h
and
Langua
g
e
Pr
ocessing
,
v
ol.
24,
no.
7,
pp.
1219–1229,
2016,
doi:
10.1109/T
ASLP
.2016.2551041.
[15]
W
.
Hu,
X.
W
ang,
and
P
.
G
´
omez,
“Rob
ust
pitch
e
xtraction
in
pathological
v
oice
based
on
w
a
v
elet
and
cepstrum,
”
Eur
opean
Signal
Pr
ocessing
Conf
er
ence
,
pp.
297–300,
2015.
[Online].
A
v
ailable:
https://ne
w
.eurasip.or
g/Proceedings/Eusipco/Eusipco2004/defe
v
ent/papers/cr1417.pdf.
[16]
M
.
S.
Rahm
an,
Y
.
Sugi
ura,
a
nd
T
.
Shim
amura,
“Utilization
of
windo
wing
ef
fect
and
accumulated
autocorrelation
function
and
po
wer
Pitc
h
e
xtr
action
using
discr
ete
cosine
tr
ansform
based
power
spectrum
method
in
...
(Humair
a
Sunzida)
Evaluation Warning : The document was created with Spire.PDF for Python.
964
❒
ISSN:
2252-8814
spectrum
for
pitch
detection
in
noisy
en
vironments,
”
IEEJ
T
r
ansactions
on
Electrical
and
Electr
onic
Engineeri
ng
,
v
ol.
15,
no.
11,
pp.
1680–1689,
2020,
doi:
10.1002/tee.23238.
[17]
S.
Gonzalez,
“Pef
ac-a
pitch
estimation
algori
thm
rob
ust
to
high
le
v
els
of
noise,
”
IEEE/A
CM
T
r
ansactions
on
A
udio,
Speec
h,
and
Langua
g
e
Pr
ocessing
,
v
ol.
22,
no.
2,
pp.
518–530,
2014,
doi:
10.1109/T
ASLP
.2013.2295918.
[18]
B.
Li
and
X.
Zhang,
“
A
pitch
estimation
algorithm
for
speech
in
comple
x
noise
en
vironments
based
on
the
Radon
transform,
”
IEEE
Access
,
v
ol.
11,
pp.
9876–9889,
2023,
doi:
10.1109/A
CCESS.2023.3240181.
[19]
Z.
Mnasri,
S.
Ro
v
etta,
and
F
.
Masulli,
“
A
no
v
el
pitch
detection
algorithm
based
on
instantaneous
frequenc
y
for
clean
and
noisy
speech,
”
Cir
cuits,
Systems,
and
Signal
Pr
ocessing
,
v
ol.
41,
no.
11,
pp.
6266–6294,
2022,
doi:
10.1007/s00034-022-02082-8.
[20]
F
.
Huang
and
T
.
Lee,
“Pitch
estimation
in
noisy
speech
using
accumulated
peak
spectrum
and
sparse
estimation
technique,
”
IEEE
T
r
ansactions
on
A
udio,
Speec
h
and
Langua
g
e
Pr
ocessing
,
v
ol.
21,
no.
1,
pp.
99–109,
2013,
doi:
10.1109/T
ASL.2012.2215589.
[21]
W
.
Chu
and
A.
Al
w
an,
“SAFE:
a
statistical
approach
to
F0
estimation
under
clean
and
noisy
conditions,
”
IEEE
T
r
ansactions
on
A
udio,
Speec
h
and
Langua
g
e
Pr
ocessing
,
v
ol.
20,
no.
3,
pp.
933–944,
2012,
doi:
10.1109/T
ASL.2011.2168518.
[22]
B
.
Gfeller
,
C.
Frank,
D.
Roblek,
M.
Shari,
M.
T
agliasacchi,
and
M.
V
elimiro
vic,
“SPICE:
self-supervised
pitch
estimation,
”
IEEE/A
CM
T
r
ansactions
on
A
udio
Speec
h
and
Langua
g
e
Pr
ocessing
,
v
ol.
28,
pp.
1118–1128,
2020,
doi:
10.1109/T
ASLP
.2020.2982285.
[23]
S
.
Singh,
R.
W
ang,
and
Y
.
Qiu,
“DeepF0:
end-to-end
fundamental
frequenc
y
estimation
for
music
and
speec
h
signals,
”
ICASSP
,
IEEE
International
Confer
ence
on
Acoustics,
Speec
h
and
Signal
Pr
ocess
ing
-
Pr
oceedings
,
v
ol.
June,
pp.
61–65,
2021,
doi:
10.1109/ICASSP39728.2021.9414050.
[24]
N.
Y
ang,
H.
Ba,
W
.
Cai,
I.
Demirk
ol,
and
W
.
Heinzelman,
“BaNa:
a
noise
resilient
fundamental
frequenc
y
detection
algori
thm
for
speech
and
music,
”
IEEE/A
CM
T
r
ansactions
on
A
udio
Speec
h
and
Langua
g
e
Pr
ocessing
,
v
ol.
22,
no.
12,
pp.
1833–1848,
2014,
doi:
10.1109/T
ASLP
.2014.2352453.
[25]
N.
Ahmed,
T
.
Natarajan,
and
K.
R.
Rao,
“Discrete
cosine
transform,
”
IEEE
T
r
ansactions
on
Computer
s
,
v
ol.
C–23,
no.
1,
pp.
90–93,
Jan.
1974,
doi:
10.1109/T
-C.1974.223784.
[26]
P
.
Duhamel
and
M.
V
etterli,
“F
ast
F
ourier
transforms:
a
tutorial
re
vie
w
and
a
state
of
the
art,
”
Signal
Pr
ocessing
,
v
ol.
19,
no.
4,
pp.
259–299,
Apr
.
1990,
doi:
10.1016/0165-1684(90)90158-U.
[27]
F
.
J.
Har
ris,
“T
ime
domain
signal
processing
with
the
DFT
,
”
Handbook
of
Digital
Signal
Pr
ocessing
,
pp.
633–699,
1987,
doi:
10.1016/b978-0-08-050780-4.50013-8.
[28]
F
.
Plante,
G.
Me
yer
,
and
W
.
Ainsw
orth,
“
A
pitch
e
xtraction
reference
database,
”
4th
Eur
opean
Confer
ence
on
Speec
h
Communication
and
T
ec
hnolo
gy
,
pp.
837–840,
1995,
doi:
10.21437/Eurospeech.1995-191.
[29]
Y
.
Meng,
“Speech
recognition
on
DSP:
algorithm
optimization
and
performance
analysis,
”
Master
Thesis,
Department
of
Electr
onic
Engineering
,
The
Chinese
Univer
sity
of
Hong
K
ong
,
Sha
T
in,
Hong
K
ong
,
2004.
[Online].
A
v
ailable:
http://www
.ee.cuhk.edu.hk/
myuan/Thesis.pdf.
[30]
NTT
Adv
anced
T
echnology
Corp,
20
countries
langua
g
e
database
,
NTT
Adv
anced
T
echnology
Corp,
1988.
[31]
A.
V
ar
g
a
and
H.
J.
M.
Steenek
en,
“
Assessment
for
automatic
speech
recognition:
II.
NOISEX-92:
a
database
and
an
e
xperiment
to
study
the
ef
fect
of
additi
v
e
noise
on
speech
recognition
systems,
”
Speec
h
Communication
,
v
ol.
12,
no.
3,
pp.
247–251,
1993,
doi:
10.1016/0167-6393(93)90095-3.
[32]
Uni
v
ersity
of
Rochester
,
“W
ireless
communication
and
netw
orking
group,
”
hajim.r
oc
hester
.edu
.
Accessed:
Mar
.
02,
2024.
[Online].
A
v
ailable:
https://hajim.rochester
.edu/ece/sites/wcng//code.html.
[33]
M.
Brook
es,
“V
OICEBO
X:
speech
processing
toolbox
for
MA
TLAB,
”
ee
.ic.ac.uk
.
Acces
sed:
Mar
.
02,
2024.
[Online].
A
v
ailable:
http://www
.ee.ic.ac.uk/hp/staf
f/dmb/v
oicebox/v
oicebox.html.
BIOGRAPHIES
OF
A
UTHORS
Humaira
Sunzi
da
obtained
her
B.Sc.
(Engineering)
de
gree
in
Information
and
Com-
munication
T
echnology
from
Comilla
Uni
v
ersity
,
Cumilla,
Bangladesh,
in
2024.
She
st
arted
her
under
graduate
studies
in
the
Department
of
Information
and
Communication
T
echnology
at
Comilla
Uni
v
ersity
in
2019.
Her
current
research
interests
encompass
speech
analysis
and
digital
signal
pro-
cessing.
She
can
be
contacted
at
email:
humairasunzida.311@stud.cou.ac.bd.
Nar
gis
P
ar
vin
recei
v
ed
her
B.S
c.
(Honours)
and
M.Sc.
de
grees
in
Information
and
Com-
munication
Engineering
from
the
Uni
v
ersity
of
Rajshahi,
Rajshahi,
Bangladesh,
in
2006
and
2007,
respecti
v
ely
.
In
2013,
she
joined
as
a
Lecturer
in
the
Department
of
Computer
Scie
nce
and
Engi-
neering,
Bangladesh
Army
International
Uni
v
ersity
of
Sci
ence
and
T
echnology
(B
AIUST),
Cumilla
Cantonment,
Cumilla,
Bangl
adesh,
where
she
is
currently
serving
as
Assistant
Professor
.
She
pursued
her
Ph.D.
de
gree
in
the
eld
of
wireless
sensor
netw
ork
(WSN)
at
the
Graduate
School
of
Science
and
Engineering
at
Saitama
Uni
v
ersity
,
Japan.
Her
research
interests
include
wireless
sensor
netw
ork,
speech
analysis
and
digital
signal
processing.
She
can
be
contacted
at
emai
l:
nar
gis.cse@baiust.ac.bd.
Int
J
Adv
Appl
Sci,
V
ol.
14,
No.
3,
September
2025:
955–965
Evaluation Warning : The document was created with Spire.PDF for Python.