IAES
Inter
national
J
our
nal
of
Articial
Intelligence
(IJ-AI)
V
ol.
15,
No.
2,
April
2026,
pp.
1733
∼
1745
ISSN:
2252-8938,
DOI:
10.11591/ijai.v15.i2.pp1733-1745
❒
1733
ResNet
based
deep
lear
ning
appr
oach
f
or
chr
onic
obstructi
v
e
pulmonary
disease
pr
ediction
using
lung
sound
analysis
Babitha
Sudhakar
Ullal
1
,
V
eena
Kalludi
Narasimhaiah
1
,
Rithul
Kamesh
2
1
School
of
Electronics
and
Communication
Engineering,
REV
A
Uni
v
ersity
,
Beng
aluru,
India
2
Department
of
Electronics
and
Communication
Engineering,
PES
Uni
v
ersity
,
Beng
aluru,
India
Article
Inf
o
Article
history:
Recei
v
ed
Aug
21,
2025
Re
vised
Jan
15,
2026
Accepted
Feb
6,
2026
K
eyw
ords:
Audio
signal
processing
Chronic
obstructi
v
e
pulmonary
disease
Con
v
olutional
neural
netw
ork
Long
short-term
memory
Residual
netw
orks
ABSTRA
CT
Chronic
obstructi
v
e
pulmonary
disease
(COPD)
af
fects
around
300-400
million
people
w
orldwi
de
representing
a
critical
healthcare
challenge
that
requires
early
detection
for
ef
fecti
v
e
interv
ention.
This
w
ork
introduces
chronic
lung
analysis
via
audi
o
signal
prediction
(CLASP),
a
no
v
el
frame
w
ork
achie
ving
97.90%
accurac
y
in
predicting
COPD
automatically
through
respiratory
audio
signal
analysis.
This
method
inte
grates
adv
anced
signal
processing
and
deep
learning
architectures,
comparing
long
short-term
memory
(LSTM),
con
v
olutional
neural
netw
orks
(CNN),
and
residual
netw
orks
(ResNet)
models
for
optimal
performance.
The
ResNet
architecture
e
xhibits
superior
diagnostic
capability
with
precision
of
98.72%,
recal
l
of
96.86%,
and
0.9937
area
under
the
curv
e
(A
UC),
as
compared
to
e
xisting
methods
by
signicant
mar
gins.
These
results
establish
a
ne
w
benchmark
for
nonin
v
asi
v
e
COPD
detection,
thus
enabling
practical
deplo
yment
in
clinical
settings
thereby
dramatically
impro
ving
the
patient
outcomes
by
early
detection
and
also
reduce
healthcare
costs.
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
Babitha
Sudhakar
Ullal
School
of
Electronics
and
Communication
Engineering,
REV
A
Uni
v
ersity
Beng
aluru,
India
Email:
babitharoshan91@gmail.com
1.
INTR
ODUCTION
Chronic
obstructi
v
e
pulmonary
disease
(COPD)
i
s
one
of
the
critical
challenges
in
present-day
respiratory
medicine,
globally
af
fecting
around
384
million
people,
and
i
s
estimated
to
become
the
w
orld’
s
third
most
common
cause
of
death
by
2030
[1],
[2].
This
disease
is
progressi
v
e
in
nature
[3],
and
is
characterized
by
persistent
problems
of
respiration
and
airo
w
limitation
[4],
[5].
It
demands
early
detection
techniques
that
can
identify
patients
before
irre
v
ersible
lung
damage
occurs
[6],
[7].
Spirometry
and
clinical
e
v
aluation
are
the
current
diagnostic
approaches
used
which
often
do
not
detect
COPD
in
its
early
stages
[8]
where
medical
interv
entions
could
be
the
most
ef
fecti
v
e
to
impro
v
e
patient
outcome,
creating
a
demanding
need
for
more
sensiti
v
e
and
accessible
screening
methods.
Machine
learning
approaches
in
detection
of
v
arious
respiratory
diseases
has
e
xhibited
remar
kable
promise
[9]–[13]
by
the
application
of
audio
signal
analysis
using
the
v
ast
temporal
and
spectral
information
present
in
breath
sounds
[14],
[15].
Recent
de
v
elopments
in
deep
learning
architectures,
such
as
recurrent
neural
netw
orks
(RNNs),
con
v
olutional
neural
netw
orks
(CNNs),
and
residual
netw
orks
(ResNet
s),
ha
v
e
sho
wn
e
xceptional
capability
in
pattern
recognition
tasks
of
bi
omedical
signals.
These
technologies
pro
vide
the
J
ournal
homepage:
http://ijai.iaescor
e
.com
Evaluation Warning : The document was created with Spire.PDF for Python.
1734
❒
ISSN:
2252-8938
necessary
computational
foundation
to
implement
automated
diagnostic
systems
that
can
analyse
comple
x
respiratory
audio
patterns
with
high
accurac
y
while
maintaining
the
nonin
v
asi
v
e
nature.
The
specic
challenge
addressed
in
this
w
ork
is
the
implementation
of
an
automated
system
to
distinguish
COPD-related
respiratory
patterns
from
other
normal
or
abnormal
breathing
patterns
with
acceptable
le
v
els
of
accurac
y
for
cli
nical
deplo
yment.
T
radi
tional
machine
learning
approaches
ha
v
e
achie
v
ed
modest
success
in
this
a
rea,
with
accuracies
typically
v
arying
between
82.5%
and
93%
[16],
[17].
But
these
performance
le
v
els
f
all
short
of
the
required
standards
for
clinical
deplo
yment.
The
proposed
approach
introduces
chronic
lung
analysis
via
audio
signal
prediction
(CLASP),
a
complete
frame
w
ork
combining
the
adv
anced
signal
processing
techniques
with
cutting-edge
deep
learning
models
gi
ving
unparalleled
accurac
y
in
automated
detection
of
COPD.
The
k
e
y
inno
v
ations
in
the
w
ork
are
as
follo
ws.
First,
de
v
elopment
of
an
optimized
audio
preprocessing
pipeline
incorporating
Mel-frequenc
y
cepstral
coef
cients
(MFCC)
along
with
rst
deri
v
ati
v
e,
delta
and
second
deri
v
ati
v
e,
delta-delta
features
to
enable
enhanced
temporal
pattern
capture.
Second,
comparati
v
e
e
v
aluation
of
three
distinct
deep
learning
models—long
short-term
memory
(LSTM),
CNN,
and
ResNet.
Third,
implementation
of
threshold
optimization
techniques
to
enhance
its
clinical
utility
.
2.
RELA
TED
W
ORK
Lee
et
al.
[16]
designed
a
model
that
uses
thermal
imaging
to
capture
respiratory
patterns,
focusing
on
four
features
considered
as
primary:
total
v
olume
of
respiration,
a
v
erage
e
xpiration
distance,
a
v
erage
inspiration
distance,
and
respiration
rate.
Later
,
Z-score
normalization
w
as
applied
to
these
features
and
combined
them
through
weighted
summ
ation
to
generate
a
composite
score
used
in
cl
assication
later
.
The
accurac
y
of
the
model
is
82.5%.
The
model’
s
high
recall
indicates
the
ability
to
identify
indi
viduals
with
COPD
more
accurately
minimizing
f
alse
ne
g
ati
v
es.
Siddiqui
et
al.
[17]
e
xplored
the
use
of
a
non-in
v
asi
v
e
method
-
ultra-wideband
(UWB)
radar
to
dif
ferentiate
COPD
patients
from
indi
viduals
who
are
health
y
.
Data
w
as
collected
from
70
subjects
(35
each
of
COPD
condition
and
health
y
controls).
The
researchers
e
xtracted
data
of
respiration
and
incorporated
further
features
such
as
age
of
patient,
smoking
history
and
gender
to
enhance
the
performance
accurac
y
.
Se
v
eral
machine
learning
classiers
including
support
v
ector
machine
(SVM),
na
¨
ıv
e
Bayes
(NB),
adapti
v
e
boosting
(AdaBoost),
k-nearest
neighbor
(KNN),
random
forest
(RF)
and
deep
learning
models
lik
e
CNN
and
LSTM
netw
orks
were
emplo
yed.
Among
these,
highest
accurac
y
of
93%
w
as
achie
v
ed
by
LSTM
model,
demonstrating
the
potential
of
combination
of
UWB
radar
and
machine
learning
as
a
non-i
n
v
as
i
v
e
and
ef
fecti
v
e
method
for
COPD
detection.
Abineza
et
al.
[18]
utilized
time-stamped
electronic
health
records
from
COPD
patients
to
de
v
elop
an
LSTM
model
to
predict
subsequent
e
xacerbation
by
analyzing
symptoms,
patterns,
and
arterial
oxygen
saturation
le
v
els
o
v
er
time.
Experiment
w
as
done
with
v
arying
time
windo
ws,
ranging
from
one
to
six
prior
days,
to
forecast
the
lik
elihood
of
an
e
xacerbation
on
the
follo
wing
day
.
The
model
demonstrated
optimal
performance
when
using
a
one-day
time
windo
w
,
achie
ving
testing
accurac
y
of
85%,
training
accurac
y
of
87%,
and
area
under
the
curv
e
(A
UC)
of
0.83.
These
results
were
obtained
from
a
dataset
comprising
54
patients,
which
is
a
small
number
.
Reliance
on
saturation
of
peripheral
oxygen
(SPO2)
is
the
only
clini
cal
v
ariable
used
as
f
actors
inuencing
COPD
e
xacerbations.
Mei
et
al.
[19]
introduces
DeepSpiro,
a
deep
learning
no
v
el
architecture
designed
to
enhance
detection
as
well
as
early
prediction
of
COPD
using
spirogram
data.
DeepSpiro
comprises
of
four
primary
components:
SpiroSmoother
,
SpiroEncoder
,
SpiroExplainer
,
and
SpiroPredictor
.
The
model
resulted
in
0.8328
v
alue
of
A
UC
distinguishing
indi
viduals
with
COPD
from
those
without.
In
early
prediction
tas
ks,
DeepSpiro
ef
fecti
v
ely
dif
ferentiated
between
the
groups
of
lo
w-risk
and
high-risk
observing
substantial
dif
ferences
in
future
COPD
de
v
elopment.
This
underscores
the
model’
s
potential
in
forecasting
the
progression
of
COPD
o
v
er
long-term.
The
model
accurac
y
depends
on
the
quality
of
spirogram
data.
Y
in
et
al.
[20]
uses
fractional-order
dynamics
and
deep
learning
techniques
to
predict
COP
D.
Thorax
breathing
ef
fort,
respiratory
rate,
and
oxygen
saturation
le
v
els
were
e
xtracted
to
obtain
fractional
dynamic
signatures
to
train
a
deep
neural
netw
ork
(DNN).
The
model
accurac
y
w
as
94.01%
when
trained
on
the
W
estRo
Porti
C
OPD
dataset
and
tested
on
the
W
estRo
COPD
dataset,
and
90.13%
accurac
y
in
the
re
v
erse
scenario.
Ho
we
v
er
,
the
relati
v
ely
small
number
of
unique
patients
may
limit
the
model’
s
generalizability
to
broader
populations.
Int
J
Artif
Intell,
V
ol.
15,
No.
2,
April
2026:
1733–1745
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
1735
Bairagi
and
Kanw
ade
[21]
used
a
non-in
v
asi
v
e
technique
by
analyzing
surf
ace
el
ectromyograph
y
(sEMG)
si
gnals
from
the
sternomastoid
muscle,
a
primary
respiratory
muscle
aiming
to
o
v
ercome
lim
itations
of
traditional
spirometry
.
EMG
signals
were
e
xamined
across
time
domain,
frequenc
y
domain
and
time-frequenc
y
domains.
An
algorithm
that
detects
onset
based
on
slope
w
as
emplo
yed
to
identify
muscle
acti
v
ation
periods,
enhancing
the
accurac
y
of
feature
e
xtraction.
Features
were
e
xtracted
usi
ng
continuous
w
a
v
elet
transform
(CWT)
at
single-frequenc
y
of
7,
8,
and
10
Hz
f
acilitating
COPD
classication
based
on
grades
of
se
v
erity
,
By
emplo
ying
this
technique,
classication
accurac
y
of
85.89%
across
dif
ferent
COPD
se
v
erity
grades
is
achei
v
ed.
K
umar
et
al.
[22]
presents
an
inno
v
ati
v
e
approach
in
diagnosing
COPD
by
inte
grating
images
scanned
by
computed
tomograph
y
(CT)
with
audio
samples
of
lung
to
enhance
the
diagnostic
process
for
COPD.
Features
are
e
xtracted
from
scan
images
and
audio
samples,
including
te
xture,
histogram
intensity
,
Gaussian
scale
space,
chroma,
and
MFCCs.
T
o
assess
se
v
erity
le
v
el
of
the
patient
by
performing
early
classication,
the
e
xtracted
features
are
fed
into
the
ensemble
learning
technique.
The
proposed
frame
w
ork
achie
v
ed
accuracies
of
97.50%
for
fusion
technique
based
early
diagnosis,
98%
for
early
diagnosis
using
the
CT
diagnostic
model,
95.30%
for
early
diagnosis
utilizing
the
cough
sample
model.
These
high
accuracies
ha
v
e
contrib
ution
not
from
the
audio
signals
alone,
b
ut
also
from
CT
images.
Ullah
et
al
.
[23]
empl
o
yed
dataset
from
Kaggle
(respiratory
sound
database)
and
chest
w
all
lung
sounds.
T
o
ensure
uniformity
along
with
duration
of
x
ed-length,
ra
w
signals
were
resampled
to
4
kHz
and
later
zero-padded.
Later
se
gmentation
w
as
performed
to
prepare
the
data
for
feature
e
xtraction
using
MFCC
(13
features)
and
short
term
F
ourier
transform
(STFT)
(1,000
features).
SVM,
articial
neural
netw
ork
(ANN),
KNN,
RF
,
and
decision
tree
(DT)
machine
learning
algorithms
were
emplo
yed.
The
models
were
trained
for
70%
data
and
v
alidated
on
30%.
The
combination
STFT+MFCC-ANN
yielded
best
accurac
y
.
Nuna
v
ath
et
al.
[24]
e
xplores
deep
learning
architectures
t
o
predict
e
xacer
bation
in
C
OPD
patients.
The
authors
emplo
y
LSTM
to
analyze
patient
data
(only
94)
and
identify
patterns
that
indicate
the
lik
elihood
of
future
e
xacerbation.
The
deep
neural
netw
ork
performed
better
than
traditional
machine
learni
n
g
approaches.
in
predicting
COPD
e
xacerbation.
The
LSTM
model
sho
wed
92.86%
accurac
y
.
Jenef
a
et
al.
[25]
presents
a
no
v
el
approach
to
predicting
COPD
by
inte
grating
both
CNN
and
LSTM.
This
approach
le
v
erages
the
strengths
of
both
architectures:
spatial
features
e
xtraction
using
CNN
and
temporal
dependencies
in
sequential
data
is
captured
using
LSTM.
The
model
ef
ciently
captured
comple
x
COPD
patterns,
leading
to
more
ef
fecti
v
e
predictions.
The
method
identied
early
stages
of
COPD
with
accurac
y
greater
than
95%.
Ho
we
v
er
,
these
approaches
often
struggle
in
capturing
long-term
temporal
patt
erns
in
the
audio
signals,
which
are
critical
for
accurate
COPD
diagnosis.
CLASP
b
uilds
upon
these
w
orks
using
LSTM
model,
a
CNN
model
and
a
ResNet
model
to
capture
long-term
and
short-term
patterns
in
respiratory
audio
signals
and
compare
their
performance.
3.
METHOD
The
CLASP
frame
w
ork
uses
a
systematic
approach
to
detect
COPD
through
the
analysis
of
respiratory
audio
signal.
The
methodology
includes
three
primary
components.
First,
computational
signal
processing
techniques
for
audio
preprocessing
and
feature
e
xtraction.
Second,
e
xperimental
deep
learning
architectures
for
pattern
recognition.
Third,
comprehensi
v
e
e
v
aluation
protocols
for
clinical
v
alidation.
The
proposed
methodology
utilizes
a
publicly
a
v
ailable
dataset
a
v
ailable
in
Kaggle
at
https://www
.kaggle.com/code/mariammagdy22/pulmonary-diseases-detection-system/input.
3.1.
Computational
techniques
Audio
signals
are
processed
using
a
sampling
frequenc
y
of
22,050
Hz
with
windo
wed
se
gmentation.
MFCC
e
xtraction
is
done
by
including
13
static
coef
cients
augmented
with
delta
and
delta-delta
features
for
capture
of
temporal
dynamics.
The
foundation
for
MFCC
computati
o
n
in
v
olv
es
STFT
analysis,
Mel-scale
lterbank
application,
and
processing
of
discrete
cosine
transform
(DCT).
The
proposed
implementation
uses
librosa
library
with
optimized
parameters
v
alidated
through
preliminary
t
esting
on
the
International
Conference
on
Biomedical
Health
Informatics
(ICBHI)
dataset
[15].
This
approach
gi
v
es
rob
ust
feature
representations
that
capture
spectral
characteristics
along
with
temporal
v
ariations
required
for
respiratory
pattern
analysis.
Spectrogram
grid
sho
wcasing
respiratory
audio
signals
used
in
the
CLASP
frame
w
ork
is
sho
wn
in
Figure
1.
ResNet
based
deep
learning
appr
oac
h
for
c
hr
onic
obstructive
pulmonary
...
(Babitha
Sudhakar
Ullal)
Evaluation Warning : The document was created with Spire.PDF for Python.
1736
❒
ISSN:
2252-8938
3.2.
Experimental
techniques
Three
dif
ferent
deep
learning
architectures
were
implemented
and
methodically
compared:
LSTM
netw
orks
with
attention
mechanisms,
CNN
with
global
pooling
strate
gies,
and
ResNet
with
skip
connections.
Each
architecture
w
as
designed
e
xplicitly
for
the
39-dimensional
MFCC
feature
v
ectors,
with
carefully
considering
temporal
dependencies
in
respiratory
audio
signals.
T
raining
protocols
used
stratied
data
splitting
with
80-20
trai
n
-
test
split,
ensuring
balanced
representation
of
COPD
and
non-COPD
classes.
Adam
optimizer
is
used
for
model
optimization
with
0.001
as
learning
rate.
T
o
pre
v
ent
o
v
ertting,
early
stopping
and
strate
gies
to
reduce
learning
rate
are
used
which
also
maximizes
con
v
er
gence
stability
.
3.3.
Err
or
analysis
and
v
alidation
The
e
v
aluation
protocols
addressed
both
random
and
systematic
error
sources
through
metric
assessment
including
accurac
y
,
precision,
specicity
,
F1-score,
recall,
and
area
under
the
curv
e
(A
UC).
Threshold
optimization
techniques
are
used
to
enhance
clinical
utility
,
prioritizing
sensiti
vity
for
screening
applications
in
medical
eld.
Strate
gies
for
cross-v
alidation
and
analysis
of
confusion
matrix
pro
vides
detailed
insights
into
model
performance
characteristics.
T
raining
time
analysis
quanti
ed
computational
ef
cienc
y
trade-of
fs
and
statistical
signi
cance
testing
conrmed
the
rob
ustness
of
performance
dif
ferences
across
architectures.
CLASP
pipeline
architecture
with
three
deep
learning
models
(CNN,
ResNet,
and
LSTM
with
attention)
is
sho
wn
in
Figure
2.
Figure
1.
Spectrogram
grid
used
in
the
CLASP
frame
w
ork
Figure
2.
CLASP
pipeline
architecture
4.
SYSTEM
ARCHITECTURE
AND
EXPERIMENT
AL
SET
-UP
The
CLASP
frame
w
ork
consists
of
three
main
components
optimized
for
respiratory
audio
signal
analysis:
pre-processing,
feature
e
xtraction,
and
neural
netw
ork-based
prediction.
Each
component
is
Int
J
Artif
Intell,
V
ol.
15,
No.
2,
April
2026:
1733–1745
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
1737
designed
to
w
ork
seamlessly
with
the
other
,
forming
a
comprehensi
v
e
analysis
pipeline.
The
audio
signal
pre-processing,
MFCC
computation,
dynamic
feature
computation
and
the
three
models
(LSTM,
CNN,
and
ResNet
architecture)
are
discussed
in
detail
in
the
follo
wing
sections.
4.1.
A
udio
signal
pr
e-pr
ocessing
The
pre-processing
phase
con
v
erts
ra
w
audio
signals
into
a
format
that
is
suitable
for
analysis
t
hrough
the
follo
wing
steps:
i)
Sampling
frequenc
y
(
f
s
):
audio
signals
are
sampled
at
22,050
Hz
for
high
resolution.
ii)
W
indo
w
size
(
w
s
):
a
windo
w
size
of
20
ms
corresponds
to
441
samples.
iii)
Step
size
(
s
s
):
a
step
size
of
10
ms
ensures
a
50%
o
v
erlap
between
consecuti
v
e
windo
ws.
i
v)
Fix
ed
number
of
windo
ws:
each
se
gment
has
10
consecuti
v
e
windo
ws
with
total
of
110
ms
per
se
gment.
T
o
se
gment
the
signals
for
analysis,
we
apply
a
windo
wing
function.
Mathematically
,
the
process
of
se
gmentation
is
e
xpressed
as
(1).
x
i
[
n
]
=
x
[
n
+
is
s
]
,
0
≤
n
<
N
w
(1)
Where
x
i
[
n
]
represents
the
i
-th
se
gment,
and
N
w
is
the
windo
w
size
in
samples.
At
the
se
gment
boundaries
spectral
leakage
is
minimized
by
using
Hamming
windo
w
function
to
taper
each
se
gment.
This
function
is
gi
v
en
as
(2).
w
[
n
]
=
0
.
54
−
0
.
46
cos
2
π
n
N
w
−
1
,
0
≤
n
<
N
w
(2)
The
trade-of
f
between
temporal
precision
and
frequenc
y
resolution
w
as
balanced
by
suitable
choice
of
windo
w
function
and
o
v
erlap
which
w
as
decided
by
e
xtensi
v
e
e
xperimentation.
The
Hamming
windo
w
,
in
particular
,
w
as
chosen
for
its
superior
frequenc
y
resolution
compared
to
rectangular
windo
ws,
while
maintaining
good
temporal
resolution.
4.2.
F
eatur
e
extraction
This
section
details
MFCC
computation
and
dynamic
feature
computation.
STFT
,
Mel
lterbank
application,
DCT
,
and
the
feature
composition
are
in
v
olv
ed
in
MFCC
comput
ation,
whereas
dynamic
feature
computation
includes
delta
(rate
of
change
of
the
cepstral
coef
cients)
and
delta-delta
(rate
of
change
of
delta
features).
4.2.1.
Mel-fr
equency
cepstral
coefcients
computation
The
MFCC
computation
process
to
capture
dif
ferent
aspects
of
the
audio
signal
is
as
follo
ws:
i)
STFT
:
STFT
con
v
erts
each
windo
wed
signal
into
its
frequenc
y
representation
as
(3).
X
i
[
k
]
=
N
w
−
1
X
n
=0
y
i
[
n
]
e
−
j
2
π
k
n/
N
(3)
Where
y
i
[
n
]
represents
the
windo
wed
signal
and
X
i
[
k
]
gi
v
es
its
frequenc
y
components.
ii)
Mel
lterbank
application:
no
w
,
a
set
of
Mel-scale
lters
is
applied
to
the
STFT
magnitudes.
This
step
maps
the
frequenc
y
components
to
Mel-scale,
which
better
represents
human
auditory
perception
as
(4).
S
i
[
m
]
=
N
/
2
X
k
=0
|
X
i
[
k
]
|
2
H
m
[
k
]
,
0
≤
m
<
M
(4)
Where
H
m
[
k
]
represents
the
m
-th
Mel-lter
response,
M
is
the
total
number
of
Mel-lters
in
the
Mel
lterbank,
S
i
[
m
]
is
Mel-ltered
spectral
ener
gy
for
the
m
-th
Mel-lter
at
time
frame
i
.
iii)
DCT
:
no
w
,
MFCC
is
computed
using
the
DCT
as
(5).
c
i
[
n
]
=
M
−
1
X
m
=0
log
(
S
i
[
m
])
cos
π
n
(
m
+
1
2
)
M
0
≤
n
<
13
(5)
Where
c
i
[
n
]
is
the
n
-th
MFCC
for
the
i
-th
time
frame.
i
v)
Feature
composition:
the
nal
feat
ure
v
ector
comprises
13
static
MFCCs,
delta
c
oef
cients,
and
13
delta-delta
coef
cients.
ResNet
based
deep
learning
appr
oac
h
for
c
hr
onic
obstructive
pulmonary
...
(Babitha
Sudhakar
Ullal)
Evaluation Warning : The document was created with Spire.PDF for Python.
1738
❒
ISSN:
2252-8938
4.2.2.
Dynamic
featur
e
computation
T
emporal
v
ariations
in
the
audio
signal
are
captured
by
computing
dynamic
features
(deltas)
from
the
MFCCs,
as
dened
in
(6).
These
features
represent
the
rate
of
change
of
the
cepstral
coef
cients
and
indicate
ho
w
the
spectral
characteristics
of
the
audio
change
o
v
er
time.
c
∆
f
t
=
P
Θ
θ
=1
θ
(
f
t
+
θ
−
f
t
−
θ
)
2
P
Θ
θ
=1
θ
2
(6)
Where
Θ
=
3
denes
computation
windo
w
width
and
θ
represents
our
time
lag,
f
t
is
cepstral
coef
cient
at
time
frame
t
,
and
∆
f
t
is
rst-order
deri
v
ati
v
e
of
f
t
.
Similarly
,
rate
of
change
of
delta
features,
∆
2
f
t
is
(7).
∆
2
f
t
=
P
Θ
θ
=1
θ
(∆
f
t
+
θ
−
∆
f
t
−
θ
)
2
P
Θ
θ
=1
θ
2
(7)
4.3.
Model
ar
chitectur
es
4.3.1.
Long
short-term
memory-based
ar
chitectur
e
The
LSTM
model
processes
sequential
MFCC
features
using
bidirectional
long
short-term
memory
(BiLSTM)
layers
follo
wed
by
an
attention
mechanism
to
identify
the
most
important
se
gments
in
the
respiratory
audio
signals.
Each
of
the
tw
o
BiLSTM
layers
are
follo
wed
by
batch
normalization
and
dropout
layers.
Attention
mechanism
helps
to
focus
on
important
parts
of
the
sequence.
The
nal
layers
include
dense
layers
and
sigmoid
acti
v
ation
for
binary
classication.
Adam
optimizer
with
learning
r
ate
of
0.001
is
used
to
train
the
model.
LSTM
model
architecture
with
attention
mechanism
is
sho
wn
in
Figure
3.
Figure
3.
LSTM
model
architecture
with
attention
mechanism
Int
J
Artif
Intell,
V
ol.
15,
No.
2,
April
2026:
1733–1745
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
1739
4.3.2.
Con
v
olutional
neural
netw
ork-based
ar
chitectur
e
The
CNN
model
reshapes
sequential
MFCC
features
int
o
an
image-lik
e
format
and
appli
es
con
v
olutional
blocks
to
e
xtract
spatial
and
temporal
feat
ures.
This
approach
treats
the
time-frequenc
y
representation
of
audio
as
a
2D
image
to
le
v
erage
the
pattern
recognition
strength
of
CNNs.
The
architecture
consists
of
three
con
v
olutional
blocks,
each
follo
wed
by
batch
normalization,
max
pooling,
and
dropout
layers
to
pre
v
ent
o
v
ertting.
F
or
binary
classication,
the
nal
layer
consists
of
global
a
v
erage
p
ool
ing,
dense
layers,
and
a
sigmoid
acti
v
ation.
This
model
is
trained
using
Adam
optimizer
and
a
learning
rate
of
0.001
is
used.
The
architecture
of
CNN
model
is
sho
wn
in
Figure
4.
Figure
4.
CNN
model
architecture
4.3.3.
ResNet-based
ar
chitectur
e
The
ResNet
model
uses
residual
connections
to
enable
deeper
netw
orks
to
learn
comple
x
feat
ure
representations
while
a
v
oi
ding
v
anishing
gradients,
ef
fecti
v
ely
capturing
subtle
respiratory
audio
patterns
that
distinguish
COPD
from
other
conditions.
The
architecture
be
gins
with
con
v
olutional
layer
and
is
follo
wed
by
three
stages
of
residual
blocks.
Each
block
includes
tw
o
con
v
olutional
layers,
batch
normalization
follo
wed
by
rectied
linear
unit
(ReLU)
ac
ti
v
ation,
along
with
a
shortcut
connection.
The
model
concludes
with
global
a
v
erage
pooling,
dense
layers,
and
a
sigmoid
acti
v
ation.
The
skip
connections
in
residual
blocks
allo
w
the
gradient
to
o
w
more
easily
during
backpropag
ation,
reducing
the
v
anishing
gradient
problem.
ResNet
architecture
is
sho
wn
in
Figure
5
and
ResNet
residual
block
s
tructure
is
sho
wn
in
Figure
6,
where
solid
red
arro
w
sho
ws
the
projection
pathw
ay
when
dimensions
change
(1×1
con
v
olution),
the
dashed
red
arro
w
sho
ws
the
direct
skip
connecti
on
when
dimensi
ons
match.
The
model
trai
ning
is
don
e
using
Adam
optimizer
and
a
learning
rate
of
0.001
is
used
and
can
be
trained
with
focal
loss
to
address
class
imbalance.
4.4.
T
raining
and
e
v
aluation
4.4.1.
Dataset
The
dataset
is
tak
en
from
Kaggle
(respiratory
sound
database),
which
contains
920
audio
recordings
from
126
subjects.
It
includes
samples
of
health
y
indi
viduals
and
patients
with
v
arious
respiratory
conditions,
including
COPD.
This
dataset
w
as
augmented
with
signal
transformations,
resulting
in
a
total
of
800
recordings
for
COPD
and
Non-COPD,
with
a
80-20
split
between
training
and
testing
sets.
4.4.2.
T
raining
methodology
This
section
outlines
the
training
methodology
of
the
CLASP
frame
w
ork.
The
discussion
is
or
g
anized
into
three
main
components.
These
components
are
described
as
follo
ws.
i)
Class
imbalance
bandling:
the
CLASP
frame
w
ork
implements
three
complementary
strate
gies
to
address
class
imbalance,
a
comm
on
issue
to
address
in
medical
datasets.
Data
augmentation
is
done
using
time
domain
augmentation,
feature
domain
augmentation,
and
sample
mixing.
Also,
balanced
training
sets
and
focal
loss
with
f
alse
ne
g
ati
v
e
weighting
are
discussed
in
the
follo
wing
sections.
ResNet
based
deep
learning
appr
oac
h
for
c
hr
onic
obstructive
pulmonary
...
(Babitha
Sudhakar
Ullal)
Evaluation Warning : The document was created with Spire.PDF for Python.
1740
❒
ISSN:
2252-8938
–
Data
augmentation:
the
audio
processor
component
applies
tar
geted
augmentation
to
increase
minority
class
repres
entation:
time-domain
augmentation:
time
stretching/compression
(±10%),
pitch
shifting
(±2
sem
itones);
feature-domain
augmentation:
spectral
masking,
random
feature
scaling,
additi
v
e
Gaussian
noise;
and
sample
mixing:
linear
combination
of
similar
-class
samples
with
random
weights.
–
Balanced
training
sets:
for
each
training
epoch,
the
frame
w
ork
dynamically
creates
balanced
mini-batches:
minority
class
samples
are
upsampled
to
match
majority
class
frequenc
y;
create
balanced
subset
function,
which
ensures
equal
class
representation
while
maintaining
di
v
ersity
within
classes;
and
random
state
initialization
ensures
reproducibility
across
training
runs.
–
F
ocal
loss
with
f
alse
ne
g
ati
v
e
weighting:
a
specialized
loss
funct
ion
FL(
p
t
)
is
emplo
yed
to
prioritize
correct
classication
of
COPD
cases
as
in
(8).
FL
(
p
t
)
=
−
α
t
(1
−
p
t
)
γ
log
(
p
t
)
(8)
Where
p
t
is
the
probability
of
estimation
of
the
model
for
the
correct
class,
α
t
is
f
actor
for
balancing,
and
γ
is
focusing
parameter
.
F
or
COPD
detection
specically
,
we
further
modify
the
loss
function
to
assign
a
higher
penalty
to
f
alse
ne
g
ati
v
es
as
in
(9).
FNFL
(
y
,
ˆ
y
)
=
(
FL
(
ˆ
y
)
if
y
=
0
FL
(
ˆ
y
)
·
fn
weight
if
y
=
1
and
ˆ
y
<
0
.
5
(9)
Where
fn
weight
is
set
to
3.0
by
def
ault,
ef
fecti
v
ely
tripling
the
loss
for
missed
COPD
cases.
Figure
5.
ResNet
architecture
ii)
Threshold
optimization:
CLASP
uses
a
sensiti
vity-weighted
optimization
to
nd
the
optimal
classication
threshold
instead
of
standard
v
alue
of
0.5
all
o
wi
ng
for
ne-tuning
the
specicity
-
sensiti
vity
trade-of
f.
Comparison
of
classication
thresholds
for
each
model
is
sho
wn
in
Figure
7.
The
sensiti
vity–specicity
threshold
selection
method
is
as
follo
ws:
–
Multiple
candidate
thresholds
are
e
v
aluated
on
v
alidation
data.
–
F
or
each
threshold,
a
weighted
score
is
computed
as
in
(10).
Score
(
θ
)
=
sensiti
vity
(
θ
)
·
w
+
specicity
(
θ
)
w
+
1
(10)
Int
J
Artif
Intell,
V
ol.
15,
No.
2,
April
2026:
1733–1745
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
1741
Where
w
is
the
sensiti
vity
weight
(def
ault
=2.0).
–
The
threshold
maximizing
this
weighted
score
is
selected
as
optimal.
Figure
6.
ResNet
residual
block
structure
Figure
7.
Comparison
of
classication
thresholds
iii)
T
raining
conguration:
the
models
are
trained
with
the
congurations
as
follo
ws:
–
Optimizer:
Adam
with
learning
rates
0.001.
–
Batch
normalization:
stabilizes
the
training
when
applied
after
each
major
layer
.
–
Dropout:
strate
gic
application
with
increasing
rates
deeper
in
the
netw
ork
(20%
to
50%).
–
Early
stopping:
based
on
v
alidation
loss
with
patience
of
10-15
epochs.
–
Learning
rate
reduction:
f
actor
of
0.2
when
v
alidation
loss
plateaus.
–
Epochs:
up
to
50
for
LSTM
and
ResNet,40
for
CNN
(typically
terminating
earlier
due
to
early
stopping).
–
V
alidation
split:
15%
of
training
data.
These
congurat
ion
choices
were
opted
by
conducting
e
xtensi
v
e
e
xperimentation
and
specically
tuned
for
respiratory
audio
classication.
5.
RESUL
TS
AND
DISCUSSION
The
proposed
CLASP
frame
w
ork
sho
ws
clear
impro
v
ements
o
v
er
e
xisting
approaches
and
is
v
a
lidated
through
performance
e
v
aluations.
A
systematic
comparison
of
three
deep
learning
architectures
re
v
eals
optimal
ResNet
based
deep
learning
appr
oac
h
for
c
hr
onic
obstructive
pulmonary
...
(Babitha
Sudhakar
Ullal)
Evaluation Warning : The document was created with Spire.PDF for Python.
1742
❒
ISSN:
2252-8938
performance
characteristics,
underscoring
its
ef
fecti
v
eness.
The
follo
wing
section
discusses
precision
recall
curv
e
comparison,
recei
v
er
operating
characteristic
(R
OC)
curv
e,
performance
metric
comparison,
training
time
comparison,
and
state-of-the-art
comparison.
5.1.
Model
perf
ormance
comparison
The
precision-recall
curv
e
plots
for
the
three
model
s
discussed
is
sho
wn
in
Figure
8.
The
graphs
sho
w
ResNet
with
a
highest
a
v
erage
precision
(AP)
of
0.994,
follo
wed
by
LSTM
model
with
AP
of
0.991
and
CNN
model
with
AP
of
0.968.
The
R
OC
curv
e
for
LSTM,
CNN,
and
ResNet
models
are
as
in
Figure
9
which
ag
ain
sho
ws
ResNet
model
with
highest
A
UC
of
0.994,
outperforming
LSTM
with
A
UC
of
0.988
and
CNN
with
A
UC
of
0.958.
The
performance
comparison
of
the
three
models
using
accurac
y
,
recall,
precision,
F1-score,
specicity
,
and
A
UC
as
metrics
is
in
Figure
10
and
T
able
1.
The
ResNet
architecture
demonstrates
e
xceptional
diagnostic
capability
with
accurac
y
of
97.90%,
signicantly
e
xceeding
the
performance
of
e
xisting
approaches
which
typically
achie
v
e
82%
to
93%
accurac
y
.
Precision
of
98.72%
indicates
e
xceptional
specicity
in
positi
v
e
COPD
identication,
while
the
recall
of
96.86%
ensures
minimal
missed
cases—a
critical
consideration
for
clinical
screening
applications
where
f
als
e
ne
g
ati
v
es
ha
v
e
se
v
ere
consequences
for
patient
outcomes.
The
training
time
comparison
of
all
the
3
models
is
as
in
Figure
11
with
ResNet
taking
maximum
training
time
of
32.2
seconds
while
11.3
seconds
for
CNN
and
10.9
seconds
for
LSTM.
But
considering
the
other
metrics
(as
in
T
able
1),
this
delay
is
f
ar
more
ne
gligible
gi
v
en
the
substantial
accurac
y
impro
v
ements
and
the
non-real-time
nature
of
diagnostic
screening
applications.
Also,
a
comparison
between
the
proposed
ResNet
architecture
and
e
xisting
approaches
that
use
only
audio
si
gnals
as
input
to
their
model
and
achie
v
e
accuracies
abo
v
e
90%
is
sho
wn
in
Figure
12.
The
accurac
y
of
93%
is
the
w
ork
of
[17],
92.86%
accurac
y
is
the
w
ork
of
[24],
and
95%
is
the
w
ork
e
x
ecuted
by
[25].
COPD
prediction
m
odel
e
v
aluation:
the
confusion
matrix
v
alues
for
LSTM,
CNN,
and
ResNet
models
pro
vide
v
aluable
insights
into
the
classication
performance
of
each
model
and
is
as
follo
ws:
–
LSTM:
true
ne
g
ati
v
es:
171,
f
alse
positi
v
es:
3,
f
alse
ne
g
ati
v
es:
7,
and
true
positi
v
es:
152.
–
CNN:
true
ne
g
ati
v
es:
170,
f
alse
positi
v
es:
4,
f
alse
ne
g
ati
v
es:
20,
and
true
positi
v
es:
139.
–
ResNet:
true
ne
g
ati
v
es:
172,
f
alse
positi
v
es:
2,
f
alse
ne
g
ati
v
es:
5,
and
true
positi
v
es:
154.
During
v
alidation
the
optimal
threshold
v
alue
obtained
is
0.15
for
LSTM,
0.29
for
CNN
and
0.36
for
ResNet.
The
learning
curv
es
for
all
three
models
sho
wed
an
appropriate
con
v
er
gence
beha
vior
,
with
v
alidation
metrics
closely
tracking
training
metrics,
suggesting
good
generalization
without
signicant
o
v
ertting.
The
ResNet
model
in
particular
demonstrated
e
xcellent
stability
in
both
loss
and
accurac
y
metrics
during
the
training
process.
The
0.9937
A
UC
score
indicates
near
-perfect
discriminati
v
e
capability
,
while
the
balanced
precision-recal
l
characteristics
pro
vide
optimal
trade-of
fs
for
clinical
deplo
yment.
Based
on
this
comprehensi
v
e
e
v
aluation,
the
ResNet
architecture
is
recommended
for
deplo
yment
in
the
CLASP
system,
as
it
pro
vides
best
diagnostic
accurac
y
and
reliabili
ty
for
clinical
applications,
with
96.86%
recall
and
98.85%
specicity
which
is
important
for
clinical
use
to
minimize
missed
cases,
despite
its
computational
requirements.
Figure
8.
Precision-recall
curv
e
comparison
Figure
9.
R
OC
curv
e
comparison
Int
J
Artif
Intell,
V
ol.
15,
No.
2,
April
2026:
1733–1745
Evaluation Warning : The document was created with Spire.PDF for Python.