IAES
Inter
national
J
our
nal
of
Articial
Intelligence
(IJ-AI)
V
ol.
15,
No.
1,
February
2026,
pp.
257
∼
268
ISSN:
2252-8938,
DOI:
10.11591/ijai.v15.i1.pp257-268
❒
257
Benchmarking
machine
lear
ning
models
f
or
natural
disaster
pr
ediction
with
synthetic
IoT
data
Moath
Alsafasfeh
1,2
,
Abdullah
Alhasanat
1
,
Atheer
Bassel
3
,
Mohanad
Alhasanat
1
1
Department
of
Computer
Engineering,
Colle
ge
of
Engineering,
Al-Hussein
Bin
T
alal
Uni
v
ersity
,
Ma’an,
Jordan
2
Department
of
Electrical
and
Computer
Engineering,
Colle
ge
of
Engineering,
T
usk
e
gee
Uni
v
ersity
,
T
usk
e
gee,
United
States
3
Department
of
Articial
Intelligence,
Colle
ge
of
Computer
Science
and
T
echnology
,
Uni
v
ersity
of
Anbar
,
Anbar
,
Iraq
Article
Inf
o
Article
history:
Recei
v
ed
Oct
2,
2025
Re
vised
Dec
29,
2025
Accepted
Jan
22,
2026
K
eyw
ords:
Disaster
resilience
Early
w
arning
systems
Ensemble
learning
Extreme
weather
e
v
ents
Internet
of
things
Natural
disaster
prediction
Synthetic
data
ABSTRA
CT
Natural
disasters
pose
se
v
ere
threats
to
human
life
and
infrastructure,
demanding
rob
ust
early
w
arning
systems
(EWS)
supported
by
machine
learning
(ML)
and
internet
of
things
(IoT)-based
sensing.
This
study
benchmarks
ML
models
for
predicting
oods
and
earthquak
es
using
synthetic
IoT
sensor
data.
A
dataset
comprising
nine
en
vironmental
and
seismic
parameters
w
as
generated
and
labe
led
into
three
classes:
no
disaster
,
ood,
and
earthquak
e,
where
the
feature
preprocessing
w
as
applied
during
model
training.
Logistic
re
gression
(LR),
random
forest
(RF),
and
e
xtreme
gradient
boosting
(XGBoost)
models
were
trained
and
e
v
aluated
using
accurac
y
,
precision,
recall,
and
F1-score.
Experimental
results
on
the
W
orld-A
test
set
sho
w
that
ensemble
models
consistently
outperform
LR,
with
XGBoost
and
RF
achie
ving
F1-scores
of
up
to
97%
and
99%,
respecti
v
ely
,
com
pared
to
79%
for
LR.
An
independent
test
on
the
separately
generated
W
orld-B
dataset
re
v
ealed
that
ensemble
models
maintained
higher
generalization
capability
with
F1-scores
of
80%
for
XGBoost
and
78%
for
RF
.
In
contrast,
LR
s
ho
wed
substantial
de
gradation
with
an
F1-score
of
54%.
Stress
testing
on
the
W
orld-B
dataset
under
simulated
situations,
such
as
sensor
f
ailures,
noise
injection,
and
e
xtreme
weather
e
v
ents,
conrms
the
resilience
performance
of
ensemble
models
in
comparison
to
LR.
These
results
demonstrate
the
usefulness
of
ensemble
learning
in
handling
unpredictable
IoT
data
for
disaster
prediction
and
highlight
their
potential
inte
gration
into
intelligent
EWS.
Future
w
ork
will
focus
on
e
xpanding
the
frame
w
ork
to
include
cross-time
prediction,
incorporating
additional
en
vironmental
features,
and
deplo
ying
the
models
in
real-time
IoT
systems
for
eld
v
alidation.
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
Moath
Alsaf
asfeh
Department
of
Computer
Engineering,
Colle
ge
of
Engineering,
Al-Hussein
Bin
T
alal
Uni
v
ersity
Ma’an,
Jordan
Email:
malsaf
asfeh@tsuk
e
gee.edu
1.
INTR
ODUCTION
Natural
disasters
are
serious
adv
erse
e
v
ents
of
geoph
ysical,
h
ydrological,
climatological,
or
meteorological
origin
that
threaten
human
life,
infrastructure,
and
the
en
vironment
[1],
[2].
Floods
and
earthquak
es
are
tw
o
of
the
most
damaging
hazards,
accounting
for
a
lar
ge
proportion
of
disa
ster
-related
deaths
and
economic
losses
w
orldwide
[1].
The
frequenc
y
of
such
incidents
has
increased
dramatically
in
recent
decades,
emphasizing
the
need
for
reliable
prediction
and
response
systems.
Early
w
arning
systems
(EWS)
are
designed
to
deli
v
er
timely
alerts
through
hazard
monitoring,
forecasting,
and
risk
assessment
[3].
Ho
we
v
er
,
J
ournal
homepage:
http://ijai.iaescor
e
.com
Evaluation Warning : The document was created with Spire.PDF for Python.
258
❒
ISSN:
2252-8938
by
2020,
only
a
small
m
inority
of
countries
had
operable
multi-hazard
EWS
[4].
Impro
ving
these
systems
remains
a
global
goal.
Articial
intelligence
(AI)
pro
vides
promising
solutions
for
disaster
prediction
by
analyzing
lar
ge,
comple
x
datasets
and
detecting
patterns
t
hat
precede
e
xtreme
e
v
ents
[5].
Machine
learning
(ML)
models,
in
particular
,
ha
v
e
the
potential
to
increase
disaster
prediction
accurac
y
and
f
acilitate
proacti
v
e
decision-making.
Despite
these
adv
ances,
the
dependability
of
ML
and
deep
learning
(DL)
algorithms
must
be
e
xtensi
v
ely
tested,
especially
under
conditions
of
data
scarcity
,
noise,
or
sensor
f
ai
lure.
Furthermore,
e
v
aluation
of
both
the
datasets
and
the
prediction
models
is
critical
to
ensuring
reliable
outcomes.
This
study
addresses
these
g
aps
by
benchmarking
ML
models
for
ood
and
earthquak
e
prediction
using
synthetic
internet
of
things
(IoT)
sensor
data.
A
wide
range
of
IoT
-based
systems
ha
v
e
been
de
v
eloped
for
natural
disaster
predi
ction
and
management,
typically
combining
sensor
netw
orks
with
ML
models
for
early
w
arning.
An
IoT
-based
frame
w
ork
inte
grating
multi-sensor
data
with
neural
netw
orks,
decision
trees
(DTs),
and
random
forest
(RF)
demonstrated
strong
decision-making
capabiliti
es
b
ut
f
aced
scalability
and
interoperability
challenges
[6].
Se
v
eral
papers
highlight
the
importance
of
applying
ML
and
DL
for
disaster
detection
and
outline
k
e
y
research
directions
[7]–[9].
As
such,
a
comparati
v
e
study
in
[10]
benchmark
ed
ML
and
DL
models,
sho
wing
con
v
olutional
neural
netw
orks
(CNNs)
and
h
ybrid
deep
netw
orks
outperform
others,
while
RF
and
e
xtreme
gradient
boosting
(XGBoost)
remained
competiti
v
e
for
smaller
datasets;
no
single
algorithm
pro
v
ed
uni
v
ersally
optimal.
F
or
ood
prediction,
IoT
and
ML
approaches
ha
v
e
used
w
ater
-le
v
el,
rainf
all,
and
humidity
sensors
with
models
such
as
RF
,
long
short-term
memory
(LSTM),
and
CNN,
achie
ving
accuracies
between
80-95%
[11]–[13].
Ensemble
RF-LSTM
h
ybrids
reached
81%
accurac
y
on
the
tested
data
[14].
Rezv
ani
et
al.
[15]
applies
geospatial
AI
to
ood
hotspot
detection
in
Portug
al,
producing
susceptibility
maps
with
96%
accurac
y
,
though
limited
by
reliance
on
historical
datasets.
According
to
Anbarasan
et
al.
[16],
a
con
v
olutional
deep
neural
netw
ork
(CDNN)-based
system
combining
IoT
sensors
and
big
data
achie
v
ed
93.23%
accurac
y
,
outperforming
articial
neural
netw
ork
(ANN)
and
deep
neural
netw
ork
(DNN)
baselines,
b
ut
w
as
v
alidated
mainly
on
simulated
data.
F
or
earthquak
e
prediction,
accelerometer
-based
IoT
frame
w
orks
analyzed
seismic
vibrations
using
support
v
ector
machine
(SVM)
and
DTs,
with
SVM
reaching
95%
accurac
y
[17].
Mukherjee
et
al.
[18]
e
xtracted
61
seismic
features
from
the
Himalayan
Belt,
nding
ANN
and
XGBoost
most
accurate
across
longer
prediction
windo
ws.
According
to
Rosca
and
S
tancu
[19],
an
IoT–cloud
system
using
8,766
seismic
and
m
eteorological
records
achie
v
ed
99.84%
accurac
y
,
though
limited
t
o
the
Vrancea
re
gion.
K
ubo
et
al.
[20]
re
vie
w
ML
applications
in
seismology
,
noting
strong
adv
ances
in
detection
and
catalog
completion
b
ut
persistent
issues
with
generalizability
and
interpretability
.
F
or
landslides,
IoT
sensors
monitoring
soil
moisture,
slope,
and
rainf
all
combined
with
ML
models
such
as
XGBoost
and
LSTM
achie
v
ed
o
v
er
95%
accurac
y
[21]–[23],
while
U-Net
impro
v
ed
susceptibility
mapping
with
satellite
imagery
[12].
The
study
in
[24]
proposed
a
lo
w-cost
IoT
-ML
frame
w
ork
in
V
ietnam
,
where
RF
achie
v
ed
strong
predicti
v
e
performance
and
enabled
real-time
alerts,
though
generalizability
and
de
vice
maintenance
remain
concerns.
On
the
other
hand,
smart
city-oriented
systems
ha
v
e
inte
grated
IoT
,
ML,
and
cloud
computing
for
multi-hazard
monitoring,
though
challenges
persist
in
scalability
,
data
transmission,
and
pri
v
ac
y
[25].
Ov
erall,
prior
w
ork
demonstrates
the
potential
of
IoT
-ML
inte
gration
for
disaster
prediction,
b
ut
most
studies
rely
on
limited
real-w
orld
datasets
and
focus
on
single
hazards.
Fe
w
ha
v
e
systematically
benchmark
ed
multiple
models
under
controlled
condi
tions.
This
paper
addresses
that
g
ap
by
generating
a
synthetic
IoT
dataset
for
ood
and
earthquak
e
scenarios
and
conducting
a
comparati
v
e
rob
ustness
e
v
aluation.
2.
METHOD
The
paper
introduces
the
de
v
elopment
and
e
v
aluation
of
a
ML-based
system
for
natural
disaster
prediction.
The
approach
in
v
olv
es
generating
tw
o
synthetic
datasets,
W
orld-A
and
W
orld-B,
labeling
disaster
e
v
ents,
and
training
multiple
classication
models
to
detect
oods
and
earthquak
es
based
on
en
vironmental
and
seismic
features.
2.1.
Dataset-a
pr
eparation
W
e
generate
a
synthetic
en
vironmental
time
series
to
train
models
for
predicting
oods
and
earthquak
es.
The
data
span
01-Jan-2024
to
01-Jan-2025
at
hourly
resolution
(8,784
timestamps)
and
include
three
feature
groups:
i)
m
eteorology:
temperature,
humidity
,
wind
speed,
rainf
all;
ii)
h
ydrology:
w
ater
le
v
el,
o
w
rate,
and
iii)
seismology:
magnitude,
depth,
frequenc
y
.
T
able
1
summarizes
the
distrib
ution
type
and
Int
J
Artif
Intell,
V
ol.
15,
No.
1,
February
2026:
257–268
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
259
the
ph
ysical
interpretation
for
each
feature.
In
natural
disaster
prediction
tasks,
model
rob
ustness
is
a
critical
concern
due
to
the
inherent
rarity
and
imbalance
of
high-se
v
erity
e
v
ents.
T
o
address
this,
the
data-generation
process
e
xplicitly
simulates
rare
e
xtreme
conditions
rather
than
relying
on
post
hoc
data
augmentation.
T
able
1.
Summary
of
feature
generation
for
synthetic
dataset
Feature
Symbol
Distrib
ution
/
model
Ph
ysical
interpretation
Meteorological
v
ariables
T
emperature
T
i
Gaussian
(
µ
=
25
,
σ
=
5
),
clipped
[
−
5
,
45]
Represents
air
temperature
with
daily
and
seasonal
v
ariation.
Humidity
H
i
Deri
v
ed
from
T
i
with
in
v
erse
relation;
Gaussian
noise
σ
=
3
Relati
v
e
humidity
in
v
ersely
related
to
temperature,
bounded
between
20–100%.
W
ind
speed
W
i
Rayleigh
(
σ
=
3
.
5
)
with
added
Gamma
b
ursts
during
storms;
capped
at
28
m/s
Background
wind
speed
with
con
v
ecti
v
e
gusts
during
storm
episodes.
Rainf
all
R
i
Gamma
(shape=1.8,
scale=4.0)
mix
ed
with
Bernoulli
rain
occurrence;
seasonal
and
diurnal
modulation
Hourly
precipi
tation
inuenced
by
time
of
year
and
time
of
day
.
Hydrological
v
ariables
W
ater
le
v
el
L
i
Recursi
v
e
(AR-1)
with
α
=
0
.
965
and
rainf
all-dependent
response
Accumulates
rainf
all
and
decays
slo
wly
o
v
er
time,
capturing
antecedent
wetness.
Flo
w
rate
Q
i
Po
wer
-la
w
Q
∝
(
L
−
H
0
)
1
.
6
with
Gaussian
noise
σ
=
6
Ri
v
er
dischar
ge
rises
nonlinearly
with
w
ater
le
v
el,
representing
runof
f
response.
Seismic
v
ariables
Magnitude
M
i
Exponential
(
λ
=
1
.
0
),
bounded
[0,
6.5]
Earthquak
e
magnitude
approximates
a
Gutenber
g–Richter–lik
e
e
xponential
decay
(small
frequent,
lar
ge
rare).
Depth
D
i
Normal
(
µ
=
12
,
σ
=
6
),
clipped
[0.5,
50]
Depth
of
seismic
e
v
ents
spanning
shallo
w
to
intermediate
crustal
re
gions.
Frequenc
y
F
i
Poisson
with
rate
λ
decreasing
e
xponentiall
y
with
M
i
and
D
i
Expected
hourly
count
of
seismic
e
v
e
nts
in
v
ersely
related
to
magnitude
and
depth.
2.2.
Dataset
labeling
The
W
orld-A
dataset
is
split
into
training
and
testing
subsets,
80%
and
20%,
respecti
v
ely
.
Sam
ples
in
both
subsets
are
assigned
class
labels
based
on
predened
threshold
v
alues
indicating
potential
ly
hazardous
conditions.
The
empirical
selection
method
is
used
to
determine
threshold
v
alues
that
maintain
the
ph
ysical
realism
of
the
synthetic
disaster
simulation,
ensuring
that
both
ood
and
earthquak
e
scenarios
appropriately
represent
plausible
en
vironmental
dynamics.
T
o
eliminate
conicts,
Algorithm
1
assigns
labels
with
earthquak
e
priority
.
Labels
be
gin
with
class
0
(no-disaster),
and
progress
to
class
2
(
earthquak
e)
when
lo
w
seismic
frequenc
y
,
high
magnitude,
and
shallo
w
depth
are
all
met.
Remaining
unlabeled
sam
p
l
es
are
classied
as
class
1
(ood)
if
o
w
rate,
wind
speed,
and
humidity
e
xceed
the
threshold
v
alues.
This
sequential
rule-based
scheme
cleanly
separates
hazards
while
prioritizing
rarer
,
high-se
v
erity
earthquak
es.
Algorithm
1
Disaster
labeling
logic
based
on
thresholds
1:
Initialize
all
labels
as
no
disaster
←
0
2:
f
or
each
ro
w
in
the
dataset
do
3:
if
Seismic
frequenc
y
<
threshold
and
seismic
depth
<
threshold
and
seismic
magnit
u
de
>
threshold
then
4:
Assign
label
2
(earthquak
e)
5:
else
if
Disaster
==
0
and
o
w
rate
>
threshold
and
wind
speed
>
threshold
and
humidity
>
threshold
then
6:
Assign
label
1
(ood)
7:
else
8:
K
eep
label
as
0
(no
disaster)
9:
end
if
10:
end
f
or
T
welv
e
predictors
are
used:
9
numeric
(temperature,
humidity
,
wind
speed,
seismic
magnitude,
seismic
depth,
seismic
frequenc
y
,
w
ater
le
v
el,
rainf
all,
and
o
w
rate)
and
3
cate
gorical
(hour
,
day
of
week,
Benc
hmarking
mac
hine
learning
models
for
natur
al
disaster
pr
ediction
with
...
(Moath
Alsafasfeh)
Evaluation Warning : The document was created with Spire.PDF for Python.
260
❒
ISSN:
2252-8938
and
month).
Numeric
features
are
normalized
using
min–max
scaling,
while
cate
gorical
features
are
one-hot
encoded
with
the
rst
cate
gory
dropped.
All
preprocessing
is
wrapped
in
a
unied
column
transformer
t
only
on
a-train,
then
applied
to
a-test
to
pre
v
ent
leakage.
2.3.
Models
training
Three
ML
models,
logistic
r
e
gres
sion
(LR),
RF
,
and
XGBoost,
are
subsequently
trained
for
mult
iclass
disaster
classication,
distinguishing
between
no-disaster
,
ood,
and
earthquak
e
e
v
ents.
All
preprocessing
steps
are
inte
grated
within
a
training-only
pipeline,
and
no
imputation
is
applied
unless
missing
v
alues
are
intentionally
introduced
during
rob
ustness
e
xperiments.
2.3.1.
Logistic
r
egr
ession
LR
can
be
used
to
classify
input
data
into
multiple
classes.
In
this
study
,
we
assign
class
0
for
non-disaster
,
class
1
for
ood,
and
class
2
for
earthquak
e,
as
e
xplained
in
Algorithm
1.
Since
there
are
three
possible
outcomes,
multiclass
LR
is
applied
using
the
softmax
function,
as
sho
wn
in
Algorithm
2
to
calculate
the
probability
that
an
input
corresponds
to
each
class.
The
probability
for
class
j
gi
v
en
input
x
is
calculated
using
(1).
Where:
x
∈
R
d
is
the
input
feature
v
ector
,
θ
j
∈
R
d
is
the
weight
v
ector
for
class
j
,
and
θ
⊤
j
x
represents
the
dot
product
between
the
weights
and
the
input
features.
The
denominator
ensures
that
the
sum
of
probabilities
o
v
er
all
classes
equals
1.
P
(
y
=
j
|
x
)
=
e
θ
⊤
j
x
e
θ
⊤
0
x
+
e
θ
⊤
1
x
+
e
θ
⊤
2
x
for
j
∈
{
0
,
1
,
2
}
(1)
Algorithm
2
Multiclass
logistic
re
gression
using
softmax
Requir
e:
Dataset
{
(
x
(
i
)
,
y
(
i
)
)
}
N
i
=1
,
learning
rate
α
,
iterations
T
1:
Initialize
weight
v
ectors
θ
0
,
θ
1
,
.
.
.
,
θ
K
−
1
∈
R
d
2:
f
or
t
=
1
to
T
do
3:
f
or
each
sample
(
x
(
i
)
,
y
(
i
)
)
do
4:
Compute
logits:
z
j
=
θ
⊤
j
x
(
i
)
5:
Compute
probabilities:
p
j
=
e
z
j
P
K
−
1
k
=0
e
z
k
6:
f
or
each
class
j
do
7:
Update
weights:
θ
j
←
θ
j
−
α
(
p
j
−
⊮
{
y
(
i
)
=
j
}
)
x
(
i
)
8:
end
f
or
9:
end
f
or
10:
end
f
or
11:
Pr
ediction:
ˆ
y
=
arg
max
j
p
j
2.3.2.
Random
f
or
est
RF
enhances
prediction
accurac
y
and
mitig
ates
o
v
ertting
by
combining
multiple
DTs
into
an
ensemble.
Initially
,
multiple
random
subsets
are
generated
from
the
training
dataset
using
bootstrap
sampling,
and
each
subset
is
used
to
independently
train
a
DT
.
During
tree
construction,
a
random
selection
of
features
is
tak
en
at
each
node
split,
and
the
best
feature
among
them
is
chosen
using
impurity
reduction
criteria.
This
randomization
minimizes
correlation
between
indi
vidual
trees,
enhancing
the
model
’
s
generalization
ability
.
During
predicti
on
,
each
tree
v
otes
for
a
class
label,
and
the
nal
prediction
is
determined
by
majority
v
oting
across
all
trees.
Algorithm
3
e
xplains
the
main
process
of
the
RF
algorithm,
where
each
h
t
(
x
)
is
a
DT
trai
ned
independently
.
The
bootstrap
sampling
ensures
di
v
ersity
in
data,
random
feature
selection
at
each
node
reduces
tree
correlation,
and
the
nal
prediction
is
made
by
majority
v
ote
in
classication
tasks.
2.3.3.
XGBoost
XGBoost
is
an
ensemble
learning
algorithm
that
b
uilds
a
strong
classier
by
sequentially
combining
multiple
DTs
.
Unlik
e
RF
,
where
trees
are
trained
independently
,
XGBoost
trains
each
ne
w
tree
to
correct
the
errors
made
by
the
pre
vious
ensemble.
XGBoost
be
gins
with
an
initial
prediction
of
0,
then
calculates
the
rst
deri
v
ati
v
e
(gradient)
and
the
second
deri
v
ati
v
e
(Hessian)
of
the
loss
function
with
respect
to
the
predictions.
Int
J
Artif
Intell,
V
ol.
15,
No.
1,
February
2026:
257–268
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
261
These
are
used
to
t
a
ne
w
DT
that
predicts
the
residual
errors.
The
output
of
each
ne
w
tree
is
scaled
by
a
learning
rate
η
and
added
to
the
pre
vious
prediction.
This
process
is
repeated
o
v
er
multiple
boosting
rounds.
Algorithm
4
outlines
the
main
steps
of
XGBoost,
where:
ℓ
is
the
loss
function
to
be
minimized,
Ω(
f
t
)
is
a
re
gularization
term
that
penalizes
tree
comple
xity
to
pre
v
ent
o
v
ertting,
and
η
∈
(0
,
1]
is
the
learning
rate
that
controls
the
contrib
ution
of
each
t
ree.
At
each
iteration,
the
model
ts
a
ne
w
tree
f
t
(
x
)
to
minimize
a
re
gularized
objecti
v
e
function
using
a
second-order
T
aylor
approximation.
The
predictions
are
updated
in
step
6.
Algorithm
3
Random
forest
classier
Requir
e:
T
raining
data
D
,
number
of
trees
T
,
number
of
features
per
split
m
Ensur
e:
Final
prediction
ˆ
y
1:
f
or
t
=
1
to
T
do
2:
Dra
w
a
bootstrap
sample
D
t
from
D
(sample
with
replacement)
3:
T
rain
a
DT
h
t
(
x
)
on
D
t
4:
f
or
each
split
node
in
the
tree
do
5:
Select
m
random
features
from
the
d
total
features
6:
Choose
the
best
feature
among
m
based
on
impurity
reduction
7:
end
f
or
8:
end
f
or
9:
F
or
an
unseen
input
x
,
aggre
g
ate
predictions:
ˆ
y
=
mode
{
h
t
(
x
)
}
T
t
=1
Algorithm
4
XGBoost
classier
Requir
e:
T
raining
data
D
,
number
of
boosting
rounds
T
,
learning
rate
η
Ensur
e:
Final
prediction
model
ˆ
y
(
x
)
1:
Initialize
predictions:
ˆ
y
(0)
i
←
0
for
all
i
2:
f
or
t
=
1
to
T
do
3:
Compute
gradients:
g
i
=
∂
ℓ
(
y
i
,
ˆ
y
(
t
−
1)
i
)
∂
ˆ
y
(
t
−
1)
i
4:
Compute
Hessians:
h
i
=
∂
2
ℓ
(
y
i
,
ˆ
y
(
t
−
1)
i
)
∂
(
ˆ
y
(
t
−
1)
i
)
2
5:
Fit
a
re
gression
tree
f
t
(
x
)
to
minimize:
n
X
i
=1
g
i
f
t
(
x
i
)
+
1
2
h
i
f
t
(
x
i
)
2
+
Ω(
f
t
)
6:
Update
prediction:
ˆ
y
(
t
)
i
←
ˆ
y
(
t
−
1)
i
+
η
·
f
t
(
x
i
)
7:
end
f
or
8:
r
etur
n
Final
model:
ˆ
y
(
x
)
=
P
T
t
=1
η
·
f
t
(
x
)
2.4.
Generating
the
independent
test
W
orld-B
T
o
conrm
the
generalization
of
the
trained
models,
an
independent
test
is
performed
to
determ
ine
their
rob
ustness
for
natural
disaster
classication.
The
W
orld-B
test
set
is
created
synthetically
,
using
the
same
feature
schema
and
temporal
structure
as
the
W
orld-A
dataset,
b
ut
with
incorporating
ne
w
stochastic
realizations
for
all
v
ariables.
T
able
2
sho
ws
the
k
e
y
dif
ferences
between
the
W
orld-A
and
W
orld-B
datasets.
This
approach
ensures
that
the
models
are
e
xposed
to
unseen
en
vi
ronmental
and
seismic
conditions
while
maintaining
consistent
ph
ysical
relationships
between
features.
2.5.
Experimental
setup
f
or
e
v
aluating
machine
lear
ning
under
IoT
inefciencies
A
controlled
rob
ustness
testing
approach
w
as
used
to
assess
the
trained
models’
stability
and
generalization
capability
across
dif
ferent
en
vironmental
a
n
d
sensor
conditions.
The
e
v
aluation
utilized
the
Benc
hmarking
mac
hine
learning
models
for
natur
al
disaster
pr
ediction
with
...
(Moath
Alsafasfeh)
Evaluation Warning : The document was created with Spire.PDF for Python.
262
❒
ISSN:
2252-8938
independent
W
orld-B
dataset,
maintaining
its
original
labels
to
ensure
independent
and
consistent
ground
truth
across
disruption
situations.
Three
stress
scenarios
were
designed
to
replicate
realistic
disturbance
s
in
IoT
-based
monitoring
systems:
i)
sensor
f
ailure
simulation:
a
portion
of
sensor
readings
w
as
randomly
halted,
simulating
temporary
de
vice
f
ailures
or
communication
outa
ges,
where
the
absent
v
alues
were
substituted
using
a
hold-last-v
alue
(HL
V)
approach;
ii)
noisy
sensor
simulation:
additi
v
e
Gaussian
noise
and
lo
w-frequenc
y
drift
were
injected
into
continuous
features
to
imitate
sensor
calibration
drift
or
interference;
and
iii)
e
xtreme
weather
simulation:
meteorological
and
h
ydrological
v
ariables
were
scaled
to
reect
storms,
including
rainf
all,
humidity
,
wind
speed,
w
ater
le
v
el,
and
o
w
rate.
In
contrast,
seismic
feature
s
remained
unchanged,
as
these
e
v
ents
are
primarily
en
vironmental
rather
than
tectonic.
T
o
ensure
stati
stical
reliability
,
each
scenario
w
as
repeated
at
v
e
se
v
erity
le
v
els
(10%-50%)
using
v
e
random
seeds
.
This
e
xaminati
on
of
multi-seed
resilience
pro
vides
a
quantitati
v
e
assessment
of
each
algori
thm’
s
response
to
sensor
de
gradation
and
e
xtreme
en
vironmental
v
ariation.
By
isolati
ng
disruptions
from
label
generation,
the
frame
w
ork
assesses
model-only
rob
ustness,
ensuring
that
an
y
reported
performance
changes
are
due
to
instability
at
the
feature
le
v
el
rather
than
uctuations
in
the
tar
get.
T
able
2.
Comparison
between
W
orld-A
and
W
orld-B
datasets
Aspect
W
orld-A
dat
aset
W
orld-B
dataset
Purpose
Model
training
and
in-domain
e
v
aluation
Independent
testing
for
generalization
Data
source
Generat
ed
once
using
x
ed
random
s
eed
(
SEED
=
42
)
Re
generated
independently
using
a
ne
w
random
seed
Feature
schema
9
numeri
c
+
3
temporal/cate
gorical
features
Identical
schema
replicated
for
comparability
Data
distrib
ution
Shares
same
stochastic
patterns
as
training
data
Dif
ferent
random
realizations
and
temporal
sequences
Thresholds
for
labeling
Fix
ed
ph
ysic
al
thresholds
(same
across
w
orld)
Same
thresholds
reused
to
preserv
e
label
semantics
Ev
aluation
type
In-dis
trib
ution
e
v
aluation
Out-of-sample
e
v
aluation
Expected
beha
vior
High
perfor
mance
due
to
shared
distrib
ution
Lo
wer
b
ut
more
realistic
per
formance
under
ne
w
conditions
3.
RESUL
TS
AND
DISCUSSION
The
performance
of
the
trained
ML
models
on
the
labeled
disaster
dataset
is
e
v
aluated.
The
models
are
assessed
based
on
their
classication
accurac
y
,
and
their
rob
ustness
is
analyzed
under
v
arious
simulated
real-w
orld
conditions,
including
sensor
noise
and
data
irre
gularities.
3.1.
Dataset
generation
A
Pearson
correlation
heatmap
w
as
constructed
using
all
continuous
v
ariables
to
e
v
aluate
the
internal
consistenc
y
of
the
synthetic
dataset
features.
The
heatmap
sho
wn
in
Figure
1
sho
ws
a
ph
ysically
coherent
relationship
between
the
meteorological,
h
ydrological,
and
seismic
features.
The
meteorological
and
h
ydrological
f
eatures
ha
v
e
a
moderate
posit
i
v
e
correlation,
where
humidity
sho
ws
a
strong
ne
g
at
i
v
e
correlation
with
temperature.
Re
g
arding
the
seismic
subsystem,
there
i
s
a
moderate
ne
g
ati
v
e
correlation
between
seismic
magnitude
and
seismic
frequenc
y
,
reecting
the
in
v
erse
relationship
b
uilt
into
the
synthetic
dataset
generation
process.
The
correlation
heatmap
in
Figure
1
v
alidates
both
the
realism
and
internal
consi
stenc
y
of
the
synthetically
generated
dataset.
These
correlation
patterns
were
obtained
by
a
stochastic–empirical
generation
process
designed
to
mimic
the
natural
beha
vior
of
IoT
sensors
capturing
oods
and
earthquak
e
parameters.
This
correlation
method
is
considered
one
of
the
limitations
of
this
study
,
where
real-w
orld
data
w
ould
gi
v
e
an
accurate
reading
and
correlations
among
the
features.
3.2.
Ev
aluation
of
machine
lear
ning
models
perf
ormance
The
performance
of
the
trained
model
is
assessed
using
four
class
ication
metrics:
accurac
y
,
precision,
recall,
and
F1-score,
as
sho
wn
in
T
able
3,
where
the
number
of
testing
samples
in
the
W
orld-A
dataset
is
1,757
for
three
dif
ferent
classes:
1,690
for
the
no-disaster
class,
34
for
the
ood
class,
and
33
for
the
earthquak
e
class.
The
accurac
y
metric
is
used
to
measure
the
proportion
of
the
correctly
classied
sample
among
all
the
predictions.
Ho
we
v
er
,
precision
is
used
to
e
v
aluate
the
proportion
of
true
positi
v
es
among
all
positi
v
e
predictions
for
each
class.
The
recall
metric
is
used
to
sho
w
the
number
of
actual
e
v
ents
that
ha
v
e
been
Int
J
Artif
Intell,
V
ol.
15,
No.
1,
February
2026:
257–268
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
263
detected
successfully
.
F1-score
is
used
to
combine
precisi
on
and
recall
in
one
metric
to
e
v
aluate
the
models
when
dealing
with
an
imbalanced
dataset.
Since
the
problem
in
v
ol
v
es
multiclass
classication
(no
disaster
,
ood,
and
earthquak
e),
the
metrics
were
computed
using
macro-a
v
eraging
to
treat
all
classes
equally
re
g
ardless
of
sample
count.
Figure
1.
Dataset
feature
correlation
heatmap
T
able
3.
Classication
metrics,
formulas,
and
brief
denitions
Metric
F
ormula
Denition
Accurac
y
TP+TN
TP+TN+FP+FN
Proportion
of
correctly
classied
samples
among
all
predictions.
Precision
i
TP
i
TP
i
+FP
i
Share
of
predicted
positi
v
es
for
class
i
that
are
truly
positi
v
e.
Recall
i
TP
i
TP
i
+FN
i
Share
of
actual
positi
v
es
for
class
i
that
are
correctly
detected.
F1
i
2
Precision
i
Recall
i
Precision
i
+Recall
i
Harmonic
mean
of
precision
and
recall
for
class
i
;
rob
ust
to
imbalance.
T
able
4
summarizes
the
computed
e
v
aluation
metrics
for
the
t
hree
implemented
models:
LR,
RF
,
and
XGBoost.
The
results
clearly
demonstrate
the
superior
performance
of
the
ensemble-based
models
o
v
er
the
linear
baseline.
RF
and
XGBoost
achi
e
v
ed
near
-perfect
scores
across
all
metrics,
sho
wing
their
rob
ustness
in
handling
the
comple
x
and
nonlinear
relationships
present
in
the
disaster
dataset
and
their
resilience
to
class
imbalance.
In
contrast,
LR
performed
signicantly
w
orse,
particularly
in
precision,
demoenstrating
a
higher
tendenc
y
to
incorrectly
classify
non-disaster
instances
as
ood
or
earthquak
e
e
v
ents.
This
underperformance
highlights
the
limitations
of
linear
models
in
disaster
classication
scenarios
where
feature
interactions
are
inherently
nonlinear
and
o
v
erlapping.
T
able
4.
Ev
aluation
metrics
for
disaster
classication
models
(based
on
test
set)
Model
Accurac
y
(%)
Precision
(%)
(Macro)
Recall
(%)
(Macro)
F1-score
(%)
(Macro)
LR
96.8
69.7
97.9
79.6
RF
99.9
99.0
99.0
99.0
XGBoost
99.7
96.2
98.0
97.0
Benc
hmarking
mac
hine
learning
models
for
natur
al
disaster
pr
ediction
with
...
(Moath
Alsafasfeh)
Evaluation Warning : The document was created with Spire.PDF for Python.
264
❒
ISSN:
2252-8938
Figure
2
sho
ws
the
confusion
matrices
for
t
he
three
models.
Ensemble
models,
RF
and
XGBoost,
sho
w
an
ideal
performance
for
the
classication
of
the
three
classes,
while
LR
sho
ws
f
alse
alarm
for
the
no-disaster
class.
Ho
we
v
er
,
these
ideal
results
sho
wed
because
of
the
ability
of
the
models
to
replicate
the
deterministic
threshold
logic
used
for
labeling
rather
than
true
generalization.
As
a
result,
a
further
e
v
aluation
as
an
independent
dataset
W
orld-B
is
required
to
test
predicti
v
e
rob
ustness.
Figure
2.
Confusion
matrices
for
the
W
orld-A
test
set
3.3.
V
alidation
of
the
trained
models
using
independent
test
W
orld-B
The
predicti
v
e
performance
of
the
three
trained
models
w
as
further
e
v
aluated
using
the
independently
generated
W
orld-B
dataset.
The
purpose
of
this
e
v
aluation
is
to
assess
the
ability
of
the
models
to
generalize
be
yond
the
statistical
distrib
utions
of
the
training
data
in
W
orld-A.
The
W
orld-B
test
pro
vides
a
rigorous
independent
assessment
of
rob
ustness
under
distrib
utional
shift,
where
identical
feature
schema
and
a
x
ed
labeling
thresholds
are
preserv
ed
while
modifying
the
underlying
data-generation
par
ameters.
The
W
orld-B
generated
dataset
consists
of
7,830
no-disaster
,
870
ood,
and
84
earthquak
e
samples.
T
able
5
illustrates
the
performance
metrics
for
the
three
trained
models.
XGBoost
model
has
the
highest
F1-score,
then
RF
slightly
lo
w
,
which
conrms
the
ability
of
the
ensemble
models
to
be
generalized
for
natural
disaster
prediction.
On
the
other
hand,
LR
sho
ws
a
notable
decline
in
predicti
v
e
performance,
particularly
in
precision
and
F1-score,
sho
wing
its
limited
capacity
to
capture
the
nonlinear
feature
interactions
inherent
in
en
vironmental
and
seismic
systems.
Figure
3
conrms
the
out-performance
of
the
ensemble
model
s
rather
than
linear
models,
conrming
the
ability
of
XGBoost
and
RF
to
be
adapti
v
e
with
comple
x
and
nonlinear
data
distrib
utions.
T
able
5.
W
orld-B
model
comparison
(macro
metrics)
Model
Accurac
y
(%)
Precision
macr
o
(%)
Recall
macr
o
(%)
F1
macr
o
(%)
LR
70.6
50.6
83.7
54.6
RF
89.0
78.5
78.1
78.3
XGBoost
85.2
78.2
91.1
80.5
Figure
4
sho
ws
the
confusion
matrices
to
visualize
the
prediction
outcomes
of
the
three
models
on
the
independent
W
orld-B
dataset.
The
LR
model
e
xhibits
high
f
alse-positi
v
e
rates,
specially
it
is
misclassifying
a
lar
ge
number
of
no
disaster
samples
as
ood,
indicating
it
is
sensiti
vity
to
o
v
erlapping
meteorological
features.
LR
true
positi
v
es
for
ood
and
earthquak
e
remain
acceptable,
with
substantial
confusion
across
classes.
In
Int
J
Artif
Intell,
V
ol.
15,
No.
1,
February
2026:
257–268
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
265
contrast,
the
RF
model
achie
v
es
a
stronger
diagonal
domi
nance,
with
the
majority
of
no
disaster
and
earthquak
e
samples
correctly
classied
and
moderate
confusion
between
ood
and
no
disaster
.
The
XGBoost
model
sho
ws
the
most
balanced
beha
vior
,
minimizing
f
alse
ne
g
ati
v
es
while
maintaining
high
true-positi
v
e
counts
for
all
classes.
Ov
erall,
Figure
4
demonstrates
that
ensemble
models
XGBoost
and
RF
better
capture
nonlinear
feature
dependencies
compared
to
LR.
Specically
,
XGBoost
reduces
mis
classications
more
ef
fecti
v
ely
than
RF
,
which
is
bi
ased
to
no-disaster
class
during
ood
e
v
ents.
In
addition,
LR
struggles
with
comple
x
features
classication
leading
to
f
alse-alarms.
Figure
3.
Comparison
of
model
performance
metrics
on
the
independent
W
orld-B
dataset
Figure
4.
Confusion
matrices
for
the
three
trained
models
using
the
W
orld-B
dataset
3.4.
T
esting
of
r
ealistic
scenarios
The
rob
ustness
of
the
three
trained
models,
LR,
RF
,
and
XGBoost,
under
realistic
scenarios
is
ess
ential
to
assess
the
use
of
the
ML
models
to
predict
the
occurrence
of
natural
disasters.
T
o
e
v
aluate
the
model’
s
rob
ustness
under
such
conditions,
the
three
classiers
were
tested
using
three
scenarios:
sensor
f
ailures,
noisy
sensors,
and
se
v
ere
weather
conditions.
Figure
5
sho
ws
the
rob
ustness
e
v
aluation
of
the
three
trained
models
using
the
W
orld-B
dataset.
Across
all
conditions,
a
consistent
performance
de
gradation
is
observ
ed
as
the
disruption
se
v
erity
increases
from
10%
to
50%.
The
ensemble
models,
XGBoost
and
RF
,
demonstrate
substantially
greater
resilience
than
the
linear
baseline,
LR.
In
the
sensor
f
ailure
scenario,
XGBoost
starts
with
a
high
macro-F1-score
of
0.76
and
RF
with
0.74,
b
ut
both
decrease
modestly
to
0.57
at
50%
f
ailure.
This
sho
ws
that
tree-based
methods
preserv
e
predi
ction
delity
e
v
en
with
p
a
rtial
data
loss.
In
comparison,
LR
sho
ws
a
steeper
decline
from
0.52
to
0.43
F1-score,
demonstrating
sensit
i
vity
to
missing
inputs.
F
or
the
noisy
sensor
scenario,
the
XGBoost
model
starts
with
a
high
F1-score
and
maintains
a
constant
performance,
reaching
0.73
e
v
en
at
the
maximum
noise
le
v
el.
The
same
scenario
for
the
RF
model,
conrming
their
tolerance
to
feature
disruption.
On
the
other
hand,
LR
has
a
stable
lo
w
F1-score
among
all
the
noisy
sensor
scenarios.
In
the
e
xtreme
weather
test,
which
amplies
en
vironmental
parameters
only
,
results
sho
w
that
RF
starts
with
a
comparati
v
ely
rob
ust
F1-score
of
0.80
and
declines
to
0.75.
While
XGBoost’
s
boosting
frame
w
ork
is
inherently
more
sensiti
v
e
to
en
vironmental
feature
shifts.
The
LR
model
sho
ws
a
lo
w
F1-score
for
the
e
xtreme
weather
as
well,
compared
with
ensemble
models.
Ov
erall,
ensemble-based
approaches
e
xhibit
strong
rob
ustness
to
sensor
f
ailures,
noisy
sensor
and
e
xtreme
weather
scenarios,
v
alidating
their
suitability
for
IoT
-based
disaster
prediction
systems
operating
under
uncertain
sensing
conditions.
The
linear
LR
model,
limited
by
i
ts
inability
to
capture
nonlinear
dependencies,
demonstrates
lo
wer
adaptability
and
poorer
f
ault
tolerance.
Benc
hmarking
mac
hine
learning
models
for
natur
al
disaster
pr
ediction
with
...
(Moath
Alsafasfeh)
Evaluation Warning : The document was created with Spire.PDF for Python.
266
❒
ISSN:
2252-8938
Figure
5.
Rob
ustness
e
v
aluation
of
the
trained
models
under
three
disturbance
scenarios
4.
CONCLUSION
This
study
demonstrated
that
ensemble-based
ML
models
can
ef
fecti
v
ely
predict
oods
and
earthquak
es
from
synthetic
IoT
sensor
data.
This
paper
underlined
the
importance
of
resilient
and
accurate
disaster
prediction
systems,
and
the
ndings
supported
this
e
xpectation.
XGBoost
and
RF
outperformed
LR
and
remained
rob
ust
under
stress
conditions
such
as
sensor
f
ailure,
noisy
sensors,
and
e
xtreme
weather
scenarios.
These
ndings
conrm
the
potential
of
ensemble
learning
as
a
foundation
for
intelligent
EWS.
Future
studies
should
focus
on
e
xpanding
the
dataset
with
additional
en
vironmental
and
seismic
parameters,
in
v
estig
ating
DL
approaches
for
spatiotemporal
prediction,
and
deplo
ying
the
models
in
real
IoT
en
vironments
for
eld
v
alidation.
Such
adv
ancements
will
enhance
the
reliability
and
applicability
of
disaster
prediction
systems
in
practice.
A
CKNO
WLEDGMENTS
The
authors
w
ould
lik
e
to
thank
Al-Hussein
Bin
T
alal
Uni
v
ersity
for
its
support.
FUNDING
INFORMA
TION
This
research
w
as
funded
in
part
by
the
N
A
T
O
Science
and
Peace
Programme
under
grant
SPS
MYP
G5932-RESCUE.
A
UTHOR
CONTRIB
UTIONS
ST
A
TEMENT
This
journal
uses
the
Contrib
utor
Roles
T
axonomy
(CRediT)
to
recognize
indi
vidual
author
contrib
utions,
reduce
authorship
disputes,
and
f
acilitate
collaboration.
Name
of
A
uthor
C
M
So
V
a
F
o
I
R
D
O
E
V
i
Su
P
Fu
Moath
Alsaf
asfeh
✓
✓
✓
✓
✓
✓
✓
✓
✓
Abdullah
Alhasanat
✓
✓
✓
✓
✓
✓
Atheer
Bassel
✓
✓
✓
✓
Mohanad
Alhasanat
✓
✓
✓
✓
C
:
C
onceptualization
I
:
I
n
v
estig
ation
V
i
:
V
i
sualization
M
:
M
ethodology
R
:
R
esources
Su
:
Su
pervision
So
:
So
ftw
are
D
:
D
ata
Curation
P
:
P
roject
Administration
V
a
:
V
a
lidation
O
:
Writing
-
O
riginal
Draft
Fu
:
Fu
nding
Acquisition
F
o
:
F
o
rmal
Analysis
E
:
Writing
-
Re
vie
w
&
E
diting
Int
J
Artif
Intell,
V
ol.
15,
No.
1,
February
2026:
257–268
Evaluation Warning : The document was created with Spire.PDF for Python.