IAES Inter national J our nal of Articial Intelligence (IJ-AI) V ol. 15, No. 1, February 2026, pp. 257 268 ISSN: 2252-8938, DOI: 10.11591/ijai.v15.i1.pp257-268 257 Benchmarking machine lear ning models f or natural disaster pr ediction with synthetic IoT data Moath Alsafasfeh 1,2 , Abdullah Alhasanat 1 , Atheer Bassel 3 , Mohanad Alhasanat 1 1 Department of Computer Engineering, Colle ge of Engineering, Al-Hussein Bin T alal Uni v ersity , Ma’an, Jordan 2 Department of Electrical and Computer Engineering, Colle ge of Engineering, T usk e gee Uni v ersity , T usk e gee, United States 3 Department of Articial Intelligence, Colle ge of Computer Science and T echnology , Uni v ersity of Anbar , Anbar , Iraq Article Inf o Article history: Recei v ed Oct 2, 2025 Re vised Dec 29, 2025 Accepted Jan 22, 2026 K eyw ords: Disaster resilience Early w arning systems Ensemble learning Extreme weather e v ents Internet of things Natural disaster prediction Synthetic data ABSTRA CT Natural disasters pose se v ere threats to human life and infrastructure, demanding rob ust early w arning systems (EWS) supported by machine learning (ML) and internet of things (IoT)-based sensing. This study benchmarks ML models for predicting oods and earthquak es using synthetic IoT sensor data. A dataset comprising nine en vironmental and seismic parameters w as generated and labe led into three classes: no disaster , ood, and earthquak e, where the feature preprocessing w as applied during model training. Logistic re gression (LR), random forest (RF), and e xtreme gradient boosting (XGBoost) models were trained and e v aluated using accurac y , precision, recall, and F1-score. Experimental results on the W orld-A test set sho w that ensemble models consistently outperform LR, with XGBoost and RF achie ving F1-scores of up to 97% and 99%, respecti v ely , com pared to 79% for LR. An independent test on the separately generated W orld-B dataset re v ealed that ensemble models maintained higher generalization capability with F1-scores of 80% for XGBoost and 78% for RF . In contrast, LR s ho wed substantial de gradation with an F1-score of 54%. Stress testing on the W orld-B dataset under simulated situations, such as sensor f ailures, noise injection, and e xtreme weather e v ents, conrms the resilience performance of ensemble models in comparison to LR. These results demonstrate the usefulness of ensemble learning in handling unpredictable IoT data for disaster prediction and highlight their potential inte gration into intelligent EWS. Future w ork will focus on e xpanding the frame w ork to include cross-time prediction, incorporating additional en vironmental features, and deplo ying the models in real-time IoT systems for eld v alidation. This is an open access article under the CC BY -SA license . Corresponding A uthor: Moath Alsaf asfeh Department of Computer Engineering, Colle ge of Engineering, Al-Hussein Bin T alal Uni v ersity Ma’an, Jordan Email: malsaf asfeh@tsuk e gee.edu 1. INTR ODUCTION Natural disasters are serious adv erse e v ents of geoph ysical, h ydrological, climatological, or meteorological origin that threaten human life, infrastructure, and the en vironment [1], [2]. Floods and earthquak es are tw o of the most damaging hazards, accounting for a lar ge proportion of disa ster -related deaths and economic losses w orldwide [1]. The frequenc y of such incidents has increased dramatically in recent decades, emphasizing the need for reliable prediction and response systems. Early w arning systems (EWS) are designed to deli v er timely alerts through hazard monitoring, forecasting, and risk assessment [3]. Ho we v er , J ournal homepage: http://ijai.iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
258 ISSN: 2252-8938 by 2020, only a small m inority of countries had operable multi-hazard EWS [4]. Impro ving these systems remains a global goal. Articial intelligence (AI) pro vides promising solutions for disaster prediction by analyzing lar ge, comple x datasets and detecting patterns t hat precede e xtreme e v ents [5]. Machine learning (ML) models, in particular , ha v e the potential to increase disaster prediction accurac y and f acilitate proacti v e decision-making. Despite these adv ances, the dependability of ML and deep learning (DL) algorithms must be e xtensi v ely tested, especially under conditions of data scarcity , noise, or sensor f ai lure. Furthermore, e v aluation of both the datasets and the prediction models is critical to ensuring reliable outcomes. This study addresses these g aps by benchmarking ML models for ood and earthquak e prediction using synthetic internet of things (IoT) sensor data. A wide range of IoT -based systems ha v e been de v eloped for natural disaster predi ction and management, typically combining sensor netw orks with ML models for early w arning. An IoT -based frame w ork inte grating multi-sensor data with neural netw orks, decision trees (DTs), and random forest (RF) demonstrated strong decision-making capabiliti es b ut f aced scalability and interoperability challenges [6]. Se v eral papers highlight the importance of applying ML and DL for disaster detection and outline k e y research directions [7]–[9]. As such, a comparati v e study in [10] benchmark ed ML and DL models, sho wing con v olutional neural netw orks (CNNs) and h ybrid deep netw orks outperform others, while RF and e xtreme gradient boosting (XGBoost) remained competiti v e for smaller datasets; no single algorithm pro v ed uni v ersally optimal. F or ood prediction, IoT and ML approaches ha v e used w ater -le v el, rainf all, and humidity sensors with models such as RF , long short-term memory (LSTM), and CNN, achie ving accuracies between 80-95% [11]–[13]. Ensemble RF-LSTM h ybrids reached 81% accurac y on the tested data [14]. Rezv ani et al. [15] applies geospatial AI to ood hotspot detection in Portug al, producing susceptibility maps with 96% accurac y , though limited by reliance on historical datasets. According to Anbarasan et al. [16], a con v olutional deep neural netw ork (CDNN)-based system combining IoT sensors and big data achie v ed 93.23% accurac y , outperforming articial neural netw ork (ANN) and deep neural netw ork (DNN) baselines, b ut w as v alidated mainly on simulated data. F or earthquak e prediction, accelerometer -based IoT frame w orks analyzed seismic vibrations using support v ector machine (SVM) and DTs, with SVM reaching 95% accurac y [17]. Mukherjee et al. [18] e xtracted 61 seismic features from the Himalayan Belt, nding ANN and XGBoost most accurate across longer prediction windo ws. According to Rosca and S tancu [19], an IoT–cloud system using 8,766 seismic and m eteorological records achie v ed 99.84% accurac y , though limited t o the Vrancea re gion. K ubo et al. [20] re vie w ML applications in seismology , noting strong adv ances in detection and catalog completion b ut persistent issues with generalizability and interpretability . F or landslides, IoT sensors monitoring soil moisture, slope, and rainf all combined with ML models such as XGBoost and LSTM achie v ed o v er 95% accurac y [21]–[23], while U-Net impro v ed susceptibility mapping with satellite imagery [12]. The study in [24] proposed a lo w-cost IoT -ML frame w ork in V ietnam , where RF achie v ed strong predicti v e performance and enabled real-time alerts, though generalizability and de vice maintenance remain concerns. On the other hand, smart city-oriented systems ha v e inte grated IoT , ML, and cloud computing for multi-hazard monitoring, though challenges persist in scalability , data transmission, and pri v ac y [25]. Ov erall, prior w ork demonstrates the potential of IoT -ML inte gration for disaster prediction, b ut most studies rely on limited real-w orld datasets and focus on single hazards. Fe w ha v e systematically benchmark ed multiple models under controlled condi tions. This paper addresses that g ap by generating a synthetic IoT dataset for ood and earthquak e scenarios and conducting a comparati v e rob ustness e v aluation. 2. METHOD The paper introduces the de v elopment and e v aluation of a ML-based system for natural disaster prediction. The approach in v olv es generating tw o synthetic datasets, W orld-A and W orld-B, labeling disaster e v ents, and training multiple classication models to detect oods and earthquak es based on en vironmental and seismic features. 2.1. Dataset-a pr eparation W e generate a synthetic en vironmental time series to train models for predicting oods and earthquak es. The data span 01-Jan-2024 to 01-Jan-2025 at hourly resolution (8,784 timestamps) and include three feature groups: i) m eteorology: temperature, humidity , wind speed, rainf all; ii) h ydrology: w ater le v el, o w rate, and iii) seismology: magnitude, depth, frequenc y . T able 1 summarizes the distrib ution type and Int J Artif Intell, V ol. 15, No. 1, February 2026: 257–268 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 259 the ph ysical interpretation for each feature. In natural disaster prediction tasks, model rob ustness is a critical concern due to the inherent rarity and imbalance of high-se v erity e v ents. T o address this, the data-generation process e xplicitly simulates rare e xtreme conditions rather than relying on post hoc data augmentation. T able 1. Summary of feature generation for synthetic dataset Feature Symbol Distrib ution / model Ph ysical interpretation Meteorological v ariables T emperature T i Gaussian ( µ = 25 , σ = 5 ), clipped [ 5 , 45] Represents air temperature with daily and seasonal v ariation. Humidity H i Deri v ed from T i with in v erse relation; Gaussian noise σ = 3 Relati v e humidity in v ersely related to temperature, bounded between 20–100%. W ind speed W i Rayleigh ( σ = 3 . 5 ) with added Gamma b ursts during storms; capped at 28 m/s Background wind speed with con v ecti v e gusts during storm episodes. Rainf all R i Gamma (shape=1.8, scale=4.0) mix ed with Bernoulli rain occurrence; seasonal and diurnal modulation Hourly precipi tation inuenced by time of year and time of day . Hydrological v ariables W ater le v el L i Recursi v e (AR-1) with α = 0 . 965 and rainf all-dependent response Accumulates rainf all and decays slo wly o v er time, capturing antecedent wetness. Flo w rate Q i Po wer -la w Q ( L H 0 ) 1 . 6 with Gaussian noise σ = 6 Ri v er dischar ge rises nonlinearly with w ater le v el, representing runof f response. Seismic v ariables Magnitude M i Exponential ( λ = 1 . 0 ), bounded [0, 6.5] Earthquak e magnitude approximates a Gutenber g–Richter–lik e e xponential decay (small frequent, lar ge rare). Depth D i Normal ( µ = 12 , σ = 6 ), clipped [0.5, 50] Depth of seismic e v ents spanning shallo w to intermediate crustal re gions. Frequenc y F i Poisson with rate λ decreasing e xponentiall y with M i and D i Expected hourly count of seismic e v e nts in v ersely related to magnitude and depth. 2.2. Dataset labeling The W orld-A dataset is split into training and testing subsets, 80% and 20%, respecti v ely . Sam ples in both subsets are assigned class labels based on predened threshold v alues indicating potential ly hazardous conditions. The empirical selection method is used to determine threshold v alues that maintain the ph ysical realism of the synthetic disaster simulation, ensuring that both ood and earthquak e scenarios appropriately represent plausible en vironmental dynamics. T o eliminate conicts, Algorithm 1 assigns labels with earthquak e priority . Labels be gin with class 0 (no-disaster), and progress to class 2 ( earthquak e) when lo w seismic frequenc y , high magnitude, and shallo w depth are all met. Remaining unlabeled sam p l es are classied as class 1 (ood) if o w rate, wind speed, and humidity e xceed the threshold v alues. This sequential rule-based scheme cleanly separates hazards while prioritizing rarer , high-se v erity earthquak es. Algorithm 1 Disaster labeling logic based on thresholds 1: Initialize all labels as no disaster 0 2: f or each ro w in the dataset do 3: if Seismic frequenc y < threshold and seismic depth < threshold and seismic magnit u de > threshold then 4: Assign label 2 (earthquak e) 5: else if Disaster == 0 and o w rate > threshold and wind speed > threshold and humidity > threshold then 6: Assign label 1 (ood) 7: else 8: K eep label as 0 (no disaster) 9: end if 10: end f or T welv e predictors are used: 9 numeric (temperature, humidity , wind speed, seismic magnitude, seismic depth, seismic frequenc y , w ater le v el, rainf all, and o w rate) and 3 cate gorical (hour , day of week, Benc hmarking mac hine learning models for natur al disaster pr ediction with ... (Moath Alsafasfeh) Evaluation Warning : The document was created with Spire.PDF for Python.
260 ISSN: 2252-8938 and month). Numeric features are normalized using min–max scaling, while cate gorical features are one-hot encoded with the rst cate gory dropped. All preprocessing is wrapped in a unied column transformer t only on a-train, then applied to a-test to pre v ent leakage. 2.3. Models training Three ML models, logistic r e gres sion (LR), RF , and XGBoost, are subsequently trained for mult iclass disaster classication, distinguishing between no-disaster , ood, and earthquak e e v ents. All preprocessing steps are inte grated within a training-only pipeline, and no imputation is applied unless missing v alues are intentionally introduced during rob ustness e xperiments. 2.3.1. Logistic r egr ession LR can be used to classify input data into multiple classes. In this study , we assign class 0 for non-disaster , class 1 for ood, and class 2 for earthquak e, as e xplained in Algorithm 1. Since there are three possible outcomes, multiclass LR is applied using the softmax function, as sho wn in Algorithm 2 to calculate the probability that an input corresponds to each class. The probability for class j gi v en input x is calculated using (1). Where: x R d is the input feature v ector , θ j R d is the weight v ector for class j , and θ j x represents the dot product between the weights and the input features. The denominator ensures that the sum of probabilities o v er all classes equals 1. P ( y = j | x ) = e θ j x e θ 0 x + e θ 1 x + e θ 2 x for j { 0 , 1 , 2 } (1) Algorithm 2 Multiclass logistic re gression using softmax Requir e: Dataset { ( x ( i ) , y ( i ) ) } N i =1 , learning rate α , iterations T 1: Initialize weight v ectors θ 0 , θ 1 , . . . , θ K 1 R d 2: f or t = 1 to T do 3: f or each sample ( x ( i ) , y ( i ) ) do 4: Compute logits: z j = θ j x ( i ) 5: Compute probabilities: p j = e z j P K 1 k =0 e z k 6: f or each class j do 7: Update weights: θ j θ j α ( p j { y ( i ) = j } ) x ( i ) 8: end f or 9: end f or 10: end f or 11: Pr ediction: ˆ y = arg max j p j 2.3.2. Random f or est RF enhances prediction accurac y and mitig ates o v ertting by combining multiple DTs into an ensemble. Initially , multiple random subsets are generated from the training dataset using bootstrap sampling, and each subset is used to independently train a DT . During tree construction, a random selection of features is tak en at each node split, and the best feature among them is chosen using impurity reduction criteria. This randomization minimizes correlation between indi vidual trees, enhancing the model s generalization ability . During predicti on , each tree v otes for a class label, and the nal prediction is determined by majority v oting across all trees. Algorithm 3 e xplains the main process of the RF algorithm, where each h t ( x ) is a DT trai ned independently . The bootstrap sampling ensures di v ersity in data, random feature selection at each node reduces tree correlation, and the nal prediction is made by majority v ote in classication tasks. 2.3.3. XGBoost XGBoost is an ensemble learning algorithm that b uilds a strong classier by sequentially combining multiple DTs . Unlik e RF , where trees are trained independently , XGBoost trains each ne w tree to correct the errors made by the pre vious ensemble. XGBoost be gins with an initial prediction of 0, then calculates the rst deri v ati v e (gradient) and the second deri v ati v e (Hessian) of the loss function with respect to the predictions. Int J Artif Intell, V ol. 15, No. 1, February 2026: 257–268 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 261 These are used to t a ne w DT that predicts the residual errors. The output of each ne w tree is scaled by a learning rate η and added to the pre vious prediction. This process is repeated o v er multiple boosting rounds. Algorithm 4 outlines the main steps of XGBoost, where: is the loss function to be minimized, Ω( f t ) is a re gularization term that penalizes tree comple xity to pre v ent o v ertting, and η (0 , 1] is the learning rate that controls the contrib ution of each t ree. At each iteration, the model ts a ne w tree f t ( x ) to minimize a re gularized objecti v e function using a second-order T aylor approximation. The predictions are updated in step 6. Algorithm 3 Random forest classier Requir e: T raining data D , number of trees T , number of features per split m Ensur e: Final prediction ˆ y 1: f or t = 1 to T do 2: Dra w a bootstrap sample D t from D (sample with replacement) 3: T rain a DT h t ( x ) on D t 4: f or each split node in the tree do 5: Select m random features from the d total features 6: Choose the best feature among m based on impurity reduction 7: end f or 8: end f or 9: F or an unseen input x , aggre g ate predictions: ˆ y = mode { h t ( x ) } T t =1 Algorithm 4 XGBoost classier Requir e: T raining data D , number of boosting rounds T , learning rate η Ensur e: Final prediction model ˆ y ( x ) 1: Initialize predictions: ˆ y (0) i 0 for all i 2: f or t = 1 to T do 3: Compute gradients: g i = ( y i , ˆ y ( t 1) i ) ˆ y ( t 1) i 4: Compute Hessians: h i = 2 ( y i , ˆ y ( t 1) i ) ( ˆ y ( t 1) i ) 2 5: Fit a re gression tree f t ( x ) to minimize: n X i =1 g i f t ( x i ) + 1 2 h i f t ( x i ) 2 + Ω( f t ) 6: Update prediction: ˆ y ( t ) i ˆ y ( t 1) i + η · f t ( x i ) 7: end f or 8: r etur n Final model: ˆ y ( x ) = P T t =1 η · f t ( x ) 2.4. Generating the independent test W orld-B T o conrm the generalization of the trained models, an independent test is performed to determ ine their rob ustness for natural disaster classication. The W orld-B test set is created synthetically , using the same feature schema and temporal structure as the W orld-A dataset, b ut with incorporating ne w stochastic realizations for all v ariables. T able 2 sho ws the k e y dif ferences between the W orld-A and W orld-B datasets. This approach ensures that the models are e xposed to unseen en vi ronmental and seismic conditions while maintaining consistent ph ysical relationships between features. 2.5. Experimental setup f or e v aluating machine lear ning under IoT inefciencies A controlled rob ustness testing approach w as used to assess the trained models’ stability and generalization capability across dif ferent en vironmental a n d sensor conditions. The e v aluation utilized the Benc hmarking mac hine learning models for natur al disaster pr ediction with ... (Moath Alsafasfeh) Evaluation Warning : The document was created with Spire.PDF for Python.
262 ISSN: 2252-8938 independent W orld-B dataset, maintaining its original labels to ensure independent and consistent ground truth across disruption situations. Three stress scenarios were designed to replicate realistic disturbance s in IoT -based monitoring systems: i) sensor f ailure simulation: a portion of sensor readings w as randomly halted, simulating temporary de vice f ailures or communication outa ges, where the absent v alues were substituted using a hold-last-v alue (HL V) approach; ii) noisy sensor simulation: additi v e Gaussian noise and lo w-frequenc y drift were injected into continuous features to imitate sensor calibration drift or interference; and iii) e xtreme weather simulation: meteorological and h ydrological v ariables were scaled to reect storms, including rainf all, humidity , wind speed, w ater le v el, and o w rate. In contrast, seismic feature s remained unchanged, as these e v ents are primarily en vironmental rather than tectonic. T o ensure stati stical reliability , each scenario w as repeated at v e se v erity le v els (10%-50%) using v e random seeds . This e xaminati on of multi-seed resilience pro vides a quantitati v e assessment of each algori thm’ s response to sensor de gradation and e xtreme en vironmental v ariation. By isolati ng disruptions from label generation, the frame w ork assesses model-only rob ustness, ensuring that an y reported performance changes are due to instability at the feature le v el rather than uctuations in the tar get. T able 2. Comparison between W orld-A and W orld-B datasets Aspect W orld-A dat aset W orld-B dataset Purpose Model training and in-domain e v aluation Independent testing for generalization Data source Generat ed once using x ed random s eed ( SEED = 42 ) Re generated independently using a ne w random seed Feature schema 9 numeri c + 3 temporal/cate gorical features Identical schema replicated for comparability Data distrib ution Shares same stochastic patterns as training data Dif ferent random realizations and temporal sequences Thresholds for labeling Fix ed ph ysic al thresholds (same across w orld) Same thresholds reused to preserv e label semantics Ev aluation type In-dis trib ution e v aluation Out-of-sample e v aluation Expected beha vior High perfor mance due to shared distrib ution Lo wer b ut more realistic per formance under ne w conditions 3. RESUL TS AND DISCUSSION The performance of the trained ML models on the labeled disaster dataset is e v aluated. The models are assessed based on their classication accurac y , and their rob ustness is analyzed under v arious simulated real-w orld conditions, including sensor noise and data irre gularities. 3.1. Dataset generation A Pearson correlation heatmap w as constructed using all continuous v ariables to e v aluate the internal consistenc y of the synthetic dataset features. The heatmap sho wn in Figure 1 sho ws a ph ysically coherent relationship between the meteorological, h ydrological, and seismic features. The meteorological and h ydrological f eatures ha v e a moderate posit i v e correlation, where humidity sho ws a strong ne g at i v e correlation with temperature. Re g arding the seismic subsystem, there i s a moderate ne g ati v e correlation between seismic magnitude and seismic frequenc y , reecting the in v erse relationship b uilt into the synthetic dataset generation process. The correlation heatmap in Figure 1 v alidates both the realism and internal consi stenc y of the synthetically generated dataset. These correlation patterns were obtained by a stochastic–empirical generation process designed to mimic the natural beha vior of IoT sensors capturing oods and earthquak e parameters. This correlation method is considered one of the limitations of this study , where real-w orld data w ould gi v e an accurate reading and correlations among the features. 3.2. Ev aluation of machine lear ning models perf ormance The performance of the trained model is assessed using four class ication metrics: accurac y , precision, recall, and F1-score, as sho wn in T able 3, where the number of testing samples in the W orld-A dataset is 1,757 for three dif ferent classes: 1,690 for the no-disaster class, 34 for the ood class, and 33 for the earthquak e class. The accurac y metric is used to measure the proportion of the correctly classied sample among all the predictions. Ho we v er , precision is used to e v aluate the proportion of true positi v es among all positi v e predictions for each class. The recall metric is used to sho w the number of actual e v ents that ha v e been Int J Artif Intell, V ol. 15, No. 1, February 2026: 257–268 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 263 detected successfully . F1-score is used to combine precisi on and recall in one metric to e v aluate the models when dealing with an imbalanced dataset. Since the problem in v ol v es multiclass classication (no disaster , ood, and earthquak e), the metrics were computed using macro-a v eraging to treat all classes equally re g ardless of sample count. Figure 1. Dataset feature correlation heatmap T able 3. Classication metrics, formulas, and brief denitions Metric F ormula Denition Accurac y TP+TN TP+TN+FP+FN Proportion of correctly classied samples among all predictions. Precision i TP i TP i +FP i Share of predicted positi v es for class i that are truly positi v e. Recall i TP i TP i +FN i Share of actual positi v es for class i that are correctly detected. F1 i 2 Precision i Recall i Precision i +Recall i Harmonic mean of precision and recall for class i ; rob ust to imbalance. T able 4 summarizes the computed e v aluation metrics for the t hree implemented models: LR, RF , and XGBoost. The results clearly demonstrate the superior performance of the ensemble-based models o v er the linear baseline. RF and XGBoost achi e v ed near -perfect scores across all metrics, sho wing their rob ustness in handling the comple x and nonlinear relationships present in the disaster dataset and their resilience to class imbalance. In contrast, LR performed signicantly w orse, particularly in precision, demoenstrating a higher tendenc y to incorrectly classify non-disaster instances as ood or earthquak e e v ents. This underperformance highlights the limitations of linear models in disaster classication scenarios where feature interactions are inherently nonlinear and o v erlapping. T able 4. Ev aluation metrics for disaster classication models (based on test set) Model Accurac y (%) Precision (%) (Macro) Recall (%) (Macro) F1-score (%) (Macro) LR 96.8 69.7 97.9 79.6 RF 99.9 99.0 99.0 99.0 XGBoost 99.7 96.2 98.0 97.0 Benc hmarking mac hine learning models for natur al disaster pr ediction with ... (Moath Alsafasfeh) Evaluation Warning : The document was created with Spire.PDF for Python.
264 ISSN: 2252-8938 Figure 2 sho ws the confusion matrices for t he three models. Ensemble models, RF and XGBoost, sho w an ideal performance for the classication of the three classes, while LR sho ws f alse alarm for the no-disaster class. Ho we v er , these ideal results sho wed because of the ability of the models to replicate the deterministic threshold logic used for labeling rather than true generalization. As a result, a further e v aluation as an independent dataset W orld-B is required to test predicti v e rob ustness. Figure 2. Confusion matrices for the W orld-A test set 3.3. V alidation of the trained models using independent test W orld-B The predicti v e performance of the three trained models w as further e v aluated using the independently generated W orld-B dataset. The purpose of this e v aluation is to assess the ability of the models to generalize be yond the statistical distrib utions of the training data in W orld-A. The W orld-B test pro vides a rigorous independent assessment of rob ustness under distrib utional shift, where identical feature schema and a x ed labeling thresholds are preserv ed while modifying the underlying data-generation par ameters. The W orld-B generated dataset consists of 7,830 no-disaster , 870 ood, and 84 earthquak e samples. T able 5 illustrates the performance metrics for the three trained models. XGBoost model has the highest F1-score, then RF slightly lo w , which conrms the ability of the ensemble models to be generalized for natural disaster prediction. On the other hand, LR sho ws a notable decline in predicti v e performance, particularly in precision and F1-score, sho wing its limited capacity to capture the nonlinear feature interactions inherent in en vironmental and seismic systems. Figure 3 conrms the out-performance of the ensemble model s rather than linear models, conrming the ability of XGBoost and RF to be adapti v e with comple x and nonlinear data distrib utions. T able 5. W orld-B model comparison (macro metrics) Model Accurac y (%) Precision macr o (%) Recall macr o (%) F1 macr o (%) LR 70.6 50.6 83.7 54.6 RF 89.0 78.5 78.1 78.3 XGBoost 85.2 78.2 91.1 80.5 Figure 4 sho ws the confusion matrices to visualize the prediction outcomes of the three models on the independent W orld-B dataset. The LR model e xhibits high f alse-positi v e rates, specially it is misclassifying a lar ge number of no disaster samples as ood, indicating it is sensiti vity to o v erlapping meteorological features. LR true positi v es for ood and earthquak e remain acceptable, with substantial confusion across classes. In Int J Artif Intell, V ol. 15, No. 1, February 2026: 257–268 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 265 contrast, the RF model achie v es a stronger diagonal domi nance, with the majority of no disaster and earthquak e samples correctly classied and moderate confusion between ood and no disaster . The XGBoost model sho ws the most balanced beha vior , minimizing f alse ne g ati v es while maintaining high true-positi v e counts for all classes. Ov erall, Figure 4 demonstrates that ensemble models XGBoost and RF better capture nonlinear feature dependencies compared to LR. Specically , XGBoost reduces mis classications more ef fecti v ely than RF , which is bi ased to no-disaster class during ood e v ents. In addition, LR struggles with comple x features classication leading to f alse-alarms. Figure 3. Comparison of model performance metrics on the independent W orld-B dataset Figure 4. Confusion matrices for the three trained models using the W orld-B dataset 3.4. T esting of r ealistic scenarios The rob ustness of the three trained models, LR, RF , and XGBoost, under realistic scenarios is ess ential to assess the use of the ML models to predict the occurrence of natural disasters. T o e v aluate the model’ s rob ustness under such conditions, the three classiers were tested using three scenarios: sensor f ailures, noisy sensors, and se v ere weather conditions. Figure 5 sho ws the rob ustness e v aluation of the three trained models using the W orld-B dataset. Across all conditions, a consistent performance de gradation is observ ed as the disruption se v erity increases from 10% to 50%. The ensemble models, XGBoost and RF , demonstrate substantially greater resilience than the linear baseline, LR. In the sensor f ailure scenario, XGBoost starts with a high macro-F1-score of 0.76 and RF with 0.74, b ut both decrease modestly to 0.57 at 50% f ailure. This sho ws that tree-based methods preserv e predi ction delity e v en with p a rtial data loss. In comparison, LR sho ws a steeper decline from 0.52 to 0.43 F1-score, demonstrating sensit i vity to missing inputs. F or the noisy sensor scenario, the XGBoost model starts with a high F1-score and maintains a constant performance, reaching 0.73 e v en at the maximum noise le v el. The same scenario for the RF model, conrming their tolerance to feature disruption. On the other hand, LR has a stable lo w F1-score among all the noisy sensor scenarios. In the e xtreme weather test, which amplies en vironmental parameters only , results sho w that RF starts with a comparati v ely rob ust F1-score of 0.80 and declines to 0.75. While XGBoost’ s boosting frame w ork is inherently more sensiti v e to en vironmental feature shifts. The LR model sho ws a lo w F1-score for the e xtreme weather as well, compared with ensemble models. Ov erall, ensemble-based approaches e xhibit strong rob ustness to sensor f ailures, noisy sensor and e xtreme weather scenarios, v alidating their suitability for IoT -based disaster prediction systems operating under uncertain sensing conditions. The linear LR model, limited by i ts inability to capture nonlinear dependencies, demonstrates lo wer adaptability and poorer f ault tolerance. Benc hmarking mac hine learning models for natur al disaster pr ediction with ... (Moath Alsafasfeh) Evaluation Warning : The document was created with Spire.PDF for Python.
266 ISSN: 2252-8938 Figure 5. Rob ustness e v aluation of the trained models under three disturbance scenarios 4. CONCLUSION This study demonstrated that ensemble-based ML models can ef fecti v ely predict oods and earthquak es from synthetic IoT sensor data. This paper underlined the importance of resilient and accurate disaster prediction systems, and the ndings supported this e xpectation. XGBoost and RF outperformed LR and remained rob ust under stress conditions such as sensor f ailure, noisy sensors, and e xtreme weather scenarios. These ndings conrm the potential of ensemble learning as a foundation for intelligent EWS. Future studies should focus on e xpanding the dataset with additional en vironmental and seismic parameters, in v estig ating DL approaches for spatiotemporal prediction, and deplo ying the models in real IoT en vironments for eld v alidation. Such adv ancements will enhance the reliability and applicability of disaster prediction systems in practice. A CKNO WLEDGMENTS The authors w ould lik e to thank Al-Hussein Bin T alal Uni v ersity for its support. FUNDING INFORMA TION This research w as funded in part by the N A T O Science and Peace Programme under grant SPS MYP G5932-RESCUE. A UTHOR CONTRIB UTIONS ST A TEMENT This journal uses the Contrib utor Roles T axonomy (CRediT) to recognize indi vidual author contrib utions, reduce authorship disputes, and f acilitate collaboration. Name of A uthor C M So V a F o I R D O E V i Su P Fu Moath Alsaf asfeh Abdullah Alhasanat Atheer Bassel Mohanad Alhasanat C : C onceptualization I : I n v estig ation V i : V i sualization M : M ethodology R : R esources Su : Su pervision So : So ftw are D : D ata Curation P : P roject Administration V a : V a lidation O : Writing - O riginal Draft Fu : Fu nding Acquisition F o : F o rmal Analysis E : Writing - Re vie w & E diting Int J Artif Intell, V ol. 15, No. 1, February 2026: 257–268 Evaluation Warning : The document was created with Spire.PDF for Python.