IAES Inter national J our nal of Articial Intelligence (IJ-AI) V ol. 14, No. 6, December 2025, pp. 4814 4827 ISSN: 2252-8938, DOI: 10.11591/ijai.v14.i6.pp4814-4827 4814 Machine and deep lear ning classiers f or binary and multi-class netw ork intrusion detection systems Ahmad Aloqaily 1 , Emad Eddien Abdallah 1 , Esraa Ab u Elsoud 2 , Y azan Hamdan 3 , Khaled J allad 3 1 Department of Information T echnology , F aculty of Prince Al-Hussein Bin Abdullah II for Information T echnology , The Hashemite Uni v ersity , Zarqa, Jordan 2 Department of Cybersecurity and Cloud Computing, F aculty of Information T echnology , Applied Science Pri v ate Uni v ersity , Amman, Jordan 3 Department of Computer Information Systems, F aculty of Prince Al-Hussein Bin Abdullah II for Information T echnology , The Hashemite Uni v ersity , Zarqa, Jordan Article Inf o Article history: Recei v ed Sep 23, 2024 Re vised Jun 24, 2025 Accepted Oct 18, 2025 K eyw ords: Cyber attacks Cyber security Deep learning Intrusion detection Machine learning ABSTRA CT The rapid proliferation of the internet and adv anceme nts in communication technologies ha v e signicantly impro v ed netw orki ng and increased data v ol- ume. This phenomenon has subsequently caused a multitude of no v el attacks, thereby presenting signicant challenges for netw ork security in the intrusion detection system (IDS). Moreo v er , the ongoing threat from authorized entities who try to carry out v arious types of attacks on the netw ork is a concer n that must be handled seriously . IDS are used to pro vide netw ork a v ailabili ty , con- dentiality , and inte grity by emplo ying machine learning (ML) and deep learn- ing (DL) algorithms. This research aimed to study the impacts of the binary and multi-attack instances label by establishing IDS that le v erages h ybrid al- gorithms, including articial neural netw orks (ANN), random forest (R F), and logistic model trees (LMTs). The paper addresses challenges such as data pre- processing, feature selection, and managing imbalanced datasets by applying synthetic minority o v ersampling technique (SMO TE) and Pearson’ s correlation methodologies. The IDS w as tested using netw ork security laboratory kno wl- edge disco v ery datasets (NSL-KDD) and catalonia independence corpus intru- sion detecti on system (CIC-IDS-2017) datasets, achie ving an a v erage F1-score of 96% for binary classication on NSL-KDD and 85% for binary classication on CIC-IDS-2017, while for multi-classication, the proposed model achie v ed an a v erage F1-score of 82% and 96% for NSL-KDD and CIC-IDS-2017 succes- si v ely . This is an open access article under the CC BY -SA license . Corresponding A uthor: Ahmad Aloqaily Department of Information T echnology F aculty of Prince Al-Hussein Bin Abdullah II for Information T echnology , The Hashemite Uni v ersity P .O. Box 330127, Zarqa 13133, Jordan Email: aloqaily@hu.edu.jo 1. INTR ODUCTION The e xponential increase in internet usage in daily life has led to an increase in c yberattacks, such as the SolarW inds breach in 2020 ha v e highlighted the increasing sophistication of netw ork intrusions. Accord- ing to the Internet Security Threat Report (ISTR), mal w are is found in one of e v ery thirteen W eb queries’. A c yberattack starts with tar get reconnaissance and ends with using vulnerabilities to carry out a harmful oper - J ournal homepage: http://ijai.iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 4815 ation. These c yberattacks result i n system intrusions, which are characterized as unauthorized system access that compromises the condentiality , inte grity , and a v ailability (CIA) of security measures protecting computer or netw ork resources. In recent years, we ha v e seen the emer gence of numerous ne w c yberattacks, including cross-site scripting, brute force, botnets, distrib uted denial of service, and others, where in 2023, the w orldwide number of mal w are attacks reached 6.06 billion, an increase of 10% compared to the preceding year [1]. These intrusions raised more serious than e v er concerns re g arding c ybersecurity [2]. Ho we v er , securing the netw orks becomes essential; one of the most ef fecti v e w ays to identify these threats is intrusion detection system (IDS), which depends on analyzing and monitoring the netw ork traf c. A host intrusion detecti on system (HIDS) is an IDS approach that uses system acti vities that appear through a v ariety of log les created on the local host com p ut er to identify possible intrusions, whereby these log les are collected through local sensors [3]. On the other hand, a netw ork intrusion detection system (NIDS) analyzes the contents of pack ets within netw ork traf c streams, whereas HIDS primely emplo ys data deri v ed from log les, system logs, sensor logs, le system data, disk resource allocation, and other rele v ant information from each system. Man y or g anizations use a h ybrid approach that combines both NIDS and HIDS techniques [4]. The emplo yment of stateful protocol analysis, anomaly detection, and signature detection techniques are used for analyzing netw ork traf c o ws. Signature detection depends on human in v olv ement to refresh the signature databas e continuously and uses pre-established signatures and ltration algorithms to identify attacks. This methodology w orks well for identifying kno wn threats, b ut it is completely inef fecti v e ag ainst unkno wn attacks. Ho we v er , anomaly det ection often leads to a signicantly higher percentage of f alse positi v es. Most or g anizations choose to apply h ybrid approaches to get a more ef fecti v e detection model [5]. Depending on the standard frame w ork of communication TCP/IP model, analysis protocols on the netw ork, application, and transport layers are the most po werful techniques to detect an y potential threats [6]. Machine learning (ML) methods ha v e sho wn e xcellence in achie ving high detection accurac y . Al- though there are some limitations, such as handling ra w , unlabeled, high dimensional data and manual feature e xtraction, these limitations af fect the accurac y of IDS [7], to address these dra wbacks, deep learning (DL) emer ged. This research aims to enhance security through IDSs by applying both ML and DL algorithms to netw ork security laboratory kno wledge disco v ery datasets (NSL-KDD) and catalonia independence cor - pus intrusion detection system (CIC-IDS-2017) datasets to impro v e o v erall system archite cture and detection performance. These datasets pro vide a foundation for benign and attack netw ork traf c, although the y ha v e shortcomings such as labeling issues, duplicate o ws, and insuf cient attack v ariation. The proposed model in this research seeks to address these limitations and de v elop a more resilient IDS, by applying a comprehensi v e e xperiment including tw o phases, study the ef fects of binary class and multi-class in the performance of the IDS. Where se v eral researchers highlight this issue due to its importance in the performance of the IDS [8], [9]. Additionally , we identify the major g ap in the literature re g arding the inte gration of ML and DL approaches in the conte xt of HIDS and NIDS. While pre vious studies ha v e e xamined the ef fecti v eness of dif ferent detection strate gies, the y often do not e xplicitly i n v est ig ate ho w detection accurac y could be impro v ed by combining ML and DL to w ork together on dif fere n t attack v ectors. This is especially true when it comes to reducing high f alse positi v e rates and the challenges that come with data labeling. The remainder of this paper is or g anized as follo ws: section 2 re vie ws recent studies on IDS. Section 3 outlines the research methodology . Section 4 presents the results and discus sion. Lastly , section 5 contains the conclusion. 2. LITERA TURE REVIEW Man y researchers studied IDS by proposing dif ferent approaches where the IDS could be dif ferent in technology used, the dataset, feature selection techniques, and man y more criteria that af fect the performance of the proposed model. In this section, we will illustrate these dif ferences by mentioning some of these studies. The poor performance of con v entional intrusion detection techniques prompted the research in [10] to suggest a neural netw ork methodology . A multi-layer con v olutional neural netw ork (CNN) is used for feature e xtraction and selection. T o cate gorize the netw ork attacks, a soft-max classier is used. T o do additional analysis, a multi-layer deep neural netw ork (DNN) is utilized for netw ork intrusions. T w o commonly utilized benchmark intrusion detection datasets, NSL-KDD and KDDCUP’99, ha v e been used in the research in v estig ations. F our performance metrics—accurac y , recall, F1-score, and precision—are used to e v aluate the suggested model’ s performance. Comparing the sugges ted method to other IDSs, the testing ndings demonstrate that it attained Mac hine and deep learning classier s for binary and ... (Ahmad Aloqaily) Evaluation Warning : The document was created with Spire.PDF for Python.
4816 ISSN: 2252-8938 an accurac y of 99%. The research in [11] e xamined the applicability of DL to internet of things (IoT) data security and conducted a comparison analysis using three DL models, including CNN, long short-term memory (LSTM), and DNN. Based on the results, DNN achie v es 94.61% accurac y , while CNN and LSTM achie v e 98.61% and 97.67%, respecti v ely . It has been established through this comparati v e study and literature re vie w that DL models perform better in the IoT IDS setting than other approaches. Although the DL models e xhibit better accurac y , their future w ork should focus on creating a h ybrid DL model for IoT ID that can anticipate attacks more accurately while e xperimenting with real-time datasets. The h ybrid model is used for IoT IDS installation strate gy and detection techniques. P atil et al . [12] presented an IDS model that enables the use of ML algorithms lik e support v ector machine (SVM), random forests (RF), and decision trees. F ollo wing the model’ s training, an ensemble method kno wn as a v oting classier w as included, and it w as able to attain 96.25% accurac y . The study suggests that trust is necessary for human-machine interactions to be producti v e. Local interpretable model-agnostic e xplanation (LIME) is an e xtendable, modular technique that pro vides concise, comprehensible descriptions of predictions. An e xplanation of prediction is highly useful for the selection of representati v e models. It is emplo yed in model select ion, trust e v aluation, model impro v ement for unreliable models, and prediction analysis for both system e xperts and non-e xperts. T o comprehend the model’ s prediction, the paper suggests deplo ying a LIME e xplainable frame w ork after emplo ying an ense mble of ML models. The ensemble of ML models sho wed an impro v ed accurac y of 96.25%. Meng [13] e xamines the use of supervised and unsupervised learning methods to impro v e c yberse- curity threat detection accurac y in his research. Additionally , the study emphasizes the use of reinforcement learning in adapti v e threat modeling. The approach helps systems disco v er the best methods to respond to threats, making them more adapti v e to changing c yber threats. The article also addresses real-time threat iden- tication using neural netw orks and DL algorithms. Hnamte and Hussa in [14] describe an adv anced and ef cient netw ork-based NIDS that uses DL tech- niques to detect attacks. CIC-IDS-2018 and Edge IIoT are tw o real-time datasets on which the model has been painstakingly trained. Multiclass clas sication is used to e xamine the model’ s performance, and the re- sults sho w remarkable accuracies of 100% and 99.64%. In contrast , Qazi et al . [15] implemented a h ybrid DL-based NIDS, wh i ch le v erages neural netw ork architectures, applying it to the CIC-IDS-2018 dataset, and attained an accurac y of 98.9%. Musleh et al . [16] seek to present a comprehensi v e study on ML-based IDS within the IoT conte xt, emplo ying v arious feature e xtraction techniques and ML algorithms to enhance their proposed model. The in v estig ation e v aluates an array of feature e xtractors, including image ltering techniques and transfer learning frame w orks. The study culminates in an assessment utilizing the IEEE Dataport dataset, achie ving an accurac y rate of 98.3%. Mo ving to research that focuses on the ef fecti v eness of binary and multi-class in IDS, Acharya et al . [8] create a unique and reliable heterogeneous ensemble ML model, to identify abnormal- ities in NIDS. T o address the class-imbalance issue with NIDS datasets, the suggested model initially uses subsampling. Then, applying the Min-Max technique for normalization translated the input data into the 0–1 range, reducing o v ertting and promoting con v er gence. Often emplo yed in meta-heuristic-based techniques, feature reduction is utilized to decrease the features while retaining the most appropriate features and a v oiding computational o v erheads. T o accomplish both tw o-class and multi-class classication across feature-selected NSL-KDD, KDD99, and UNSW -NB-15 datasets, the suggested NIDS approach ultimately created a hetero- geneous ensemble learning model using J48, k-nearest neighbors (k-NN), SVM, Bagging, AdaBoost, and RF algorithms as base-classiers. Bace vicius and T arase viciene [17] aims to address the dif culties that arise when testing multi-class classication performance for netw ork intrusions using highly imbalanced ra w data, such as the CIC-IDS-2017 and CSE-CIC-IDS-2018 datasets. The main objecti v e of the study is to e xamine se v eral ML models, such as CNNs, articial neural netw orks (ANN), RF , decision trees, and logistic re gression. It also uses e xplainable articial intelligence (XAI) tools to e xamine potential interpretations of the data. W ith an a v erage macro F1-score of 0.96878, the results sho wed that decision trees using the classi cation and re gression trees (CAR T) strate gy performed better than other methods on the 28-class classication task. Tseng and Chang [18] presented an ensemble feature selection frame w ork that combines three fea- ture scoring techniques—classication and re gression tree, random forest, and e xtra tree—with tw o dif ferent feature selection methodologies to produce six distinct feature sets. The frame w ork determines the best fea- ture set based on accurac y for each binary model. By utilising random sampling and of fering a customised Int J Artif Intell, V ol. 14, No. 6, December 2025: 4814–4827 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 4817 sample size based on the tar get class dimensions in each binary model, the proposed ensemble data balanc- ing technique signicantly enhances con v entional data balancing approaches. Random sampling, the synthetic minority o v ersampling technique (SMO TE), and T omek Link methods are all included in this frame w ork. It also incorporates four encoder modes to identify the best feature e xtraction conguration for each binary model. Experimental ndings demonstrate that ensem ble binary detection models achie v e higher accurac y in identifying three types of wireless attacks in the Ae gean W i-Fi intrusion dataset (A WID) compared to similar studies using traditional multi-class detection frame w orks [18]. In addition, a data resampling method based on the adapti v e synthetic (AD ASYN) and T omek l inks algorithms is presented in [2], combined with se v eral DL models. Using the benchmark NSL-KDD dataset, the proposed model is e v aluated through accurac y , precision, recall, and F-score metrics. Experimental results indicate that the approach achie v es 99.8% accurac y in binary classication, outperforming e xisting models. Its performance in multi-class classication also impro v es, surpassing state-of-the-art accurac y le v els of 99.9%. 3. RESEARCH METHODOLOGY Our proposed methodology consists of v e phases illustrated in Figure 1. W e used tw o datasets, NSL- KDD [19] and CIC-IDS-2017 [20], the data were cleaned by remo ving the noise instances and an y duplicated data. The third phase aimed to con v ert features into numerical data using an ordinal encoder . T o ensure that fea- tures are treated equally during the training phase, MinMax scaling scales data in the standard range between 0 and 1. Furthermore, we used Pearson’ s correlation coef cient to e v aluate the linear relat ionships between fea- tures in both the NSL-KDD and CIC-IDS-2017 datasets. A correlation coef cient threshold of 0.8 (in absolute v alue) w as chosen to identify highly correlated features. Features with correlation coef cients greater than this threshold were considered redundant and remo v ed, as the y did not pro vide additional information for model training. This threshold w as selected to balance between reducing dimensionality and retaining informati v e features. T o ensure that features were treated equally during model trai ning, we applied MinMax scaling to scale all features to the range [0, 1]. In our study , we applied SMO TE after feature selection to ensure that the generated synthetic data w as based on rele v ant features. The technique w as crucial in impro ving the clas- sier’ s performance, particularly for detecting rare attack types in the NSL-KDD and CIC-IDS-2017 datasets, which were otherwise underrepresented. SMO TE is a po werful technique used to address class imbalance by generating synthetic samples for the underrepresented class. The algorithm w orks by selecting a sample from the minority class, nding its k-NN, and then creating synthetic instances by interpolating between the selected sample and its neighbors. This approach helps to increase the decision boundary comple xity for the minority class, thus impro ving the classier’ s ability to distinguish between the classes. Figure 1. Proposed methodology According to T able 1 and Figure 2 Benign traf c is disproportionately predominant (454,495 in- stances), indicating a major imbalance in the distrib ution. There ha v e been a lot of DDoS attacks (25,545 instances), DoS Hulk attacks (45,887 instances), and PortScan attacks (31,702 instances), b ut v ery fe w Heart- Mac hine and deep learning classier s for binary and ... (Ahmad Aloqaily) Evaluation Warning : The document was created with Spire.PDF for Python.
4818 ISSN: 2252-8938 bleed attacks (2 occurrences), inltration attacks (9 instances), and SQL injection attacks (5 instances). Because there is insuf cient data to train algorithms, this mism atch mak es it dif cult to classify attacks accurately , es- pecially for under -represented attack types. Ov erall, the approach e xceeds other classiers in binary and multi- class cas es, especially when applied to handling rare attack types, making it the most ef fecti v e model for the dataset. T able 1. Number of instances for each attacks type Standards Number of Instances BENIGN 454495 Bot 388 DDoS 25545 DoS GoldenEye 2020 DoS Hulk 45887 DoS slo whttptes 1140 DoS slo wloris 1180 FTP-P atator 1620 Heartbleed 2 Inltration 9 PortScan 31702 SSH-P atator 1164 Brute force 281 Sql injection 5 XSS 142 Figure 2. Data distrib ution for BENIGN and attack classes Finally , we applied dif ferent cl assication algorithms to the train data to e v aluate the proposed model performance. The methodology phases can be outlined in these v e steps: i) Data e xtraction: the e xperiments conducted depend on tw o datasets, NSL-KDD and CIC-IDS-2017. T able 2 summarizes the used dataset. F or model training and e v aluation, we di vided the datasets into training and test sets. In the c ase of the NSL-KDD dataset, we used 80% of the instances for training (100,000 instances) and reserv ed the remaining 20% (25,000 instances) for testing. Similarly , for the CIC-IDS-2017 dataset, 80% of the instances (2,264,594) were allocated for training, and the remaining 20% (566,149 instances) were used for testing. T able 2. Summary of NSL-KDD and CIC-IDS-2017 datasets Dataset name Number of instances Number of features Attack NSL-KDD 125,000 41 DOS, Probe, R2L and U2R CIC-IDS-2017 2,830,743 79 Brute force FTP , Brute force SSH, DoS, Heartbleed, W eb attack, inltration, Botnet, and DDoS ii) Preprocessing: the initial step in the preparation of data is to remo v e constant features that add no mean- ingful v alue to the dataset. Subsequently , data encoding is applied to con v ert non-numeric properties into Int J Artif Intell, V ol. 14, No. 6, December 2025: 4814–4827 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 4819 numeric representations. This is especially useful for ordinal data, which are cate gorical data with a par - ticular hierarch y . After encoding, the data is normalized using the MinMaxScaler , which scales features to a predetermined range (usually 0-1) while preserving the structure of the original distrib ution [21]. By ensuring that e v ery v ariable contrib utes equally to the model, this normal ization helps to pre v ent bias and impro v es the stability and speed of DL and ML algorithms duri ng training. The MinMaxScaler operates by applying (1) to feature v alues to t them into the specied range [22]. X scaled = X min( X ) max( X ) min( X ) (1) iii) Feature selection: Pearson’ s correlati on coef cient is used to determine the correlations between the v ariables in the datasets to select features. This statistical tool produces a correlation coef cient that ranges from -1 to +1 by e v aluating the linear relationship between tw o continuous v ariables [23], [24]. When a coef cient is close to ±1, it represents a strong linear link; when it is close to 0, it denotes no linear association. The methodology assumes that the v ariables in v olv ed ha v e a normal distrib ution, are independent, and are linear . T able 3 sho ws the features that were identied based on the chosen algorithm. T able 3. T op features from NSL-KDD and CIC-IDS-2017 Datasets NSL-KDD features CIC-IDS-2017 features duration protocol type Flo w IA T Std Max P ack et Length service ag Init W in bytes forw ard act data pkt fwd src bytes dst bytes Subo wFwdBytes T otalBackw ardP ack ets land wrong fragment Flo w IA T Mean A CK Flag Count ur gent hot A vg Bwd Se gment Size URG Flag Count num f ailed logins logged in Fwd P ack et Length Max ECE Flag Count num compromised root shell P ack et LengthStd IdleMean su attempted num root Init W in bytes backw ard P ack etLength Mean num le creations num shells RST Flag Count Fwd Header Length num access les is host login Bwd P ack et Length Max min se g size forw ard is guest login count IdleMax BwdP ack ets/s srv count serror rate T otalFwd P ack ets Fwd P ack et Length Mean srv serror rate rerror rate Fwd Header Length.1 Fwd P ack et Length Std srv rerror rate same srv rate PSH Flag Count Fwd IA T Max dif f srv rate srv dif f host rate Acti v e Mean Idle Min dst host count dst host srv count Bwd P ack et Length Mean A v erage P ack et Size dst host same srv rate dst host dif f srv rate Fwd PSH Flags T otal Length of Fwd P ack ets dst host same src port rate dst host srv dif f host rate Fwd IA T Std Flo w IA T Max dst host serror rate dst host srv serror rate Bwd P ack et Length Std A vg Fwd Se gment Size dst host rerror rate dst host srv rerror rate Flo w P ack ets/s Do wn/Up Ratio Destination Port P ack et Length V ariance Subo w Fwd P ack ets SYN Flag Count i v) Ov er -sampling: the datasets’ class imbalance w as solv ed using the SMO TE. Rather than just cop ying samples from the e xisting dataset, this technique generates ne w , synthetic samples. W e specically emplo yed SMO TE to reduce the size of the CIC-IDS-2017 dataset to 24,607,475 instances and the NSL- KDD dataset to 308,830 instances. v) Classiers: in the conte xt of ML, a classier is an algorithm that automatically sorts or groups data into one or more ”classes. Data is cate gorized or classied according to specic feat ures [25]. In this research, we ha v e used three classiers: multi-layer perceptron (MLP), RF , and logistic model trees (LMTs). MLP is a type of ANN that consists of multiple layers of interconnected nodes, called neurons. It is one of the si mplest and most used neural netw ork architectures [26]. F or binary classication tasks, the output layer of the MLP typically uses the sigmoid acti v ation function. This acti v ation function outputs a v alue between 0 and 1, which can be interpreted as the probability of the instance belonging to one of the classes. A threshold of 0.5 is commonly used to assign the class label: v alues abo v e 0.5 are classied as class 1, and v alues belo w 0.5 as class 0. In the hidden layers, rectied l inear unit (ReLU) is often emplo yed to introduce non-linearity , helping the model to learn comple x patterns in the data. Mac hine and deep learning classier s for binary and ... (Ahmad Aloqaily) Evaluation Warning : The document was created with Spire.PDF for Python.
4820 ISSN: 2252-8938 RF is one of the popular ML algorithms that belong to the ensemble learning cate gory . It is used for both classication and re gression tasks and is based on the concept of decision tree [27]. LMTs combine decision tree structures with logistic re gression functions. A logistic re gression model is stored in each leaf node of the LMT and is used to cate gorize occurrences that f all into the appropriate re gion. LMTs di vide the instance space into discrete re gions, each represented by a leaf node with a logistic re gression function on it [28]. 4. RESUL TS AND DISCUSSION W e e v aluate the classication models using binar y and multi-class labels to identify the most ef fecti v e IDS model. In the binary classication setup, all attack instances are labeled as 1, and all normal instances are labeled as 0. On the other hand, for multi-classication the tar geted attack instances are labeled as 1, other types of attack instances are labeled as 2, and all normal instances are labeled as 0. W e illustrate the ef fect of the label o n accurac y by performing an e xtensi v e performance analysis of the models on the NSL-KDD 2017 and CIC-IDS 2017 datasets. The results of applying the selected classiers ment ioned in the pre vious section which applied to the NSL-KDD dataset are presented in T able 4. T able 4. Performance metrics for binary and multi-class NSL-KDD Classiers Performance metrics Binary NSL-KDD Multi class NSL-KDD U2R Dos R2L propel MLP Precision 0.99 0.84 0.99 0.24 0.99 Recall 0.99 0.94 0.99 0.43 0.99 F1-score 0.99 0.89 0.99 0.31 0.99 RF Precision 0.99 0.91 0.99 0.65 0.99 Recall 0.99 0.94 0.99 0.62 0.99 F1-score 0.99 0.93 0.99 0.63 0.99 LMT Precision 0.88 0.34 0.98 0.02 0.76 Recall 0.94 0.88 0.97 0.71 0.94 F1-score 0.91 0.49 0.97 0.04 0.84 T able 4 pro vides information on the precision, recall, and F1-score performance measures for three dif ferent classiers: MLP , RF , and LMT . These classiers were e v aluated using the binary and multi-class NSL-KDD datasets. When it comes to binary class ication, both RF and MLP perform almost optimally with similar metrics. The y both achie v e an F1-score, precision, and recall of 0.99, which indicates an e xtraordinary ability to identify instances with minimal errors. On the other hand, LMT performs some what w orse than the other classiers with a precision of 0.88, a recall of 0.94, and an F1-score of 0.91. This suggests that although it can detect true positi v es, it produces more f alse positi v es than the other classiers, as sho wn in Figure 3. Figure 3. Binary NSL-KDD classication W ithin multi-class classication, there is a signicant dif ference in the performance among the four dif ferent attack cate gories (U2R, DoS, L2R, and Probe). In most classes, the MLP performs ef fecti v ely; it achie v es high metrics for DoS and Probe (all approximately 0.99); ho we v er , it has signicant issues with L2R, as indicated by a lo w F1-score of 0.31, which is due to poor precision (0.24) and recall (0.43). On the other Int J Artif Intell, V ol. 14, No. 6, December 2025: 4814–4827 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 4821 hand, in multi-class scenarios, the RF algorithm typically outperforms MLP , attaining high precision, recall, and F1 scores in most cate gories. While in U2R, it records an F1-score of 0.93 and performs well in the DoS and Probe classications. Ho we v er , e v en with L2R, it stil l f aces some moderate issues. There is a noticeable drop in metrics for U2R and L2R and inconsistent performance across man y multi-class cate gories for LMT . In U2R, it achie v es a comparati v ely high recall (0.88) b ut lo w accurac y (0.34), resulting in a lo wer F1 score of 0.49. W ith a precision of 0.02 and an F1-score of 0.04 indicating that LMT is almost us eless in correctly identifying the L2R cate gory , this cate gory presents signicant challenges for LMT . LMT performs satisf actorily in the DoS and Probe cate gories despite these dif culties, especially in recall. Finally , while RF and MLP both perform e xceptionally well in binary classication, RF is the more rob ust model in multi-class classication, especially when dealing with a v ariety of attack types, whereas LMT clearly sho ws deciencies, particularly concerning less pre v al ent attack classes, Figure 4 illustrate the performance of multi-class classication for the NSL-KDD dataset. Figure 4. Multi-class NSL-KDD classication F or the CIC-IDS-2017 dataset, T ables 5 and 6 pro vided a comprehensi v e comparison of MLP , RF , and LMT—across dif ferent attack cate gories and general types. In the binary classication frame w ork of the CIC-IDS-2017 dataset, we found that the MLP e xhibits outstanding results, with a precision of 0.92, a recall of 0.99, and an F1-score of 0.95. This implies that the MLP is usually good at distinguishing between le gitimate and malicious traf c. Ho we v er , it should be noted that its precision is less than that of the RF precision. On the other hand, the RF classier performs better on all e v aluation measures, achi e ving an F1 score of 0.99, a precision of 0.99, and a recall of 0.99. This indicates that RF can identify and classify cases as malicious or le gitimate instances with remarkable precision and reliability . Ho we v er , LMT has a signicantly lo wer performance, with a precision of 0.52, a recall of 0.80, and an F1-score of 0.63. The suboptimal precision and F1 score imply that LMT encounters greater challenges in accurately classifying instances while sustaining a balance between precision and recall, as sho wn in Figure 5. T able 5. Performance metrics for binary and multi-class CIC-IDS-2017 Classiers Metrics Binary Bot DDoS DoS GoldenEye DoS Hulk DoS Slo whttptest DoS Slo wloris MLP Precision 0.92 0.99 0.99 0.99 0.99 0.99 0.99 Recall 0.99 0.67 0.99 0.99 1.00 0.99 0.98 F1-score 0.95 0.80 0.99 0.99 0.99 0.99 0.99 RF Precision 0.99 1.00 1.00 1.00 1.00 1.00 0.99 Recall 0.99 0.96 1.00 1.00 1.00 0.99 1.00 F1-score 0.99 0.98 1.00 1.00 1.00 1.00 1.00 LMT Precision 0.52 0.00 0.99 0.91 0.97 0.88 0.87 Recall 0.80 0.00 0.97 0.85 0.95 0.71 0.81 F1-score 0.63 0.00 0.98 0.88 0.96 0.79 0.84 Mac hine and deep learning classier s for binary and ... (Ahmad Aloqaily) Evaluation Warning : The document was created with Spire.PDF for Python.
4822 ISSN: 2252-8938 T able 6. Performance metrics for binary and multi-class CIC-IDS-2017 Classiers Metrics FTP-P atator Heartbleed Inltration PortScan SSH-P atator Brute F orce SQL Injection XSS MLP Precision 1.00 0.00 1.00 1.00 0.98 0.61 0.00 0.70 Recall 1.00 0.00 0.22 1.00 0.99 0.19 0.00 0.10 F1-score 1.00 0.00 0.36 1.00 0.99 0.29 0.00 0.10 RF Precision 1.00 1.00 0.83 1.00 1.00 0.71 1.00 0.50 Recall 1.00 1.00 0.56 1.00 1.00 0.76 0.20 0.40 F1-score 1.00 1.00 0.67 1.00 1.00 0.73 0.33 0.40 LMT Precision 0.84 0.50 0.00 0.90 0.87 0.00 0.00 0.00 Recall 1.00 1.00 0.00 1.00 0.51 0.00 0.00 0.00 F1-score 0.91 0.67 0.00 0.94 0.64 0.00 0.00 0.00 Figure 5. Binary CIC-IDS-2017 classication In terms of multi-class classication, we found that the MLP classier performs well ( x y 0.99) in most cate gories; ne v ertheless, in terms of ’Heartbleed’, ’Brute F orce’, ’SQL Injection’, and ’XSS’, Precision signicantly decreases (from 0.00 to w ards 0.73). The MLP recal l numbers sho w some v ariation; for e xam- ple, it performs well in the ”DDoS” cate gory (0.99) and the ”DoS” attack cate gory (98–1.00); at this point, it f alls poorly in the ”Heartbleed, ”Inltration, and ”XSS” cate gories (0.00–0.06). In se v eral cate gories, the F1-scores for MLP are high (0.99 for some). Still, the y are signicantly lo wer in ’Heartbleed’, ’SQL Injection’, and ’XSS’, indicating dif culties in nding a balance between precision and recall for v arious attack types. RF consistently maintains a high recall ( x y 0.96), although some lo wer v alues (0.56 to 0.39) are sho wn in ’Inltration’, ’SQL Injection’, and ’XSS’, indicating an ability to ignore some rare attacks. The F1-score for RF are consistently high ( x y 0.98) in all cate gories; ne v ertheless, the y sho w lo wer scores in ’Inltration’, ’SQL Injection’, and ’XSS’, sho wing certain domains where it f ails to balance precision and recall. LMT dis- plays a v aried precision prole with high sc o r es in ’DDoS’, ’DoS GoldenEye’, ’DoS Hulk’, and ’FTP-P atator’ (from 0.84 to 0.97), and v ery lo w scores in ’Bot’, ’Heartbleed’, ’Inltration’, and ’XSS’ (from 0.00 to 0.50). LMT performs po or ly in areas lik e ”Bot, ”SQL Injection, and ”XSS, b ut it obtains good recall in ”DDoS, ”DoS GoldenEye, and ”DoS Hulk” (ranging from 0.71 to 1.00). The F1-scores of LMT are highest in ’DDoS’, ’DoS GoldenEye’, ’DoS Hulk’, and ’FTP-P atator’ (from 0.84 to 0.96), b ut the y are lo w or none xistent in other cate gories, indicating that LMT f aces signicant dif culties when handling less frequent or rare attack sce- narios. Figure 6 presents the performance metrics for multi-class classication on the CIC-IDS-2017 dataset. Figure 6(a) illustrates the precision v alues for each attack type across dif ferent classiers, Figure 6(b) sho ws the corresponding recall performance, and Figure 6(c) displays the F1-scores, which summarize the balance between precision and recall. Ov erall, the RF classier achie v ed consistently higher scores across most attack cate gories. Int J Artif Intell, V ol. 14, No. 6, December 2025: 4814–4827 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 4823 (a) (b) (c) Figure 6. Performance metrics for multi-class in CIC-IDS-2017 dataset: (a) precision for multi-class, (b) recall for multi-class, and (c) F1-score for multi-class There can be notable dif ferences in IDS performance between binary and multi-class classicati on methods. By concentrating on dif ferentiating between benign and malicious communicat ions, binary classi- cation frequently impro v es the detection process and can increase the detection rates of minority classes. On the other hand, multi-class cate gorization seeks to dist inguish between dif ferent types of attacks, which may mak e it dif cult to reliably identify attacks that occur rarely . T able 7 compares our ndings with other recent studies that focus on analyzing the ef fects of binary class and multi-class classication in the performance of the IDS. Using DL models, Singh et al . [29] demonstrated tw o cutting-edge IDS. While the second combines temporal con v olutional netw ork (TCN), CNN, and bidirectional long short-term memory (Bi-LSTM), the rst emplo ys Bi-LSTM and LuNet. The systems outperform con v entional ML models in tests conducted on the NSL-KDD and UNSW -NB15 datasets. Classication accurac y of up to 99% w as achie v ed by using ensemble Mac hine and deep learning classier s for binary and ... (Ahmad Aloqaily) Evaluation Warning : The document was created with Spire.PDF for Python.