IAES Inter national J our nal of Articial Intelligence (IJ-AI) V ol. 15, No. 1, February 2026, pp. 628 641 ISSN: 2252-8938, DOI: 10.11591/ijai.v15.i1.pp628-641 628 Pr edicting uni v ersity student dr opouts in Latin America using machine lear ning Laberiano Andrade-Ar enas 1 , Inoc Rubio P aucar 2 , Mar garita Giraldo Retuerto 1 , Cesar Y actay o-Arias 3 1 F acultad de Ciencias e Ingenier ´ ıa, Uni v ersidad de Ciencias y Humanidades, Lima, Per ´ u 2 F acultad de Ingenier ´ ıa y Ne gocios,Uni v ersidad Pri v ada Norbert W iener , Lima, Per ´ u 3 Departamento de Estudios Generales, Uni v ersidad Continental, Lima, Per ´ u Article Inf o Article history: Recei v ed Aug 14, 2025 Re vised Dec 29, 2025 Accepted Jan 22, 2026 K eyw ords: Decision making Machine learning Predicti v e model Random forest Student dropout ABSTRA CT In the uni v ersity conte xt, student dropout has become one of the most recurring problems, both in the short and long term. The objecti v e of this research w as to de v elop a predicti v e model using the random forest (RF) algorithm to identify patterns associated with uni v ersity dropout. T o achie v e this, the kno wledge disco v ery in databases (KDD) methodology w as applied, which encompasses the stages of selection, preprocessing, transformation, data mining, and interpretation of results. The RF model demonstrated superior performance compared to other e v aluated m odels, achie ving an accurac y of 87%, a precision of 86%, a recall of 85%, an F1-score of 85%, and an recei v er operating characteristic (R OC) area under the curv e (A UC) of 0.91, hi ghlighting its high predicti v e capability compared to othe r techniques analyzed. Therefore, the applicat ion of the proposed model is recommended in v arious uni v ersity institutions in order to identify potential dropout cases at an early stage. This is an open access article under the CC BY -SA license . Corresponding A uthor: Laberiano Andrade-Arenas F acultad de Ciencias e Ingenier ´ ıa, Uni v ersidad de Ciencias y Humanidades Lima, Per ´ u Email: landrade@uch.edu.pe 1. INTR ODUCTION In the current conte xt, uni v ersity student dropout has become one of the most pressing global issues, with both social and economic implications. High dropout rates limit students’ professional de v elopment, reduce the ef cienc y of higher education institutions, and directly impact the gro wth and competiti v eness of countries. This situation not only represents a loss of talent and resources b ut also undermines ef forts to ensure quality education [1], [2]. It is essential to t ak e action in response to this situation, as student dropout has become a common occurrence in uni v ersities, dri v en by multiple f actors. Despite institutional ef forts to impro v e educational quality , student dropout in higher education remains a persistent and multif actorial challenge. The causes of dropout are di v erse and include academic, personal, economic, and conte xtual f actors, which mak e timely identication dif cult through traditional methods. This comple xity pre v ents man y uni v ersities from anticipating dropout risk and implementing ef fecti v e interv entions in a timely manner [3], [4]. Moreo v er , the limited a v ailability of resources to carry out indi vidualized student monitoring further complicates the implementation of appropriate pre v enti v e strate gies. This situation not only af fects institutional performance and educational planning, b ut also represents a signicant loss of human talent, public in v estment, and personal and professional de v elopment opportunities for students. In addition, the emotional and moti v ational impact of dropping out can af fect students’ self-esteem, J ournal homepage: http://ijai.iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 629 creating a ne g ati v e ef fect on their f amily and social en vironment [5], [6]. Therefore, it is ur gent to strengthen academic support policies, guidance, and comprehensi v e assistance that can help address this issue from a more human and inclusi v e perspecti v e. It is considered essential to approach student dropout with greater attention, as it represents a signicant loss not only for students, b ut also for institutions and society as a whole. This research is jus tied by the ur gent nee d to reduce dropout rates in hi gh e r education—a problem that ne g ati v ely impacts students, institutions, and national de v elopment. The multif actorial causes of dropout mak e early detection dif cult through traditional methods, limiting the implementation of ef fecti v e pre v enti v e strate gies [7], [8]. In this conte xt, it becomes essential to ha v e tools that allo w for the analysis of lar ge v olumes of academic data and the generation of accurate predictions re g arding dropout risk. Machine learning emer ges as an inno v ati v e and ef fecti v e alternati v e for this purpose, as it enables the construction of predicti v e models capable of identifying risk patterns based on a v ailable data. This research will contrib ute to the de v elopment of decision-support systems in uni v ersities, f acilitating timely and personali zed interv entions to impro v e st udent retention and promote academic success [9], [10]. Anticipating student dropout is essential, as timely interv ention not only enhances academic performance b ut also pro vides greater opportunities for students’ personal and professional de v elopment. The object i v e of this research is to de v elop a predicti v e model based on the random forest (RF) algorithm to identify patterns of student dropout, with the aim of optimizing strate gic decision-making in the uni v ersity conte xt. 2. LITERA TURE REVIEW This section presents a thorough re vie w of v arious studies related to the topic addressed. W ith the purpose of pro viding a broad and well-founded per specti v e on the subject of study . Additionally , the theoretical frame w orks consulted support the selection and interpretation of the v ariables considered in the analysis. 2.1. Related w orks This research proposes a machine learning-based approach for e v aluating teaching performance. T o address this issue, se v eral classi cation algorithms were implemented using the Python programming language, including k-nearest neighbors (KNN), e xtra trees, light gradient boosting machine (LightGBM), CatBoost classier , among others. The results sho wed that the proposed model achie v ed a 2% higher accurac y compared to the other e v aluated algorithms, highlighting its ef fecti v eness in the educational conte xt. In a complementary area, a student dropout predicti on system w as de v eloped using machine learni n g algorithms, based on a longitudinal dataset collected from uni v ersity students. The results indicated that the risk of dropout is primarily associated with f actors such as academic department, gender , and socioeconomic group [11], [12]. Another rele v ant aspect addressed by Niyogisubizo et al. [13] w as the proposal of a h ybrid dropout prediction model, which combines the RF , e xtreme gradient boosting (XGBoost), gradient boosting (GB), and feedforw ard neural netw orks (FNN) algorithms. The model’ s performance w as e v aluated using the area under the curv e (A UC), sho wing promising results in ident ifying f actors related to school dropout. The analysis highlighted the impact of uncontrolled beha viors as a k e y v ariable in dropout risk. On the other hand, V i v es et al. [14] emphasizes the ef fecti v eness of long short-term memory (LSTM) netw orks in predicting academic performance. Through comparisons between dif ferent models based on metrics such as accurac y , precision, recall, and F1-score, the superiority of the LSTM - generati v e adv ersarial netw orks (GAN) model w as conrmed, achi e ving an accurac y of 98.3% in week 8, follo wed by the deep neural netw orks (DNN) - GAN model with 98.1%. In the conte xt of predicting dropout in postgraduate p r og r ams, classication models such as logi stic re gression, RF , and neural netw orks were de v eloped and optimized using resampling techniques to address class imbalance (synthetic minority o v er -sampling technique (SMO TE), SMO TE - support v ector machine (SVM), adapti v e synthetic (AD ASYN)), as well as through h yperparameter tuning. The best-performing model w as the neural netw ork combined with SMO TE-SVM, achie ving a recall v alue of 0.75, follo wed by logistic re gression with 0.67 and RF with 0.60—the latter also demonstrating strong generalization ability with an optimal decision threshold of 0.427. Complementarily , another study focused on student dropout implemented a predicti v e model based on LightGBM, which achie v ed outstanding performance with an F1-score of 0.840, surpassing the results of pre vious studies that addressed the class imbalance issue. The model’ s ef fecti v eness w as enhanced through the application of o v ersampling techniques such as SMO TE, AD ASYN, and Borderline-SMO TE, which helped impro v e class distrib ution and optimize the system’ s predicti v e capacity , as noted in [15], [16]. Another application oriented to w ard virtual learning en vironments adopted a h ybrid approach using machine learning algorithms—specically RF and XGBoost —to classify students at risk of dropping out. The Pr edicting univer sity student dr opouts in Latin America using ... (Laberiano Andr ade-Ar enas) Evaluation Warning : The document was created with Spire.PDF for Python.
630 ISSN: 2252-8938 model achie v ed outstanding results, with an accurac y of 93%, a precision of 91.52%, a recall of 96.42%, and an F1-score of 93.91%, demonstrating its high ef fecti v eness in the early detection of academic dropout. A further rele v ant contrib ution related to student dropout in v olv ed the de v elopment of a uni v ersity dropout prediction system. F or this purpose, a softw are prediction program w as created bas ed on machine learning models to identify the correlation between v ariables and student dropout. The models were e v aluated for accurac y , with articial neural netw orks of the perceptron type achie ving the highest accurac y at 98.1% [ 1 7] , [18]. Recent studies ha v e de v eloped a uni v ersity dropout predicti o n system that signicantly impro v ed accurac y (0.963) and recall rate (0.766) by using dimensionality reduction techniques with principal component analysis (PCA) and clustering through K-means. The model outperformed the best pre vious approach by 0.093 in accurac y and achie v ed an F1-score of 0.808, surpassing the GB method. Additionally , it identied four main causes of dropout: emplo yment, non-re gistration, personal problems, and admission to another uni v ersity—the latter being the most accurately predicted (0.672). In a separate study , a classication model w as implemented using machine learning techniques to anticipate student dropout with high le v els of accurac y . The proposal follo wed a technological methodology with a propositional focus, incremental inno v ation, and synchronous scope. Data collection w as conducted through a 20-question surv e y administered to 237 postgraduate students enrolled in education master’ s programs. The model, based on gradient boosting machine (GBM), yielded outstanding results: a Gini coef cient of 92.2 0%, an A UC of 96.10%, and a LogLoss of 24.24%. These results enabled the ef fecti v e identication of k e y f actors behind student dropout and pro vided a strate gic tool for educational management [19], [20]. In a rele v ant alternati v e approach, the research in [21], [22] applied data mining techniques using academic grades as k e y predicti v e v ariables, combined with v arious machine learning algorithms aimed at modeling uni v ersity dropout. The results demonstrated strong model performance, achie ving an F1-s core of 81% on the nal test set. These ndings suggest that students’ academic performance is a representati v e indicator of their li ving conditions and, therefore, allo ws for the early detection of potential dropout cases in higher education. This supports the idea that academic success is inuenced by multiple f actors, including class imbalance, which justies the use of supervised machine learning algorithms such as decision trees (DT) and SVM. Ho we v er , boosting algorithms—especially LightGBM and CatBoost optimized with Optuna—sho wed superior performance compared to traditional classiers, establishing themselv es as more ef fecti v e approaches for academic prediction, as highlighted by the aforementioned author . In another instance, when analyzing dropout risk among under graduate students, unsupervised clustering algorithms were applied alongside RF and probability threshold adjustment. The traditional model yielded a lo w accurac y of 13.2% in predicting dropout, compared to 99.4% in retention. Ho we v er , after adjusting the threshold, the accurac y in detecting dropout e xceeded 50%, while maintaining o v erall and retention rates abo v e 70% [23], [24]. This research addresses dropout in massi v e open online courses (MOOCs), proposing the use of the RF algorithm to predict this phenomenon. The model demonstrated strong performance, achie ving an accurac y of 87.5%, an A UC of 94.5%, a precision of 88%, a recall of 87.5%, and an F1-score of 87.5%, highlighting its ef fecti v eness in the early detection of uni v ersity students at risk of dropping out. In addition, risk f actors associated with dropout in uni v ersity programs were identied by applying v arious machine learning algorithms, among which RF e xhibited the most notable performance. The highest le v el of predicti v e accurac y w as reached at the end of the rst semester , once suf cient academic information about the students had been collected. At this stage, the model produced performance indicators that were comparable to those reported in pre vious research on early identication of dropout risk and lo w academic achie v ement [25], [26]. Se v eral studies highlight the rele v ance of applying machine learning techniques in this conte xt, particularly models such as GB, RF , and SVM, which ha v e sho wn promising res ults for supporting institutional decision-making and for designing pre v enti v e strate gies in uni v ersity settings [27]. 2.2. Student dr opout Student dropout in uni v ersities is de ned as the student’ s decision to interrupt their studies for v arious conte xt-related reasons, whether the interruption is temporary or permanent. Dropout represents a critical issue for uni v ersities, as it impacts the ef cienc y of the educational system, the allocation of resources, and the de v elopment of qualied human capital [28], [29]. This phenomenon arises from multiple causes, as outlined in T able 1, which seek to address this challenge. In this re g ard, it is advisable to closely monitor uni v ersity students’ academic performance, as it can signicantly inuence their long-term professional succes s or f ailure. Int J Artif Intell, V ol. 15, No. 1, February 2026: 628–641 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 631 T able 1. Main causes of uni v ersity student dropout Cate gory Specic cause Example Impact type Academic Lo w performance Continuous f ailure of courses Academic Economic Lack of resources Can’ t af ford tuition or transportation Economic V ocational Demoti v ation Insecurity about career choice Emotional/V ocational F amiliar F amily problems Conicts or responsibilities at home Psychological Institutional Lack of mentoring Poor academic support Institutional Social Discrimination or e xclusion By gender , race or social class Social/Cultural Health Medical or psychological problems Anxiety , depression, chronic illnesses Staf f Labor Need to w ork Drop out of school to w ork full-time Economic/Labor 2.3. Random f or est The RF algorithm is a supervised machine learning method based on ensemble techniques, which in v olv es b uilding multiple independent DTs and combining their predictions to obtain more rob ust, accurate, and generalizable results. This model uses the bagging method, where each tree is trained on a random sample of the dataset, and at each split in the tree, a random subset of features is considered, which helps reduce the correlation between trees. F or classication tasks, the nal result is determined by majority v oting, while for re gression tasks, it is calculated by a v eraging the predictions. This approach impro v es performance by reducing o v ertting and ef ciently handles lar ge v olumes of data. Ho we v er , its main dra wback is its lo wer interpretability compared to a single DT [30], [31]. This type of algorithm can be applied in v arious conte xts—such as medicine, education, and nance depending on the domain in which it is used. 3. METHODOLOGY The kno wledge disco v ery in databases (KDD) methodology is a comprehensi v e and systemat ic process aimed at transforming lar ge v olumes of ra w data into useful, no v el, understandable, and rele v ant kno wledge for decision-making. This process includes se v eral interrelated stages: the selection of rele v ant data, cleaning and preprocessing to remo v e inconsis tencies or outliers, transformation into suitable formats, application of data mining techniques to e xtract meaningful patterns, and nally , the e v aluation, interpretation, and presentation of the disco v ered kno wledge in a w ay that can be understood and used by or g anizations [32], [33]. In this study , the process is applied to an institutional dataset composed of academic, socio-economic, and demographic student records, in v olving approximately 510 students. Before modeling, the data underwent a cleaning procedure, treatment of missing v alues, detection of outliers, and normalization to ensure analytical reliability . Figure 1 presents the phases of the KDD process, illustrating the data o w to w ard obtaining rele v ant results that support informed decision-making. Meanwhile, Figure 2 sho ws the architecture implemented for student data analysis. The w orko w starts with the ingestion of datasets in formats such as .CSV , .XLSX, .TXT , and .JSON, processed using Python. The architecture inte grates libraries such as Scikit-learn, XGBoost, NumPy , and P andas, applying preprocessing steps including cleaning, standardization, and transformation. The RF model w as subsequently implemented, allocating 80% of the dataset for training and the remaining 20% for v alidation. Finally , the model is e v aluated through metrics such as accurac y , confusion matrix, F1-score, precision, recall, and R OC–A UC curv es, aiming to obtain meaningful results that contrib ute to decision-making in educational conte xts. 3.1. Selection This section presents a thorough search focused on selecting the most appropriate dataset for the de v elopment of the machine learning project. The selection w as based on the research objecti v e, prioritizing data rele v ance, quality , and a v ailability . T o achie v e this, se v eral specialized platforms for public dataset distrib ution were e xplored, with Kaggle standing out as a leading platform due t o its rob ustness and wide v ariety of datasets from dif ferent elds of kno wledge. Kaggle is a reliable and up-to-date source, supported by an acti v e scientic community that shares high-quality data along with detailed technical descriptions [34]. This feature allo wed for the selection of a dataset aligned with the project’ s goals, ensuring a solid foundation for subsequent analys is, preprocessing, and modeling using machine learning techniques such as RF . It is important to note that Kaggle of fers datasets across v arious domains and hosts competitions and publications centered on machine learning. Pr edicting univer sity student dr opouts in Latin America using ... (Laberiano Andr ade-Ar enas) Evaluation Warning : The document was created with Spire.PDF for Python.
632 ISSN: 2252-8938 Figure 1. KDD methodology Figure 2. Machine learning architecture 3.2. Pr epr ocessing and transf ormation This section presents the preprocessing and transformation of the data. T ables 2 to 5 sho ws the results of the e xploratory data analysis and the initial stages of v ariable preparation for the predicti v e model. T able 2 displays the analysis of missing v alues, where lo w percentages of missing data are observ ed 3.75% in mother’ s occupation and 4.01% in f ather’ s occupation. V ariables such as debtor , tuition payment, and unemplo yment rate do not contain an y missing v alues. T able 3 details the distrib ution of part icipants by marital status, with “single” being the most common cate gory , follo wed by marri ed, contrib uting to the sociodemographic prole of the Int J Artif Intell, V ol. 15, No. 1, February 2026: 628–641 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 633 study . T able 4 presents the correlation between economic indicators, re v ealing ne g at i v e relationships between gross domestic product (GDP) and both the unemplo yment rate (-0.40) and the ination rate (-0.55), suggesting a link between economic gro wth and impro v ed social conditions. Lastly , T able 5 sho ws the discretization of GDP into three le v els (lo w , medium, and high), with the medium le v el being the most frequent. This f acilitates its inte gration into classi cation models such as RF . These tables help to understand ho w the data is structured and what transformations are applied prior to modeling. T able 2. Exploratory data analysis and preprocessing results, missing data analysis V ariable Missing (%) Marital status 2.14 Day/Night attendance 0.00 Mother’ s occupation 3.75 F ather’ s occupation 4.01 Debtor 0.00 T uition payment 0.00 International student 0.27 Unemplo yment rate 0.00 Ination rate 0.00 GDP 0.00 T able 3. Exploratory data analysis and preprocessing results, marital status distrib ution Cate gory Frequenc y Single 210 Married 125 Di v orced 30 W ido wed 9 T otal 374 T able 4. Exploratory data analysis and preprocessing results, correlation between economic indicators Unemplo yment rate Ination rate GDP Unemplo yment rate 1.00 0.65 -0.40 Ination rate 1.00 -0.55 GDP 1.00 T able 5. Exploratory data analysis and preprocessing results, discretization of GDP v alues Cate gory GDP range Frequenc y Lo w GDP < 10000 80 Medium GDP 10000–30000 200 High GDP > 30000 94 T otal 374 T able 6 presents the descripti v e statistics of the numerically encoded quantitati v e v ariables in t h e dataset. These stati stics pro vide an o v ervie w of the beha vior of the sociodemographic and economic v ariables considered in the study . The v ariable marital status has a mean v alue of 1.78, indicating that most participants f all between the cate gories of single and married. Similarly , the mean v alue for attendance (day or night) is 1.25, suggesting a higher proportion of st udents attending daytime classes. P arental occupation v ariables sho w a v erage v alues close to 1.5, reecting an intermediate distrib ution among emplo yed, unemplo yed, or “other” cate gories. Re g arding binary v ariables such as debtor , tuition payment, and international student, the lo w mean v alues indicate that most indi viduals are not in debt, are up to date with tuition payments, and are not international students, respecti v ely . On the other hand, economic indicators re v eal an a v erage unemplo yment rate of 6.20%, an ination rate of 2.45%, and a GDP a v erage of 21,500.75 monetary units. These v alues help to understand the economic conte xt in which the participants are situated and pro vide a solid foundation for further analysis. Ov erall, the statistical information of these v ariables f acilitates data preparation and attrib ute selection for the construction of predicti v e models. Pr edicting univer sity student dr opouts in Latin America using ... (Laberiano Andr ade-Ar enas) Evaluation Warning : The document was created with Spire.PDF for Python.
634 ISSN: 2252-8938 T able 6. Descripti v e statistics of selected v ariables V ariable Count Mean Std. De v . Min Q1 (25%) Q2 (Median) Q3 (75%) Max Marital status 366.00 1.78 0.89 1.00 1.00 2.00 2.00 4.00 Day/Night attendance 374.00 1.25 0.43 1.00 1.00 1.00 1.00 2.00 Mother’ s occupation 360.00 1.65 0.75 1.00 1.00 2.00 2.00 3.00 F ather’ s occupation 359.00 1.52 0.70 1.00 1.00 1.00 2.00 3.00 Debtor 374.00 0.25 0.43 0.00 0.00 0.00 1.00 1.00 T uition payment 374.00 0.20 0.40 0.00 0.00 0.00 0.00 1.00 International student 373.00 0.09 0.29 0.00 0.00 0.00 0.00 1.00 Unemplo yment rate 374.00 6.20 1.45 3.20 5.10 6.00 7.30 9.80 Ination rate 374.00 2.45 0.65 1.00 2.00 2.40 2.90 4.10 GDP 374.00 21500.75 7800.42 8000.00 16000.00 21000.00 27000.00 42000.00 3.3. Data mining F or this procedure, Figure 3 illustrates the architecture underlying the decision-making process within the DT frame w ork, based on the dataset used. Meanwhile, Figure 4 displays the results that allo w the e v aluation of the RF model’ s performance in predicting student dropout. Figure 4(a) sho ws the R OC curv e, sho wing the relationship between the true positi v e rate and the f alse positi v e rate, with a high A UC indicating strong discriminati o n between students who drop out and those who do n ot . Figure 4(b) presents the precision–recall curv e, sho wing the balance between precision and recall, including the A UC v alue and the optimal threshold, which is especially useful in scenarios in v olving class imbalance. Figure 4(c) illustrates the relationship between sensiti vity and specicity across dif ferent classication thresholds. As the threshold incr eases, sensiti vity decreases while specicity increases. The intersection point of the tw o curv es at approximately a threshold of 0.4 suggests a possible balance between these metrics. The le gend includes the formulas, and the sidebar indicates the threshold v alues. Figure 3. Decision tree representation Int J Artif Intell, V ol. 15, No. 1, February 2026: 628–641 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 635 (a) (b) (c) Figure 4. Classication model performance e v aluation: (a) R OC curv e with the A UC, (b) precision–recall curv e with A UC and optimal threshold, and (c) sensiti vity and specicity across thresholds Pr edicting univer sity student dr opouts in Latin America using ... (Laberiano Andr ade-Ar enas) Evaluation Warning : The document was created with Spire.PDF for Python.
636 ISSN: 2252-8938 3.3.1. Mathematical f oundation The predicti v e model for student dropout is based on the RF algorithm, an ensemble learning method that combines multiple DTs to impro v e accurac y and rob ustness. The follo wing mathematical formulations pro vide the theoretical foundation for this methodology [35]. Data representation: we dene the training dataset as: D = { ( x i , y i ) } n i =1 , x i R d , y i { 0 , 1 } (1) where x i denotes the feature v ector of student i , and y i is the binary tar get v ariable: 1 if the student drops out, and 0 otherwise [36]. Gini impurity: each DT splits the dataset using impurity functions. The Gini impurity is dened as [37]: G ( p ) = 1 K X k =1 p 2 k (2) In binary classication ( K = 2 ), it simplies to: G ( p ) = 2 p (1 p ) (3) where p is the probability of belonging to one of the tw o classes (dropout or not). Shannon entrop y (alternati v e): as an alternati v e to Gini, the Shannon entrop y can be used: H ( p ) = K X k =1 p k log 2 ( p k ) (4) RF prediction: let h m ( x ) denote the prediction of tree m . The nal prediction is based on majority v oting: ˆ y = mo de ( h 1 ( x ) , h 2 ( x ) , . . . , h M ( x )) (5) The estimated probability that a student drops out is: P ( y = 1 | x ) = 1 M M X m =1 I ( h m ( x ) = 1) (6) where I ( · ) is the indicator function, which returns 1 if the condition is true and 0 otherwise. Feature importance: the importance of each feature x j is e v aluated as: Imp( x j ) = X t T j n t n · ϕ t (7) where T j is the set of nodes where feature x j is used, n t is the number of samples at node t , and ϕ t is the impurity reduction at that node. Ev aluation metrics: the model is e v aluated using the follo wing standard classication metrics. Precision: Precision = T P T P + F P (8) Recall (Sensiti vity): Recall = T P T P + F N (9) F1-score: F 1 = 2 · Precision · Recall Precision + Recall (10) Accurac y: Accurac y = T P + T N T P + T N + F P + F N (11) These metrics help determine ho w well the model identies students at risk of dropping out. Int J Artif Intell, V ol. 15, No. 1, February 2026: 628–641 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 637 4. RESUL TS In the results stage, Figure 5 illustrates the perform ance of the classication models—RF , XGB oo s t, and KNN o v er ten epochs, re v ealing distinct performance patterns. In Figure 5(a), RF consistently maintains the highest accurac y across training epochs, while KNN e xhibit s the lo west and most unstable accurac y , sho wing the stability and trend of each model during training. Figure 5(b), the precision metric follo ws a similar pattern, with RF and XGBoost achie ving high v alues and KNN remaining lo w , highlighting ho w each algorithm’ s precision impro v es or uctuates during training. Figure 5(c) sho ws that XGBoost attains the best recall o v er epochs, indicating strong performance in correctly identifying positi v e cases, whereas KNN performs poorly . Finally , Figure 5(d) conrms through the F1-score that XGBoost achie v es t h e best balance between precision and recall throughout the epochs, follo wed by RF , while KNN continues to sho w the weak est performance across all metrics, reecting the o v erall trade-of f between precision and recall for each algorithm. (a) (b) (c) (d) Figure 5. Algorithm comparison across epochs: (a) accurac y , (b) precision, (c) recall, and (d) F1-score T able 7 sho ws that, in the classication problem addressed, the ensemble models RF and XGBoost consistently outperform KNN, with RF leading across all k e y performance metrics (accurac y , precision, recall, F1-score, and A UC), indicating its superior predicti v e reliability . Additionally , T able 8 highlights the feature importance analysis, emphasizing the critical role of GDP , unemplo yment rat e, and mother’ s occupation as the most inuential f actors in the model’ s predictions—underscoring the signicance of socioeconomic and macroeconomic v ariables. Finally , T able 9 presents the specic h yperparameter congurations for each model, which are essential for reproducibility and for understanding tuning process that optimized their performance. T able 7. Performance comparison between classication algorithms Modelo Accurac y Precision Recall F1-score A UC RF 0.87 0.86 0.85 0.85 0.91 XGBoost 0.85 0.84 0.83 0.83 0.89 KNN 0.76 0.73 0.71 0.72 0.76 Pr edicting univer sity student dr opouts in Latin America using ... (Laberiano Andr ade-Ar enas) Evaluation Warning : The document was created with Spire.PDF for Python.