IAES
Inter
national
J
our
nal
of
Articial
Intelligence
(IJ-AI)
V
ol.
15,
No.
1,
February
2026,
pp.
628
∼
641
ISSN:
2252-8938,
DOI:
10.11591/ijai.v15.i1.pp628-641
❒
628
Pr
edicting
uni
v
ersity
student
dr
opouts
in
Latin
America
using
machine
lear
ning
Laberiano
Andrade-Ar
enas
1
,
Inoc
Rubio
P
aucar
2
,
Mar
garita
Giraldo
Retuerto
1
,
Cesar
Y
actay
o-Arias
3
1
F
acultad
de
Ciencias
e
Ingenier
´
ıa,
Uni
v
ersidad
de
Ciencias
y
Humanidades,
Lima,
Per
´
u
2
F
acultad
de
Ingenier
´
ıa
y
Ne
gocios,Uni
v
ersidad
Pri
v
ada
Norbert
W
iener
,
Lima,
Per
´
u
3
Departamento
de
Estudios
Generales,
Uni
v
ersidad
Continental,
Lima,
Per
´
u
Article
Inf
o
Article
history:
Recei
v
ed
Aug
14,
2025
Re
vised
Dec
29,
2025
Accepted
Jan
22,
2026
K
eyw
ords:
Decision
making
Machine
learning
Predicti
v
e
model
Random
forest
Student
dropout
ABSTRA
CT
In
the
uni
v
ersity
conte
xt,
student
dropout
has
become
one
of
the
most
recurring
problems,
both
in
the
short
and
long
term.
The
objecti
v
e
of
this
research
w
as
to
de
v
elop
a
predicti
v
e
model
using
the
random
forest
(RF)
algorithm
to
identify
patterns
associated
with
uni
v
ersity
dropout.
T
o
achie
v
e
this,
the
kno
wledge
disco
v
ery
in
databases
(KDD)
methodology
w
as
applied,
which
encompasses
the
stages
of
selection,
preprocessing,
transformation,
data
mining,
and
interpretation
of
results.
The
RF
model
demonstrated
superior
performance
compared
to
other
e
v
aluated
m
odels,
achie
ving
an
accurac
y
of
87%,
a
precision
of
86%,
a
recall
of
85%,
an
F1-score
of
85%,
and
an
recei
v
er
operating
characteristic
(R
OC)
area
under
the
curv
e
(A
UC)
of
0.91,
hi
ghlighting
its
high
predicti
v
e
capability
compared
to
othe
r
techniques
analyzed.
Therefore,
the
applicat
ion
of
the
proposed
model
is
recommended
in
v
arious
uni
v
ersity
institutions
in
order
to
identify
potential
dropout
cases
at
an
early
stage.
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
Laberiano
Andrade-Arenas
F
acultad
de
Ciencias
e
Ingenier
´
ıa,
Uni
v
ersidad
de
Ciencias
y
Humanidades
Lima,
Per
´
u
Email:
landrade@uch.edu.pe
1.
INTR
ODUCTION
In
the
current
conte
xt,
uni
v
ersity
student
dropout
has
become
one
of
the
most
pressing
global
issues,
with
both
social
and
economic
implications.
High
dropout
rates
limit
students’
professional
de
v
elopment,
reduce
the
ef
cienc
y
of
higher
education
institutions,
and
directly
impact
the
gro
wth
and
competiti
v
eness
of
countries.
This
situation
not
only
represents
a
loss
of
talent
and
resources
b
ut
also
undermines
ef
forts
to
ensure
quality
education
[1],
[2].
It
is
essential
to
t
ak
e
action
in
response
to
this
situation,
as
student
dropout
has
become
a
common
occurrence
in
uni
v
ersities,
dri
v
en
by
multiple
f
actors.
Despite
institutional
ef
forts
to
impro
v
e
educational
quality
,
student
dropout
in
higher
education
remains
a
persistent
and
multif
actorial
challenge.
The
causes
of
dropout
are
di
v
erse
and
include
academic,
personal,
economic,
and
conte
xtual
f
actors,
which
mak
e
timely
identication
dif
cult
through
traditional
methods.
This
comple
xity
pre
v
ents
man
y
uni
v
ersities
from
anticipating
dropout
risk
and
implementing
ef
fecti
v
e
interv
entions
in
a
timely
manner
[3],
[4].
Moreo
v
er
,
the
limited
a
v
ailability
of
resources
to
carry
out
indi
vidualized
student
monitoring
further
complicates
the
implementation
of
appropriate
pre
v
enti
v
e
strate
gies.
This
situation
not
only
af
fects
institutional
performance
and
educational
planning,
b
ut
also
represents
a
signicant
loss
of
human
talent,
public
in
v
estment,
and
personal
and
professional
de
v
elopment
opportunities
for
students.
In
addition,
the
emotional
and
moti
v
ational
impact
of
dropping
out
can
af
fect
students’
self-esteem,
J
ournal
homepage:
http://ijai.iaescor
e
.com
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
629
creating
a
ne
g
ati
v
e
ef
fect
on
their
f
amily
and
social
en
vironment
[5],
[6].
Therefore,
it
is
ur
gent
to
strengthen
academic
support
policies,
guidance,
and
comprehensi
v
e
assistance
that
can
help
address
this
issue
from
a
more
human
and
inclusi
v
e
perspecti
v
e.
It
is
considered
essential
to
approach
student
dropout
with
greater
attention,
as
it
represents
a
signicant
loss
not
only
for
students,
b
ut
also
for
institutions
and
society
as
a
whole.
This
research
is
jus
tied
by
the
ur
gent
nee
d
to
reduce
dropout
rates
in
hi
gh
e
r
education—a
problem
that
ne
g
ati
v
ely
impacts
students,
institutions,
and
national
de
v
elopment.
The
multif
actorial
causes
of
dropout
mak
e
early
detection
dif
cult
through
traditional
methods,
limiting
the
implementation
of
ef
fecti
v
e
pre
v
enti
v
e
strate
gies
[7],
[8].
In
this
conte
xt,
it
becomes
essential
to
ha
v
e
tools
that
allo
w
for
the
analysis
of
lar
ge
v
olumes
of
academic
data
and
the
generation
of
accurate
predictions
re
g
arding
dropout
risk.
Machine
learning
emer
ges
as
an
inno
v
ati
v
e
and
ef
fecti
v
e
alternati
v
e
for
this
purpose,
as
it
enables
the
construction
of
predicti
v
e
models
capable
of
identifying
risk
patterns
based
on
a
v
ailable
data.
This
research
will
contrib
ute
to
the
de
v
elopment
of
decision-support
systems
in
uni
v
ersities,
f
acilitating
timely
and
personali
zed
interv
entions
to
impro
v
e
st
udent
retention
and
promote
academic
success
[9],
[10].
Anticipating
student
dropout
is
essential,
as
timely
interv
ention
not
only
enhances
academic
performance
b
ut
also
pro
vides
greater
opportunities
for
students’
personal
and
professional
de
v
elopment.
The
object
i
v
e
of
this
research
is
to
de
v
elop
a
predicti
v
e
model
based
on
the
random
forest
(RF)
algorithm
to
identify
patterns
of
student
dropout,
with
the
aim
of
optimizing
strate
gic
decision-making
in
the
uni
v
ersity
conte
xt.
2.
LITERA
TURE
REVIEW
This
section
presents
a
thorough
re
vie
w
of
v
arious
studies
related
to
the
topic
addressed.
W
ith
the
purpose
of
pro
viding
a
broad
and
well-founded
per
specti
v
e
on
the
subject
of
study
.
Additionally
,
the
theoretical
frame
w
orks
consulted
support
the
selection
and
interpretation
of
the
v
ariables
considered
in
the
analysis.
2.1.
Related
w
orks
This
research
proposes
a
machine
learning-based
approach
for
e
v
aluating
teaching
performance.
T
o
address
this
issue,
se
v
eral
classi
cation
algorithms
were
implemented
using
the
Python
programming
language,
including
k-nearest
neighbors
(KNN),
e
xtra
trees,
light
gradient
boosting
machine
(LightGBM),
CatBoost
classier
,
among
others.
The
results
sho
wed
that
the
proposed
model
achie
v
ed
a
2%
higher
accurac
y
compared
to
the
other
e
v
aluated
algorithms,
highlighting
its
ef
fecti
v
eness
in
the
educational
conte
xt.
In
a
complementary
area,
a
student
dropout
predicti
on
system
w
as
de
v
eloped
using
machine
learni
n
g
algorithms,
based
on
a
longitudinal
dataset
collected
from
uni
v
ersity
students.
The
results
indicated
that
the
risk
of
dropout
is
primarily
associated
with
f
actors
such
as
academic
department,
gender
,
and
socioeconomic
group
[11],
[12].
Another
rele
v
ant
aspect
addressed
by
Niyogisubizo
et
al.
[13]
w
as
the
proposal
of
a
h
ybrid
dropout
prediction
model,
which
combines
the
RF
,
e
xtreme
gradient
boosting
(XGBoost),
gradient
boosting
(GB),
and
feedforw
ard
neural
netw
orks
(FNN)
algorithms.
The
model’
s
performance
w
as
e
v
aluated
using
the
area
under
the
curv
e
(A
UC),
sho
wing
promising
results
in
ident
ifying
f
actors
related
to
school
dropout.
The
analysis
highlighted
the
impact
of
uncontrolled
beha
viors
as
a
k
e
y
v
ariable
in
dropout
risk.
On
the
other
hand,
V
i
v
es
et
al.
[14]
emphasizes
the
ef
fecti
v
eness
of
long
short-term
memory
(LSTM)
netw
orks
in
predicting
academic
performance.
Through
comparisons
between
dif
ferent
models
based
on
metrics
such
as
accurac
y
,
precision,
recall,
and
F1-score,
the
superiority
of
the
LSTM
-
generati
v
e
adv
ersarial
netw
orks
(GAN)
model
w
as
conrmed,
achi
e
ving
an
accurac
y
of
98.3%
in
week
8,
follo
wed
by
the
deep
neural
netw
orks
(DNN)
-
GAN
model
with
98.1%.
In
the
conte
xt
of
predicting
dropout
in
postgraduate
p
r
og
r
ams,
classication
models
such
as
logi
stic
re
gression,
RF
,
and
neural
netw
orks
were
de
v
eloped
and
optimized
using
resampling
techniques
to
address
class
imbalance
(synthetic
minority
o
v
er
-sampling
technique
(SMO
TE),
SMO
TE
-
support
v
ector
machine
(SVM),
adapti
v
e
synthetic
(AD
ASYN)),
as
well
as
through
h
yperparameter
tuning.
The
best-performing
model
w
as
the
neural
netw
ork
combined
with
SMO
TE-SVM,
achie
ving
a
recall
v
alue
of
0.75,
follo
wed
by
logistic
re
gression
with
0.67
and
RF
with
0.60—the
latter
also
demonstrating
strong
generalization
ability
with
an
optimal
decision
threshold
of
0.427.
Complementarily
,
another
study
focused
on
student
dropout
implemented
a
predicti
v
e
model
based
on
LightGBM,
which
achie
v
ed
outstanding
performance
with
an
F1-score
of
0.840,
surpassing
the
results
of
pre
vious
studies
that
addressed
the
class
imbalance
issue.
The
model’
s
ef
fecti
v
eness
w
as
enhanced
through
the
application
of
o
v
ersampling
techniques
such
as
SMO
TE,
AD
ASYN,
and
Borderline-SMO
TE,
which
helped
impro
v
e
class
distrib
ution
and
optimize
the
system’
s
predicti
v
e
capacity
,
as
noted
in
[15],
[16].
Another
application
oriented
to
w
ard
virtual
learning
en
vironments
adopted
a
h
ybrid
approach
using
machine
learning
algorithms—specically
RF
and
XGBoost
—to
classify
students
at
risk
of
dropping
out.
The
Pr
edicting
univer
sity
student
dr
opouts
in
Latin
America
using
...
(Laberiano
Andr
ade-Ar
enas)
Evaluation Warning : The document was created with Spire.PDF for Python.
630
❒
ISSN:
2252-8938
model
achie
v
ed
outstanding
results,
with
an
accurac
y
of
93%,
a
precision
of
91.52%,
a
recall
of
96.42%,
and
an
F1-score
of
93.91%,
demonstrating
its
high
ef
fecti
v
eness
in
the
early
detection
of
academic
dropout.
A
further
rele
v
ant
contrib
ution
related
to
student
dropout
in
v
olv
ed
the
de
v
elopment
of
a
uni
v
ersity
dropout
prediction
system.
F
or
this
purpose,
a
softw
are
prediction
program
w
as
created
bas
ed
on
machine
learning
models
to
identify
the
correlation
between
v
ariables
and
student
dropout.
The
models
were
e
v
aluated
for
accurac
y
,
with
articial
neural
netw
orks
of
the
perceptron
type
achie
ving
the
highest
accurac
y
at
98.1%
[
1
7]
,
[18].
Recent
studies
ha
v
e
de
v
eloped
a
uni
v
ersity
dropout
predicti
o
n
system
that
signicantly
impro
v
ed
accurac
y
(0.963)
and
recall
rate
(0.766)
by
using
dimensionality
reduction
techniques
with
principal
component
analysis
(PCA)
and
clustering
through
K-means.
The
model
outperformed
the
best
pre
vious
approach
by
0.093
in
accurac
y
and
achie
v
ed
an
F1-score
of
0.808,
surpassing
the
GB
method.
Additionally
,
it
identied
four
main
causes
of
dropout:
emplo
yment,
non-re
gistration,
personal
problems,
and
admission
to
another
uni
v
ersity—the
latter
being
the
most
accurately
predicted
(0.672).
In
a
separate
study
,
a
classication
model
w
as
implemented
using
machine
learning
techniques
to
anticipate
student
dropout
with
high
le
v
els
of
accurac
y
.
The
proposal
follo
wed
a
technological
methodology
with
a
propositional
focus,
incremental
inno
v
ation,
and
synchronous
scope.
Data
collection
w
as
conducted
through
a
20-question
surv
e
y
administered
to
237
postgraduate
students
enrolled
in
education
master’
s
programs.
The
model,
based
on
gradient
boosting
machine
(GBM),
yielded
outstanding
results:
a
Gini
coef
cient
of
92.2
0%,
an
A
UC
of
96.10%,
and
a
LogLoss
of
24.24%.
These
results
enabled
the
ef
fecti
v
e
identication
of
k
e
y
f
actors
behind
student
dropout
and
pro
vided
a
strate
gic
tool
for
educational
management
[19],
[20].
In
a
rele
v
ant
alternati
v
e
approach,
the
research
in
[21],
[22]
applied
data
mining
techniques
using
academic
grades
as
k
e
y
predicti
v
e
v
ariables,
combined
with
v
arious
machine
learning
algorithms
aimed
at
modeling
uni
v
ersity
dropout.
The
results
demonstrated
strong
model
performance,
achie
ving
an
F1-s
core
of
81%
on
the
nal
test
set.
These
ndings
suggest
that
students’
academic
performance
is
a
representati
v
e
indicator
of
their
li
ving
conditions
and,
therefore,
allo
ws
for
the
early
detection
of
potential
dropout
cases
in
higher
education.
This
supports
the
idea
that
academic
success
is
inuenced
by
multiple
f
actors,
including
class
imbalance,
which
justies
the
use
of
supervised
machine
learning
algorithms
such
as
decision
trees
(DT)
and
SVM.
Ho
we
v
er
,
boosting
algorithms—especially
LightGBM
and
CatBoost
optimized
with
Optuna—sho
wed
superior
performance
compared
to
traditional
classiers,
establishing
themselv
es
as
more
ef
fecti
v
e
approaches
for
academic
prediction,
as
highlighted
by
the
aforementioned
author
.
In
another
instance,
when
analyzing
dropout
risk
among
under
graduate
students,
unsupervised
clustering
algorithms
were
applied
alongside
RF
and
probability
threshold
adjustment.
The
traditional
model
yielded
a
lo
w
accurac
y
of
13.2%
in
predicting
dropout,
compared
to
99.4%
in
retention.
Ho
we
v
er
,
after
adjusting
the
threshold,
the
accurac
y
in
detecting
dropout
e
xceeded
50%,
while
maintaining
o
v
erall
and
retention
rates
abo
v
e
70%
[23],
[24].
This
research
addresses
dropout
in
massi
v
e
open
online
courses
(MOOCs),
proposing
the
use
of
the
RF
algorithm
to
predict
this
phenomenon.
The
model
demonstrated
strong
performance,
achie
ving
an
accurac
y
of
87.5%,
an
A
UC
of
94.5%,
a
precision
of
88%,
a
recall
of
87.5%,
and
an
F1-score
of
87.5%,
highlighting
its
ef
fecti
v
eness
in
the
early
detection
of
uni
v
ersity
students
at
risk
of
dropping
out.
In
addition,
risk
f
actors
associated
with
dropout
in
uni
v
ersity
programs
were
identied
by
applying
v
arious
machine
learning
algorithms,
among
which
RF
e
xhibited
the
most
notable
performance.
The
highest
le
v
el
of
predicti
v
e
accurac
y
w
as
reached
at
the
end
of
the
rst
semester
,
once
suf
cient
academic
information
about
the
students
had
been
collected.
At
this
stage,
the
model
produced
performance
indicators
that
were
comparable
to
those
reported
in
pre
vious
research
on
early
identication
of
dropout
risk
and
lo
w
academic
achie
v
ement
[25],
[26].
Se
v
eral
studies
highlight
the
rele
v
ance
of
applying
machine
learning
techniques
in
this
conte
xt,
particularly
models
such
as
GB,
RF
,
and
SVM,
which
ha
v
e
sho
wn
promising
res
ults
for
supporting
institutional
decision-making
and
for
designing
pre
v
enti
v
e
strate
gies
in
uni
v
ersity
settings
[27].
2.2.
Student
dr
opout
Student
dropout
in
uni
v
ersities
is
de
ned
as
the
student’
s
decision
to
interrupt
their
studies
for
v
arious
conte
xt-related
reasons,
whether
the
interruption
is
temporary
or
permanent.
Dropout
represents
a
critical
issue
for
uni
v
ersities,
as
it
impacts
the
ef
cienc
y
of
the
educational
system,
the
allocation
of
resources,
and
the
de
v
elopment
of
qualied
human
capital
[28],
[29].
This
phenomenon
arises
from
multiple
causes,
as
outlined
in
T
able
1,
which
seek
to
address
this
challenge.
In
this
re
g
ard,
it
is
advisable
to
closely
monitor
uni
v
ersity
students’
academic
performance,
as
it
can
signicantly
inuence
their
long-term
professional
succes
s
or
f
ailure.
Int
J
Artif
Intell,
V
ol.
15,
No.
1,
February
2026:
628–641
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
631
T
able
1.
Main
causes
of
uni
v
ersity
student
dropout
Cate
gory
Specic
cause
Example
Impact
type
Academic
Lo
w
performance
Continuous
f
ailure
of
courses
Academic
Economic
Lack
of
resources
Can’
t
af
ford
tuition
or
transportation
Economic
V
ocational
Demoti
v
ation
Insecurity
about
career
choice
Emotional/V
ocational
F
amiliar
F
amily
problems
Conicts
or
responsibilities
at
home
Psychological
Institutional
Lack
of
mentoring
Poor
academic
support
Institutional
Social
Discrimination
or
e
xclusion
By
gender
,
race
or
social
class
Social/Cultural
Health
Medical
or
psychological
problems
Anxiety
,
depression,
chronic
illnesses
Staf
f
Labor
Need
to
w
ork
Drop
out
of
school
to
w
ork
full-time
Economic/Labor
2.3.
Random
f
or
est
The
RF
algorithm
is
a
supervised
machine
learning
method
based
on
ensemble
techniques,
which
in
v
olv
es
b
uilding
multiple
independent
DTs
and
combining
their
predictions
to
obtain
more
rob
ust,
accurate,
and
generalizable
results.
This
model
uses
the
bagging
method,
where
each
tree
is
trained
on
a
random
sample
of
the
dataset,
and
at
each
split
in
the
tree,
a
random
subset
of
features
is
considered,
which
helps
reduce
the
correlation
between
trees.
F
or
classication
tasks,
the
nal
result
is
determined
by
majority
v
oting,
while
for
re
gression
tasks,
it
is
calculated
by
a
v
eraging
the
predictions.
This
approach
impro
v
es
performance
by
reducing
o
v
ertting
and
ef
ciently
handles
lar
ge
v
olumes
of
data.
Ho
we
v
er
,
its
main
dra
wback
is
its
lo
wer
interpretability
compared
to
a
single
DT
[30],
[31].
This
type
of
algorithm
can
be
applied
in
v
arious
conte
xts—such
as
medicine,
education,
and
nance
depending
on
the
domain
in
which
it
is
used.
3.
METHODOLOGY
The
kno
wledge
disco
v
ery
in
databases
(KDD)
methodology
is
a
comprehensi
v
e
and
systemat
ic
process
aimed
at
transforming
lar
ge
v
olumes
of
ra
w
data
into
useful,
no
v
el,
understandable,
and
rele
v
ant
kno
wledge
for
decision-making.
This
process
includes
se
v
eral
interrelated
stages:
the
selection
of
rele
v
ant
data,
cleaning
and
preprocessing
to
remo
v
e
inconsis
tencies
or
outliers,
transformation
into
suitable
formats,
application
of
data
mining
techniques
to
e
xtract
meaningful
patterns,
and
nally
,
the
e
v
aluation,
interpretation,
and
presentation
of
the
disco
v
ered
kno
wledge
in
a
w
ay
that
can
be
understood
and
used
by
or
g
anizations
[32],
[33].
In
this
study
,
the
process
is
applied
to
an
institutional
dataset
composed
of
academic,
socio-economic,
and
demographic
student
records,
in
v
olving
approximately
510
students.
Before
modeling,
the
data
underwent
a
cleaning
procedure,
treatment
of
missing
v
alues,
detection
of
outliers,
and
normalization
to
ensure
analytical
reliability
.
Figure
1
presents
the
phases
of
the
KDD
process,
illustrating
the
data
o
w
to
w
ard
obtaining
rele
v
ant
results
that
support
informed
decision-making.
Meanwhile,
Figure
2
sho
ws
the
architecture
implemented
for
student
data
analysis.
The
w
orko
w
starts
with
the
ingestion
of
datasets
in
formats
such
as
.CSV
,
.XLSX,
.TXT
,
and
.JSON,
processed
using
Python.
The
architecture
inte
grates
libraries
such
as
Scikit-learn,
XGBoost,
NumPy
,
and
P
andas,
applying
preprocessing
steps
including
cleaning,
standardization,
and
transformation.
The
RF
model
w
as
subsequently
implemented,
allocating
80%
of
the
dataset
for
training
and
the
remaining
20%
for
v
alidation.
Finally
,
the
model
is
e
v
aluated
through
metrics
such
as
accurac
y
,
confusion
matrix,
F1-score,
precision,
recall,
and
R
OC–A
UC
curv
es,
aiming
to
obtain
meaningful
results
that
contrib
ute
to
decision-making
in
educational
conte
xts.
3.1.
Selection
This
section
presents
a
thorough
search
focused
on
selecting
the
most
appropriate
dataset
for
the
de
v
elopment
of
the
machine
learning
project.
The
selection
w
as
based
on
the
research
objecti
v
e,
prioritizing
data
rele
v
ance,
quality
,
and
a
v
ailability
.
T
o
achie
v
e
this,
se
v
eral
specialized
platforms
for
public
dataset
distrib
ution
were
e
xplored,
with
Kaggle
standing
out
as
a
leading
platform
due
t
o
its
rob
ustness
and
wide
v
ariety
of
datasets
from
dif
ferent
elds
of
kno
wledge.
Kaggle
is
a
reliable
and
up-to-date
source,
supported
by
an
acti
v
e
scientic
community
that
shares
high-quality
data
along
with
detailed
technical
descriptions
[34].
This
feature
allo
wed
for
the
selection
of
a
dataset
aligned
with
the
project’
s
goals,
ensuring
a
solid
foundation
for
subsequent
analys
is,
preprocessing,
and
modeling
using
machine
learning
techniques
such
as
RF
.
It
is
important
to
note
that
Kaggle
of
fers
datasets
across
v
arious
domains
and
hosts
competitions
and
publications
centered
on
machine
learning.
Pr
edicting
univer
sity
student
dr
opouts
in
Latin
America
using
...
(Laberiano
Andr
ade-Ar
enas)
Evaluation Warning : The document was created with Spire.PDF for Python.
632
❒
ISSN:
2252-8938
Figure
1.
KDD
methodology
Figure
2.
Machine
learning
architecture
3.2.
Pr
epr
ocessing
and
transf
ormation
This
section
presents
the
preprocessing
and
transformation
of
the
data.
T
ables
2
to
5
sho
ws
the
results
of
the
e
xploratory
data
analysis
and
the
initial
stages
of
v
ariable
preparation
for
the
predicti
v
e
model.
T
able
2
displays
the
analysis
of
missing
v
alues,
where
lo
w
percentages
of
missing
data
are
observ
ed
3.75%
in
mother’
s
occupation
and
4.01%
in
f
ather’
s
occupation.
V
ariables
such
as
debtor
,
tuition
payment,
and
unemplo
yment
rate
do
not
contain
an
y
missing
v
alues.
T
able
3
details
the
distrib
ution
of
part
icipants
by
marital
status,
with
“single”
being
the
most
common
cate
gory
,
follo
wed
by
marri
ed,
contrib
uting
to
the
sociodemographic
prole
of
the
Int
J
Artif
Intell,
V
ol.
15,
No.
1,
February
2026:
628–641
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
633
study
.
T
able
4
presents
the
correlation
between
economic
indicators,
re
v
ealing
ne
g
at
i
v
e
relationships
between
gross
domestic
product
(GDP)
and
both
the
unemplo
yment
rate
(-0.40)
and
the
ination
rate
(-0.55),
suggesting
a
link
between
economic
gro
wth
and
impro
v
ed
social
conditions.
Lastly
,
T
able
5
sho
ws
the
discretization
of
GDP
into
three
le
v
els
(lo
w
,
medium,
and
high),
with
the
medium
le
v
el
being
the
most
frequent.
This
f
acilitates
its
inte
gration
into
classi
cation
models
such
as
RF
.
These
tables
help
to
understand
ho
w
the
data
is
structured
and
what
transformations
are
applied
prior
to
modeling.
T
able
2.
Exploratory
data
analysis
and
preprocessing
results,
missing
data
analysis
V
ariable
Missing
(%)
Marital
status
2.14
Day/Night
attendance
0.00
Mother’
s
occupation
3.75
F
ather’
s
occupation
4.01
Debtor
0.00
T
uition
payment
0.00
International
student
0.27
Unemplo
yment
rate
0.00
Ination
rate
0.00
GDP
0.00
T
able
3.
Exploratory
data
analysis
and
preprocessing
results,
marital
status
distrib
ution
Cate
gory
Frequenc
y
Single
210
Married
125
Di
v
orced
30
W
ido
wed
9
T
otal
374
T
able
4.
Exploratory
data
analysis
and
preprocessing
results,
correlation
between
economic
indicators
Unemplo
yment
rate
Ination
rate
GDP
Unemplo
yment
rate
1.00
0.65
-0.40
Ination
rate
1.00
-0.55
GDP
1.00
T
able
5.
Exploratory
data
analysis
and
preprocessing
results,
discretization
of
GDP
v
alues
Cate
gory
GDP
range
Frequenc
y
Lo
w
GDP
<
10000
80
Medium
GDP
10000–30000
200
High
GDP
>
30000
94
T
otal
374
T
able
6
presents
the
descripti
v
e
statistics
of
the
numerically
encoded
quantitati
v
e
v
ariables
in
t
h
e
dataset.
These
stati
stics
pro
vide
an
o
v
ervie
w
of
the
beha
vior
of
the
sociodemographic
and
economic
v
ariables
considered
in
the
study
.
The
v
ariable
marital
status
has
a
mean
v
alue
of
1.78,
indicating
that
most
participants
f
all
between
the
cate
gories
of
single
and
married.
Similarly
,
the
mean
v
alue
for
attendance
(day
or
night)
is
1.25,
suggesting
a
higher
proportion
of
st
udents
attending
daytime
classes.
P
arental
occupation
v
ariables
sho
w
a
v
erage
v
alues
close
to
1.5,
reecting
an
intermediate
distrib
ution
among
emplo
yed,
unemplo
yed,
or
“other”
cate
gories.
Re
g
arding
binary
v
ariables
such
as
debtor
,
tuition
payment,
and
international
student,
the
lo
w
mean
v
alues
indicate
that
most
indi
viduals
are
not
in
debt,
are
up
to
date
with
tuition
payments,
and
are
not
international
students,
respecti
v
ely
.
On
the
other
hand,
economic
indicators
re
v
eal
an
a
v
erage
unemplo
yment
rate
of
6.20%,
an
ination
rate
of
2.45%,
and
a
GDP
a
v
erage
of
21,500.75
monetary
units.
These
v
alues
help
to
understand
the
economic
conte
xt
in
which
the
participants
are
situated
and
pro
vide
a
solid
foundation
for
further
analysis.
Ov
erall,
the
statistical
information
of
these
v
ariables
f
acilitates
data
preparation
and
attrib
ute
selection
for
the
construction
of
predicti
v
e
models.
Pr
edicting
univer
sity
student
dr
opouts
in
Latin
America
using
...
(Laberiano
Andr
ade-Ar
enas)
Evaluation Warning : The document was created with Spire.PDF for Python.
634
❒
ISSN:
2252-8938
T
able
6.
Descripti
v
e
statistics
of
selected
v
ariables
V
ariable
Count
Mean
Std.
De
v
.
Min
Q1
(25%)
Q2
(Median)
Q3
(75%)
Max
Marital
status
366.00
1.78
0.89
1.00
1.00
2.00
2.00
4.00
Day/Night
attendance
374.00
1.25
0.43
1.00
1.00
1.00
1.00
2.00
Mother’
s
occupation
360.00
1.65
0.75
1.00
1.00
2.00
2.00
3.00
F
ather’
s
occupation
359.00
1.52
0.70
1.00
1.00
1.00
2.00
3.00
Debtor
374.00
0.25
0.43
0.00
0.00
0.00
1.00
1.00
T
uition
payment
374.00
0.20
0.40
0.00
0.00
0.00
0.00
1.00
International
student
373.00
0.09
0.29
0.00
0.00
0.00
0.00
1.00
Unemplo
yment
rate
374.00
6.20
1.45
3.20
5.10
6.00
7.30
9.80
Ination
rate
374.00
2.45
0.65
1.00
2.00
2.40
2.90
4.10
GDP
374.00
21500.75
7800.42
8000.00
16000.00
21000.00
27000.00
42000.00
3.3.
Data
mining
F
or
this
procedure,
Figure
3
illustrates
the
architecture
underlying
the
decision-making
process
within
the
DT
frame
w
ork,
based
on
the
dataset
used.
Meanwhile,
Figure
4
displays
the
results
that
allo
w
the
e
v
aluation
of
the
RF
model’
s
performance
in
predicting
student
dropout.
Figure
4(a)
sho
ws
the
R
OC
curv
e,
sho
wing
the
relationship
between
the
true
positi
v
e
rate
and
the
f
alse
positi
v
e
rate,
with
a
high
A
UC
indicating
strong
discriminati
o
n
between
students
who
drop
out
and
those
who
do
n
ot
.
Figure
4(b)
presents
the
precision–recall
curv
e,
sho
wing
the
balance
between
precision
and
recall,
including
the
A
UC
v
alue
and
the
optimal
threshold,
which
is
especially
useful
in
scenarios
in
v
olving
class
imbalance.
Figure
4(c)
illustrates
the
relationship
between
sensiti
vity
and
specicity
across
dif
ferent
classication
thresholds.
As
the
threshold
incr
eases,
sensiti
vity
decreases
while
specicity
increases.
The
intersection
point
of
the
tw
o
curv
es
at
approximately
a
threshold
of
0.4
suggests
a
possible
balance
between
these
metrics.
The
le
gend
includes
the
formulas,
and
the
sidebar
indicates
the
threshold
v
alues.
Figure
3.
Decision
tree
representation
Int
J
Artif
Intell,
V
ol.
15,
No.
1,
February
2026:
628–641
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
635
(a)
(b)
(c)
Figure
4.
Classication
model
performance
e
v
aluation:
(a)
R
OC
curv
e
with
the
A
UC,
(b)
precision–recall
curv
e
with
A
UC
and
optimal
threshold,
and
(c)
sensiti
vity
and
specicity
across
thresholds
Pr
edicting
univer
sity
student
dr
opouts
in
Latin
America
using
...
(Laberiano
Andr
ade-Ar
enas)
Evaluation Warning : The document was created with Spire.PDF for Python.
636
❒
ISSN:
2252-8938
3.3.1.
Mathematical
f
oundation
The
predicti
v
e
model
for
student
dropout
is
based
on
the
RF
algorithm,
an
ensemble
learning
method
that
combines
multiple
DTs
to
impro
v
e
accurac
y
and
rob
ustness.
The
follo
wing
mathematical
formulations
pro
vide
the
theoretical
foundation
for
this
methodology
[35].
–
Data
representation:
we
dene
the
training
dataset
as:
D
=
{
(
x
i
,
y
i
)
}
n
i
=1
,
x
i
∈
R
d
,
y
i
∈
{
0
,
1
}
(1)
where
x
i
denotes
the
feature
v
ector
of
student
i
,
and
y
i
is
the
binary
tar
get
v
ariable:
1
if
the
student
drops
out,
and
0
otherwise
[36].
–
Gini
impurity:
each
DT
splits
the
dataset
using
impurity
functions.
The
Gini
impurity
is
dened
as
[37]:
G
(
p
)
=
1
−
K
X
k
=1
p
2
k
(2)
In
binary
classication
(
K
=
2
),
it
simplies
to:
G
(
p
)
=
2
p
(1
−
p
)
(3)
where
p
is
the
probability
of
belonging
to
one
of
the
tw
o
classes
(dropout
or
not).
–
Shannon
entrop
y
(alternati
v
e):
as
an
alternati
v
e
to
Gini,
the
Shannon
entrop
y
can
be
used:
H
(
p
)
=
−
K
X
k
=1
p
k
log
2
(
p
k
)
(4)
–
RF
prediction:
let
h
m
(
x
)
denote
the
prediction
of
tree
m
.
The
nal
prediction
is
based
on
majority
v
oting:
ˆ
y
=
mo
de
(
h
1
(
x
)
,
h
2
(
x
)
,
.
.
.
,
h
M
(
x
))
(5)
The
estimated
probability
that
a
student
drops
out
is:
P
(
y
=
1
|
x
)
=
1
M
M
X
m
=1
I
(
h
m
(
x
)
=
1)
(6)
where
I
(
·
)
is
the
indicator
function,
which
returns
1
if
the
condition
is
true
and
0
otherwise.
–
Feature
importance:
the
importance
of
each
feature
x
j
is
e
v
aluated
as:
Imp(
x
j
)
=
X
t
∈
T
j
n
t
n
·
∆
ϕ
t
(7)
where
T
j
is
the
set
of
nodes
where
feature
x
j
is
used,
n
t
is
the
number
of
samples
at
node
t
,
and
∆
ϕ
t
is
the
impurity
reduction
at
that
node.
–
Ev
aluation
metrics:
the
model
is
e
v
aluated
using
the
follo
wing
standard
classication
metrics.
Precision:
Precision
=
T
P
T
P
+
F
P
(8)
Recall
(Sensiti
vity):
Recall
=
T
P
T
P
+
F
N
(9)
F1-score:
F
1
=
2
·
Precision
·
Recall
Precision
+
Recall
(10)
Accurac
y:
Accurac
y
=
T
P
+
T
N
T
P
+
T
N
+
F
P
+
F
N
(11)
These
metrics
help
determine
ho
w
well
the
model
identies
students
at
risk
of
dropping
out.
Int
J
Artif
Intell,
V
ol.
15,
No.
1,
February
2026:
628–641
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
637
4.
RESUL
TS
In
the
results
stage,
Figure
5
illustrates
the
perform
ance
of
the
classication
models—RF
,
XGB
oo
s
t,
and
KNN
o
v
er
ten
epochs,
re
v
ealing
distinct
performance
patterns.
In
Figure
5(a),
RF
consistently
maintains
the
highest
accurac
y
across
training
epochs,
while
KNN
e
xhibit
s
the
lo
west
and
most
unstable
accurac
y
,
sho
wing
the
stability
and
trend
of
each
model
during
training.
Figure
5(b),
the
precision
metric
follo
ws
a
similar
pattern,
with
RF
and
XGBoost
achie
ving
high
v
alues
and
KNN
remaining
lo
w
,
highlighting
ho
w
each
algorithm’
s
precision
impro
v
es
or
uctuates
during
training.
Figure
5(c)
sho
ws
that
XGBoost
attains
the
best
recall
o
v
er
epochs,
indicating
strong
performance
in
correctly
identifying
positi
v
e
cases,
whereas
KNN
performs
poorly
.
Finally
,
Figure
5(d)
conrms
through
the
F1-score
that
XGBoost
achie
v
es
t
h
e
best
balance
between
precision
and
recall
throughout
the
epochs,
follo
wed
by
RF
,
while
KNN
continues
to
sho
w
the
weak
est
performance
across
all
metrics,
reecting
the
o
v
erall
trade-of
f
between
precision
and
recall
for
each
algorithm.
(a)
(b)
(c)
(d)
Figure
5.
Algorithm
comparison
across
epochs:
(a)
accurac
y
,
(b)
precision,
(c)
recall,
and
(d)
F1-score
T
able
7
sho
ws
that,
in
the
classication
problem
addressed,
the
ensemble
models
RF
and
XGBoost
consistently
outperform
KNN,
with
RF
leading
across
all
k
e
y
performance
metrics
(accurac
y
,
precision,
recall,
F1-score,
and
A
UC),
indicating
its
superior
predicti
v
e
reliability
.
Additionally
,
T
able
8
highlights
the
feature
importance
analysis,
emphasizing
the
critical
role
of
GDP
,
unemplo
yment
rat
e,
and
mother’
s
occupation
as
the
most
inuential
f
actors
in
the
model’
s
predictions—underscoring
the
signicance
of
socioeconomic
and
macroeconomic
v
ariables.
Finally
,
T
able
9
presents
the
specic
h
yperparameter
congurations
for
each
model,
which
are
essential
for
reproducibility
and
for
understanding
tuning
process
that
optimized
their
performance.
T
able
7.
Performance
comparison
between
classication
algorithms
Modelo
Accurac
y
Precision
Recall
F1-score
A
UC
RF
0.87
0.86
0.85
0.85
0.91
XGBoost
0.85
0.84
0.83
0.83
0.89
KNN
0.76
0.73
0.71
0.72
0.76
Pr
edicting
univer
sity
student
dr
opouts
in
Latin
America
using
...
(Laberiano
Andr
ade-Ar
enas)
Evaluation Warning : The document was created with Spire.PDF for Python.