IAES
Inter
national
J
our
nal
of
Articial
Intelligence
(IJ-AI)
V
ol.
14,
No.
4,
August
2025,
pp.
2876
2888
ISSN:
2252-8938,
DOI:
10.11591/ijai.v14.i4.pp2876-2888
❒
2876
A
sur
v
ey
of
missing
data
imputation
techniques:
statistical
methods,
machine
lear
ning
models,
and
GAN-based
appr
oaches
Rifaa
Sadegh,
Ahmed
Mohameden,
Mohamed
Lemine
Salihi,
Mohamedade
F
ar
ouk
Nanne
Scientic
Computing,
Computer
Science
and
Data
Science,
Department
of
Computer
Science,
F
aculty
of
Science
and
T
echnology
,
Uni
v
ersity
of
Nouakchott,
Nouakchott,
Mauritania
Article
Inf
o
Article
history:
Recei
v
ed
Jun
8,
2024
Re
vised
Jun
11,
2025
Accepted
Jul
10,
2025
K
eyw
ords:
Data
imputation
Generati
v
e
adv
ersarial
netw
orks
Machine
learning
Missing
data
Statistical
methods
ABSTRA
CT
Ef
ciently
addressing
missing
data
is
critical
in
data
analysis
across
di
v
erse
domains.
This
study
e
v
aluates
traditional
statistical,
machine
learning,
and
generati
v
e
adv
ersarial
netw
ork
(GAN)-based
imputation
methods,
emphasizing
their
strengths,
limitations,
and
applicability
to
dif
ferent
data
type
s
and
missing
data
mechanisms
(missing
completely
at
random
(MCAR),
missing
at
random
(MAR),
missing
not
at
random
(MN
AR)).
GAN-based
models,
including
gener
-
ati
v
e
adv
ersarial
imputa
tion
netw
ork
(GAIN),
vie
w
imputati
on
generati
v
e
adv
er
-
sarial
netw
ork
(VIGAN),
and
SolarGAN,
are
highlighted
for
their
adaptability
and
ef
fecti
v
eness
in
handling
comple
x
datasets,
such
as
images
and
time
series.
Despite
challe
nges
lik
e
computat
ional
demands,
GANs
outperform
con
v
entional
methods
in
capturing
non-linear
dependencies.
Future
w
ork
includes
optimiz-
ing
GAN
architectures
for
broader
data
types
and
e
xploring
h
ybrid
models
to
enhance
imputation
accurac
y
and
scalability
in
real-w
orld
applications.
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
Rif
aa
Sade
gh
Scientic
Computing,
Computer
Science
and
Data
Science,
Department
of
Computer
Science
F
aculty
of
Science
and
T
echnology
,
Uni
v
ersity
of
Nouakchott
Nouakchott,
Mauritania
Email:
rif
asade
gh@gmail.com
1.
INTR
ODUCTION
Missing
data
is
a
perv
asi
v
e
challenge
that
af
fects
nearly
e
v
ery
scientic
discipline,
from
m
edicine
[1]
to
geology
[2],
ener
gy
[3]
and
en
vironmental
sciences
[4].
Rubin
[5]
dened
missing
data
as
unobserv
ed
v
alues
that
could
yield
critical
insights
if
a
v
ailable.
These
g
aps
introduce
biases,
distort
analysis,
and
reduce
the
ef
fecti
v
eness
of
algorithms,
ultimately
impairing
decision-making
processes.
The
origins
of
missing
data
are
di
v
erse,
arising
from
incomplete
data
collection,
recording
errors,
or
h
a
rdw
are
malfunctions
[5].
These
g
aps
sk
e
w
results
and
misrepresent
the
studied
population
[6],
creating
a
need
for
rob
ust
and
scalable
solutions
to
ensure
reliable
research
outcomes.
Addressing
missing
data
has
pro
v
en
to
be
a
multif
aceted
problem,
requiring
methods
that
v
ar
y
depending
on
the
type
and
comple
xity
of
the
dataset.
Initial
approaches,
such
as
listwise
deletion,
were
simple
b
ut
often
discarded
v
aluable
information
along
with
the
missing
data
[7].
Ov
er
time,
more
sophisticated
imputation
techniques
emer
ged,
including
sta-
tistical
methods,
machine
learning
algorithms,
and
deep
learning
models.
Among
these,
generati
v
e
adv
ersarial
netw
orks
(GANs)
ha
v
e
g
ained
prominence
for
their
ability
to
model
comple
x
data
distrib
utions
and
address
non-linear
dependencies
ef
fecti
v
ely
.
Despite
their
potential,
implementing
GANs
for
data
imputation
comes
J
ournal
homepage:
http://ijai.iaescor
e
.com
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
2877
with
challenges,
including:
i)
high
computational
cos
ts
due
to
comple
x
training
processes;
ii)
sensiti
vity
to
h
yperparameter
tuning,
which
af
fects
model
stabi
lity;
and
iii)
risk
of
o
v
ertting,
particularly
when
handling
small
datasets.
This
paper
pro
vides
a
comprehensi
v
e
re
vie
w
of
missing
data
imputation
methods.
W
e
analyze
tradi
-
tional
statistical
approaches,
machine
learning
techniques,
and
deep
learning
models,
with
a
particular
focus
on
GAN-based
imputation.
Our
ndings
re
v
eal
that
while
GANs
outperform
traditional
methods
in
handling
com-
ple
x
datasets,
their
deplo
yment
requir
es
careful
balancing
of
model
comple
xity
and
computational
ef
cienc
y
.
W
e
also
propose
future
research
directions,
including:
i)
the
inte
gration
of
h
ybrid
models
combining
statistical
techniques
with
GANs;
ii)
optimization
of
GAN
architectures
for
imputation
tasks;
and
iii)
application
of
these
techniques
to
real-w
orld
datasets
in
elds
such
as
healthcare,
ener
gy
,
and
en
vironmental
science.
By
address-
ing
these
challenges
and
e
xploring
inno
v
ati
v
e
solutions,
this
w
ork
aims
to
contrib
ute
to
the
gro
wing
body
of
kno
wledge
in
data
imputation,
enabling
researchers
and
practitioners
to
better
handle
missing
data
scenarios.
The
remainder
of
this
article
is
structured
as
follo
ws:
section
2
introduces
the
methodology
and
crit
eria
for
e
v
aluating
imputation
methods.
Section
3
presents
a
comparati
v
e
analysis
of
dif
ferent
approaches.
Section
4
discusses
the
implications
of
the
results,
including
ethical
considerations
related
to
imputation
in
sensiti
v
e
domains.
Section
5
concludes
with
k
e
y
ndings
and
recommendations
for
future
research.
2.
MISSING
D
A
T
A
MECHANISMS
AND
TYPES
OF
V
ARIABLES
Handling
missing
data
is
critical
for
ensuring
the
reliability
of
statistical
analyses.
Unders
tanding
the
mechanisms
underlying
missing
data
and
the
types
of
v
ariables
in
v
olv
ed
is
fundamental
for
selecting
appropriate
imputation
techniques.
This
section
e
xplores
the
cate
gories
of
missing
completely
at
random
(MCAR),
missing
not
at
random
(MN
AR),
and
missing
at
random
(MAR),
alongside
a
classication
of
statistical
v
ariables
and
imputation
approaches.
2.1.
Missing
data
categories
Missing
data
can
be
classied
into
three
distinct
cate
gories:
MCAR,
MN
AR
and
MAR
[5].
–
MCAR:
data
is
missing
randomly
,
unrelated
to
observ
ed
or
unobserv
ed
v
ariables.
Example:
pix
els
missing
in
radiological
images
due
to
random
noise
or
technical
errors,
such
as
sensor
malfunction.
–
MAR:
missingness
depends
on
observ
ed
v
ariables.
Example:
crop
yield
data
missing
in
re
gions
with
e
xtreme
weather
conditions,
where
meteorological
data
is
recorded.
–
MN
AR:
missingness
depends
on
unobserv
ed
v
ariables
[8].
Example:
fetal
position
af
fects
the
visibility
of
genital
or
g
ans
during
an
ultrasound,
leading
to
gender
data
being
systematically
missing
when
the
fetus
is
positioned
laterally
or
with
crossed
le
gs.
Figure
1
pro
vides
an
illustration.
T
able
1
summarizes
the
criteria
distinguishing
these
cate
gories.
MCAR
is
ignorable,
while
MAR
and
MN
AR
require
adv
anced
techniques
to
mitig
ate
bias.
P
M
1
Y
o
,
Y
m
,
ψ
denes
the
probability
of
the
missing
data
mechanism,
where
ψ
represents
the
set
of
parameters
of
the
imputation
model.
When
data
is
MN
AR,
the
probability
of
the
mechanism
cannot
be
dened
because
it
depends
on
one
or
more
unmeasured
parameters,
i.e.,
unobserv
ed
v
ariables.
Figure
1.
Missing
data
mechanisms
A
surve
y
of
missing
data
imputation
tec
hniques:
statistical
methods
...
(Rifaa
Sade
gh)
Evaluation Warning : The document was created with Spire.PDF for Python.
2878
❒
ISSN:
2252-8938
T
able
1.
Comparison
of
missing
data
mechanisms
Criterion
MAR
MN
AR
MCAR
Random
No
No
Y
es
Ignorable
It
depends
No
Y
es
Dependenc
y
Observ
ed
v
ariable
Unobserv
ed
v
ariable
None
P
M
1
Y
o
,
Y
m
,
ψ
P
M
1
Y
o
,
ψ
Undened
P
M
1
,
ψ
2.2.
Imputation
appr
oaches
Imputation
methods
are
cate
gorized
based
on
v
ariable
relationships:
–
Single
vs.
multiple
imputation:
single
imputation
replaces
a
m
issing
v
alue
with
one
estimate,
while
multiple
imputation
generates
se
v
eral
plausible
v
alues
[9].
–
Uni
v
ariate
vs.
multi
v
ariate:
uni
v
ariate
imputation
considers
only
the
tar
get
v
ariable,
whereas
multi
v
ariate
imputation
incorporates
relationships
between
v
ariables
[10].
Multi
v
ariate
methods
are
preferable
for
datasets
with
strong
interdependencie
s
as
the
y
support
both
single
and
multiple
imputations,
as
sho
wn
in
T
able
2.
T
able
2.
Comparison
of
imputation
types
Criterion
Approach
Uni
v
ariate
Multi
v
ariate
Replacement
1
m
Correlation
✕
✓
Single
Imputation
✕
✓
Multiple
Imputations
✓
✓
2.3.
T
ypes
of
v
ariables
Statistical
v
ariables
are
classied
as:
i)
quantitati
v
e
(e.g.,
continuous:
salary
,
discrete:
age).
ii)
qualitati
v
e
(e.g.,
nominal:
marital
status,
ordinal:
satisf
action
le
v
el)
[11].
Misinterpretations
arise
when
qualitati
v
e
v
ariables
are
numerically
encoded
(e.g.,
zip
codes),
as
their
mean
has
no
signicance.
Figure
2
pro
vides
an
o
v
ervie
w
.
Figure
2.
T
ypes
of
statistical
v
ariables
3.
IMPUT
A
TION
METHODS
Managing
missing
data
is
crucial
across
v
arious
elds
to
ensure
the
accurac
y
of
analyses
and
predic-
ti
v
e
models.
This
sect
ion
re
vie
ws
se
v
eral
imputation
techniques,
ranging
from
traditional
statistical
methods
to
adv
anced
machine
learning
and
deep
learning
approaches.
Each
method’
s
strengths
and
limitations
are
discussed,
along
with
their
suitability
for
dif
ferent
data
types
and
conte
xts.
3.1.
Statistical
methods
Statistical
methods
are
foundational
for
imputation.
K
e
y
approaches
include
similarity-based
meth-
ods,
observ
ation-based
methods,
measures
of
central
tendenc
y
,
and
multi
v
ariate
imputation
by
chained
equa-
tions
(MICE).
3.1.1.
Similarity-based
methods
The
hot-deck
method
re
places
missing
v
alues
with
those
from
similar
indi
viduals.
The
cold-deck
method
uses
v
alues
from
e
xternal
sources.
This
is
applied
when
there
are
not
enough
similar
data
points
[12],
[13].
Int
J
Artif
Intell,
V
ol.
14,
No.
4,
August
2025:
2876–2888
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
2879
3.1.2.
Obser
v
ation-based
methods
Methods
lik
e
last
observ
ation
carried
forw
ard
(LOCF),
baseline
observ
ation
carried
forw
ard
(BOCF),
w
orst
observ
ation
carried
forw
ard
(W
OCF),
and
ne
xt
observ
ation
carried
backw
ard
(NOCB)
are
commonly
used
for
longitudinal
data.
These
methods
replace
missing
v
alues
based
on
temporal
patterns.
The
y
rely
on
the
assumption
that
nearby
observ
ations
carry
meaningful
information
[14]-[17].
3.1.3.
Measur
es
of
central
tendency
The
objecti
v
e
of
central
tendenc
y
measures
is
to
summarize,
in
a
single
v
alue,
the
elements
of
a
v
ariable
in
a
dataset.
The
most
commonly
used
central
tendenc
y
measures
are
the
mean
[18],
the
median
[19],
and
the
mode
[20].
Indeed,
t
here
are
v
arious
means
[21],
such
as
the
arithmetic,
quadratic,
harmonic,
geometric,
weighted,
and
truncated
means.
Here,
we
illustrate
the
arithmetic
mean,
where
the
imputation
in
v
olv
es
replacing
the
missi
ng
v
alues
of
a
v
ariable
with
the
sum
of
its
kno
wn
v
alues,
di
vided
by
the
total
number
of
v
alues:
i
1
,
2
,
.
.
.
,
p
,
¯
y
i
1
n
n
j
1
y
ij
y
ij
Y
m
The
arithmetic
mean
is
only
applicable
to
quantitati
v
e
v
ariables,
especially
continuous
ones.
Ho
we
v
er
,
it
can
also
be
used
for
discrete
v
ariables,
in
which
case
the
result
will
be
rounded
to
the
nearest
inte
ger
.
The
median
is
the
v
alue
that
di
vides
the
elements
of
an
observ
ed
v
ariable
into
tw
o
equal
par
ts.
After
sorting
the
v
alues
of
the
tar
get
observ
ed
v
ariable
in
ascending
order
,
imputation
by
the
median
in
v
olv
es
replacing
the
missing
v
alues
of
a
v
ariable
with
the
middle
v
alue
when
the
number
of
observ
ations
n
is
odd,
or
the
a
v
erage
of
the
tw
o
middle
observ
ations
if
n
is
e
v
en:
i
1
,
2
,
.
.
.
,
p
,
˜
y
i
y
n
1
2
if
n
0
mo
d
2
y
n
2
if
n
1
mo
d
2
In
addition
to
the
classical
median,
there
are
other
w
ays
[22]
to
calculate
a
measure
of
central
posit
ion,
such
as
the
weighted
median,
the
geometric
median,
and
the
absolute
median
de
viation.
Imputation
by
mode
replaces
missing
data
with
the
most
frequent
v
alue
of
the
tar
get
v
ariable:
i
1
,
2
,
.
.
.
,
p
j
1
,
2
,
.
.
.
,
n
such
that
ˆ
y
i
argmax
y
ij
P
Y
y
ij
Although
the
mode
can
be
cal
culated
for
both
numerical
and
cate
gorical
v
ariables,
in
practice,
it
is
commonly
used
only
for
nominal
v
ariables
as
the
y
do
not
ha
v
e
other
central
tendenc
y
measures.
3.1.4.
Multi
v
ariate
imputation
by
chained
equations
MICE
is
an
iterati
v
e
approach
that
imputes
missing
data
using
re
gression
models.
Each
missing
v
alue
is
predicted
using
a
re
gression
model
base
d
on
other
v
ariables
in
the
dataset.
The
algorithm
iterates
until
the
imputed
v
alues
con
v
er
ge
[23],
[24].
3.2.
Machine
lear
ning
methods
Machine
learning
methods
of
fer
adv
antages
o
v
er
traditional
statistical
approaches,
particularly
in
han-
dling
lar
ge
and
comple
x
datasets
[25].
This
section
re
vie
ws
four
popular
machine
learning
models
for
data
imputation:
linear
re
gression,
logistic
re
gression,
k-nearest
neighbors
(KNN),
and
decision
trees.
3.2.1.
Regr
ession
Re
gression
models
estimate
relationships
between
the
tar
get
and
observ
ed
v
ariables.
W
e
focus
on
linear
and
logistic
re
gression
[26].
–
Linear
re
gression
models
aim
to
capture
a
proportional
trend
between
inputs
and
outcomes.
It
operates
by
applying
the
least
squares
method
to
reduce
the
g
ap
between
actual
observ
ations
and
model
predictions.
y
α
x
β
ϵ
(1)
Here,
α
is
the
coef
cients
of
the
re
gression
line
and
β
originally
ordered,
and
ϵ
is
the
error
term,
rep-
resenting
the
une
xplained
de
viation
or
v
ariance
by
the
linear
relationship
between
the
observ
ed
v
alue
y
and
the
predicted
v
alue
α
β
x
.
A
surve
y
of
missing
data
imputation
tec
hniques:
statistical
methods
...
(Rifaa
Sade
gh)
Evaluation Warning : The document was created with Spire.PDF for Python.
2880
❒
ISSN:
2252-8938
–
Logistic
re
gression:
used
for
binary
classicati
on,
it
models
the
probability
of
the
tar
get
v
ariable
being
1
using
a
logistic
function:
p
1
1
e
z
(2)
Here,
p
is
the
probability
that
the
tar
get
v
ariable
is
1,
and
z
is
the
linear
function
in
the
form:
z
b
0
b
1
x
1
b
2
x
2
b
n
x
n
(3)
Where
b
0
,
b
1
,
b
2
,
.
.
.
,
b
n
are
the
re
gression
coef
cients,
and
x
1
,
x
2
,
.
.
.
,
x
n
are
the
observ
ed
v
ariables.
3.2.2.
K-near
est
neighbors
The
basic
idea
of
the
KNN
is
t
o
nd
the
k-nearest
neighbors
of
the
indi
vidual
with
missing
data
[27].
This
algorithm
requires
tw
o
parameters,
namely
,
the
v
alue
of
k
and
the
similarity
metric
between
indi
viduals.
The
similarity
is
calculated
using
a
distance
measure
such
as
the
Euclidean
distance,
the
Manhattan
distance,
and
the
Mink
o
wski
distance.
3.2.3.
Decision
tr
ees
Decision
trees
partition
data
into
subsets
based
on
feature
v
alues
to
predict
missing
v
alues.
Random
forests,
an
ensemble
of
multiple
decision
trees
trained
on
dif
ferent
subsets,
enhance
rob
ustness
and
reduce
o
v
ertting.
MissF
orest
[28],
a
widely
used
v
ariant,
be
gins
with
nai
v
e
imputations
and
iterati
v
ely
renes
pre-
dictions
via
random
forests.
These
methods
are
more
e
xible
than
traditional
statistical
approaches
b
ut
may
require
careful
tuning
for
high-dimensional
or
sparse
datasets.
3.3.
Deep
lear
ning
methods
Deep
learning
models
of
fer
tw
o
major
adv
antages
o
v
er
traditional
machine
learning
models.
Firstly
,
traditional
methods
often
require
manual
selection
of
rele
v
ant
features
or
v
ariables
for
training
the
imputation
model.
In
contrast,
deep
learning
models
use
neural
netw
orks
to
automatically
learn
these
features
from
ra
w
data.
This
”automation”
occurs
during
the
learning
phase,
where
biases
and
weights
in
each
layer
of
the
neural
netw
ork
are
adjusted
to
better
capture
the
underlying
patterns
in
the
data.
The
second
adv
antage
is
the
v
ersatility
of
neural
netw
orks,
which
mak
es
them
easily
adaptable
to
v
arious
scenarios,
including
the
12
cases
illustrated
in
T
able
3.
Neural
netw
orks
can
model
comple
x,
non-linear
relationships,
making
them
particularly
ef
fecti
v
e
for
imputing
data
with
intricate
patterns.
T
able
3.
Ov
ervie
w
of
methods
for
imputing
missing
data
Method
Uni
v
ariate
imputation
Multi
v
ariate
imputation
Quantitati
v
e
Qualitati
v
e
Quantitati
v
e
Qualitati
v
e
MAR
MN
AR
MCAR
MAR
MN
AR
MCAR
MAR
MN
AR
MCAR
MAR
MN
AR
MCAR
Hot-deck
✓
✕
✕
✕
✕
✕
✕
✕
✕
✕
✕
✕
Cold-deck
✕
✕
✓
✕
✕
✕
✕
✕
✕
✕
✕
✕
LOCF/BOCF/NOCB
✕
✕
✓
✕
✕
✓
✕
✕
✕
✕
✕
✕
Mean
and
Median
✕
✕
✓
✕
✕
✕
✕
✕
✕
✕
✕
✕
Mode
✕
✕
✓
✕
✕
✓
✕
✕
✕
✕
✕
✕
MICE
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
KNN
✓
✕
✓
✓
✕
✓
✓
✕
✓
✓
✕
✓
Linear
re
gression
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
Logistic
re
gression
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
MissF
orest
✓
✕
✓
✓
✕
✓
✓
✕
✓
✓
✕
✓
Neural
netw
orks
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
In
the
follo
wing,
we
present
the
main
deep
learning
models
used
for
missing
data
imputation.
These
models
include
con
v
olutional
neural
netw
orks
(CNNs),
recurrent
neural
netw
orks
(RNNs),
v
ariational
autoen-
coders
(V
AEs),
and
GANs.
Each
model
has
a
unique
approach
in
handling
incomplete
data.
3.3.1.
Con
v
olutional
neural
netw
orks
CNNs
[29]
are
part
icularly
well-suited
for
imputing
missing
data
in
images,
where
mis
sing
pix
els
can
be
estimated
based
on
spatial
correlations
with
nearby
pix
els.
CNNs
utilize
con
v
olutional
layers
to
e
xtract
features
from
input
images,
ef
fecti
v
ely
capturing
local
dependencies.
This
mak
es
them
ideal
for
applications
where
data
e
xhibits
spatial
patterns,
such
as
medical
imaging
or
satellite
data.
Int
J
Artif
Intell,
V
ol.
14,
No.
4,
August
2025:
2876–2888
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
2881
3.3.2.
Recurr
ent
neural
netw
orks
RNNs
[30]
are
frequently
emplo
yed
for
imputing
temporal
data,
as
the
y
le
v
erage
pre
vious
inform
ation
to
predict
missing
v
alues.
These
models
maintain
an
internal
state
that
captures
the
sequence
of
pre
vious
inputs,
making
them
suitable
for
time
series
imputation.
An
adv
anced
v
ariant,
long
short-term
memory
(LSTM)
net-
w
orks,
addresses
the
v
anishing
gradient
problem
by
maintaining
long-term
dependencies,
which
is
particularly
useful
for
long-range
temporal
correlations.
3.3.3.
V
ariational
autoencoders
V
AEs
[31]
use
an
encoder
to
compress
data
into
a
latent
representation
and
a
decoder
to
reconstr
uct
it.
Their
p
r
ob
a
bilistic
frame
w
ork
enables
realistic
imputations
in
comple
x,
non-linear
datasets.
The
y
achie
v
e
this
by
generating
data
distrib
utions
close
to
the
original.
3.3.4.
Generati
v
e
adv
ersarial
netw
orks
GANs
[32]
consist
of
a
generator
and
a
discriminator
tha
t
compete
during
training.
The
generator
pro-
duces
synthetic
data,
while
the
discriminator
distinguishes
real
from
generated
data.
This
adv
ersarial
learning
enables
realistic
imputations
for
comple
x
data
types.
3.3.5.
Comparati
v
e
adv
antages
of
deep
lear
ning
models
Deep
l
earning
models
outperform
traditional
methods
in
capturing
non-linear
and
high-dimens
ional
patterns.
GANs
and
V
AEs,
in
particular
,
generate
realistic
imputations.
Ho
we
v
er
,
the
y
require
signicant
computational
resources,
are
sensiti
v
e
to
h
yperparameters,
and
risk
o
v
ertting
with
limited
data.
Despite
these
challenges,
their
feature-learning
capability
mak
es
them
highly
ef
fecti
v
e
across
v
arious
data
types.
3.3.6.
GAN-based
models
GANs
[33]
iterati
v
ely
impro
v
e
data
generation
through
competition
between
a
generator
and
dis
crim-
inator
.
This
adv
ersarial
approach
has
enabled
breakthroughs
in
missing
data
imputation
[34].
a.
Generati
v
e
adv
ersarial
imputation
netw
ork
Generati
v
e
adv
ersarial
imputation
netw
ork
(GAIN)
[35]
adapts
GAN
principles
for
im
p
ut
ation,
using
a
mask
matrix
to
highlight
missing
v
alues.
The
generator
predicts
missing
data,
while
the
discriminator
e
v
aluates
imputations.
The
architecture
in
v
olv
es
three
components:
data,
mask,
and
noise
matrices.
Algorithm
1
outlines
its
operation.
Algorithm
1
Pseudo-code
of
GAIN
Requir
e:
Dataset
with
missing
v
alues
Ensur
e:
Complete
data
v
ector
1:
Initialize
generator
G
and
discriminator
D
2:
while
loss
has
not
con
v
er
ged
do
3:
Dra
w
random
samples
and
masks
4:
Generate
imputations
with
G
5:
Compute
discriminator
loss
and
update
D
6:
Compute
generator
loss
and
update
G
7:
end
while
b
.
Missing
data
GAN
MisGAN
[36]
learns
high-dimensional
data
distrib
utions
by
combining
tw
o
generators
and
discri
mi-
nators
for
masks
and
data.
Algorithm
2
summarizes
its
training
process.
A
surve
y
of
missing
data
imputation
tec
hniques:
statistical
methods
...
(Rifaa
Sade
gh)
Evaluation Warning : The document was created with Spire.PDF for Python.
2882
❒
ISSN:
2252-8938
Algorithm
2
Pseudo-code
of
MisGAN
Requir
e:
Dataset
with
missing
v
alues
Ensur
e:
Complete
data
1:
while
iterations
not
complete
do
2:
T
rain
mask
discriminator
D
m
and
generator
G
m
3:
T
rain
data
discriminator
D
x
and
generator
G
x
4:
Update
both
generators
with
combined
loss
5:
end
while
c.
Other
GAN
v
ariants
–
Stack
elber
g
GAN:
uses
multiple
generators
to
handle
comple
x
imputation
tasks
[32].
–
SolarGAN:
tailored
for
solar
data
imputation
with
W
asserstein
GAN
techniques
[37].
–
Con
vGAIN:
e
xtends
GAIN
with
con
v
olutional
layers
for
spatio-temporal
correlations
[38].
–
DEGAIN:
b
uilds
on
GAIN
with
enhanced
loss
functions
[39].
–
GAN-based
Sperm-inspired
pix
el
imputation:
introduces
an
identity
block
and
a
sperm
motility-inspired
metaheuristic
to
impro
v
e
imputation
rob
ustness
and
address
mode
collapse
and
v
anishing
gradient
s
[40].
–
Menstrual
c
ycle
inspired
GAN
:
inte
grates
adapti
v
e
loss
functions
and
identity
blocks
inspired
by
en-
dometrial
beha
vior
to
enhance
imputation
in
medical
images
[41].
Deep
le
arning,
particularly
GANs,
pro
vides
po
werful
tools
for
imputing
missing
data.
Despite
challe
ng
e
s
lik
e
high
computational
demands
and
o
v
ertting
risks,
ongoing
inno
v
ations
conti
n
ue
to
impro
v
e
their
rob
ustness
and
adaptability
across
v
arious
domains.
4.
EV
ALU
A
TION
METHODS
Ev
aluation
metrics
are
essential
for
measuring
the
quality
of
missing
data
imputation
in
images
by
quantifying
the
discrepanc
y
between
the
original
and
imputed
data.
This
w
ork
focuses
on
three
main
metrics:
mean
squared
error
(MSE),
root
mean
squared
error
(RMSE),
and
Fr
´
echet
inception
distance
(FID).
4.1.
Mean
squar
ed
err
or
MSE
measures
the
a
v
erage
of
the
squared
dif
ferences
between
the
actual
and
imputed
v
alues.
A
lo
wer
MSE
indicates
better
imputation
quality
.
A
k
e
y
v
ariant
of
MSE
is
RMSE,
which
computes
the
square
root
of
the
a
v
erage
squared
prediction
errors:
RMSE
y
,
ˆ
y
MSE
y
,
ˆ
y
1
n
n
i
1
y
i
ˆ
y
i
2
(4)
RMSE
is
often
preferred
for
e
v
aluating
imputation
models
as
it:
Pro
vides
error
measurements
in
the
same
units
as
the
tar
get
v
ariable,
aiding
interpretation,
Penalizes
lar
ger
errors
more
signicantly
,
and
Is
less
sensiti
v
e
to
outliers
compared
to
MSE.
4.2.
Fr
´
echet
inception
distance
The
FID,
introduced
by
[10],
is
widely
used
to
e
v
aluate
the
quality
of
images
generated
by
generati
v
e
models,
including
GANs.
It
has
been
applied
to
state-of-the-art
models
such
as
StyleGAN1
and
StyleGAN2
[42].
FID
quanties
the
similarity
between
the
feature
distrib
utions
of
generated
and
real
images.
It
calculates
the
Fr
´
echet
distance
between
tw
o
probability
distrib
utions.
FID
pro
vides
a
rob
ust
measure
for
assessing
the
delity
of
generati
v
e
models
by
comparing
ho
w
closely
generated
images
match
real
image
distrib
utions.
4.3.
Ev
aluation
framew
ork
This
w
ork
emplo
ys
the
follo
wing
metrics
to
e
v
aluate
missing
data
imputation
quali
ty:
i)
MSE
and
RMSE:
assess
predicti
on
accurac
y
and
v
ariability
.
ii)
FID:
e
v
aluates
the
delity
of
generati
v
e
models,
espe-
cially
GANs.
These
metrics
establish
a
strong
foundation
for
selecting
and
optimizing
imputation
models
in
v
arious
conte
xts.
The
subsequent
section
will
analyze
i
mputation
models,
highlighting
their
strengths,
limita-
tions,
and
practical
applications.
Int
J
Artif
Intell,
V
ol.
14,
No.
4,
August
2025:
2876–2888
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
2883
5.
DISCUSSION
This
secti
on
aims
to
pro
vide
a
critical
e
v
aluation
of
the
methods
discussed.
This
section
e
v
aluates
the
methods
for
missing
data
imputation
based
on
three
main
criteria:
the
imputation
approach
(single
or
multiple),
the
v
ariable
types
(quantitati
v
e
or
qualitati
v
e),
and
the
missing
data
mechanisms
(MCAR,
MAR,
or
MN
AR).
T
able
3
summarizes
these
methods,
illustrating
their
applicability
and
limitations.
–
T
raditional
methods:
hot-deck
and
cold-deck
approaches
perform
well
in
specic
scenarios
(MCAR,
MAR)
b
ut
f
ail
in
comple
x
mechanisms
(MN
AR)
or
when
continuous
v
ariables
are
in
v
olv
ed.
Mean
and
median
imputations
are
ef
fecti
v
e
under
MCAR
b
ut
introduce
bias
in
MAR
and
MN
AR
cases.
–
Machine
learning
approaches,
such
as
KNN
and
re
gression
e
xhibit
adaptability
to
both
quantitati
v
e
and
qualitati
v
e
v
ariables.
Ho
we
v
er
,
their
performance
declines
in
MN
AR
cases
or
non-linear
relationships.
–
Adv
anced
models:
neural
netw
orks
and
MICE
pro
vide
the
most
comprehensi
v
e
solutions,
e
xcelling
across
all
criteria,
including
the
ability
to
handle
di
v
erse
data
types
and
multiple
imputations.
GAN-based
models:
T
able
4
presents
a
detailed
comparison
of
GAN-based
models,
sho
wcasing
their
architectures,
e
v
aluation
metrics,
and
domain-specic
applications.
K
e
y
insights
include:
i)
GAIN:
of
fers
a
e
xible,
fully
connected
architecture
ef
fecti
v
e
for
cate
gorical,
numerical,
and
image
data.
Extensions
to
tem-
poral
and
te
xtual
domains
are
recommended.
ii)
vie
w
imputation
generati
v
e
adv
ersarial
netw
ork
(VIGAN):
fo-
cused
on
image
data
with
multimodal
D
AE
and
CNN.
Its
performance
could
impro
v
e
with
multi-vie
w
datasets.
iii)
SolarGAN:
designed
for
time-series
data,
with
potential
applications
in
photo
v
oltaic
forecasting.
In
conclusion,
neural
netw
orks
and
GAN-based
models
stand
out
for
their
rob
ustness
and
adapt
ability
.
Ho
we
v
er
,
careful
alignment
of
method
selection
with
data
type
and
missing
data
mechanism
is
crucial.
Fu-
ture
research
should
emphasize
domain-specic
optimizations
and
comparisons
to
address
comple
x
scenarios
ef
fecti
v
ely
.
T
able
4.
Comparison
of
GAN-based
models
Model
Y
ear
Dataset
Ev
aluation
Code
Internal
structure
Missing
data
Architecture
G
D
Mechanism
T
ype
GAIN
2018
UCI
and
MNIST
RMSE
Y
es
FC
1
1
MCAR
Qualitati
v
e
VIGAN
2017
MNIST
RMSE
Y
es
FC,
CNN
2
2
N
A
Quantitati
v
e
MisGAN
2019
CIF
AR-10
and
CelebA
FID
and
RMSE
Y
es
FC,
CNN
2
2
MCAR
Quantitati
v
e
CollaGAN
2019
T2-FLAIR
and
RaFD
NMSE
and
SSIM
Y
es
CNN
1
1
N
A
Quantitati
v
e
Stack
elber
g
2018
T
in
y
ImageNet
FID
No
FC
M
1
N
A
Quantitati
v
e
SolarGAN
2020
GEFCom2014
MSE
Y
es
GR
UI,
FC
1
1
N
A
Qualitati
v
e
Con
vGAIN
2021
CHS
dataset
RMSE
Y
es
CNN
1
1
MCAR
Qualitati
v
e
DEGAIN
2022
UCI
RMSE
and
FID
No
Decon
v
1
1
N
A
N
A
GSIP
2025
Ener
gy
Images,
NREL
Solar
Images,
and
NREL
W
ind
T
urbine
RMSE,
RSNR,
SSIM,
FID
No
CNN,Decon
v
1
1
N
A
Qual
itati
v
e
MCI-GAN
2025
Medical
ima
ges
RMSE,
RSNR,
FID,
IS,
SSIM
NO
CNN
1
1
MAR
Quanti
tati
v
e
T
able
4
pro
vides
an
o
v
ervie
w
of
GAN-based
models
for
missing
data
imputation.
It
compares
their
internal
structures,
architectures,
e
v
aluation
metrics,
tested
datasets,
and
data
handling
capabilities
across
v
ari-
ous
domains
(cate
gorical,
numerical,
image,
and
time
series).
This
analysis
of
fers
a
detailed
understanding
of
each
model’
s
strengths,
limitations,
and
application
potential.
K
e
y
insights:
among
GAN-based
models,
GAIN
stands
out
for
its
e
xibility
and
broad
appli
cabil-
ity
across
cat
e
go
r
ical,
numerical,
and
image
data.
VIGAN
le
v
erages
multimodal
D
AE
and
CNN
for
image
tasks,
with
room
for
mul
ti-vie
w
impro
v
ements.
MisGAN
performs
well
under
MCAR
b
ut
requires
adaptation
for
broader
use.
CollaGAN
focuses
on
image-to-image
translation,
while
Stack
elber
g
GAN
e
xplores
multi-
generator
designs
for
numerical
data.
SolarGAN
is
tailored
to
time-series
imputation,
and
Con
vGAIN
and
DEGAIN
enhance
spatial
and
generator
performance
through
CNNs
and
decon
v
olution.
Ov
erall,
these
models
illustrate
the
e
v
olution
of
GAN-based
imputation.
GAIN,
in
particular
,
pro
vides
a
strong
base
for
future
domain-specic
e
xtensions.
Emphasis
should
be
placed
on
impro
ving
adaptability
and
addressing
stability
and
interpretability
challenges.
5.1.
Best-perf
orming
methods
by
missing
data
mechanism
Based
on
the
literature
synthesis
and
the
comparati
v
e
T
abl
e
3,
the
follo
wing
conclusions
can
be
dra
wn:
A
surve
y
of
missing
data
imputation
tec
hniques:
statistical
methods
...
(Rifaa
Sade
gh)
Evaluation Warning : The document was created with Spire.PDF for Python.
2884
❒
ISSN:
2252-8938
–
MCAR:
simple
statistical
methods
such
as
mean/m
edian
imputation
and
KNN
are
often
suf
cient
due
to
the
randomness
of
missingness.
GAN-based
models
lik
e
GAIN
and
MisGAN
also
perform
well
under
MCAR
assumptions.
–
MAR:
more
adv
anced
methods
such
as
MICE,
MissF
orest,
and
neural
netw
orks
are
better
suited,
as
the
y
can
le
v
erage
observ
ed
v
ariable
relationships.
GAN
models
lik
e
MCI-GAN
also
sho
w
promising
results.
–
MN
AR:
handling
MN
AR
remains
challenging.
Methods
based
on
neural
netw
orks
and
certain
rob
ust
v
ariants
of
GANs
(e.g.,
DEGAIN,
GSIP)
of
fer
impro
v
ed
results,
though
no
method
fully
resolv
es
the
MN
AR
scenario
without
domain
kno
wledge
or
additional
assumptions.
5.2.
Challenges
and
limitations
of
GAN-based
imputation
models
Despite
their
po
werful
capabilities,
GAN-based
imputation
models
f
ace
se
v
eral
technical
chal
lenges
that
limit
their
reliability
and
generalizability
.
5.2.1.
Mode
collapse
and
con
v
er
gence
issues
GAN
training
is
notoriously
unstable
due
to
the
adv
ersarial
nature
of
the
generator
and
dis
criminator
.
Mode
collapse,
where
the
generator
produces
limited
data
patterns
re
g
ardless
of
input
noise,
results
in
biased
or
unrealistic
imputations.
Additionally
,
con
v
er
gence
is
dif
cult
to
assess,
and
training
may
oscillate
or
di
v
er
ge
without
pro
viding
meaningful
imputations
[43].
5.2.2.
Hyper
parameter
sensiti
vity
GANs
are
sensiti
v
e
to
h
yperparameters
such
as
learning
rates,
batch
sizes,
and
architecture
depth.
Fine-tuning
these
parameters
is
often
problem-specic
and
computationally
e
xpensi
v
e,
requiring
e
xtensi
v
e
empirical
e
xperimentation
[44].
Poorly
chosen
h
yperparameters
may
lead
to
o
v
ertting
or
non-con
v
er
gent
training,
particularly
when
w
orking
with
sparse
datasets
or
comple
x
data
structures.
5.2.3.
P
otential
solutions
Se
v
eral
st
rate
gies
ha
v
e
been
proposed
to
impro
v
e
the
stability
and
ef
fecti
v
eness
of
GAN-based
imputa-
tion:
i)
Pretraining
techniques:
pretraining
the
generator
or
discriminator
with
autoencoder
str
u
c
tures
or
V
AEs
can
stabilize
learning
and
pre
v
ent
early
collapse
[45].
ii)
Hybrid
architectures:
models
combining
GANs
with
V
AEs
(e.g.,
V
AE-GAN)
or
transformer
encoders
enhance
b
ot
h
stability
and
representational
richness
[46].
iii)
Re
gularization
and
loss
design:
adv
anced
loss
functions
(e.g.,
W
asserstein
loss
with
gradient
penalty)
and
spec-
tral
normalization
can
impro
v
e
con
v
er
gence
and
reduce
sensiti
vity
to
h
yperparameters.
i
v)
M
´
eta-apprentissage:
using
meta-learning
to
adapti
v
ely
select
the
best
imputation
strate
gy
depending
on
the
missingness
mechanism
(MCAR,
MAR,
MN
AR)
and
the
data
type
has
sho
wn
promise
in
impro
ving
generalizability
.
These
impro
v
e-
ments
not
only
enhance
imputation
quality
b
ut
also
address
ethical
and
interpretability
concerns
by
making
GANs
more
stable,
transparent,
and
adaptable
to
real-w
orld
constraints.
5.3.
Ethical
implications
of
data
imputation
Data
imputation
techniques,
while
essential
for
maintaining
data
inte
grity
,
pose
signicant
ethical
challenges,
especially
when
applied
in
critical
domains
such
as
healthcare,
nance,
and
social
sciences.
The
use
of
adv
anced
imputation
methods,
particularly
those
based
on
GANs,
raises
concerns
related
to
accurac
y
,
f
airness,
transparenc
y
,
and
accountability
.
5.3.1.
Risk
of
inaccurate
imputation
One
of
the
primary
ethical
concerns
in
data
imputation
is
the
risk
of
inaccurate
imputations
leading
to
erroneous
conclusions
or
biased
decision-making.
In
healthcare,
for
instance,
imputing
missing
patient
data
with
GAN-based
methods
without
adequate
v
alidation
could
result
in
misleading
diagnostic
outcomes
or
inappropriate
treatments
[47].
In
nance,
incorrect
imputation
of
nancial
metrics
might
lead
to
a
wed
credit
scoring,
adv
ersely
af
fecting
indi
viduals
or
b
usinesses
[48].
5.3.2.
F
air
ness
and
bias
GAN-based
imputation
methods
may
inadv
ertently
propag
ate
or
amplify
e
xisting
biases
present
in
the
training
data.
F
or
e
xample,
if
demographic
data
from
underrepresented
groups
are
underimputed
or
inaccu-
rately
generated.
It
can
lead
to
discriminatory
outcomes
in
automated
decision-making
systems,
such
as
loan
appro
v
als
or
health
risk
assessments
[49].
Int
J
Artif
Intell,
V
ol.
14,
No.
4,
August
2025:
2876–2888
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
2885
5.3.3.
Opacity
and
lack
of
inter
pr
etability
GANs
are
often
considered
”black-box”
models,
meaning
their
decision-making
processe
s
are
inher
-
ently
dif
cult
to
interpret.
This
lack
of
transparenc
y
poses
ethical
challenges
when
imputations
signicantly
inuence
high-stak
es
decisions.
De
v
eloping
interpretable
imputation
models
or
inte
grating
e
xplainable
AI
(XAI)
techniques
is
essential
to
ensure
accountability
and
b
uild
trust
in
automated
systems
[50].
5.3.4.
Pri
v
acy
concer
ns
The
use
of
GANs
for
data
imputation
may
also
raise
pri
v
ac
y
issues.
Since
GANs
generate
synthet
ic
data
that
resemble
real-w
orld
data,
there
is
a
risk
that
sensiti
v
e
information
might
be
reconstructed,
e
v
en
when
anon
ymization
techniques
are
applied.
This
potential
for
data
leakage
necessitates
rigorous
pri
v
ac
y-preserving
mechanisms
during
the
imputation
process
[51].
5.3.5.
Mitigation
strategies
T
o
address
these
ethical
challenges,
researchers
and
practitioners
should
consider
the
follo
wing
ap-
proaches:
i)
ethical
guidelines
for
data
imputation:
establishing
clear
guidelines
to
e
v
aluate
the
ethical
impact
of
imputation
methods,
particularly
in
sensiti
v
e
domains.
ii)
Algorithmic
f
airness
audits:
re
gularly
auditing
GAN-based
models
to
identify
and
mitig
ate
bias,
especially
when
handling
demographic
data.
iii)
Impro
ving
model
transparenc
y:
incorporating
XAI
methods,
such
as
feature
attrib
ution
and
latent
space
visualization,
to
mak
e
imputed
results
more
interpretable
and
trustw
orth
y
.
i
v)
Data
pri
v
ac
y
mechanisms:
emplo
ying
techniques
lik
e
dif
ferential
pri
v
ac
y
to
ensure
that
GAN-generated
data
does
not
inadv
ertently
re
v
eal
personal
information.
6.
CONCLUSION
AND
FUTURE
W
ORK
This
study
underscores
the
signicance
of
sel
ecting
imputation
methods
that
are
well-suited
to
the
nature
of
missing
data
and
v
ariable
types.
GAN-based
models
ha
v
e
demonstrated
strong
potential
in
handling
comple
x
data
structures
such
as
images
and
time
series,
especially
in
high-impact
elds
lik
e
healthcare,
nance,
and
en
vironmental
analysis.
Their
adaptability
and
capacity
to
generate
realistic
v
alues
mak
e
them
v
aluable
tools
in
adv
ancing
missing
data
imputation
techniques.
Ho
we
v
er
,
these
models
still
f
ace
notable
challenges
including
training
instability
,
mode
collapse,
and
h
yperparameter
tuning
dif
culties.
Hybrid
models
that
com-
bine
GANs
with
V
AEs
ha
v
e
emer
ged
as
a
promising
direction,
of
fering
both
the
generati
v
e
strength
of
GANs
and
the
stability
of
V
AEs.
Moreo
v
er
,
the
inte
gration
of
meta-learning
techniques
could
allo
w
for
dynamic
se-
lection
of
i
mputation
strate
gies
based
on
dataset
characteristics,
thus
enhancing
generalization.
Despite
their
performance,
the
interpretability
of
GAN-based
models
remains
limited,
raising
concerns
in
critical
domains
where
transparenc
y
is
essential.
Future
research
should
therefore
e
xplore
the
incorporation
of
XAI
methods
to
impro
v
e
understanding
and
trust
in
the
imputation
process.
Additionally
,
ef
forts
should
focus
on
scaling
these
models
for
real-w
orld
applications,
impro
ving
their
computational
ef
ci
enc
y
,
and
ensuri
ng
their
reliabil-
ity
across
di
v
erse
data
conte
xts.
Ov
erall,
this
w
ork
lays
the
groundw
ork
for
further
e
xploration
into
rob
ust,
interpretable,
and
scalable
imputation
strate
gies
using
GANs.
A
CKNO
WLEDGMENTS
The
authors
w
ould
lik
e
to
sincerely
thank
Mr
.
Mohamed
El
Hadramy
Oumar
,
founder
of
V
ector
Mind,
for
his
generous
support
in
f
acilitating
the
transaction
required
for
the
publication
process.
His
ass
istance
is
gratefully
ackno
wledged.
FUNDING
INFORMA
TION
The
authors
declare
that
no
funding
w
as
in
v
olv
ed
in
the
preparation
of
this
manuscript.
A
UTHOR
CONTRIB
UTIONS
ST
A
TEMENT
This
journal
uses
the
Contri
b
ut
or
Roles
T
axonomy
(CRediT)
to
recognize
indi
vidual
author
contrib
u-
tions,
reduce
authorship
disputes,
and
f
acilitate
collaboration.
A
surve
y
of
missing
data
imputation
tec
hniques:
statistical
methods
...
(Rifaa
Sade
gh)
Evaluation Warning : The document was created with Spire.PDF for Python.