IAES Inter national J our nal of Articial Intelligence (IJ-AI) V ol. 14, No. 4, August 2025, pp. 2876 2888 ISSN: 2252-8938, DOI: 10.11591/ijai.v14.i4.pp2876-2888 2876 A sur v ey of missing data imputation techniques: statistical methods, machine lear ning models, and GAN-based appr oaches Rifaa Sadegh, Ahmed Mohameden, Mohamed Lemine Salihi, Mohamedade F ar ouk Nanne Scientic Computing, Computer Science and Data Science, Department of Computer Science, F aculty of Science and T echnology , Uni v ersity of Nouakchott, Nouakchott, Mauritania Article Inf o Article history: Recei v ed Jun 8, 2024 Re vised Jun 11, 2025 Accepted Jul 10, 2025 K eyw ords: Data imputation Generati v e adv ersarial netw orks Machine learning Missing data Statistical methods ABSTRA CT Ef ciently addressing missing data is critical in data analysis across di v erse domains. This study e v aluates traditional statistical, machine learning, and generati v e adv ersarial netw ork (GAN)-based imputation methods, emphasizing their strengths, limitations, and applicability to dif ferent data type s and missing data mechanisms (missing completely at random (MCAR), missing at random (MAR), missing not at random (MN AR)). GAN-based models, including gener - ati v e adv ersarial imputa tion netw ork (GAIN), vie w imputati on generati v e adv er - sarial netw ork (VIGAN), and SolarGAN, are highlighted for their adaptability and ef fecti v eness in handling comple x datasets, such as images and time series. Despite challe nges lik e computat ional demands, GANs outperform con v entional methods in capturing non-linear dependencies. Future w ork includes optimiz- ing GAN architectures for broader data types and e xploring h ybrid models to enhance imputation accurac y and scalability in real-w orld applications. This is an open access article under the CC BY -SA license . Corresponding A uthor: Rif aa Sade gh Scientic Computing, Computer Science and Data Science, Department of Computer Science F aculty of Science and T echnology , Uni v ersity of Nouakchott Nouakchott, Mauritania Email: rif asade gh@gmail.com 1. INTR ODUCTION Missing data is a perv asi v e challenge that af fects nearly e v ery scientic discipline, from m edicine [1] to geology [2], ener gy [3] and en vironmental sciences [4]. Rubin [5] dened missing data as unobserv ed v alues that could yield critical insights if a v ailable. These g aps introduce biases, distort analysis, and reduce the ef fecti v eness of algorithms, ultimately impairing decision-making processes. The origins of missing data are di v erse, arising from incomplete data collection, recording errors, or h a rdw are malfunctions [5]. These g aps sk e w results and misrepresent the studied population [6], creating a need for rob ust and scalable solutions to ensure reliable research outcomes. Addressing missing data has pro v en to be a multif aceted problem, requiring methods that v ar y depending on the type and comple xity of the dataset. Initial approaches, such as listwise deletion, were simple b ut often discarded v aluable information along with the missing data [7]. Ov er time, more sophisticated imputation techniques emer ged, including sta- tistical methods, machine learning algorithms, and deep learning models. Among these, generati v e adv ersarial netw orks (GANs) ha v e g ained prominence for their ability to model comple x data distrib utions and address non-linear dependencies ef fecti v ely . Despite their potential, implementing GANs for data imputation comes J ournal homepage: http://ijai.iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 2877 with challenges, including: i) high computational cos ts due to comple x training processes; ii) sensiti vity to h yperparameter tuning, which af fects model stabi lity; and iii) risk of o v ertting, particularly when handling small datasets. This paper pro vides a comprehensi v e re vie w of missing data imputation methods. W e analyze tradi - tional statistical approaches, machine learning techniques, and deep learning models, with a particular focus on GAN-based imputation. Our ndings re v eal that while GANs outperform traditional methods in handling com- ple x datasets, their deplo yment requir es careful balancing of model comple xity and computational ef cienc y . W e also propose future research directions, including: i) the inte gration of h ybrid models combining statistical techniques with GANs; ii) optimization of GAN architectures for imputation tasks; and iii) application of these techniques to real-w orld datasets in elds such as healthcare, ener gy , and en vironmental science. By address- ing these challenges and e xploring inno v ati v e solutions, this w ork aims to contrib ute to the gro wing body of kno wledge in data imputation, enabling researchers and practitioners to better handle missing data scenarios. The remainder of this article is structured as follo ws: section 2 introduces the methodology and crit eria for e v aluating imputation methods. Section 3 presents a comparati v e analysis of dif ferent approaches. Section 4 discusses the implications of the results, including ethical considerations related to imputation in sensiti v e domains. Section 5 concludes with k e y ndings and recommendations for future research. 2. MISSING D A T A MECHANISMS AND TYPES OF V ARIABLES Handling missing data is critical for ensuring the reliability of statistical analyses. Unders tanding the mechanisms underlying missing data and the types of v ariables in v olv ed is fundamental for selecting appropriate imputation techniques. This section e xplores the cate gories of missing completely at random (MCAR), missing not at random (MN AR), and missing at random (MAR), alongside a classication of statistical v ariables and imputation approaches. 2.1. Missing data categories Missing data can be classied into three distinct cate gories: MCAR, MN AR and MAR [5]. MCAR: data is missing randomly , unrelated to observ ed or unobserv ed v ariables. Example: pix els missing in radiological images due to random noise or technical errors, such as sensor malfunction. MAR: missingness depends on observ ed v ariables. Example: crop yield data missing in re gions with e xtreme weather conditions, where meteorological data is recorded. MN AR: missingness depends on unobserv ed v ariables [8]. Example: fetal position af fects the visibility of genital or g ans during an ultrasound, leading to gender data being systematically missing when the fetus is positioned laterally or with crossed le gs. Figure 1 pro vides an illustration. T able 1 summarizes the criteria distinguishing these cate gories. MCAR is ignorable, while MAR and MN AR require adv anced techniques to mitig ate bias. P M 1 Y o , Y m , ψ denes the probability of the missing data mechanism, where ψ represents the set of parameters of the imputation model. When data is MN AR, the probability of the mechanism cannot be dened because it depends on one or more unmeasured parameters, i.e., unobserv ed v ariables. Figure 1. Missing data mechanisms A surve y of missing data imputation tec hniques: statistical methods ... (Rifaa Sade gh) Evaluation Warning : The document was created with Spire.PDF for Python.
2878 ISSN: 2252-8938 T able 1. Comparison of missing data mechanisms Criterion MAR MN AR MCAR Random No No Y es Ignorable It depends No Y es Dependenc y Observ ed v ariable Unobserv ed v ariable None P M 1 Y o , Y m , ψ P M 1 Y o , ψ Undened P M 1 , ψ 2.2. Imputation appr oaches Imputation methods are cate gorized based on v ariable relationships: Single vs. multiple imputation: single imputation replaces a m issing v alue with one estimate, while multiple imputation generates se v eral plausible v alues [9]. Uni v ariate vs. multi v ariate: uni v ariate imputation considers only the tar get v ariable, whereas multi v ariate imputation incorporates relationships between v ariables [10]. Multi v ariate methods are preferable for datasets with strong interdependencie s as the y support both single and multiple imputations, as sho wn in T able 2. T able 2. Comparison of imputation types Criterion Approach Uni v ariate Multi v ariate Replacement 1 m Correlation Single Imputation Multiple Imputations 2.3. T ypes of v ariables Statistical v ariables are classied as: i) quantitati v e (e.g., continuous: salary , discrete: age). ii) qualitati v e (e.g., nominal: marital status, ordinal: satisf action le v el) [11]. Misinterpretations arise when qualitati v e v ariables are numerically encoded (e.g., zip codes), as their mean has no signicance. Figure 2 pro vides an o v ervie w . Figure 2. T ypes of statistical v ariables 3. IMPUT A TION METHODS Managing missing data is crucial across v arious elds to ensure the accurac y of analyses and predic- ti v e models. This sect ion re vie ws se v eral imputation techniques, ranging from traditional statistical methods to adv anced machine learning and deep learning approaches. Each method’ s strengths and limitations are discussed, along with their suitability for dif ferent data types and conte xts. 3.1. Statistical methods Statistical methods are foundational for imputation. K e y approaches include similarity-based meth- ods, observ ation-based methods, measures of central tendenc y , and multi v ariate imputation by chained equa- tions (MICE). 3.1.1. Similarity-based methods The hot-deck method re places missing v alues with those from similar indi viduals. The cold-deck method uses v alues from e xternal sources. This is applied when there are not enough similar data points [12], [13]. Int J Artif Intell, V ol. 14, No. 4, August 2025: 2876–2888 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 2879 3.1.2. Obser v ation-based methods Methods lik e last observ ation carried forw ard (LOCF), baseline observ ation carried forw ard (BOCF), w orst observ ation carried forw ard (W OCF), and ne xt observ ation carried backw ard (NOCB) are commonly used for longitudinal data. These methods replace missing v alues based on temporal patterns. The y rely on the assumption that nearby observ ations carry meaningful information [14]-[17]. 3.1.3. Measur es of central tendency The objecti v e of central tendenc y measures is to summarize, in a single v alue, the elements of a v ariable in a dataset. The most commonly used central tendenc y measures are the mean [18], the median [19], and the mode [20]. Indeed, t here are v arious means [21], such as the arithmetic, quadratic, harmonic, geometric, weighted, and truncated means. Here, we illustrate the arithmetic mean, where the imputation in v olv es replacing the missi ng v alues of a v ariable with the sum of its kno wn v alues, di vided by the total number of v alues: i 1 , 2 , . . . , p , ¯ y i 1 n n j 1 y ij y ij Y m The arithmetic mean is only applicable to quantitati v e v ariables, especially continuous ones. Ho we v er , it can also be used for discrete v ariables, in which case the result will be rounded to the nearest inte ger . The median is the v alue that di vides the elements of an observ ed v ariable into tw o equal par ts. After sorting the v alues of the tar get observ ed v ariable in ascending order , imputation by the median in v olv es replacing the missing v alues of a v ariable with the middle v alue when the number of observ ations n is odd, or the a v erage of the tw o middle observ ations if n is e v en: i 1 , 2 , . . . , p , ˜ y i y n 1 2 if n 0 mo d 2 y n 2 if n 1 mo d 2 In addition to the classical median, there are other w ays [22] to calculate a measure of central posit ion, such as the weighted median, the geometric median, and the absolute median de viation. Imputation by mode replaces missing data with the most frequent v alue of the tar get v ariable: i 1 , 2 , . . . , p j 1 , 2 , . . . , n such that ˆ y i argmax y ij P Y y ij Although the mode can be cal culated for both numerical and cate gorical v ariables, in practice, it is commonly used only for nominal v ariables as the y do not ha v e other central tendenc y measures. 3.1.4. Multi v ariate imputation by chained equations MICE is an iterati v e approach that imputes missing data using re gression models. Each missing v alue is predicted using a re gression model base d on other v ariables in the dataset. The algorithm iterates until the imputed v alues con v er ge [23], [24]. 3.2. Machine lear ning methods Machine learning methods of fer adv antages o v er traditional statistical approaches, particularly in han- dling lar ge and comple x datasets [25]. This section re vie ws four popular machine learning models for data imputation: linear re gression, logistic re gression, k-nearest neighbors (KNN), and decision trees. 3.2.1. Regr ession Re gression models estimate relationships between the tar get and observ ed v ariables. W e focus on linear and logistic re gression [26]. Linear re gression models aim to capture a proportional trend between inputs and outcomes. It operates by applying the least squares method to reduce the g ap between actual observ ations and model predictions. y α x β ϵ (1) Here, α is the coef cients of the re gression line and β originally ordered, and ϵ is the error term, rep- resenting the une xplained de viation or v ariance by the linear relationship between the observ ed v alue y and the predicted v alue α β x . A surve y of missing data imputation tec hniques: statistical methods ... (Rifaa Sade gh) Evaluation Warning : The document was created with Spire.PDF for Python.
2880 ISSN: 2252-8938 Logistic re gression: used for binary classicati on, it models the probability of the tar get v ariable being 1 using a logistic function: p 1 1 e z (2) Here, p is the probability that the tar get v ariable is 1, and z is the linear function in the form: z b 0 b 1 x 1 b 2 x 2 b n x n (3) Where b 0 , b 1 , b 2 , . . . , b n are the re gression coef cients, and x 1 , x 2 , . . . , x n are the observ ed v ariables. 3.2.2. K-near est neighbors The basic idea of the KNN is t o nd the k-nearest neighbors of the indi vidual with missing data [27]. This algorithm requires tw o parameters, namely , the v alue of k and the similarity metric between indi viduals. The similarity is calculated using a distance measure such as the Euclidean distance, the Manhattan distance, and the Mink o wski distance. 3.2.3. Decision tr ees Decision trees partition data into subsets based on feature v alues to predict missing v alues. Random forests, an ensemble of multiple decision trees trained on dif ferent subsets, enhance rob ustness and reduce o v ertting. MissF orest [28], a widely used v ariant, be gins with nai v e imputations and iterati v ely renes pre- dictions via random forests. These methods are more e xible than traditional statistical approaches b ut may require careful tuning for high-dimensional or sparse datasets. 3.3. Deep lear ning methods Deep learning models of fer tw o major adv antages o v er traditional machine learning models. Firstly , traditional methods often require manual selection of rele v ant features or v ariables for training the imputation model. In contrast, deep learning models use neural netw orks to automatically learn these features from ra w data. This ”automation” occurs during the learning phase, where biases and weights in each layer of the neural netw ork are adjusted to better capture the underlying patterns in the data. The second adv antage is the v ersatility of neural netw orks, which mak es them easily adaptable to v arious scenarios, including the 12 cases illustrated in T able 3. Neural netw orks can model comple x, non-linear relationships, making them particularly ef fecti v e for imputing data with intricate patterns. T able 3. Ov ervie w of methods for imputing missing data Method Uni v ariate imputation Multi v ariate imputation Quantitati v e Qualitati v e Quantitati v e Qualitati v e MAR MN AR MCAR MAR MN AR MCAR MAR MN AR MCAR MAR MN AR MCAR Hot-deck Cold-deck LOCF/BOCF/NOCB Mean and Median Mode MICE KNN Linear re gression Logistic re gression MissF orest Neural netw orks In the follo wing, we present the main deep learning models used for missing data imputation. These models include con v olutional neural netw orks (CNNs), recurrent neural netw orks (RNNs), v ariational autoen- coders (V AEs), and GANs. Each model has a unique approach in handling incomplete data. 3.3.1. Con v olutional neural netw orks CNNs [29] are part icularly well-suited for imputing missing data in images, where mis sing pix els can be estimated based on spatial correlations with nearby pix els. CNNs utilize con v olutional layers to e xtract features from input images, ef fecti v ely capturing local dependencies. This mak es them ideal for applications where data e xhibits spatial patterns, such as medical imaging or satellite data. Int J Artif Intell, V ol. 14, No. 4, August 2025: 2876–2888 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 2881 3.3.2. Recurr ent neural netw orks RNNs [30] are frequently emplo yed for imputing temporal data, as the y le v erage pre vious inform ation to predict missing v alues. These models maintain an internal state that captures the sequence of pre vious inputs, making them suitable for time series imputation. An adv anced v ariant, long short-term memory (LSTM) net- w orks, addresses the v anishing gradient problem by maintaining long-term dependencies, which is particularly useful for long-range temporal correlations. 3.3.3. V ariational autoencoders V AEs [31] use an encoder to compress data into a latent representation and a decoder to reconstr uct it. Their p r ob a bilistic frame w ork enables realistic imputations in comple x, non-linear datasets. The y achie v e this by generating data distrib utions close to the original. 3.3.4. Generati v e adv ersarial netw orks GANs [32] consist of a generator and a discriminator tha t compete during training. The generator pro- duces synthetic data, while the discriminator distinguishes real from generated data. This adv ersarial learning enables realistic imputations for comple x data types. 3.3.5. Comparati v e adv antages of deep lear ning models Deep l earning models outperform traditional methods in capturing non-linear and high-dimens ional patterns. GANs and V AEs, in particular , generate realistic imputations. Ho we v er , the y require signicant computational resources, are sensiti v e to h yperparameters, and risk o v ertting with limited data. Despite these challenges, their feature-learning capability mak es them highly ef fecti v e across v arious data types. 3.3.6. GAN-based models GANs [33] iterati v ely impro v e data generation through competition between a generator and dis crim- inator . This adv ersarial approach has enabled breakthroughs in missing data imputation [34]. a. Generati v e adv ersarial imputation netw ork Generati v e adv ersarial imputation netw ork (GAIN) [35] adapts GAN principles for im p ut ation, using a mask matrix to highlight missing v alues. The generator predicts missing data, while the discriminator e v aluates imputations. The architecture in v olv es three components: data, mask, and noise matrices. Algorithm 1 outlines its operation. Algorithm 1 Pseudo-code of GAIN Requir e: Dataset with missing v alues Ensur e: Complete data v ector 1: Initialize generator G and discriminator D 2: while loss has not con v er ged do 3: Dra w random samples and masks 4: Generate imputations with G 5: Compute discriminator loss and update D 6: Compute generator loss and update G 7: end while b . Missing data GAN MisGAN [36] learns high-dimensional data distrib utions by combining tw o generators and discri mi- nators for masks and data. Algorithm 2 summarizes its training process. A surve y of missing data imputation tec hniques: statistical methods ... (Rifaa Sade gh) Evaluation Warning : The document was created with Spire.PDF for Python.
2882 ISSN: 2252-8938 Algorithm 2 Pseudo-code of MisGAN Requir e: Dataset with missing v alues Ensur e: Complete data 1: while iterations not complete do 2: T rain mask discriminator D m and generator G m 3: T rain data discriminator D x and generator G x 4: Update both generators with combined loss 5: end while c. Other GAN v ariants Stack elber g GAN: uses multiple generators to handle comple x imputation tasks [32]. SolarGAN: tailored for solar data imputation with W asserstein GAN techniques [37]. Con vGAIN: e xtends GAIN with con v olutional layers for spatio-temporal correlations [38]. DEGAIN: b uilds on GAIN with enhanced loss functions [39]. GAN-based Sperm-inspired pix el imputation: introduces an identity block and a sperm motility-inspired metaheuristic to impro v e imputation rob ustness and address mode collapse and v anishing gradient s [40]. Menstrual c ycle inspired GAN : inte grates adapti v e loss functions and identity blocks inspired by en- dometrial beha vior to enhance imputation in medical images [41]. Deep le arning, particularly GANs, pro vides po werful tools for imputing missing data. Despite challe ng e s lik e high computational demands and o v ertting risks, ongoing inno v ations conti n ue to impro v e their rob ustness and adaptability across v arious domains. 4. EV ALU A TION METHODS Ev aluation metrics are essential for measuring the quality of missing data imputation in images by quantifying the discrepanc y between the original and imputed data. This w ork focuses on three main metrics: mean squared error (MSE), root mean squared error (RMSE), and Fr ´ echet inception distance (FID). 4.1. Mean squar ed err or MSE measures the a v erage of the squared dif ferences between the actual and imputed v alues. A lo wer MSE indicates better imputation quality . A k e y v ariant of MSE is RMSE, which computes the square root of the a v erage squared prediction errors: RMSE y , ˆ y MSE y , ˆ y 1 n n i 1 y i ˆ y i 2 (4) RMSE is often preferred for e v aluating imputation models as it: Pro vides error measurements in the same units as the tar get v ariable, aiding interpretation, Penalizes lar ger errors more signicantly , and Is less sensiti v e to outliers compared to MSE. 4.2. Fr ´ echet inception distance The FID, introduced by [10], is widely used to e v aluate the quality of images generated by generati v e models, including GANs. It has been applied to state-of-the-art models such as StyleGAN1 and StyleGAN2 [42]. FID quanties the similarity between the feature distrib utions of generated and real images. It calculates the Fr ´ echet distance between tw o probability distrib utions. FID pro vides a rob ust measure for assessing the delity of generati v e models by comparing ho w closely generated images match real image distrib utions. 4.3. Ev aluation framew ork This w ork emplo ys the follo wing metrics to e v aluate missing data imputation quali ty: i) MSE and RMSE: assess predicti on accurac y and v ariability . ii) FID: e v aluates the delity of generati v e models, espe- cially GANs. These metrics establish a strong foundation for selecting and optimizing imputation models in v arious conte xts. The subsequent section will analyze i mputation models, highlighting their strengths, limita- tions, and practical applications. Int J Artif Intell, V ol. 14, No. 4, August 2025: 2876–2888 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 2883 5. DISCUSSION This secti on aims to pro vide a critical e v aluation of the methods discussed. This section e v aluates the methods for missing data imputation based on three main criteria: the imputation approach (single or multiple), the v ariable types (quantitati v e or qualitati v e), and the missing data mechanisms (MCAR, MAR, or MN AR). T able 3 summarizes these methods, illustrating their applicability and limitations. T raditional methods: hot-deck and cold-deck approaches perform well in specic scenarios (MCAR, MAR) b ut f ail in comple x mechanisms (MN AR) or when continuous v ariables are in v olv ed. Mean and median imputations are ef fecti v e under MCAR b ut introduce bias in MAR and MN AR cases. Machine learning approaches, such as KNN and re gression e xhibit adaptability to both quantitati v e and qualitati v e v ariables. Ho we v er , their performance declines in MN AR cases or non-linear relationships. Adv anced models: neural netw orks and MICE pro vide the most comprehensi v e solutions, e xcelling across all criteria, including the ability to handle di v erse data types and multiple imputations. GAN-based models: T able 4 presents a detailed comparison of GAN-based models, sho wcasing their architectures, e v aluation metrics, and domain-specic applications. K e y insights include: i) GAIN: of fers a e xible, fully connected architecture ef fecti v e for cate gorical, numerical, and image data. Extensions to tem- poral and te xtual domains are recommended. ii) vie w imputation generati v e adv ersarial netw ork (VIGAN): fo- cused on image data with multimodal D AE and CNN. Its performance could impro v e with multi-vie w datasets. iii) SolarGAN: designed for time-series data, with potential applications in photo v oltaic forecasting. In conclusion, neural netw orks and GAN-based models stand out for their rob ustness and adapt ability . Ho we v er , careful alignment of method selection with data type and missing data mechanism is crucial. Fu- ture research should emphasize domain-specic optimizations and comparisons to address comple x scenarios ef fecti v ely . T able 4. Comparison of GAN-based models Model Y ear Dataset Ev aluation Code Internal structure Missing data Architecture G D Mechanism T ype GAIN 2018 UCI and MNIST RMSE Y es FC 1 1 MCAR Qualitati v e VIGAN 2017 MNIST RMSE Y es FC, CNN 2 2 N A Quantitati v e MisGAN 2019 CIF AR-10 and CelebA FID and RMSE Y es FC, CNN 2 2 MCAR Quantitati v e CollaGAN 2019 T2-FLAIR and RaFD NMSE and SSIM Y es CNN 1 1 N A Quantitati v e Stack elber g 2018 T in y ImageNet FID No FC M 1 N A Quantitati v e SolarGAN 2020 GEFCom2014 MSE Y es GR UI, FC 1 1 N A Qualitati v e Con vGAIN 2021 CHS dataset RMSE Y es CNN 1 1 MCAR Qualitati v e DEGAIN 2022 UCI RMSE and FID No Decon v 1 1 N A N A GSIP 2025 Ener gy Images, NREL Solar Images, and NREL W ind T urbine RMSE, RSNR, SSIM, FID No CNN,Decon v 1 1 N A Qual itati v e MCI-GAN 2025 Medical ima ges RMSE, RSNR, FID, IS, SSIM NO CNN 1 1 MAR Quanti tati v e T able 4 pro vides an o v ervie w of GAN-based models for missing data imputation. It compares their internal structures, architectures, e v aluation metrics, tested datasets, and data handling capabilities across v ari- ous domains (cate gorical, numerical, image, and time series). This analysis of fers a detailed understanding of each model’ s strengths, limitations, and application potential. K e y insights: among GAN-based models, GAIN stands out for its e xibility and broad appli cabil- ity across cat e go r ical, numerical, and image data. VIGAN le v erages multimodal D AE and CNN for image tasks, with room for mul ti-vie w impro v ements. MisGAN performs well under MCAR b ut requires adaptation for broader use. CollaGAN focuses on image-to-image translation, while Stack elber g GAN e xplores multi- generator designs for numerical data. SolarGAN is tailored to time-series imputation, and Con vGAIN and DEGAIN enhance spatial and generator performance through CNNs and decon v olution. Ov erall, these models illustrate the e v olution of GAN-based imputation. GAIN, in particular , pro vides a strong base for future domain-specic e xtensions. Emphasis should be placed on impro ving adaptability and addressing stability and interpretability challenges. 5.1. Best-perf orming methods by missing data mechanism Based on the literature synthesis and the comparati v e T abl e 3, the follo wing conclusions can be dra wn: A surve y of missing data imputation tec hniques: statistical methods ... (Rifaa Sade gh) Evaluation Warning : The document was created with Spire.PDF for Python.
2884 ISSN: 2252-8938 MCAR: simple statistical methods such as mean/m edian imputation and KNN are often suf cient due to the randomness of missingness. GAN-based models lik e GAIN and MisGAN also perform well under MCAR assumptions. MAR: more adv anced methods such as MICE, MissF orest, and neural netw orks are better suited, as the y can le v erage observ ed v ariable relationships. GAN models lik e MCI-GAN also sho w promising results. MN AR: handling MN AR remains challenging. Methods based on neural netw orks and certain rob ust v ariants of GANs (e.g., DEGAIN, GSIP) of fer impro v ed results, though no method fully resolv es the MN AR scenario without domain kno wledge or additional assumptions. 5.2. Challenges and limitations of GAN-based imputation models Despite their po werful capabilities, GAN-based imputation models f ace se v eral technical chal lenges that limit their reliability and generalizability . 5.2.1. Mode collapse and con v er gence issues GAN training is notoriously unstable due to the adv ersarial nature of the generator and dis criminator . Mode collapse, where the generator produces limited data patterns re g ardless of input noise, results in biased or unrealistic imputations. Additionally , con v er gence is dif cult to assess, and training may oscillate or di v er ge without pro viding meaningful imputations [43]. 5.2.2. Hyper parameter sensiti vity GANs are sensiti v e to h yperparameters such as learning rates, batch sizes, and architecture depth. Fine-tuning these parameters is often problem-specic and computationally e xpensi v e, requiring e xtensi v e empirical e xperimentation [44]. Poorly chosen h yperparameters may lead to o v ertting or non-con v er gent training, particularly when w orking with sparse datasets or comple x data structures. 5.2.3. P otential solutions Se v eral st rate gies ha v e been proposed to impro v e the stability and ef fecti v eness of GAN-based imputa- tion: i) Pretraining techniques: pretraining the generator or discriminator with autoencoder str u c tures or V AEs can stabilize learning and pre v ent early collapse [45]. ii) Hybrid architectures: models combining GANs with V AEs (e.g., V AE-GAN) or transformer encoders enhance b ot h stability and representational richness [46]. iii) Re gularization and loss design: adv anced loss functions (e.g., W asserstein loss with gradient penalty) and spec- tral normalization can impro v e con v er gence and reduce sensiti vity to h yperparameters. i v) M ´ eta-apprentissage: using meta-learning to adapti v ely select the best imputation strate gy depending on the missingness mechanism (MCAR, MAR, MN AR) and the data type has sho wn promise in impro ving generalizability . These impro v e- ments not only enhance imputation quality b ut also address ethical and interpretability concerns by making GANs more stable, transparent, and adaptable to real-w orld constraints. 5.3. Ethical implications of data imputation Data imputation techniques, while essential for maintaining data inte grity , pose signicant ethical challenges, especially when applied in critical domains such as healthcare, nance, and social sciences. The use of adv anced imputation methods, particularly those based on GANs, raises concerns related to accurac y , f airness, transparenc y , and accountability . 5.3.1. Risk of inaccurate imputation One of the primary ethical concerns in data imputation is the risk of inaccurate imputations leading to erroneous conclusions or biased decision-making. In healthcare, for instance, imputing missing patient data with GAN-based methods without adequate v alidation could result in misleading diagnostic outcomes or inappropriate treatments [47]. In nance, incorrect imputation of nancial metrics might lead to a wed credit scoring, adv ersely af fecting indi viduals or b usinesses [48]. 5.3.2. F air ness and bias GAN-based imputation methods may inadv ertently propag ate or amplify e xisting biases present in the training data. F or e xample, if demographic data from underrepresented groups are underimputed or inaccu- rately generated. It can lead to discriminatory outcomes in automated decision-making systems, such as loan appro v als or health risk assessments [49]. Int J Artif Intell, V ol. 14, No. 4, August 2025: 2876–2888 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 2885 5.3.3. Opacity and lack of inter pr etability GANs are often considered ”black-box” models, meaning their decision-making processe s are inher - ently dif cult to interpret. This lack of transparenc y poses ethical challenges when imputations signicantly inuence high-stak es decisions. De v eloping interpretable imputation models or inte grating e xplainable AI (XAI) techniques is essential to ensure accountability and b uild trust in automated systems [50]. 5.3.4. Pri v acy concer ns The use of GANs for data imputation may also raise pri v ac y issues. Since GANs generate synthet ic data that resemble real-w orld data, there is a risk that sensiti v e information might be reconstructed, e v en when anon ymization techniques are applied. This potential for data leakage necessitates rigorous pri v ac y-preserving mechanisms during the imputation process [51]. 5.3.5. Mitigation strategies T o address these ethical challenges, researchers and practitioners should consider the follo wing ap- proaches: i) ethical guidelines for data imputation: establishing clear guidelines to e v aluate the ethical impact of imputation methods, particularly in sensiti v e domains. ii) Algorithmic f airness audits: re gularly auditing GAN-based models to identify and mitig ate bias, especially when handling demographic data. iii) Impro ving model transparenc y: incorporating XAI methods, such as feature attrib ution and latent space visualization, to mak e imputed results more interpretable and trustw orth y . i v) Data pri v ac y mechanisms: emplo ying techniques lik e dif ferential pri v ac y to ensure that GAN-generated data does not inadv ertently re v eal personal information. 6. CONCLUSION AND FUTURE W ORK This study underscores the signicance of sel ecting imputation methods that are well-suited to the nature of missing data and v ariable types. GAN-based models ha v e demonstrated strong potential in handling comple x data structures such as images and time series, especially in high-impact elds lik e healthcare, nance, and en vironmental analysis. Their adaptability and capacity to generate realistic v alues mak e them v aluable tools in adv ancing missing data imputation techniques. Ho we v er , these models still f ace notable challenges including training instability , mode collapse, and h yperparameter tuning dif culties. Hybrid models that com- bine GANs with V AEs ha v e emer ged as a promising direction, of fering both the generati v e strength of GANs and the stability of V AEs. Moreo v er , the inte gration of meta-learning techniques could allo w for dynamic se- lection of i mputation strate gies based on dataset characteristics, thus enhancing generalization. Despite their performance, the interpretability of GAN-based models remains limited, raising concerns in critical domains where transparenc y is essential. Future research should therefore e xplore the incorporation of XAI methods to impro v e understanding and trust in the imputation process. Additionally , ef forts should focus on scaling these models for real-w orld applications, impro ving their computational ef ci enc y , and ensuri ng their reliabil- ity across di v erse data conte xts. Ov erall, this w ork lays the groundw ork for further e xploration into rob ust, interpretable, and scalable imputation strate gies using GANs. A CKNO WLEDGMENTS The authors w ould lik e to sincerely thank Mr . Mohamed El Hadramy Oumar , founder of V ector Mind, for his generous support in f acilitating the transaction required for the publication process. His ass istance is gratefully ackno wledged. FUNDING INFORMA TION The authors declare that no funding w as in v olv ed in the preparation of this manuscript. A UTHOR CONTRIB UTIONS ST A TEMENT This journal uses the Contri b ut or Roles T axonomy (CRediT) to recognize indi vidual author contrib u- tions, reduce authorship disputes, and f acilitate collaboration. A surve y of missing data imputation tec hniques: statistical methods ... (Rifaa Sade gh) Evaluation Warning : The document was created with Spire.PDF for Python.