IAES Inter national J our nal of Articial Intelligence (IJ-AI) V ol. 15, No. 1, February 2026, pp. 725 743 ISSN: 2252-8938, DOI: 10.11591/ijai.v15.i1.pp725-743 725 Explainable deep lear ning f or scalable r ecord linkage: a T abNet-based framew ork f or structur ed data integration F atima Zahrae Saber 1 , Ali Choukri 1 , Mohamed Amnai 1 , Abderrahim W aga 2 1 Department of Computer Science, F aculty of Science, Ibn T of ail Uni v ersity , K enitra, Morocco 2 School of Digital Engineering and Articial Intelligence, Euromed Uni v ersity of Fes, Fez, Morocco Article Inf o Article history: Recei v ed Apr 30, 2025 Re vised Oct 30, 2025 Accepted No v 8, 2025 K eyw ords: Big data Data quality Deep neural netw orks Record linkage T abNet ABSTRA CT Record linkage is considered a fundamental process for ensuring data quality and reliability , with critical applications in domains such as healthcare, nance, and commerce. A machine le arning-based approach for optimizing record linkage in structured datasets is presented in this paper . By inte grating h ybrid blocking methods (com bining standard blocking and sorted neighborhood approaches) with adv anced similarity measures, computational o v erhead is signicantly reduced while high accura c y is maintained. The performance of T abNet, a deep lea rning model designed for tab ular data, is compared with traditional deep neural netw orks (DNNs) in the classication phase. Experimental resul ts on a synthetic dataset of 5,000 records demonstrate that comparable precis ion and recall are achie v ed by T abNet to DNNs while e x ecution time is reduced by o v er 79%. The scalability and ef cienc y of the proposed method are highlighted by these ndings, making it well-suited for lar ge-scale data management tasks. Practical and computationally ef cient solutions for record linkage in the era of big data are contrib uted to by this w ork. This is an open access article under the CC BY -SA license . Corresponding A uthor: F atima Zahrae Saber Department of Computer Science, F aculty of Science, Ibn T of ail Uni v ersity K enitra, Morocco Email: f atimazahrae.saber@uit.ac.ma 1. INTR ODUCTION Data plays a crucial role in man y aspects of daily life. Ensuring high quality data often in v olv es the use of record linkage techniques, which aim to identify and remo v e duplicate entries referring to the same entity as sho wn in Figure 1. This process contrib utes to impro v ed data inte grity by reducing redundanc y and minimizing errors. Ho we v er , as databases increase in size and comple xity , record linkage becomes increasingly challenging. T raditional methods, such as probabilistic record linkage [1], tend to be time consuming and resource intensi v e. In the conte xt of big data [2], ne w challenges arise, including high proces sing demands, increased hardw are costs, and dif culties in accurately determining whether records truly match. The record linkage process can be di vided into four main steps [3]: data preprocessing, inde xing, comparison, and classication of records. In the rst step, tasks such as standardizing and normalizing data are performed to create a uniform database. The second step in v olv es b uilding an inde x of record pairs that may match, which helps reduce the time required f o r comparison. Only records within the same group are compared. F or lar ge databases, dif ferent inde xing methods are emplo yed, such as locality sensiti v e hashing (LSH) and sorted block inde xing [4], each with its adv antages and disadv antages. In the third step, similarity scores are calculated J ournal homepage: http://ijai.iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
726 ISSN: 2252-8938 between the v alues of each record pair , resulting in scores for all pairs. The nal step in v olv es classication, where the record pairs are labeled as matching or not matching based on the calculated scores. Figure 1. Data consolidation process Some methods for the pair classication phase utilize machine learning algorithms, such as support v ector machines (SVM) and XGBoost [5], while others are supported by deep neural netw orks (DNNs) [6]. DNNs ha v e been recognized as po werful tools in this domain due to their ability to learn comple x patterns from lar ge amounts of structured data. T o better understand the current challenges and adv ancements in the eld, pre vious studies that ha v e addressed the record linkage issue are re vie wed. Record linkage, also kno wn as entity resolution, is re g arded as a critical task for the inte gration and deduplication of lar ge datasets across di v erse domains. Ov er the years, v arious methodologies ha v e been proposed to address the challenges of scalability , accurac y , and pri v ac y in record linkage processes. This paper proposes a no v el h ybrid blocking technique inte grated with T abNet, a deep learning model specically designed for tab ular data [7]. Our approach optim izes the record linkage process by reducing computational o v erhead, impro ving e x ecution time, and maintaining high accurac y . Through e xperiments conducted on synthetic datasets, we demonst rate the ef fecti v eness and scalability of our method, highlighting its potential for lar ge scale applications in data management. 2. RELA TED W ORK Record linkage, or entity resolution, is a critical task in data inte gration. The goal is identifying and mer ging records that refer to the same entity a cross dif ferent datasets. Ov er the years, v arious approaches ha v e been de v eloped to address challenges related to scalability , accurac y , and pri v ac y in record linkage processes. 2.1. T raditional methods T raditional probabilistic models, such as the Felle gi-Sunter model, ha v e long dominated record linkage [8], focusing on probabilistic scoring to match records based on similarity thresholds. Recent adv ancements ha v e incorporated ensemble methods and machine learning algorithms, as demonstrated in probabilistic record linkage for f amilies (PRLF), an open source Python based tool. PRLF emplo ys generalized linear models and machine learning to impro v e accurac y under challenging conditions, such as data de gradation and missing elds, of fering rob ust performance across synthetic and real w orld datasets. 2.2. Machine lear ning appr oaches He ydari et al. [9] propose a distrib uted record linkage method applied to healthcare data using Apache Spark and its MLlib library . Their approach utilizes machine learning algorithms, such as re gression and SVM, to match records based on preprocessed features lik e names, dates of birth, and zip codes. This study is notable for its use of stratied sampling to address the common issue of imbalanced datasets in record linkage, as well as its rigorous model v alidation, ensuring rob ust performance. The results demonstrate remarkable accurac y (up to 96.71% for re gression), highlighting the scala b i lity of fered by Spark in handling massi v e data en vironments. This method sho wcases the ef fect i v eness of a distrib uted approach in addres sing challenges related to scalability and accurac y , although it focuses primarily on healthcare specic data. 2.3. Deep lear ning based methods An inno v ati v e solution [10] introduces a scalable deep learning-based approach designed for big data scenarios. This method b uilds an articial neural netw ork (ANN), specically a Siamese netw ork, to ef ciently encode records for f aster similarity computat ions. By le v eraging the cosine similarity metric, the netw ork Int J Artif Intell, V ol. 15, No. 1, February 2026: 725–743 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 727 classies record pairs as either matched or unmatched. The use of Apache Spark further enhances the scalability of this method, enabling parallel processing of lar ge datasets and reducing computational o v erhead. This inte gration of deep learning and distrib uted computi ng mak es it particularly suitable for handling lar ge-scale data inte gration tasks. Application of deep learning on record linkage is one of the major research area that seeks to addres s scalability and ine xibility problems of con v entional rule-based approaches. Y ulianton and Santi [11] present a deep learning approach for e-commerce product matching based on Sentence-BER T . Using lightweight transformer embeddings and cosine similarity with a x ed threshold, their method ef fecti v ely captures semantic similarities between heterogeneous product titles. Ev aluated on the Pricerunner dataset, the approach achie v es high accurac y and perfect precision, demonstrating that ef cient SBER T -based models are well suited for lar ge-scale product matching tasks. In the meantime, ne wer models lik e transformers [12] hold considerable promise for matching. In the related map matching eld, a transformer model achie v ed F1-scores of o v er 96%, setting v ery high le v els of ef cienc y for sequence matching problems. This piece conrms the line of deep learning models to address comple x conte xtual information, and it also highlights the need for solutions that are still ef fecti v e and po werful in addressing real-w orld data problems. T able 1 presents a comparati v e study of record linkage methods using deep learning for tab ular data. T able 1. Comparati v e table of e xisting deep learning approaches for record linkage Cate gory Method/study Approach/technique K e y features/strengths Mentioned performance Reference Deep learning Sentence-BER T (MiniLM) T ransformer -based sentence embeddings with cosine similarity and threshold-based matching Lightweight transformer enabling semantic product matching with high precision and lo w computational cost; scalable to lar ge e-commerce datasets Accurac y: 98.10%, Precision: 100%, Recall: 91.84%, F1-score: 95.74% Y ulianton and Santi [11] Deep learning Neural ER (T uple embeddings) DNNs for learning distrib uted representations of structured entity attrib utes Ef fecti v e for comple x entity matching tasks on heterogeneous structured data, including medical and product datasets F1-score up to 94% Peeters and Bizer [13] Deep learning T ransformer (Seq2Seq) Uses transfer learning with a transformer architecture. Sho ws high potential for sequence matching tasks, though related to “map matching”. F1-score: > 96% (at se gment le v el) Jin et al. [12] 2.4. Pri v acy-pr eser ving r ecord linkage based methods The method of W ang et al. [14] seeks to enhance bloom lter -based pri v ac y-preserving record linkage (PPRL). Their “(Hash)- A” hashing approach tackles information loss by coding q-gram frequenc y to more ef fecti v ely dif ferentiate betwee n records and thus increase matching accurac y . T o protect pri v ac y , the “utility-optimized bloom lter” (UBF) approach utilizes user -le v el dif ferential pri v ac y (ULDP) to subject only a subset of bits recognized as sensiti v e to intense perturbation. This selecti v e protection pro vides an impro v ed trade-of f between utility (linkage accurac y) and pri v ac y than current methods. Ranbaduge et al. [15] presents the inaugural multi-party PPRL protocol to combine deep learning into a federated learning paradigm. The database o wners initially encode their records into bloom lters, to which dif ferential pri v ac y noise is injected to pro vide pro v able protection for pri v ac y . Local deep learning models are then trained separately by each party on feature v ectors (similarity/distance scores) deri v ed from such noisy bloom lters. Lastly , the local models are submitted to a secure aggre g ator that ensembles them into a global model, which a linkage unit uses to classify unlabeled data. T able 2 summarizes the cate gories of record linkage methods, with the adv antages and disa dv a ntages of each. Our proposed solution optimizes record linkage through the combination of a h ybrid blocking strate gy and a T abNet classier , a deep learning model specically designed for tab ular data. Through this combination, a ne w computation-accurac y tradeof f for medium to lar ge datasets is introduced. Firstly , the number of pairs that need to be compared is signicantly r educed by the h ybrid blocking technique, lo wering the w orkload computation while maintaining high recall of potential matches. This is follo wed by se v eral critical adv antages of the T abNet model: e xtrem ely accurate classication is possible, enhanced interpretability is f acilitated through its attention mechanism, and most signicantly , it is e xtremely ef cient, with e x ecution time reducing o v er 79% compared to a standard DNN. Explainable deep learning for scalable r ecor d linka g e: a T abNet-based fr ame work ... (F atima Zahr ae Saber) Evaluation Warning : The document was created with Spire.PDF for Python.
728 ISSN: 2252-8938 T able 2. Comparati v e table of record linkage approaches Cate gory Method/study Approach/technique K e y features/strengths Mentioned performance Reference PPRL Enhanced bloom lter PPRL “(Hash)-A” hashing with q-gram frequenc y and UBF with dif ferential pri v ac y . Selecti v e protection for a better accurac y-pri v ac y trade-of f. Impro v ed trade-of f W ang et al. [14] Machine learning Distrib uted approach Re gression and SVM on Apache Spark (MLlib). High scalability; handles imbalanced data. Up to 96.71% accurac y He ydari et al. [9] Probabilistic PRLF Ensemble methods, generalized linear models, and machine learning. Open-source Python tool; rob ust ag ainst data de gradation and missing elds. Rob ust performance Prindle et al. [8] PPRL Federated PPRL with deep learning Multi-party protocol using noisy bloom lters to train and aggre g ate local models. First protocol combining DL and federated learning for PPRL. Rob ust performance Ranbaduge et al. [15] Deep learning T ransformer model T ransformer (Seq2Seq) with transfer learning. V ery promising; high ef cienc y for sequences. F1-scores of o v er 96% Jin et al. [12] Deep learning Siamese netw ork on Spark ANN (Siamese netw ork) to encode records. Scalable, designed for big data. Ef cient for reducing computation time. W olcott et al. [10] 3. METHOD In this paradigm, record linkage becomes a supervised learning issue. It starts with le v eraging the freely e xtensible biomedical record linkage 2 (FEBRL 2) dataset that comprises 5,000 synthetic records with pre-determined duplicates, thereby serving as training and v alidation labeled data. F or each candidate record pair , a v ector of similarities is calculated by applying stable measures such as the Jaro-W inkler (JW) and Le v enshtein distances. This feature v ector is then used to train a deep learning-based classier , T abNet, in which duplicate pairs (matches) are learned to be distinguished from non-duplicate pairs. Ne w , unseen record pairs can then be predicted to be ne w or not by the learned model. Se v eral inherent challe ng e s of this supervised learning task are directly addressed by our methodology: i) Noisy data: real-w orld data is also i nf amous for ha ving entry errors, format v ariations, and missing v alues. Our preprocessing stage of uppercase con v ersion, remo v al of irrele v ant symbols, and numeric eld cleaning guarantees correct results. Furthermore, the JW distance w as chosen especially so that common typographical errors could be tolerated, making the method rob ust to noise. ii) Class imbalance: class im balance is f amously a critical issue in record linkage, as w as the case in our earlier w ork where this problem caused v ery lo w precision. Our approach utilizes deep learning models that perform well on imbalanced tab ular data, as is e vident from the high precision and recall v alues achie v ed. iii) Domain adaptation: the f act that the synthetic data might not capture the comple xity of real data is cited as a primary limitation. Therefore, as a k e y future e x ercise, the generali zability of our model will be e v aluated on a lar ge, real-w orld dataset—the North Carolina v oter re gistration (NCVR)—so that it can be v alidated to generalize well to a number of production en vironments. Articial intelligence (AI) models, and deep learning tec hn i ques in particular , are better equipped to handle the ambiguity inherent in real-w orld data, thereby outperforming their classical rule-based counterparts. While classical rule-based techniques are traditionally described as stif f, greater e xibility and better performance in handling noisy or missing data are of fered by our AI technique. Instead of being x ed and binary rules being applied, the model is trained from a rich similarity v ector . The similarity le v el is quantied in terms of similarity scores, which are calculated based on JW and Le v enshtein distances, such that v ariations, typos, and other forms of error can be managed by the model. The comple x interaction among the di v erse scores is then set up by the T abNet model so that it can mak e a probabilistic decision—a much more sophisticated task than what could possibly be accomplished by a set of rules. This ability for e vidence to be dynamically weighted and for the most rele v ant features for an y prediction to be established, as sho wn in our analysis of interpretability , is wh y uncertainty is best dealt with by AI. As pre viously mentioned, the process is composed of four main steps. First, the data is preprocessed to clean and normalize it [3]. Ne xt, a h ybrid blocking method is emplo yed to reduce the number of comparisons by di viding the data into smaller , more manageable blocks. T w o techniques, sorted neighborhood and standard blocking, are used to create an inde x of candidate record pairs. These pairs are subsequently compared using Int J Artif Intell, V ol. 15, No. 1, February 2026: 725–743 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 729 similarity measures such as the Le v enshtein distance and the JW distance. The resulting similarity scores are then fed into a classication model. T o e v aluate performance in terms of e x ecution time and accurac y , T abNet and a DNN model were utilized, aiming to determine the best trade of f between speed and precision [7], [16]. As sho wn in Figure 2. Figure 2. Proposed record linkage process 3.1. T raining and v alidation dataset The training and v alidation dataset used for this study is the FEBRL 2, which consists of cti tious records simulating personal information typically found in structured databases. It contains 5,000 ro ws, including 4,000 original records and 1,000 duplicate records. The dataset comprises six columns, each representing a specic attrib ute related to indi viduals, such as rst name, last name, address, and other personal details. This dataset is emplo yed to test and v alidate the proposed record linkage method by replicating real w orld conditions encountered in lar ge scale databases. The main columns include both te xtual and numerical information, as illustrated in T able 1. These columns represent the types of data commonly found in administrati v e or commercial databases and present typical challenges such as input errors, missing data, and format inconsistencies. In this conte xt, data preprocessing w as essential to normalize certain columns and address inconsistencies, as detailed in the ne xt section. This step is critical for impro ving the quality of record matches. T able 3 pro vides a detailed description of each column in the dataset, along with concrete e xample s and remarks re g arding the specic characteristics of each eld. T able 3. Description and specic features of the dataset used Column name Description Data type Example gi v en name First name T e xt SARAH surname Surname T e xt BR UHN address 1 First line of address T e xt FORBES STREET state State T e xt VIC date of birth Date of birth (format YYYYMMDD) Numeric 19300213 soc sec id Unique social security number Numeric 7535316 3.2. Experimental dataset Three datasets were generated from the training dataset to e xperiment with and test the proposed method, as well as to e v aluate the performance of the models and the e x ecution time of each prediction. The e x ecution time is considered an important criterion in this study , as the objecti v e is to identify a method that reduces the time required for record comparison and duplicate prediction. Lar ger datasets were created to assess the models’ performance at a lar ger scale. As sho wn in T able 4, the rst dataset consists of 13,000 records, with 10,000 original records and 3,000 duplicates. The second dataset contains 16,000 records, including 12,000 original records and 4,000 duplicates. Finally , the third dataset includes 21,000 records, comprising 16,000 original records and 5,000 duplicates. Explainable deep learning for scalable r ecor d linka g e: a T abNet-based fr ame work ... (F atima Zahr ae Saber) Evaluation Warning : The document was created with Spire.PDF for Python.
730 ISSN: 2252-8938 T able 4. Ov ervie w of datasets used for e xperimental model Column name Dataset 1 Dataset 2 Dataset 3 T otal records 13,000 16,000 21,000 Original records 10,000 12,000 16,000 Duplicate records 3,000 4,000 5,000 3.3. Data pr epr ocessing T o enhance data quality and f acilitate comparisons during the record linkage process, se v eral transformations wer e applied. First, columns containing te xtual data such as rst names, surnames, and addresses were con v erted to uppercase to ensure consistenc y in information representation, re g ardless of v ariations in case. Ne xt, irrele v ant symbols and characters, particularly in address elds, were remo v ed to rene matches and reduce potential inconsistencies [17]. F or numeric elds, particularly zip codes, non-conforming (non-numeric) v alues were identied and remo v ed to impro v e the accurac y of comparisons. Additionally , other specic columns underwent tailored cleaning operations, such as the standardization of abbre viations and the correction of typographical errors. These preprocessing steps are crucial for ensuring reliable results during the matching phase [18]. 3.4. Indexing Inde xing is a critical step in the record matchi ng process, designed to reduce the number of record pairs to be compared while maintaining a high le v el of accurac y in match detection [4]. Gi v en the size of the dataset used in this study (5,000 records), the total number of potential comparisons without inde xing w ould be e xtremely lar ge, potentially reaching se v eral million pairs. T o address thi s challenge, a h ybrid blocking approach w as adopted, combining tw o methods: standard blocking and the sorted neighborhood method. The ef cienc y of the process is signicantly enhanced by this combination, reducing the number of pairs to be compared while ef fecti v ely identifying rele v ant matches. 3.4.1. Standard blocking The rst method emplo yed is standard blocking [19], which in v olv es di viding records into blocks based on one or more columns. F or this study , records were bloc k e d using the state column. This approach results in records being grouped by zip code, t hereby restricting comparisons to within each block. While this technique ef fecti v ely reduces the number of comparisons, limitations arise when dealing with missing or incorrect zip code v alues. 3.4.2. Sorted neighborhood T o address these limitations, standard blocking w as combined with the sorted neighborhood method. This technique in v olv es sorting records based on a sort k e y and then comparing each record with its neighbors within a x ed size windo w [20]. By using the surname as the sort k e y , this method captures matches that may not be grouped together in standard blocking due to minor v ariat ions in zip codes. The sliding windo w approach allo ws comparisons to be made only between neighboring records, signicantly reducing the number of pairs to be compared. Figure 3 illustrates this process, where pairs of records (Record a and Record b) are compared after sorting. The lines represent potential matches between neighboring r ecords, sho wing ho w the sorted neighborhood method limits the comparisons while capturing rele v ant matches. Figure 3. Sorted neighborhood algorithm to inde x record pairs Int J Artif Intell, V ol. 15, No. 1, February 2026: 725–743 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 731 3.5. Hybrid blocking The combination of standard state- b a sed blocking and the sorted neighborhood algorithm forms a rob ust h ybrid blocking approach. First, standard state based blocking reduces the number of pairs to be compared by e xcluding records that are geographically too distant. Second, the sorted neighborhood algorithm renes this process by performing comparisons between records sorted based on their surname, thereby capturing matches that might ha v e been missed by standard blocking alone [21]–[23]. As illustrated in Figure 4, these tw o methods w ork together to impro v e the ef cienc y and ef fecti v eness of record matching. Figure 4. Hybrid blocking method w as used The h ybrid approach of fers se v eral adv antages. The comple xity of comparisons is signi cantly reduced while maintaining a high de gree of accurac y in matching records. It ef fecti v ely handles minor v ariations in te xtual data, input errors, and missing v alues in specic columns. F ollo wing this step, an inde x of o v er 2.8 million record pairs is generated, which will be compared and classied as either matched or unmatched pairs. T able 5 presents the number of record pairs for each dataset, highlighting that as the number of records in a dataset increases, the number of record pairs for comparison also gro ws. T able 5. Number of record pairs for each dataset Dataset T rain/v alidation Exp. dataset 1 Exp. dataset 2 Exp. dataset 3 T otal records 5,000 13,000 16,000 21,000 P airwise inde x es 28,700 2,900,000 4,200,000 7,300,000 This comparison T able 6 dif ferent blocking strate gies in terms of ho w well the y perform in reducing the number of candidate pairs for record linkage. While a full inde x results in nearly 12.5 million pairs (0% reduction), the best performing method is the proposed h ybrid blocking approach. It greatly minimizes the number of comparisons to just 28,702 pairs with an accomplishment of 99.77% reduction ratio (RR). This emphasizes the central contrib ution of the h ybrid strate gy to impro ving the computational ef cienc y of the record linkage pipeline. T able 6. Comparison of blocking strate gies by number of candidate pairs and RR Method Number of pairs RR (%) Full inde x 12,497,500 0.00 Blocking (state) 2,768,103 77.85 SortedNeighbour (surname) 75,034 99.40 Hybrid blocking 28,702 99.77 3.6. Comparison phase In the comparison phase, similarity measures are applied to assess the correspondence between record pairs. T w o well established methods, JW and Le v enshtein distances, ha v e been selec ted for this task. Each comparison produces a similari ty score rangi ng from 0 to 1, reecting the de gree of corr espondence between eld v alues. These scores are then aggre g ated into a similarity v ector , which summarizes the o v erall similarity between the tw o records. This similarity v ector serv es as the foundation for the subsequent class ication phase, where it is determined whether the records represent the same entity . 3.6.1. J ar o-W inkler distance The JW distance metric is especially ef fecti v e for short strings, such as names. In this study , i t w as applied to se v eral elds, including gi v en name, surname, address 1, and state. By taking into account Explainable deep learning for scalable r ecor d linka g e: a T abNet-based fr ame work ... (F atima Zahr ae Saber) Evaluation Warning : The document was created with Spire.PDF for Python.
732 ISSN: 2252-8938 both character matches and their order , the JW distance is sensiti v e to common typographical errors, making it particularly suitable for record matching tasks [24]–[26]. The JW impro v es the Jaro distance by adding a prex scale: J W = J + ( l × p × (1 J )) (1) In this equation, l is the length of the common prex (up to 4 characters), and p is a scaling f actor , usually 0.1. The adjustment f a v ors strings that match from the be ginning. 3.6.2. Le v enshtein distance The Le v enshtein distance metric calculates the minimum number of operations (insertion, deletion, or substitution) required to transform one string into another . In the conte xt of this study , Le v enshtein distance w as applied to numerical elds such as date of birth and soc sec id. This approach ef fecti v ely quanties the dif ferences between records, e v en when there are v ariations in data entry , such as errors in date formatting or incorrect postal codes [27]. The Le v enshtein distance, also kno wn as the edit distance, measures the minimum number of single-character operations required to transform one string into another . The operations permitted include insertion, deletion, and substitution. The Le v enshtein distance measures the minimum number of single character edits required to change one string into another . The permitted operations include insertion, deletion, and substitution. The recursi v e formula for computing the Le v enshtein distance d ( a, b ) between tw o strings a and b is dened as: d ( a, b ) = max( len ( a ) , len ( b )) if min( len ( a ) , len ( b )) = 0 d ( a 1 , b 1) if a = b 1 + min d ( a 1 , b ) d ( a, b 1) d ( a 1 , b 1) otherwise (2) a and b are the tw o strings being compared. d ( a, b ) is the minimum number of edit operations needed to con v ert string a into string b . The allo wed operations are: i) insertion of a single character , ii) deletion of a single character , iii) substitution of one character for another len ( a ) and len ( b ) denote the lengths of the strings a and b , respecti v ely . This algorithm is widely used in approximate string matching and natural language processing tasks, as it pro vides a quantiable measure of similarity between tw o sequences based on their structural dif ferences. In Figure 5, each ro w represents a pair of records, and each column sho ws the score for the corresponding attrib ute. F or instance, for the rst record pair , the scores are 0.466667 for the rst name (gi v en name score), 0.455556 for the surname (surname score), and so on. These indi vidual attrib ute scores are combined to calculate an o v erall similarity score for each pair of records. The similarity measures are selecti v ely applied to the record pairs generated in the pre vious step, which uses a h ybrid blocking system. This technique reduces the num b e r of comparisons required, optimizing the proces s while maintaining a high precision rate. As a result, similarity v ectors are generated, where each pair of records is associated with a similarity score for each attrib ute. Figure 5. Comparison of record pair similarities using JW and Le v enshtein distances Int J Artif Intell, V ol. 15, No. 1, February 2026: 725–743 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Artif Intell ISSN: 2252-8938 733 High scores indicate a strong similarity between the attrib ute v alues, suggesting a probable mat ch between the records . This detailed scoring system of fers greater e xibility in the nal classication step. While traditional record linkage methods typically apply a global similarity threshold to determine matches, our method classies record pairs as either matching or non-matching using T abNet, a deep learning model specically designed for tab ular data. Thi s approach w as selected to impro v e both accurac y and e x ecution time. 3.7. Classication models F or our record linkage e xperi ment, a pragmatic h yperparameter search strate gy w as utilized for tw o classication models: T abNet and DNN. Rather than e xhausti v e s earch, we took established parameters. W e trained the T abNet model at learning rate 0.02, max epochs of 6, and patience of 5 for early stopping. On the other hand, the DNN w as re gularized with binary cross entrop y loss and emplo yed dropout and early stopping as methods of re gularization. The choice of model w as based on a balance between performance as indicated by accurac y and e x ecution time, and computational ef cienc y . The goal w as to achie v e the model of fering the best balance for deplo yability at scale. 3.7.1. T abNet T abNet is a deep learning model uni quely designed for the ef fecti v e handling of tab ular data. Unl ik e traditional neural netw ork architectures [28], [29], T abNet emplo ys an inno v ati v e approach that inte grates attention mechanisms with a hierarchical structure to identify and e xtract rele v ant features from the data, see Figure 6. This model has demonstrated signicant succes s in tasks in v olving structured datasets , due to its ability to focus on the most informati v e parts of the data while maintaining interpretability . Figure 6. The T abNet model for record pair classication The T abNet model be gins with the feature transformer , which transforms the input v ariables into richer representations suitable for prediction tasks. Thi s component consists of four layers: fully connected layers (dense layers) that inte grate the v ariables, batch normalization to stabilize the learning process, and specic acti v ation functions such as g ated linear units (GLU) that dynamically select rele v ant information. Explainable deep learning for scalable r ecor d linka g e: a T abNet-based fr ame work ... (F atima Zahr ae Saber) Evaluation Warning : The document was created with Spire.PDF for Python.
734 ISSN: 2252-8938 The primary purpose of the feature transformer is to e xtract comple x, non-linear r epresentations of the data, capture interactions between v ariables, and prepare these representations for the ne xt phase the attenti v e transformer module. Before progressing to the attenti v e transformer , the data is di vided using a split mechanism into tw o parts. The rst part produces a partial prediction result, while the second part is forw arded to the attenti v e transformer , which focuses on selecting rele v ant features. The attenti v e transformer le v erages an attention mechanism to identify and emphasize the most i mportant columns at each stage, capturing intricate relationships among them. This approach enables T abNet to dynamically select rele v ant combinations of columns, impro ving its ef cienc y and e xibility in prediction tasks in v olving structured data. Once the relationships between columns ha v e been identied, the model dynamically selects the rele v ant columns at each step using a mask. This process is iterated 10 times to generate the nal prediction for each record pair , determining whether the y are a match or not. Due to its architecture, T abNet has pro v en to be an ef fecti v e tool for classication tasks, particularly in the conte xt of structured data. 3.7.2. Neural netw orks deep neural netw orks Deep learning models, particularly DNNs, ha v e become increasingly utilized for solving record linkage problems, including tasks such as record pair classication, record normalization, and similarity computation between records [6], [30], [31]. The DNN model used for record pair classication consists of three dense layers: an input layer with 256 nodes emplo ying the ReLU acti v ation function, follo wed by a dropout layer for re gularization; a hidden layer with 128 nodes, also utilizing the ReLU acti v ation function and a dropout layer; and an output layer with a single node using a sigmoid acti v ation function for binary classication see Figure 7 (1 for matched records and 0 for unmatched records). T able 7 presents a comparison between the T abNet model and the DNN in terms of architecture, scalability , training time, performance, and other rele v ant f actors. The adv antages of the T abNet model o v er the pre viously emplo yed DNN are clearly highlighted in the table. Figure 7. The DNN model for record pair classication T able 7. Comparison between T abNet and DNNs Criteria T abNet DNN Data type Optimized for tab ular data Can process v arious data types (images and te xt) Architecture Uses attention mechanisms and dynamic masks Composed of fully connected layers Acti v ation functions Sparse acti v ation via attention Non-linear functions such as ReLU, sigmoid Interpretability High due to the attention mechanism Limited due to comple x structure Ov ertting pre v ention Inte grated re gularization techniques Dropout, early stopping, etc. Scalability Ef cient on lar ge tab ular datasets Requires more resources for lar ge datasets T raining time F ast due to feature selection via attention Can be long with deep architectures Performance Performs well on imbalanced tab ular data Requires tuning for optimal performance T ypical applications Classication and re gression on tab ular data Computer vision, NLP , and more Int J Artif Intell, V ol. 15, No. 1, February 2026: 725–743 Evaluation Warning : The document was created with Spire.PDF for Python.