IAES
Inter
national
J
our
nal
of
Articial
Intelligence
(IJ-AI)
V
ol.
15,
No.
1,
February
2026,
pp.
725
∼
743
ISSN:
2252-8938,
DOI:
10.11591/ijai.v15.i1.pp725-743
❒
725
Explainable
deep
lear
ning
f
or
scalable
r
ecord
linkage:
a
T
abNet-based
framew
ork
f
or
structur
ed
data
integration
F
atima
Zahrae
Saber
1
,
Ali
Choukri
1
,
Mohamed
Amnai
1
,
Abderrahim
W
aga
2
1
Department
of
Computer
Science,
F
aculty
of
Science,
Ibn
T
of
ail
Uni
v
ersity
,
K
enitra,
Morocco
2
School
of
Digital
Engineering
and
Articial
Intelligence,
Euromed
Uni
v
ersity
of
Fes,
Fez,
Morocco
Article
Inf
o
Article
history:
Recei
v
ed
Apr
30,
2025
Re
vised
Oct
30,
2025
Accepted
No
v
8,
2025
K
eyw
ords:
Big
data
Data
quality
Deep
neural
netw
orks
Record
linkage
T
abNet
ABSTRA
CT
Record
linkage
is
considered
a
fundamental
process
for
ensuring
data
quality
and
reliability
,
with
critical
applications
in
domains
such
as
healthcare,
nance,
and
commerce.
A
machine
le
arning-based
approach
for
optimizing
record
linkage
in
structured
datasets
is
presented
in
this
paper
.
By
inte
grating
h
ybrid
blocking
methods
(com
bining
standard
blocking
and
sorted
neighborhood
approaches)
with
adv
anced
similarity
measures,
computational
o
v
erhead
is
signicantly
reduced
while
high
accura
c
y
is
maintained.
The
performance
of
T
abNet,
a
deep
lea
rning
model
designed
for
tab
ular
data,
is
compared
with
traditional
deep
neural
netw
orks
(DNNs)
in
the
classication
phase.
Experimental
resul
ts
on
a
synthetic
dataset
of
5,000
records
demonstrate
that
comparable
precis
ion
and
recall
are
achie
v
ed
by
T
abNet
to
DNNs
while
e
x
ecution
time
is
reduced
by
o
v
er
79%.
The
scalability
and
ef
cienc
y
of
the
proposed
method
are
highlighted
by
these
ndings,
making
it
well-suited
for
lar
ge-scale
data
management
tasks.
Practical
and
computationally
ef
cient
solutions
for
record
linkage
in
the
era
of
big
data
are
contrib
uted
to
by
this
w
ork.
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
F
atima
Zahrae
Saber
Department
of
Computer
Science,
F
aculty
of
Science,
Ibn
T
of
ail
Uni
v
ersity
K
enitra,
Morocco
Email:
f
atimazahrae.saber@uit.ac.ma
1.
INTR
ODUCTION
Data
plays
a
crucial
role
in
man
y
aspects
of
daily
life.
Ensuring
high
quality
data
often
in
v
olv
es
the
use
of
record
linkage
techniques,
which
aim
to
identify
and
remo
v
e
duplicate
entries
referring
to
the
same
entity
as
sho
wn
in
Figure
1.
This
process
contrib
utes
to
impro
v
ed
data
inte
grity
by
reducing
redundanc
y
and
minimizing
errors.
Ho
we
v
er
,
as
databases
increase
in
size
and
comple
xity
,
record
linkage
becomes
increasingly
challenging.
T
raditional
methods,
such
as
probabilistic
record
linkage
[1],
tend
to
be
time
consuming
and
resource
intensi
v
e.
In
the
conte
xt
of
big
data
[2],
ne
w
challenges
arise,
including
high
proces
sing
demands,
increased
hardw
are
costs,
and
dif
culties
in
accurately
determining
whether
records
truly
match.
The
record
linkage
process
can
be
di
vided
into
four
main
steps
[3]:
data
preprocessing,
inde
xing,
comparison,
and
classication
of
records.
In
the
rst
step,
tasks
such
as
standardizing
and
normalizing
data
are
performed
to
create
a
uniform
database.
The
second
step
in
v
olv
es
b
uilding
an
inde
x
of
record
pairs
that
may
match,
which
helps
reduce
the
time
required
f
o
r
comparison.
Only
records
within
the
same
group
are
compared.
F
or
lar
ge
databases,
dif
ferent
inde
xing
methods
are
emplo
yed,
such
as
locality
sensiti
v
e
hashing
(LSH)
and
sorted
block
inde
xing
[4],
each
with
its
adv
antages
and
disadv
antages.
In
the
third
step,
similarity
scores
are
calculated
J
ournal
homepage:
http://ijai.iaescor
e
.com
Evaluation Warning : The document was created with Spire.PDF for Python.
726
❒
ISSN:
2252-8938
between
the
v
alues
of
each
record
pair
,
resulting
in
scores
for
all
pairs.
The
nal
step
in
v
olv
es
classication,
where
the
record
pairs
are
labeled
as
matching
or
not
matching
based
on
the
calculated
scores.
Figure
1.
Data
consolidation
process
Some
methods
for
the
pair
classication
phase
utilize
machine
learning
algorithms,
such
as
support
v
ector
machines
(SVM)
and
XGBoost
[5],
while
others
are
supported
by
deep
neural
netw
orks
(DNNs)
[6].
DNNs
ha
v
e
been
recognized
as
po
werful
tools
in
this
domain
due
to
their
ability
to
learn
comple
x
patterns
from
lar
ge
amounts
of
structured
data.
T
o
better
understand
the
current
challenges
and
adv
ancements
in
the
eld,
pre
vious
studies
that
ha
v
e
addressed
the
record
linkage
issue
are
re
vie
wed.
Record
linkage,
also
kno
wn
as
entity
resolution,
is
re
g
arded
as
a
critical
task
for
the
inte
gration
and
deduplication
of
lar
ge
datasets
across
di
v
erse
domains.
Ov
er
the
years,
v
arious
methodologies
ha
v
e
been
proposed
to
address
the
challenges
of
scalability
,
accurac
y
,
and
pri
v
ac
y
in
record
linkage
processes.
This
paper
proposes
a
no
v
el
h
ybrid
blocking
technique
inte
grated
with
T
abNet,
a
deep
learning
model
specically
designed
for
tab
ular
data
[7].
Our
approach
optim
izes
the
record
linkage
process
by
reducing
computational
o
v
erhead,
impro
ving
e
x
ecution
time,
and
maintaining
high
accurac
y
.
Through
e
xperiments
conducted
on
synthetic
datasets,
we
demonst
rate
the
ef
fecti
v
eness
and
scalability
of
our
method,
highlighting
its
potential
for
lar
ge
scale
applications
in
data
management.
2.
RELA
TED
W
ORK
Record
linkage,
or
entity
resolution,
is
a
critical
task
in
data
inte
gration.
The
goal
is
identifying
and
mer
ging
records
that
refer
to
the
same
entity
a
cross
dif
ferent
datasets.
Ov
er
the
years,
v
arious
approaches
ha
v
e
been
de
v
eloped
to
address
challenges
related
to
scalability
,
accurac
y
,
and
pri
v
ac
y
in
record
linkage
processes.
2.1.
T
raditional
methods
T
raditional
probabilistic
models,
such
as
the
Felle
gi-Sunter
model,
ha
v
e
long
dominated
record
linkage
[8],
focusing
on
probabilistic
scoring
to
match
records
based
on
similarity
thresholds.
Recent
adv
ancements
ha
v
e
incorporated
ensemble
methods
and
machine
learning
algorithms,
as
demonstrated
in
probabilistic
record
linkage
for
f
amilies
(PRLF),
an
open
source
Python
based
tool.
PRLF
emplo
ys
generalized
linear
models
and
machine
learning
to
impro
v
e
accurac
y
under
challenging
conditions,
such
as
data
de
gradation
and
missing
elds,
of
fering
rob
ust
performance
across
synthetic
and
real
w
orld
datasets.
2.2.
Machine
lear
ning
appr
oaches
He
ydari
et
al.
[9]
propose
a
distrib
uted
record
linkage
method
applied
to
healthcare
data
using
Apache
Spark
and
its
MLlib
library
.
Their
approach
utilizes
machine
learning
algorithms,
such
as
re
gression
and
SVM,
to
match
records
based
on
preprocessed
features
lik
e
names,
dates
of
birth,
and
zip
codes.
This
study
is
notable
for
its
use
of
stratied
sampling
to
address
the
common
issue
of
imbalanced
datasets
in
record
linkage,
as
well
as
its
rigorous
model
v
alidation,
ensuring
rob
ust
performance.
The
results
demonstrate
remarkable
accurac
y
(up
to
96.71%
for
re
gression),
highlighting
the
scala
b
i
lity
of
fered
by
Spark
in
handling
massi
v
e
data
en
vironments.
This
method
sho
wcases
the
ef
fect
i
v
eness
of
a
distrib
uted
approach
in
addres
sing
challenges
related
to
scalability
and
accurac
y
,
although
it
focuses
primarily
on
healthcare
specic
data.
2.3.
Deep
lear
ning
based
methods
An
inno
v
ati
v
e
solution
[10]
introduces
a
scalable
deep
learning-based
approach
designed
for
big
data
scenarios.
This
method
b
uilds
an
articial
neural
netw
ork
(ANN),
specically
a
Siamese
netw
ork,
to
ef
ciently
encode
records
for
f
aster
similarity
computat
ions.
By
le
v
eraging
the
cosine
similarity
metric,
the
netw
ork
Int
J
Artif
Intell,
V
ol.
15,
No.
1,
February
2026:
725–743
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
727
classies
record
pairs
as
either
matched
or
unmatched.
The
use
of
Apache
Spark
further
enhances
the
scalability
of
this
method,
enabling
parallel
processing
of
lar
ge
datasets
and
reducing
computational
o
v
erhead.
This
inte
gration
of
deep
learning
and
distrib
uted
computi
ng
mak
es
it
particularly
suitable
for
handling
lar
ge-scale
data
inte
gration
tasks.
Application
of
deep
learning
on
record
linkage
is
one
of
the
major
research
area
that
seeks
to
addres
s
scalability
and
ine
xibility
problems
of
con
v
entional
rule-based
approaches.
Y
ulianton
and
Santi
[11]
present
a
deep
learning
approach
for
e-commerce
product
matching
based
on
Sentence-BER
T
.
Using
lightweight
transformer
embeddings
and
cosine
similarity
with
a
x
ed
threshold,
their
method
ef
fecti
v
ely
captures
semantic
similarities
between
heterogeneous
product
titles.
Ev
aluated
on
the
Pricerunner
dataset,
the
approach
achie
v
es
high
accurac
y
and
perfect
precision,
demonstrating
that
ef
cient
SBER
T
-based
models
are
well
suited
for
lar
ge-scale
product
matching
tasks.
In
the
meantime,
ne
wer
models
lik
e
transformers
[12]
hold
considerable
promise
for
matching.
In
the
related
map
matching
eld,
a
transformer
model
achie
v
ed
F1-scores
of
o
v
er
96%,
setting
v
ery
high
le
v
els
of
ef
cienc
y
for
sequence
matching
problems.
This
piece
conrms
the
line
of
deep
learning
models
to
address
comple
x
conte
xtual
information,
and
it
also
highlights
the
need
for
solutions
that
are
still
ef
fecti
v
e
and
po
werful
in
addressing
real-w
orld
data
problems.
T
able
1
presents
a
comparati
v
e
study
of
record
linkage
methods
using
deep
learning
for
tab
ular
data.
T
able
1.
Comparati
v
e
table
of
e
xisting
deep
learning
approaches
for
record
linkage
Cate
gory
Method/study
Approach/technique
K
e
y
features/strengths
Mentioned
performance
Reference
Deep
learning
Sentence-BER
T
(MiniLM)
T
ransformer
-based
sentence
embeddings
with
cosine
similarity
and
threshold-based
matching
Lightweight
transformer
enabling
semantic
product
matching
with
high
precision
and
lo
w
computational
cost;
scalable
to
lar
ge
e-commerce
datasets
Accurac
y:
98.10%,
Precision:
100%,
Recall:
91.84%,
F1-score:
95.74%
Y
ulianton
and
Santi
[11]
Deep
learning
Neural
ER
(T
uple
embeddings)
DNNs
for
learning
distrib
uted
representations
of
structured
entity
attrib
utes
Ef
fecti
v
e
for
comple
x
entity
matching
tasks
on
heterogeneous
structured
data,
including
medical
and
product
datasets
F1-score
up
to
94%
Peeters
and
Bizer
[13]
Deep
learning
T
ransformer
(Seq2Seq)
Uses
transfer
learning
with
a
transformer
architecture.
Sho
ws
high
potential
for
sequence
matching
tasks,
though
related
to
“map
matching”.
F1-score:
>
96%
(at
se
gment
le
v
el)
Jin
et
al.
[12]
2.4.
Pri
v
acy-pr
eser
ving
r
ecord
linkage
based
methods
The
method
of
W
ang
et
al.
[14]
seeks
to
enhance
bloom
lter
-based
pri
v
ac
y-preserving
record
linkage
(PPRL).
Their
“(Hash)-
A”
hashing
approach
tackles
information
loss
by
coding
q-gram
frequenc
y
to
more
ef
fecti
v
ely
dif
ferentiate
betwee
n
records
and
thus
increase
matching
accurac
y
.
T
o
protect
pri
v
ac
y
,
the
“utility-optimized
bloom
lter”
(UBF)
approach
utilizes
user
-le
v
el
dif
ferential
pri
v
ac
y
(ULDP)
to
subject
only
a
subset
of
bits
recognized
as
sensiti
v
e
to
intense
perturbation.
This
selecti
v
e
protection
pro
vides
an
impro
v
ed
trade-of
f
between
utility
(linkage
accurac
y)
and
pri
v
ac
y
than
current
methods.
Ranbaduge
et
al.
[15]
presents
the
inaugural
multi-party
PPRL
protocol
to
combine
deep
learning
into
a
federated
learning
paradigm.
The
database
o
wners
initially
encode
their
records
into
bloom
lters,
to
which
dif
ferential
pri
v
ac
y
noise
is
injected
to
pro
vide
pro
v
able
protection
for
pri
v
ac
y
.
Local
deep
learning
models
are
then
trained
separately
by
each
party
on
feature
v
ectors
(similarity/distance
scores)
deri
v
ed
from
such
noisy
bloom
lters.
Lastly
,
the
local
models
are
submitted
to
a
secure
aggre
g
ator
that
ensembles
them
into
a
global
model,
which
a
linkage
unit
uses
to
classify
unlabeled
data.
T
able
2
summarizes
the
cate
gories
of
record
linkage
methods,
with
the
adv
antages
and
disa
dv
a
ntages
of
each.
Our
proposed
solution
optimizes
record
linkage
through
the
combination
of
a
h
ybrid
blocking
strate
gy
and
a
T
abNet
classier
,
a
deep
learning
model
specically
designed
for
tab
ular
data.
Through
this
combination,
a
ne
w
computation-accurac
y
tradeof
f
for
medium
to
lar
ge
datasets
is
introduced.
Firstly
,
the
number
of
pairs
that
need
to
be
compared
is
signicantly
r
educed
by
the
h
ybrid
blocking
technique,
lo
wering
the
w
orkload
computation
while
maintaining
high
recall
of
potential
matches.
This
is
follo
wed
by
se
v
eral
critical
adv
antages
of
the
T
abNet
model:
e
xtrem
ely
accurate
classication
is
possible,
enhanced
interpretability
is
f
acilitated
through
its
attention
mechanism,
and
most
signicantly
,
it
is
e
xtremely
ef
cient,
with
e
x
ecution
time
reducing
o
v
er
79%
compared
to
a
standard
DNN.
Explainable
deep
learning
for
scalable
r
ecor
d
linka
g
e:
a
T
abNet-based
fr
ame
work
...
(F
atima
Zahr
ae
Saber)
Evaluation Warning : The document was created with Spire.PDF for Python.
728
❒
ISSN:
2252-8938
T
able
2.
Comparati
v
e
table
of
record
linkage
approaches
Cate
gory
Method/study
Approach/technique
K
e
y
features/strengths
Mentioned
performance
Reference
PPRL
Enhanced
bloom
lter
PPRL
“(Hash)-A”
hashing
with
q-gram
frequenc
y
and
UBF
with
dif
ferential
pri
v
ac
y
.
Selecti
v
e
protection
for
a
better
accurac
y-pri
v
ac
y
trade-of
f.
Impro
v
ed
trade-of
f
W
ang
et
al.
[14]
Machine
learning
Distrib
uted
approach
Re
gression
and
SVM
on
Apache
Spark
(MLlib).
High
scalability;
handles
imbalanced
data.
Up
to
96.71%
accurac
y
He
ydari
et
al.
[9]
Probabilistic
PRLF
Ensemble
methods,
generalized
linear
models,
and
machine
learning.
Open-source
Python
tool;
rob
ust
ag
ainst
data
de
gradation
and
missing
elds.
Rob
ust
performance
Prindle
et
al.
[8]
PPRL
Federated
PPRL
with
deep
learning
Multi-party
protocol
using
noisy
bloom
lters
to
train
and
aggre
g
ate
local
models.
First
protocol
combining
DL
and
federated
learning
for
PPRL.
Rob
ust
performance
Ranbaduge
et
al.
[15]
Deep
learning
T
ransformer
model
T
ransformer
(Seq2Seq)
with
transfer
learning.
V
ery
promising;
high
ef
cienc
y
for
sequences.
F1-scores
of
o
v
er
96%
Jin
et
al.
[12]
Deep
learning
Siamese
netw
ork
on
Spark
ANN
(Siamese
netw
ork)
to
encode
records.
Scalable,
designed
for
big
data.
Ef
cient
for
reducing
computation
time.
W
olcott
et
al.
[10]
3.
METHOD
In
this
paradigm,
record
linkage
becomes
a
supervised
learning
issue.
It
starts
with
le
v
eraging
the
freely
e
xtensible
biomedical
record
linkage
2
(FEBRL
2)
dataset
that
comprises
5,000
synthetic
records
with
pre-determined
duplicates,
thereby
serving
as
training
and
v
alidation
labeled
data.
F
or
each
candidate
record
pair
,
a
v
ector
of
similarities
is
calculated
by
applying
stable
measures
such
as
the
Jaro-W
inkler
(JW)
and
Le
v
enshtein
distances.
This
feature
v
ector
is
then
used
to
train
a
deep
learning-based
classier
,
T
abNet,
in
which
duplicate
pairs
(matches)
are
learned
to
be
distinguished
from
non-duplicate
pairs.
Ne
w
,
unseen
record
pairs
can
then
be
predicted
to
be
ne
w
or
not
by
the
learned
model.
Se
v
eral
inherent
challe
ng
e
s
of
this
supervised
learning
task
are
directly
addressed
by
our
methodology:
i)
Noisy
data:
real-w
orld
data
is
also
i
nf
amous
for
ha
ving
entry
errors,
format
v
ariations,
and
missing
v
alues.
Our
preprocessing
stage
of
uppercase
con
v
ersion,
remo
v
al
of
irrele
v
ant
symbols,
and
numeric
eld
cleaning
guarantees
correct
results.
Furthermore,
the
JW
distance
w
as
chosen
especially
so
that
common
typographical
errors
could
be
tolerated,
making
the
method
rob
ust
to
noise.
ii)
Class
imbalance:
class
im
balance
is
f
amously
a
critical
issue
in
record
linkage,
as
w
as
the
case
in
our
earlier
w
ork
where
this
problem
caused
v
ery
lo
w
precision.
Our
approach
utilizes
deep
learning
models
that
perform
well
on
imbalanced
tab
ular
data,
as
is
e
vident
from
the
high
precision
and
recall
v
alues
achie
v
ed.
iii)
Domain
adaptation:
the
f
act
that
the
synthetic
data
might
not
capture
the
comple
xity
of
real
data
is
cited
as
a
primary
limitation.
Therefore,
as
a
k
e
y
future
e
x
ercise,
the
generali
zability
of
our
model
will
be
e
v
aluated
on
a
lar
ge,
real-w
orld
dataset—the
North
Carolina
v
oter
re
gistration
(NCVR)—so
that
it
can
be
v
alidated
to
generalize
well
to
a
number
of
production
en
vironments.
Articial
intelligence
(AI)
models,
and
deep
learning
tec
hn
i
ques
in
particular
,
are
better
equipped
to
handle
the
ambiguity
inherent
in
real-w
orld
data,
thereby
outperforming
their
classical
rule-based
counterparts.
While
classical
rule-based
techniques
are
traditionally
described
as
stif
f,
greater
e
xibility
and
better
performance
in
handling
noisy
or
missing
data
are
of
fered
by
our
AI
technique.
Instead
of
being
x
ed
and
binary
rules
being
applied,
the
model
is
trained
from
a
rich
similarity
v
ector
.
The
similarity
le
v
el
is
quantied
in
terms
of
similarity
scores,
which
are
calculated
based
on
JW
and
Le
v
enshtein
distances,
such
that
v
ariations,
typos,
and
other
forms
of
error
can
be
managed
by
the
model.
The
comple
x
interaction
among
the
di
v
erse
scores
is
then
set
up
by
the
T
abNet
model
so
that
it
can
mak
e
a
probabilistic
decision—a
much
more
sophisticated
task
than
what
could
possibly
be
accomplished
by
a
set
of
rules.
This
ability
for
e
vidence
to
be
dynamically
weighted
and
for
the
most
rele
v
ant
features
for
an
y
prediction
to
be
established,
as
sho
wn
in
our
analysis
of
interpretability
,
is
wh
y
uncertainty
is
best
dealt
with
by
AI.
As
pre
viously
mentioned,
the
process
is
composed
of
four
main
steps.
First,
the
data
is
preprocessed
to
clean
and
normalize
it
[3].
Ne
xt,
a
h
ybrid
blocking
method
is
emplo
yed
to
reduce
the
number
of
comparisons
by
di
viding
the
data
into
smaller
,
more
manageable
blocks.
T
w
o
techniques,
sorted
neighborhood
and
standard
blocking,
are
used
to
create
an
inde
x
of
candidate
record
pairs.
These
pairs
are
subsequently
compared
using
Int
J
Artif
Intell,
V
ol.
15,
No.
1,
February
2026:
725–743
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
729
similarity
measures
such
as
the
Le
v
enshtein
distance
and
the
JW
distance.
The
resulting
similarity
scores
are
then
fed
into
a
classication
model.
T
o
e
v
aluate
performance
in
terms
of
e
x
ecution
time
and
accurac
y
,
T
abNet
and
a
DNN
model
were
utilized,
aiming
to
determine
the
best
trade
of
f
between
speed
and
precision
[7],
[16].
As
sho
wn
in
Figure
2.
Figure
2.
Proposed
record
linkage
process
3.1.
T
raining
and
v
alidation
dataset
The
training
and
v
alidation
dataset
used
for
this
study
is
the
FEBRL
2,
which
consists
of
cti
tious
records
simulating
personal
information
typically
found
in
structured
databases.
It
contains
5,000
ro
ws,
including
4,000
original
records
and
1,000
duplicate
records.
The
dataset
comprises
six
columns,
each
representing
a
specic
attrib
ute
related
to
indi
viduals,
such
as
rst
name,
last
name,
address,
and
other
personal
details.
This
dataset
is
emplo
yed
to
test
and
v
alidate
the
proposed
record
linkage
method
by
replicating
real
w
orld
conditions
encountered
in
lar
ge
scale
databases.
The
main
columns
include
both
te
xtual
and
numerical
information,
as
illustrated
in
T
able
1.
These
columns
represent
the
types
of
data
commonly
found
in
administrati
v
e
or
commercial
databases
and
present
typical
challenges
such
as
input
errors,
missing
data,
and
format
inconsistencies.
In
this
conte
xt,
data
preprocessing
w
as
essential
to
normalize
certain
columns
and
address
inconsistencies,
as
detailed
in
the
ne
xt
section.
This
step
is
critical
for
impro
ving
the
quality
of
record
matches.
T
able
3
pro
vides
a
detailed
description
of
each
column
in
the
dataset,
along
with
concrete
e
xample
s
and
remarks
re
g
arding
the
specic
characteristics
of
each
eld.
T
able
3.
Description
and
specic
features
of
the
dataset
used
Column
name
Description
Data
type
Example
gi
v
en
name
First
name
T
e
xt
SARAH
surname
Surname
T
e
xt
BR
UHN
address
1
First
line
of
address
T
e
xt
FORBES
STREET
state
State
T
e
xt
VIC
date
of
birth
Date
of
birth
(format
YYYYMMDD)
Numeric
19300213
soc
sec
id
Unique
social
security
number
Numeric
7535316
3.2.
Experimental
dataset
Three
datasets
were
generated
from
the
training
dataset
to
e
xperiment
with
and
test
the
proposed
method,
as
well
as
to
e
v
aluate
the
performance
of
the
models
and
the
e
x
ecution
time
of
each
prediction.
The
e
x
ecution
time
is
considered
an
important
criterion
in
this
study
,
as
the
objecti
v
e
is
to
identify
a
method
that
reduces
the
time
required
for
record
comparison
and
duplicate
prediction.
Lar
ger
datasets
were
created
to
assess
the
models’
performance
at
a
lar
ger
scale.
As
sho
wn
in
T
able
4,
the
rst
dataset
consists
of
13,000
records,
with
10,000
original
records
and
3,000
duplicates.
The
second
dataset
contains
16,000
records,
including
12,000
original
records
and
4,000
duplicates.
Finally
,
the
third
dataset
includes
21,000
records,
comprising
16,000
original
records
and
5,000
duplicates.
Explainable
deep
learning
for
scalable
r
ecor
d
linka
g
e:
a
T
abNet-based
fr
ame
work
...
(F
atima
Zahr
ae
Saber)
Evaluation Warning : The document was created with Spire.PDF for Python.
730
❒
ISSN:
2252-8938
T
able
4.
Ov
ervie
w
of
datasets
used
for
e
xperimental
model
Column
name
Dataset
1
Dataset
2
Dataset
3
T
otal
records
13,000
16,000
21,000
Original
records
10,000
12,000
16,000
Duplicate
records
3,000
4,000
5,000
3.3.
Data
pr
epr
ocessing
T
o
enhance
data
quality
and
f
acilitate
comparisons
during
the
record
linkage
process,
se
v
eral
transformations
wer
e
applied.
First,
columns
containing
te
xtual
data
such
as
rst
names,
surnames,
and
addresses
were
con
v
erted
to
uppercase
to
ensure
consistenc
y
in
information
representation,
re
g
ardless
of
v
ariations
in
case.
Ne
xt,
irrele
v
ant
symbols
and
characters,
particularly
in
address
elds,
were
remo
v
ed
to
rene
matches
and
reduce
potential
inconsistencies
[17].
F
or
numeric
elds,
particularly
zip
codes,
non-conforming
(non-numeric)
v
alues
were
identied
and
remo
v
ed
to
impro
v
e
the
accurac
y
of
comparisons.
Additionally
,
other
specic
columns
underwent
tailored
cleaning
operations,
such
as
the
standardization
of
abbre
viations
and
the
correction
of
typographical
errors.
These
preprocessing
steps
are
crucial
for
ensuring
reliable
results
during
the
matching
phase
[18].
3.4.
Indexing
Inde
xing
is
a
critical
step
in
the
record
matchi
ng
process,
designed
to
reduce
the
number
of
record
pairs
to
be
compared
while
maintaining
a
high
le
v
el
of
accurac
y
in
match
detection
[4].
Gi
v
en
the
size
of
the
dataset
used
in
this
study
(5,000
records),
the
total
number
of
potential
comparisons
without
inde
xing
w
ould
be
e
xtremely
lar
ge,
potentially
reaching
se
v
eral
million
pairs.
T
o
address
thi
s
challenge,
a
h
ybrid
blocking
approach
w
as
adopted,
combining
tw
o
methods:
standard
blocking
and
the
sorted
neighborhood
method.
The
ef
cienc
y
of
the
process
is
signicantly
enhanced
by
this
combination,
reducing
the
number
of
pairs
to
be
compared
while
ef
fecti
v
ely
identifying
rele
v
ant
matches.
3.4.1.
Standard
blocking
The
rst
method
emplo
yed
is
standard
blocking
[19],
which
in
v
olv
es
di
viding
records
into
blocks
based
on
one
or
more
columns.
F
or
this
study
,
records
were
bloc
k
e
d
using
the
state
column.
This
approach
results
in
records
being
grouped
by
zip
code,
t
hereby
restricting
comparisons
to
within
each
block.
While
this
technique
ef
fecti
v
ely
reduces
the
number
of
comparisons,
limitations
arise
when
dealing
with
missing
or
incorrect
zip
code
v
alues.
3.4.2.
Sorted
neighborhood
T
o
address
these
limitations,
standard
blocking
w
as
combined
with
the
sorted
neighborhood
method.
This
technique
in
v
olv
es
sorting
records
based
on
a
sort
k
e
y
and
then
comparing
each
record
with
its
neighbors
within
a
x
ed
size
windo
w
[20].
By
using
the
surname
as
the
sort
k
e
y
,
this
method
captures
matches
that
may
not
be
grouped
together
in
standard
blocking
due
to
minor
v
ariat
ions
in
zip
codes.
The
sliding
windo
w
approach
allo
ws
comparisons
to
be
made
only
between
neighboring
records,
signicantly
reducing
the
number
of
pairs
to
be
compared.
Figure
3
illustrates
this
process,
where
pairs
of
records
(Record
a
and
Record
b)
are
compared
after
sorting.
The
lines
represent
potential
matches
between
neighboring
r
ecords,
sho
wing
ho
w
the
sorted
neighborhood
method
limits
the
comparisons
while
capturing
rele
v
ant
matches.
Figure
3.
Sorted
neighborhood
algorithm
to
inde
x
record
pairs
Int
J
Artif
Intell,
V
ol.
15,
No.
1,
February
2026:
725–743
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
731
3.5.
Hybrid
blocking
The
combination
of
standard
state-
b
a
sed
blocking
and
the
sorted
neighborhood
algorithm
forms
a
rob
ust
h
ybrid
blocking
approach.
First,
standard
state
based
blocking
reduces
the
number
of
pairs
to
be
compared
by
e
xcluding
records
that
are
geographically
too
distant.
Second,
the
sorted
neighborhood
algorithm
renes
this
process
by
performing
comparisons
between
records
sorted
based
on
their
surname,
thereby
capturing
matches
that
might
ha
v
e
been
missed
by
standard
blocking
alone
[21]–[23].
As
illustrated
in
Figure
4,
these
tw
o
methods
w
ork
together
to
impro
v
e
the
ef
cienc
y
and
ef
fecti
v
eness
of
record
matching.
Figure
4.
Hybrid
blocking
method
w
as
used
The
h
ybrid
approach
of
fers
se
v
eral
adv
antages.
The
comple
xity
of
comparisons
is
signi
cantly
reduced
while
maintaining
a
high
de
gree
of
accurac
y
in
matching
records.
It
ef
fecti
v
ely
handles
minor
v
ariations
in
te
xtual
data,
input
errors,
and
missing
v
alues
in
specic
columns.
F
ollo
wing
this
step,
an
inde
x
of
o
v
er
2.8
million
record
pairs
is
generated,
which
will
be
compared
and
classied
as
either
matched
or
unmatched
pairs.
T
able
5
presents
the
number
of
record
pairs
for
each
dataset,
highlighting
that
as
the
number
of
records
in
a
dataset
increases,
the
number
of
record
pairs
for
comparison
also
gro
ws.
T
able
5.
Number
of
record
pairs
for
each
dataset
Dataset
T
rain/v
alidation
Exp.
dataset
1
Exp.
dataset
2
Exp.
dataset
3
T
otal
records
5,000
13,000
16,000
21,000
P
airwise
inde
x
es
28,700
2,900,000
4,200,000
7,300,000
This
comparison
T
able
6
dif
ferent
blocking
strate
gies
in
terms
of
ho
w
well
the
y
perform
in
reducing
the
number
of
candidate
pairs
for
record
linkage.
While
a
full
inde
x
results
in
nearly
12.5
million
pairs
(0%
reduction),
the
best
performing
method
is
the
proposed
h
ybrid
blocking
approach.
It
greatly
minimizes
the
number
of
comparisons
to
just
28,702
pairs
with
an
accomplishment
of
99.77%
reduction
ratio
(RR).
This
emphasizes
the
central
contrib
ution
of
the
h
ybrid
strate
gy
to
impro
ving
the
computational
ef
cienc
y
of
the
record
linkage
pipeline.
T
able
6.
Comparison
of
blocking
strate
gies
by
number
of
candidate
pairs
and
RR
Method
Number
of
pairs
RR
(%)
Full
inde
x
12,497,500
0.00
Blocking
(state)
2,768,103
77.85
SortedNeighbour
(surname)
75,034
99.40
Hybrid
blocking
28,702
99.77
3.6.
Comparison
phase
In
the
comparison
phase,
similarity
measures
are
applied
to
assess
the
correspondence
between
record
pairs.
T
w
o
well
established
methods,
JW
and
Le
v
enshtein
distances,
ha
v
e
been
selec
ted
for
this
task.
Each
comparison
produces
a
similari
ty
score
rangi
ng
from
0
to
1,
reecting
the
de
gree
of
corr
espondence
between
eld
v
alues.
These
scores
are
then
aggre
g
ated
into
a
similarity
v
ector
,
which
summarizes
the
o
v
erall
similarity
between
the
tw
o
records.
This
similarity
v
ector
serv
es
as
the
foundation
for
the
subsequent
class
ication
phase,
where
it
is
determined
whether
the
records
represent
the
same
entity
.
3.6.1.
J
ar
o-W
inkler
distance
The
JW
distance
metric
is
especially
ef
fecti
v
e
for
short
strings,
such
as
names.
In
this
study
,
i
t
w
as
applied
to
se
v
eral
elds,
including
gi
v
en
name,
surname,
address
1,
and
state.
By
taking
into
account
Explainable
deep
learning
for
scalable
r
ecor
d
linka
g
e:
a
T
abNet-based
fr
ame
work
...
(F
atima
Zahr
ae
Saber)
Evaluation Warning : The document was created with Spire.PDF for Python.
732
❒
ISSN:
2252-8938
both
character
matches
and
their
order
,
the
JW
distance
is
sensiti
v
e
to
common
typographical
errors,
making
it
particularly
suitable
for
record
matching
tasks
[24]–[26].
The
JW
impro
v
es
the
Jaro
distance
by
adding
a
prex
scale:
J
W
=
J
+
(
l
×
p
×
(1
−
J
))
(1)
In
this
equation,
l
is
the
length
of
the
common
prex
(up
to
4
characters),
and
p
is
a
scaling
f
actor
,
usually
0.1.
The
adjustment
f
a
v
ors
strings
that
match
from
the
be
ginning.
3.6.2.
Le
v
enshtein
distance
The
Le
v
enshtein
distance
metric
calculates
the
minimum
number
of
operations
(insertion,
deletion,
or
substitution)
required
to
transform
one
string
into
another
.
In
the
conte
xt
of
this
study
,
Le
v
enshtein
distance
w
as
applied
to
numerical
elds
such
as
date
of
birth
and
soc
sec
id.
This
approach
ef
fecti
v
ely
quanties
the
dif
ferences
between
records,
e
v
en
when
there
are
v
ariations
in
data
entry
,
such
as
errors
in
date
formatting
or
incorrect
postal
codes
[27].
The
Le
v
enshtein
distance,
also
kno
wn
as
the
edit
distance,
measures
the
minimum
number
of
single-character
operations
required
to
transform
one
string
into
another
.
The
operations
permitted
include
insertion,
deletion,
and
substitution.
The
Le
v
enshtein
distance
measures
the
minimum
number
of
single
character
edits
required
to
change
one
string
into
another
.
The
permitted
operations
include
insertion,
deletion,
and
substitution.
The
recursi
v
e
formula
for
computing
the
Le
v
enshtein
distance
d
(
a,
b
)
between
tw
o
strings
a
and
b
is
dened
as:
d
(
a,
b
)
=
max(
len
(
a
)
,
len
(
b
))
if
min(
len
(
a
)
,
len
(
b
))
=
0
d
(
a
−
1
,
b
−
1)
if
a
=
b
1
+
min
d
(
a
−
1
,
b
)
d
(
a,
b
−
1)
d
(
a
−
1
,
b
−
1)
otherwise
(2)
–
a
and
b
are
the
tw
o
strings
being
compared.
–
d
(
a,
b
)
is
the
minimum
number
of
edit
operations
needed
to
con
v
ert
string
a
into
string
b
.
–
The
allo
wed
operations
are:
i)
insertion
of
a
single
character
,
ii)
deletion
of
a
single
character
,
iii)
substitution
of
one
character
for
another
–
len
(
a
)
and
len
(
b
)
denote
the
lengths
of
the
strings
a
and
b
,
respecti
v
ely
.
This
algorithm
is
widely
used
in
approximate
string
matching
and
natural
language
processing
tasks,
as
it
pro
vides
a
quantiable
measure
of
similarity
between
tw
o
sequences
based
on
their
structural
dif
ferences.
In
Figure
5,
each
ro
w
represents
a
pair
of
records,
and
each
column
sho
ws
the
score
for
the
corresponding
attrib
ute.
F
or
instance,
for
the
rst
record
pair
,
the
scores
are
0.466667
for
the
rst
name
(gi
v
en
name
score),
0.455556
for
the
surname
(surname
score),
and
so
on.
These
indi
vidual
attrib
ute
scores
are
combined
to
calculate
an
o
v
erall
similarity
score
for
each
pair
of
records.
The
similarity
measures
are
selecti
v
ely
applied
to
the
record
pairs
generated
in
the
pre
vious
step,
which
uses
a
h
ybrid
blocking
system.
This
technique
reduces
the
num
b
e
r
of
comparisons
required,
optimizing
the
proces
s
while
maintaining
a
high
precision
rate.
As
a
result,
similarity
v
ectors
are
generated,
where
each
pair
of
records
is
associated
with
a
similarity
score
for
each
attrib
ute.
Figure
5.
Comparison
of
record
pair
similarities
using
JW
and
Le
v
enshtein
distances
Int
J
Artif
Intell,
V
ol.
15,
No.
1,
February
2026:
725–743
Evaluation Warning : The document was created with Spire.PDF for Python.
Int
J
Artif
Intell
ISSN:
2252-8938
❒
733
High
scores
indicate
a
strong
similarity
between
the
attrib
ute
v
alues,
suggesting
a
probable
mat
ch
between
the
records
.
This
detailed
scoring
system
of
fers
greater
e
xibility
in
the
nal
classication
step.
While
traditional
record
linkage
methods
typically
apply
a
global
similarity
threshold
to
determine
matches,
our
method
classies
record
pairs
as
either
matching
or
non-matching
using
T
abNet,
a
deep
learning
model
specically
designed
for
tab
ular
data.
Thi
s
approach
w
as
selected
to
impro
v
e
both
accurac
y
and
e
x
ecution
time.
3.7.
Classication
models
F
or
our
record
linkage
e
xperi
ment,
a
pragmatic
h
yperparameter
search
strate
gy
w
as
utilized
for
tw
o
classication
models:
T
abNet
and
DNN.
Rather
than
e
xhausti
v
e
s
earch,
we
took
established
parameters.
W
e
trained
the
T
abNet
model
at
learning
rate
0.02,
max
epochs
of
6,
and
patience
of
5
for
early
stopping.
On
the
other
hand,
the
DNN
w
as
re
gularized
with
binary
cross
entrop
y
loss
and
emplo
yed
dropout
and
early
stopping
as
methods
of
re
gularization.
The
choice
of
model
w
as
based
on
a
balance
between
performance
as
indicated
by
accurac
y
and
e
x
ecution
time,
and
computational
ef
cienc
y
.
The
goal
w
as
to
achie
v
e
the
model
of
fering
the
best
balance
for
deplo
yability
at
scale.
3.7.1.
T
abNet
T
abNet
is
a
deep
learning
model
uni
quely
designed
for
the
ef
fecti
v
e
handling
of
tab
ular
data.
Unl
ik
e
traditional
neural
netw
ork
architectures
[28],
[29],
T
abNet
emplo
ys
an
inno
v
ati
v
e
approach
that
inte
grates
attention
mechanisms
with
a
hierarchical
structure
to
identify
and
e
xtract
rele
v
ant
features
from
the
data,
see
Figure
6.
This
model
has
demonstrated
signicant
succes
s
in
tasks
in
v
olving
structured
datasets
,
due
to
its
ability
to
focus
on
the
most
informati
v
e
parts
of
the
data
while
maintaining
interpretability
.
Figure
6.
The
T
abNet
model
for
record
pair
classication
The
T
abNet
model
be
gins
with
the
feature
transformer
,
which
transforms
the
input
v
ariables
into
richer
representations
suitable
for
prediction
tasks.
Thi
s
component
consists
of
four
layers:
fully
connected
layers
(dense
layers)
that
inte
grate
the
v
ariables,
batch
normalization
to
stabilize
the
learning
process,
and
specic
acti
v
ation
functions
such
as
g
ated
linear
units
(GLU)
that
dynamically
select
rele
v
ant
information.
Explainable
deep
learning
for
scalable
r
ecor
d
linka
g
e:
a
T
abNet-based
fr
ame
work
...
(F
atima
Zahr
ae
Saber)
Evaluation Warning : The document was created with Spire.PDF for Python.
734
❒
ISSN:
2252-8938
The
primary
purpose
of
the
feature
transformer
is
to
e
xtract
comple
x,
non-linear
r
epresentations
of
the
data,
capture
interactions
between
v
ariables,
and
prepare
these
representations
for
the
ne
xt
phase
the
attenti
v
e
transformer
module.
Before
progressing
to
the
attenti
v
e
transformer
,
the
data
is
di
vided
using
a
split
mechanism
into
tw
o
parts.
The
rst
part
produces
a
partial
prediction
result,
while
the
second
part
is
forw
arded
to
the
attenti
v
e
transformer
,
which
focuses
on
selecting
rele
v
ant
features.
The
attenti
v
e
transformer
le
v
erages
an
attention
mechanism
to
identify
and
emphasize
the
most
i
mportant
columns
at
each
stage,
capturing
intricate
relationships
among
them.
This
approach
enables
T
abNet
to
dynamically
select
rele
v
ant
combinations
of
columns,
impro
ving
its
ef
cienc
y
and
e
xibility
in
prediction
tasks
in
v
olving
structured
data.
Once
the
relationships
between
columns
ha
v
e
been
identied,
the
model
dynamically
selects
the
rele
v
ant
columns
at
each
step
using
a
mask.
This
process
is
iterated
10
times
to
generate
the
nal
prediction
for
each
record
pair
,
determining
whether
the
y
are
a
match
or
not.
Due
to
its
architecture,
T
abNet
has
pro
v
en
to
be
an
ef
fecti
v
e
tool
for
classication
tasks,
particularly
in
the
conte
xt
of
structured
data.
3.7.2.
Neural
netw
orks
deep
neural
netw
orks
Deep
learning
models,
particularly
DNNs,
ha
v
e
become
increasingly
utilized
for
solving
record
linkage
problems,
including
tasks
such
as
record
pair
classication,
record
normalization,
and
similarity
computation
between
records
[6],
[30],
[31].
The
DNN
model
used
for
record
pair
classication
consists
of
three
dense
layers:
an
input
layer
with
256
nodes
emplo
ying
the
ReLU
acti
v
ation
function,
follo
wed
by
a
dropout
layer
for
re
gularization;
a
hidden
layer
with
128
nodes,
also
utilizing
the
ReLU
acti
v
ation
function
and
a
dropout
layer;
and
an
output
layer
with
a
single
node
using
a
sigmoid
acti
v
ation
function
for
binary
classication
see
Figure
7
(1
for
matched
records
and
0
for
unmatched
records).
T
able
7
presents
a
comparison
between
the
T
abNet
model
and
the
DNN
in
terms
of
architecture,
scalability
,
training
time,
performance,
and
other
rele
v
ant
f
actors.
The
adv
antages
of
the
T
abNet
model
o
v
er
the
pre
viously
emplo
yed
DNN
are
clearly
highlighted
in
the
table.
Figure
7.
The
DNN
model
for
record
pair
classication
T
able
7.
Comparison
between
T
abNet
and
DNNs
Criteria
T
abNet
DNN
Data
type
Optimized
for
tab
ular
data
Can
process
v
arious
data
types
(images
and
te
xt)
Architecture
Uses
attention
mechanisms
and
dynamic
masks
Composed
of
fully
connected
layers
Acti
v
ation
functions
Sparse
acti
v
ation
via
attention
Non-linear
functions
such
as
ReLU,
sigmoid
Interpretability
High
due
to
the
attention
mechanism
Limited
due
to
comple
x
structure
Ov
ertting
pre
v
ention
Inte
grated
re
gularization
techniques
Dropout,
early
stopping,
etc.
Scalability
Ef
cient
on
lar
ge
tab
ular
datasets
Requires
more
resources
for
lar
ge
datasets
T
raining
time
F
ast
due
to
feature
selection
via
attention
Can
be
long
with
deep
architectures
Performance
Performs
well
on
imbalanced
tab
ular
data
Requires
tuning
for
optimal
performance
T
ypical
applications
Classication
and
re
gression
on
tab
ular
data
Computer
vision,
NLP
,
and
more
Int
J
Artif
Intell,
V
ol.
15,
No.
1,
February
2026:
725–743
Evaluation Warning : The document was created with Spire.PDF for Python.