Indonesian
J
our
nal
of
Electrical
Engineering
and
Computer
Science
V
ol.
42,
No.
1,
April
2026,
pp.
71
∼
80
ISSN:
2502-4752,
DOI:
10.11591/ijeecs.v42.i1.pp71-80
❒
71
Integrating
blind
sour
ce
separation
and
self-super
vised
lear
ning
f
or
Algerian
Arabic
connected-digit
r
ecognition
Mourad
Reggab,
Mohammed
Belkhiri
Laboratory
of
T
elecommunications,
Signals
and
Systems,
Uni
v
ersity
Amar
T
elidji
of
Laghouat
(U
A
TL),
Laghouat,
Algeria
Article
Inf
o
Article
history:
Recei
v
ed
Jan
29,
2026
Re
vised
Feb
16,
2026
Accepted
Mar
4,
2026
K
eyw
ords:
Arabic
speech
recognition
Blind
source
separation
Con
v-T
asNet
DUET
Lo
w-resource
ASR
SepF
ormer
W
a
v2V
ec
2.0
ABSTRA
CT
This
paper
proposes
an
impro
v
ement
in
Arabic
automatic
speech
recognition
(ASR)
by
combining
blind
source
separation
(BSS)
with
self-supervis
ed
acous-
tic
modeling.
The
study
concentrates
on
the
Algerian
Arabic
connected-digit
recognition
task
and
ree
xamines
the
classical
de
generate
unmixing
estimation
technique
(DUET)
as
a
front-end
approach
for
suppressing
noise
and
inter
-
ference.
The
output
of
the
BSS
stage
is
fed
into
a
Hidden
Mark
o
v
model
(HMM)
recognizer
de
v
eloped
using
the
HTK
toolkit.
T
o
conte
xtualize
DUET’
s
performance,
it
is
compared
with
modern
neural
separation
techniques
(Con
v-
T
asNet,
SepF
ormer)
paired
with
both
traditional
and
self-supervised
AS
R
back-
ends
(W
a
v2V
ec
2.0
and
Whisper).
A
ne
w
corpus
of
11,230
utterance
s
from
37
speak
ers,
representing
dialectal
and
gender
di
v
ersity
,
w
as
collected.
Experimen-
tal
outcomes
indicate
that
DUET
enhances
w
ord
accurac
y
under
stereo
mixing
conditions;
ho
we
v
er
,
neural
separation
combined
with
self-supervised
ASR
re-
sults
in
considerably
lo
wer
w
ord-error
rates
and
stronger
rob
ustness
in
noisy
or
o
v
erlapping-speech
scenarios.
The
study
emphasizes
practical
trade-of
fs
be-
tween
computational
cost
and
accurac
y
for
deplo
ying
lo
w-resource
Arabic
ASR
systems.
This
is
an
open
access
article
under
the
CC
BY
-SA
license
.
Corresponding
A
uthor:
Mourad
Re
gg
ab
Laboratory
of
T
elecommunications,
Signals
and
Systems,
Uni
v
ersity
Amar
T
elidji
of
Laghouat
(U
A
TL)
Laghouat,
Algeria
Email:
m.re
gg
ab@lagh-uni
v
.dz
1.
INTR
ODUCTION
Background
and
moti
v
ation:
automatic
apeech
recognition
(ASR)
systems
ha
v
e
become
essential
to
human-computer
interaction,
enabling
hands-free
control,
v
oice
search,
and
con
v
ersational
AI
[1].
Ho
we
v
er
,
in
real
acoustic
en
vironments,
speech
is
rarely
captured
in
isolation:
background
noise,
re
v
erberation,
and
interfering
speak
ers
often
corrupt
the
tar
get
signal.
This
challenge,
kno
wn
as
the
cocktail
party
ef
fect,
has
long
encouraged
research
in
speech
source
separation
namely
the
process
of
isolating
one
or
more
speech
signals
from
a
mixture
of
sources.
Early
solutions
used
independent
component
analysis
(ICA)
and
frequenc
y-
domain
masking,
while
more
rec
ent
approaches
utilize
deep
neural
netw
orks
such
as
Con
v-T
asNet
[2]
and
SepF
ormer
[3]
that
perform
end-to-end
time-domain
separation.
Concurrently
,
ASR
technology
has
progressed
from
Hidden
Mark
o
v
models
(HMMs)
and
Gaussian
mixt
ure
models
(GMMs)
to
h
ybrid
DNN-HMMs
and
fully
end-to-end
architectures
trained
on
lar
ge-scale
corpora
[4].
Despite
these
adv
ances,
most
speech
separation
and
recognition
research
has
focused
on
high-re
source
languages,
primarily
English,
Mandarin,
and
French.
F
or
man
y
other
languages,
including
Arabic,
limited
annotated
data,
comple
x
morphology
,
and
dialectal
v
ariability
remain
signicant
obstacles
.
Arabic
is
the
fth
J
ournal
homepage:
http://ijeecs.iaescor
e
.com
Evaluation Warning : The document was created with Spire.PDF for Python.
72
❒
ISSN:
2502-4752
most
spok
en
language
w
orldwide,
with
o
v
er
300
million
nati
v
e
speak
ers,
yet
its
automated
processing
remains
comparati
v
ely
underde
v
eloped
[5].
The
diglossic
nature
of
modern
standard
Arabic
(MSA)
for
formal
conte
xts
v
ersus
numerous
re
gional
dialects
for
daily
communicati
on
creates
substantial
pronunciation
and
le
xical
g
aps
between
training
and
tar
get
speech
[6].
Moreo
v
er
,
publicly
a
v
ailable
Arabic
speech
corpora
often
emphasize
broadcast
or
scripted
MSA,
of
fering
limited
co
v
erage
of
colloquial
forms
and
noisy
acoustic
conditions
[7].
W
ithin
the
Arabic
dialect
continuum,
Algerian
Arabic
introduces
additional
comple
xities
[8].
It
incor
-
porates
Classical
Arabic
roots
with
Berber
and
French
inuences,
leading
to
distinct
phonetic
shifts,
loanw
ords,
and
code-switching.
Dialectal
v
ariation
across
Algeria’
s
W
estern,
Central,
Eastern
and
Southern
re
gions
is
con-
siderable:
v
o
wel
harmon
y
,
consonant
emphas
is,
and
w
ord
stress
dif
fer
noticeably
by
re
gion
[9].
These
f
actors
hinder
the
direct
reuse
of
models
trained
on
MSA
or
other
Arabic
dialects
[7].
Furthermore,
practical
Alge-
rian
speech
data
are
typically
recorded
in
e
v
eryday
settings
homes,
classrooms,
or
mark
ets
where
o
v
erlapping
speech
and
en
vironmental
noise
are
common.
Hence,
a
rob
ust
ASR
system
must
inte
grate
dialec
tal
modeling
with
mechanisms
to
suppress
interference
and
background
noise.
Digits
represent
a
well-dened
and
important
subset
of
spok
en
language
that
pro
vides
a
controlled
benchmark
for
ASR
research
[10].
Connected-digit
tasks
(e.g.,
telephone
numbers,
prices,
dates)
of
fer
con-
strained
grammars
and
limited
v
ocab
ularies,
f
acilitating
systematic
e
v
aluation
of
modeling
and
preprocessing
techniques
[11].
Historically
,
connected-digit
recognition
has
serv
ed
as
a
testing
ground
for
algorithms
such
as
dynamic
time
w
arping,
HMMs,
and
early
deep
neural
netw
orks.
F
or
Arabic,
digit
pronunciation
v
aries
across
dialects—for
e
xample,
the
num
b
e
r
”tw
o”
may
be
pronounced
“
thnin
”,
“
tnin
”,
”
zoudj
”
or
”
zouz
”
in
dif
ferent
re
gions—making
this
task
challenging
[6].
De
v
eloping
an
accurate
digit
recognizer
for
Algerian
Arabic
thus
constitutes
a
meaningful
step
to
w
ard
lar
ger
-v
ocab
ulary
systems.
In
this
conte
xt,
blind
source
separation
(BSS)
presents
a
po
werful
preprocessing
strate
gy
to
impro
v
e
recognition
rob
ustness
[12].
BSS
techniques
aim
to
reco
v
er
original
source
signals
from
observ
ed
mixtures
without
prior
kno
wledge
of
the
mixing
process.
Among
them,
the
de
generate
unmi
xing
estimation
technique
(DUET)
le
v
erages
time-frequenc
y
sparsity
and
inter
-channel
dif
ferences
to
perform
unsupervised
separation
in
stereo
recordings.
Although
computationally
lightweight,
DUET
and
similar
classical
algorithms
struggle
in
highly
re
v
erberant
or
single-channel
conditions
[13].
Con
v
ersely
,
modern
neural
separation
models
achie
v
e
superior
signal-to-distortion
ratios
b
ut
demand
considerable
training
data
and
computational
resources
[2],
[3].
This
study
in
v
estig
ates
ho
w
such
separation
methods
can
impro
v
e
Algerian
Arabic
connected-digit
recognition,
e
xtending
our
pre
vious
w
ork
[14]
which
focused
solely
on
DUET
combined
with
classical
HMM-
based
ASR.
W
e
rst
re
visit
DUET
as
a
lo
w-cost
stereo
front-end
for
an
HMM-based
recognizer
and
then
compare
it
ag
ainst
state-of-the-art
neural
separators,
specically
Con
v-T
asNet
and
SepF
ormer
,
in
combination
with
both
con
v
entional
and
self-supervised
ASR
back-ends
(HTK,
W
a
v2V
ec
2.0
[15],
and
Whisper
[4]).
T
o
support
this
in
v
estig
ation,
we
b
uilt
a
dedicated
Algerian
Arabic
digit
corpus
comprising
11,230
utterances
from
37
speak
ers
of
di
v
erse
dialectal
backgrounds.
The
goal
is
to
quantify
impro
v
ements
in
w
ord-error
rate
(WER)
and
noise
rob
ustness
pro
vided
by
blind
and
learned
separation,
and
to
ident
ify
practical
trade-of
fs
between
comple
xity
and
performance
for
lo
w-resource
ASR
deplo
yment
in
Arabic-speaking
en
vironments
[16]-[18].
2.
RELA
TED
W
ORK
Research
on
speech
separation
a
nd
recognition
has
e
v
olv
ed
through
se
v
eral
technological
stages,
be-
ginning
with
statistical
signal
processing
and
adv
ancing
to
w
ard
data-dri
v
en
neural
methods.
This
section
sum-
marizes
rele
v
ant
progress
in
(a)
blind
source
separation,
(b)
Arabic
and
dialectal
ASR,
(c)
connected-digit
recognition,
and
(d)
self-supervised
learning
for
speech
processing.
2.1.
Blind
sour
ce
separation
and
speech
enhancement
Early
BSS
approaches
relied
on
statistical
independence
and
sparsity
assumptions.
Independent
com-
ponent
analysis
(ICA)
[19]
and
non-ne
g
ati
v
e
matrix
f
ac
torization
(NMF)
[20]
were
among
the
rst
unsuper
-
vised
algorithms
capable
of
separating
multiple
speak
ers
from
mix
ed
signals.
The
DUET
proposed
by
Y
ilmaz
and
Rickard
[12]
became
a
reference
method
for
tw
o-microphone
or
stereo
mixtures,
e
xploiting
inter
-channel
amplitude
and
phase
dif
ferences
t
o
cluster
time–frequenc
y
points
belonging
to
distinct
sources.
DUET
is
at-
tracti
v
e
for
its
simplicity
and
real-time
feasibi
lity
b
ut
de
grades
under
hea
vy
re
v
erberation
or
strong
spectral
o
v
erlap
[13].
W
ith
the
adv
ent
of
deep
learning,
separation
shifted
from
frequenc
y-domai
n
masking
to
end-to-end
time-domain
modeling.
Luo
and
Mesg
arani’
s
Con
v-T
asNet
[2]
demonstrated
that
con
v
olutional
encoder–decoder
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
42,
No.
1,
April
2026:
71–80
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
73
netw
orks
can
surpass
traditional
magnitude-masking
baselines,
achie
ving
near
-ideal
signal-to-noise
ratio
im-
pro
v
ements
on
benchmark
datasets
such
as
WSJ0-2mix.
Subsequent
transformer
-based
architectures,
notably
SepF
ormer
[3],
[21],
introduced
global
self-attention
and
dual
-path
processing,
further
impro
ving
separation
quality
and
generalization
to
unseen
speak
ers.
These
neural
models
no
w
represent
the
state
of
the
art
in
both
single-
and
mul
ti-channel
speech
separation
and
are
increasingly
used
as
front-ends
for
ASR
and
speak
er
di-
arization
[17].
Recent
adv
ances
ha
v
e
focused
on
inte
grating
separation
with
recognition
objecti
v
es.
Studies
by
Bouchak
our
et
al.
[22]
demonstrate
that
joint
optimization
of
separation
and
acoustic
model
ing
can
yield
signicant
impro
v
ements
in
noisy
conditions.
Ho
we
v
er
,
most
separation
researc
h
has
focused
on
high-resource
languages,
lea
ving
lo
w-resource
scenarios
lik
e
Algerian
Arabic
under
-e
xplored
[23].
2.2.
Arabic
and
dialectal
ASR
Arabic
ASR
research
has
follo
wed
a
slo
wer
trajectory
than
for
English
or
Mandarin
due
to
linguis
tic
and
data-a
v
ailability
barriers
[1].
Classical
systems
b
uilt
with
HTK
or
Kaldi
emplo
yed
phoneme-based
HMM-
GMM
models
trained
on
modern
standard
Arabic
(MSA)
corpora
such
as
the
Arabic
Broadcast
Ne
ws,
QASR,
or
MGB-2
dat
asets
[24].
While
these
models
achie
v
e
high
accurac
y
on
scripted
speech,
their
performance
drops
sharply
on
spontaneous
or
dialectal
data
because
of
phonetic
and
le
xical
v
ariability
[6].
The
diglossic
nature
of
Arabic
presents
unique
challenges.
As
noted
by
[5],
the
g
ap
between
MSA
and
re
gional
dialects
af
fects
both
acoustic
and
language
modeling.
North
African
dialects,
particularly
Alge-
rian
Arabic,
e
xhibit
distincti
v
e
phonetic
characteristics
including
v
o
wel
reduction,
consonant
assimilation,
and
e
xtensi
v
e
code-switching
with
French
and
Berber
languages
[8].
Droua-Hamdani
et
al.
[7]
highlighted
the
scarcity
of
resources
for
Algerian
dialect,
with
most
a
v
ailable
corpora
focusing
on
Le
v
antine
or
Gulf
v
arieties
[25].
T
o
address
limited
resources,
se
v
eral
studies
ha
v
e
e
xplored
transfer
learning
and
multilingual
t
rain-
ing.
The
multilingual
W
a
v2V
ec
2.0
XLSR-53
and
HuBER
T
models
pre-trained
on
hundreds
of
languages
ha
v
e
recently
been
ne-tuned
for
Arabic
with
substantial
w
ord-error
-rate
(WER)
reductions
[16].
End-to-end
trans-
former
architec
tures
such
as
Whisper
[4]
also
sho
w
strong
zero-shot
performance
on
Arabic
dialects
without
e
xplicit
retraini
ng.
Ne
v
ertheless,
v
ery
fe
w
w
orks
focus
specically
on
North-African
dialects—particularly
Algerian
Arabic—where
the
phonetic
in
v
entory
and
code-switching
patterns
dif
fer
signicantly
from
MSA,
and
where
background
noise
and
o
v
erlapping
speak
ers
are
common
in
natural
recordings.
2.3.
Connected-digit
r
ecognition
Connected-digit
recognition
pro
vides
a
compact
yet
informati
v
e
benchmark
for
e
v
aluating
ASR
mod-
els
and
preprocessing
methods
[26].
Because
the
grammar
and
v
ocab
ulary
are
restricted,
this
task
isolates
acoustic
and
phonetic
modeling
ef
fects
from
language-model
comple
xity
.
English
connected-digit
datasets
such
as
TIDIGITS
ha
v
e
historically
dri
v
en
progress
in
DTW
and
HMM
techniques,
later
serving
to
test
neural
sequence
models.
In
Arabic,
only
a
fe
w
corpora
of
isolated
or
connected
digits
e
xist,
and
most
tar
get
Modern
Standard
Arabic.
Recent
w
ork
by
Bouchak
our
et
al.
[22]
demonstrated
the
ef
fecti
v
eness
of
attention
mechanisms
for
rob
ust
digit
recognition
in
noisy
en
vironments.
Ho
we
v
er
,
dialectal
v
aria
tions
in
digit
pronunciation
remain
a
signicant
challenge.
F
or
instance,
the
number
“tw
o”
may
be
pronounced
as
“
ithnayn
”
in
MSA,
“
etnin
”
in
Le
v
antine
dialects,
or
“
zoudj
”
in
Algerian
Arabic,
creating
recognition
ambiguities
[6].
The
system
proposed
by
Re
gg
ab
and
Belkhiri
[14]
w
as
among
the
rst
to
construct
an
Algerian
Ara-
bic
digits
database
and
to
emplo
y
DUET
as
a
denoising
stage
for
an
HTK-based
recognizer
.
Ho
we
v
er
,
that
study
predated
current
neural
separation
and
self-supervised
paradigms
and
did
not
e
xplore
the
inte
gration
with
modern
ASR
back-ends.
2.4.
Self-super
vised
lear
ning
Self-supervised
learning
has
re
v
olutionized
speech
processing
by
enabling
models
to
learn
po
werful
representations
from
unlabeled
data
[15].
The
w
a
v2v
ec
2.0
frame
w
ork
introduced
a
contrasti
v
e
learning
objec-
ti
v
e
that
masks
portions
of
the
audio
input
and
learns
to
reconstruct
the
latent
representations.
This
approach
has
sho
wn
remarkable
success
across
mult
iple
languages
and
tasks,
with
the
XLSR-53
model
demonstrating
strong
cross-lingual
transfer
capabilities
[16].
The
HuBER
T
model
[27]
e
xtended
this
paradigm
by
using
clustered
representations
as
training
tar
gets,
achie
ving
state-of-the-art
performance
on
se
v
eral
benchmarks.
More
recently
,
Whisper
[4]
demonstrated
that
Inte
gr
ating
blind
sour
ce
separ
ation
and
self-supervised
learning
for
Alg
erian
Ar
abic
...
(Mour
ad
Re
g
gab)
Evaluation Warning : The document was created with Spire.PDF for Python.
74
❒
ISSN:
2502-4752
lar
ge-scale
weak
supervision
using
audio-transcript
pairs
from
the
web
can
yield
models
with
rob
ust
zero-shot
capabilities
across
di
v
erse
languages
and
acoustic
conditions.
F
or
lo
w-resource
scenarios,
Chen
et
al.
[17]
sho
wed
that
self-supervised
representations
can
signi-
cantly
reduce
the
amount
of
labeled
data
required
for
ef
fecti
v
e
ne-tuning.
Ho
we
v
er
,
applying
these
techniques
to
dialectal
Arabic,
particularly
in
combination
with
speech
separation
front-ends,
remains
undere
xplored.
2.5.
Resear
ch
gap
and
contrib
utions
In
summary
,
prior
w
ork
established
the
feasibility
of
BSS-enhanced
ASR
and
produced
initial
bench-
marks
for
Arabic,
b
ut
inte
gration
of
modern
neural
separation
and
self-supervised
models
for
Algerian
Arabic
remains
lar
gely
une
xplored.
While
se
v
eral
studies
ha
v
e
addressed
Arabic
ASR
[1],
[5]
and
dialectal
processing
[6],
[7],
fe
w
ha
v
e
specically
tar
geted
the
Algerian
v
ariant
or
e
xplored
the
syner
gy
between
separation
and
self-supervised
learning
in
lo
w-resource
settings.
The
present
study
lls
this
g
ap
by:
−
Comparing
classical
DUET
and
contemporary
neural
front-ends
within
a
unied
e
v
aluation
frame
w
ork.
−
In
v
estig
ating
the
combination
of
separation
techniques
with
self-supervised
ASR
back-ends
for
Algerian
Arabic.
−
Releasing
a
dedicated
Algerian
Arabic
digits
corpus
to
support
future
research.
−
Analyzing
practical
trade-of
fs
between
computational
cos
t
and
recognition
accurac
y
for
lo
w-resource
de-
plo
yment.
This
comprehensi
v
e
e
v
aluation
of
fers
insights
that
are
particularly
rele
v
ant
for
re
source-constrained
en
vironments
where
computational
ef
cienc
y
must
be
balanced
ag
ainst
recognition
performance.
3.
METHOD
This
section
describes
the
o
v
erall
system
architecture,
including
corpus
de
v
elopment,
preprocessing,
BSS
front-ends,
ASR
back-ends,
and
e
v
aluation
protocols.
The
proposed
processing
pipeline
is
illustrated
in
Figure
1.
Figure
1.
Processing
pipeline:
stereo
mixture
→
BSS
front-end
(DUET
/
Con
v-T
asNet
/
SepF
ormer)
→
ASR
back-end
(feature
e
xtraction
→
HMM
/
W
a
v2V
ec
2.0
/
Whisper)
→
decoding
→
recognized
te
xt
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
42,
No.
1,
April
2026:
71–80
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
75
3.1.
Cor
pus
design
and
data
pr
eparation
3.1.1.
Speech
collection
A
dedicated
Algerian
Arabic
connected-digit
corpus
w
as
de
v
eloped
to
address
the
absence
of
publicly
a
v
ailable
data
for
this
dialect.
Recordings
were
collected
from
37
nati
v
e
speak
ers
(17
male,
20
female)
rep-
resenting
di
v
erse
re
gional
accents.
Each
participant
r
ead
randomly
generated
digit
sequences
of
one
to
nine
digits,
co
v
ering
both
simple
and
compound
numerical
e
xpressions
(e.g.,
“sabEa
w
thlathin”
for
“thirty-se
v
en”).
Recordings
were
made
in
of
ce
and
quiet
home
en
vironments
using
tw
o
identical
condenser
m
icro-
phones
spaced
15
cm
apart,
enabling
stereo
processing.
Speech
w
as
captured
at
16
kHz,
16-bit
resolution.
The
dataset
w
as
partitioned
by
speak
er
into
80%
for
training,
10%
for
v
alidation,
and
10%
for
testing.
After
quality
control
and
trimming,
the
total
duration
reached
approximately
9
hours.
3.1.2.
Lexicon
and
grammar
A
pronunciation
le
xicon
w
as
constructed
to
capture
dialectal
v
ariability
,
incorporating
common
v
a
ri-
ants
for
each
digit
(e.g.,
/thnin/,
/tnin/,
/zoudj/,
/zouz
/
for
“tw
o”).
Phonetic
transcriptions
follo
wed
an
Algerian
Arabic
adaptation
of
the
International
Phonetic
Alphabet
(IP
A).
A
conte
xt-free
grammar
w
as
written
in
HTK’
s
w
ord
netw
ork
forma
t
to
model
v
alid
connect
ed-digit
sequences
with
optional
conjunctions
such
as
/u/
(“and”).
This
grammar
supported
both
training
and
decoding
to
ensure
linguistic
consistenc
y
and
realistic
digit
combinations.
3.1.3.
F
eatur
e
extraction
F
or
the
HMM-GMM
baseline,
acoustic
features
were
computed
as
39-di
mensional
mel-frequenc
y
cepstral
coef
cients
(MFCCs):
13
static
coef
cients
augmented
with
their
rst-
and
second-order
deri
v
ati
v
es
(
∆
and
∆
2
).
A
25
ms
Hamming
windo
w
with
a
10
ms
frame
shift
w
as
used.
Cepstral
mean
and
v
ariance
normalization
were
applied
on
a
per
-utterance
basis
to
reduce
channel
and
speak
er
v
ariabi
lity
.
These
MFCC
features
serv
ed
as
the
input
to
the
HTK-based
recognizer
.
F
or
the
self-supervis
ed
models
(W
a
v2V
ec
2.0
and
Whisper),
the
separated
ra
w
audio
w
a
v
eforms
were
used
directly
as
input,
le
v
eraging
the
models’
internal
feature
e
xtraction
layers.
3.2.
Blind
sour
ce
separation
fr
ont-ends
Three
front-end
separation
approaches
were
e
v
aluated:
−
DUET
:
the
de
generate
unmixing
estimation
technique
e
xploits
inter
-channel
amplitude
and
phase
dif
fer
-
ences
to
perform
unsupervised
separation
of
stereo
mixtures.
DUET
assumes
sparsity
in
the
time-frequenc
y
domain
and
pro
vides
ef
cient
real-time
separation,
b
ut
it
is
sensiti
v
e
to
re
v
erberation
and
hea
vy
o
v
erlap.
−
Con
v-T
asNet:
a
fully
con
v
olutional
time-domain
separation
model
consisting
of
an
encoder–decoder
struc-
ture
and
stack
ed
temporal
con
v
olutional
blocks.
The
SpeechBrain
pretrained
model
trained
on
WSJ0-2mix
w
as
used
without
further
adaptation.
−
SepF
ormer:
a
transformer
-based
dual-path
netw
ork
le
v
eraging
self-attention
to
capture
both
local
and
global
dependencies.
It
pro
vides
state-of-the-ar
t
performance
on
multi-speak
er
mixtures.
The
SpeechBrain
pre-
trained
model
w
as
used
for
inference
on
our
data.
Separation
performance
w
as
e
v
aluated
using
scale-in
v
ariant
signal-to-noise
ratio
impro
v
ement
(SI-
SNRi)
and
signal-to-distortion
ratio
impro
v
ement
(SDRi).
The
separated
w
a
v
eforms
were
re-encoded
into
MFCC
for
HTK-based
ASR
or
input
as
ra
w
audio
for
neural-based
ASR
for
the
subsequent
stage.
3.3.
ASR
back-ends
T
w
o
classes
of
recognizers
were
tested:
3.3.1.
HMM-GMM
baseline
(HTK)
A
classical
left-to-right
3-state
Bakis
topology
w
as
used
to
model
conte
xt-dependent
triphones.
Each
state
w
as
represented
by
an
8-component
Gaussian
mixture.
State
tying
w
as
performed
via
decision-tree
clus-
tering.
Models
were
trained
using
v
e
iterations
of
Baum–W
elch
reestimation.
Recognition
used
V
iterbi
decoding
constrained
by
the
connected-digit
grammar
.
Inte
gr
ating
blind
sour
ce
separ
ation
and
self-supervised
learning
for
Alg
erian
Ar
abic
...
(Mour
ad
Re
g
gab)
Evaluation Warning : The document was created with Spire.PDF for Python.
76
❒
ISSN:
2502-4752
3.3.2.
Self-super
vised
and
end-to-end
models
T
w
o
self-supervised
encoders
were
e
v
aluated:
−
W
a
v2V
ec
2.0
(XLSR-53):
A
multilingual
model
pre-trained
on
53
languages
using
a
mask
ed
prediction
objecti
v
e.
Fine-tuning
w
as
performed
for
15
epochs
using
our
labeled
training
set
with
a
Connectionist
T
emporal
Classication
(CTC)
loss.
Optimization
used
AdamW
with
a
1
×
10
−
4
learning
rate
and
batch
size
of
8.
−
Whisper
(Small):
An
end-to-end
transformer
trained
on
680K
hours
of
multilingual
data.
W
e
e
v
aluated
both
zero-shot
inference
and
light
ne-tuning
on
our
dataset
using
the
Whisper
toolkit.
3.4.
Ev
aluation
metrics
Recognition
performance
w
as
measured
using
w
ord
error
rate
(WER):
W
E
R
=
S
+
D
+
I
N
×
100
,
(1)
where
S
,
D
,
and
I
denote
the
number
of
substitution,
deletion,
and
insertion
err
ors,
and
N
is
the
total
number
of
reference
w
ords.
All
e
xperiments
were
repeated
three
times
with
dif
ferent
random
seeds,
and
mean
v
alues
were
reported.
SI-SNRi
and
SDRi
were
used
to
e
v
aluate
separation
quality
,
while
real-time
f
actors
(R
TF)
were
computed
to
estimate
computational
feasibility
on
CPU
and
GPU
hardw
are.
3.5.
Experimental
conguration
All
e
xperiments
were
conducted
on
a
w
orkstation
equipped
with
an
Intel
Core
i7-12700
CPU
(3.6
GHz),
64
GB
RAM,
and
an
NVIDIA
R
TX
A6000
GPU
with
48
GB
me
mory
.
Model
training
and
inference
were
im-
plemented
in
Python
3.10
using
the
PyT
orch
2.1
frame
w
ork
and
the
SpeechBrain
and
T
ransformers
libraries.
Feature
e
xtraction,
forced
alignment,
and
HMM
training
ut
ilized
the
HTK
3.4
toolkit,
while
w
a
v
eform-le
v
el
signal
processing
(STFT
,
DUET
,
and
SNR
computation)
w
as
implemented
in
MA
TLAB
2022b
.
F
or
the
neural
front-ends,
pretrained
Con
v-T
asNet
and
SepF
ormer
checkpoints
from
SpeechBrain
were
used
without
further
ne-tuning.
The
W
a
v2V
ec
2.0
model
w
as
ne-tuned
for
15
epochs
wit
h
a
batch
size
of
8,
using
a
linear
learning-
rate
w
arm-up
o
v
er
the
rst
10%
of
updates
and
early
stopping
on
v
alidation
loss.
The
Whisper
-small
model
w
as
e
v
aluated
both
in
zero-shot
mode
and
afte
r
tw
o
epochs
of
ne-tuning
with
learning
rate
5
×
10
−
5
.
During
e
v
aluation,
inference
w
as
performed
with
a
beam
width
of
5
for
all
decoders
to
maintain
a
consistent
decoding
strate
gy
across
models.
All
results
reported
in
this
w
ork
correspond
to
a
v
erages
o
v
er
three
independent
runs
with
dif
ferent
random
seeds
to
ensure
statistical
rob
ustness.
4.
RESUL
TS
AND
DISCUSSION
4.1.
Separation
perf
ormance
T
able
1
reports
mean
SI-SNRi
and
SDRi
o
v
er
test
mixtures.
Neural
separators
out
perform
DUET
.
Figure
2
clearly
sho
ws
that
SepF
ormer
performs
best
while
Con
v-T
asNet
is
intermediate
then
DUET
is
baseline
with
highly
correlated
SI-SNRi
and
SDRi.
T
able
1.
Separation
quality
on
test
mixtures
Front-End
SI-SNRi
(dB)
SDRi
(dB)
DUET
6.4
6.1
Con
v-T
asNet
12.6
12.5
SepF
ormer
15.3
15.6
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
42,
No.
1,
April
2026:
71–80
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
77
DUET
Con
v-T
asNet
SepF
ormer
10
15
Separation
Method
Impro
v
ement
(dB)
SI-SNRi
SDRi
Figure
2.
Separation
performance
(SI-SNRi
&
SDRi)
4.2.
ASR
accuracy
As
sho
wn
in
T
able
2,
DUET
pro
vi
des
signicant
g
ains
for
the
HMM
baseline
(23%
relati
v
e
WER
re-
duction)
b
ut
of
fers
diminishing
returns
for
self-supervised
back-ends.
This
suggests
that
models
lik
e
W
a
v2V
ec
2.0
and
Whisper
already
incorporate
substantial
noise
rob
ustness
.
In
contrast,
neural
separators
paired
with
self-supervised
ASR
yield
the
best
o
v
erall
accurac
y
.
These
trends
are
visualized
in
Figure
3
sho
wing
the
WER
impro
v
ements
achie
v
able
through
dif
ferent
front-end/back
end
combinations.
The
steep
initial
drop
from
‘None’
to
‘DUET’
highlights
the
substantial
benet
of
e
v
en
basic
separation,
while
impro
v
ement
attens
out
through
Con
v-T
asNet
to
SepF
ormer
sho
wing
diminishing
returns
from
increasingly
comple
x
separation
methods.
T
able
2.
W
ord
error
rate
(%)
for
dif
ferent
front-end/back-end
combinations
Front-End
HTK
W
a
v2V
ec
2.0
Whisper
None
12.5
7.8
5.4
DUET
9.6
6.1
4.6
Con
v-T
asNet
7.3
4.8
3.9
SepF
ormer
6.5
3.9
3.4
Figure
3.
WER
trends
across
front-end/back-end
combinations
4.3.
Discussion
of
perf
ormance
and
practical
trade-offs
The
e
xperimental
results
indicate
that
BSS
impro
v
es
recognition
rob
ustness
under
noisy
and
o
v
erlapping-
speech
conditions.
Clas
sical
DUET
pro
vides
an
ef
cient
front-end
solution,
achie
ving
a
25%
relati
v
e
WER
Inte
gr
ating
blind
sour
ce
separ
ation
and
self-supervised
learning
for
Alg
erian
Ar
abic
...
(Mour
ad
Re
g
gab)
Evaluation Warning : The document was created with Spire.PDF for Python.
78
❒
ISSN:
2502-4752
reduction
while
operating
in
real
time
on
CPU
(R
TF
=
0.3),
whereas
neural
separation
methods
such
as
Con
v-
T
asNet
and
SepF
ormer
yield
higher
accurac
y
when
combined
with
self-supervised
ASR
back-ends.
The
SepF
ormer
+
W
a
v2V
ec
2.0
conguration
achie
v
ed
a
WER
of
3.9%
at
0
dB
SNR,
demonstrating
strong
rob
u
s
tness
to
noise
and
dialectal
v
ariability
,
although
at
increased
computational
cost,
requiring
GPU
acceleration
for
practical
use
(R
TF
=
2.1
on
CPU
and
0.1
on
GPU).
The
connected-digit
task
pro
vides
a
controlled
e
v
aluation
frame
w
ork;
ho
we
v
er
,
e
xtension
to
lar
ger
-
v
ocab
ulary
and
spontaneous
Algerian
Arabic
speech
remains
necessary
to
fully
assess
scalability
.
While
the
nine-hour
corpus
de
v
eloped
in
this
study
addresses
an
important
resource
limitat
ion,
further
e
xpansion
and
inclusion
of
subjecti
v
e
e
v
aluation
measures
w
ould
pro
vide
a
more
comprehensi
v
e
assessment.
In
addition,
e
v
aluation
with
alternati
v
e
self-supervised
architectures
such
as
HuBER
T
and
W
a
vLM,
as
well
as
e
xploration
of
h
ybrid
or
lightweight
solutions,
may
further
impro
v
e
the
balance
between
recognition
performance
and
computational
ef
cienc
y
.
The
obtained
w
ord
error
rate
(WER)
of
3.4%
using
(SepF
ormer
+
Whisper)
represents
a
notable
impro
v
ement
o
v
er
pre
vious
studies
on
Algerian
Arabic
speech
recognition.
F
or
instance,
[25]
reported
a
WER
of
approximately
14%,
while
more
recent
deep
learning
approaches
on
North
African
dialect
digits
achie
v
ed
WERs
around
8–12%
in
noisy
settings
[23].
The
inte
gration
of
neural
source
separation
with
self-supervised
acoustic
modeling
thus
yields
a
relati
v
e
WER
reduction
of
o
v
er
50%
compared
to
earlier
Algerian
Arabic
benchmarks,
conrming
the
ef
fecti
v
eness
of
the
proposed
pipeline
for
lo
w-resource
dialectal
ASR.
5.
CONCLUSION
This
w
ork
in
v
estig
ated
the
inte
gration
of
BSS
and
self-supervised
learning
for
Algerian
Arabic
connected-
digit
recognition.
A
ne
w
nine-hour
corpus
of
11,230
utterances
from
37
speak
ers
w
as
created
to
e
v
aluate
both
classical
and
neural
BSS
front-ends
(DUET
,
Con
v-T
asNet,
SepF
ormer)
combined
with
con
v
entional
and
self-
supervised
ASR
back-ends
(HTK,
W
a
v2V
ec
2.0,
Whisper).
The
e
xperiments
conrmed
that
BSS
substantially
impro
v
es
recognition
rob
ustness
in
noisy
and
o
v
erlapping
conditions.
DUET
pro
vides
a
lightweight,
stereo-
based
enhancement,
b
ut
neural
separators
achie
v
e
higher
separation
quality
and
recognition
accurac
y
.
When
paired
with
W
a
v2V
ec
2.0
or
Whisper
,
the
y
reach
state-of-the-art
performance,
v
alidating
the
syner
gy
between
separation
and
self-supervised
acoustic
modeling
for
lo
w-resource
languages.
While
DUET
remains
suitable
for
real-time
embedded
systems,
SepF
ormer
achie
v
es
the
best
separation
metrics
(15.6
dB
SDRi).
Ho
we
v
er
,
its
WER
g
ains
o
v
er
Con
v-T
asNet
are
modest
(0.9%
absolute
for
W
a
v2V
ec
2.0),
suggesting
either
ASR
back-end
saturation
or
that
separation
quality
be
yond
˜
15
dB
of
fers
diminishing
returns
for
digit
recognition.
Future
w
ork
will
e
xtend
this
frame
w
ork
to
multi-dialect,
lar
ger
-v
ocab
ulary
Algerian
Arabic
cor
-
pora,
incorporate
subjecti
v
e
listening
tests
(e.g.,
MOS
scores)
to
complement
objecti
v
e
metrics,
e
x
pl
ore
online
separation,
and
de
v
elop
ef
cient
neural
models
for
deplo
yment
on
edge
de
vices.
This
research
contrib
utes
to
w
ard
bridging
the
performance
g
ap
between
high-
and
lo
w-resource
speech
technologies
across
Arabic
di-
alects.
A
CKNO
WLEDGMENTS
The
authors
thank
the
Uni
v
ersity
of
Amar
T
elidji
Laghouat
Algeria
as
well
as
the
T
elecommunications,
signals
and
systems
Research
laboratory
for
supporting
this
research.
FUNDING
INFORMA
TION
Authors
state
no
funding
in
v
olv
ed.
A
UTHOR
CONTRIB
UTIONS
ST
A
TEMENT
This
journal
uses
the
Contrib
utor
Roles
T
axonomy
(CRediT)
to
recognize
indi
vidual
author
contrib
u-
tions,
reduce
authorship
disputes,
and
f
acilitate
collaboration.
Indonesian
J
Elec
Eng
&
Comp
Sci,
V
ol.
42,
No.
1,
April
2026:
71–80
Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian
J
Elec
Eng
&
Comp
Sci
ISSN:
2502-4752
❒
79
Name
of
A
uthor
C
M
So
V
a
F
o
I
R
D
O
E
V
i
Su
P
Fu
Mourad
Re
gg
ab
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
Mohammed
Belkhiri
✓
✓
✓
✓
✓
✓
✓
✓
C
:
C
onceptualization
I
:
I
n
v
estig
ation
V
i
:
V
i
sualization
M
:
M
ethodology
R
:
R
esources
Su
:
Su
pervision
So
:
So
ftw
are
D
:
D
ata
Curation
P
:
P
roject
administration
V
a
:
V
a
lidation
O
:
Writing
-
O
riginal
Draft
Fu
:
Fu
nding
acquisition
F
o
:
F
o
rmal
analysis
E
:
Writing
-
Re
vie
w
&
E
diting
CONFLICT
OF
INTEREST
ST
A
TEMENT
Authors
state
no
conict
of
interest.
D
A
T
A
A
V
AILABILITY
The
data
that
support
the
ndings
of
this
study
are
a
v
ailable
from
the
corresponding
author
,
M.R.,
upon
reasonable
request.
REFERENCES
[1]
W
.
Algihab,
N.
Ala
ww
ad,
A.
Alda
wish,
and
S.
AlHumoud,
“
Arabic
Speech
Recognition
with
Deep
Learning:
A
Re
vie
w
,
”
in
Social
Computing
and
Social
Media
,
G.
Meisel
witz,
Ed.
Cham,
Switzerland:
Springer
,
2019,
v
ol.
11578,
pp.
15–31,
doi:
10.1007/978-3-
030-21902-4
_
2.
[2]
Y
.
Luo
and
N.
Mesg
arani,
“Con
v-T
asNet:
Surpassing
ideal
time-frequenc
y
magnitude
m
asking
for
speech
separation,
”
IEEE/A
CM
T
rans.
Audio
Speech
Lang.
Process.
,
v
ol.
27,
no.
8,
pp.
1256–1266,
Aug.
2019,
doi:
10.1109/T
ASLP
.2019.2915167.
[3]
C.
Subakan,
M.
Ra
v
anelli,
S.
Cornell,
M.
Bronzi,
and
J.
Zhong,
“
Attention
is
all
you
need
in
speech
separation,
”
in
Proc.
IEEE
Int.
Conf.
Acoust.,
Speech
Signal
Process.
(ICASSP)
,
T
oronto,
ON,
Canada,
Jun.
2021,
pp.
21–25,
doi:
10.1109/ICASSP39728.2021.9413901.
[4]
A.
Radford,
J.
W
.
Kim,
T
.
Xu,
G.
Brockman,
C.
McLea
v
e
y
,
and
I.
Sutsk
e
v
er
,
“Rob
ust
speech
recognition
via
lar
ge-scale
weak
supervision,
”
arXi
v
preprint
arXi
v:2212.04356
,
2022.
[5]
F
.
S.
Al-Anzi
and
D.
Ab
uZeina,
“Synopsis
on
Arabic
speech
recognition,
”
Ain
Shams
Engineering
Journal
,
v
ol.
13,
no.
2,
p.
101534,
2022,
doi:
10.1016/j.asej.2021.06.020.
[6]
M.
Malathi,
S.
Senthilkumar
,
C.
H.
H.
Basha,
G.
Sundara
v
adi
v
el,
M.
Ka
vitha,
and
P
.
Arunkumar
,
”Multi-dialect
speech
recognition
using
transfer
learning
and
transformer
-based
architectures:
A
comprehensi
v
e
approach
to
accurate
and
ef
cient
dialect
identi-
cation,
”
in
2024
Conference
on
Rene
w
able
Ener
gy
T
echnologies
and
Modern
Communications
Systems:
Future
and
Challenges
,
2024,
pp.
1–6,
doi:
10.1109/IEEECONF63577.2024.10880973.
[7]
G.
Deroua-Hamdani,
S.Selouani,
and
M.
Boudraa,
“
Alge
rian
Arabic
Speech
Database
(ALGASD):
Corpus
design
and
automatic
speech
recognition
application,
”
Arabian
Journal
for
Science
and
Engineering
,
v
ol.
35,
no.
2C,
pp.
157–166,
2010.
[8]
Y
.
T
oughrai,
K.
Sma
¨
ıli,
and
D.
Langois,
“
ABDUL:
a
ne
w
Approach
to
Build
language
models
for
Dialects
Using
formal
Lan-
guage
corpora
only
,
”
in
Proc.
1st
W
orkshop
Lang.
Models
Underserv
ed
Communities
(LM4UC
2025)
,
2025,
pp.
16–21,
doi:
10.18653/v1/2025.lm4uc-1.3.
[9]
M.
A.
Menacer
,
O.
Mella,
D.
F
ohr
,
D.
Jouv
et,
D.
Langlois,
and
K.
Sma
¨
ıli,
“De
v
elopment
of
the
Arabic
Loria
Automatic
Speech
Recognition
system
(ALASR)
and
it
s
e
v
aluation
for
Algerian
dialect,
”
Procedia
Computer
Science
,
v
ol.
117,
pp.
81–88,
2017,
doi:
10.1016/j.procs.2017.10.096.
[10]
L.
R.
Rabiner
,
J.
G.
W
ilpon,
and
F
.
K.
Soong,
“High
performance
connected
digit
recognition,
using
hidden
Mark
o
v
models,
”
in
ICASSP-88.,
International
Conference
on
Acoustics,
Speech,
and
Signal
Processing
,
Ne
w
Y
ork,
NY
,
USA,
1988,
v
ol.
1,
pp.
119–122,
doi:
10.1109/ICASSP
.1988.196526
[11]
M
.
J.
Manaileng
and
M.
J.
Manamela,
“Connected-digits
recognition
for
an
under
-resourced
language
using
Hidde
n
Mark
o
v
Mod-
els,
”
in
Proceedings
ELMAR-2013
,
Zadar
,
Croatia,
Sep.
2013,
pp.
211–214.
[12]
O.
Y
ilmaz
and
S.
Rickard,
“Blind
separation
of
speech
mixtures
via
time-frequenc
y
masking,
”
IEEE
T
ransactions
on
Signal
Pro-
cessing
,
v
ol.
52,
no.
7,
pp.
1830–1847,
Jul.
2004,
doi:
10.1109/TSP
.2004.828896.
[13]
M
.
I.
Mandel,
R.
J.
W
eiss,
and
D.
P
.
W
.
Ellis,
“Model-based
e
xpectation-maximization
source
separation
and
localization,
”
IEEE
T
ransactions
on
Audio,
Speech,
and
Language
Processing
,
v
ol.
18,
no.
2,
pp.
382–394,
2010,
doi:
10.1109/T
ASL.2009.2029711.
[14]
M
.
Re
gg
ab
and
M.
Belkhiri,
“Blind
Source
Separation
technique
for
Arabic
language
ASR,
”
T
echnical
Report
,
Uni
v
.
Amar
T
elidji,
Laghouat,
2018.
[15]
A
.
Bae
vs
ki,
Y
.
Zhou,
A.
Mohamed,
and
M.
Auli,
“w
a
v2v
ec
2.0:
A
frame
w
ork
for
self-supervised
learning
of
speech
representations,
”
arXi
v
preprint
arXi
v:
2006.11477
,
2020.
[16]
A.
Conneau,
A.
Bae
vski,
R.
Collobert,
A.
Mohamed,
and
M.
Auli,
“Unsupervised
cross-lingual
representation
learning
for
speech
recognition,
”
in
Proc.
Interspeech
,
Brno,
Czechia,
pp.
2426–2430,
2021,
doi:
10.21437/Interspeech.2021-329.
[17]
Y
.
Chen,
H.
Zhang,
X.
Y
ang,
W
.
Zhang,
and
D.
Qu,
“Meta-Adaptable-Adapter:
Ef
cient
adaptation
of
self-supervised
models
for
lo
w-resource
speech
recognition,
”
Neurocomputing
,
v
ol.
609,
no.
1,
p.
128493,
2024,
doi:
10.1016/j.neucom.2024.128493.
[18]
O.
H.
Anidjar
,
R.
Marbel,
and
R.
Y
oze
vitch,
“Whisper
T
urns
Stronger:
Augmenting
W
a
v2V
ec
2.0
for
Superior
ASR
in
Lo
w-
Resource
Languages,
”
arXi
v
preprint
arXi
v:
2501.00425
,
2024.
Inte
gr
ating
blind
sour
ce
separ
ation
and
self-supervised
learning
for
Alg
erian
Ar
abic
...
(Mour
ad
Re
g
gab)
Evaluation Warning : The document was created with Spire.PDF for Python.
80
❒
ISSN:
2502-4752
[19]
A.
Hyv
¨
arinen
and
E.
Oja,
“Independent
component
analysis:
algorithms
and
applications,
”
Neural
Netw
orks
,
v
ol.
13,
no.
4–5,
pp.
411–430,
2000,
doi:
10.1016/S0893-6080(00)00026-5.
[20]
H.
Sa
w
ada,
N.
Ono,
H.
Kameoka,
D.
Kitamura,
and
H.
Saruw
atari,
“
A
re
vi
e
w
of
blind
source
separation
methods:
tw
o
con
v
er
ging
routes
to
ILRMA
originating
from
ICA
and
NMF
,
”
APSIP
A
T
ransactions
on
Signal
and
Information
Processing
,
v
ol.
8,
pp.
1–14,
2019,
doi:
10.1017/A
TSIP
.2019.5.
[21]
S
.
Ui-Hyeop,
L.
Sangyoun,
K.
T
aehan,
and
P
.
Hyung-Min,
“Separate
and
Reconstruct:
Asymmetric
Encoder
-Decoder
for
Speech
Separation,
”
arXi
v
preprint
arXi
v:
2406.05983
,
2024.
[22]
L.
Bouchak
our
,
K.
Lounnas,
and
M.
Debyeche,
“Enhancing
Rob
ustness
of
Arabic
Speech
Recognition
in
Noisy
En
vironments
Using
Adv
anced
Feature
Extraction
and
Denoising
T
echniques
Based
on
Deep
Learning
Models,
”
Circuits,
Systems,
and
Signal
Processing
,
2025,
doi:
10.1007/s00034-025-03418-w
.
[23]
K
.
Lounnas,
M.
Abbas,
M.
Lichouri,
M.
Hamidi,
H.
Satori,
and
H.
T
ef
f
ahi,
“Enhancement
of
spok
en
digits
recognition
for
under
-
resourced
languages:
case
of
Algerian
and
Moroccan
dialects,
”
International
Journal
of
Speech
T
echnology
,
v
ol.
25,
no.
2,
pp.
443–455,
2022,
doi:
10.1007/s10772-022-09971-y
.
[24]
H.
Mubarak,
A.
Hussein,
S.
A.
Cho
wdhury
,
and
A.
Ali,
“QASR:
QCRI
Aljazeera
Speech
Resource
–
A
lar
ge
scale
annotated
Arabic
speech
corpus,
”
in
Proc.
59th
Annu.
Meeting
Assoc.
Comput.
Linguist.
11th
Int.
Joint
Conf.
Nat.
Lang.
Process.
(A
CL-IJCNLP)
,
2021,
v
ol.
1,
pp.
2274–2285.
doi:
10.18653/v1/2021.acl-long.177.
[25]
A.
R.
Ali,
“Multi-Dialect
Arabic
Speech
Recognition,
”
in
2020
International
Joint
Conference
on
Neural
Netw
orks
(IJCNN))
,
Glasgo
w
,
UK,
2020,
pp.
1–7.
doi:
10.1109/ijcnn48605.2020.9206658.
[26]
R.
Ashifur
,
Md.
M.
Kabir
,
M.
F
.
Mridha,
M.
Alatiyyah,
H.
F
.
Alhasson,
and
S.
S.
Alharbi
,
“
Arabic
Speech
Recognition:
Adv
ance-
ment
and
Challenges,
”
IEEE
Access
,
v
ol.
12,
pp.
39689–39716,
2024,
doi:
10.1109/A
CCESS.2024.3376237.
[27]
W
.
N.
Hsu,
B.
Bolte,
Y
.-H.
H.
Tsai,
K.
Lakhotia,
R.
Salakhutdino
v
,
and
A.
Moham
ed,
“HuBER
T
:
Self-supervised
speech
repre-
sentation
learning
by
mask
ed
prediction
of
hidden
units,
”
IEEE/A
CM
T
rans.
Audio
Speech
Lang.
Process.
,
v
ol.
29,
no.
11,
pp.
3451–3460,
2021.
doi:
10.1109/T
ASLP
.2021.3122291.
BIOGRAPHIES
OF
A
UTHORS
Mourad
Reggab
is
an
Associate
Professor
at
the
Uni
v
ersi
ty
of
Laghouat,
Algeria.
An
academic
with
o
v
er
tw
o
decades
of
e
xperience
in
higher
education,
he
earned
his
de
gree
in
Electronics
Engineering
from
the
Uni
v
ersity
of
Boumerd
`
es
in
2001,
specializing
in
communication
systems.
He
later
com
pleted
a
Magister
de
gree
in
Electronic
Systems’
Engineering
with
a
focus
on
Automatic
Speech
Recognition.
His
primary
research
areas
are
signal
processing,
speech
recognition,
blind
source
separation,
and
articial
intelligence.
He
is
a
member
of
the
T
elecommunications,
Signals,
and
S
ystems
Research
Laboratory
at
Amar
T
elidji
Uni
v
ersity
.
His
specic
research
intere
sts
include
statistical
signal
processing,
automatic
speech
recognition,
blind
source
separation,
image
processing,
and
articial
intelligence.
He
can
be
contacted
via
email
at:
m.re
gg
ab@lagh-uni
v
.dz.