Indonesian J our nal of Electrical Engineering and Computer Science V ol. 42, No. 1, April 2026, pp. 71 80 ISSN: 2502-4752, DOI: 10.11591/ijeecs.v42.i1.pp71-80 71 Integrating blind sour ce separation and self-super vised lear ning f or Algerian Arabic connected-digit r ecognition Mourad Reggab, Mohammed Belkhiri Laboratory of T elecommunications, Signals and Systems, Uni v ersity Amar T elidji of Laghouat (U A TL), Laghouat, Algeria Article Inf o Article history: Recei v ed Jan 29, 2026 Re vised Feb 16, 2026 Accepted Mar 4, 2026 K eyw ords: Arabic speech recognition Blind source separation Con v-T asNet DUET Lo w-resource ASR SepF ormer W a v2V ec 2.0 ABSTRA CT This paper proposes an impro v ement in Arabic automatic speech recognition (ASR) by combining blind source separation (BSS) with self-supervis ed acous- tic modeling. The study concentrates on the Algerian Arabic connected-digit recognition task and ree xamines the classical de generate unmixing estimation technique (DUET) as a front-end approach for suppressing noise and inter - ference. The output of the BSS stage is fed into a Hidden Mark o v model (HMM) recognizer de v eloped using the HTK toolkit. T o conte xtualize DUET’ s performance, it is compared with modern neural separation techniques (Con v- T asNet, SepF ormer) paired with both traditional and self-supervised AS R back- ends (W a v2V ec 2.0 and Whisper). A ne w corpus of 11,230 utterance s from 37 speak ers, representing dialectal and gender di v ersity , w as collected. Experimen- tal outcomes indicate that DUET enhances w ord accurac y under stereo mixing conditions; ho we v er , neural separation combined with self-supervised ASR re- sults in considerably lo wer w ord-error rates and stronger rob ustness in noisy or o v erlapping-speech scenarios. The study emphasizes practical trade-of fs be- tween computational cost and accurac y for deplo ying lo w-resource Arabic ASR systems. This is an open access article under the CC BY -SA license . Corresponding A uthor: Mourad Re gg ab Laboratory of T elecommunications, Signals and Systems, Uni v ersity Amar T elidji of Laghouat (U A TL) Laghouat, Algeria Email: m.re gg ab@lagh-uni v .dz 1. INTR ODUCTION Background and moti v ation: automatic apeech recognition (ASR) systems ha v e become essential to human-computer interaction, enabling hands-free control, v oice search, and con v ersational AI [1]. Ho we v er , in real acoustic en vironments, speech is rarely captured in isolation: background noise, re v erberation, and interfering speak ers often corrupt the tar get signal. This challenge, kno wn as the cocktail party ef fect, has long encouraged research in speech source separation namely the process of isolating one or more speech signals from a mixture of sources. Early solutions used independent component analysis (ICA) and frequenc y- domain masking, while more rec ent approaches utilize deep neural netw orks such as Con v-T asNet [2] and SepF ormer [3] that perform end-to-end time-domain separation. Concurrently , ASR technology has progressed from Hidden Mark o v models (HMMs) and Gaussian mixt ure models (GMMs) to h ybrid DNN-HMMs and fully end-to-end architectures trained on lar ge-scale corpora [4]. Despite these adv ances, most speech separation and recognition research has focused on high-re source languages, primarily English, Mandarin, and French. F or man y other languages, including Arabic, limited annotated data, comple x morphology , and dialectal v ariability remain signicant obstacles . Arabic is the fth J ournal homepage: http://ijeecs.iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
72 ISSN: 2502-4752 most spok en language w orldwide, with o v er 300 million nati v e speak ers, yet its automated processing remains comparati v ely underde v eloped [5]. The diglossic nature of modern standard Arabic (MSA) for formal conte xts v ersus numerous re gional dialects for daily communicati on creates substantial pronunciation and le xical g aps between training and tar get speech [6]. Moreo v er , publicly a v ailable Arabic speech corpora often emphasize broadcast or scripted MSA, of fering limited co v erage of colloquial forms and noisy acoustic conditions [7]. W ithin the Arabic dialect continuum, Algerian Arabic introduces additional comple xities [8]. It incor - porates Classical Arabic roots with Berber and French inuences, leading to distinct phonetic shifts, loanw ords, and code-switching. Dialectal v ariation across Algeria’ s W estern, Central, Eastern and Southern re gions is con- siderable: v o wel harmon y , consonant emphas is, and w ord stress dif fer noticeably by re gion [9]. These f actors hinder the direct reuse of models trained on MSA or other Arabic dialects [7]. Furthermore, practical Alge- rian speech data are typically recorded in e v eryday settings homes, classrooms, or mark ets where o v erlapping speech and en vironmental noise are common. Hence, a rob ust ASR system must inte grate dialec tal modeling with mechanisms to suppress interference and background noise. Digits represent a well-dened and important subset of spok en language that pro vides a controlled benchmark for ASR research [10]. Connected-digit tasks (e.g., telephone numbers, prices, dates) of fer con- strained grammars and limited v ocab ularies, f acilitating systematic e v aluation of modeling and preprocessing techniques [11]. Historically , connected-digit recognition has serv ed as a testing ground for algorithms such as dynamic time w arping, HMMs, and early deep neural netw orks. F or Arabic, digit pronunciation v aries across dialects—for e xample, the num b e r ”tw o” may be pronounced thnin ”, tnin ”, zoudj or zouz in dif ferent re gions—making this task challenging [6]. De v eloping an accurate digit recognizer for Algerian Arabic thus constitutes a meaningful step to w ard lar ger -v ocab ulary systems. In this conte xt, blind source separation (BSS) presents a po werful preprocessing strate gy to impro v e recognition rob ustness [12]. BSS techniques aim to reco v er original source signals from observ ed mixtures without prior kno wledge of the mixing process. Among them, the de generate unmi xing estimation technique (DUET) le v erages time-frequenc y sparsity and inter -channel dif ferences to perform unsupervised separation in stereo recordings. Although computationally lightweight, DUET and similar classical algorithms struggle in highly re v erberant or single-channel conditions [13]. Con v ersely , modern neural separation models achie v e superior signal-to-distortion ratios b ut demand considerable training data and computational resources [2], [3]. This study in v estig ates ho w such separation methods can impro v e Algerian Arabic connected-digit recognition, e xtending our pre vious w ork [14] which focused solely on DUET combined with classical HMM- based ASR. W e rst re visit DUET as a lo w-cost stereo front-end for an HMM-based recognizer and then compare it ag ainst state-of-the-art neural separators, specically Con v-T asNet and SepF ormer , in combination with both con v entional and self-supervised ASR back-ends (HTK, W a v2V ec 2.0 [15], and Whisper [4]). T o support this in v estig ation, we b uilt a dedicated Algerian Arabic digit corpus comprising 11,230 utterances from 37 speak ers of di v erse dialectal backgrounds. The goal is to quantify impro v ements in w ord-error rate (WER) and noise rob ustness pro vided by blind and learned separation, and to ident ify practical trade-of fs between comple xity and performance for lo w-resource ASR deplo yment in Arabic-speaking en vironments [16]-[18]. 2. RELA TED W ORK Research on speech separation a nd recognition has e v olv ed through se v eral technological stages, be- ginning with statistical signal processing and adv ancing to w ard data-dri v en neural methods. This section sum- marizes rele v ant progress in (a) blind source separation, (b) Arabic and dialectal ASR, (c) connected-digit recognition, and (d) self-supervised learning for speech processing. 2.1. Blind sour ce separation and speech enhancement Early BSS approaches relied on statistical independence and sparsity assumptions. Independent com- ponent analysis (ICA) [19] and non-ne g ati v e matrix f ac torization (NMF) [20] were among the rst unsuper - vised algorithms capable of separating multiple speak ers from mix ed signals. The DUET proposed by Y ilmaz and Rickard [12] became a reference method for tw o-microphone or stereo mixtures, e xploiting inter -channel amplitude and phase dif ferences t o cluster time–frequenc y points belonging to distinct sources. DUET is at- tracti v e for its simplicity and real-time feasibi lity b ut de grades under hea vy re v erberation or strong spectral o v erlap [13]. W ith the adv ent of deep learning, separation shifted from frequenc y-domai n masking to end-to-end time-domain modeling. Luo and Mesg arani’ s Con v-T asNet [2] demonstrated that con v olutional encoder–decoder Indonesian J Elec Eng & Comp Sci, V ol. 42, No. 1, April 2026: 71–80 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 73 netw orks can surpass traditional magnitude-masking baselines, achie ving near -ideal signal-to-noise ratio im- pro v ements on benchmark datasets such as WSJ0-2mix. Subsequent transformer -based architectures, notably SepF ormer [3], [21], introduced global self-attention and dual -path processing, further impro ving separation quality and generalization to unseen speak ers. These neural models no w represent the state of the art in both single- and mul ti-channel speech separation and are increasingly used as front-ends for ASR and speak er di- arization [17]. Recent adv ances ha v e focused on inte grating separation with recognition objecti v es. Studies by Bouchak our et al. [22] demonstrate that joint optimization of separation and acoustic model ing can yield signicant impro v ements in noisy conditions. Ho we v er , most separation researc h has focused on high-resource languages, lea ving lo w-resource scenarios lik e Algerian Arabic under -e xplored [23]. 2.2. Arabic and dialectal ASR Arabic ASR research has follo wed a slo wer trajectory than for English or Mandarin due to linguis tic and data-a v ailability barriers [1]. Classical systems b uilt with HTK or Kaldi emplo yed phoneme-based HMM- GMM models trained on modern standard Arabic (MSA) corpora such as the Arabic Broadcast Ne ws, QASR, or MGB-2 dat asets [24]. While these models achie v e high accurac y on scripted speech, their performance drops sharply on spontaneous or dialectal data because of phonetic and le xical v ariability [6]. The diglossic nature of Arabic presents unique challenges. As noted by [5], the g ap between MSA and re gional dialects af fects both acoustic and language modeling. North African dialects, particularly Alge- rian Arabic, e xhibit distincti v e phonetic characteristics including v o wel reduction, consonant assimilation, and e xtensi v e code-switching with French and Berber languages [8]. Droua-Hamdani et al. [7] highlighted the scarcity of resources for Algerian dialect, with most a v ailable corpora focusing on Le v antine or Gulf v arieties [25]. T o address limited resources, se v eral studies ha v e e xplored transfer learning and multilingual t rain- ing. The multilingual W a v2V ec 2.0 XLSR-53 and HuBER T models pre-trained on hundreds of languages ha v e recently been ne-tuned for Arabic with substantial w ord-error -rate (WER) reductions [16]. End-to-end trans- former architec tures such as Whisper [4] also sho w strong zero-shot performance on Arabic dialects without e xplicit retraini ng. Ne v ertheless, v ery fe w w orks focus specically on North-African dialects—particularly Algerian Arabic—where the phonetic in v entory and code-switching patterns dif fer signicantly from MSA, and where background noise and o v erlapping speak ers are common in natural recordings. 2.3. Connected-digit r ecognition Connected-digit recognition pro vides a compact yet informati v e benchmark for e v aluating ASR mod- els and preprocessing methods [26]. Because the grammar and v ocab ulary are restricted, this task isolates acoustic and phonetic modeling ef fects from language-model comple xity . English connected-digit datasets such as TIDIGITS ha v e historically dri v en progress in DTW and HMM techniques, later serving to test neural sequence models. In Arabic, only a fe w corpora of isolated or connected digits e xist, and most tar get Modern Standard Arabic. Recent w ork by Bouchak our et al. [22] demonstrated the ef fecti v eness of attention mechanisms for rob ust digit recognition in noisy en vironments. Ho we v er , dialectal v aria tions in digit pronunciation remain a signicant challenge. F or instance, the number “tw o” may be pronounced as ithnayn in MSA, etnin in Le v antine dialects, or zoudj in Algerian Arabic, creating recognition ambiguities [6]. The system proposed by Re gg ab and Belkhiri [14] w as among the rst to construct an Algerian Ara- bic digits database and to emplo y DUET as a denoising stage for an HTK-based recognizer . Ho we v er , that study predated current neural separation and self-supervised paradigms and did not e xplore the inte gration with modern ASR back-ends. 2.4. Self-super vised lear ning Self-supervised learning has re v olutionized speech processing by enabling models to learn po werful representations from unlabeled data [15]. The w a v2v ec 2.0 frame w ork introduced a contrasti v e learning objec- ti v e that masks portions of the audio input and learns to reconstruct the latent representations. This approach has sho wn remarkable success across mult iple languages and tasks, with the XLSR-53 model demonstrating strong cross-lingual transfer capabilities [16]. The HuBER T model [27] e xtended this paradigm by using clustered representations as training tar gets, achie ving state-of-the-art performance on se v eral benchmarks. More recently , Whisper [4] demonstrated that Inte gr ating blind sour ce separ ation and self-supervised learning for Alg erian Ar abic ... (Mour ad Re g gab) Evaluation Warning : The document was created with Spire.PDF for Python.
74 ISSN: 2502-4752 lar ge-scale weak supervision using audio-transcript pairs from the web can yield models with rob ust zero-shot capabilities across di v erse languages and acoustic conditions. F or lo w-resource scenarios, Chen et al. [17] sho wed that self-supervised representations can signi- cantly reduce the amount of labeled data required for ef fecti v e ne-tuning. Ho we v er , applying these techniques to dialectal Arabic, particularly in combination with speech separation front-ends, remains undere xplored. 2.5. Resear ch gap and contrib utions In summary , prior w ork established the feasibility of BSS-enhanced ASR and produced initial bench- marks for Arabic, b ut inte gration of modern neural separation and self-supervised models for Algerian Arabic remains lar gely une xplored. While se v eral studies ha v e addressed Arabic ASR [1], [5] and dialectal processing [6], [7], fe w ha v e specically tar geted the Algerian v ariant or e xplored the syner gy between separation and self-supervised learning in lo w-resource settings. The present study lls this g ap by: Comparing classical DUET and contemporary neural front-ends within a unied e v aluation frame w ork. In v estig ating the combination of separation techniques with self-supervised ASR back-ends for Algerian Arabic. Releasing a dedicated Algerian Arabic digits corpus to support future research. Analyzing practical trade-of fs between computational cos t and recognition accurac y for lo w-resource de- plo yment. This comprehensi v e e v aluation of fers insights that are particularly rele v ant for re source-constrained en vironments where computational ef cienc y must be balanced ag ainst recognition performance. 3. METHOD This section describes the o v erall system architecture, including corpus de v elopment, preprocessing, BSS front-ends, ASR back-ends, and e v aluation protocols. The proposed processing pipeline is illustrated in Figure 1. Figure 1. Processing pipeline: stereo mixture BSS front-end (DUET / Con v-T asNet / SepF ormer) ASR back-end (feature e xtraction HMM / W a v2V ec 2.0 / Whisper) decoding recognized te xt Indonesian J Elec Eng & Comp Sci, V ol. 42, No. 1, April 2026: 71–80 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 75 3.1. Cor pus design and data pr eparation 3.1.1. Speech collection A dedicated Algerian Arabic connected-digit corpus w as de v eloped to address the absence of publicly a v ailable data for this dialect. Recordings were collected from 37 nati v e speak ers (17 male, 20 female) rep- resenting di v erse re gional accents. Each participant r ead randomly generated digit sequences of one to nine digits, co v ering both simple and compound numerical e xpressions (e.g., “sabEa w thlathin” for “thirty-se v en”). Recordings were made in of ce and quiet home en vironments using tw o identical condenser m icro- phones spaced 15 cm apart, enabling stereo processing. Speech w as captured at 16 kHz, 16-bit resolution. The dataset w as partitioned by speak er into 80% for training, 10% for v alidation, and 10% for testing. After quality control and trimming, the total duration reached approximately 9 hours. 3.1.2. Lexicon and grammar A pronunciation le xicon w as constructed to capture dialectal v ariability , incorporating common v a ri- ants for each digit (e.g., /thnin/, /tnin/, /zoudj/, /zouz / for “tw o”). Phonetic transcriptions follo wed an Algerian Arabic adaptation of the International Phonetic Alphabet (IP A). A conte xt-free grammar w as written in HTK’ s w ord netw ork forma t to model v alid connect ed-digit sequences with optional conjunctions such as /u/ (“and”). This grammar supported both training and decoding to ensure linguistic consistenc y and realistic digit combinations. 3.1.3. F eatur e extraction F or the HMM-GMM baseline, acoustic features were computed as 39-di mensional mel-frequenc y cepstral coef cients (MFCCs): 13 static coef cients augmented with their rst- and second-order deri v ati v es ( and 2 ). A 25 ms Hamming windo w with a 10 ms frame shift w as used. Cepstral mean and v ariance normalization were applied on a per -utterance basis to reduce channel and speak er v ariabi lity . These MFCC features serv ed as the input to the HTK-based recognizer . F or the self-supervis ed models (W a v2V ec 2.0 and Whisper), the separated ra w audio w a v eforms were used directly as input, le v eraging the models’ internal feature e xtraction layers. 3.2. Blind sour ce separation fr ont-ends Three front-end separation approaches were e v aluated: DUET : the de generate unmixing estimation technique e xploits inter -channel amplitude and phase dif fer - ences to perform unsupervised separation of stereo mixtures. DUET assumes sparsity in the time-frequenc y domain and pro vides ef cient real-time separation, b ut it is sensiti v e to re v erberation and hea vy o v erlap. Con v-T asNet: a fully con v olutional time-domain separation model consisting of an encoder–decoder struc- ture and stack ed temporal con v olutional blocks. The SpeechBrain pretrained model trained on WSJ0-2mix w as used without further adaptation. SepF ormer: a transformer -based dual-path netw ork le v eraging self-attention to capture both local and global dependencies. It pro vides state-of-the-ar t performance on multi-speak er mixtures. The SpeechBrain pre- trained model w as used for inference on our data. Separation performance w as e v aluated using scale-in v ariant signal-to-noise ratio impro v ement (SI- SNRi) and signal-to-distortion ratio impro v ement (SDRi). The separated w a v eforms were re-encoded into MFCC for HTK-based ASR or input as ra w audio for neural-based ASR for the subsequent stage. 3.3. ASR back-ends T w o classes of recognizers were tested: 3.3.1. HMM-GMM baseline (HTK) A classical left-to-right 3-state Bakis topology w as used to model conte xt-dependent triphones. Each state w as represented by an 8-component Gaussian mixture. State tying w as performed via decision-tree clus- tering. Models were trained using v e iterations of Baum–W elch reestimation. Recognition used V iterbi decoding constrained by the connected-digit grammar . Inte gr ating blind sour ce separ ation and self-supervised learning for Alg erian Ar abic ... (Mour ad Re g gab) Evaluation Warning : The document was created with Spire.PDF for Python.
76 ISSN: 2502-4752 3.3.2. Self-super vised and end-to-end models T w o self-supervised encoders were e v aluated: W a v2V ec 2.0 (XLSR-53): A multilingual model pre-trained on 53 languages using a mask ed prediction objecti v e. Fine-tuning w as performed for 15 epochs using our labeled training set with a Connectionist T emporal Classication (CTC) loss. Optimization used AdamW with a 1 × 10 4 learning rate and batch size of 8. Whisper (Small): An end-to-end transformer trained on 680K hours of multilingual data. W e e v aluated both zero-shot inference and light ne-tuning on our dataset using the Whisper toolkit. 3.4. Ev aluation metrics Recognition performance w as measured using w ord error rate (WER): W E R = S + D + I N × 100 , (1) where S , D , and I denote the number of substitution, deletion, and insertion err ors, and N is the total number of reference w ords. All e xperiments were repeated three times with dif ferent random seeds, and mean v alues were reported. SI-SNRi and SDRi were used to e v aluate separation quality , while real-time f actors (R TF) were computed to estimate computational feasibility on CPU and GPU hardw are. 3.5. Experimental conguration All e xperiments were conducted on a w orkstation equipped with an Intel Core i7-12700 CPU (3.6 GHz), 64 GB RAM, and an NVIDIA R TX A6000 GPU with 48 GB me mory . Model training and inference were im- plemented in Python 3.10 using the PyT orch 2.1 frame w ork and the SpeechBrain and T ransformers libraries. Feature e xtraction, forced alignment, and HMM training ut ilized the HTK 3.4 toolkit, while w a v eform-le v el signal processing (STFT , DUET , and SNR computation) w as implemented in MA TLAB 2022b . F or the neural front-ends, pretrained Con v-T asNet and SepF ormer checkpoints from SpeechBrain were used without further ne-tuning. The W a v2V ec 2.0 model w as ne-tuned for 15 epochs wit h a batch size of 8, using a linear learning- rate w arm-up o v er the rst 10% of updates and early stopping on v alidation loss. The Whisper -small model w as e v aluated both in zero-shot mode and afte r tw o epochs of ne-tuning with learning rate 5 × 10 5 . During e v aluation, inference w as performed with a beam width of 5 for all decoders to maintain a consistent decoding strate gy across models. All results reported in this w ork correspond to a v erages o v er three independent runs with dif ferent random seeds to ensure statistical rob ustness. 4. RESUL TS AND DISCUSSION 4.1. Separation perf ormance T able 1 reports mean SI-SNRi and SDRi o v er test mixtures. Neural separators out perform DUET . Figure 2 clearly sho ws that SepF ormer performs best while Con v-T asNet is intermediate then DUET is baseline with highly correlated SI-SNRi and SDRi. T able 1. Separation quality on test mixtures Front-End SI-SNRi (dB) SDRi (dB) DUET 6.4 6.1 Con v-T asNet 12.6 12.5 SepF ormer 15.3 15.6 Indonesian J Elec Eng & Comp Sci, V ol. 42, No. 1, April 2026: 71–80 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 77 DUET Con v-T asNet SepF ormer 10 15 Separation Method Impro v ement (dB) SI-SNRi SDRi Figure 2. Separation performance (SI-SNRi & SDRi) 4.2. ASR accuracy As sho wn in T able 2, DUET pro vi des signicant g ains for the HMM baseline (23% relati v e WER re- duction) b ut of fers diminishing returns for self-supervised back-ends. This suggests that models lik e W a v2V ec 2.0 and Whisper already incorporate substantial noise rob ustness . In contrast, neural separators paired with self-supervised ASR yield the best o v erall accurac y . These trends are visualized in Figure 3 sho wing the WER impro v ements achie v able through dif ferent front-end/back end combinations. The steep initial drop from ‘None’ to ‘DUET’ highlights the substantial benet of e v en basic separation, while impro v ement attens out through Con v-T asNet to SepF ormer sho wing diminishing returns from increasingly comple x separation methods. T able 2. W ord error rate (%) for dif ferent front-end/back-end combinations Front-End HTK W a v2V ec 2.0 Whisper None 12.5 7.8 5.4 DUET 9.6 6.1 4.6 Con v-T asNet 7.3 4.8 3.9 SepF ormer 6.5 3.9 3.4 Figure 3. WER trends across front-end/back-end combinations 4.3. Discussion of perf ormance and practical trade-offs The e xperimental results indicate that BSS impro v es recognition rob ustness under noisy and o v erlapping- speech conditions. Clas sical DUET pro vides an ef cient front-end solution, achie ving a 25% relati v e WER Inte gr ating blind sour ce separ ation and self-supervised learning for Alg erian Ar abic ... (Mour ad Re g gab) Evaluation Warning : The document was created with Spire.PDF for Python.
78 ISSN: 2502-4752 reduction while operating in real time on CPU (R TF = 0.3), whereas neural separation methods such as Con v- T asNet and SepF ormer yield higher accurac y when combined with self-supervised ASR back-ends. The SepF ormer + W a v2V ec 2.0 conguration achie v ed a WER of 3.9% at 0 dB SNR, demonstrating strong rob u s tness to noise and dialectal v ariability , although at increased computational cost, requiring GPU acceleration for practical use (R TF = 2.1 on CPU and 0.1 on GPU). The connected-digit task pro vides a controlled e v aluation frame w ork; ho we v er , e xtension to lar ger - v ocab ulary and spontaneous Algerian Arabic speech remains necessary to fully assess scalability . While the nine-hour corpus de v eloped in this study addresses an important resource limitat ion, further e xpansion and inclusion of subjecti v e e v aluation measures w ould pro vide a more comprehensi v e assessment. In addition, e v aluation with alternati v e self-supervised architectures such as HuBER T and W a vLM, as well as e xploration of h ybrid or lightweight solutions, may further impro v e the balance between recognition performance and computational ef cienc y . The obtained w ord error rate (WER) of 3.4% using (SepF ormer + Whisper) represents a notable impro v ement o v er pre vious studies on Algerian Arabic speech recognition. F or instance, [25] reported a WER of approximately 14%, while more recent deep learning approaches on North African dialect digits achie v ed WERs around 8–12% in noisy settings [23]. The inte gration of neural source separation with self-supervised acoustic modeling thus yields a relati v e WER reduction of o v er 50% compared to earlier Algerian Arabic benchmarks, conrming the ef fecti v eness of the proposed pipeline for lo w-resource dialectal ASR. 5. CONCLUSION This w ork in v estig ated the inte gration of BSS and self-supervised learning for Algerian Arabic connected- digit recognition. A ne w nine-hour corpus of 11,230 utterances from 37 speak ers w as created to e v aluate both classical and neural BSS front-ends (DUET , Con v-T asNet, SepF ormer) combined with con v entional and self- supervised ASR back-ends (HTK, W a v2V ec 2.0, Whisper). The e xperiments conrmed that BSS substantially impro v es recognition rob ustness in noisy and o v erlapping conditions. DUET pro vides a lightweight, stereo- based enhancement, b ut neural separators achie v e higher separation quality and recognition accurac y . When paired with W a v2V ec 2.0 or Whisper , the y reach state-of-the-art performance, v alidating the syner gy between separation and self-supervised acoustic modeling for lo w-resource languages. While DUET remains suitable for real-time embedded systems, SepF ormer achie v es the best separation metrics (15.6 dB SDRi). Ho we v er , its WER g ains o v er Con v-T asNet are modest (0.9% absolute for W a v2V ec 2.0), suggesting either ASR back-end saturation or that separation quality be yond ˜ 15 dB of fers diminishing returns for digit recognition. Future w ork will e xtend this frame w ork to multi-dialect, lar ger -v ocab ulary Algerian Arabic cor - pora, incorporate subjecti v e listening tests (e.g., MOS scores) to complement objecti v e metrics, e x pl ore online separation, and de v elop ef cient neural models for deplo yment on edge de vices. This research contrib utes to w ard bridging the performance g ap between high- and lo w-resource speech technologies across Arabic di- alects. A CKNO WLEDGMENTS The authors thank the Uni v ersity of Amar T elidji Laghouat Algeria as well as the T elecommunications, signals and systems Research laboratory for supporting this research. FUNDING INFORMA TION Authors state no funding in v olv ed. A UTHOR CONTRIB UTIONS ST A TEMENT This journal uses the Contrib utor Roles T axonomy (CRediT) to recognize indi vidual author contrib u- tions, reduce authorship disputes, and f acilitate collaboration. Indonesian J Elec Eng & Comp Sci, V ol. 42, No. 1, April 2026: 71–80 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 79 Name of A uthor C M So V a F o I R D O E V i Su P Fu Mourad Re gg ab Mohammed Belkhiri C : C onceptualization I : I n v estig ation V i : V i sualization M : M ethodology R : R esources Su : Su pervision So : So ftw are D : D ata Curation P : P roject administration V a : V a lidation O : Writing - O riginal Draft Fu : Fu nding acquisition F o : F o rmal analysis E : Writing - Re vie w & E diting CONFLICT OF INTEREST ST A TEMENT Authors state no conict of interest. D A T A A V AILABILITY The data that support the ndings of this study are a v ailable from the corresponding author , M.R., upon reasonable request. REFERENCES [1] W . Algihab, N. Ala ww ad, A. Alda wish, and S. AlHumoud, Arabic Speech Recognition with Deep Learning: A Re vie w , in Social Computing and Social Media , G. Meisel witz, Ed. Cham, Switzerland: Springer , 2019, v ol. 11578, pp. 15–31, doi: 10.1007/978-3- 030-21902-4 _ 2. [2] Y . Luo and N. Mesg arani, “Con v-T asNet: Surpassing ideal time-frequenc y magnitude m asking for speech separation, IEEE/A CM T rans. Audio Speech Lang. Process. , v ol. 27, no. 8, pp. 1256–1266, Aug. 2019, doi: 10.1109/T ASLP .2019.2915167. [3] C. Subakan, M. Ra v anelli, S. Cornell, M. Bronzi, and J. Zhong, Attention is all you need in speech separation, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP) , T oronto, ON, Canada, Jun. 2021, pp. 21–25, doi: 10.1109/ICASSP39728.2021.9413901. [4] A. Radford, J. W . Kim, T . Xu, G. Brockman, C. McLea v e y , and I. Sutsk e v er , “Rob ust speech recognition via lar ge-scale weak supervision, arXi v preprint arXi v:2212.04356 , 2022. [5] F . S. Al-Anzi and D. Ab uZeina, “Synopsis on Arabic speech recognition, Ain Shams Engineering Journal , v ol. 13, no. 2, p. 101534, 2022, doi: 10.1016/j.asej.2021.06.020. [6] M. Malathi, S. Senthilkumar , C. H. H. Basha, G. Sundara v adi v el, M. Ka vitha, and P . Arunkumar , ”Multi-dialect speech recognition using transfer learning and transformer -based architectures: A comprehensi v e approach to accurate and ef cient dialect identi- cation, in 2024 Conference on Rene w able Ener gy T echnologies and Modern Communications Systems: Future and Challenges , 2024, pp. 1–6, doi: 10.1109/IEEECONF63577.2024.10880973. [7] G. Deroua-Hamdani, S.Selouani, and M. Boudraa, Alge rian Arabic Speech Database (ALGASD): Corpus design and automatic speech recognition application, Arabian Journal for Science and Engineering , v ol. 35, no. 2C, pp. 157–166, 2010. [8] Y . T oughrai, K. Sma ¨ ıli, and D. Langois, ABDUL: a ne w Approach to Build language models for Dialects Using formal Lan- guage corpora only , in Proc. 1st W orkshop Lang. Models Underserv ed Communities (LM4UC 2025) , 2025, pp. 16–21, doi: 10.18653/v1/2025.lm4uc-1.3. [9] M. A. Menacer , O. Mella, D. F ohr , D. Jouv et, D. Langlois, and K. Sma ¨ ıli, “De v elopment of the Arabic Loria Automatic Speech Recognition system (ALASR) and it s e v aluation for Algerian dialect, Procedia Computer Science , v ol. 117, pp. 81–88, 2017, doi: 10.1016/j.procs.2017.10.096. [10] L. R. Rabiner , J. G. W ilpon, and F . K. Soong, “High performance connected digit recognition, using hidden Mark o v models, in ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing , Ne w Y ork, NY , USA, 1988, v ol. 1, pp. 119–122, doi: 10.1109/ICASSP .1988.196526 [11] M . J. Manaileng and M. J. Manamela, “Connected-digits recognition for an under -resourced language using Hidde n Mark o v Mod- els, in Proceedings ELMAR-2013 , Zadar , Croatia, Sep. 2013, pp. 211–214. [12] O. Y ilmaz and S. Rickard, “Blind separation of speech mixtures via time-frequenc y masking, IEEE T ransactions on Signal Pro- cessing , v ol. 52, no. 7, pp. 1830–1847, Jul. 2004, doi: 10.1109/TSP .2004.828896. [13] M . I. Mandel, R. J. W eiss, and D. P . W . Ellis, “Model-based e xpectation-maximization source separation and localization, IEEE T ransactions on Audio, Speech, and Language Processing , v ol. 18, no. 2, pp. 382–394, 2010, doi: 10.1109/T ASL.2009.2029711. [14] M . Re gg ab and M. Belkhiri, “Blind Source Separation technique for Arabic language ASR, T echnical Report , Uni v . Amar T elidji, Laghouat, 2018. [15] A . Bae vs ki, Y . Zhou, A. Mohamed, and M. Auli, “w a v2v ec 2.0: A frame w ork for self-supervised learning of speech representations, arXi v preprint arXi v: 2006.11477 , 2020. [16] A. Conneau, A. Bae vski, R. Collobert, A. Mohamed, and M. Auli, “Unsupervised cross-lingual representation learning for speech recognition, in Proc. Interspeech , Brno, Czechia, pp. 2426–2430, 2021, doi: 10.21437/Interspeech.2021-329. [17] Y . Chen, H. Zhang, X. Y ang, W . Zhang, and D. Qu, “Meta-Adaptable-Adapter: Ef cient adaptation of self-supervised models for lo w-resource speech recognition, Neurocomputing , v ol. 609, no. 1, p. 128493, 2024, doi: 10.1016/j.neucom.2024.128493. [18] O. H. Anidjar , R. Marbel, and R. Y oze vitch, “Whisper T urns Stronger: Augmenting W a v2V ec 2.0 for Superior ASR in Lo w- Resource Languages, arXi v preprint arXi v: 2501.00425 , 2024. Inte gr ating blind sour ce separ ation and self-supervised learning for Alg erian Ar abic ... (Mour ad Re g gab) Evaluation Warning : The document was created with Spire.PDF for Python.
80 ISSN: 2502-4752 [19] A. Hyv ¨ arinen and E. Oja, “Independent component analysis: algorithms and applications, Neural Netw orks , v ol. 13, no. 4–5, pp. 411–430, 2000, doi: 10.1016/S0893-6080(00)00026-5. [20] H. Sa w ada, N. Ono, H. Kameoka, D. Kitamura, and H. Saruw atari, A re vi e w of blind source separation methods: tw o con v er ging routes to ILRMA originating from ICA and NMF , APSIP A T ransactions on Signal and Information Processing , v ol. 8, pp. 1–14, 2019, doi: 10.1017/A TSIP .2019.5. [21] S . Ui-Hyeop, L. Sangyoun, K. T aehan, and P . Hyung-Min, “Separate and Reconstruct: Asymmetric Encoder -Decoder for Speech Separation, arXi v preprint arXi v: 2406.05983 , 2024. [22] L. Bouchak our , K. Lounnas, and M. Debyeche, “Enhancing Rob ustness of Arabic Speech Recognition in Noisy En vironments Using Adv anced Feature Extraction and Denoising T echniques Based on Deep Learning Models, Circuits, Systems, and Signal Processing , 2025, doi: 10.1007/s00034-025-03418-w . [23] K . Lounnas, M. Abbas, M. Lichouri, M. Hamidi, H. Satori, and H. T ef f ahi, “Enhancement of spok en digits recognition for under - resourced languages: case of Algerian and Moroccan dialects, International Journal of Speech T echnology , v ol. 25, no. 2, pp. 443–455, 2022, doi: 10.1007/s10772-022-09971-y . [24] H. Mubarak, A. Hussein, S. A. Cho wdhury , and A. Ali, “QASR: QCRI Aljazeera Speech Resource A lar ge scale annotated Arabic speech corpus, in Proc. 59th Annu. Meeting Assoc. Comput. Linguist. 11th Int. Joint Conf. Nat. Lang. Process. (A CL-IJCNLP) , 2021, v ol. 1, pp. 2274–2285. doi: 10.18653/v1/2021.acl-long.177. [25] A. R. Ali, “Multi-Dialect Arabic Speech Recognition, in 2020 International Joint Conference on Neural Netw orks (IJCNN)) , Glasgo w , UK, 2020, pp. 1–7. doi: 10.1109/ijcnn48605.2020.9206658. [26] R. Ashifur , Md. M. Kabir , M. F . Mridha, M. Alatiyyah, H. F . Alhasson, and S. S. Alharbi , Arabic Speech Recognition: Adv ance- ment and Challenges, IEEE Access , v ol. 12, pp. 39689–39716, 2024, doi: 10.1109/A CCESS.2024.3376237. [27] W . N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdino v , and A. Moham ed, “HuBER T : Self-supervised speech repre- sentation learning by mask ed prediction of hidden units, IEEE/A CM T rans. Audio Speech Lang. Process. , v ol. 29, no. 11, pp. 3451–3460, 2021. doi: 10.1109/T ASLP .2021.3122291. BIOGRAPHIES OF A UTHORS Mourad Reggab is an Associate Professor at the Uni v ersi ty of Laghouat, Algeria. An academic with o v er tw o decades of e xperience in higher education, he earned his de gree in Electronics Engineering from the Uni v ersity of Boumerd ` es in 2001, specializing in communication systems. He later com pleted a Magister de gree in Electronic Systems’ Engineering with a focus on Automatic Speech Recognition. His primary research areas are signal processing, speech recognition, blind source separation, and articial intelligence. He is a member of the T elecommunications, Signals, and S ystems Research Laboratory at Amar T elidji Uni v ersity . His specic research intere sts include statistical signal processing, automatic speech recognition, blind source separation, image processing, and articial intelligence. He can be contacted via email at: m.re gg ab@lagh-uni v .dz. Mohammed Belkhiri Has recei v ed the Engineer de gree in Electrical Engineering and Electronics from the Institute of Electrical and Electronics Engineering, Uni v ersity of Boumerd ` es, Algeria, in 2000. He then earned the Magister de gree in Robotics and Automatic Control in 2002 and the Ph.D. de gree in Automatic Control in 2008, both from the ´ Ecole Nationale Polytechnique, Al- giers, Algeria. From 2003 to 2008, he serv ed as an Assistant Professor at the Uni v ersity of Laghouat, Algeria. He w as promoted to Associate Professor in 2008, a position he held until 2016. Since 2011, he has been af liated with the T elecommunications, Signals, and Systems Research Labora- tory at the Uni v ersity Amar T elidji, Laghouat, Algeria. Since 2016, he has been a Full Professor with the Ele ctrical Engineering Department at t he Uni v ersity of Laghouat. His research focuses on nonlinear , adapti v e, and intelligent neural netw ork control of electromechanical systems, with ap- plications in po wer con v ersion, robotics, and autonomous systems. He can be contacted at email: m.belkheiri@lagh-uni v .dz. Indonesian J Elec Eng & Comp Sci, V ol. 42, No. 1, April 2026: 71–80 Evaluation Warning : The document was created with Spire.PDF for Python.