Indonesian J our nal of Electrical Engineering and Computer Science V ol. 40, No. 2, No v ember 2025, pp. 640 653 ISSN: 2502-4752, DOI: 10.11591/ijeecs.v40.i2.pp640-653 640 Laryngeal pathology detection using EMD-based v oice acoustic featur es analysis and SVM-RBF Soane Cherif 1 , Abdelhad Kaddour 1 , Abdelmoudjib Benkada 2 , Said Kar oui 2 , Ouissem Chibani Bahi 1 , Asmaa Bouzid Daho 1 1 Laboratory of Signals, Systems and Data (LSSD), Department of Electronic, F aculty of Electrical Engineering, Uni v ersity of Sciences and T echnology of Oran Mohamed Boudiaf (UST O-MB), Oran, Algeria 2 Laboratory of Intelligent Systems Research (LARESI), Department of Electronics, F aculty of Electrical Engineering, Uni v ersity of Sciences and T echnology of Oran Mohamed Boudiaf (UST O-MB), Oran, Algeria Article Inf o Article history: Recei v ed Sep 6, 2024 Re vised Jul 22, 2025 Accepted Oct 14, 2025 K eyw ords: Acoustic features EMD Laryngeal pathology SVM V oice analysis ABSTRA CT T raditional techniques for detecting laryngeal pathologies, such as laryngoscop y and endoscop y , are costly and in v asi v e. This study presents a no v el approach for detecting laryngeal disorders using empirical mode decomposition (EMD)- based acoustic features analysis and support v ector machine (SVM) with a ra- dial basis function (RBF) k ernel. The e xperiments were conduct ed using the Saarbr ¨ uck en v oice database (SVD). The v oice signals were then decomposed us- ing EMD to e xtract the intrinsic mode functions (IMFs). The IMF with the high- est ener gy v alue w as selected as the most rele v ant. A set of acoustic features, including mel-frequenc y cepstral coef cients (MFCCs), linear predicti v e cep- stral coef cients (LPCCs), Pitch (fundamental frequenc y), higher -order statistics (HOSs), zero-crossing rate (ZCR), spectral centroid (SC), and spectral roll-of f (SR O), is deri v ed from the most rele v ant IMFs and fed into an SVM classier to dif ferentiate between health y and pathological v oices. Experimental results demonstrate the ef fecti v eness of the proposed methodology , achie ving a high classication accurac y of 94.5%, a sensiti vity of 94.2%, a specicity of 95.3%, and an F1 score of 96.1%, outperforming con v entional approaches. These re- sults highlight the potential of EMD-based v oice analysis as a non-in v asi v e and reliable tool for early diagnosis of laryngeal disorders. This is an open access article under the CC BY -SA license . Corresponding A uthor: Soane Cherif Laboratory of Signals, Systems and Data (LSSD), Department of Electronic, F aculty of Electrical Engineering Uni v ersity of Sciences and T echnology of Oran Mohamed Boudiaf (UST O-MB) P .O. Box 1505, El Mnaouar , 31000 Oran, Algeria Email: soane.cherif@uni v-usto.dz 1. INTR ODUCTION Speech production is a vital function of the v ocal tract system, enabling the creation of speech sounds. Impaired v oice production can signicantly impact an indi vidual’ s qualit y of life. Speech pathologists assess impairments af fecting communication, language, and v oice [1]. The human v oice plays a crucial role in f a- cilitating communication and social interaction. Ho we v er , improper v oice use can lead to v arious problems. Approximately 25% of the w orld’ s population suf fers from v oice disorders [2], which are often caused by conditions af fecting the larynx and v ocal cords, kno wn as laryngeal pathologies [3]. Con v entional diagnos- tic techniques, such as stroboscop y and laryngoscop y , are commonly used b ut can cause patients discomfort. Non-in v asi v e methods, such as electroglottograph y (EGG) and self-assessment, of fer alternati v es b ut require J ournal homepage: http://ijeecs.iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 641 specialist e xpertise for accurate analysis [4], [5]. T o address these challenges and enhance the accurac y of v oice disorder detection, researchers ha v e de- v eloped v arious models that e xtract v ocal characteristics, such as mel-frequenc y cepstral coef cients (MFCCs) and linear predicti v e cepstral coef cients (LPCCs) . These models utilize lar ge v oice databases, such as the Saarbr ¨ uck en v oice database (SVD), and emplo y adv anced classication techniques, including support v ector machine (SVM), Gaussian mi x t ure models (GMM), and uni v ersal background model Gaussian mixture mod- els (GMM-UBM). Adv ances in articial intelligence and machine learning ha v e signicantly impro v ed the ef cienc y of these classication algorithms, enabling more precise and non-in v asi v e detection of laryngeal pathologies [6]. V arious inno v ati v e approaches, particularly t hose le v eraging deep l earning techniques, ha v e achie v ed signicant adv ancements in v oice disorder detection. Alhussein and Muhammad [7] ha v e de v eloped a system for detecting speech disorders using deep learning techniques. The y trained their model on the SVD dataset and e v aluated it using the Massachusetts e ye and ear inrmary v oice disorders database (M EEI). The visual geometry group-16 (V GG16) and Caf feNet algorithms achie v ed 94.5% and 94.1% accurac y rates, respecti v ely . Le v eraging deep con v olutional neural netw orks (CNNs) further impro v ed the accurac y to 97.5%. Hammami [8] proposed a technique that utilizes w a v elet coef cients to classify v ocal disor d e rs. Their analysis w as based on sustained v o wel recordings of the sound /a/ from the SVD dataset. Through e xperiments with v arious GMM, the y found that incorporating the teager ener gy operator and using 32 Gaus- sian mixtures yielded an accurac y of 96.66%. Con v ersely , when combining three feature v ectors, the accurac y dropped to 92.22%. F ang et al. [9] utilized a lar ge set of features, including 430 basic acoustic features (B AFS—basic acoustic features), 84 cepstral coef cients based on the mel S-transform (MSCC—Mel S-transform cepstrum coef cients), and 12 chaotic features. Feature optimizati on w as conducted using radar charts and the F-score, reducing the feature dimensionality from 526 to 96 dimensions for the NKI-CCR T corpus and 104 dimensions for the SVD corpus. These optimized features were fed into an SVM classier to detect v oice disorders. Ho we v er , their approach achie v ed only 84.4% accurac y on the NKI-CCR T database and 78.7% on the SVD database. Al-Dhief et al. [10] suggested a w ay to get MFCC features from the SVD database and use them with the OS-LEM (online sequential e xtreme learning machine) classier . The approach achie v ed a maximum accurac y of 91.17%, recall of 91%, F-measure of 87%, G-mean of 87.55%, and specicity of 97.67%. Ribas et al. [11] de v eloped a model based on deep neural netw orks (DNN) to dif ferentiate between health y and pathological v oices. The model achie v ed maximum accurac y rates of 80.71% for sentences and 82.8% for v o wels (/a/, /i/, /u/). The authors utilized t he automatic v oice disorder detection (A VDD) system with self-supervised representations to e xtract distincti v e auditory features. The y incorporated a feedforw ard layer with a class-tok en transformer to consolidate temporal feature sequences. The researchers augmented the training dataset with out-of-scope data to address data a v ailability concerns. Experimental results demonstrated a classication accurac y of 93.36%, representing signicant impro v ements of 4.1% without data augmentation and 15.62% with data augmentation. Using self-supervised (SS) representations in A VDD resulted in an accu- rac y rate of 90% [11]. Lee [12] emplo yed deep learning techniques to classify v oice samples, specically using feedforw ard nural netw orks (FNN) and CNN. Their study found that utilizing the LPCCs, the CNN classier achie v ed a maximum accurac y of 82.69% for the v o wel /a/ in male subjects. Ding et al. [13] utilized v oice signal analysis to de v elop a method for the early diagnosis and treatment of v oic e disorders. The y also introduced a no v el computer -aided assessment approach for pathological v oice classication (CS-PVC), specically designed to distinguish between pathological and health y v oices in areas with signicant discrepancies. The m od e l achie v ed identication accurac y of 81.6% on the SVD dataset and 82.2% on the self-b uilt Shenzhen People’ s Hospital v oice database (SZUPD). Ja v anmardi et al. [14] conducted a comparati v e analysis of v arious data augmentation (D A) tech- niques for v ocal pathology detection, e v aluating three temporal methods (noise addition, pitch shifting, and time stretching), one time-frequenc y technique (SpecAugment), and tw o v ocoder -based approaches (modify- ing the harmonic-to-noise ratio (HNR) and glottal pulse length). The e xtracted features include static and dynamic MFCCs, the spectrogram, and the mel-spectrogram, which were then fed into machine learning mod- els (SVM and random forest) and deep learning models (long short-term memory (LSTM) and CNN). The best performance, achie v ed with a 2D CNN, reached an accurac y of 80% on the SVD database [14]. Albadr et al. [15] impro v ed the detection and classication of v oice pathologies (VP) using a f ast- learning net w ork (FLN) classier based on MFCCs features. Their study comprised tw o phases: the rs t phase Laryng eal patholo gy detection using EMD-based voice acoustic featur es ... (Soane Cherif) Evaluation Warning : The document was created with Spire.PDF for Python.
642 ISSN: 2502-4752 analyzed v ocal samples of sust ained v o wels (/a/, /i/, and /u/) along with spok en phrases. In contras t, the second phase focused on v ocal sampl es from three common v oice disorders—paralysis, polyps, and c ysts—using the v o wel /a/ spok en in a neutral tone. The e xperimental results achie v ed an accurac y of 84.64%, a precision of 97.39%, a recall of 86.05%, an F-measure of 86.80%, a G-mean of 86.81%, and a specicity of 88.24%. According to the literature, traditional methods for identifying laryngeal pathologies rely on v ocal signal analysis. Ho we v er , the y ha v e se v eral limitat ions, particularly the l ack of proper pre-processing of v oice datasets. Researchers often e xtract features direc tly and classify them using a limited number of samples, mak- ing it challenging to eliminate residual noise in the reconstructed signal. This leads to oscillations that distort mode decomposition. Additionally , these approaches hinder the systematic e v aluation of e xtracted parame- ters. T o address these issues, we propose a no v el method, described in section 2, to impro v e the detection of laryngeal disorders from speech signals. This article is structured as follo ws: section 2 presents the proposed frame w ork, detailing the mate rials and methodologies used in this study , encompassing both theoretical and practical aspects. Section 3 pro vides an in-depth discussion of the results, e v aluating the ef fecti v eness of the proposed method in detecting laryngeal issues. Finally , section 4 concludes with k e y ndings and suggests potential directions for future research on diagnosing laryngeal pathologies. 2. METHOD Figure 1 presents the block diagram illustrating the proposed methodology for the accurate and unbi- ased diagnosis of laryngeal pathologies. This methodology consists of four k e y steps: silence remo v al, lo w-pass ltering, normalization, and empirical mode decomposition (EMD). This method decomposes the v ocal signal into IMFs, representing its harmonic components. The most rele v ant IMFs are selected based on their maxi- mum temporal ener gy and are framed into short se gments (0.1-second duration with 0.01-second o v erlap) for analysis. Each frame is then multiplied by a Hamming windo w to minimize discontinuities at the be ginning and end of the s ignal, thereby enhancing the accurac y of the frequenc y analysis. The Hamming windo w is the same length as the frame. Figure 1. Block diagram illustrating the proposed methodology Afterw ard, we e xtract se v en features: Pitch (fundamental frequenc y), spectral roll-of f (SR O), spec- tral centroid (SC), zero-crossing rate (ZCR), higher -orderstatistics (HOSs), LPCCs, and MFCCs. Finally , each e xtracted feature serv es as input for a support SVM-RBF classier , enhancing the accurac y of laryngeal pathol- ogy diagnosis. The originality of this study lies in inte grating v oice signal pre-processing and empirical mode decomposition to e xtract acoustic features. The main contrib utions of this study are as follo ws: Indonesian J Elec Eng & Comp Sci, V ol. 40, No. 2, No v ember 2025: 640–653 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 643 - De v eloping a non-in v asi v e, lo w-cost method for the detection of laryngeal pathologies - Experimental v alidation of the ef fecti v eness of the proposed system using the SVD database - Using more adv anced v oice signal pre-processing methods, including dif ferent feature e xtraction and clas- sication algorithms, to mak e diagnosing laryngeal pathology much more reliable and accurate. 2.1. Database This study utilized the SVD database, an online repository containing o v er 2,000 audio les fea turing three distinct v o wel sounds: /a/, /i/, and /u/. Each le has a duration ranging from 1 to 4 seconds and is sampled at a frequenc y of 50 kHz with a 16-bit resolution. F or analysis, we selected v ocal signals of the sustained neutral v o wel /a/ from a group of 200 health y males and 91 males with pathological conditions. The pathology subset includes recordings from four specic conditions: 50 cases of laryngitis, 19 ca ses of v ocal cord cancer , 5 cases of Reink e’ s edema, and 17 cases of v ocal cord polyps. 2.2. V ocal signals pr epr ocessing Before using v ocal signals in speech-processing applications, performing pre-processing tasks such as zero-mean normalization, amplied normalization, lo w-pass ltering, and silence remo v al is important. Subtracting the mean from a signal centers it around zero, making the a v erage of all the signal samples equal to zero. This process is commonly used to prepare data for machine learning algorithms. The signal is then scaled by di viding each sample by the maximum absolute v alue. This ensures that the signal’ s peak is normalized t o 1 if the peak is positi v e or -1 if the peak is ne g ati v e. W e applied a lo w-pass lter with a cutof f frequenc y of 1 kHz to isolate the rele v ant lo w-frequenc y components and remo v e unw anted high-frequenc y components. Silence remo v al refers to detecting and remo ving periods of silence in a signal while maintaining its timing. This method uses an ener gy threshold to identify silent periods. In this study , the threshold w as set at 2% of the maximum ener gy le v el. An y se gment with ener gy belo w this threshold w as considered silent. V ocal signals primarily contain ener gy at lo wer frequencies , whil e non-v ocal signals typically ha v e higher frequencies [16]. As illustrated in Figure 2, we present the preprocessing steps applied to the v ocal signal of speak er 563 from the SVD database to impro v e clarity . Figure 2(a) sho ws the v oice signal 114-a-n.w a v after the application of lo w-pass ltering and normal- ization. The signal is centered around zero, reecting the attenuation of high-frequenc y components and the standardization of the amplitude scale. Figure 2(b) displays a 10,000-sample e xcerpt of the same signal, corre- sponding to a duration of 0.2 seconds, to f acilitate visual observ ati on . This e xcerpt allo ws for a more detailed analysis of the w a v eform of the preprocessed signal, enabling a localized e xamination of its acoustic content. Figure 2(c) illustrates the v oice signal after silence remo v al (7,893 samples, corresponding to a duration of 0.1579 seconds). The reduced signal length highlights the ef fecti v e elimination of silent se gments. (a) (b) (c) Figure 2. Preprocessing of the v oice signal: (a) lo w-pass ltered and normalized v oice signal 114-a n.w a v, (b) 10,000-sample e xcerpt of a lo w-pass ltered and normalized v oice signal, and (c) v oice signal after silence remo v al (7,893 samples corresponding to 0.1579-second duration) Laryng eal patholo gy detection using EMD-based voice acoustic featur es ... (Soane Cherif) Evaluation Warning : The document was created with Spire.PDF for Python.
644 ISSN: 2502-4752 2.3. Empirical mode decomposition Man y researchers ha v e used EMD to process v ocal signals due to its e xcellent performance with this specic type of signal [17]-[20]. T o detect the presence of v oice in a non-stationary speech signal, we applied EMD to decompose it into a sequence of oscillatory patterns kno wn as IMFs and a residual component, as sho wn in (1). x ( n ) = r k ( n ) + k X i =1 I M F i ( n ) (1) Where x ( n ) is the digitized v oice signal, n representing the sample, k is the number of IMFs e xtracted and r k ( n ) is the residual. W e incorporated the stopping condition proposed by Huang et al. [17] for the sifting procedure. This criterion limits the standard de viation (SD) between tw o consecuti v e sifting results typically between 0.2 and 0.3. F or an IMF to be considered genuine, it must satisfy tw o criteria: the dif ference between the number of zero crossings and the number of e xtrema must not e xceed one, and the a v erage v alue of the en v elope formed by the local maxima and minima must be zero. Figure 3 illustrates the decomposition process as well as the criteria used t o identify the most rele v ant IMFs, summarizing the k e y steps of our method. It highlights both the decomposition procedure and the steps used to e xtract acoustic information from the most signicant components. The IMFs, sho wn in Figure 3(a), are obtained through an iterati v e sifting process, which in v olv es the follo wing steps: i) Determine all e xtrema (local maxima and minima) of the signal x ( t ) . ii) Estimate the v alues of the minima and maxima using cubic spline interpolation, creating the lo wer en v e- lope e min ( t ) and the upper en v elope e max ( t ) . iii) Determine the en v elope’ s mean by applying the follo wing formula: m 1 ( t ) = e max ( t ) + e min ( t ) 2 (2) i v) Calculate the IMF by calculating the dif ference between the x ( t ) and m 1 ( t ) signals. x ( t ) m 1 ( t ) = h 1 ( t ) (3) v) If h 1 ( t ) is an IMF , it is dened as the rst IMF component of x ( t ) . Alternati v ely , h 1 ( t ) is considered the original signal. vi) Iterate the preceding steps, treating h 1 ( t ) as the ne w x ( t ) , and obtain h 11 ( t ) . If h 11 ( t ) is an IMF , s top the process. Otherwise, continue iterating. After the decomposition, we ha v e identied the IMF with the highest ener gy v al ue as the most rele v ant IMFs. The ener gy is calculated using (4). E k = N X n =1 [ I M F k ( n )] 2 (4) Where E k is the ener gy of the k th I M F , N is the length of the backscattered signal, and I M F k ( n ) is the v alue of the k th I M F at sample n . The rele v ant IMF obtained (Figure 3(b)) is se gmented into 0.1-second interv als and then m ultiplied by a Hamming windo w of the same length (Figure 3(c)) to e xtract acoustic features. 2.4. F eatur e extraction 2.4.1. Mel-fr equency cepstral coefcients The MFCCs are e xtensi v ely utilized features in speech and audio processing. It denotes the short-term po wer spectrum of an auditory input, emulating human speech perception. MFCCs are crucial for identifying v ocal abnormalities in the v ocal domain [21], [22]. Figure 4 illus trates the steps in v olv ed in computing MFCCs. The pre-emphasis step enhances high frequencies to balance the spectrum. The f ast fourier transform (FFT) then con v erts the time-domain signal into a frequenc y spectrum. Subsequently , a Mel lter bank is applied Indonesian J Elec Eng & Comp Sci, V ol. 40, No. 2, No v ember 2025: 640–653 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 645 to map frequencies onto the mel scale, which aligns with human auditory perception. Finally , the amplitudes are con v erted to a log arithmic scale (similar to human perception) and subjected to a discrete cosine transform (DCT), e xtracting the most rele v ant MFCCs for classifying laryngeal diseases. (a) (b) (c) Figure 3. Decomposition of v oice signal: (a) IMFs, (b) the rele v ant mode, and (c) rele v ant mode multiplied by the hamming windo w (0.1-second) Figure 4. Steps to compute MFCCs Laryng eal patholo gy detection using EMD-based voice acoustic featur es ... (Soane Cherif) Evaluation Warning : The document was created with Spire.PDF for Python.
646 ISSN: 2502-4752 2.4.2. Linear pr edicti v e cepstral coefcients LPCCs are an adv anced signal process ing technique used to estimate the source signal of v ocal sounds. This method utilizes LPCCs—also referred to as CPLC—to perform a detailed analysis of the v ocal signal. The primary goal of LPCCs is to model the signal’ s spectral en v elope to e xtract its essential features. The v ocal tract is an innite impulse response (IIR) lter modeled through a recursi v e and graphical approach [23]. This modeling process is described in (5). H ( z ) = G 1 + P p k =1 a p ( k ) Z k (5) Where p is the number of poles, G denotes the lter g ain, and a p ( k ) are the coef cients. The e xtraction of LPCCs in v olv es a series of sequential steps, as illustrated in Figure 5. First, the rele v ant signal se gment—multiplied by a 0.1-second hamming windo w—is modeled using a linear predicti v e model, which assumes that the current sample can be estimated as a linear combination of pre vious samples. The model coef cients are obtained by minimizing the prediction error . The autocorrelation function of the predicted signal is then computed to assess the similarity between dif ferent parts of the signal. Subsequently , the iterati v e Le vinson-Durbin algorithm is empl o yed to deri v e the LPCCs from the autocorrelation function. Finally , the LPCCs are transformed into the cepstral domain by applying the discrete cosine transform (DCT) [24], [25]. Figure 5. Steps to compute LPCCs 2.4.3. Pitch The fundamental frequenc y ( F 0 ), often called pitch, is the frequenc y at which the v ocal cords vibrate when producing v oiced sounds. This frequenc y is a crucial indicator of laryngeal diseases. Se v eral methods for calculating F 0 are described in the literature, including those based on autocorrelation, spectral analysis, and combinations of these techniques [20]. F or our study , we chose the autocorrelation method, as dened by (6). R [ k ] = N k 1 X n =0 x [ n ] · x [ n + k ] (6) Where: R [ k ] represents the one-lag autocorrelation function k , x [ n ] is the input signal at time n , k denotes the shift inde x (lag), N is the length of the signal. The rst peak (local maximum) in the autocorrelation function, after the peak at k = 0 , corresponds to the fundamental period of the signal. The period T 0 is the distance between this peak and k = 0 . Indonesian J Elec Eng & Comp Sci, V ol. 40, No. 2, No v ember 2025: 640–653 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 647 2.4.4. Higher order statistics Our w ork e xplicitly e xamined the HOSs characteristics, focusing on the third-order moments (sk e w- ness) and fourth-order moments (K urtosis). One notable bene t of these HOSs features is their compatibility with per iodic and non-periodic signals. Sk e wness quanties the lack of symmetry in a v oice’ s probability dis- trib ution, whereas K urtosis measures the e xtent to which a distrib ution is at and contains impulsi v e elements in a signal. These tw o statistics pro vide a v aluable method for analyzing v oice features and diagnosing pathol- ogy laryngeal, assessing data distrib ution, and identifying impulsi v e components. W e compute the Sk e wness and K urtosis using (7) and (8) in sequential order [26]-[28]: γ 3 = P N n =1 ( x n µ ) 3 ( N 1) σ 3 (7) γ 4 = P N n =1 ( x n µ ) 4 ( N 1) σ 4 (8) Where γ 3 and γ 4 denote the measures of sk e wness and K urtosis, respecti v ely , N the number of samples, µ the mean and σ the SD. 2.4.5. Zer o-cr ossing rate The ZCR is a quantitati v e meas ure emplo yed t o assess the frequenc y char acteristics of a signal. The term “sign change rate” refers to the frequenc y at which a signal changes its polarity within a specic time frame. More precisely , it counts the number of times the signal changes from positi v e to ne g ati v e v alues (or vice v ersa) and then standardizes thi s tally by di viding it by the total duration of the frame. The follo wing mathematical e xpression determines the zero-crossing rate: Z n = 1 w l w l X m =1 | sgn [ x n ( m )] sgn [ x n ( m 1)] | (9) The length of the frame is represented by w l , the frame number is represented by m , and the sign function is represented by sg n . sgn [ x n ( m )] = 1 si x n ( m ) > 0 , 0 si x n ( m ) = 0 , 1 si x n ( m ) < 0 . (10) 2.4.6. Spectral centr oid The spectral centroid is a crucial feat ure used to identify v oice disorders. It represents the “center of gra vity” of the spectrum and is computed using frequenc y and amplitude information deri v ed from the fourier transform [29], [30]. The spectral centroid indicates the frequenc y in Hertz (Hz) at which the spectral ener gy is balanced or e v enly distrib uted. It is calculated as the weighted a v erage of the frequencies contained in the signal, as e xpressed by (11). Spectral centroid = P N k =1 f k · S k P N k =1 S k (11) Where N represents the number of spectral bins or frequencies, f k is the frequenc y of the k -th spectral bin, and S k denotes the the amplitude of the k -th spectral bin. 2.4.7. Spectral r oll-off The term “spectral roll-of f” refers to a metric that is used to dene a lter that is intended to decreas e the amplitude of f requencies that f all outside of a particular range. This technique is frequently used to reduce undesired frequencies in a transmission. It is a measure that identies the frequenc y at which a specic per - centage of the total ener gy in a spectrum is concentrated belo w . The equation for SR O states that the spectral Laryng eal patholo gy detection using EMD-based voice acoustic featur es ... (Soane Cherif) Evaluation Warning : The document was created with Spire.PDF for Python.
648 ISSN: 2502-4752 ener gy accumulated up to the i-th bin is proportional to the total ener gy contained between the b 1 and b 2 bins and it is typically e xpressed as follo ws [28]: Roll-of f spectral ( i ) = i X k = b 1 S k = K b 2 X k = b 1 S k (12) where S k represents the spectral amplitude at the k frequenc y bin. b 1 and b 2 are the band edges o v er which the spectral spread is calculated, and K represents the percentage of total ener gy . The equation e xpresses that the spectral ener gy accumulated up to the ii-th bin is proportional to the total ener gy contained between the b 1 and b 2 bins. 2.5. Classication Se v eral techniques are a v ailable for classifying laryngeal disorders bas ed on v ocal signals, including CNNs, Ale xNet, SVMs, random forests, K-nearest neighbors (KNN), decision trees, and deep neural netw orks (DNNs). Each algorithm of fers distinct adv antages, impro ving classication accurac y depending on the conte xt and dataset [4], [31]. In our study , we selected a SVM with a RBF k ernel. The SVM-RBF is a supervised learning model designed to construct an optimal h yperplane that separates data into tw o distinct classes. One of it s k e y strengths lies in its deterministic nature, as it does not rely on probabilistic ass u m ptions. Such an approach can lead to more consistent and interpretable results in specic applications. The SVM-RBF’ s goal is to nd the h yperplane that maximizes the mar gin—the distance between the h yperplane and the closest support v ectors. This mar gin serv es as a decision boundary that best dif ferentiates the tw o classes. A wider mar gin typically impro v es the model’ s generalization capability , enabling it to more accurately classify ne w , unseen data. Additionally , the mar gin-based approach contrib utes to rob ustness by reducing the model’ s sensiti vity to outliers and noise in the dataset [27]. T o optimize the performance of the SVM-RBF model for our specic dataset, we conducted an e x- hausti v e parameter search. In particular , we ne- tuned tw o crucial paramete rs: the k erne l scale ( γ ) and the box constraint ( C ) . The k ernel scale re gulates the impact of indi vidual training samples on the conguration of the decision border , whereas the box constraint mediates the balance between optimizing the mar gin and reducing classication mistak es [32]. By carefully adjusting these parameters, we could re gulate the comple xity of the decision surf ace and enhance the model’ s ef fecti v eness in classifying v ocal signals associated with laryngeal disorders. The RBF k ernel used in SVMs is mathematically dened as follo ws: K ( x i , x j ) = e γ x i x j (13) where: x i and x j are feature v ectors in the input space, K ( x i , x j ) is the k ernel function that computes the similarity between tw o data points x i and x j , x i x j represents the Euclidean distance between the tw o data points x i and x j , γ is a parameter that controls the spread of the k ernel. A higher v alue of γ results in a narro wer k ernel, meaning that o nl y points that are v ery close to each other will be considered si milar . Con v ersely , a lo wer v alue of γ mak es the k ernel wider , considering more distant points as similar . While the box constraint ( C ) is a re gularization parameter that controls the trade-of f between achie ving a lo w training error and maintaining a simpler decision boundary . A higher v alue of C penalizes misclassica- tions more hea vily , leading to a comple x decision boundary that may o v ert the data, whereas a lo wer C allo ws for more classication errors, promoting a simpler and more generalized model. min ω ,b,ϵ 1 2 ω 2 + C l X i =1 ϵ i (14) Subject to the constraints: y i ( w T x i + b ) 1 ϵ i , ϵ i 0 , i = 1 , . . . , l (15) where w denotes the normal v ector dening the h yperplane, b represents the bias shifting the h yperplane, l is the total number of data points, ϵ are the slack v ariables allo wing for tolerance of classication errors, and Indonesian J Elec Eng & Comp Sci, V ol. 40, No. 2, No v ember 2025: 640–653 Evaluation Warning : The document was created with Spire.PDF for Python.
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 649 y i { +1 , 1 } is the class of the sample x i . W e in v estig ated the optimization parameters C = 2 k and γ = 2 m , where k and m are inte gers chosen within the range of -20 to 20. By ne-tuning these parameters, we aim to enhance classication performance while preserving a balance between accurac y and generalization. W e aim to enhance classication performance by ne-tuning these parameters while preserving a bal - ance between accurac y and generalization. W e e v aluated these automated classicati on and detection methods for laryngeal diseases using four k e y metrics: accurac y , sensiti vity , specicity , and the F1 score. In this case, the algorithm classies samples as either pathological or health y , accordingly labeling them as true positi v es (TP) or f alse ne g ati v es (FN). Con v ersely , health y samples are classied as either pathological or health y , cor - responding to true ne g ati v es (TN) and f alse positi v es (FP). The follo wing equations dene a v ariety of these performance measures. Accurac y = T P + T N T P + T N + F P + F N (16) Sensiti vity (Recall) = T P T P + F N (17) Precision = T P T P + F P (18) Specicity = T N T N + F P (19) F1 Score = 2 × Precision × Recall Precision + Recall (20) 3. RESUL TS AND DISCUSSION The proposed laryngeal disease detection and classication method w as e v aluated using the SVD database, described in section 2.1. In our e xperiments, 80% of the data w as used for training, while 20% w as reserv ed for testing and v alidation to e v aluate the model’ s performance. Interpreting the confusion matrix is essential for e v aluating the model’ s performance in accurately classifying the dif ferent cate gories (normal or pathological). This e v aluation is guided by the metrics dened in section 2.5, which pro vide a quantitati v e clas- sication performance assessment. T able 1 presents the metric v alues corresponding t o each feature: MFCCs, LPCCs, HOSs, Pitch, SR O, ZCR, and SC. T able 1. Ev aluation metrics table of dif ferent characterization parameters P arameter Accurac y (%) Sensiti vity (%) Specicity (%) F1 (%) A UC (%) 14 MFCCs 94.5 94.2 95.3 96.1 94.5 14 LPCCs 85.8 88.7 78.1 88.7 85.5 HOSs 86.1 91.3 71.9 91.3 86.1 Pitch 86.6 87.9 83.5 87.9 89.1 SR O 86.1 86.9 83.9 86.9 86.1 ZCR 79.2 90.6 50.7 90.6 79.2 SC 86.0 93.2 68.2 93.2 86 The metri cs presented in T able 1 pro vide v aluable insights into the contrib ution of each acoust ic feature to the classication of normal and pathological v oices. Among all the parameters, MFCCs, and LPCCs e xhibit the highest performance across all e v a luation metrics, indicating their strong discriminati v e po wer in detecting v ocal pathologies. This result is consistent with pre vious studies, which highlight the ef cienc y of cepstral features in capturing rele v ant information from speech signals. HOSs also sho w promi sing results, suggesting that the v oice signal’ s nonlinear characteristics contain useful diagnostic cues. Pitch and SR O demonstrate moderate classication perf o r mance, lik ely because the y capture complementary aspects of v ocal signal v ariability that may not be as rob ust across all samples. Laryng eal patholo gy detection using EMD-based voice acoustic featur es ... (Soane Cherif) Evaluation Warning : The document was created with Spire.PDF for Python.