Inter national J our nal of Adv ances in A pplied Sciences (IJ AAS) V ol. 14, No. 3, September 2025, pp. 955 965 ISSN: 2252-8814, DOI: 10.11591/ijaas.v14.i3.pp955-965 955 Pitch extraction using discr ete cosine transf orm based po wer spectrum method in noisy speech Humaira Sunzida 1 , Nar gis P ar vin 2 , J afrin Akter J eba 1 , Sulin Chi 3 , Md. Shiplu Ali 1 , Moinur Rahman 1 , Md. Saifur Rahman 1 1 Department of Information and Communication T echnology , F aculty of Engineering, Comilla Uni v ersity , Cumilla, Bangladesh 2 Department of Computer Science and Engineering, Bangladesh Army International Uni v ersity of Science and T echnology , Cumilla, Bangladesh 3 Department of Information Engineering, Otemon Gakuin Uni v ersity , Osaka, Japan Article Inf o Article history: Recei v ed Jun 8, 2024 Re vised Mar 9, 2025 Accepted Jun 8, 2025 K eyw ords: Autocorrelation function Cumulati v e po wer spectrum Discrete cosine transform Fundamental frequenc y Pitch ABSTRA CT The pitch period is a k e y com ponent of man y speech analysis research projects. In real-w orld applications, v oice data is frequently g athered in noisy surround- ings, therefore algorithms must be able to manage background noise well in order to estimate pitch accurately . Despite adv ancements, man y state-of–the-art algorithms struggle to deli v er adequate results when f aced with lo w signal-to- noise ratios (SNRs) in processing noisy speech signals. This research proposes an ef fecti v e concept specically designed for speech processing applications, particularly in noisy conditions. T o achie v e this goal, we introduce a fundamen- tal frequenc y e xtraction algorithm designed to tolerate non-stationary changes in the amplitude and frequenc y of the input signal. In order to impro v e the e xtrac- tion accurac y , we also use a cumulati v e po wer spectrum (CPS) based on discrete cosine transform (DCT) rather than con v entional po wer spectrum. W e enhance e xtraction accurac y of our method by utilizing shorter sub-frames of the input signal to mitig ate the noise characteristics present in speech signals. According to the e xperimental results, our proposed technique demonstrates superior per - formance in noisy conditions compared to other e xisting state-of-the-art meth- ods without utilizing an y kind of post-processing techniques. This is an open access article under the CC BY -SA license . Corresponding A uthor: Md. Saifur Rahman Department of Information and Communication T echnology , F aculty of Engineering, Comilla Uni v ersity K otbari, Cumilla, Bangladesh Email: saifurice@cou.ac.bd 1. INTR ODUCTION The v ocalized form of human communication, kno wn as speech, is dened as the mo v ement of d i f fer - ent speech or g ans, to produce sounds. In other w ords, speech can be dened as a series of sounds arr anged in a sequence. A symbolic representation of information that needs to be transmitted between people or between people and machines is sound. The speech signal, represented acoustically as uctuations in air pressure, con- v e y information between indi viduals or between indi viduals and machines. Speech may tak e the form being v oiced, un v oiced, or silent, reecting dif ferent approaches to v ocalization and sound generation. A v oiced sound occurs when the speak er’ s v ocal cords vibrate during sound production, while an un v oiced s o und is pro- duced without v ocal cord vibration and when nothing is coming out from mouth is considered as a silence part. When a person speaks, their v ocal cords vibrate, and the pitch is determined by ho w long it tak es for the cords to open and close, kno wn as the pitch period. This periodicity denes the fundamental frequenc y , which is J ournal homepage: http://ijaas.iaescor e .com Evaluation Warning : The document was created with Spire.PDF for Python.
956 ISSN: 2252-8814 also represented as the pitch. In v oiced sounds, the percei v ed pitch is determined by the apparent periodicity of v ocal cord vibrations. Essentially , ”pitch” in speech corresponds to the frequenc y of v ocal cord vibrations during v oiced sounds [1]. Pitch le v el correlates with the fundamental frequenc y: lo wer frequencies correspond to lo wer pitches, while higher frequencies indicate higher pitches [2]. Children and females capable of reaching frequencies up to 500 Hz, while males typically ha v e a lo wer fundamental frequenc y around 60 Hz [3]. Pitch, or fundamental frequenc y ( F 0 ), is vital in speech production, reecting the rate of v ocal fold vibration and inuencing intonation and emotion perception. Accurate pitch estimation is essential across multiple elds lik e speech processing and music, enabling tasks such as music analysis, speech prosody un- derstanding, and telecommunications. Preci sion in pitch e xtraction signicantly impacts the ef fecti v eness of applications lik e music synthesis, speech processing, and v oice modulation [4], [5]. Up till no w , a v ariety of pitch recognition methods ha v e been co v ered. Pitch detection algorithm (PD A) is the term used to describe these techniques, which were founded on v arious mathematical principles [6]. PD As can be used in three dif ferent w ays: in the frequenc y domain, in the time domain, or in combination of the tw o [7]. Some pitch detection methods focus on identifying a nd timing specic features in the time domain. Pitch estimators in the time domain usually ha v e three parts: a basic estimator , a post processor for error correction, and a preprocessor for signal simplication. W ithin this domain, v arious techniques, such as autocorrelation function (A CF) [8], a v erage magnitude dif ference function (AMDF) [9], a v erage squared mean dif ference function (ASMDF) [10], weighted autocorrelation function (W AF) [11], and YIN [12]. The autocorrelation approach is the most often used method for guring out a v oice signal’ s pitch period. The correlation between the input signal and a time-delayed v ersion of itself is indicated by the A CF . AMDF , kno wn for sho wcasing lo w points at inte gral multiples of the pitch period, is often utilized for pitch estimation [13]. AMDF stands as an alternati v e approach to autocorrelation analysis, presenting a simplied v ersion compared to A CF . W ith AMDF , as opposed to A CF , the delayed speech is subtracted from the original to create a dif- ference signal, and the absolute magnitude is then determined at each delay v alue. In the W AF method, the periodicity property shared wi th A CF and AMDF is uti lized. The W AF is characterized by emplo ying the A CF as its numerator and the AMDF as its denominator . An algorithm called the YIN technique analyzes the traditional A CF [14]. In frequenc y domain techniques, v arious techniques ha v e been de v eloped to analyze the frequenc y do- main cepstrum coef cients or spectrum of periodic signals i n order to e xtract pitch. The cepstrum (CEP) [15] method is one of the most well-kno wn methods. This method, relies on spectral char acteristics. CEP is able to distinguish v ocal tract features from periodic components. Ho we v er , its performance is signicantly com- promised in a noisy en vironment, where the prese n c e of noise has a pronounced im pact on the log-amplitude spectrum. Enhancements to the cepstrum method are tackled in the modied cepstrum (MCEP) [16]. Features from both windo wless autocorrelation function (WLA CF) and cepstral analysis are included in the cepstrum technique kno wn as WLA CF-CEP . WLA CF reduces noise in the speech signal without compromising its peri- odicity . Pitch estimation lter with amplitude compression (PEF A C) utilizes summations of sub-harmonics in the log frequenc y domain. T o impro v e its resilience to noise, the PEF A C incorporates an amplitude compres- sion technique [17]. Using both log arithmic and po wer functions, [18] reduces the ef fect of formants and utilizes the Radon transform to pro vide a no v el method for estimating pitch in noisy speech conditions. It also incorporates the V iterbi algorithm for pitch pattern renement. Mnasri et al . [19] based on establishing a pragmatic relationship between the instantaneous frequenc y ( F i ) and the fundamental frequenc y ( F 0 ). It determines whether speech areas are v oiced or un v oiced and e xtracts t he F 0 contour by approximating it as a smoothed en v elope of remain- ing F i v alues. T o estimate pitch by comparing the temporal accumulations of clean and noisy speech samples, the topology-a w are intra-operator parallelism strate gy searching (T APS) algorithm, as described in [20], trains a set of peak spectrum e x emplars. T o understand ho w noise af fects the locations and amplitudes within the spectrum of clear speech, Chu and Al w an de v eloped the statistical algorithm for F0 estimation (SAFE) model [21]. Pitch estimation is enhanced using self-supervised pitch esti mation (SPICE), as stated in [22], by rening the acquired data and training a constant Q transform of signals. T o accommodate pitches with v arying noise le v els, Deep F 0 [23] e xpands the netw ork’ s recepti v e range. It has been demonstrated that Harmo F 0 outper - forms Deep F 0 in pit ch estimation by emplo ying a range of dilated con v olutions. On the other hand, BaNa [24] opts for the initial v e amplitude spectral peaks from the speech signal’ s spectrum on a v erage for both male Int J Adv Appl Sci, V ol. 14, No. 3, September 2025: 955–965 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Adv Appl Sci ISSN: 2252-8814 957 and female speak ers. Existing methods often struggle with accurac y in noisy conditions, particularly when the signal -to- noise ratio (SNR) is lo w . In a no v el approach, the study e xplores using discrete cosine transform (DCT) [25] instead of f ast F ourier transform (FFT) [26], which pro v es ef fecti v e in noisy signals b ut susceptible to v ocal tract ef fects, resulting in some inconsistencies. Ho we v er , when DCT w as appl ied directly in po wer spectrum, detection accurac y decreased. T o mitig ate noise impact and impro v e accurac y , the study introduces a no v el method combining cumulati v e po wer spectrum (CPS) with DCT features. Instead of t he con v entional po wer spectrum, the proposed technique emplo ys CPS based on DCT . CPS emphasizes the shorter sub-frames which is more ef fecti v e to reduce the noise characteristics as well as mitig ate the ef fect of v ocal tract. Therefore, the proposed approach outperforms traditional pitch e xtraction methods in noisy speech signals by ef fecti v ely suppressing noise components, demonstrating superior ef cac y in fundamental frequenc y e xtraction under noisy conditions. 2. PR OPOSED METHOD Assuming that y ( n ) represents a speech signal impacted by noise, as specied by (1), y ( n ) = s ( n ) + w ( n ) (1) Where w ( n ) is additi v e noise and s ( n ) is a clean speech signal. The CPS approach’ s block diagram is displayed in Figure 1. The initial step in v olv es di viding the noise corrupted speech signal y ( n ) into frames. Figure 1. Block diagram of DCT based CPS In this approach, framing is accomplished by emplo ying a rectangular windo w function. In our e x- periments, the input signal needs to be partitioned into frames, each comprising 800 samples (equi v alent to 50 [ ms ] ). The signal framed as y f ( n ) , where 0 n N 1 , is partitioned into three sub-frames using a time di vision approach. These sub-frames are part as (2)-(4). y f , 1 ( n ) = y f ( n ) , 0 n M 1 (2) y f , 1 ( n D ) = y f ( n ) , D n D + M 1 (3) y f , 1 ( n 2 D ) = y f ( n ) , 2 D n 2 D + M 1 (4) In this conte xt, where M represents an inte ger indicating the sub-frame length and D denotes the frame shift in samples, the goal is typically to set 2 D + M 1 to be equal t o N . In s ection 3, the v alues for the lengths of M and D are specied as 30 [ ms ] and 10 [ ms ] , respecti v ely . The signal y f ( n ) , where 0 n N 1 , under goes frequenc y domain transformation through Periodogram computation using DCT . W e e xamine the y f ( n ) based po wer spectrum to obtain information about the basic frequencies re g arding the DCT . DCT is a F ourier -related transform that uses only real v alues, much similar to dis crete F ourier trans- form (DFT) [27]. The DCT w as f a v ored o v er the DFT in the transformation of actual signals, lik e an acoustic signal. Dif ferent kinds of DCT and in v erse discrete cosine transform (IDCT) pairings can be used for imple- mentation purposes. The DFT changes a complicated signal within its intricate spectrum. On the other hand, half of the data is redundant and half of the computation is w asted if the signal is real, as it is in the majority of applications. DCT tends to concentrate signal ener gy in a smaller number of coef cients compared to DFT . Pitc h e xtr action using discr ete cosine tr ansform based power spectrum method in ... (Humair a Sunzida) Evaluation Warning : The document was created with Spire.PDF for Python.
958 ISSN: 2252-8814 The DFT pro vides a comple x spectrum for a real signal, thereby w asting o v er half of the data. On the other hand, the DCT eliminates the need to compute redundant data by producing a true spectrum of real signals. DCT g athers most of the signal’ s information and sends it to the signal’ s lo wer -order coef cients, resulting in a lar ge reduction in processing costs [28]. DCT a v oids superuous data and computation by producing a real spectrum of a real signal as a real transform. DCT has a further benet in that it requires a straightfor - w ard phase unwrapping procedure because it is a real function. Furthermore, as DCT is deri v ed from DFT , all of DFT’ s adv antageous characteristics are retained, and a quick algorithm is a v ailable. Because DCT is a fully real transform and doesn’ t require comple x v ariables or arithmetic, it is computationally more ef cient than DFT . T aking into account the benets of DCT for actual signals, the DCT Y f ( k ) of y f ( n ) is chosen and deri v ed as (5). Y f ( k ) = c d ( k ) X y f ( n ) cos π (2 n 1)( k 1) 2 N (5) In (5), k represents the frequenc y bin inde x, and the coef cient c d ( k ) can be found as follo ws: Here, c d ( k ) = q 1 N for k=1, and c d ( k ) = q 2 N for 2 k N . Therefore, the Y f ( k ) is obtained. The fundamen- tal frequenc y and higher harmonics are represented as sharper , higher amplitude peaks in the DCT spectrum. DCT’ s do wnsampled or compressed spectra allo w for the location of the higher harmonics at the fundamental frequenc y . The resultant spectrum, identied as the po wer spectrum of y f ( n ) , is denoted as P y f ( k ) , where k corresponds to the frequenc y bin number associated with a discrete representation of w represented by w k . F or each sub-frame y f , 1 ( n ) , where j = 1 , 2 , 3 and 0 n M 1 , the po wer spectra are computed as P y f , 1 ( k ) , P y f , 2 ( k ) , and P y f , 3 ( k ) . The accumulations of these three po wer spectra are performed for each frequenc y bin as (6). ¯ P y f ( k ) = 3 X j =1 P y f ,j ( k ) (6) The obtained po wer spectrum under goes an IDCT . By identifying the maximum location in the resulting A CF , the fundame n t al frequenc y of y f ( n ) is detected. Figure 2 displays the output w a v eforms of the noisy speech signal, the con v entional A CF approach, DCT -based A CF , and DCT -based CPS. Figure 2. V alidation of CPS-DCT using output w a v eform Int J Adv Appl Sci, V ol. 14, No. 3, September 2025: 955–965 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Adv Appl Sci ISSN: 2252-8814 959 In Figure 2, the f alse peak represents the v ocal tract ef f ect, while the true peak indicates the fundamen- tal frequenc y . The con v entional A CF output w a v eform is notably impacted by the v ocal tract ef fect, resulting in a f alse peak close to the true peak. The adoption of DCT in place of F FT within A CF helps alle viate the v ocal tract ef fect. Whereas our proposed method plays a crucial role in achie ving a smoother signal than the DCT -based A CF . It not only signicantly reduces the v ocal tract ef fect b ut also pro vides a more seamless w a v eform compared to other methods. The results from the autocorrelation method applied to a v oiced frame are illustrated in Figure 2. The w a v eform in the Figure 2 represents the ef fect of FFT and DCT in A CF of the speech signal. These gures depict the outcome of speech deli v ered by a male speak er in the presence of white noise. W e ha v e already e xplored that in the cross-correlation of noisy and clean speech, this component becomes zero. Hence, clean speech is signicantly emphasized, and the A CF pro v es to be v ery ef fecti v e in the case of a noisy signal. Ho we v er , A CF is considerably inuenc ed by the v ocal tract ef fect, leading to some unsmooth occurrences in the signal due to noise. The use of DCT -based A CF can mitig ate the v ocal tract ef fect, yet some residual noise occurrences are still observ able in the signal. Also when we used DCT in A CF , the detection accurac y went do wn. In order to further diminish the impact of nois e characteristics and acie v e better accurac y , we ha v e introduced our proposed method that combines the feature of CPS with DCT . On the other hand, Figure 3 represents the v alidation of our proposed idea by utilizing the harmonic characteristics. From Figure 3, we ha v e observ ed that DCT based CPS (proposed) is more ef fecti v e ag ainst noise characteristics that that of FFT and DCT based po wer spectrum. In the case of FFT based po wer spectrum, we ha v e in v estig ated that harmonics are highly af fected by noise which is in mark ed by circle. Figure 3. V alidation of CPS-DCT using harmonic characteristics 3. RESUL TS AND DISCUSSION In this section, we asses s the ef fecti v eness of the CPS in identifying the fundamental frequenc y in the presence of noisy speech. Our assessment in v olv es conducting e xperiments on speech signals to e xamine the performance of the cumulation-based approach. Ulti mately , we present a comparati v e analysis of the outcomes achie v ed with our proposed method ag ainst those obtained from con v entional pitch detection methods. 3.1. Experimental conditions The proposed pitch detection method is implemented using speech signals obtained from the KEELE database [29] and the NTT database [30]. This database contains speech recordings from ten speak ers, e v enly di vided between v e males and v e females. The collecti v e duration of speech signals e xtracted from the KEELE database, encompassing the s peeches of all ten speak ers, amounts to around 5.5 [ m ] . These speech signals were sampl ed at a frequenc y of 16 [ k H z ] . Eight utterances by Japanese speak ers, each lasting ten sec- onds and with a 3.4 [ k H z ] band limitation and 10 [ k H z ] sampling rate, are a v ailable in the NTT database. This research introduces a no v el idea that pro v es to be more suitable for speech processing applications, particularly in the accurate retrie v al of pitch from speech signals under noisy conditions. T o simulate noisy speech sam- Pitc h e xtr action using discr ete cosine tr ansform based power spectrum method in ... (Humair a Sunzida) Evaluation Warning : The document was created with Spire.PDF for Python.
960 ISSN: 2252-8814 ples, we blend clean speech recordings with noise collected from en vironments with high le v els of background sound. T o create the appropriate noisy v oice samples, our method combines se v eral forms of noise with the original speech signals. F our distinct noise cate gories, eac h with dif ferent SNR le v els, were introduced into the initial signal s to e v aluate the algorithms’ rob ustness to noise. These noise cate gories include white noise, babble noise, train noise, high frequenc y (HF)-channel noise, all obtained from the NOISEX-92 [31], sampled at a frequenc y of 20 [ k H z ] . The noises were adjusted to a 16 [ k H z ] sample frequenc y in order to match the KEELE database’ s signal properties and 10 [ k H z ] sample frequenc y in order to match the NTT database’ s signal properties. The SNR, or signal-to-noise ratio w as systematically v aried at le v els of (0, 5, 10, 15 and 20 [dB]) for the assessment. The remaining e xperimental parameters for e xtracting the fundamental frequenc y were as follo ws: Frame length without PEF A C and BaNa, the frame length is 50 [ms]. The frame shift is 10 [ms]. W indo w type: rectangular , with the e xception of BaNa and PEF A C. DCT (IDCT) points: 2048 points (KEELE) and 1024 points (NTT) when BaNa and PEF A C are not present. 3.2. Ev aluation criteria Pitch estimation error is determined by measuring the dif ference between the reference and est imated fundamental frequencies. The accurac y of basic frequenc y detection is assessed, follo wing Rabiner’ s rule [31], utilizes the fundamental frequenc y detection error e ( l ) . e ( l ) = F est ( l ) F tr ue ( l ) (7) Where l is frame number , F est ( l ) is estimated fundamental frequenc y at the l -th frame from a noisy spok en signal, and F tr ue ( l ) is true fundamental frequenc y at the l -th frame. If the absolute v alue of e ( i ) e xceeds 10% , ( i.e. | e ( i ) | > 10% ) of F tr ue ( i ) , it f alls under the cate gory of gross pitch error (GPE), and the o v erall proportion of this error is computed for each uttered frame in the speech data. The error w as designated as the ne pitch error (FPE) if | e ( i ) | 10% from the ground truth rst harmonic frequenc y . W e specically identied and e v aluated the v oiced portions in sentences concerning the fundamental frequenc y . Our analysis utilized a search range from f min = 50[ H z ] to f max = 400[ H z ] , corresponding to the fundamental frequenc y range commonly observ ed in most people. 3.3. Results and perf ormance comparison In this section, we conduct a comparati v e analysis between our proposed method and con v entional approaches, such as PEF A C, BaNa, and YIN, using distinct utterances from the KEELE and NTT databases. W e e v aluate performance under four types of noise: white noise, babble noise, HF channel noise, and train noise. P arameters lik e frame length, windo w function, and the number of DFT (IDFT) points specic to PEF A C and BaNa were adjusted, while other parameters remained consistent across methods. The Hamming windo w function w as applied uniformly in PEF A C and BaNa. F or BaNa, the frame duration w as set to 60 [ms], and 2 16 points were used for DFT (IDFT) points. The source code of BaNa, tailored for this en vironment, w as implemented (as described in [32]). PEF A C util ized a Hamming windo w function with a duration of 90 [ms] for both the windo w function and frame length. The source code used 2 13 as the v alue for the DFT (IDFT) points. The implementation of PEF A C in this en vironment is well-suited for BaNa (as indicated in [17], [33]). Performance e v aluation w as conducted using the GPE and the FPE. The a v erage GPE and FPE results obtained from the e xperimental outcomes of the proposed method, PEF A C, BaNa, YIN, were considered for utterances from both female and male speak ers at v arious SNRs (0, 5, 10, 15 and 20 [ dB ] ). T ables 1-8 present a comparison of GPE for the KEELE database and NTT database, respecti v ely under v arious noise conditions, including white noise , babble noise, HF channel noise, and train noise. On the other hand, T ables 9-16 present a comparison of FPE for the KEELE database and NTT database, respecti v ely under the abo v e noise conditions. The GPE and FPE v alues of our proposed method are contrasted with those of PEF A C, BaNa, and YIN. Int J Adv Appl Sci, V ol. 14, No. 3, September 2025: 955–965 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Adv Appl Sci ISSN: 2252-8814 961 T able 1. A v erage GPE rate (%) for KEELE database for white noise SNR [dB] Proposed PEF A C BaNa YIN 0 20.58 37.15 22.61 31.37 5 15.96 34.38 19.58 21.59 10 13.86 33.01 17.80 16.57 15 13.12 32.50 16.97 14.29 20 12.90 31.98 16.59 12.87 T able 2. A v erage GPE rate (%) for KEELE database for babble noise SNR[dB] Proposed PEF A C BaNa YIN 0 35.18 49.01 40.54 36.89 5 22.88 41.86 29.48 23.68 10 16.57 37.41 22.84 16.64 15 13.09 34.98 19.69 13.16 20 11.87 33.39 17.70 12.14 T able 3. A v erage GPE rate (%) for KEELE database for train noise SNR [dB] Proposed PEF A C BaNa YIN 0 33.44 43.17 29.08 34.38 5 22.81 38.99 23.11 22.76 10 16.98 35.59 20.04 16.36 15 14.50 33.40 18.31 13.42 20 13.49 32.25 17.36 12.16 T able 4. A v erage GPE rate (%) for KEELE database for HF-channel noise SNR [dB] Proposed PEF A C BaNa YIN 0 24.70 40.13 22.64 31.55 5 17.64 36.86 19.82 21.01 10 14.79 34.37 17.90 16.06 15 13.45 32.98 17.31 13.76 20 13.04 32.11 16.57 12.79 T able 5. A v erage GPE rate (%) for NTT database for white noise SNR [dB] Proposed PEF A C BaNa YIN 0 4.71 17.47 8.00 14.20 5 1.90 12.89 5.52 4.70 10 1.38 11.34 3.98 2.08 15 1.36 11.93 3.26 1.55 20 1.38 13.21 3.30 1.46 T able 6. A v erage GPE rate (%) for NTT database for babble noise SNR [dB] Proposed PEF A C BaNa YIN 0 28.26 39.86 27.71 31.75 5 10.01 24.75 12.60 12.31 10 2.80 16.11 5.20 3.20 15 1.58 12.45 4.08 1.52 20 1.44 11.69 4.02 1.41 T able 7. A v erage GPE rate (%) for NTT database for train noise SNR [dB] Proposed PEF A C BaNa YIN 0 14.98 25.28 10.91 20.32 5 4.66 16.3657 5.72 6.76 10 1.92 12.28 4.28 2.34 15 1.38 10.21 3.44 1.61 20 1.36 9.29 3.47 1.33 T able 8. A v erage GPE rate (%) for NTT database for HF-channel noise SNR [dB] Proposed PEF A C BaNa YIN 0 5.73 18.91 5.97 14.32 5 2.34 13.13 4.52 4.84 10 1.62 11.00 4.41 2.02 15 1.49 10.72 4.29 1.48 20 1.45 10.06 4.13 1.39 T able 9. A v erage FPE rate (Hz) for KEELE database for white noise SNR [dB] Proposed PEF A C BaNa YIN 0 4.42 5.45 5.23 4.54 5 4.14 5.36 5.22 3.97 10 4.03 5.32 5.19 3.60 15 3.99 5.26 5.14 3.46 20 3.97 5.25 5.08 3.44 T able 10. A v erage FPE rate (Hz) for KEELE database for babble noise SNR [dB] Proposed PEF A C BaNa YIN 0 4.54 5.62 5.29 4.12 5 4.28 5.49 5.18 3.79 10 4.10 5.38 5.11 3.59 15 4.01 5.30 5.09 3.50 20 3.98 5.24 5.08 3.50 T able 11. A v erage FPE rate (Hz) for KEELE database for train noise SNR [dB] Proposed PEF A C BaNa YIN 0 4.48 5.51 5.30 3.96 5 4.24 5.40 5.15 3.68 10 4.06 5.33 5.11 3.53 15 3.98 5.31 5.05 3.45 20 3.95 5.27 5.03 3.44 T able 12. A v erage FPE rate (Hz) for KEELE database for HF channel noise SNR [dB] Proposed PEF A C BaNa YIN 0 4.62 5.51 5.24 4.21 5 4.30 5.38 5.21 3.80 10 4.10 5.33 5.21 3.56 15 3.99 5.30 5.14 3.48 20 3.97 5.29 5.11 3.43 Pitc h e xtr action using discr ete cosine tr ansform based power spectrum method in ... (Humair a Sunzida) Evaluation Warning : The document was created with Spire.PDF for Python.
962 ISSN: 2252-8814 T able 13. A v erage FPE rate (Hz) for NTT database for white noise SNR [dB] Proposed PEF A C BaNa YIN 0 3.01 3.42 2.39 3.82 5 2.69 3.34 2.20 2.59 10 2.53 3.25 2.09 2.16 15 2.49 3.20 2.00 2.03 20 2.49 3.15 1.95 1.99 T able 14. A v erage FPE rate (Hz) for NTT database for babble noise SNR [dB] Proposed PEF A C BaNa YIN 0 2.26 3.88 2.69 3.09 5 2.40 3.52 2.25 2.42 10 2.50 3.31 2.03 2.15 15 2.50 3.21 1.93 2.02 20 2.48 3.16 1.84 1.99 T able 15. A v erage FPE rate (Hz) for NTT database for train noise SNR [dB] Proposed PEF A C BaNa YIN 0 2.84 3.61 2.51 3.25 5 2.70 3.44 2.19 2.44 10 2.56 3.25 2.05 2.44 15 2.51 3.15 1.94 2.02 20 2.49 3.13 1.87 1.99 T able 16. A v erage FPE rate (Hz) for NTT database for HF channel noise SNR [dB] Proposed PEF A C BaNa YIN 0 3.13 3.55 2.34 3.85 5 2.69 3.40 2.19 2.70 10 2.54 3.28 2.09 2.18 15 2.50 3.17 2.01 2.02 20 2.48 3.11 1.93 1.99 In the case of KEELE database, the proposed approach consistently e xhibits the lo west a v erage GPE rate compared to other techniques across almost all SNRs in all noise cases e xcept lo w SNR (0 [ dB ] ) at train and HF channel noise cases. At SNR (0 [ dB ] ) in train and HF channel noise cases, BaNa pro vides the sl ightly lo wer gross pitch error rate due to processing strate gy according to the noise characteristics. On the other hand, in the case of NTT database, the proposed method sho ws the almost similar properties with the KEELE database. In the case of FPE of T ables 9-12 i n KEELE database, the proposed method pro vides the lo wer FPE (Hz) than that of the PEF A C and BaNa at almost all SNRs in all noise cases e xcept the YIN method. The proposed method is highly competiti v e with the YIN method e xcept white noise case. In the case of NTT database, the FPE (Hz) of the proposed method is lo wer than that of PEF A C and YIN method and highly competiti v e with BaNa e xcept babble noise. In babble noise, proposed method sho ws the superior performance compared with the other methods. 4. CONCLUSION Accurately estimating perfect pitch poses a challenge in speech analysis, especially in noisy en vi- ronments. In this study , we introduce an impro v ed method that e xcels in isolating noise from the w a v eform, particularly in babble noise scenarios, outperforming other techniques. This method e xhibits a lo wer a v erage GPE rate compared to alternati v e approaches, and it achie v es this without an y complicated post-processing. Additionally , it ef ciently mitig ates the impact of v ocal tract ef fects by equalizing unnecessary ripples in the w a v eform. According to their noise type and SNRs, our research so demonstrates that it is more rob ust than other traditional methods without requiring a n y comple x post-processing. In the future, research might fo- cus on creating a ne w pitch e xtraction technique that is more ef fecti v e in speech processing applications and incredibly resilient to e xtremely lo w SNR instances across a range of real-w orld noise scenarios. FUNDING INFORMA TION No funding in v olv ed. A UTHOR CONTRIB UTIONS ST A TEMENT This journal uses the Contrib utor Roles T axonomy (CRediT) to recognize indi vidual author contrib u- tions, reduce authorship disputes, and f acilitate collaboration. Int J Adv Appl Sci, V ol. 14, No. 3, September 2025: 955–965 Evaluation Warning : The document was created with Spire.PDF for Python.
Int J Adv Appl Sci ISSN: 2252-8814 963 Name of A uthor C M So V a F o I R D O E V i Su P Fu Humaira Sunzida Nar gis P arvin Jafrin Akter Jeba Sulin Chi Md. Shiplu Ali Moinur Rahman Md. Saifur Rahman C : C onceptualization I : I n v estig ation V i : V i sualization M : M ethodology R : R esources Su : Su pervision So : So ftw are D : D ata Curation P : P roject Administration V a : V a lidation O : Writing - O riginal Draft Fu : Fu nding Acquisition F o : F o rmal Analysis E : Writing - Re vie w & E diting CONFLICT OF INTEREST ST A TEMENT Authors state no conict of interest. D A T A A V AILABILITY The authors conrm that the data supporting the ndings of this study are a v ailable within the article. REFERENCES [1] S. S. Upadh ya, “Pitch detection in time and frequenc y domain, Pr oceedings - 2012 International Confer ence on Communication, Information and Computing T ec hnolo gy , ICCICT 2012 , 2012, doi: 10.1109/ICCICT .2012.6398150. [2] M. S. Rahman, “Pitch e xtraction for speech signals in noisy en vironments, P .hD. Dissertation , Department of Mathematics, Electronics, and Informatics, Saitama Uni v ersity , Saitama, Japan, 2020. [Online]. A v ailable: https://sucra.repo.nii.ac.jp/record/19377/les/GD0001258.pdf. [3] X. Zhang, H. Zhang, S. Nie, G. Gao, and W . Liu, A pairwise algorithm using the deep stacking netw ork for speech separation and pitch estimation, IEEE/A CM T r ansactions on A udio Speec h and Langua g e Pr ocessing , v ol. 24, no. 6, pp. 1066–1078, 2016, doi: 10.1109/T ASLP .2016.2540805. [4] D. W ang, C. Y u, and J. H. L. Hansen, “Rob ust harmonic feat ures for classication-based pitch estimation, IEEE/A CM T r ansactions on A udio Speec h and Langua g e Pr ocessing , v ol. 25, no. 5, pp. 952–964, 2017, doi: 10.1109/T ASLP .2017.2667879. [5] D. Gerhard, “Pitch e xtraction and fundamental frequenc y: history and current techniques theory of pitch, T ec hnical Repor t TR-CS , 2003. [Online]. A v ailable: https://www .cs.b u.edu/f ac/sn yder/cs583/Literature and Resources/PitchExtractionMastersThesis.pdf. [6] N. S. B. Ruslan, M. Mamat , R. R. Porle, and N. P arimon, A comparati v e study of pitch detection algorithms for microcontroller based v oice pitch detector , Advanced Science Letter s , v ol. 23, no. 11, pp. 11521–11524, 2017, doi: 10.1166/asl.2017.10320. [7] L. Sukhostat and Y . Imamv erdiye v , A comparati v e analysis of pitch detection methods under the inuence of dif ferent noise condi tions, J ournal of V oice , v ol. 29, no. 4, pp. 410–417, 2015, doi: 10.1016/j.jv oice.2014.09.016. [8] L. R. Rabiner , “On the use of autocorrelation analysis for pitch detection, IEEE T r ansactions on Acoustics, Speec h, and Signal Pr ocessing , v ol. 25, no. 1, pp. 24–33, 1977, doi: 10.1109/T ASSP .1977.1162905. [9] A. Cohen et al., A v erage magnitude dif ference function pitch e xtractor , IEEE T r ansactions on Acoustics, Speec h, and Signal Pr ocessing , v ol. ASSP-22, no. 5, pp. 353–362, 1974, doi: 10.1109/T ASSP .1974.1162598. [10] R. Chakraborty , D. Sengupta, and S. Sinha, “Pitch tracking of acoustic signals based on a v erage squared mean dif ference function, Signal, Ima g e and V ideo Pr ocessing , v ol. 3, no. 4, pp. 319–327, 2009, doi: 10.1007/s11760-008-0072-5. [11] T . Shimamura and H. K obayashi, “W eighted autocorrelation for pitch e xtraction of noisy speech, IEEE T r ansactions on Speec h and A udio Pr ocessing , v ol. 9, no. 7, pp. 727–730, 2001, doi: 10.1109/89.952490. [12] A. de Che v eign ´ e and H. Ka w ahara, “YIN, a fundamental fre quenc y estimator for speech and music, The J ournal of the Acoustical Society of America , v ol. 111, no. 4, pp. 1917–1930, 2002, doi: 10.1121/1.1458024. [13] C. Shahnaz, W . P . Zhu, and M. O. Ahmad, “Pitch estimation based on a har monic sinusoidal autocorrelation model and a time- domain matching scheme, IEEE T r ansactions on A udio, Speec h and Langua g e Pr ocessing , v ol. 20, no. 1, pp. 322–335, 2012, doi: 10.1109/T ASL.2011.2161579. [14] H. Haji molahoseini, R. Amirf attahi, S. Gazor , and H. Soltanian-Zadeh, “Rob ust estimation and tracking of pitch period using an ef cient bayesian lter , IEEE/A CM T r ansactions on A udio Speec h and Langua g e Pr ocessing , v ol. 24, no. 7, pp. 1219–1229, 2016, doi: 10.1109/T ASLP .2016.2551041. [15] W . Hu, X. W ang, and P . G ´ omez, “Rob ust pitch e xtraction in pathological v oice based on w a v elet and cepstrum, Eur opean Signal Pr ocessing Conf er ence , pp. 297–300, 2015. [Online]. A v ailable: https://ne w .eurasip.or g/Proceedings/Eusipco/Eusipco2004/defe v ent/papers/cr1417.pdf. [16] M . S. Rahm an, Y . Sugi ura, a nd T . Shim amura, “Utilization of windo wing ef fect and accumulated autocorrelation function and po wer Pitc h e xtr action using discr ete cosine tr ansform based power spectrum method in ... (Humair a Sunzida) Evaluation Warning : The document was created with Spire.PDF for Python.
964 ISSN: 2252-8814 spectrum for pitch detection in noisy en vironments, IEEJ T r ansactions on Electrical and Electr onic Engineeri ng , v ol. 15, no. 11, pp. 1680–1689, 2020, doi: 10.1002/tee.23238. [17] S. Gonzalez, “Pef ac-a pitch estimation algori thm rob ust to high le v els of noise, IEEE/A CM T r ansactions on A udio, Speec h, and Langua g e Pr ocessing , v ol. 22, no. 2, pp. 518–530, 2014, doi: 10.1109/T ASLP .2013.2295918. [18] B. Li and X. Zhang, A pitch estimation algorithm for speech in comple x noise en vironments based on the Radon transform, IEEE Access , v ol. 11, pp. 9876–9889, 2023, doi: 10.1109/A CCESS.2023.3240181. [19] Z. Mnasri, S. Ro v etta, and F . Masulli, A no v el pitch detection algorithm based on instantaneous frequenc y for clean and noisy speech, Cir cuits, Systems, and Signal Pr ocessing , v ol. 41, no. 11, pp. 6266–6294, 2022, doi: 10.1007/s00034-022-02082-8. [20] F . Huang and T . Lee, “Pitch estimation in noisy speech using accumulated peak spectrum and sparse estimation technique, IEEE T r ansactions on A udio, Speec h and Langua g e Pr ocessing , v ol. 21, no. 1, pp. 99–109, 2013, doi: 10.1109/T ASL.2012.2215589. [21] W . Chu and A. Al w an, “SAFE: a statistical approach to F0 estimation under clean and noisy conditions, IEEE T r ansactions on A udio, Speec h and Langua g e Pr ocessing , v ol. 20, no. 3, pp. 933–944, 2012, doi: 10.1109/T ASL.2011.2168518. [22] B . Gfeller , C. Frank, D. Roblek, M. Shari, M. T agliasacchi, and M. V elimiro vic, “SPICE: self-supervised pitch estimation, IEEE/A CM T r ansactions on A udio Speec h and Langua g e Pr ocessing , v ol. 28, pp. 1118–1128, 2020, doi: 10.1109/T ASLP .2020.2982285. [23] S . Singh, R. W ang, and Y . Qiu, “DeepF0: end-to-end fundamental frequenc y estimation for music and speec h signals, ICASSP , IEEE International Confer ence on Acoustics, Speec h and Signal Pr ocess ing - Pr oceedings , v ol. June, pp. 61–65, 2021, doi: 10.1109/ICASSP39728.2021.9414050. [24] N. Y ang, H. Ba, W . Cai, I. Demirk ol, and W . Heinzelman, “BaNa: a noise resilient fundamental frequenc y detection algori thm for speech and music, IEEE/A CM T r ansactions on A udio Speec h and Langua g e Pr ocessing , v ol. 22, no. 12, pp. 1833–1848, 2014, doi: 10.1109/T ASLP .2014.2352453. [25] N. Ahmed, T . Natarajan, and K. R. Rao, “Discrete cosine transform, IEEE T r ansactions on Computer s , v ol. C–23, no. 1, pp. 90–93, Jan. 1974, doi: 10.1109/T -C.1974.223784. [26] P . Duhamel and M. V etterli, “F ast F ourier transforms: a tutorial re vie w and a state of the art, Signal Pr ocessing , v ol. 19, no. 4, pp. 259–299, Apr . 1990, doi: 10.1016/0165-1684(90)90158-U. [27] F . J. Har ris, “T ime domain signal processing with the DFT , Handbook of Digital Signal Pr ocessing , pp. 633–699, 1987, doi: 10.1016/b978-0-08-050780-4.50013-8. [28] F . Plante, G. Me yer , and W . Ainsw orth, A pitch e xtraction reference database, 4th Eur opean Confer ence on Speec h Communication and T ec hnolo gy , pp. 837–840, 1995, doi: 10.21437/Eurospeech.1995-191. [29] Y . Meng, “Speech recognition on DSP: algorithm optimization and performance analysis, Master Thesis, Department of Electr onic Engineering , The Chinese Univer sity of Hong K ong , Sha T in, Hong K ong , 2004. [Online]. A v ailable: http://www .ee.cuhk.edu.hk/ myuan/Thesis.pdf. [30] NTT Adv anced T echnology Corp, 20 countries langua g e database , NTT Adv anced T echnology Corp, 1988. [31] A. V ar g a and H. J. M. Steenek en, Assessment for automatic speech recognition: II. NOISEX-92: a database and an e xperiment to study the ef fect of additi v e noise on speech recognition systems, Speec h Communication , v ol. 12, no. 3, pp. 247–251, 1993, doi: 10.1016/0167-6393(93)90095-3. [32] Uni v ersity of Rochester , “W ireless communication and netw orking group, hajim.r oc hester .edu . Accessed: Mar . 02, 2024. [Online]. A v ailable: https://hajim.rochester .edu/ece/sites/wcng//code.html. [33] M. Brook es, “V OICEBO X: speech processing toolbox for MA TLAB, ee .ic.ac.uk . Acces sed: Mar . 02, 2024. [Online]. A v ailable: http://www .ee.ic.ac.uk/hp/staf f/dmb/v oicebox/v oicebox.html. BIOGRAPHIES OF A UTHORS Humaira Sunzi da obtained her B.Sc. (Engineering) de gree in Information and Com- munication T echnology from Comilla Uni v ersity , Cumilla, Bangladesh, in 2024. She st arted her under graduate studies in the Department of Information and Communication T echnology at Comilla Uni v ersity in 2019. Her current research interests encompass speech analysis and digital signal pro- cessing. She can be contacted at email: humairasunzida.311@stud.cou.ac.bd. Nar gis P ar vin recei v ed her B.S c. (Honours) and M.Sc. de grees in Information and Com- munication Engineering from the Uni v ersity of Rajshahi, Rajshahi, Bangladesh, in 2006 and 2007, respecti v ely . In 2013, she joined as a Lecturer in the Department of Computer Scie nce and Engi- neering, Bangladesh Army International Uni v ersity of Sci ence and T echnology (B AIUST), Cumilla Cantonment, Cumilla, Bangl adesh, where she is currently serving as Assistant Professor . She pursued her Ph.D. de gree in the eld of wireless sensor netw ork (WSN) at the Graduate School of Science and Engineering at Saitama Uni v ersity , Japan. Her research interests include wireless sensor netw ork, speech analysis and digital signal processing. She can be contacted at emai l: nar gis.cse@baiust.ac.bd. Int J Adv Appl Sci, V ol. 14, No. 3, September 2025: 955–965 Evaluation Warning : The document was created with Spire.PDF for Python.