国产日韩欧美一区二区三区三州_亚洲少妇熟女av_久久久久亚洲av国产精品_波多野结衣网站一区二区_亚洲欧美色片在线91_国产亚洲精品精品国产优播av_日本一区二区三区波多野结衣 _久久国产av不卡

?

基于掩蔽估計(jì)與優(yōu)化的單通道語(yǔ)音增強(qiáng)算法

2019-11-15 04:49葛宛營(yíng)張?zhí)祢U
計(jì)算機(jī)應(yīng)用 2019年10期

葛宛營(yíng) 張?zhí)祢U

摘 要:?jiǎn)瓮ǖ勒Z(yǔ)音增強(qiáng)算法通過(guò)從帶噪語(yǔ)音中估計(jì)并抑制噪聲成分來(lái)得到增強(qiáng)語(yǔ)音。然而,噪聲估計(jì)算法在計(jì)算時(shí)存在過(guò)估現(xiàn)象,導(dǎo)致部分估計(jì)噪聲能量值比實(shí)際值大。盡管可以通過(guò)補(bǔ)償消去這些過(guò)估值,但引入的誤差同樣會(huì)降低增強(qiáng)語(yǔ)音的整體質(zhì)量。針對(duì)此問(wèn)題,提出一種基于計(jì)算聽(tīng)覺(jué)場(chǎng)景分析(CASA)的時(shí)頻掩蔽估計(jì)與優(yōu)化算法。首先,通過(guò)直接判決(DD)算法估計(jì)先驗(yàn)信噪比(SNR)并計(jì)算初始掩蔽;其次,利用噪聲與帶噪語(yǔ)音在Gammatone頻帶內(nèi)的互相關(guān)(ICC)系數(shù)來(lái)計(jì)算噪聲的存在概率,結(jié)合帶噪語(yǔ)音能量譜得到新的噪聲估計(jì),減少原估計(jì)噪聲中的過(guò)估成分;然后,利用優(yōu)化算法對(duì)初始掩蔽進(jìn)行迭代處理以減少其中因噪聲過(guò)估而存在的誤差并增加其中的目標(biāo)語(yǔ)音成分,在滿足條件后停止迭代并得到新的掩蔽;最后,利用新的掩蔽合成增強(qiáng)語(yǔ)音。實(shí)驗(yàn)結(jié)果表明在不同的背景噪聲下,相比優(yōu)化前,

新的掩蔽使增強(qiáng)語(yǔ)音獲得了較高的主觀語(yǔ)音質(zhì)量(PESQ)和語(yǔ)音可懂度(STOI)值,

提升了語(yǔ)音聽(tīng)感與可懂度。

關(guān)鍵詞:計(jì)算聽(tīng)覺(jué)場(chǎng)景分析;語(yǔ)音增強(qiáng);時(shí)頻掩蔽;噪聲估計(jì);掩蔽優(yōu)化;語(yǔ)音可懂度

中圖分類(lèi)號(hào):TN912.35

文獻(xiàn)標(biāo)志碼:A

Abstract: Monaural speech enhancement algorithms obtain enhanced speech by estimating and negating the noise components in speech with noise. However, the over-estimation and the error of the introduction to make up the over-estimation of noise power make detrimental effect on the enhanced speech. To constrain the distortion caused by noise over-estimation, a time-frequency mask estimation and optimization algorithm based on Computational Auditory Scene Analysis (CASA) was proposed. Firstly, Decision Directed (DD) algorithm was used to estimate the priori Signal-to-Noise Ratio (SNR) and calculate the initial mask. Secondly, the Inter-Channel Correlation (ICC) factor between noise and speech with noisein each Gammatone filterbank channelwas used to calculate the noise presence probability, the new noise estimation was obtained by the probability combining with the power spectrum of speech with noise, and the over-estimation of the primary estimated noise was decreased. Thirdly, the initial mask was iterated by the optimization algorithm to reduce the error caused by the noise over-estimation and raise the target speech components in the mask, and the new mask was obtained when the iteration stopped with the conditions met.Finally, the optimization method was used to optimize the estimated? mask.The enhanced speech was composed by using the new mask. Experimental results demonstrate that the new mask has higher Perceptual Evaluation of Speech Quality (PESQ) and Short-Time Objective Intelligibility measure (STOI) values of the enhanced speech in comparison with the mask before optimization, improving the intelligibility and listening feeling of speech.Key words:? computational auditory scene analysis; speech enhancement; time-frequency mask; noise estimation; mask optimization; speech intelligibility

0 引言

語(yǔ)音增強(qiáng)作為一項(xiàng)前端處理技術(shù),目的是從受噪聲干擾的語(yǔ)音中提取出目標(biāo)語(yǔ)音。按照接收麥克風(fēng)的個(gè)數(shù)可將語(yǔ)音增強(qiáng)方法分為單通道和多通道增強(qiáng)方法。相對(duì)于多通道語(yǔ)音增強(qiáng)方法,單通道語(yǔ)音增強(qiáng)方法具有成本低、易實(shí)現(xiàn)等優(yōu)點(diǎn),在通信、語(yǔ)音識(shí)別等領(lǐng)域有著廣泛的應(yīng)用。傳統(tǒng)的單通道語(yǔ)音增強(qiáng)方法包括譜減法[1]、維納濾波法[2]、子空間算法[3]等。

近年來(lái),研究人員通過(guò)模擬人耳處理聲音信號(hào)的方式,提出了計(jì)算聽(tīng)覺(jué)場(chǎng)景分析(Computational Auditory Scene Analysis, CASA),其中Gammatone濾波器組便是一種用來(lái)模擬人耳耳蝸的聽(tīng)覺(jué)模型。經(jīng)過(guò)濾波器組處理后的語(yǔ)音信號(hào),能夠得到相對(duì)傳統(tǒng)方式更好的效果?;贑ASA的語(yǔ)音增強(qiáng)算法通常根據(jù)基音周期[4]、等特征,構(gòu)造區(qū)分目標(biāo)語(yǔ)音與背景噪聲的掩蔽,進(jìn)而得到增強(qiáng)后的語(yǔ)音信號(hào)。在單通道語(yǔ)音增強(qiáng)算法中,需要對(duì)噪聲能量進(jìn)行估計(jì),然而由于噪聲的隨機(jī)性,使得估計(jì)過(guò)程中存在過(guò)估現(xiàn)象,從而降低了增強(qiáng)語(yǔ)音的整體質(zhì)量[6-7]。文獻(xiàn)[6]利用Gammatone濾波器組的非線性頻率特征,計(jì)算噪聲與帶噪語(yǔ)音在濾波器組各頻帶內(nèi)的互相關(guān)系數(shù),減少估計(jì)噪聲中過(guò)估的成分后采用凸優(yōu)化算法迭代得到語(yǔ)音能量譜的估計(jì)。但算法在得到語(yǔ)音能量譜后還需要進(jìn)一步聚類(lèi)處理,利用計(jì)算得到的掩蔽恢復(fù)增強(qiáng)語(yǔ)音。受聚類(lèi)準(zhǔn)確性的影響,通?;謴?fù)得到的增強(qiáng)語(yǔ)音在聽(tīng)感和可懂度方面存在欠缺。

針對(duì)上述問(wèn)題,本文提出一種結(jié)合直接判決(Decision Directed, DD)算法[8]和頻帶內(nèi)互相關(guān)(Inter-Channel Correlation, ICC)系數(shù)[6]的時(shí)頻掩蔽估計(jì)與優(yōu)化算法。首先,通過(guò)DD算法得到初始掩蔽估計(jì);接著,計(jì)算出各頻帶內(nèi)噪聲與帶噪語(yǔ)音的互相關(guān)系數(shù),得到噪聲的存在概率;然后,根據(jù)掩蔽的特性確定目標(biāo)函數(shù),結(jié)合前兩步結(jié)果,通過(guò)優(yōu)化算法減少初始掩蔽中的誤差;最后,利用新的掩蔽從帶噪語(yǔ)音中去除噪聲信號(hào),得到增強(qiáng)語(yǔ)音。

1 語(yǔ)音增強(qiáng)原理

一般情況下,帶噪語(yǔ)音由語(yǔ)音和加性噪聲合成:

聲信號(hào)經(jīng)過(guò)Gammatone濾波器組濾波后被分到帶寬不同的64個(gè)頻帶中,各頻帶的中心頻率和帶寬由等效矩形帶寬(Equivalent Rectangular Bandwidth, ERB)方法確定[9]。將各頻帶內(nèi)的聲信號(hào)經(jīng)過(guò)加窗、分幀后得到時(shí)頻單元序列,計(jì)算每個(gè)時(shí)頻單元的能量后得到聲信號(hào)的能量譜[4]。假設(shè)噪聲與目標(biāo)語(yǔ)音相互獨(dú)立,經(jīng)過(guò)濾波器組處理后信號(hào)的能量在時(shí)間幀為t、頻帶中心頻率為f的時(shí)頻單元中表示為:

CASA語(yǔ)音增強(qiáng)需要利用掩蔽與帶噪語(yǔ)音合成時(shí)域內(nèi)的增強(qiáng)語(yǔ)音[5-6,10]。理想二值掩蔽(Ideal Binary Mask, IBM)為一種常用的掩蔽,其值當(dāng)目前時(shí)頻點(diǎn)上語(yǔ)音能量占主導(dǎo)時(shí)為1,其他情況下為0。采用IBM得到的增強(qiáng)語(yǔ)音能夠保留目標(biāo)語(yǔ)音占主導(dǎo)的部分,消去其他部分。另一種掩蔽為理想浮值掩蔽(Ideal Ratio Mask, IRM),取值在0~1,且語(yǔ)音部分的值比噪聲部分大。

相對(duì)于IBM,采用IRM得到的增強(qiáng)語(yǔ)音能夠保存夾雜在噪聲中的弱語(yǔ)音成分,具有更高的語(yǔ)音質(zhì)量。因此本文計(jì)算的掩蔽為理想浮值掩蔽,公式為:

2 掩蔽估計(jì)與優(yōu)化

2.1 算法整體框架

本文算法的整體框架如圖1所示。

算法包含兩部分:掩蔽估計(jì)和掩蔽優(yōu)化。在掩蔽估計(jì)部分,估計(jì)噪聲并計(jì)算后驗(yàn)信噪比后,利用最大似然估計(jì)得到先驗(yàn)信噪比,然后計(jì)算初始掩蔽。在掩蔽優(yōu)化部分,將通過(guò)Gammatone濾波器組后的帶噪語(yǔ)音與估計(jì)噪聲信號(hào)分幀、加窗處理后進(jìn)行離散傅里葉變換,計(jì)算各頻帶內(nèi)帶噪語(yǔ)音與噪聲信號(hào)的互相關(guān)系數(shù);為了修正噪聲過(guò)估對(duì)初始掩蔽的影響,將得到的互相關(guān)系數(shù)作為噪聲的存在概率并結(jié)合帶噪語(yǔ)音得到優(yōu)化目標(biāo),利用優(yōu)化目標(biāo)對(duì)初始掩蔽進(jìn)行迭代處理,在減少過(guò)估而引起的偏差的同時(shí),增加掩蔽中包含的目標(biāo)語(yǔ)音成分。最后使用優(yōu)化后的新掩蔽合成增強(qiáng)語(yǔ)音。

2.2 掩蔽估計(jì)

由式(2)~(3)可得掩蔽與語(yǔ)音能量的關(guān)系為:

2.3.3 掩蔽優(yōu)化

由于語(yǔ)音能量取值范圍為(0,+∞),且各時(shí)頻單元間能量值差異很大,導(dǎo)致每次迭代計(jì)算S^(t, f)的運(yùn)算量十分大。同時(shí),為解決聚類(lèi)的準(zhǔn)確性和二值掩蔽對(duì)算法的影響,本文使用浮值掩蔽值替代式(14)中的能量值來(lái)當(dāng)作優(yōu)化目標(biāo):

3 實(shí)驗(yàn)與結(jié)果分析

3.1 實(shí)驗(yàn)參數(shù)與評(píng)價(jià)指標(biāo)

仿真實(shí)驗(yàn)選取TIMIT數(shù)據(jù)庫(kù)[14]中的語(yǔ)音信號(hào)。信號(hào)采樣頻率為16 kHz,16 bit量化,時(shí)長(zhǎng)約為2 s。噪聲取自noisex-92數(shù)據(jù)庫(kù)[15],分別為Babble噪聲、Engine噪聲和White噪聲,分別在輸入信噪比為-5~5dB、間隔為1dB的情況下測(cè)試本文算法。

實(shí)驗(yàn)使用4階64頻帶Gammatone濾波器組,每一幀長(zhǎng)20ms,幀重疊為50%。選取參數(shù)為:式(10)中,a=-2,c=2.7,ζ=0.015。式(12)中,Gmin=0.178。式(19)中λ=0.02,式(22)中μ=0.01,式(23)中θ=0.3。式(24)中n1=1,n2=1。

本文使用文獻(xiàn)[16]算法對(duì)時(shí)域的噪聲信號(hào)進(jìn)行估計(jì),將該算法與文獻(xiàn)[6]算法作為對(duì)比算法。選取的評(píng)價(jià)指標(biāo)除分段信噪比(segmental Signal-to-Noise Ratio, segSNR)外,還有主觀語(yǔ)音質(zhì)量(Perceptual Evaluation of Speech Quality, PESQ)[17] 和語(yǔ)音可懂度(Short-Time Objective Intelligibility measure, STOI)[18]。分段信噪比計(jì)算信號(hào)每幀的信噪比后取平均值,其值越高說(shuō)明算法對(duì)噪聲的抑制效果越好;PESQ表示增強(qiáng)語(yǔ)音的主觀聽(tīng)感,其得分越高,表明增強(qiáng)語(yǔ)音的聽(tīng)感越好;STOI反映了增強(qiáng)語(yǔ)音的失真程度,其數(shù)值越大表明算法造成的失真越小,語(yǔ)音的可懂度越高。

3.2 結(jié)果與分析

表1~2給出了在三種背景噪聲下掩蔽優(yōu)化前后得到的平均PESQ和STOI值。比較結(jié)果可以看出,經(jīng)過(guò)掩蔽優(yōu)化后增強(qiáng)語(yǔ)音的聽(tīng)感與可懂度都得到了提升,尤其是在Engine噪聲下兩項(xiàng)指標(biāo)提升較為明顯。

表3為兩種掩蔽得到的segSNR平均值??梢?jiàn),在Babble噪聲和White噪聲下,掩蔽優(yōu)化后得到的segSNR值均小于優(yōu)化前的結(jié)果,即本文提出的掩蔽優(yōu)化算法無(wú)法有效抑制噪聲。分析其原因,雖然估計(jì)的頻帶間互相關(guān)系數(shù)和真實(shí)值存在相似性(見(jiàn)圖4),但其取值范圍在低頻部分并不相同,使得計(jì)算得到的噪聲存在概率總是大于實(shí)際概率。因此相對(duì)初始掩蔽,優(yōu)化后的掩蔽在合成增強(qiáng)語(yǔ)音時(shí)保留了更多的噪聲成分。

對(duì)比表4中三種算法的PESQ值可看出:本文算法比對(duì)比算法得到了相對(duì)較高的PESQ值,但在Babble噪聲下其結(jié)果低于文獻(xiàn)[6]算法,是因?yàn)槲墨I(xiàn)[6]采用了結(jié)合IBM與IRM的掩蔽,且最終掩蔽中IBM占比較大,使其算法在吵鬧噪聲下能夠消去多余的噪聲成分,得到更好的主觀聽(tīng)感。在其他噪聲環(huán)境下,本文算法得到的主觀聽(tīng)感均高于對(duì)比算法。

表6為三種算法得到的segSNR值。由表6可知,文獻(xiàn)[16]算法有最高的噪聲抑制性能,而采用Gammatone頻帶內(nèi)互相關(guān)系數(shù)的文獻(xiàn)[6]算法與本文算法均取得了較低的segSNR值。這一方面是由于頻帶間互相關(guān)系數(shù)和真實(shí)值存在差異,另一方面是改進(jìn)的DD算法在計(jì)算時(shí)并未區(qū)分語(yǔ)音的低頻和高頻成分,同時(shí)其在瞬時(shí)信噪比較低時(shí)對(duì)prop的估計(jì)不準(zhǔn)確,使得初始掩蔽中仍保留了部分噪聲成分。解決這一問(wèn)題是本研究下一步工作之一。然而,相對(duì)于抑制噪聲的能力,語(yǔ)音增強(qiáng)算法更注重提升語(yǔ)音聽(tīng)感與可懂度[6,19],根據(jù)PESQ和STOI結(jié)果,本文算法在這兩個(gè)方面優(yōu)于對(duì)比算法。

4 結(jié)語(yǔ)

本文針對(duì)單通道語(yǔ)音增強(qiáng)時(shí),傳統(tǒng)噪聲估計(jì)算法中存在的過(guò)估現(xiàn)象會(huì)影響增強(qiáng)語(yǔ)音的整體質(zhì)量問(wèn)題,提出一種基于時(shí)頻掩蔽估計(jì)與優(yōu)化的單通道語(yǔ)音增強(qiáng)算法。該算法在得到初始掩蔽后,利用迭代優(yōu)化增加初始掩蔽中的目標(biāo)語(yǔ)音成分。實(shí)驗(yàn)結(jié)果表明,算法雖然不能提升初始掩蔽抑制噪聲的性能,

但在另外兩項(xiàng)關(guān)鍵指標(biāo)(PESQ和STOI)上

本文算法較對(duì)比算法均有明顯提升,說(shuō)明本文算法能有效提升增強(qiáng)語(yǔ)音的聽(tīng)感與可懂度。

參考文獻(xiàn)(References)

[1] 曹亮, 張?zhí)祢U, 高洪興, 等. 基于聽(tīng)覺(jué)掩蔽效應(yīng)的多頻帶譜減語(yǔ)音增強(qiáng)方法[J]. 計(jì)算機(jī)工程與設(shè)計(jì), 2013, 34(1): 235-240. (CAO L, ZHANG T Q, GAO H X, et al. Multi-band spectral subtraction method for speech enhancement based on masking property of human auditory system[J]. Computer Engineering and Design, 2013, 34(1): 235-240.)

[2] 李季碧, 馬永保, 夏杰, 等. 一種基于修正倒譜平滑技術(shù)改進(jìn)的維納濾波語(yǔ)音增強(qiáng)算法[J]. 重慶郵電大學(xué)學(xué)報(bào)(自然科學(xué)版), 2016, 28(4): 462-467. (LI J B, MA Y B, XIA J, et al. An improved Wiener filtering speech enhancement algorithm based on modified cepstrum smooth technology[J]. Journal of Chongqing University of Posts and Telecommunications (Natural Science Edition), 2016, 28(4): 462-467.)

[3] BOROWICZ A, PETROVSKY A. Signal subspace approach for psychoacoustically motivated speech enhancement[J]. Speech communication, 2011, 53(2): 210-219.

[4] HU K, WANG D. Unvoiced speech segregation from nonspeech interference via CASA and spectral subtraction[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(6): 1600-1609.

[5] WANG Y, NARAYANAN A, WANG D, et al. On training targets for supervised speech separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014, 22(12): 1849-1858.

[6] BAO F, ABDULLA W H. Noise masking method based on an effective ratio mask estimation in Gammatone channels[J]. APSIPA Transactions on Signal and Information Processing, 2018, 7(e5):1-12.

[7] SUN M, LI Y, GEMMEKE J F, et al. Speech enhancement under low SNR conditions via noise estimation using sparse and low-rank NMF with Kullback-Leibler divergence[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23(7): 1233-1242.

[8] NAHMA L, YONG P C, DAM H H, et al. Convex combination framework for a priori SNR estimation in speech enhancement[C]// Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ. IEEE, 2017: 4975-4979.

[9] 蔣毅, 劉潤(rùn)生, 馮振明. 基于聽(tīng)感知特性的雙麥克風(fēng)近講語(yǔ)音增強(qiáng)算法[J]. 清華大學(xué)學(xué)報(bào)(自然科學(xué)版), 2014(9): 1179-1183. (JIANG Y, LIU R S, FENG Z M. Dual-microphone speech enhancement algorithm based on the auditory features for a close-talk system[J]. Journal of Tsinghua University (Science and Technology), 2014, 54(9): 1179-1183.)

[10] BAO F, ABDULLA W H. A new ratio mask representation for CASA-based speech enhancement[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2019, 27(1): 7-19.

[11] YONG P C, NORDHOLM S, DAM H H, et al. On the optimization of sigmoid function for speech enhancement[C]// Proceedings of the 19th European Signal Processing Conference. Piscataway: IEEE, 2011: 211-215.

[12] CHEN Z, HOHMANN V. Online monaural speech enhancement based on periodicity analysis and a priori SNR estimation[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2015, 23(11): 1904-1916.

[13] ZHENG C, TAN Z, PENG R, et al. Guided spectrogram filtering for speech dereverberation[J]. Applied Acoustics, 2018, 134(5): 154-159.

[14] GAROFOLO J S, LAMEL L F, FISHER W M, et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus[EB/OL]. [2019-01-12]. https://catalog.ldc.upenn.edu/LDC93S1.

[15] VARGA A, STEENEKEN H J M. Assessment for automatic speech recognition II: NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems[J]. Speech Communication, 1993, 12(3): 247-251.

[16] GERKMANN T, HENDRIKS R C. Unbiased MMSE-based noise power estimation with low complexity and low tracking delay[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(4): 1383-1393.

[17] International Telecommunications Union (ITU). Perceptual Evaluation of Speech Quality (PESQ): an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs[EB/OL]. [2019-01-12]. https://www.itu.int/rec/T-REC-P.862-200102-I/en.

[18] TAAL C H, HENDRIKS R C, HEUSDENS R, et al. An algorithm for intelligibility prediction of time-frequency weighted noisy speech[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(7): 2125-2136.

[19] LOIZOU P C, KIM G. Reasons why current speech-enhancement algorithms do not improve speech intelligibility and suggested solutions[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(1): 47-56.

This work is partially supported by the National Natural Science Foundation of China (61671095, 61702065, 61701067, 61771085), the Project of Key Laboratory of Signal and Information Processing of Chongqing (CSTC2009CA2003), the Chongqing Graduate Research and Innovation Project (CYS17219), the Research Project of Chongqing Educational Commission (KJ1600427, KJ1600429).

GE Wanying, born in 1994, M. S. candidate. His research interests include signal processing, speech enhancement.

ZHANG Tianqi, born in 1971, Ph. D., professor. Her research interests include spread spectrum communications, blind signal processing, speech signal processing.

酒泉市| 诏安县| 兴宁市| 商水县| 汝州市| 法库县| 叶城县| 东至县| 盐源县| 那曲县| 白沙| 华容县| 集安市| 利川市| 新郑市| 兴文县| 武鸣县| 将乐县| 晋城| 泽普县| 克拉玛依市| 交口县| SHOW| 北碚区| 庆安县| 错那县| 临清市| 通辽市| 得荣县| 苏尼特右旗| 洞头县| 林芝县| 建德市| 忻城县| 汽车| 常德市| 阿拉尔市| 金湖县| 高青县| 崇左市| 监利县|