国产日韩欧美一区二区三区三州_亚洲少妇熟女av_久久久久亚洲av国产精品_波多野结衣网站一区二区_亚洲欧美色片在线91_国产亚洲精品精品国产优播av_日本一区二区三区波多野结衣 _久久国产av不卡

?

基于多尺度階梯時(shí)頻Conformer GAN的語(yǔ)音增強(qiáng)算法

2023-11-29 12:12:00金玉堂王以松王麗會(huì)趙鵬利
計(jì)算機(jī)應(yīng)用 2023年11期
關(guān)鍵詞:時(shí)頻振幅頻譜

金玉堂,王以松*,王麗會(huì),趙鵬利

基于多尺度階梯時(shí)頻Conformer GAN的語(yǔ)音增強(qiáng)算法

金玉堂1,王以松1*,王麗會(huì)1,趙鵬利2

(1.公共大數(shù)據(jù)國(guó)家重點(diǎn)實(shí)驗(yàn)室(貴州大學(xué)),貴陽(yáng),550025; 2.許昌電氣職業(yè)學(xué)院,河南 許昌 461000)( ? 通信作者電子郵箱yswang@gzu.edu.cn)

針對(duì)頻率域語(yǔ)音增強(qiáng)算法中因相位混亂產(chǎn)生人工偽影,導(dǎo)致去噪性能受限、語(yǔ)音質(zhì)量不高的問題,提出一種基于多尺度階梯型時(shí)頻Conformer生成對(duì)抗網(wǎng)絡(luò)(MSLTF-CMGAN)的語(yǔ)音增強(qiáng)算法。將語(yǔ)音語(yǔ)譜圖的實(shí)部、虛部和振幅譜作為輸入,生成器首先在多個(gè)尺度上利用時(shí)間-頻率Conformer學(xué)習(xí)時(shí)域和頻域的全局及局部特征依賴;其次,利用Mask Decoder分支學(xué)習(xí)振幅掩碼,而Complex Decoder分支則直接學(xué)習(xí)干凈的語(yǔ)譜圖,融合這兩個(gè)Decoder分支的輸出可得到重建后的語(yǔ)音;最后,利用指標(biāo)判別器判別語(yǔ)音的評(píng)價(jià)指標(biāo)得分,通過極大極小訓(xùn)練使生成器生成高質(zhì)量的語(yǔ)音。采用主觀評(píng)價(jià)平均意見得分(MOS)和客觀評(píng)價(jià)指標(biāo)在公開數(shù)據(jù)集VoiceBank+Demand上與各類語(yǔ)音增強(qiáng)模型進(jìn)行對(duì)比,結(jié)果顯示,所提算法的MOS信號(hào)失真(CSIG)和MOS噪聲失真(CBAK)比目前最先進(jìn)的方法CMGAN(基于Conformer的指標(biāo)生成對(duì)抗網(wǎng)絡(luò)語(yǔ)音增強(qiáng)模型)分別提高了0.04和0.07,盡管它的MOS整體語(yǔ)音質(zhì)量(COVL)和語(yǔ)音質(zhì)量的感知評(píng)估(PESQ)略低于CMGAN,但與其他對(duì)比模型相比在多項(xiàng)主客觀語(yǔ)音質(zhì)量評(píng)估方面的評(píng)分均處于領(lǐng)先水平。

語(yǔ)音增強(qiáng);多尺度;Conformer;生成對(duì)抗網(wǎng)絡(luò);指標(biāo)判別器;深度學(xué)習(xí)

0 引言

語(yǔ)音增強(qiáng)是去除環(huán)境噪聲、共振噪聲、電磁噪聲等干擾的重要手段,是語(yǔ)音分析和識(shí)別的關(guān)鍵技術(shù)[1],旨在通過去除音頻中的混合噪聲以恢復(fù)高質(zhì)量和高可懂度的語(yǔ)音。

傳統(tǒng)語(yǔ)音增強(qiáng)方法包括譜減法[2]、維納濾波[3]、基于統(tǒng)計(jì)模型的方法[4]和子空間算法[5]等,基本思想是假定加性噪聲和短時(shí)平穩(wěn)的語(yǔ)音信號(hào)相互獨(dú)立的條件下,從帶噪語(yǔ)音中去除噪聲;但這類方法只能處理平穩(wěn)噪聲,在處理非平穩(wěn)噪聲和低信噪比信號(hào)時(shí)性能大幅降低。隨著深度學(xué)習(xí)的出現(xiàn),基于數(shù)據(jù)驅(qū)動(dòng)的語(yǔ)音增強(qiáng)技術(shù)成為主要的研究趨勢(shì)。自20世紀(jì)80年代起就有將神經(jīng)網(wǎng)絡(luò)用于語(yǔ)音增強(qiáng)的方法[6],隨后也出現(xiàn)了基于深度神經(jīng)網(wǎng)絡(luò)(Deep Neural Network, DNN)、循環(huán)神經(jīng)網(wǎng)絡(luò)(Recurrent Neural Network, RNN)、卷積神經(jīng)網(wǎng)絡(luò)(Convolutional Neural Network, CNN)和生成對(duì)抗網(wǎng)絡(luò)(Generative Adversarial Network, GAN)的方法。

早期基于深度學(xué)習(xí)的語(yǔ)音增強(qiáng)方法通過短時(shí)傅里葉變換(Short Time Fourier Transform, STFT)將一維的語(yǔ)音信號(hào)轉(zhuǎn)換為二維的頻率域語(yǔ)譜進(jìn)行去噪,但由于相位混亂且缺乏時(shí)間和結(jié)構(gòu)上的規(guī)律性,會(huì)對(duì)語(yǔ)音增強(qiáng)帶來巨大干擾,因此許多方法主要關(guān)注重建振幅特征而忽略了相位分量。如Wang等[7-8]最早將深度學(xué)習(xí)應(yīng)用到語(yǔ)音增強(qiáng)任務(wù),使用DNN學(xué)習(xí)理想二值掩蔽值(Ideal Binary Mask, IBM),直接將有噪語(yǔ)音信號(hào)映射到干凈語(yǔ)音信號(hào),但DNN存在參數(shù)量大、難以提取上下文特征等問題。隨后,Weninger等[9]利用RNN對(duì)語(yǔ)音上下文特征信息進(jìn)行建模,又進(jìn)一步采用長(zhǎng)短期記憶(Long Short-Term Memory, LSTM)神經(jīng)網(wǎng)絡(luò)對(duì)語(yǔ)音信號(hào)進(jìn)行去噪重建[10];但RNN存在訓(xùn)練時(shí)間長(zhǎng)、網(wǎng)絡(luò)規(guī)模大、難以并行化處理等缺點(diǎn)。Park等[11]提出了基于CNN的編碼-解碼器網(wǎng)絡(luò)RCED,通過輸入前幾幀的帶噪振幅譜預(yù)測(cè)當(dāng)前干凈的振幅譜。與RNN相比,這種基于CNN的模型有更小的參數(shù)量、訓(xùn)練時(shí)間更短,但存在感受野受限、提取上下文特征能力弱等問題。為了緩解傳統(tǒng)CNN的問題,張?zhí)祢U等[12]采用門控機(jī)制和擴(kuò)張卷積神經(jīng)網(wǎng)絡(luò),在擴(kuò)大感受野的基礎(chǔ)上,門控機(jī)制可以較好地提取上下文特征。

然而越來越多的研究表明,當(dāng)同時(shí)對(duì)語(yǔ)譜的振幅和相位分量進(jìn)行優(yōu)化時(shí),語(yǔ)音的主客觀感知質(zhì)量會(huì)大幅提升,尤其是在低信噪比的情況下[13]。為了同時(shí)涵蓋振幅和相位信息、避免相位混亂問題,有研究工作將語(yǔ)音的頻率域語(yǔ)譜轉(zhuǎn)換至笛卡爾坐標(biāo)系得到復(fù)數(shù)域頻譜,對(duì)復(fù)數(shù)域特征進(jìn)行增強(qiáng)的同時(shí)也隱含地增強(qiáng)了振幅和相位分量。如Tan等[14]結(jié)合CNN和RNN,提出了一種門機(jī)制卷積循環(huán)神經(jīng)網(wǎng)絡(luò)(Gated Convolutional Recurrent Network, GCRN),在增加感受野的同時(shí),門控循環(huán)神經(jīng)網(wǎng)絡(luò)能較好地提取上下文特征信息,重建干凈的復(fù)數(shù)域頻譜。此外也有研究直接對(duì)原始一維語(yǔ)音信號(hào)進(jìn)行增強(qiáng),如Parveen等[15]提出了Wave?U?Net模型,采用一維U?Net架構(gòu)通過有監(jiān)督的方式學(xué)習(xí)有噪聲頻到干凈語(yǔ)音的映射,利用擴(kuò)張卷積擴(kuò)大對(duì)語(yǔ)音信號(hào)的感受野,同時(shí)還能保持少量參數(shù)。為了提高重建語(yǔ)音的真實(shí)性和可懂度,有研究利用GAN進(jìn)行語(yǔ)音增強(qiáng)。如Pascual等[16]提出語(yǔ)音增強(qiáng)生成對(duì)抗網(wǎng)絡(luò)(Speech Enhancement Generative Adversarial Network, SEGAN),利用基于CNN的生成器對(duì)原始的一維音頻信號(hào)進(jìn)行特征提取,直接映射到干凈的一維語(yǔ)音信號(hào),來自判別器的對(duì)抗損失也間接提升了重建語(yǔ)音的質(zhì)量,但由于判別器未學(xué)習(xí)對(duì)應(yīng)語(yǔ)音質(zhì)量判別標(biāo)準(zhǔn),該方法的效果相較于傳統(tǒng)方法提升不明顯。為了解決這一問題,F(xiàn)u等[17]提出了指標(biāo)生成對(duì)抗網(wǎng)絡(luò)語(yǔ)音增強(qiáng)模型MetricGAN,利用判別器學(xué)習(xí)評(píng)價(jià)指標(biāo)函數(shù),使之能代替評(píng)價(jià)指標(biāo)預(yù)測(cè)音頻質(zhì)量,解決評(píng)價(jià)指標(biāo)不可微分的問題,同時(shí)優(yōu)化生成器生成信號(hào)的評(píng)價(jià)指標(biāo)分?jǐn)?shù),從而重建出具有更高語(yǔ)音感知質(zhì)量和可懂度的語(yǔ)音。

近年來,Transformer[18]因可并行、能處理長(zhǎng)時(shí)間依賴的優(yōu)勢(shì)在語(yǔ)音識(shí)別、自然語(yǔ)言處理和圖像分割等領(lǐng)域取得了成功,也有研究將它應(yīng)用于語(yǔ)音增強(qiáng)。如Kim等[19]在Transformer模型中引入高斯加權(quán)矩陣,提出了帶有高斯加權(quán)的注意力機(jī)制,通過網(wǎng)絡(luò)學(xué)習(xí)振幅譜的掩碼,將該掩碼與帶噪語(yǔ)音信號(hào)振幅譜相乘得到干凈語(yǔ)音振幅的估計(jì),然后結(jié)合原始帶噪語(yǔ)音信號(hào)的相位逆短時(shí)傅里葉變換(Inverse Short Time Fourier Transform, ISTFT)重構(gòu)干凈語(yǔ)音信號(hào)。為了更好地捕獲音頻信號(hào)中的局部特征和全局依賴,Gulati等[20]提出了結(jié)合CNN和Transformer的模型Conformer,在語(yǔ)音識(shí)別領(lǐng)域取得了顯著的成果。Cao等[21]提出了一種基于Conformer的MetricGAN——CMGAN,用于在振幅譜和復(fù)數(shù)域頻譜上進(jìn)行語(yǔ)音增強(qiáng)。CMGAN的生成器包含基于時(shí)頻的兩級(jí)Conformer模塊,能捕獲時(shí)間域和頻率域的長(zhǎng)距離依賴和局部特征,它采用的指標(biāo)判別器作用與MetricGAN相同,有助于提高生成語(yǔ)音的質(zhì)量,同時(shí)不會(huì)對(duì)其他指標(biāo)產(chǎn)生不利影響。

本文針對(duì)頻率域語(yǔ)音增強(qiáng)中因相位混亂產(chǎn)生人工偽影、去噪性能受限、語(yǔ)音質(zhì)量不高的問題,提出一種多尺度階梯型時(shí)頻Conformer生成對(duì)抗網(wǎng)絡(luò)(Multi-Scale Ladder-type Time-Frequency CMGAN, MSLTF-CMGAN)算法用于單通道語(yǔ)音增強(qiáng)。MSLTF-CMGAN由一個(gè)生成器和一個(gè)指標(biāo)判別器組成:生成器采用編碼-解碼器結(jié)構(gòu),包含一個(gè)編碼器、一個(gè)多尺度階梯型時(shí)頻Conformer(MultiScale Ladder-type Time-frequency Conformer, MSLTFC)模塊和兩個(gè)解碼器(掩碼解碼器和復(fù)數(shù)解碼器),用于生成增強(qiáng)后的復(fù)數(shù)域頻譜;指標(biāo)判別器由擴(kuò)張卷積網(wǎng)絡(luò)組成,用于預(yù)測(cè)重建語(yǔ)音的音頻質(zhì)量。

本文工作包括:

1)提出一種基于MSLTF-CMGAN的語(yǔ)音增強(qiáng)算法,在頻譜上多尺度地進(jìn)行局部-全局表示學(xué)習(xí),既能學(xué)習(xí)局部細(xì)節(jié)特征,又能捕獲全局長(zhǎng)距離依賴;此外,通過在不同尺度的特征圖上學(xué)習(xí)的方式,在保留特征的同時(shí)解決了CMGAN中使用原維度特征圖計(jì)算開銷大的問題,加速了模型訓(xùn)練。

2)采用指標(biāo)判別器預(yù)測(cè)重建語(yǔ)音的評(píng)價(jià)指標(biāo)得分,有助于提升生成器生成語(yǔ)音的質(zhì)量。

3)在公開數(shù)據(jù)集VoiceBank+Demand[22]上的實(shí)驗(yàn)結(jié)果表明,本文算法取得了可行有效的去噪效果,消融實(shí)驗(yàn)也驗(yàn)證了本文的多尺度階梯型結(jié)構(gòu)提升了語(yǔ)音增強(qiáng)的效果。

1 語(yǔ)音增強(qiáng)問題描述

頻率域的語(yǔ)音增強(qiáng),首先需要對(duì)語(yǔ)音信號(hào)進(jìn)行STFT得到語(yǔ)音的頻譜,即

2 本文算法

2.1 語(yǔ)音增強(qiáng)模型原理

為了充分利用時(shí)間域和頻率域信息進(jìn)行語(yǔ)音增強(qiáng),本文提出了MSLTF-CMGAN,它由一個(gè)基于時(shí)頻卷積自注意力的生成器(圖1~3)和一個(gè)用于預(yù)測(cè)評(píng)價(jià)指標(biāo)得分的判別器(圖4)組成,通過兩者的極大極小訓(xùn)練來提升生成器的去噪能力以重建高質(zhì)量的干凈語(yǔ)音。下面分別對(duì)生成器和判別器的具體結(jié)構(gòu)和語(yǔ)音增強(qiáng)的原理進(jìn)行詳細(xì)的介紹。

2.1.1基于多尺度階梯時(shí)頻Conformer的生成器

生成器網(wǎng)絡(luò)結(jié)構(gòu)如圖1所示,它由一個(gè)稠密鏈接擴(kuò)張卷積編碼器Dense Encoder、一個(gè)MSLTFC模塊和兩個(gè)解碼器組成(分別是掩碼解碼器Mask Decoder和復(fù)數(shù)解碼器Complex Decoder)。

圖1 生成器網(wǎng)絡(luò)結(jié)構(gòu)

通過冪律壓縮[23]獲得特征增強(qiáng)后的頻譜:

其中表示ISTFT。

從上述去噪過程可以看出,本文提出的MSLTFC模塊可以從全局和細(xì)節(jié)上充分提取時(shí)域和頻域特征,有效去除持續(xù)性的低頻噪聲和碎片化的高頻噪聲,不同尺度的音頻特征相結(jié)合也可提升語(yǔ)音增強(qiáng)的質(zhì)量,減少人工偽影。模型的兩個(gè)解碼器在有效保留語(yǔ)音主體的同時(shí)也規(guī)避了相位混亂導(dǎo)致的問題。

圖3 Conformer模塊的網(wǎng)絡(luò)結(jié)構(gòu)

2.1.2指標(biāo)判別器

在語(yǔ)音增強(qiáng)任務(wù)中,模型的目標(biāo)函數(shù)往往不能直接表示評(píng)價(jià)指標(biāo),并且一些評(píng)價(jià)指標(biāo)函數(shù)是不可微分的,如語(yǔ)音質(zhì)量的感知評(píng)估(Perceptual Evaluation of Speech Quality, PESQ)[24]和短時(shí)客觀可懂度(Short-Time Objective Intelligibility, STOI)[25]。本文受MetricGAN啟發(fā),提出了一個(gè)輕量指標(biāo)判別器來模擬語(yǔ)音評(píng)價(jià)指標(biāo)函數(shù),并將評(píng)價(jià)得分添加至模型訓(xùn)練損失以提升語(yǔ)音增強(qiáng)效果。

如圖4所示,指標(biāo)判別器包含4個(gè)卷積模塊,每個(gè)模塊包含二維卷積層、實(shí)例標(biāo)準(zhǔn)化層(IN)和參數(shù)化修正線性單元(Parametric Rectified Linear Unit, PReLU)激活函數(shù)。之后是1個(gè)最大值池化層和2個(gè)線性層,在2個(gè)線性層之間引入了PReLU激活函數(shù)和Dropout層以避免梯度消失和過擬合問題。模型最后是Sigmoid函數(shù),將它的輸出限制在[0,1],輸出越小表示語(yǔ)音質(zhì)量越差,相反輸出越大表示語(yǔ)音質(zhì)量越好。

該指標(biāo)判別器將有噪聲頻振幅譜和干凈語(yǔ)音振幅譜作為輸入,判別器通過學(xué)習(xí)能夠準(zhǔn)確預(yù)測(cè)去噪后語(yǔ)音的PESQ/STOI得分。此外生成器通過對(duì)抗訓(xùn)練會(huì)生成PESQ/STOI得分越來越高的語(yǔ)音。

圖4 指標(biāo)判別器的網(wǎng)絡(luò)結(jié)構(gòu)

2.2 損失函數(shù)

此外,本文還在時(shí)間域波形信號(hào)間應(yīng)用了L1損失,如式(16)所示:

2.3 算法框架

本文算法的步驟如下:

輸入有噪語(yǔ)音樣本;干凈語(yǔ)音樣本;生成器模型;判別器模型;

初始化:0;0

forfrom 1 to

3 實(shí)驗(yàn)與結(jié)果分析

本文實(shí)驗(yàn)環(huán)境為L(zhǎng)inux Ubuntu18.04操作系統(tǒng),GPU顯卡Tesla V100,顯存32 GB以及CUDA 11.4、PyTorch1.11和Python3.8的軟件平臺(tái)。

3.1 實(shí)驗(yàn)數(shù)據(jù)集

為了驗(yàn)證本文算法的有效性,采用公開的經(jīng)典數(shù)據(jù)集VoiceBank[26]+Demand[27]比較本文模型和前沿的基線模型。該數(shù)據(jù)集包含28個(gè)用于訓(xùn)練的人聲和2個(gè)用于測(cè)試的未知人聲,訓(xùn)練集包含11 572個(gè)有噪-干凈音頻數(shù)據(jù)對(duì),測(cè)試集包含824個(gè)數(shù)據(jù)對(duì),音頻長(zhǎng)度在2~15 s不等。訓(xùn)練集中的音頻樣本混合了10種噪聲中的任意一種(包括2種人聲噪聲和8種來自Demand數(shù)據(jù)集的環(huán)境噪聲),且按信噪比{0 dB,5 dB,10 dB,15 dB}添加噪聲。測(cè)試集使用Demand數(shù)據(jù)集中5種未出現(xiàn)在訓(xùn)練集的噪聲創(chuàng)建樣本,按信噪比{2.5 dB,7.5 dB,12.5 dB,17.5 dB}添加噪聲。數(shù)據(jù)集中的噪聲類型廣泛,如公共環(huán)境噪聲(餐廳和辦公室)、家庭噪聲(廚房和客廳)以及交通噪聲(地鐵、公交和汽車)等,使該數(shù)據(jù)集具有挑戰(zhàn)性。

3.2 實(shí)驗(yàn)設(shè)置

3.3 基線模型

為了體現(xiàn)本文算法對(duì)語(yǔ)音增強(qiáng)的效果,與近年來前沿的模型進(jìn)行比較,基線模型包括傳統(tǒng)方法維納濾波(Wiener)和其他深度學(xué)習(xí)模型。其中,SEGAN、HiFiGAN[28]、MetricGAN、MetricGAN+[29]、DVUGAN[31]和CMGAN是基于GAN的模型,它們的生成器用于去除噪聲、提升音頻質(zhì)量,判別器用于區(qū)分干凈音頻和帶噪聲頻。SEGAN是首次使用GAN進(jìn)行語(yǔ)音增強(qiáng)的方法,它的生成器是一個(gè)編碼-解碼的全卷積結(jié)構(gòu)模型,用于在時(shí)域上進(jìn)行語(yǔ)音增強(qiáng)。HiFIGAN包括一個(gè)前饋WaveNet生成器網(wǎng)絡(luò)和一個(gè)時(shí)頻多尺度判別器,利用對(duì)抗訓(xùn)練進(jìn)行語(yǔ)音增強(qiáng)。MetricGAN基于來自判別器的評(píng)價(jià)得分損失訓(xùn)練生成器,生成器將振幅特征圖作為輸入和輸出,通過反傅里葉變換得到增強(qiáng)后的音頻,指標(biāo)判別器解決了評(píng)價(jià)指標(biāo)如PESQ、STOI計(jì)算不可微的問題,有效地提升了重建語(yǔ)音的質(zhì)量。MetricGAN的改進(jìn)版本MetricGAN+優(yōu)化了損失并在模型中添加了可學(xué)習(xí)的Sigmoid函數(shù),對(duì)不同頻率段有更強(qiáng)的適應(yīng)性。PHASEN[30]提出了雙流的深度神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu),分別用于處理幅度和相位,有助于頻譜重建。DVUGAN設(shè)計(jì)具有變分編碼結(jié)構(gòu)的對(duì)抗網(wǎng)絡(luò)模型,采用包含概率瓶頸的變分U-Net,增加未知數(shù)據(jù)分布的先驗(yàn)知識(shí),還利用信噪比(Signal?to?Noise Ratio, SNR)損失指導(dǎo)判別器,提升了語(yǔ)音增強(qiáng)性能。TSTNN[32]提出了時(shí)域兩級(jí)Transformer神經(jīng)網(wǎng)絡(luò),能有效提取長(zhǎng)距離語(yǔ)音序列的局部和全局上下文信息。DB-AIAT[33]提出了雙分支Transformer結(jié)構(gòu)網(wǎng)絡(luò),包括振幅分支和復(fù)數(shù)域分支,一起重建去噪后的頻譜。DPT?FSNet[34]提出了一種基于Transformer的全頻段結(jié)合子頻段的雙路網(wǎng)絡(luò),用于在頻率域進(jìn)行語(yǔ)音增強(qiáng)。CMGAN由MetricGAN得到啟發(fā),設(shè)計(jì)了兩階段的Conformer模塊提取時(shí)間域和頻率域特征,達(dá)到了此前最高的PESQ/STOI得分。

3.4 評(píng)價(jià)指標(biāo)

本文采用多種語(yǔ)音質(zhì)量評(píng)價(jià)指標(biāo),如主觀評(píng)價(jià)平均意見得分(Mean Opinion Score, MOS)[35],包括MOS信號(hào)失真CSIG(MOS prediction of the signal distortion)、MOS噪聲失真CBAK(MOS predictor of intrusiveness of background noise)和MOS整體語(yǔ)音質(zhì)量COVL(MOS prediction of the overall effect),三個(gè)評(píng)價(jià)指標(biāo)數(shù)值均在[1,5];此外還采用了客觀評(píng)價(jià)指標(biāo)PESQ和STOI。PESQ用于語(yǔ)音質(zhì)量的感知評(píng)估,數(shù)值在[-0.5,4.5],本文使用的是ITU-TP.862.2建議書提供的寬帶PESQ。STOI用于評(píng)價(jià)語(yǔ)音可懂度,數(shù)值在[0,1]。對(duì)于以上5種評(píng)價(jià)指標(biāo),數(shù)值越高均表示語(yǔ)音質(zhì)量越高。

3.5 模型對(duì)比實(shí)驗(yàn)

為了評(píng)估模型的語(yǔ)音增強(qiáng)性能,在VoiceBank+Demand數(shù)據(jù)集上與維納濾波(Wiener)和前述的基線模型進(jìn)行對(duì)比,結(jié)果如表1所示。

表1 不同算法在VoiceBank+Demand數(shù)據(jù)集上的性能評(píng)估

此外,還對(duì)本文算法在VoiceBank+Demand測(cè)試集上的結(jié)果進(jìn)行了可視化,展示了不同算法增強(qiáng)的音頻信號(hào)與頻譜圖,效果如圖5所示。

圖5 不同算法增強(qiáng)的語(yǔ)音信號(hào)的語(yǔ)譜圖可視化

為了更明顯地展現(xiàn)音頻特征,本文將頻譜轉(zhuǎn)換至dB標(biāo)度,即對(duì)原頻譜取對(duì)數(shù),殘差圖中顏色越紅(即越接近色彩標(biāo)度尺上方顏色)表示差值越大,反之越藍(lán)表示差值越?。ㄔ节吔?)、越接近干凈語(yǔ)音。圖5(a)為原始有噪聲頻,可以看出時(shí)域波形和頻譜中存在大量噪聲;圖5(b)為對(duì)應(yīng)干凈語(yǔ)音,作為對(duì)比標(biāo)簽;圖5(c)為維納濾波增強(qiáng)后的語(yǔ)音,可見濾波后去除了高頻和低頻噪聲,但是中段頻率沒有明顯的去除效果;圖5(d)為SEGAN增強(qiáng)語(yǔ)音,可見去噪效果較維納濾波均有提高,但對(duì)低頻噪聲的去除效果較差,仍存在肉眼可見的白噪聲,實(shí)際聽感表現(xiàn)為機(jī)械化的電流雜音;而MetricGAN的提出解決了這一問題,圖5(e)為MetricGAN增強(qiáng)語(yǔ)音,所有頻段白噪聲明顯減少,語(yǔ)音可懂度較原始有噪聲頻大幅提高;圖5(f)為CMGAN增強(qiáng)語(yǔ)音,在各頻段均有去噪效果,在低頻部分也更加接近干凈語(yǔ)音,同時(shí)語(yǔ)音中沒有音素的部分也有效地去除了噪聲;圖5(g)為本文算法增強(qiáng)語(yǔ)音,可見在高頻部分和低頻部分均能很好地去除噪聲,在沒有人聲的時(shí)間段能有效去除無(wú)關(guān)的音素,整體和放大細(xì)節(jié)上均能明顯看出去噪效果,且殘差圖顯示與干凈語(yǔ)音相差分貝小,語(yǔ)音質(zhì)量和可懂度高。通過以上對(duì)比可以看出,本文算法在語(yǔ)音增強(qiáng)方面表現(xiàn)更好,在時(shí)間域和頻率域都能明顯去除噪聲,以恢復(fù)高質(zhì)量的干凈語(yǔ)音。

3.6 消融實(shí)驗(yàn)

為了驗(yàn)證本文MSLTFC模塊嵌入到網(wǎng)絡(luò)中的提升效果以及模型設(shè)計(jì)的合理性,設(shè)計(jì)了如下消融實(shí)驗(yàn):首先研究不同Conformer的組合方式對(duì)特征提取的效果,將兩個(gè)Conformer分別以并行和串行的方式提取時(shí)間域和頻率域特征;其次,研究?jī)蓚€(gè)Decoder的去噪性能,分別僅保留Mask Decoder和Complex Decoder;再次,分別研究不進(jìn)行降采樣的Conformer結(jié)構(gòu)和采用多尺度時(shí)頻Conformer的去噪性能;最后研究有無(wú)指標(biāo)判別器對(duì)語(yǔ)音增強(qiáng)效果的影響。

表2展示了消融實(shí)驗(yàn)的結(jié)果對(duì)比,此外對(duì)測(cè)試樣例繪制了消融實(shí)驗(yàn)結(jié)果圖,如圖6所示。圖6(a)是有噪語(yǔ)音樣例,頻譜中黑框部分表示前部的一段尖銳持續(xù)噪聲;圖6(b)是對(duì)應(yīng)的干凈語(yǔ)音。表2中Parallel-Conformer表示兩個(gè)Conformer以并行方式連接,分別用于提取時(shí)間域和頻率域特征,傳入的特征圖變形以適配各自的維度,經(jīng)過Conformer以后再變形為原始特征圖維度并相加,結(jié)果表明,并行方式的語(yǔ)音增強(qiáng)效果低于順序方式,它的CSIG和PESQ比MSLTF-CMGAN分別低了0.14和0.06,測(cè)試效果如圖6(c)所示。

表2中Mask Decoder表示輸入只使用了振幅譜,除了僅有一個(gè)掩碼解碼器以外其余網(wǎng)絡(luò)結(jié)構(gòu)保持不變,輸出的重建振幅譜與原始相位相結(jié)合得到去噪后的頻譜。同樣的,Complex Decoder表示僅有復(fù)數(shù)解碼器且只是用復(fù)數(shù)域頻譜作為輸入,直接輸出重建復(fù)數(shù)頻譜。將兩種方式進(jìn)行對(duì)比,由表2可知,僅采用振幅譜而忽略相位信息導(dǎo)致它的CSIG比MSLTF-CMGAN降低了0.22,同時(shí)僅采樣復(fù)數(shù)頻譜時(shí)PESQ降低了0.08。圖6(d)、(e)分別為Mask Decoder和Complex Decoder語(yǔ)音增強(qiáng)效果,兩者均去除了尖銳噪聲,但在其他頻段Complex Decoder去噪效果要稍優(yōu)于Mask Decoder,如其中黑框部分所示。該結(jié)果也表明本文提出的掩碼解碼器結(jié)合復(fù)數(shù)解碼器形成了互補(bǔ)的結(jié)構(gòu),在提升語(yǔ)音質(zhì)量和可懂度方面有顯著優(yōu)勢(shì)。

表2中Without Downsample表示不采用多尺度階梯時(shí)頻Conformer模塊,取而代之使用三個(gè)不改變特征圖維度的時(shí)頻Conformer模塊用于提取時(shí)頻特征。結(jié)果表明,不采用多尺度階梯時(shí)頻Conformer模塊在CSIG、CBAK、COVL上比MSLTF-CMGAN分別降低了0.31、0.33和0.25,此外如圖6(f)所示,在去除無(wú)關(guān)音素和高頻噪聲時(shí)較本文算法有所不足。這說明本文提出的多尺度階梯時(shí)頻Conformer模塊能進(jìn)一步抓取音頻結(jié)構(gòu)上和細(xì)節(jié)上的特征,達(dá)到更高的語(yǔ)音增強(qiáng)質(zhì)量。

最后,本文還驗(yàn)證了指標(biāo)判別器對(duì)于語(yǔ)音增強(qiáng)模型的提升作用,Without Discriminator表示沒有引入指標(biāo)判別器進(jìn)行對(duì)抗訓(xùn)練,由表2可知,這導(dǎo)致了語(yǔ)音增強(qiáng)模型在各項(xiàng)指標(biāo)均有下降,由圖6(g)可知,雖然去除了高低頻噪聲,但是引入了黑框中條紋形的噪聲,實(shí)際聽感表現(xiàn)為電流雜音,降低了語(yǔ)音的質(zhì)量,這反映了只將頻譜之間的閔可夫斯基距離(Minkowski distance,即Lp距離)作為損失函數(shù)并不能得到較高的語(yǔ)音質(zhì)量。此外圖6(h)所示為本文算法增強(qiáng)語(yǔ)音,各頻段去噪效果顯著,相較于圖6(g),黑框部分也表明引入指標(biāo)判別器顯著提升了重建語(yǔ)音的質(zhì)量。

表2 消融實(shí)驗(yàn)結(jié)果

圖6 消融實(shí)驗(yàn)的語(yǔ)譜圖可視化

4 結(jié)語(yǔ)

本文提出了多尺度階梯型時(shí)頻Conformer生成對(duì)抗網(wǎng)絡(luò)(MSLTF-CMGAN),用于去除音頻中的噪聲以恢復(fù)更高的語(yǔ)音質(zhì)量。通過短時(shí)傅里葉變換,同時(shí)利用振幅譜和復(fù)數(shù)域頻譜在時(shí)間域和頻率域進(jìn)行語(yǔ)音特征提取,有效保留振幅信息的同時(shí)規(guī)避了相位結(jié)構(gòu)隨機(jī)性的問題。本文提出的多尺度階梯型時(shí)頻Conformer(MSLTFC)模塊能在不同尺度分辨率的特征圖中提取細(xì)節(jié)和結(jié)構(gòu)化特征,同時(shí)引入指標(biāo)判別器解決了評(píng)價(jià)指標(biāo)函數(shù)不可直接微分的問題,對(duì)抗訓(xùn)練也提高了整體語(yǔ)音增強(qiáng)質(zhì)量。在公開數(shù)據(jù)集VoiceBank+Demand數(shù)據(jù)集上的實(shí)驗(yàn)結(jié)果表明,本文算法可行有效,取得了較好的去噪效果,甚至在一些指標(biāo)上取得了最高得分,如CSIG。此外,消融實(shí)驗(yàn)也驗(yàn)證了本文模型中每個(gè)部分的有效性。本文中振幅譜結(jié)合復(fù)數(shù)譜導(dǎo)致了模型輸入維度的增加,如何在保留振幅和相位信息的同時(shí)減小輸入特征維度、提升語(yǔ)音增強(qiáng)效果是未來研究的重點(diǎn)。

[1] LOIZOU P C. Speech Enhancement: Theory and Practice[M]. Boca Raton, FL: CRC Press, 2007: 1-9.

[2] BOLL S. Suppression of acoustic noise in speech using spectral subtraction[J]. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1979, 27(2): 113-120.

[3] ZALEVSKY Z, MENDLOVIC D. Fractional Wiener filter[J]. Applied Optics, 1996, 35(20): 3930-3936.

[4] EPHRAIM Y. Statistical-model-based speech enhancement systems[J]. Proceedings of the IEEE, 1992, 80(10): 1526-1555.

[5] EPHRAIM Y, TREES H L VAN. A signal subspace approach for speech enhancement[J]. IEEE Transactions on Speech and Audio Processing, 1995, 3(4): 251-266.

[6] TAMURA S, WAIBEL A. Noise reduction using connectionist models[C]// Proceedings of the 1988 International Conference on Acoustics, Speech, and Signal Processing — Volume 1. Piscataway: IEEE, 1988: 553-556.

[7] WANG Y, WANG D. Towards scaling up classification-based speech separation[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2013, 21(7): 1381-1390.

[8] HEALY E W, YOHO S E, WANG Y, et al. An algorithm to improve speech recognition in noise for hearing-impaired listeners[J]. The Journal of the Acoustical Society of America, 2013, 134(4): 3029-3038.

[9] WENINGER F, HERSHEY J R, LE ROUX J, et al. Discriminatively trained recurrent neural networks for single-channel speech separation[C]// Proceedings of the 2014 IEEE Global Conference on Signal and Information Processing. Piscataway: IEEE, 2014: 577-581.

[10] WENINGER F, ERDOGAN H, WATANABE S, et al. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR[C]// Proceedings of the 2015 International Conference on Latent Variable Analysis and Signal Separation, LNCS 9237. Cham: Springer, 2015: 91-99.

[11] PARK S R, LEE J W. A fully convolutional neural network for speech enhancement[C]// Proceedings of the INTERSPEECH 2017. [S.l.]: International Speech Communication Association, 2017: 1993-1997.

[12] 張?zhí)祢U,柏浩鈞,葉紹鵬,等. 基于門控殘差卷積編解碼網(wǎng)絡(luò)的單通道語(yǔ)音增強(qiáng)方法[J]. 信號(hào)處理, 2021, 37(10):1986-1995.(ZHANG T Q, BAI H J, YE S P, et al. Single-channel speech enhancement method based on gated residual convolution encoder-and-decoder network[J]. Journal of Signal Processing, 2021, 37(10):1986-1995.)

[13] PALIWAL K, WóJCICKI K, SHANNON B. The importance of phase in speech enhancement[J]. Speech Communication, 2011, 53(4): 465-494.

[14] TAN K, WANG D. Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28: 380-390.

[15] PARVEEN S, GREEN P. Speech enhancement with missing data techniques using recurrent neural networks[C]// Proceedings of the 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing — Volume 1. Piscataway: IEEE, 2004: 733-736.

[16] PASCUAL S, BONAFONTE A, SERRà J. SEGAN: speech enhancement generative adversarial network[C]// Proceedings of the INTERSPEECH 2017. [S.l.]: International Speech Communication Association, 2017: 3642-3646.

[17] FU S W, LIAO C F, TSAO Y, et al. MetricGAN: generative adversarial networks based black-box metric scores optimization for speech enhancement[C]// Proceedings of the 36th International Conference on Machine Learning. New York: JMLR.org, 2019: 2031-2041.

[18] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates Inc., 2017: 6000-6010.

[19] KIM J, EL-KHAMY M, LEE J. T-GSA: Transformer with Gaussian-weighted self-attention for speech enhancement[C]// Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2020: 6649-6653.

[20] GULATI A, QIN J, CHIU C C, et al. Conformer: convolution-augmented Transformer for speech recognition[C]// Proceedings of the INTERSPEECH 2020. [S.l.]: International Speech Communication Association, 2020: 5036-5040.

[21] CAO R, ABDULATIF S, YANG B. CMGAN: conformer-based metric GAN for speech enhancement[C]// Proceedings of the INTERSPEECH 2022. [S.l.]: International Speech Communication Association, 2022: 936-940.

[22] VALENTINI-BOTINHAO C, WANG X, TAKAKI S, et al. Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech[C]// Proceedings of the 9th ISCA Speech Synthesis Workshop. [S.l.]: International Speech Communication Association, 2016: 146-152.

[23] BRAUN S, TASHEV I. A consolidated view of loss functions for supervised deep learning-based speech enhancement[C]// Proceedings of the 44th International Conference on Telecommunications and Signal Processing. Piscataway: IEEE, 2021: 72-76.

[24] RIX A W, BEERENDS J G, HOLLIER M P, et al. Perceptual Evaluation of Speech Quality (PESQ) — a new method for speech quality assessment of telephone networks and codecs[C]// Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing — Volume 2. Piscataway: IEEE, 2001: 749-752.

[25] TAAL C H, HENDRIKS R C, HEUSDENS R, et al. An algorithm for intelligibility prediction of time-frequency weighted noisy speech[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(7): 2125-2136.

[26] VEAUX C, YAMAGISHI J, KING S. The voice bank corpus: design, collection and data analysis of a large regional accent speech database[C]// Proceedings of the 2013 International Conference of the Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation. Piscataway: IEEE, 2013: 1-4.

[27] THIEMANN J, ITO N, VINCENT E. The diverse environments multi-channel acoustic noise database: a database of multichannel environmental noise recordings[J]. The Journal of the Acoustical Society of America, 2013, 133(S5): No.4806631.

[28] SU J, JIN Z, FINKELSTEIN A. HiFi-GAN: high-fidelity denoising and dereverberation based on speech deep features in adversarial networks[C]// Proceedings of the INTERSPEECH 2020. [S.l.]: International Speech Communication Association, 2020: 4506-4510.

[29] FU S W, YU C, HSIEH T A, et al. MetricGAN+: an improved version of MetricGAN for speech enhancement[C]// Proceedings of the INTERSPEECH 2021. [S.l.]: International Speech Communication Association, 2021: 201-205.

[30] YIN D, LUO C, XIONG Z, et al. PHASEN: a phase-and-harmonics-aware speech enhancement network[C]// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto, CA: AAAI Press, 2020: 9458-9465.

[31] 徐峰,李平. DVUGAN:基于STDCT的DDSP集成變分U-Net的語(yǔ)音增強(qiáng)[J]. 信號(hào)處理, 2022, 38(3):582-589.(XU F, LI P. DVUGAN: DDSP integrated variational U-Net speech enhancement based on STDCT[J]. Journal of Signal Processing, 2022, 38(3):582-589.)

[32] WANG K, HE B, ZHU W P. TSTNN: two-stage transformer based neural network for speech enhancement in the time domain[C]// Proceedings of the 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing. Piscataway: IEEE, 2021: 7098-7102.

[33] YU G, LI A, ZHENG C, et al. Dual-branch attention-in-attention transformer for single-channel speech enhancement[C]// Proceedings of the 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing. Piscataway: IEEE, 2022: 7847-7851.

[34] DANG F, CHEN H, ZHANG P. DPT-FSNet: dual-path transformer based full-band and sub-band fusion network for speech enhancement[C]// Proceedings of the 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing. Piscataway: IEEE, 2022: 6857-6861.

[35] HU Y, LOIZOU P C. Evaluation of objective quality measures for speech enhancement[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2008, 16(1): 229-238.

Speech enhancement algorithm based on multi-scale ladder-type time-frequency Conformer GAN

JIN Yutang1, WANG Yisong1*, WANG Lihui1, ZHAO Pengli2

(1(),550025,;2,461000,)

Aiming at the problem of artificial artifacts due to phase disorder in frequency-domain speech enhancement algorithms, which limits the denoising performance and decreases the speech quality, a speech enhancement algorithm based on Multi-Scale Ladder-type Time-Frequency Conformer Generative Adversarial Network (MSLTF-CMGAN) was proposed. Taking the real part, imaginary part and magnitude spectrum of the speech spectrogram as input, the generator first learned the local and global feature dependencies between temporal and frequency domains by using time-frequency Conformer at multiple scales. Secondly, the Mask Decoder branch was used to learn the amplitude mask, and the Complex Decoder branch was directly used to learn the clean spectrogram, and the outputs of the two decoder branches were fused to obtain the reconstructed speech. Finally, the metric discriminator was used to judge the scores of speech evaluation metrics, and high-quality speech was generated by the generator through minimax training. Comparison experiments with various types of speech enhancement models were conducted on the public dataset VoiceBank+Demand by subjective evaluation Mean Opinion Score (MOS) and objective evaluation metrics.Experimental results show that compared with current state-of-the-art speech enhancement method CMGAN (Comformer-based MetricGAN), MSLTF-CMGAN improves MOS prediction of the signal distortion (CSIG) and MOS predictor of intrusiveness of background noise (CBAK) by 0.04 and 0.07 respectively, even though its Perceptual Evaluation of Speech Quality (PESQ) and MOS prediction of the overall effect (COVL) are slightly lower than that of CMGAN, it still outperforms other comparison models in several subjective and objective speech evaluation metrics.

speech enhancement; multi-scale; Conformer; Generative Adversarial Network (GAN); metric discriminator; deep learning

1001-9081(2023)11-3607-09

10.11772/j.issn.1001-9081.2022111734

2022?11?22;

2023?02?27;

國(guó)家自然科學(xué)基金資助項(xiàng)目(U1836205)。

金玉堂(1999—),男,貴州安順人,碩士研究生,主要研究方向:數(shù)字信號(hào)處理、語(yǔ)音增強(qiáng)、信號(hào)去噪; 王以松(1975—),男,貴州思南人,教授,博士,CCF會(huì)員,主要研究方向:知識(shí)表示與推理、回答集程序設(shè)計(jì)、人工智能、機(jī)器學(xué)習(xí); 王麗會(huì)(1982—),女,黑龍江哈爾濱人,教授,博士,主要研究方向:深度學(xué)習(xí)、機(jī)器學(xué)習(xí)、醫(yī)學(xué)成像、醫(yī)學(xué)圖像處理、計(jì)算機(jī)視覺; 趙鵬利(1992—),女,河南許昌人,助教,碩士,主要研究方向:數(shù)據(jù)庫(kù)、軟件工程。

TP391.9

A

2023?02?28。

This work is partially supported by National Natural Science Foundation of China (U1836205).

JIN Yutang, born in 1999, M. S. candidate. His research interests include digital signal processing, speech enhancement, signal denoising.

WANG Yisong, born in 1975, Ph. D., professor. His research interests include knowledge representation and reasoning, answer set programming design, artificial intelligence, machine learning.

WANG Lihui, born in 1982, Ph. D., professor. Her research interests include deep learning, machine learning, medical imaging, medical image processing, computer vision.

ZHAO Pengli, born in 1992, M. S., teaching assistant. Her research interests include database, software engineering.

猜你喜歡
時(shí)頻振幅頻譜
一種用于深空探測(cè)的Chirp變換頻譜分析儀設(shè)計(jì)與實(shí)現(xiàn)
一種基于稀疏度估計(jì)的自適應(yīng)壓縮頻譜感知算法
十大漲跌幅、換手、振幅、資金流向
十大漲跌幅、換手、振幅、資金流向
滬市十大振幅
認(rèn)知無(wú)線電頻譜感知技術(shù)綜述
基于時(shí)頻分析的逆合成孔徑雷達(dá)成像技術(shù)
對(duì)采樣數(shù)據(jù)序列進(jìn)行時(shí)頻分解法的改進(jìn)
雙線性時(shí)頻分布交叉項(xiàng)提取及損傷識(shí)別應(yīng)用
一種基于功率限制下的認(rèn)知無(wú)線電的頻譜感知模型
南汇区| 阿克| 尼木县| 昔阳县| 砚山县| 潜江市| 容城县| 东港市| 马边| 界首市| 大方县| 泰州市| 信宜市| 疏附县| 正镶白旗| 贵溪市| 登封市| 寻乌县| 双辽市| 兴隆县| 麻阳| 隆子县| 壤塘县| 长垣县| 安徽省| 镇赉县| 中超| 平南县| 五莲县| 东乌珠穆沁旗| 宝清县| 丹阳市| 波密县| 桦甸市| 福清市| 大竹县| 蓝田县| 洛川县| 安塞县| 白银市| 土默特左旗|