黎 煊,趙 建,高 云,劉望宏,雷明剛,譚鶴群
?
基于連續(xù)語音識別技術(shù)的豬連續(xù)咳嗽聲識別
黎 煊1,2,趙 建1,2,高 云1,2,劉望宏2,3,雷明剛2,3,譚鶴群1,2
(1. 華中農(nóng)業(yè)大學(xué)工學(xué)院,武漢 430070;2. 生豬健康養(yǎng)殖協(xié)同創(chuàng)新中心,武漢 430070; 3. 華中農(nóng)業(yè)大學(xué)動物科技學(xué)院動物醫(yī)學(xué)院,武漢 430070)
針對現(xiàn)有基于孤立詞識別技術(shù)的豬咳嗽聲識別存在識別聲音種類有限,無法反映實際患病豬連續(xù)咳嗽的問題,該文提出了基于雙向長短時記憶網(wǎng)絡(luò)-連接時序分類模型(birectional long short-term memory-connectionist temporal classification, BLSTM-CTC)構(gòu)建豬聲音聲學(xué)模型,進行豬場環(huán)境豬連續(xù)咳嗽聲識別的方法,以此進行豬早期呼吸道疾病的預(yù)警和判斷。研究了體質(zhì)量為75 kg左右長白豬單個咳嗽聲樣本的持續(xù)時間長度和能量大小的時域特征,構(gòu)建了聲音樣本持續(xù)時間在0.24~0.74 s和能量大于40.15V2?s的閾值范圍。在此閾值范圍內(nèi),利用單參數(shù)雙門限端點檢測算法對基于多窗譜的心理聲學(xué)語音增強算法處理后的30 h豬場聲音進行檢測,得到222段試驗語料。將豬場環(huán)境下的聲音分為豬咳嗽聲和非豬咳嗽聲,并以此作為聲學(xué)模型建模單元,進行語料的標(biāo)注。提取26維梅爾頻率倒譜系數(shù)(Mel frequency cepstral coefficients,MFCC)作為試驗語段特征參數(shù)。通過BLSTM網(wǎng)絡(luò)學(xué)習(xí)豬連續(xù)聲音的變化規(guī)律,并利用CTC實現(xiàn)了端到端的豬連續(xù)聲音識別系統(tǒng)。5折交叉驗證試驗平均豬咳嗽聲識別率達到92.40%,誤識別率為3.55%,總識別率達到93.77%。同時,以數(shù)據(jù)集外1 h語料進行了算法應(yīng)用測試,得到豬咳嗽聲識別率為94.23%,誤識別率為9.09%,總識別率為93.24%。表明基于連續(xù)語音識別技術(shù)的BLSTM-CTC豬咳嗽聲識別模型是穩(wěn)定可靠的。該研究可為生豬健康養(yǎng)殖過程中豬連續(xù)咳嗽聲的識別和疾病判斷提參考。
信號處理;聲音信號;識別;生豬產(chǎn)業(yè);連續(xù)咳嗽聲;雙向長短時記憶網(wǎng)絡(luò)-連接時序分類模型;聲學(xué)模型
目前,市場對豬肉的需求量在所有動物肉類中比重最大[1]。然而,隨著生豬產(chǎn)業(yè)規(guī)?;陌l(fā)展,豬呼吸道疾病嚴(yán)重威脅了豬肉的質(zhì)量和產(chǎn)量,通過豬咳嗽聲的監(jiān)測可以及時發(fā)現(xiàn)豬呼吸道疾病[2-4]。目前豬場監(jiān)測豬咳嗽聲的方法是人為蹲點監(jiān)測,不僅人力成本高,而且無法保證較理想的識別率。本文基于語音識別技術(shù)開展豬場環(huán)境下豬咳嗽聲的自動識別研究,以促進生豬健康養(yǎng)殖的發(fā)展[5]。
在豬咳嗽聲時域特征的研究過程中,Mitchell等[3]研究了比利時長白和杜洛克雜交豬的咳嗽聲,發(fā)現(xiàn)病豬、健康豬咳嗽聲持續(xù)時間分別為0.3和0.21 s;Sara等[4]通過對長白和大白雜交豬咳嗽聲的研究,發(fā)現(xiàn)病豬、健康豬咳嗽聲持續(xù)時間分別為0.67和0.43 s。由此可見豬的咳嗽聲持續(xù)時間長度與豬的健康狀況以及品種都有關(guān)系。另外,Cordeiro等[1]采用不同的冷熱環(huán)境對豬進行刺激,發(fā)現(xiàn)處于緊張狀態(tài)下的豬所發(fā)聲音持續(xù)時間長于1.02 s,并以此閾值作為決策樹算法(decision tree algorithm)的判斷標(biāo)準(zhǔn),對豬是否處于緊張狀態(tài)進行判斷。
在豬咳嗽聲識別的研究過程中,Exadaktylos等[6]采用模糊C均值聚類算法識別豬咳嗽,總識別率達到85%。同樣基于模糊C均值聚類算法進行豬咳嗽聲識別的工作有Hirtum等[7],識別率達到92%,錯誤率達到21%;徐亞妮等[8]識別率達到83.4%;Guarino等[9]則采用動態(tài)時間規(guī)整(dynamic time warping,DTW)算法識別豬咳嗽,識別率達到85.5%;劉振宇等[10]采用隱馬爾科夫模型(hidden markov model,HMM)對豬咳嗽聲進行識別,識別率達到80.0%;黎煊等[11]基于深度信念網(wǎng)絡(luò)(deep belief nets,DBN)實現(xiàn)了豬咳嗽聲識別,豬咳嗽聲識別率達到95.80%,誤識別率為6.83%,總識別率達到94.29%。
前人的工作均是基于孤立詞的豬咳嗽聲識別和研究,所考慮的非豬咳嗽聲種類有限,故所得模型對于沒有學(xué)習(xí)的豬場其他聲音樣本無法做出識別判斷,模型實用性受到限制;另外,患病豬每次咳嗽時,會進行多次連續(xù)性的咳嗽[12-13],故通過豬的連續(xù)咳嗽聲識別更能反映豬的患病情況。
目前,國內(nèi)外關(guān)于豬連續(xù)聲音識別的研究工作未曾報道,但是越來越多的學(xué)者已經(jīng)通過構(gòu)建聲學(xué)模型,將連續(xù)語音識別技術(shù)運用于其他動物的聲音識別研究上。聲學(xué)模型是連續(xù)語音識別系統(tǒng)的重要組成部分,通過選擇合適的聲學(xué)建模單元可以很方便地描述語音信號的物理變換規(guī)律。Milone等[14]構(gòu)建了牛吃食聲的聲學(xué)模型,實現(xiàn)了牛連續(xù)吃食聲的識別。類似的研究工作還有Reby等[15]實現(xiàn)了鹿連續(xù)聲音的識別,Milone等[16]實現(xiàn)了羊連續(xù)吃食聲的識別,Trifa等[17]實現(xiàn)了蟻鳥連續(xù)聲音的識別。
為此,本文開展了豬連續(xù)咳嗽聲識別的研究。通過雙向長短時記憶(birectional long short-term memory,BLSTM)網(wǎng)絡(luò)[18-20]對豬連續(xù)聲音進行特征學(xué)習(xí),進一步借助連接時序分類(connectionist temporal classification,CTC)[21]直接對輸入豬連續(xù)聲音序列和其標(biāo)注的對齊分布進行建模,實現(xiàn)端到端[22-23]的豬連續(xù)咳嗽聲識別系統(tǒng),以期為生豬健康養(yǎng)殖過程中豬連續(xù)咳嗽聲的識別和疾病的判斷提供方法參考。
豬聲音采集地點為華中農(nóng)業(yè)大學(xué)校屬精品豬場。用美博M66錄音筆(采樣頻率為48 kHz)進行采集。采集時間為2016年3?4月氣溫變換明顯的豬病多發(fā)期進行。聲音采集對象為10頭體質(zhì)量75 kg左右的長白豬,各5頭分開飼養(yǎng)于相鄰兩欄。經(jīng)獸醫(yī)診斷10頭豬中5頭感染呼吸道疾病,咳嗽明顯。將錄音筆固定于兩欄中間靠近豬舍墻壁上離地1.5 m處,進行每天24 h連續(xù)豬場環(huán)境聲音的采集。對錄音筆采集的聲音進行選取,保留豬咳嗽頻繁時間段的語音信號共30 h進行試驗。
豬場環(huán)境噪聲復(fù)雜,過多的噪聲對后續(xù)端點檢測和豬聲音的識別都有不利的影響。本文選擇基于多窗譜的心理聲學(xué)語音增強算法[11]實現(xiàn)豬連續(xù)聲音的去噪。圖1所示為語音增強算法處理前后時長為8.50 s豬連續(xù)咳嗽聲波形對比圖,由圖1b可知豬連續(xù)聲音信號噪聲得到明顯削減,并且通過人耳試聽感知,發(fā)現(xiàn)豬聲音樣本幾乎沒有失真。
豬場采集的連續(xù)聲音中聲音種類繁雜,豬聲音主要包括咳嗽、打噴嚏、吃食、尖叫、哼哼、甩耳朵等,環(huán)境噪聲主要包括狗叫聲、金屬碰撞聲、抽風(fēng)機噪聲等其他聲音,這些聲音與豬咳嗽聲在持續(xù)時間和能量大小等時域特征上存在明顯的差異。本文從前人[3-4]通過對豬聲音持續(xù)時間、能量大小等特征的研究工作中得到啟示,研究了本試驗中單個豬咳嗽聲樣本的持續(xù)時間長度和能量大小。
圖1 語音增強前后豬連續(xù)咳嗽聲波形圖
經(jīng)過分幀處理后的豬咳嗽聲樣本()的持續(xù)時間長度dur計算公式為
式中是經(jīng)過分幀后豬咳嗽樣本總幀數(shù),是幀長,根據(jù)聲音信號的短時平穩(wěn)特性取為25 ms,inc是幀移,取為幀長的40%,F是采樣頻率,Hz。
令豬咳嗽聲樣本()經(jīng)過分幀后第幀表示為y(),則豬咳嗽聲樣本()的能量計算公式為
式中是采樣點序號。
利用Direct Splitter語音信號處理軟件從錄音筆采集聲音中隨機截取了597個豬咳嗽聲樣本,按照公式(1)和(2)分別計算每個樣本的持續(xù)時間長度和能量大小,進一步得到最大最小值,結(jié)果如表1所示。
表1 單個豬咳嗽聲樣本時域特征分析結(jié)果
由表1分析結(jié)果可知,本試驗對象長白豬咳嗽聲持續(xù)時間從0.24~0.74 s不等,研究結(jié)果與前人的研究結(jié)果[3-4]類似。由于豬咳嗽的強度和豬距離錄音筆的距離都會對咳嗽聲樣本的能量造成影響。相對而言,能量越高的樣本表示豬咳嗽越劇烈且豬距離錄音筆越近,故能量閾值只考慮其下限值。
豬聲音信號端點檢測是指從包含豬聲音的連續(xù)信號中找出所有聲音樣本的起止點,把起止點之間的信號定義為有效信號。利用文獻[11]中基于短時能量的單參數(shù)雙門限端點檢測算法檢測錄音筆采集的30 h連續(xù)豬場聲音的有效信號,并對檢測出的每一個聲音樣本按照表1中持續(xù)時長上下限和能量下限值設(shè)定的閾值范圍進行判斷,剔除不在此范圍內(nèi)的聲音樣本。最終得到222段試驗語料,其中最長9.14 s,最短3.91 s。所有222段語料共包含聲音樣本1 145個,其中豬咳嗽樣本一共751個,非豬咳嗽樣本一共394個。
在獸醫(yī)幫助下,采用人工標(biāo)記法對222段語料進行標(biāo)注得到對應(yīng)的序列標(biāo)記,將聲學(xué)建模單元中豬咳嗽聲和非豬咳嗽聲分別用符號‘k’和‘n’表示。
梅爾頻率倒譜系數(shù)(Mel frequency cepstral coefficients,MFCC)[24-25]的分析是基于人耳的聽覺機理進行的。將線性頻譜映射到非線性的Mel頻譜中,依據(jù)人的聽覺試驗結(jié)果來分析聲音的頻譜特性。將豬連續(xù)聲音語段經(jīng)過分幀加窗后,采用快速傅里葉變換計算其頻譜能量,然后將其通過梅爾濾波器組,對濾波器輸出取對數(shù)得到梅爾濾波能量,再計算其離散余弦變換得到可以反映豬聲音靜態(tài)特性的13維梅爾頻率倒譜系數(shù),最后加入反映豬聲音動態(tài)特性的一階差分系數(shù),得到26維梅爾頻率倒譜系數(shù)。梅爾頻率倒譜系數(shù)特征參數(shù)提取過程具體步驟如圖2所示。
圖2 MFCC特征參數(shù)提取步驟
相對于前饋神經(jīng)網(wǎng)絡(luò)[26-27]隱層神經(jīng)元之間無連接的特點,RNN(recurrent neural network)是一種允許隱層神經(jīng)元存在自反饋通路的神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu)。RNN隱層輸入不僅包括輸入層輸入的豬聲音特征,也包括上一時刻隱層神經(jīng)元的輸出,這種網(wǎng)絡(luò)結(jié)構(gòu)有利于模型對前面的信息進行記憶,并應(yīng)用于處理當(dāng)前輸出的計算中。雖然RNN理論上很適合處理類似語音序列的建模問題,但是隨著語音序列長度的增加存在著梯度爆炸和消失的問題[20,28]。LSTM是一種特殊的RNN,其通過引入記憶單元和門限機制可以學(xué)習(xí)歷史信息,并控制信息的累積速度,在一定程度上緩解了存在于RNN模型中的問題,LSTM模塊單元如圖3所示。
由圖3可知LSTM單元主要由4個部分組成:記憶單元(memory cell)、輸入門(input gate)、輸出門(output gate)和遺忘門(forget gate)。在LSTM網(wǎng)絡(luò)中記憶單元彼此互相連接,3個非線性門控單元可以調(diào)節(jié)輸入和輸出記憶單元的信息(如圖3中虛線連接所示)。其中輸入門控制哪些信息會被輸入到記憶單元,通過讀取上一時刻記憶單元輸出h-1和此時刻輸入x,輸出一個在0和1之間的數(shù)值,i表示要輸入信息的百分比,0表示全部舍棄,1表示完全輸入。i計算公式為
i=(W[h-1, x]+ b) (3)
式中是sigmoid函數(shù),W是輸入門權(quán)值,b是輸入門閾值。
注:it表示輸入信息的百分比,ft表示遺忘信息的百分比,ot輸出門狀態(tài)值大小,黑色實心圓表示進行乘積運算。Note: it is the percentage of the input information of input gate; ft is the percentage of the forgotten information of forget gate; ot is the state value of output gate, and the black circle indicates the multiplication operation.
類似的,遺忘門控制需要忘記上一時刻記憶單元狀態(tài)c-1的哪些信息,f表示要遺忘信息的百分比,計算公式為
f=(W[h-1, x]+ b) (4)
式中W是遺忘門權(quán)值,b是遺忘門閾值。
于是可得到此時刻記憶單元的狀態(tài)c計算公式如下
c= f c-1+ itanh(W[h-1, x]+ b) (5)
式中tanh是雙曲正切函數(shù),W是記憶單元權(quán)值,b是記憶單元閾值。
輸出門值o控制記憶單元此時刻輸出了多少信息,于是有如下計算公式
o=(W[h-1, x]+ b) (6)
h=otanh c(7)
式中W是輸出門權(quán)值,b是輸出門閾值,h是此時刻記憶單元輸出。
傳統(tǒng)LSTM是單向展開的,只能利用歷史信息,而豬連續(xù)咳嗽聲識別是對整個語音序列的識別。當(dāng)前幀的特征不僅與前面各幀有聯(lián)系,也與后面各幀有關(guān)聯(lián)。因此通過2個獨立的LSTM來分別處理前向和后向[29-30]豬連續(xù)聲音序列(圖4),然后將輸出組合進入網(wǎng)絡(luò)下一層進行處理,充分挖掘上下文時序信息進行豬連續(xù)聲音的聲學(xué)建模。
在連續(xù)語音識別系統(tǒng)中,CTC(connectionist temporal classification)層利用BLSTM學(xué)習(xí)序列信號的強大能力直接對輸入語音特征和輸出標(biāo)簽進行建模[21,31],而不必依賴語音特征序列與序列標(biāo)記之間的對齊,從而實現(xiàn)了端到端的聲學(xué)模型訓(xùn)練。
注:xt-1、xt和xt+1分別表示t-1、t和t+1時刻輸入層輸入, ct-1、ct和ct+1分別表示隱層記憶單元t-1、t和t+1時刻的狀態(tài)值,ht-1、ht和ht+1分別表示t-1、t和t+1時刻記憶單元輸出,上標(biāo)→、←分別表示前向傳播和后向傳播。
BLSTM模型輸出作為CTC層輸入,輸出神經(jīng)元個數(shù)即所有可能的標(biāo)簽個數(shù),即聲學(xué)建模單元個數(shù),額外加入一個空白標(biāo)簽用于估計輸出的靜音,在本系統(tǒng)中標(biāo)簽個數(shù)為3,即‘k’、‘n’和‘_’,其中‘_’為空白標(biāo)簽,表示靜音模型。于是BLSTM模型的輸出可以描述輸入連續(xù)語音對應(yīng)的標(biāo)簽概率分布。給定長度為的連續(xù)輸入語料,在時刻BLSTM模型輸出標(biāo)簽索引(∈{1,2,3})的概率表示為
式中y是BLSTM網(wǎng)絡(luò)時刻輸出標(biāo)簽的值,即輸出層神經(jīng)元的輸出值,l是BLSTM模型時刻輸出的標(biāo)簽索引,為標(biāo)簽個數(shù)。
令CTC輸出序列為π,則π是由個標(biāo)簽組成的長度為的序列,將個時刻的概率值相乘即得到π的概率為
實際上,每個真實序列標(biāo)記有多個CTC輸出序列π與之對應(yīng),定義從π到的映射=(π),通過將可能序列中的重復(fù)標(biāo)簽和空白標(biāo)簽去掉[23]就可以將π轉(zhuǎn)化為。例如,對于一個為8的豬連續(xù)聲音信號,若其真實序列標(biāo)記為(n, k, n, k),相應(yīng)的CTC輸出序列可以為(n, _, k, k, k, n, k, k)或(n, n, _, k, n, k, _, _)等。于是可得到
上式可利用前向后向算法[18]通過動態(tài)規(guī)劃的思想計算并求導(dǎo)。若*為對應(yīng)連續(xù)輸入語料的序列標(biāo)記,CTC訓(xùn)練目的就是讓BLSTM網(wǎng)絡(luò)輸出*的概率最大化,也即概率的負導(dǎo)數(shù)最小化,設(shè)定損失函數(shù)為
圖5所示為基于BLSTM-CTC聲學(xué)模型的豬連續(xù)咳嗽聲識別系統(tǒng)。首先將豬連續(xù)聲音特征參數(shù)作為BLSTM輸入,利用BLSTM的強大聲學(xué)建模能力學(xué)習(xí)處理輸入語音的特征,接著網(wǎng)絡(luò)輸出豬連續(xù)聲音語料特征對應(yīng)的標(biāo)簽概率分布,以此概率分布作為CTC層輸入,同時借助原始語料序列標(biāo)記計算模型損失,進一步實現(xiàn)整個聲學(xué)模型的訓(xùn)練。
圖5 豬連續(xù)咳嗽聲識別系統(tǒng)框圖
訓(xùn)練好的豬連續(xù)咳嗽聲識別系統(tǒng)可以應(yīng)用于豬連續(xù)聲音語料的識別,測試過程會輸出一個行列的概率矩陣,表示在所有時刻輸入幀經(jīng)過系統(tǒng)輸出后對應(yīng)標(biāo)簽的概率分布,通過集束收索算法[32]可解碼得到最大概率輸出序列,即為識別結(jié)果。
對BLSTM-CTC豬連續(xù)咳嗽聲識別模型進行性能評估。試驗采用5折交叉驗證方法進行,將222段試驗數(shù)據(jù)集劃分為5個大小近似相等的互斥子集,然后每次用4個子集的并集作為訓(xùn)練集,第5個子集作為測試集,這樣就得到5組訓(xùn)練、測試集,從而可以進行5次訓(xùn)練和測試。
在以識別基元為聲學(xué)模型建模單元的連續(xù)語音識別系統(tǒng)中一般以詞錯誤率[33](word error rate,WER)作為系統(tǒng)評價指標(biāo),將識別結(jié)果與測試語料的序列標(biāo)記進行對比,計算替代誤差個數(shù)(substitution)、插入誤差個數(shù)(insertion)和刪除誤差個數(shù)(deletion)三者之和,再除以測試語料中總樣本個數(shù),得到WER,即
關(guān)于3種誤差的解釋如下例:序列標(biāo)記為(n, , k, k, n, n),識別結(jié)果為(n,k,n, _),由識別結(jié)果與序列標(biāo)記對比可知,識別結(jié)果中第一個豬咳嗽聲為插入誤差,第二個非豬咳嗽聲為替代誤差,序列標(biāo)記中的最后一個非豬咳嗽聲沒有被識別出來,為刪除誤差。由于本文主要進行豬連續(xù)咳嗽聲的識別,所以在識別過程中,僅考慮非豬咳嗽聲的替代誤差,同時忽略了非豬咳嗽聲的插入和刪除誤差。為此,本文利用改建的WER來對豬連續(xù)咳嗽聲識別系統(tǒng)進行性能評估,評估指標(biāo)豬咳嗽聲識別率R、誤識別率R和總識別率total計算公式分別如下所示。
式中S、I、D、N、S、N分別表示豬咳嗽聲識別為非豬咳嗽聲個數(shù)、插入豬咳嗽聲個數(shù)、刪除豬咳嗽聲個數(shù)、測試集中豬咳嗽聲個數(shù)、非豬咳嗽聲識別為豬咳嗽聲個數(shù)、測試集中非豬咳嗽聲個數(shù)。
通過多次試驗對比,最終將BLSTM前向傳播過程和后向傳播過程隱層神經(jīng)元、全連接層神經(jīng)元個數(shù)均設(shè)置為300,學(xué)習(xí)率設(shè)置為0.001,訓(xùn)練過程最大迭代次數(shù)為200。5折交叉驗證試驗結(jié)果如表2所示。
表2 豬連續(xù)咳嗽聲識別5折交叉驗證結(jié)果
通過表2的交叉驗證試驗對應(yīng)的5組試驗識別結(jié)果可知,各組豬咳嗽聲識別率和總識別率均達到90.00%,誤識別率控制在8.00%以內(nèi)。并且5折交叉驗證結(jié)果平均豬咳嗽聲識別率達到92.40%,誤識別率達到3.55%,總識別率達到93.77%,本文采用的基于BLSTM-CTC聲學(xué)模型的豬連續(xù)咳嗽聲識別系統(tǒng)是穩(wěn)定有效的。
為了對基于連續(xù)語音識別技術(shù)的豬連續(xù)咳嗽聲識別模型進行算法應(yīng)用測試,另取一段長度為1 h豬場環(huán)境語料為試驗對象。先進行語音增強,然后利用基于閾值的端點檢測算法獲得測試數(shù)據(jù)集14段,其中最長8.51 s,最短3.56 s。此14段語料共包含聲音樣本74個,其中豬咳嗽樣本52個,非豬咳嗽樣本22個。接著對此14段測試語料進行人工句級標(biāo)記,特征參數(shù)提取,最后利用表2第2組數(shù)據(jù)所得模型進行算法應(yīng)用測試。測試結(jié)果豬咳嗽聲發(fā)生替代誤差1次、插入誤差1次、刪除誤差1次,非豬咳嗽聲發(fā)生替代誤差2次。分別計算得到豬咳嗽聲識別率為94.23%,誤識別率為9.09%,總識別率為93.24%。算法應(yīng)用測試結(jié)果表明基于連續(xù)語音識別技術(shù)的豬連續(xù)咳嗽聲識別模型對于訓(xùn)練測試數(shù)據(jù)集外的樣本同樣可得到較理想的識別效果,模型穩(wěn)定可靠。
本文提出了一種進行豬場環(huán)境豬連續(xù)咳嗽聲識別的方法。該方法相對孤立詞識別技術(shù)而言,可以識別更多種類的豬場環(huán)境聲音,更能反映豬的患病狀況,語料處理、特征提取、識別等過程比孤立詞識別技術(shù)更簡單。
1)提出了豬聲音的聲學(xué)模型,并且引入具有強大時序信號處理能力的雙向長短時記憶網(wǎng)絡(luò)結(jié)構(gòu)和連接時序分類層來構(gòu)建豬聲音聲學(xué)模型。以豬咳嗽聲與非豬咳嗽聲為聲學(xué)建模單元,對連續(xù)語料進行了標(biāo)注,實現(xiàn)了端到端的豬連續(xù)咳嗽聲識別系統(tǒng)。
2)通過5折交叉驗證試驗,將BLSTM前向傳播過程和后向傳播過程隱層神經(jīng)元、全連接層神經(jīng)元個數(shù)均設(shè)置為300,學(xué)習(xí)率設(shè)置為0.001,5折交叉驗證試驗平均豬咳嗽聲識別率達到92.40%,誤識別率為3.55%,總識別率達到93.77%。同時,以數(shù)據(jù)集外1 h語料進行了算法應(yīng)用測試。得到豬咳嗽聲識別率為94.23%,誤識別率為9.09%,總識別率為93.24%,表明基于連續(xù)語音識別技術(shù)的BLSTM-CTC豬連續(xù)咳嗽聲識別模型是穩(wěn)定可靠的。
[1] Cordeiro A, N??s I, Leit?o F, et al. Use of vocalisation to identify sex, age, and distress in pig production[J]. Biosystems Engineering, 2018, 173:57-63.
[2] Silva M, Ferrari S, Costa A, et al. Cough localization for the detection of respiratory diseases in pig houses[J]. Computers and Electronics in Agriculture, 2008, 64(2): 286-292.
[3] Mitchell S, Vasileios E, Sara F, et al. The influence of respiratory disease on the energy envelope dynamics of pig cough sounds[J]. Computers and Electronics in Agriculture, 2009, 69(1): 80-85.
[4] Sara F, Mitchell S, Marcella G, et al. Cough sound analysis to identify respiratory infection in pigs[J]. Computers and Electronics in Agriculture, 2009, 64(2): 318-325.
[5] 何東健,劉冬,趙凱旋. 精準(zhǔn)畜牧業(yè)中動物信息智能感知與行為檢測研究進展[J]. 農(nóng)業(yè)機械學(xué)報,2016,47(5):231-244. He Dongjian, Liu Dong, Zhao Kaixuan. Review of perceiving animal information and behavior in precision livestock farming[J]. Transactions of the Chinese Society for Agricultural Machinery, 2016, 47(5): 231-244. (in Chinese with English abstract)
[6] Exadaktylos V, Silva M, Aerts J M, et al. Real-time recognition of sick pig cough sounds[J]. Computers and Electronics in Agriculture, 2008, 63(2): 207-214.
[7] Hirtum A V, Berckmans D. Fuzzy approach for improved recognition of citric acid induced piglet coughing from continuous registration[J]. Journal of Sound and Vibration, 2003, 266(3): 677-686.
[8] 徐亞妮,沈明霞,閆麗,等. 待產(chǎn)梅山母豬咳嗽聲識別算法的研究[J]. 南京農(nóng)業(yè)大學(xué)學(xué)報,2016,39(4):681-687. Xu Yani, Shen Mingxia, Yan Li, et al. Research of predelivery meishan sow cough recognition algorithm[J]. Journal of Nanjing Agricultural University, 2016, 39(4): 681-687. (in Chinese with English abstract)
[9] Guarino M, Jans P, Costa A, et al. Field test of algorithm for automatic cough detection in pig house[J]. Computers and Electronics in Agriculture, 2008, 62(1): 22-28.
[10] 劉振宇,赫曉燕,桑靜,等. 基于隱馬爾可夫模型的豬咳嗽聲音識別的研究[C]//中國畜牧獸醫(yī)學(xué)會信息技術(shù)分會第十屆學(xué)術(shù)研討會論文集,2015:99-104.
[11] 黎煊,趙建,高云,等. 基于深度信念網(wǎng)絡(luò)的豬咳嗽聲識別[J]. 農(nóng)業(yè)機械學(xué)報,2018,49(3):179-186. Li Xuan, Zhao Jian, Gao Yun, et al. Recognitional of pig cough sound based on deep belief nets[J]. Transactions of the Chinese Society for Agricultural Machinery, 2018, 49(3): 179-186. (in Chinese with English abstract)
[12] 陳升科. 從中獸醫(yī)學(xué)角度分析豬咳嗽氣喘及治療方案[J]. 中國動物保健,2015,17(3):22-23.
[13] 陳潤生. 豬咳嗽疾病的鑒別診斷[J]. 現(xiàn)代農(nóng)業(yè)科技,2016(14):269-270.
[14] Milone D H, Galli J R, Cangianoc C A, et al. Automatic recognition of ingestive sounds of cattle based on hidden markov models[J]. Computers and Electronics in Agriculture, 2012, 87(3): 51-55.
[15] Reby D, Andreobrecht R, Galinier A, et al. Cepstral coefficients and hidden markov models reveal idiosyncratic voice characteristics in red deer (cervus elaphus) stags[J]. Journal of the Acoustical Society of America, 2006, 120(6): 4080-4089.
[16] Milone D H, Rufiner H L, Galli J R, et al. Computational method for segmentation and classification of ingestive sounds in sheep[J]. Computers and Electronics in Agriculture, 2009, 65(2): 228-237.
[17] Trifa V M, Kirschel A N, Taylor C E, et al. Automated species recognition of antbirds in a mexican rainforest using hidden markov models[J]. Journal of the Acoustical Society of America, 2008, 123(4): 2424-2431.
[18] Sepp H, Jurgen S. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[19] 陳英義,程倩倩,方曉敏,等. 主成分分析和長短時記憶神經(jīng)網(wǎng)絡(luò)預(yù)測水產(chǎn)養(yǎng)殖水體溶解氧[J]. 農(nóng)業(yè)工程學(xué)報,2018,34(17):183-191. Chen Yingyi, Cheng Qianqian, Fang Xiaomin, et al. Principal component analysis and long short-term memory neural network for predicting dissolved oxygen in water for aquaculture[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE),2018, 34(17): 183-191. (in Chinese with English abstract)
[20] Bengio Y, Frasconi P, Simard P. The problem of learning long-term dependencies in recurrent networks[C]// IEEE International Conference on Neural Networks. IEEE, 1993: 1183-1188.
[21] 王智超,張鵬遠,潘接林,等. 連接時序分類準(zhǔn)則聲學(xué)建模方法優(yōu)化[J]. 聲學(xué)學(xué)報,2018,43(6): 984-990.
Wang Zhichao, Zhang Pengyuan, Pan Jielin, et al. Optimization of acoustic modeling method with connectionist temporal classification criterion[J]. Acta Acustica, 2018,43(6): 984-990. (in Chinese with English abstract)
[22] Bahdanau D, Chorowski J, Serdyuk D, et al. End-to-end attention-based large vocabulary speech recognition[C]// IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2016: 4945-4949.
[23] Graves A, JaitlyA N. Towards end-to-end speech recognition with recurrent neural networks[C]// International Conference on Machine Learning, 2014: 1764-1772.
[24] Chia A O, Hariharan M, Yaacob S, et al. Classification of speech dysfluencies with mfcc and lpcc features[J]. Expert Systems with Applications, 2012, 39(2): 2157-2165.
[25] 李志忠,騰光輝. 基于改進MFCC的家禽發(fā)聲特征提取方法[J]. 農(nóng)業(yè)工程學(xué)報,2008,24(11):202-205.
Li Zhizhong, Teng Guanghui. Feature extraction for poultry vocalization recognition based on improved MFCC[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2008, 24(11): 202-205. (in Chinese with English abstract)
[26] Hinton G E. Learning multiple layers of representation[J]. Trends in Cognitive Sciences, 2007, 11(10): 428-434.
[27] Lecun Y, Bengio Y, Hinton G E. Deep learning[J]. Nature, 2015, 512: 436-444.
[28] 趙明,杜回芳,董翠翠,等. 基于word2vec和LSTM的飲食健康文本分類研究[J]. 農(nóng)業(yè)機械學(xué)報,2017,48(10):202-208. Zhao Ming, Du Huifang, Dong Cuicui, et al. Diet health text classification based on word2vec and LSTM[J]. Transactions of the Chinese Society for Agricultural Machinery, 2017, 48(10): 202-208. (in Chinese with English abstract)
[29] Schuster M, Paliwal K K. Bidirectional recurrent neural networks[J]. IEEE Transactions on Signal Processing, 2002, 45(11): 2673-2681.
[30] Chen K, Huo Q . Training deep bidirectional LSTM acoustic model for LVCSR by a Context-Sensitive-Chunk BPTT approach[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(7): 1185-1193.
[31] Woellmer M , Eyben F , Schuller B , et al. Spoken term detection with connectionist temporal classification: A novel hybrid CTC-DBN decoder[C]//International Conference on Acoustics Speech and Signal Processing (ICASSP). IEEE, 2010: 5274-5277.
[32] Graves A, Gomez F. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks[C]// International Conference on Machine Learning. ACM, 2006: 369-376.
[33] Abu-Khzam F N, Fernau H, Langston M A, et al. A fixed-parameter algorithm for string-to-string correction[C]// Sixteenth Symposium on Computing: the Australasian Theory(CATS 2010). Australian Computer Society, 2010: 31-37.
Pig continuous cough sound recognition based on continuous speech recognition technology
Li Xuan1,2, Zhao Jian1,2, Gao Yun1,2, Liu Wanghong2,3, Lei Minggang2,3, Tan Hequn1,2
(1.,,430070,; 2.,430070,; 3.,,,430070,)
Cough is one of the most frequent symptoms in the early stage of pig respiratory diseases. So it is possible to monitor and diagnose the diseases of pigs by detecting their coughs. The existing methods for pig cough recognition are based on key word recognition technology, which cannot recognize the samples that have not been trained or learned by itself, another drawback is that the methods are for isolated coughs while the coughs of sick pigs are usually continuous. This paper intends to realize the recognition of pig continuous cough sound based on continuous speech recognition technology. Ten Landrace pigs, with a body weight of about 75 kg, were used as sound collection objects, and pig sounds were collected in pig farms during late winter and early spring when the respiratory diseases of pigs were prevalent. The sound collection devices were working continuously all day. By selecting the frequent coughing phases in the collected signal, a total of 30 h pig farm sound signals were obtained as the experimental corpus. Firstly, the sound signals were denoised by the speech enhancement algorithm based on a psychoacoustical model. Then the time-domain characteristics, including duration and energy of individual cough, were studied, and it was found that the duration of pig cough ranged from 0.24 to 0.74 s and the energy ranged from 40.15 to 822.87V2·s. So threshold of the sound samples was set with the duration and the lower energy value of individual coughs. Based on the threshold range, the speech endpoint detection algorithm based on short-time energy was used to detect the 30 h pig field sound signals which had been preprocessed by the speech enhancement algorithm, and 222 experimental sentences were obtained. The longest was 9.14 s and the shortest was 3.91 s. All 222 corpus contained a total of 1 145 sound samples, including 751 pig coughs and 394 non-pig coughs. Sounds in the pig farm environment, including cough, sneeze, eating, scream, hum, shaking ears sounds of pigs and sounds of dogs, metal clanging and some other background noise, were divided into pig cough and non-pig cough, which were chosen as the acoustic modeling units. The labels of the experimental sentences were obtained with the help of experts. Then the 13-dimensional Mel frequency cepstrum coefficients (MFCC) reflecting the static characteristics of pig sound were extracted, and the first-order differential coefficients reflecting the dynamic characteristics of pig sound were added to obtain the 26-dimensional MFCC, which were used as the characteristic parameter of the experimental sentence. Finally, the bidirectional Long Short-term Memory-Connectionist temporal classification(BLSTM-CTC) model was selected to recognize the pig continuous sounds, specifically, the BLSTM network had excellent feature learning ability of continuous pig sounds, and the CTC could directly model the alignment of the input continuous pig sound sequence and its labels. Through the 5-fold cross-validation experiment and analysis, the number of hidden layer neurons in the BLSTM forward propagation process, the backward propagation process, and the fully connected layer, were all set to 300, and the learning rate was set to 0.001. The average recognition rate, error recognition rate and total recognition rate of the results of 5 groups were 92.40%, 3.55% and 93.77%, respectively. Furthermore, the algorithm application test was carried out with another 1 h data, and the recognition rate reached to 94.23%, the error recognition rate was 9.09% with the total recognition rate of 93.24%. It is indicated that the pig cough sound recognition model based on continuous speech recognition technology is stable and reliable. This paper provides a reference for the recognition and disease judgment of pig continuous cough sound during the healthy breeding of pigs.
signal processing; acoustic signal; recognition; pig industry; continuous cough; birectional long short-term memory-connectionist temporal classification; acoustic model
2018-11-09
2019-01-13
國家重點研發(fā)計劃項目(2018YFD0500700);華中農(nóng)業(yè)大學(xué)自主科技創(chuàng)新基金;華中農(nóng)業(yè)大學(xué)大北農(nóng)青年學(xué)者提升專項項目(2017DBN005);現(xiàn)代農(nóng)業(yè)產(chǎn)業(yè)技術(shù)體系項目(CARS-36);國家級大學(xué)生創(chuàng)新創(chuàng)業(yè)訓(xùn)練計劃(201810504074)
黎煊,副教授,博士,主要從事生豬信息智能感知與行為識別研究。Email:lx@mail.hzau.edu.cn
10.11975/j.issn.1002-6819.2019.06.021
TN912.34
A
1002-6819(2019)-06-0174-07
黎 煊,趙 建,高 云,劉望宏,雷明剛,譚鶴群. 基于連續(xù)語音識別技術(shù)的豬連續(xù)咳嗽聲識別[J]. 農(nóng)業(yè)工程學(xué)報,2019,35(6):174-180. doi:10.11975/j.issn.1002-6819.2019.06.021 http://www.tcsae.org
Li Xuan, Zhao Jian, Gao Yun, Liu Wanghong, Lei Minggang, Tan Hequn. Pig continuous cough sound recognition based on continuous speech recognition technology[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2019, 35(6): 174-180. (in Chinese with English abstract) doi:10.11975/j.issn.1002-6819.2019.06.021 http://www.tcsae.org