張?zhí)炝?,張東興,崔 濤,楊 麗,丁友強(qiáng),解春季,杜兆輝,鐘翔君
基于支持向量機(jī)和ReliefF算法的玉米品種抗倒伏預(yù)測(cè)
張?zhí)炝?,張東興,崔 濤,楊 麗※,丁友強(qiáng),解春季,杜兆輝,鐘翔君
(1. 中國農(nóng)業(yè)大學(xué)工學(xué)院,北京 100083;2. 農(nóng)業(yè)部土壤-機(jī)器-植物系統(tǒng)技術(shù)重點(diǎn)實(shí)驗(yàn)室,北京 100083)
針對(duì)目前玉米品種抗倒伏鑒定方法費(fèi)時(shí)、費(fèi)力,玉米抗倒伏品種選育周期長(zhǎng)的問題,該研究采用高光譜成像技術(shù)結(jié)合統(tǒng)計(jì)學(xué)習(xí)方法在玉米營養(yǎng)生長(zhǎng)期開展品種抗倒伏預(yù)測(cè)。于2018年和2019年開展田間試驗(yàn)采集不同抗倒伏的8個(gè)玉米品種的高光譜成像數(shù)據(jù),基于區(qū)域識(shí)別方法提取感興趣區(qū)域(Region of Interest,ROI)的光譜曲線,分析抗倒樣本和不抗倒樣本的數(shù)據(jù)特性;然后分別采用過濾式特征選擇算法ReliefF(Relevant Features)和主成分分析(Principal Component Analysis,PCA)結(jié)合ReliefF算法的方式,挖掘抗倒品種和不抗倒品種的光譜分類特征;最后使用交叉驗(yàn)證的方式,對(duì)ReliefF方法選擇的原始光譜數(shù)據(jù)特征數(shù)量和PCAReliefF方法選擇的主成分特征數(shù)量進(jìn)行優(yōu)化,分別建立ReliefF-SVM和PCAReliefF-SVM支持向量機(jī)(Support Vector Machines,SVM)分類模型,并對(duì)SVM模型的懲罰參數(shù)和核參數(shù)進(jìn)行優(yōu)化,以獲得更好的模型預(yù)測(cè)效果。結(jié)果表明:經(jīng)過特征優(yōu)化,2018年試驗(yàn)和2019年試驗(yàn)分別選擇了40和50個(gè)特征參與建模,且使用PCAReliefF方法選擇的主成分特征與使用ReliefF方法選擇的原始光譜數(shù)據(jù)特征相比,幾乎不含有冗余特征;通過對(duì)支持向量機(jī)模型的懲罰參數(shù)和核參數(shù)進(jìn)行優(yōu)化,2018年試驗(yàn)ReliefF-SVM和PCAReliefF-SVM模型對(duì)預(yù)測(cè)集樣本的抗倒伏分類預(yù)測(cè)正確率分別為84.17%和85.00%,2019年試驗(yàn)?zāi)P头诸愵A(yù)測(cè)正確率分別為84.17%和85.83%??梢?,采用高光譜成像數(shù)據(jù)和統(tǒng)計(jì)學(xué)習(xí)方法可以實(shí)現(xiàn)對(duì)玉米品種抗倒伏的早期預(yù)測(cè),使用PCAReliefF-SVM模型比ReliefF-SVM分類模型綜合性能更優(yōu),試驗(yàn)可為玉米抗倒伏品種的高效篩選提供方法和借鑒。
主成分分析;品種;支持向量機(jī);玉米;抗倒;ReliefF
玉米是中國三大糧食作物之一,保障玉米的高產(chǎn)高效對(duì)國家糧食安全有重大意義。玉米倒伏是影響糧食產(chǎn)量和機(jī)械化收獲效率的重要因素,研究預(yù)測(cè)玉米抗倒伏的方法,對(duì)于篩選抗倒伏的玉米品種、縮短育種周期有重要意義。
影響玉米倒伏的因素主要有內(nèi)因(遺傳、植株形態(tài)、莖稈和根系特性等)和外因(自然條件和栽培措施等)[1-2],目前國內(nèi)外對(duì)玉米品種抗倒伏評(píng)價(jià)的研究主要集中在玉米生殖生長(zhǎng)期,通過對(duì)倒伏性狀的多基因定位[3]、冠層光照強(qiáng)度[4]莖稈的力學(xué)特性和顯微結(jié)構(gòu)冠層植株形態(tài)[9]的研究等來反映玉米品種的抗倒伏能力。例如:Wei等[3]研究了與玉米株高、穗高、葉角、莖稈強(qiáng)度等有關(guān)的玉米ZmSPL(Zea mays Squamosa-Promoter Binding Protein-Like)基因家族,用基因選擇的方法來識(shí)別和評(píng)價(jià)具有抗倒伏耐密植特性的玉米品種;Xue等[4]研究了冠層光照環(huán)境對(duì)秸稈強(qiáng)度和倒伏率的影響,研究表明上部冠層葉片較小、中部冠層葉片較大、下部冠層葉片中等的玉米品種抗倒伏能力較強(qiáng);Zhang等[5]研究了莖節(jié)處的微觀解剖特征與莖稈生物力學(xué)特性的關(guān)系,證明莖稈微表型是預(yù)測(cè)秸稈機(jī)械特性,評(píng)價(jià)品種抗倒伏能力的重要指標(biāo)。玉米在生殖生長(zhǎng)期更容易發(fā)生倒伏,研究此時(shí)玉米的相關(guān)性狀可以直觀地表征玉米品種的抗倒伏能力。但生殖生長(zhǎng)期試驗(yàn)需要的周期長(zhǎng)、成本高,費(fèi)時(shí)又費(fèi)力,如果能在玉米生長(zhǎng)發(fā)育早期如九葉期時(shí),更早地預(yù)測(cè)出玉米品種是否抗倒伏,對(duì)于提高玉米抗倒品種的篩選效率具有重要意義。
高光譜成像技術(shù)是光譜技術(shù)和成像技術(shù)的融合,具有圖譜合一的特點(diǎn),既可以觀測(cè)植物的外在表型也能測(cè)量其內(nèi)在理化特性。目前高光譜技術(shù)已經(jīng)用于研究玉米的品種鑒別、脅迫研究和生理監(jiān)測(cè)等[10-14],但鮮有將高光譜成像技術(shù)應(yīng)用于對(duì)玉米品種抗倒伏預(yù)測(cè)的研究。本研究擬采用高光譜成像技術(shù)結(jié)合統(tǒng)計(jì)學(xué)習(xí)方法在玉米營養(yǎng)生長(zhǎng)期研究玉米品種的抗倒特性,實(shí)現(xiàn)對(duì)抗倒伏和不抗倒伏玉米品種的早期預(yù)測(cè),以期為玉米抗倒伏品種的高效篩選提供方法和借鑒。
1.1.1 試驗(yàn)過程
2018年和2019年在河北省滄州市吳橋縣中國農(nóng)業(yè)大學(xué)吳橋試驗(yàn)站(37°41′02″N,116°37′23″E)開展試驗(yàn),供試玉米雜交種為適宜黃淮海區(qū)域種植的8個(gè)夏玉米品種:登海605(DH605)、京丹28(JD28)、蠡玉37(LY37)、隆平206(LP206)、隆平208(LP208)、圣瑞999(SR999)、沃玉964(WY964)、先玉335(XY335)。試驗(yàn)為單一因素的品種試驗(yàn),采用隨機(jī)區(qū)組試驗(yàn)設(shè)計(jì),研究不同品種玉米抗倒伏性的早期預(yù)測(cè)問題。每小區(qū)5 m長(zhǎng),4.8 m寬,種9行玉米,行長(zhǎng)5 m,人工播種,設(shè)定株距22.2 cm,行距60 cm(種植密度約為75 000 株/hm2),各品種3次重復(fù),共24個(gè)小區(qū)。分別于2018年6月15日、2019年6月16日在田間播種,播后一次性側(cè)施復(fù)合肥720 kg/hm2(N∶P2O5∶K2O=24∶8∶10,有效成分質(zhì)量分?jǐn)?shù)≥42%),在玉米生長(zhǎng)周期內(nèi)禁止使用生長(zhǎng)調(diào)節(jié)劑類的藥物,其他的田間管理措施同當(dāng)?shù)氐拇筇锓N植方式。
1.1.2 玉米倒伏特性指標(biāo)
在玉米成熟后人工統(tǒng)計(jì)每個(gè)品種的田間實(shí)際倒伏率,包括根倒伏、莖彎曲、莖折斷3種情況[1-2],作為判斷品種抗倒伏的依據(jù)。
在蠟熟期統(tǒng)計(jì)各品種的田間倒伏率:發(fā)生倒伏的株數(shù)占各品種小區(qū)總株數(shù)(不含邊行)的百分比,%。選擇5%倒伏率作為區(qū)分抗倒伏(Lodging Resistant,LR)和不抗倒伏(Lodging,L)品種的評(píng)價(jià)標(biāo)準(zhǔn),具體結(jié)果見表1。各品種在2 a間的倒伏率相差不大,抗倒伏品種和不抗倒伏品種各占4個(gè)。
表1 2018和2019年不同品種玉米的倒伏率
注:LR:抗倒伏;L:不抗倒伏;下同。
Note: LR: lodging-resistant; L: lodging; Same below.
1.2.1 高光譜圖像獲取
玉米生長(zhǎng)到九葉期時(shí),在每個(gè)小區(qū)內(nèi)(不含小區(qū)最外邊的一行,避免邊行生長(zhǎng)優(yōu)勢(shì))隨機(jī)采集玉米的第9片完全展開葉。在24個(gè)小區(qū)每個(gè)小區(qū)各取樣21片,共504個(gè)葉片樣本。將每小區(qū)的樣本分別用自封袋密封好后放入盛有冰袋的保鮮盒中,帶回試驗(yàn)站,在試驗(yàn)室內(nèi)拍攝圖像。如圖1所示,試驗(yàn)使用的高光譜成像系統(tǒng)包括高光譜成像光譜儀SOC710VP(Surface Optics Corporation,美國)及配套的數(shù)據(jù)采集軟件HyperScanner、數(shù)據(jù)處理軟件SRAnal710、光源E27(4個(gè)鹵素釹燈,100 W/220 V,Sun Glo公司,美國)、光學(xué)暗箱和載物升降平臺(tái)。
高光譜成像光譜儀(SOC710VP)采用內(nèi)置平移推掃的方式拍攝圖像,波長(zhǎng)范圍374~1038 nm,光譜分辨率4.68 nm,可以一次性拍攝128個(gè)波段的灰度圖像,也可以從圖像上的每個(gè)像素點(diǎn)提取出1條128個(gè)數(shù)據(jù)點(diǎn)的光譜曲線。為了提高拍攝效率,每次同時(shí)拍攝3個(gè)葉片樣本,每小區(qū)拍攝7次,后續(xù)再通過圖像處理方法單獨(dú)提取出每個(gè)葉片樣本的光譜曲線。拍攝時(shí)將葉片和標(biāo)準(zhǔn)板同時(shí)放置在載物臺(tái)上,儀器位于葉片正上方距離75 cm處,垂直于葉片拍攝。然后調(diào)整好焦距和曝光時(shí)間,用HyperScanner軟件控制相機(jī)拍攝并保存文件。
1.2.2 數(shù)據(jù)反射率提取
拍攝完成后用SRAnal710軟件進(jìn)行光譜標(biāo)定、輻射標(biāo)定和反射率轉(zhuǎn)換操作,其反射率轉(zhuǎn)換公式如下:
式中為校正后圖像的反射率;是原始圖像的反射強(qiáng)度,cd;std是標(biāo)準(zhǔn)板區(qū)域的反射強(qiáng)度,cd;std是標(biāo)準(zhǔn)板區(qū)域的反射率,且標(biāo)準(zhǔn)板的反射率為已知。經(jīng)過校正后最終獲得需要的高光譜圖像文件。
與單個(gè)像素的光譜相比,使用平均光譜可以減少數(shù)據(jù)量,避免不同類別樣本之間存在的相似像素光譜對(duì)模型分類產(chǎn)生干擾,避免葉片曲面邊緣處的像素產(chǎn)生錯(cuò)誤分類,影響模型效果[15]。由于玉米葉片表面不平整,試驗(yàn)拍攝的圖像中存在正常反射區(qū)、暗反射區(qū)以及葉脈區(qū)(圖2a)等。為了提取到目標(biāo)區(qū)域即正常反射區(qū)的光譜曲線,需要進(jìn)行圖像分割和聚類,并計(jì)算正常反射區(qū)的平均光譜作為該葉片的光譜曲線。以一組葉片樣本的光譜數(shù)據(jù)反射率提取為例說明提取流程(圖2):1)分析樣本葉片的RGB彩圖(圖2a),找到葉片圖像中正常反射區(qū)、暗反射區(qū)以及葉脈區(qū)等;2)分析各類別區(qū)域的光譜曲線(圖2b)選擇閾值分割波段,如圖2b所示,在779 nm處各葉片目標(biāo)類的光譜反射率顯著高于其他類別,在470 nm處3個(gè)葉片目標(biāo)類之間又有明顯的區(qū)分,因此選擇470、779 nm處的波段圖像進(jìn)行閾值分割;3)進(jìn)行分割:如圖2 c所示,首先在779 nm圖像上提取反射率大于0.3的區(qū)域作為3個(gè)葉片區(qū),然后在470 nm圖像上對(duì)每個(gè)葉片區(qū)用K-means算法[16-18]進(jìn)行聚類并分割成3類:正常反射區(qū)如圖2d中的綠色區(qū)域、暗反射區(qū)如圖2d中的藍(lán)色區(qū)域、葉脈區(qū)如圖2d中的紅色區(qū)域;4)提取正常反射區(qū)的平均光譜作為該葉片的反射光譜曲線(圖 2e)。
1.2.3 高光譜數(shù)據(jù)預(yù)處理
經(jīng)過對(duì)每個(gè)葉片樣本進(jìn)行反射率提取,每年各獲得504條樣本光譜曲線,人工剔除其中明顯偏離數(shù)據(jù)中心的異常樣本。然后單獨(dú)對(duì)每個(gè)品種的樣本數(shù)據(jù)使用Kennard Stone算法進(jìn)行樣本排序,并按照3∶1的比例將其劃分為訓(xùn)練集樣本和測(cè)試集樣本兩部分。最后將各品種的劃分結(jié)果組合成最終的訓(xùn)練集數(shù)據(jù)和測(cè)試集數(shù)據(jù),以保證訓(xùn)練集和測(cè)試集在各品種上分布均勻。最終2018年試驗(yàn)得到378個(gè)訓(xùn)練集樣本和120個(gè)測(cè)試集樣本,2019年試驗(yàn)得到383個(gè)訓(xùn)練集樣本和120個(gè)測(cè)試集樣本。具體樣本劃分結(jié)果如表2所示。
使用多元散射校正(Multiplicative Scatter Correction,MSC)方法對(duì)每個(gè)篩選后的訓(xùn)練集樣本光譜曲線進(jìn)行預(yù)處理,以消除樣本間散射影響所導(dǎo)致的基線平移和偏移現(xiàn)象,同時(shí)盡可能保留光譜中與化學(xué)成分有關(guān)的信息。然后對(duì)訓(xùn)練集數(shù)據(jù)的各波段變量進(jìn)行標(biāo)準(zhǔn)化處理,通過等比縮放各波段特征,突出光譜間特征差異,提高模型的預(yù)測(cè)能力。最后,基于測(cè)試集與訓(xùn)練集同分布的假設(shè),使用訓(xùn)練集數(shù)據(jù)的相關(guān)參數(shù)對(duì)測(cè)試集的數(shù)據(jù)進(jìn)行MSC和標(biāo)準(zhǔn)化處理。其中,MSC與標(biāo)準(zhǔn)化處理過程的轉(zhuǎn)換公式如下:
式中(i,j)是第個(gè)樣本的第個(gè)變量,(i,j)std是其標(biāo)準(zhǔn)化處理后的數(shù)據(jù),μ是訓(xùn)練集第個(gè)特征變量的平均值,σ是訓(xùn)練集第個(gè)特征變量的標(biāo)準(zhǔn)差。
1.3.1 特征選擇與提取
本文采用2種方法進(jìn)行特征變量提?。?)ReliefF(Relevant Features)方法;2)PCAReliefF方法,即主成分分析(Principal Component Analysis,PCA)方法結(jié)合ReliefF方法。
1)ReliefF方法
ReliefF是一種典型的過濾式特征選擇方法,它通過相關(guān)統(tǒng)計(jì)量來度量每個(gè)特征的重要性并賦予不同的權(quán)重值。其基本思想是評(píng)估各特征變量對(duì)個(gè)最近鄰樣本的區(qū)分能力,然后增大對(duì)區(qū)分異類樣本有益的特征變量的相關(guān)統(tǒng)計(jì)量分量,減小對(duì)區(qū)分異類樣本有負(fù)面作用的特征變量的相關(guān)統(tǒng)計(jì)量分量,最終對(duì)基于各樣本得到的估計(jì)結(jié)果進(jìn)行平均,權(quán)重值越大的分量對(duì)應(yīng)的特征變量的分類能力就越強(qiáng)。通常最近鄰樣本數(shù)會(huì)影響特征變量的權(quán)重值,如果太小則權(quán)重值的估計(jì)容易受到噪聲數(shù)據(jù)的影響,如果太大也可能找不到重要的特征變量,因此需要取不同的值,通過觀察特征變量的穩(wěn)定性來選擇特征[19-20]。ReliefF算法關(guān)于權(quán)重值的更新公式如下:
2)基于ReliefF算法和主成分分析的特征選擇與提取
PCA方法是將一組相關(guān)變量通過線性變換轉(zhuǎn)換到一個(gè)新的坐標(biāo)系,它沿著樣本矩陣的協(xié)方差最大的方向由高維空間向低維空間投影,并使得第一大方差在第一坐標(biāo)軸上,第二大方差在第二坐標(biāo)軸上,以此類推。PCA得到的各主成分之間相互正交,且包含的信息也不重疊,可以有效解決光譜數(shù)據(jù)普遍存在的多重共線性問題,去除冗余特征[21-22]。在實(shí)際中只需要保留方差貢獻(xiàn)率最大的前幾個(gè)主成分就可以包含原始數(shù)據(jù)中的主要信息。
綜上可知,ReliefF算法會(huì)賦予所有和類別相關(guān)的特征較高的權(quán)值,而不管該特征是否是冗余特征;同時(shí)PCA方法可以消除特征間的相關(guān)性,但它只根據(jù)樣本數(shù)據(jù)集本身的特性提取了信息,而沒有與樣本的類別屬性相關(guān)聯(lián)。本文嘗試結(jié)合2種方法的優(yōu)點(diǎn),構(gòu)造PCAReliefF特征選擇方法。首先使用主成分分析對(duì)樣本數(shù)據(jù)進(jìn)行空間投影,消除特征間的共線性,然后再用ReliefF方法選擇與樣本類別高度相關(guān)的主成分特征參與建模,以達(dá)到更好的建模和預(yù)測(cè)效果。
1.3.2 模型構(gòu)建
將使用ReliefF方法選擇的原始光譜特征與使用PCAReliefF方法選擇的主成分特征,輸入支持向量機(jī)(Support Vector Machines,SVM)模型進(jìn)行訓(xùn)練,并使用交叉驗(yàn)證法進(jìn)行特征個(gè)數(shù)優(yōu)化。SVM模型構(gòu)建及模型參數(shù)優(yōu)化是使用libSVM工具包[23]執(zhí)行的。
支持向量機(jī)方法適用于解決小樣本、非線性及高維度的數(shù)據(jù)分類問題。它通過支持向量來確定分類超平面,需要的數(shù)據(jù)少;且低維空間里線性不可分的數(shù)據(jù)在高維空間中有更大的概率被分開,當(dāng)維度無限時(shí)概率為1。SVM的基本原理是先將特征空間中線性不可分的數(shù)據(jù)映射到更高維的空間,使其具有線性可分性;然后在高維空間中尋找一個(gè)最優(yōu)超平面線性分隔各類數(shù)據(jù),且使得分類的間隔最大化[24-27]。本質(zhì)上支持向量機(jī)要解決的優(yōu)化問題如下:
懲罰參數(shù)和核參數(shù)是SVM方法的2個(gè)重要參數(shù),將很大程度上影響模型的學(xué)習(xí)能力和預(yù)測(cè)效果。
本研究以模型在測(cè)試集上的預(yù)測(cè)準(zhǔn)確率(Accuracy,ACC)作為模型的評(píng)價(jià)指標(biāo)。準(zhǔn)確率是指總體樣本中將抗倒樣本和不抗倒樣本都預(yù)測(cè)正確的樣本所占的比例。同時(shí)繪制以真正率(True Positive Rate,TPR)和假正率(False Positive Rate,F(xiàn)PR)為坐標(biāo)軸的受試者工作特征曲線(Receiver Operating Characteristic Curve,ROC),對(duì)比不同模型的預(yù)測(cè)效果。真正率是真實(shí)抗倒樣本總預(yù)測(cè)正確的比例,假正率是真實(shí)不抗倒樣本中預(yù)測(cè)正確的比例[16]。ROC曲線用于對(duì)不同的模型性能進(jìn)行綜合比較,曲線下面積越大則模型性能越好[28-30]。
圖3所示是2a試驗(yàn)的樣本光譜曲線,用四分位數(shù)曲線(Quartile Curve)分別將原始光譜數(shù)據(jù)中抗倒樣本和不抗倒樣本的譜帶分布表示出來,用變異系數(shù)(Coefficient of Variation)曲線表示各波長(zhǎng)特征在抗倒樣本和不抗倒樣本中的變異程度。從四分位數(shù)曲線圖可以看出,抗倒樣本和不抗倒樣本在原始光譜上的變化趨勢(shì)基本一致,但在光譜分布上有很大程度的重疊,光譜區(qū)分并不明顯。從變異系數(shù)曲線圖可以看出,同年試驗(yàn)抗倒樣本和不抗倒樣本的變異系數(shù)曲線的變化趨勢(shì)一致,但在400~700 nm范圍內(nèi)不抗倒樣本的變異系數(shù)曲線明顯高于抗倒樣本,說明在此范圍內(nèi)抗倒樣本的光譜數(shù)據(jù)分布要比不抗倒樣本更為集中,400~700 nm波段有可能是區(qū)分抗倒樣本和不抗倒樣本的敏感波段。
對(duì)訓(xùn)練集的光譜數(shù)據(jù)使用ReliefF算法,設(shè)置不同的最近鄰個(gè)數(shù),計(jì)算各波長(zhǎng)的分類權(quán)重值;對(duì)訓(xùn)練集數(shù)據(jù)先進(jìn)行主成分分析,再設(shè)置不同的值對(duì)各主成分進(jìn)行權(quán)重計(jì)算。經(jīng)過代入不同的值計(jì)算特征權(quán)重發(fā)現(xiàn),當(dāng)≥38(2018年數(shù)據(jù))和≥39(2019年數(shù)據(jù))時(shí),各波長(zhǎng)、各主成分的權(quán)重值均趨于穩(wěn)定,不再隨值變化而變化,此時(shí)的權(quán)重值可以作為分類特征的選擇依據(jù)。各波段權(quán)重如圖4a所示,2a間各波段的權(quán)重值相對(duì)不同,但其變化趨勢(shì)基本一致,其中400、750和1 000 nm波長(zhǎng)附近分類權(quán)重值比較高,是比較重要的分類特征波段。將各主成分按主成分貢獻(xiàn)率由高到低排序并繪制各主成分對(duì)應(yīng)的分類權(quán)重如圖4b所示??梢钥吹酱蟛糠址诸悪?quán)重高的主成分都集中在前60個(gè)主成分以內(nèi),整體上隨著主成分貢獻(xiàn)率的降低各主成分的分類權(quán)重也在降低,但貢獻(xiàn)率高的主成分對(duì)應(yīng)的分類權(quán)重不一定高。對(duì)比圖4a和圖4b可以發(fā)現(xiàn),各主成分權(quán)重的曲線都是陡峭的“尖峰”,各相鄰主成分間基本沒有相關(guān)性且對(duì)應(yīng)的各主成分權(quán)重相差較大;而各波長(zhǎng)權(quán)重的曲線波峰相對(duì)平緩,相鄰波長(zhǎng)的相關(guān)性高、權(quán)重值也接近。這說明ReliefF算法經(jīng)常選中相鄰波段相關(guān)性高的冗余特征,而PCAReliefF方法選擇的特征則基本不含有冗余特征。
將各波長(zhǎng)和主成分按分類權(quán)重值由高到低排序,共128個(gè)特征變量,以5為特征個(gè)數(shù)步長(zhǎng)值,分成21組,依次把選中的特征波長(zhǎng)和主成分分別代入模型。采用libSVM工具包[23]訓(xùn)練支持向量機(jī)模型,基于網(wǎng)格搜索(Grid Search)法對(duì)懲罰參數(shù)和核參數(shù)進(jìn)行優(yōu)化選擇,用20折交叉驗(yàn)證法對(duì)各訓(xùn)練模型進(jìn)行評(píng)價(jià)。懲罰參數(shù)的搜索范圍是2-5, 2-3, …, 229,核參數(shù)的搜索范圍是2-27, 2-25, …, 213。對(duì)每組特征的建模過程都進(jìn)行參數(shù)和的優(yōu)化選擇,選擇交叉驗(yàn)證正確率最高的參數(shù)組合作為最佳參數(shù)組合,如圖5所示,懲罰參數(shù)和核參數(shù)最佳組合為2-9和213,對(duì)應(yīng)的交叉驗(yàn)證正確率為91.01%。
然后將每組特征的最佳參數(shù)組合所對(duì)應(yīng)的交叉驗(yàn)證正確率繪制成曲線,從圖6可以看出,交叉驗(yàn)證正確率曲線起點(diǎn)的分類正確率都在65%以上,說明特征選擇算法找到的前5個(gè)特征分類權(quán)重都比較高、分類效果明顯;PCAReliefF的特征選擇方法可以更迅速地找到關(guān)鍵分類特征,達(dá)到較高的分類正確率;而ReliefF的方法則是隨著特征數(shù)量的增加分類精度逐漸提高,說明該方法選出的冗余特征較多。綜合考慮模型的正確率和計(jì)算的復(fù)雜度,對(duì)2018年和2019年數(shù)據(jù)分別選擇40和50個(gè)特征作為最終建立模型的特征個(gè)數(shù)。分別將2 a試驗(yàn)分類權(quán)重最高的前40和50個(gè)特征代入模型并在相應(yīng)的最佳參數(shù)附近再次進(jìn)行局部的參數(shù)優(yōu)化,確定建模的最終參數(shù)如表3。
表3 最終模型參數(shù)及模型預(yù)測(cè)效果
最后用訓(xùn)練集的所有樣本和選定的最終參數(shù)進(jìn)行模型訓(xùn)練,并用測(cè)試集數(shù)據(jù)對(duì)模型進(jìn)行測(cè)試,訓(xùn)練集和測(cè)試集模型的預(yù)測(cè)正確率如表3所示,模型預(yù)測(cè)結(jié)果混淆矩陣如表4所示,模型預(yù)測(cè)結(jié)果ROC曲線如圖7所示。由表3可知,PCAReliefF-SVM模型2a的測(cè)試集預(yù)測(cè)正確率為85.00%和85.83%,ReliefF-SVM模型2a的測(cè)試集預(yù)測(cè)正確率均為84.17%,PCAReliefF-SVM模型預(yù)測(cè)效果更好;同時(shí)從模型對(duì)測(cè)試集預(yù)測(cè)結(jié)果的混淆矩陣可知,各個(gè)模型對(duì)不抗倒伏樣本L的識(shí)別錯(cuò)誤率都要高于抗倒伏樣本LR,即各模型的假正例個(gè)數(shù)FP都要高于假反例個(gè)數(shù)FN,模型對(duì)不抗倒伏樣本的敏感度相對(duì)較低。使用ROC曲線對(duì)模型性能進(jìn)行評(píng)價(jià),PCAReliefF-SVM建模方法的ROC曲線幾乎完全“包住”了ReliefF-SVM方法的ROC曲線,模型綜合性能更好;當(dāng)認(rèn)為真正率(把抗倒樣本預(yù)測(cè)為抗倒樣本)和假正率(把不抗倒樣本預(yù)測(cè)為抗倒樣本)同樣重要時(shí),即同時(shí)使真正率最大、假正率最小,此時(shí)PCAReliefF-SVM模型的性能依然優(yōu)于ReliefF-SVM模型。
表4 最終模型預(yù)測(cè)結(jié)果混淆矩陣
以上結(jié)果表明:經(jīng)過主成分分析的PCAReliefF方法比ReliefF算法能更迅速地找到主要分類特征,PCAReliefF-SVM的建模方法各項(xiàng)指標(biāo)均優(yōu)于ReliefF-SVM建模方法,PCAReliefF-SVM模型建模效率更高,模型的綜合性能也比ReliefF-SVM模型更好。
本研究采用高光譜成像技術(shù)對(duì)玉米品種的抗倒伏進(jìn)行早期分類和預(yù)測(cè),提出了光譜提取、特征分析和建模預(yù)測(cè)方法,提前了玉米抗倒伏的檢測(cè)時(shí)間,提高了玉米品種抗倒伏的篩選效率。主要結(jié)論如下:
1)提出了一種基于類別區(qū)域識(shí)別的精確光譜反射率提取方法,實(shí)現(xiàn)了對(duì)玉米葉片高光譜圖像感興趣區(qū)域光譜的自動(dòng)提取,相對(duì)于人工獲取目標(biāo)區(qū)域高光譜數(shù)據(jù)方法提高了處理效率;
2)采用光譜主成分分析和ReliefF算法的過濾式特征提取方法,既可以直接挖掘出抗倒樣本和不抗倒樣本數(shù)據(jù)的典型分類特征,又避免了冗余特征,降低了計(jì)算復(fù)雜度,提高了模型效率;
3)結(jié)合精確優(yōu)化參數(shù)的高斯核支持向量機(jī)建模方法PCAReliefF-SVM,對(duì)未知抗倒伏的樣本進(jìn)行預(yù)測(cè),預(yù)測(cè)正確率不低于85.00%。
研究成果為玉米抗倒伏的研究提供了可靠的思路,證明了高光譜成像技術(shù)在玉米抗倒伏早期預(yù)測(cè)方面的應(yīng)用潛力,對(duì)于提高玉米抗倒伏品種的選育工作效率有重要意義。
[1] 楊德光,馬德志,于喬喬,等. 玉米倒伏的影響因素及抗倒伏性研究進(jìn)展[J]. 中國農(nóng)業(yè)大學(xué)學(xué)報(bào),2020,25(7):28-38.
Yang Deguang, Ma Dezhi, Yu Qiaoqiao, et al. Research progress on influencing factors of lodging and lodging resistance in maize[J]. Journal of China Agricultural University, 2020, 25(7): 28-38. (in Chinese with English abstract)
[2] Xue J, Xie R Z, Zhang W F, et al. Research progress on reduced lodging of high-yield and -density maize[J]. Journal of Integrative Agriculture, 2017, 16(12): 2717-2725.
[3] Wei H B, Zhao Y P, Xie Y R, et al. Exploiting SPL genes to improve maize plant architecture tailored for high-density planting[J]. Journal of Experimental Botany, 2018, 69(20): 4675-4688.
[4] Xue J, Gou L, Zhao Y S, et al. Effects of light intensity within the canopy on maize lodging[J]. Field Crops Research, 2016, 188: 133-141.
[5] Zhang Y, Du J J, Wang J L, et al. High-throughput micro-phenotyping measurements applied to assess stalk lodging in maize (. )[J]. Biological Research, 2018, 51: 40.
[6] Al-Zube L A, Robertson D J, Edwards J N, et al. Measuring the compressive modulus of elasticity of pith ? filled plant stems[J]. Plant Methods, 2017, 13: 99.
[7] Huang J L, Liu W Y, Zhou F, et al. Mechanical properties of maize fibre bundles and their contribution to lodging resistance[J]. Biosystems Engineering, 2016, 151: 298-307.
[8] Al-Zube L, Sun W, Robertson D, et al. The elastic modulus for maize stems[J]. Plant Methods, 2018, 14: 11.
[9] 萇建峰,張海紅,李鴻萍,等. 不同行距配置方式對(duì)夏玉米冠層結(jié)構(gòu)和群體抗性的影響[J]. 作物學(xué)報(bào),2016,42(1):104-112.
Chang Jianfeng, Zhang Haihong, Li Hongping, et al. Effects of different row spaces on canopy structure and resistance of summer maize[J]. 2016, 42(1): 104-112. (in Chinese with English abstract)
[10] Xia C, Yang S, Huang M, et al. Maize seed classification using hyperspectral image coupled with multi-linear discriminant analysis[J/OL]. Infrared Physics & Technology, 2019. [2019-10-14]. https: //doi. org/10. 1016/j. infrared. 2019. 103077.
[11] Zhang F, Zhou G. Estimation of vegetation water content using hyperspectral vegetation indices: A comparison of crop water indicators in response to water stress treatments for summer maize[J]. BMC Ecology, 2019, 19: 18.
[12] Trachsel S, Dhliwayo T, Perez L G, et al. Estimation of Physiological Genomic Estimated Breeding Values (PGEBV) combining full hyperspectral and marker data across environments for grain yield under combined heat and drought stress in tropical maize (. )[J/OL]. Plos One, 2019, 14(3). [2019-03-20]. https: //pubmed. ncbi. nlm. nih. gov/30893307/.
[13] Qin H M, Wang C, Zhao K G, et al. Estimation of the fraction of absorbed Photosynthetically Active Radiation (fPAR) in maize canopies using LiDAR data and hyperspectral imagery[J/OL]. Plos One, 2018, 13(5). [2018-05-29]. https: //pubmed. ncbi. nlm. nih. gov/29813094/.
[14] Feng L, Zhu S S, Zhang C, et al. Identification of maize kernel vigor under different accelerated aging times using hyperspectral imaging[J]. Molecules, 2018, 23(12): 3078.
[15] Munera S, Amigo J M, Aleixos N, et al. Potential of VIS-NIR hyperspectral imaging and chemometric methods to identify similar cultivars of nectarine[J]. Food Control, 2018, 86: 1-10.
[16] 謝文涌,柴琴琴,甘勇輝,等. 基于多特征提取和Stacking集成學(xué)習(xí)的金線蓮品系分類[J]. 農(nóng)業(yè)工程學(xué)報(bào),2020,36(14):203-210. Xie Wenyong, Chai Qinqin, Gan Yonghui, et al. Strains classification of anoectochilus roxburghii using multi-feature extraction and Stacking ensemble learning[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2020, 36(14): 203-210. (in Chinese with English abstract)
[17] Arthur D, Vassilvitskii S. K-means++: The advantages of careful seeding[C]. Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. New Orleans, LA: ACM, 2007.
[18] 王俊,張海洋,趙凱旋,等. 基于最優(yōu)二叉決策樹分類模型的奶牛運(yùn)動(dòng)行為識(shí)別[J]. 農(nóng)業(yè)工程學(xué)報(bào),2018,34(18):202-210.
Wang Jun, Zhang Haiyang, Zhao Kaixuan, et al. Cow movement behavior classification based on optimal binary decision-tree classification model[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2018, 34(18): 202-210. (in Chinese with English abstract)
[19] Robnik-Sikonja M, Kononenko I. Theoretical and empirical analysis of ReliefF and RReliefF[J]. Machine Learning, 2003, 53(1/2): 23–69.
[20] 戴建國,張國順,郭鵬,等. 基于無人機(jī)遙感可見光影像的北疆主要農(nóng)作物分類方法[J]. 農(nóng)業(yè)工程學(xué)報(bào),2018,34(18):122-129.
Dai Jianguo, Zhang Guoshun, Guo Peng, et al. Classification method of main crops in northern Xinjiang based on UAV visible waveband images[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2018, 34(18): 122-129. (in Chinese with English abstract)
[21] Li X B, Wang Y S, Fu L H. Monitoring lettuce growth using K-means color image segmentation and principal component analysis method[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2016, 32(12): 179-186. (in English with Chinese abstract)
[22] Chen Y S, Zhao X, Jia X P. Spectral–spatial classification of hyperspectral data based on deep belief network[C]. 2015 IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. Piscataway, N Y: IEEE Press, 2015, 8(6): 2381-2392.
[23] Chang C C, Lin C J. LIBSVM: A library for support vector machines[J]. ACM Transactions on Intelligent Systems and Technology, 2007, 2(3): 1-27.
[24] Cruz-Tirado J P, Pierna J A F, Rogez H, et al. Authentication of cocoa () bean hybrids by NIR-hyperspectral imaging and chemometrics[J/OL]. Food Control, 2020, 118: 107445. [2020-06-28]. https: //doi. org/10. 1016/j. foodcont. 2020. 107445.
[25] Zhang N, Wang Y T, Zhang X L. Extraction of tree crowns damaged by Dendrolimus tabulaeformis Tsai et Liu via spectral-spatial classification using UAV-based hyperspectral images[J]. Plant Methods, 2020, 16(1): 1-19.
[26] Li L Q, Huang J, Wang Y J, et al. Intelligent evaluation of storage period of green tea based on VNIR hyperspectral imaging combined with chemometric analysis[J/OL]. Infrared Physics & Technology, 2020, 110. [2020-08-06]. https: //doi. org/10. 1016/j. infrared. 2020. 103450.
[27] Xu Z P, Jiang Y M, Ji J L, et al. Classification, identification, and growth stage estimation of microalgae based on transmission hyperspectral microscopic imaging and machine learning[J/OL]. Optics Express, 2020, 28(21). [2020-10-12]. https: //doi. org/10. 1364/OE. 406036.
[28] Hu M H, Dong Q L, Liu B L. Classification and characterization of blueberry mechanical damage with time evolution using reflectance, transmittance and interactance imaging spectroscopy[J]. Computers and Electronics in Agriculture, 2016, 122: 19-28.
[29] Wang L, Chang C I, Lee L C, et al. Band subset selection for anomaly detection in hyperspectral imagery[J]. IEEE Transactions on Geoscience and Remote Sensing, 2017, 55(9): 4887-4898.
[30] Wu Y F, Sebastián L, Zhang B, et al. Approximate computing for onboard anomaly detection from hyperspectral images[J]. Journal of Real-Time Image Processing, 2019, 16(1): 99-114.
中國農(nóng)業(yè)工程學(xué)會(huì)會(huì)員:楊麗(E041200411S)
Lodging resistance prediction of maize varieties based on support vector machine and ReliefF algorithm
Zhang Tianliang, Zhang Dongxing, Cui Tao, Yang Li※, Ding Youqiang, Xie Chunji, Du Zhaohui, Zhong Xiangjun
(1.,,100083,; 2.,100083,)
Maize is one of the main food crops in the world. The lodging of maize has posed a serious challenge on the yield and mechanized harvesting in modern agriculture. Current identification methods cannot fully meet the lodging resistance and long breeding cycle of maize varieties, due to the time-consuming and laborious tasks. In this study, hyperspectral imaging technology was combined with statistical learning to predict the lodging resistance of maize varieties during the vegetative growth period. A field trial was also carried out in 2018 and 2019. The hyperspectral images were then collected for the top leaves of 8 corn varieties with and without lodging resistance at the 9-leaf stage. The experimental procedure was as follows. A threshold segmentation was first utilized to identify the leaf area. The K-means clustering was then used to divide the leaf into three areas: normal reflection, dark reflection, and leaf vein area. The average spectral curve was finally extracted in the normal reflection area, in order to analyze the data characteristics of lodging-resistant and lodging samples. The Kennard Stone was selected to sort the sample data of each species. Two parts of the set sample were also divided, including the training and test set at a ratio of 3:1. The division of each variety was integrated into the final training and test set data, in order to obtain an evenly distributed dataset of each variety. As such, there were 378 training and 120 test set samples in the 2018 test, while there were 383 training and 120 test set samples in the 2019 test. The filtering feature selection Relevant Features (ReliefF) and Principal Component Analysis (PCA) were selected to mine the spectral classification features of lodging-resistant varieties and lodging varieties. Specifically, a different number of the nearest neighbors in ReliefF was set to determine some features, according to the stability of feature variables. The redundant features were often selected with a high correlation in adjacent bands. Correspondingly, the PCA was first performed on the spectral data, thereby selecting principal components without redundant features using the ReliefF. The classification models of ReliefF- Support Vector Machine (SVM) and PCAReliefF-SVM were established, where the original spectral data features were selected by the ReliefF, and the principal component features were selected by the PCAReliefF. The grid search was also selected to optimize the penalty and kernel parameters in the SVM model for a better prediction of the model. First, cross-validation was used on the training set data to optimize the number of selected features. 40 and 50 features in the trials in 2018 and 2019 were selected to build the model, in order to balance the accuracy of the model and the complexity of calculation. All the samples were then used in the training set, where the final parameters were used for model training. The accuracy rates of prediction in the PCAReliefF-SVM model were 85.00% and 85.83% in 2018 and 2019, respectively. In the ReliefF-SVM model, the prediction accuracy rates were 84.17% and 84.17% in 2018 and 2019, respectively. It indicated that the PCAReliefF-SVM model performed better prediction. The ROC curve was also used to evaluate the performance of the model. It was found that the ROC curve in the PCAReliefF-SVM modeling almost completely "enclosed" the ROC curve in the ReliefF-SVM, indicating a better performance of the PCAReliefF-SVM model. As such, hyperspectral imaging was used for the early classification of maize varieties, particularly for the overwhelm resistance. Consequently, the findings can provide a reliable idea for the maize resistance to overwhelm using spectral extraction, feature analysis, and modeling prediction.
principal component analysis; variety; support vector machine; maize; lodging resistant; ReliefF
10.11975/j.issn.1002-6819.2021.20.026
S126
A
1002-6819(2021)-20-0226-08
張?zhí)炝粒瑥垨|興,崔濤,等. 基于支持向量機(jī)和ReliefF算法的玉米品種抗倒伏預(yù)測(cè)[J]. 農(nóng)業(yè)工程學(xué)報(bào),2021,37(20):226-233.doi:10.11975/j.issn.1002-6819.2021.20.026 http://www.tcsae.org
Zhang Tianliang, Zhang Dongxing, Cui Tao, et al. Lodging resistance prediction of maize varieties based on support vector machine and ReliefF algorithm[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2021, 37(20): 226-233. (in Chinese with English abstract) doi:10.11975/j.issn.1002-6819.2021.20.026 http://www.tcsae.org
2020-10-24
2021-09-10
國家重點(diǎn)研發(fā)計(jì)劃項(xiàng)目(2016YFD0300302);玉米產(chǎn)業(yè)技術(shù)體系建設(shè)項(xiàng)目(CARS-02)
張?zhí)炝粒┦可?,研究方向?yàn)楦吖庾V農(nóng)業(yè)應(yīng)用與植物表型檢測(cè)。Email:tianliangzn@163.com
楊麗,教授,博士生導(dǎo)師,研究方向?yàn)檗r(nóng)業(yè)裝備智能化和高光譜農(nóng)業(yè)應(yīng)用。Email:yl_hb68@ 126.com