楊晨 王婕婷 李飛江 錢宇華
摘 要:針對(duì)目前概率機(jī)器學(xué)習(xí)方法在解決概率問(wèn)題時(shí)具有較高的復(fù)雜度,而傳統(tǒng)的支持向量數(shù)據(jù)描述(SVDD)作為一種核密度估計(jì)方法只能判斷測(cè)試樣本是否屬于該類等問(wèn)題,提出一種基于概率的支持向量數(shù)據(jù)描述方法。首先,利用傳統(tǒng)的SVDD方法分別得到兩類數(shù)據(jù)的數(shù)據(jù)描述,計(jì)算測(cè)試樣本到超球體的距離;然后,構(gòu)造一個(gè)將距離轉(zhuǎn)換為概率的函數(shù),提出一種基于概率的SVDD方法;同時(shí),使用Bagging算法進(jìn)行集成,進(jìn)一步提高數(shù)據(jù)描述的性能。借鑒分類場(chǎng)景,將所提方法與傳統(tǒng)的SVDD方法在Gunnar Raetsch的13種基準(zhǔn)數(shù)據(jù)集上進(jìn)行實(shí)驗(yàn),實(shí)驗(yàn)結(jié)果表明,所提方法在準(zhǔn)確率和F1值上優(yōu)于傳統(tǒng)的SVDD方法,并且其數(shù)據(jù)描述的性能有所提升。
關(guān)鍵詞:概率機(jī)器學(xué)習(xí);支持向量數(shù)據(jù)描述;集成;不確定性;分類
中圖分類號(hào):TP181
文獻(xiàn)標(biāo)志碼:A
Support vector data description method based on probability
YANG Chen1,2,3, WANG Jieting1,2,3, LI Feijiang1,2,3, QIAN Yuhua1,2,3*
1.Research Institute of Big Data Science and Industry, Shanxi University, Taiyuan Shanxi 030006, China;
2.Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education(Shanxi University), Taiyuan Shanxi 030006, China;
3.School of Computer and Information Technology, Shanxi University, Taiyuan Shanxi 030006, China
Abstract:
In view of the high complexity of current probabilistic machine learning methods in solving probability problems, and the fact that traditional Support Vector Data Description (SVDD), as a kernel density estimation method, can only estimate whether the test samples belong to this class, a probabilitybased SVDD method was proposed. Firstly, the traditional SVDD method was used to obtain the data descriptions of two types of data, and the distance between the test sample and the hypersphere was calculated. Then, a function was constructed to convert the distance into probability, and an SVDD method based on probability was proposed. At the same time, Bagging algorithm was used for the integration to further improve the performance of data description. By referring to classification scenarios, the proposed method was compared with the traditional SVDD method on 13 kinds of benchmark datasets of Gunnar Raetsch. The experimental results show that the proposed method is better than the traditional SVDD method on accuracy and F1value, and its performance of data description is improved.
Key words:
probabilistic machine learning; Support Vector Data Description (SVDD); ensemble; uncertainty; classification
0?引言
機(jī)器學(xué)習(xí)主要是計(jì)算機(jī)通過(guò)已知數(shù)據(jù)訓(xùn)練模型,運(yùn)用模型對(duì)未知數(shù)據(jù)進(jìn)行預(yù)測(cè)。機(jī)器學(xué)習(xí)面臨的未來(lái)的數(shù)據(jù)和將來(lái)的行為引發(fā)的結(jié)果是不確定的, 而概率機(jī)器學(xué)習(xí)[1]針對(duì)這種不確定性提供了一個(gè)概率框架,對(duì)模型和預(yù)測(cè)的不確定性進(jìn)行表示和控制,因此在很多領(lǐng)域發(fā)揮著重要的作用,比如語(yǔ)音識(shí)別[2]、圖像分類[3-4]、文本預(yù)測(cè)[5]等。Ghahramani[1]在2015年指出概率機(jī)器學(xué)習(xí)框架的主要思想是,學(xué)習(xí)可以被認(rèn)為是通過(guò)推斷合理的模型來(lái)解釋預(yù)測(cè)數(shù)據(jù),然后機(jī)器使用模型來(lái)預(yù)測(cè)未來(lái)的數(shù)據(jù),并且根據(jù)這些預(yù)測(cè)作出合理的決策,不確定性在這一切中起著至關(guān)重要的作用,對(duì)于給定的數(shù)據(jù)而言,哪個(gè)模型是合適的尚不確定;同樣,對(duì)未來(lái)數(shù)據(jù)和未來(lái)行動(dòng)結(jié)果的預(yù)測(cè)也是不確定的,因此使用概率論描述這種不確定性,概率論為不確定性建模提供了一個(gè)框架[1]。相對(duì)于二值輸出而言,以概率作為輸出的結(jié)果能夠提供更多的信息,預(yù)測(cè)結(jié)果更加準(zhǔn)確。
目前已有的概率機(jī)器學(xué)習(xí)方法主要有貝葉斯學(xué)習(xí)、高斯過(guò)程和深度學(xué)習(xí)等,其中:神經(jīng)網(wǎng)絡(luò)是具有許多參數(shù)的可調(diào)非線性函數(shù),深度學(xué)習(xí)的概念源于人工神經(jīng)網(wǎng)絡(luò)的研究,但是也有一些重要的改變,比如新的架構(gòu)和算法創(chuàng)新、有更大的數(shù)據(jù)集、大規(guī)模的計(jì)算資源等。神經(jīng)網(wǎng)絡(luò)和深度學(xué)習(xí)系統(tǒng)在許多基本測(cè)試任務(wù)上都有很高的性能,但是也有一些局限性,比如需要大量的數(shù)據(jù)、需要大量的計(jì)算資源進(jìn)行訓(xùn)練等,因此這些方法在解決概率問(wèn)題時(shí)具有較高的復(fù)雜度。
近年來(lái),數(shù)據(jù)描述問(wèn)題得到了大量的研究。在域描述領(lǐng)域中,數(shù)據(jù)描述的任務(wù)不是以“區(qū)分不同的類”為目標(biāo)的分類問(wèn)題,也不是以“對(duì)每一個(gè)樣本產(chǎn)生一個(gè)期望輸出”為目標(biāo)的回歸問(wèn)題,而是給出一個(gè)關(guān)于訓(xùn)練樣本集的描述,同時(shí)檢測(cè)那些與訓(xùn)練樣本集相似的新樣本。該描述應(yīng)該覆蓋訓(xùn)練樣本集的所有樣本,同時(shí),在理想情況下,該描述應(yīng)該能夠?qū)颖究臻g中其他所有可能的異常樣本排除在外。
20世紀(jì)末,Tax等[6]在支持向量機(jī)(Support Vector Machine, SVM)的基礎(chǔ)上提出了一種數(shù)據(jù)描述方法,即支持向量數(shù)據(jù)描述(Support Vector Data Description, SVDD)。SVDD方法是一種基于邊界數(shù)據(jù)(即支持向量)的描述方法,其思想是尋找一個(gè)幾乎包含所有目標(biāo)樣本并且體積最小的超球體。SVDD方法相對(duì)于貝葉斯學(xué)習(xí)、高斯過(guò)程和深度學(xué)習(xí)等具有算法復(fù)雜性低、擴(kuò)展性強(qiáng)、對(duì)數(shù)據(jù)規(guī)模需求低等優(yōu)點(diǎn)。Lee等[7]提出了一種改進(jìn)的局部密度支持向量數(shù)據(jù)描述方法(Densityinduced Support Vector Data Description, DSVDD),結(jié)果表明,DSVDD的性能優(yōu)于SVDD和k近鄰數(shù)據(jù)描述方法。Kim等[8]針對(duì)SVDD在處理大數(shù)據(jù)集時(shí)存在一定的局限性提出了一種基于k均值聚類的快速SVDD算法,該方法與原始SVDD方法有相同的結(jié)果,并且降低了計(jì)算成本。Luo等[9]針對(duì)矩陣操作的空間復(fù)雜性,提出了一種基于分解和組合策略的快速SVDD算法,該算法在實(shí)現(xiàn)樣本描述方面優(yōu)于原SVDD算法,尤其是在大規(guī)模樣本數(shù)據(jù)集上。然而在現(xiàn)實(shí)生活中,目標(biāo)數(shù)據(jù)集往往包含一類以上的對(duì)象,每一類對(duì)象都需要同時(shí)進(jìn)行描述和區(qū)分,在這種情況下,傳統(tǒng)的SVDD只能給出一個(gè)描述目標(biāo)數(shù)據(jù)集,不區(qū)分不同的目標(biāo)類數(shù)據(jù)集,或者給目標(biāo)數(shù)據(jù)集中每個(gè)類的對(duì)象一個(gè)描述,Huang等[10]提出了一種改進(jìn)的支持向量數(shù)據(jù)描述方法——兩類支持向量數(shù)據(jù)描述(TwoClass Support Vector Data Description, TCSVDD)。
傳統(tǒng)的SVDD方法作為一種核密度估計(jì)方法,對(duì)未知數(shù)據(jù)進(jìn)行預(yù)測(cè)主要是先對(duì)已知樣本進(jìn)行數(shù)據(jù)描述,然后判斷測(cè)試樣本是否屬于該類,是二值輸出;相對(duì)于二值輸出而言,以概率作為輸出的結(jié)果能夠提供更多的信息,獲得更準(zhǔn)確的數(shù)據(jù)描述。而傳統(tǒng)的SVDD方法判斷測(cè)試樣本是否屬于該類,主要是計(jì)算測(cè)試數(shù)據(jù)到超球體中心的距離,但是該距離無(wú)法得知測(cè)試數(shù)據(jù)所屬類別的概率。因此,本文提出了一個(gè)將距離轉(zhuǎn)換為概率的函數(shù),在本文第2章詳細(xì)解釋。
集成學(xué)習(xí)算法[11]是通過(guò)集成多個(gè)學(xué)習(xí)器共同決策的機(jī)器學(xué)習(xí)技術(shù),通過(guò)不同的樣本集訓(xùn)練有差異的個(gè)體學(xué)習(xí)器,然后使用某種規(guī)則把個(gè)體學(xué)習(xí)器的結(jié)果進(jìn)行集成,得到的集成分類器可以有效提高單個(gè)分類器的學(xué)習(xí)效果。根據(jù)個(gè)體學(xué)習(xí)器的生成方式,目前的集成學(xué)習(xí)方法大致分為兩大類,即個(gè)體學(xué)習(xí)器間存在強(qiáng)依賴關(guān)系、必須串行生成的序列化方法,以及個(gè)體學(xué)習(xí)器間不存在強(qiáng)依賴關(guān)系、可同時(shí)生成的并行化方法。前者的代表是Boosting[12],后者的代表是Bagging[13]和隨機(jī)森林(Random Forest)[14-15]。借鑒分類場(chǎng)景,本文主要使用Bagging集成算法來(lái)提高傳統(tǒng)SVDD方法的數(shù)據(jù)描述能力。
為了解決目前深度學(xué)習(xí)等概率機(jī)器學(xué)習(xí)在解決概率問(wèn)題時(shí)具有較高的復(fù)雜度,而且傳統(tǒng)的SVDD方法只能判斷測(cè)試樣本是否屬于該類等問(wèn)題,本文首先介紹了SVDD方法和Bagging集成算法的相關(guān)知識(shí);然后構(gòu)造了一個(gè)將距離轉(zhuǎn)換為概率的函數(shù),提出了一種基于概率的支持向量數(shù)據(jù)描述方法(SVDD method based on Probability, PSVDD);最后,實(shí)驗(yàn)驗(yàn)證了PSVDD方法的有效性。
1?相關(guān)工作
本文算法是基于概率的支持向量數(shù)據(jù)描述方法,因此在本章簡(jiǎn)要回顧了SVDD方法和Bagging集成算法的相關(guān)知識(shí)。
1.1?SVDD相關(guān)理論
支持向量數(shù)據(jù)描述(SVDD)是一種對(duì)目標(biāo)數(shù)據(jù)集進(jìn)行球形描述并用于離群點(diǎn)檢測(cè)或分類的數(shù)據(jù)描述方法。它的基本思想是通過(guò)利用核函數(shù)將樣本從低維空間映射到高維特征空間,并尋求一個(gè)超球體,盡可能在使超球體的半徑小的情況下把所有訓(xùn)練樣本包圍起來(lái),并以最小超球體的邊界對(duì)數(shù)據(jù)進(jìn)行描述和分類。在測(cè)試階段,判斷測(cè)試樣本與所構(gòu)造的超球體的球心之間的距離是否小于半徑:如果小于超球體半徑則認(rèn)為是目標(biāo)點(diǎn);若大于超球體的半徑則被認(rèn)為是異常點(diǎn)。SVDD示意圖如圖1所示。
設(shè)X={xi,i=1,2,…,n}為Rd空間中的訓(xùn)練數(shù)據(jù)集,a和R分別表示球面的中心和半徑。該目標(biāo)被表述為一個(gè)二次規(guī)劃問(wèn)題:
minR,a,ξi(R,a,ξi)=R2+C∑iξi(1)
s.t.‖xi-a‖2≤R2+ξi, ξi≥0, i=1,2,…,n
其中ξi是一個(gè)松弛變量,允許訓(xùn)練數(shù)據(jù)集中出現(xiàn)異常值的可能性; C為誤差懲罰系數(shù),用來(lái)平衡錯(cuò)分誤差和球體體積。為了該二次規(guī)劃問(wèn)題進(jìn)行求解,引入拉格朗日乘子αi,γi構(gòu)造拉格朗日函數(shù):
L(R,a,ξi,αi,γi)=R2+C∑iξi-
∑iαi[R2+ξi-(xi-a)2]-∑iγiξi(2)
其中拉格朗日乘子αi≥0,γi≥0。將R,a,ξi偏導(dǎo)數(shù)設(shè)為零得到約束條件:
[2]HINTON G, DENG L, YU D, et al. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups[J]. IEEE Signal Processing Magazine, 2012, 29(6): 82-97.
[3]KRIZHEVSKY A, SUTSKEVER I, HINTON G E, et al. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90.
[4]黃凱奇, 任偉強(qiáng), 譚鐵牛. 圖像物體分類與檢測(cè)算法綜述[J]. 計(jì)算機(jī)學(xué)報(bào), 2014, 36(6):1-18.(HUANG K Q, REN W Q, TAN T N. A review on image object classification and detection[J]. Chinese Journal of Computers, 2014, 36(6): 1-18.)
[5]KANDOLA E J, HOFMANN T, POGGIO T, et al. A neural probabilistic language model[J]. Journal of Machine Learning Research, 2006, 194: 137-186.
[6]TAX D M J, DUIN R P W. Support vector data description[J]. Machine Learning, 2004, 54(1): 45-66.
[7]LEE K Y, KIM D W, LEE D, et al. Improving support vector data description using local density degree[J]. Pattern Recognition, 2005, 38(10):1768-1771.
[8]KIM P J, CHANG H J, SONG D S, et al. Fast support vector data description usingkmeans clustering[C]// Proceedings of the 4th International Symposium on Neural Networks. Berlin: Springer, 2007: 506-514.
[9]LUO J, LI B, WU C, et al. A fast SVDD algorithm based on decomposition and combination for fault detection[C]// Proceedings of the 2010 International Conference on Control and Automation. Piscataway: IEEE, 2010: 1924-1928.
[10]HUANG G, CHEN H, ZHOU Z, et al. Twoclass support vector data description[J]. Pattern Recognition, 2011, 44(2): 320-329.
[11]李勇, 劉戰(zhàn)東, 張海軍. 不平衡數(shù)據(jù)的集成分類算法綜述[J]. 計(jì)算機(jī)應(yīng)用研究, 2014, 31(5): 1287-1291.(LI Y, LIU Z D, ZHANG H J. Summary of integrated classification algorithm for unbalanced data[J]. Application Research of Computers, 2014, 31(5): 1287-1291.)
[12]FREUND Y, SCHAPIRE R E. A decisiontheoretic generalization of online learning and an application to boosting[J]. Journal of Computer and System Sciences, 1997, 55(1): 119-139.
[13]BREIMAN L. Bagging predictors[J]. Machine Learning, 1996, 24(2): 123-140.
[14]BREIMAN L. Random forests[J]. Machine Learning, 2001, 45(1): 5-32.
[15]周志華. 機(jī)器學(xué)習(xí)[M]. 北京:清華大學(xué)出版社, 2016: 171-189.(ZHOU Z H. Mechine Learning[M]. Beijing: Tsinghua University Press, 2016: 171-189.)
[16]QIAN Y, LI F, LIANG J, et al. Space structure and clustering of categorical data[J]. IEEE Transactions on Neural Networks & Learning Systems, 2016, 27(10):2047-2059.
[17]ZHOU Z. Ensemble Methods: Foundations and Algorithms[M]. Boca Raton, FL: Taylor & Francis Group, 2012: 1-22.
[18]QIAN Y, LI F, LIANG J, et al. Fusing monotonic decision trees[J]. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(10): 2717-2728.
[19]EFRON B. Bootstrap methods: another look at the jackknife[J]. Breakthroughs in Statistics, 1979, 7(1): 569-593.
[20]CAWLEY G, TALBOT N. Gunnar Raetschs benchmark datasets [DB/OL].[2018-11-20]. http://theoval.cmp.uea.ac.uk/~gcc/matlab/default.html#benchmarks.
[21]NAGANJANEYULU S, KUPPA M R. A novel framework for class imbalance learning using intelligent undersampling[J]. Progress in Artificial Intelligence, 2013, 2(1): 73-84.
[22]ZHANG X, SONG Q, WANG G, et al. A dissimilaritybased imbalance data classification algorithm[J]. Applied Intelligence, 2015, 42(3): 544-565.
[23]JIANG K, LU J, XIA K. A novel algorithm for imbalance data classification based on genetic algorithm improved SMOTE[J]. Arabian Journal for Science and Engineering, 2016, 41(8): 3255-3266.
This work is partially supported by the National Natural Science Foundation of China (61672332), the Program for the Outstanding Innovative Teams of Higher Learning Institutions of Shanxi, the Program for the San Jin Young Scholars of Shanxi, the Overseas Returnee Research Program of Shanxi Province (2017023).
YANG Chen, born in 1996, M. S. candidate. Her research interests include statistical machine learning theory.
WANG Jieting, born in 1991, Ph. D. candidate. Her research interests include statistical machine learning theory, reinforcement learning.
LI Feijiang, born in 1990, Ph. D. candidate. His research interests include group learning, unsupervised learning.
QIAN Yuhua, born in 1976, Ph. D., professor. His research interests include machine learning, complex network, rough set theory, granular computing.