靳燕 彭新光
摘要:
為進一步弱化數(shù)據(jù)不均衡對分類算法的束縛,從數(shù)據(jù)集區(qū)域分布特性著手,提出了不均衡數(shù)據(jù)集上基于子域?qū)W習的復(fù)合分類模型。子域劃分階段,擴展支持向量數(shù)據(jù)描述(SVDD)算法給出類的最小界定域,劃分出域內(nèi)密集區(qū)與域外稀疏區(qū)。借鑒不同類存在相似樣本的類重疊概念,對邊界樣本進行搜索,組合構(gòu)成重疊域。子域清理階段,基于鄰近算法(KNN)的鄰近性假設(shè),結(jié)合不同域的密疏程度,設(shè)置出樣本有效性參數(shù),對域內(nèi)樣本逐個檢測以清理噪聲。各子域隔離參與分類建模,按序組合產(chǎn)生出用于不均衡數(shù)據(jù)集的復(fù)合分類器CCRD。實驗設(shè)計三類比較方案(相似算法比較、代價敏感MetaCost比較和SMOTE抽樣比較),前兩類比較中,CCRD對正類的正確分類改善明顯,且未加重負類誤判;后一類比較中,改善了負類的誤判情形,且未影響正類的正確分類。在五類數(shù)據(jù)集的逐個比較中,CCRD分類性能均有提升,在Haberman_sur的正類分類性能提升上尤為明顯。結(jié)果表明,基于子域?qū)W習的復(fù)合分類模型的分類性能較好,是一種研究不均衡數(shù)據(jù)集的較有效的方法。
為進一步弱化數(shù)據(jù)不均衡對分類算法的束縛,從數(shù)據(jù)集區(qū)域分布特性著手,提出了不均衡數(shù)據(jù)集上基于子域?qū)W習的復(fù)合分類模型。子域劃分階段,擴展支持向量數(shù)據(jù)描述(SVDD)算法給出類的最小界定域,劃分出域內(nèi)密集區(qū)與域外稀疏區(qū)。借鑒不同類存在相似樣本的類重疊概念,對邊界樣本進行搜索,組合構(gòu)成重疊域。子域清理階段,基于鄰近算法(KNN)的鄰近性假設(shè),結(jié)合不同域的密疏程度,設(shè)置樣本有效性參數(shù),對域內(nèi)樣本逐個檢測以清理噪聲。各子域隔離參與分類建模,按序組合產(chǎn)生出用于不均衡數(shù)據(jù)集的復(fù)合分類器CCRD。在相似算法對比以及代價敏感MetaCost對比中,CCRD對正類的正確分類改善明顯,且未加重負類誤判;在SMOTE抽樣比較中,CCRD改善了負類的誤判情形,且未影響正類的正確分類;在五類數(shù)據(jù)集的逐個比較中,CCRD分類性能均有提升,在Haberman_sur的正類分類性能提升上尤為明顯。結(jié)果表明,基于子域?qū)W習的復(fù)合分類模型的分類性能較好,是一種研究不均衡數(shù)據(jù)集的較有效的方法。
關(guān)鍵詞:
不均衡數(shù)據(jù)集區(qū)域分布;支持向量數(shù)據(jù)描述;稀疏域與重疊域;子域隔離學(xué)習;復(fù)合分類器
中圖分類號:
TP391
文獻標志碼:A
Abstract:
Started with the regional distribution characteristics, a composite classification model learned on multiple isolated subdomains was proposed to further study the class imbalance problem. In the subdomains division stage, each class was described as ultrasmall spheres by improved Support Vector Data Description (SVDD) algorithm, then class domain was divided into intensive and sparse domains. Some instances were founded out from the boundaries of classes and composed of class overlapping domains. In the subdomains cleanup stage, according to sample availability parameters related to domain tightness, noise data was cleaned up by improved KNearest Neighbor (KNN). After combining classifiers sequentially which were learned on isolated subdomains, the Composite Classification model (CCRD) was generated. The classification performance of CCRD is compared with those algorithms' which are divided into three groups of similar ones, MetaCost and SMOTE. CCRD could classify more positive instances than the top two groups and more negative instances than the last group, and the improvement doesn't affect the classification of the other class. The results indicate that the composite classification model learned on multiple isolated subdomains could classify instances more accurately and could be an effective method for imbalance class.
In the comparison with similar algorithms including SVM (Support Vector Machine), KNN, C4.5 and MetaCost, CCRD can obviously improve the accuracy of positive instances without increasing mistake of negative instances; in the comparison with SMOTE (Synthetic Minority Oversampling TEchnique) sampling, CCRD can improve the misjudgement of negative instances without affecting the classification of the positive instances; in the experiments on five datasets, the classification performance of CCRD is also improved, especially in Haberman_sur. Experimental results indicate that the composite classification model learned on multiple isolated subdomains has excellent classification capability, and it is an effective method for inbalanced dataset.
英文關(guān)鍵詞Key words:
regional distribution of imbalanced class; Support Vector Data Description (SVDD); sparse and overlapping domains; leaning classifiers on multiple isolated subdomains; Composite Classification model (CCRD)
0引言
不均衡數(shù)據(jù)集中,按類樣本出現(xiàn)概率的大與小定義多數(shù)類與少數(shù)類。少數(shù)類雖出現(xiàn)概率極小但往往又極其重要。如:通信電話中的騷擾電話、被診患者中的癌癥患者、網(wǎng)絡(luò)用戶行為中的攻擊行為、衛(wèi)星圖片中的油井圖片等。抽樣是使不均衡數(shù)據(jù)集均衡化的常用方法,增加少數(shù)類樣本以減少類間偏斜的方法稱為over_sampling(向上采樣) [1]。文獻[2]提出SMOTE(Synthetic Minority Oversampling TEchnique)算法,在“近距離少數(shù)類樣本間仍為同類樣本”的假設(shè)基礎(chǔ)上生成人工少數(shù)類樣本;文獻[3]選用遺傳算法優(yōu)化SMOTE參數(shù),以得到適合應(yīng)用數(shù)據(jù)集的取樣倍數(shù);文獻[4]提出BorderlineSMOTE方法,將人工合成樣本限于少數(shù)類樣本的邊界處。減少多數(shù)類樣本以降低類間偏斜的方法稱為under_sampling(向下采樣)[5]。最早由Tomek提出的Tomek links方法按距離尋找分屬兩類的Tomek links對,并刪除其中的多數(shù)類樣本;
H. Alhammady和K. Ramamohanarao提出的EPRC(Emerging Patterns for Rareclass Classification)算法提供三個改進步驟來克服EP算法在少數(shù)類分類上的不足[6]。目前,不少研究將兩種抽樣相結(jié)合[7-8],以優(yōu)勢互補。
分析上述兩類抽樣方法,均以改變樣本分布為出發(fā)點。文獻[9]選用了10種分類方法對13個數(shù)據(jù)集進行分類性能分析,認為數(shù)據(jù)集的不均衡不僅體現(xiàn)在樣本數(shù)量的偏差上,類間重疊亦對分類性能產(chǎn)生影響。文獻[10]使用人工數(shù)據(jù)較系統(tǒng)地分析了類間重疊與類不均衡的關(guān)聯(lián)程度,并有望在實際數(shù)據(jù)集中進一步作關(guān)聯(lián)分析。文獻[11]嘗試通過K近鄰(KNearest Neighbor, KNN)方法來尋找類間重疊區(qū)域,距離查找法易受到“維災(zāi)”效應(yīng)影響,且當重疊子域較多時,該方法在有效發(fā)現(xiàn)各類邊界特征時性能較差。
本文在相關(guān)文獻的研究[9-12]基礎(chǔ)上,從數(shù)據(jù)集區(qū)域密疏分布特性著手,引入最小超球體進行類描述,界定出密集與稀疏域,按概率定義找出類邊界的重疊樣本,將樣本空間劃分為密集、重疊與稀疏三個子域。不同于傳統(tǒng)分類方式,本文將各子域隔離參與分類建模,分類模型按序組合構(gòu)成復(fù)合分類器。實驗結(jié)果比較表明,本文設(shè)計的復(fù)合分類器的分類性能提升較明顯,總體較穩(wěn)定。
1樣本空間密疏域隔離方法設(shè)計
類邊界區(qū)域中,存在與其他類極為相似的樣本,該類樣本往往會被誤分類,在覆蓋稀疏小樣本的分類學(xué)習中尤其明顯,類邊界會因此類樣本而發(fā)生偏移,誤分更為嚴重。
當不同類的樣本極其相似時,類重疊就會產(chǎn)生。選用概率法定義類重疊:樣本x均以大于0的概率落在至少兩個類的界定內(nèi),則樣本x位于重疊域O。即:x,x∈O,p(Cm|x)>0且p(Cn|x)>0。
圖1的(a)和(b)分別表示沒有重疊和有重疊的情形。重疊造成兩類邊界模糊,為分類帶來困難。若將類的邊界控制在樣本密集區(qū),并與稀疏區(qū)域相隔離,類重疊可大大降低。
1)與單個分類算法的性能比較。
本文設(shè)計的分類算法是三類算法的擴展與組合,在同一數(shù)據(jù)集上,依次單個應(yīng)用三類算法(支持向量機(Support Vector Machine, SVM)、KNN(K=5)、C4.5)和本文算法CCRD。實驗分類學(xué)習產(chǎn)生的統(tǒng)計結(jié)果見表3。
同一數(shù)據(jù)集的比較中,CCRD算法的TP值為最大,對正類樣本的正確分類數(shù)最多;TP與TN存在互約關(guān)系,CCRD的TN值雖非最大,但與最大值較為接近。CCRD算法修正了SVM在Haberman_sur上的“正類全部錯判”,同時對負類仍有很好的分類效果。從實驗結(jié)果得出:CCRD算法作為三類算法的擴展與組合,總體分類效果得到了較大提升,分類性能要優(yōu)于單個算法的性能。
2)與單個分類算法在SMOTE集下的分類性能比較。
SMOTE算法通過生成少數(shù)類樣本來降低類間不均衡性,是從改變樣本分布角度來研究不均衡問題的。本文所提方法則充分分析了數(shù)據(jù)集分布子域特點,并依據(jù)該特點選用合理算法進行分類學(xué)習。與SMOTE相比,是同一問題的不同角度的設(shè)計。因此,設(shè)計實驗比較SMOTE集上的相近算法與本文算法的性能。
在同一數(shù)據(jù)集上,應(yīng)用SMOTE(倍數(shù)為2)進行預(yù)處理,并依次單個應(yīng)用三類算法(SVM、KNN(K=5)、C4.5)進行分類學(xué)習,統(tǒng)計結(jié)果見表4。
5結(jié)語
數(shù)據(jù)集的不均衡對分類算法性能影響較大,其不均衡性不僅體現(xiàn)在樣本數(shù)量懸殊上,也體現(xiàn)在樣本分布的密疏不均與類重疊上。本文從數(shù)據(jù)樣本分布的密疏程度出發(fā),進行樣本區(qū)域分隔,根據(jù)不同域的分布特點,針對性選用不同分類方法進行學(xué)習,多個分類模型按序組合成復(fù)合分類器。從多組數(shù)據(jù)集的不同算法性能對比分析中得出,本文所用分類器在給定的評價性能指標中,總體提升明顯,且受數(shù)據(jù)集特性影響較小,總體性能較穩(wěn)定。
本文所提方法中,擴展SVDD進行區(qū)域劃分時,實驗選擇了一種核函數(shù),樣本去噪時的近鄰集計算是按固定K值進行的,并據(jù)此給出定值βK。方法實現(xiàn)中涉及的相關(guān)參數(shù)等在今后研究中,嘗試以優(yōu)值選取方式進一步分析給定,以將本文方法推廣至多分類問題和大數(shù)據(jù)集的有效應(yīng)用中。
參考文獻:
[1]
ABDI L, HASHEMI S. To combat multiclass imbalanced problems by means of oversampling and boosting techniques [J]. Knowledge and Data Engineering, IEEE Transactions on, 2015, 19(12): 3369-3385.
ABDI L, HASHEMI S. To combat multiclass imbalanced problems by means of oversampling and boosting techniques [J]. Soft Computing, 2015, 19(12): 3369-3385.
[2]
VERBIEST N, RAMENTOL E, CORNELIS C, et al. Preprocessing noisy imbalanced datasets using SMOTE enhanced with fuzzy rough prototype selection [J]. Applied Soft Computing, 2014, 22(5): 511-517.
[3]
霍玉丹,谷瓊,蔡之華,等.基于遺傳算法改進的少數(shù)類樣本合成過采樣技術(shù)的非平衡數(shù)據(jù)集分類算法[J].計算機應(yīng)用,2015,35(1):121-124.(HUO Y D, GU Q, CAI Z H, et al. Classification method for imbalance based on genetic algorithm improved synthetic minority oversampling technique [J]. Journal of Computer Applications, 2015,35(1):121-124.)
[4]
WANG K J, ADRIAN A M, CHEN K H, et al. A hybrid classifier combining borderlineSMOTE with AIRS algorithm for estimating brain metastasis from lung cancer: a case study in Taiwan [J]. Computer Methods and Programs in Biomedicine, 2015, 119(2): 63-76.
[5]
YU H, NI J, ZHAO J. ACOSampling: an ant colony optimizationbased undersampling method for classifying imbalanced DNA microarray data [J]. Neurocomputing, 2013, 101(3): 309-318.
[6]
GARCABORROTO M, MARTNEZTRINIDAD J F, CARRASCOOCHOA J A. A survey of emerging patterns for supervised classification [J]. Artificial Intelligence Review, 2014, 42(4): 705-721.
[7]
陳睿,張亮,楊靜,等. 基于BSMOTE和逆轉(zhuǎn)欠抽樣的不均衡數(shù)據(jù)分類算法[J]. 計算機應(yīng)用研究,2014,31(11):3299-3303.(CHEN R, ZHANG L, YANG J, et al. Classification algorithm for imbalanced data sets based on combination of BSMOTE and inverse under sampling [J]. Application Research of Computers, 2014,31(11):3299-3303.)
[8]
GALAR M, FERNNDEZ A, BARRENECHEA E, et al. A review on ensembles for the class imbalance problem: bagging, boosting, and hybridbased approaches [J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2012, 42(4): 463-484.
[9]
GARCA V, SNCHEZ J S, MOLLINEDA R A. On the effectiveness of preprocessing methods when dealing with different levels of class imbalance [J]. KnowledgeBased Systems, 2012, 25(1): 13-21.
[10]
ALEJO R, VALDOVINOS R M, GARCA V, et al. A hybrid method to face class overlap and class imbalance on neural networks and multiclass scenarios [J]. Pattern Recognition Letters, 2013, 34(4): 380-388.
[11]
BECKMANN M, EBECKEN N F F, DE LIMA B S L P. A KNN undersampling approach for data balancing [J]. Journal of Intelligent Learning Systems and Applications, 2015, 7(4): 104-116.
[12]
熊海濤,吳俊杰,劉洪甫,等.分類中的類重疊問題及其處理方法研究[J].管理科學(xué)學(xué)報,2013,16(4):8-21.(XIONG H T, WU J J, LIU H P, et al. Towards classification with class overlapping [J]. Journal of Management Sciences in China, 2013,16(4):8-21.)
[13]
KHAZAI S, SAFARI A, MOJARADI B, et al. Improving the SVDD approach to hyperspectral image classification [J]. IEEE Geoscience and Remote Sensing Letters, 2012, 9(4): 594-598.
[14]
蔣盛益,苗邦,余雯.基于一趟聚類的不平衡數(shù)據(jù)下抽樣算法[J].小型微型計算機系統(tǒng),2012,33(2):232-236.(JIANG S Y, MIAO B, YU W. Undersampling method based on onepass clustering for imbalanced data distribution [J]. Journal of Chinese Computer Systems, 2012, 33(2): 232-236.)
[15]
李雄飛,李軍,董元方,等.一種新的不平衡數(shù)據(jù)學(xué)習算法PCBoost [J]. 計算機學(xué)報,2012,35(2):202-209.(LI X F, LI J, DONG Y F, et al. A new learning algorithm for imbalanced dataPCBoost [J]. Chinese Journal of Computers, 2012, 35(2): 202-209.)
[16]
曹鵬,李博,栗偉,等.基于粒子群優(yōu)化的不均衡數(shù)據(jù)學(xué)習[J].計算機應(yīng)用,2013,33(3):789-792.(CAO P, LI B, LI W, et al. Imbalanced data learning based on particle swarm optimization [J]. Journal of Computer Applications, 2013, 33(3): 789-792.)