朱穎雯
摘? 要: 大規(guī)模社交媒體數(shù)據(jù)的復(fù)雜性要求將聚類技術(shù)擴展到大規(guī)模數(shù)據(jù),使其能夠在很少的經(jīng)驗設(shè)置下自動識別數(shù)據(jù)聚簇。文章研究和討論了模糊自適應(yīng)諧振理論(Fuzzy Adaptive Resonance Theory)算法,其具有線性計算復(fù)雜性,僅使用一個單一參數(shù),且對參數(shù)設(shè)置具有魯棒性,可以產(chǎn)生更好的聚類結(jié)果。真實數(shù)據(jù)集上的實驗結(jié)果表明,該算法在大規(guī)模數(shù)據(jù)聚類中取得了可比較的性能和更快的速度,而且也不需要預(yù)先定義聚簇個數(shù)。
關(guān)鍵詞: 大規(guī)模數(shù)據(jù); 聚類; 自適應(yīng)諧振理論; 模糊自適應(yīng)諧振理論
中圖分類號:TP391? ? ? ? ? 文獻標識碼:A? ? ?文章編號:1006-8228(2020)10-24-05
Abstract: The large scale and complex nature of social media data raises the need to scale clustering techniques to big data and make them capable of automatically identifying data clusters with few empirical settings. In this paper, the Fuzzy ART algorithm (Fuzzy Adaptive Resonance Theory) is studied; it has linear computational complexity, uses a single parameter, i.e., the vigilance parameter to identify data clusters, and is robust to modest parameter settings. Experiments on real data sets show that Fuzzy ART can achieve better or comparable performance and much faster speed than the state-of-the-art clustering algorithms, without need for predefining the number of clusters.
Key words: large scale data; data clustering; adaptive resonance theory; fuzzy adaptive resonance theory
0 引言
社交網(wǎng)站的流行導(dǎo)致在線多媒體文檔急劇增加,例如圖片、博客等。近年來,聚類網(wǎng)絡(luò)多媒體數(shù)據(jù)已成為社交網(wǎng)絡(luò)社群識別[1-2],集體行為分析[3],基礎(chǔ)主題發(fā)現(xiàn)[4-5]的熱點。社交媒體數(shù)據(jù)通常規(guī)模很大,涵蓋不同主題內(nèi)容,因此很難評估基礎(chǔ)主題的個數(shù)和數(shù)據(jù)的模式分布,需要現(xiàn)有的聚類方法擴展到大規(guī)模數(shù)據(jù),通過一些經(jīng)驗設(shè)置(如聚簇個數(shù))自動識別數(shù)據(jù)聚簇。
大多廣泛使用的聚類方法如K-Means、譜聚類、矩陣分解均需要設(shè)置聚簇個數(shù)。方法一般可分為兩類:聚類趨勢分析方法[6-8]和聚類驗證方法[9-14]。第一種方法通過模式的相鄰關(guān)系來確定數(shù)據(jù)集中的聚簇個數(shù),第二種方法通過評估不同聚簇的結(jié)構(gòu)。這兩種方法都很慢,均無法擴展到大規(guī)模數(shù)據(jù)。不需要預(yù)先定義聚簇個數(shù)的方法有基于層次的聚類方法[15-17],基于遺傳的聚類方法[18-19],基于密度的聚類方法[20-21],基于近鄰傳播的方法(AP)[22]和自適應(yīng)諧振理論(ART)[23-27]。而層次和遺傳方法類似于聚類驗證方法,其他算法通常具有O(n2)的時間復(fù)雜度,且需要一個或多個參數(shù)來形成聚簇,這使得它們的性能對這些參數(shù)的設(shè)置非常敏感。
為了聚類大規(guī)模數(shù)據(jù),不設(shè)置聚簇個數(shù)且具有高聚類精度,本文討論和研究了模糊自適應(yīng)諧振理論(Fuzzy ART)算法,它是ART算法的變體,其具有線性計算復(fù)雜性,僅使用一個單一參數(shù),且對參數(shù)設(shè)置具有魯棒性,可以產(chǎn)生更好的聚類結(jié)果。
1 相關(guān)研究
1.1 聚類趨勢分析
聚類趨勢分析的目的是聚類前確定數(shù)據(jù)集中的聚簇個數(shù)。主要聚焦于研究模式的相異度矩陣。趨勢視覺評估(VAT)[6]對模式的相異矩陣進行重新排序,形成一個重新排序的不相似圖像(RDI),通過計算對角線像素上的黑色塊來識別聚簇的個數(shù)。聚類計數(shù)提?。–CE)[7]和暗塊提取(DBE)[8]可客觀識別聚簇的個數(shù)取代人工計數(shù)。CCE使用過濾的RDI的非對角線像素值構(gòu)造直方圖,簇的個數(shù)等于直方圖中峰值的個數(shù)。而DBE采用矩陣變換的方法,將RDI的所有像素值投影到主對角線軸上從而獲得投影信號,簇的個數(shù)等于信號中主要峰值的個數(shù)。
1.2 聚類驗證
聚類驗證是通過評估生成的不同聚類結(jié)構(gòu)質(zhì)量來找到最佳聚簇??紤]到聚簇歸屬機制的不同,聚類驗證可以分為硬聚類[28](一個模式屬于一個聚簇)和模糊聚類[9](一個模式具有所有聚簇的模糊隸屬度)。硬聚類方法遵循以下原則。①驗證指標。基于簇內(nèi)緊密度和簇間分離度對不同聚簇個數(shù)生成的不同簇的質(zhì)量進行評價[10,13]。②將不同聚簇的驗證指標值繪制為關(guān)于聚簇個數(shù)的函數(shù),最佳聚簇個數(shù)位于極端值或彎頭值[11,29]。③通過子采樣和添加隨機噪聲等工具扭曲給定數(shù)據(jù)集產(chǎn)生多數(shù)據(jù)集。再對每個數(shù)據(jù)集進行聚類,以確定最佳的聚簇個數(shù)[14]。已有的模糊聚類驗證方法[9,12]通常使用模糊c均值作為基本算法,并對生成的聚類質(zhì)量進行評估,以確定最佳聚簇個數(shù)。
2 模糊自適應(yīng)諧振網(wǎng)絡(luò)
自組織神經(jīng)網(wǎng)絡(luò)是人工智能領(lǐng)域應(yīng)用最為廣泛的一種學(xué)習(xí)模型。為解決大部分神經(jīng)網(wǎng)絡(luò)模型遭遇的“穩(wěn)定性-彈性問題”,美國Boston大學(xué)的S.Grossberg和G.A.Carpenter于1976年提出了一種無監(jiān)督競爭型神經(jīng)網(wǎng)絡(luò)模型,即自適應(yīng)諧振理論網(wǎng)絡(luò)ART(Adaptive Resonance Theory)[30],可在穩(wěn)定原有模式類的前提下,學(xué)習(xí)新的模式。ART模擬了人類大腦如何捕捉、識別、記憶關(guān)于事物和事件的信息。隨著理論不斷完善,有大量基于ART改進的無監(jiān)督學(xué)習(xí)模型被提出,如ART1[23]、ART2[25]、ART2-A[26]、ART3[27]和模糊ART(Fuzzy ART)[24]。模糊ART通過在類別空間實時搜索和匹配現(xiàn)有聚簇增長式的逐步處理每一個輸入模式。其中警戒參數(shù)用于約束在同一個聚簇中的模式的最小相似度。當輸入模式與現(xiàn)有聚簇都不相似時,生成一個新的聚簇來編碼這個新模式。模糊ART已被改進用于解決圖像和文本挖掘問題,如Web文檔管理,基于標記的Web圖像組織,圖像-文本關(guān)聯(lián)和異構(gòu)數(shù)據(jù)聚類。模糊ART適用于大規(guī)模數(shù)據(jù)聚類。
模糊ART模型由輸入層F1和識別層F2組成,如圖1所示。輸入層為輸入向量I,識別層由一些節(jié)點向量組成,代表各個聚簇。在F1層,輸入向量被提交到網(wǎng)絡(luò),與F2層各個聚簇向量的權(quán)值進行相似度比較并歸類。
⑴ 輸入向量(Input Vectors):設(shè)[I=x]表示輸入層F1的輸入模式。[x=(x1,…,xm)],[xi∈[0,1]](i=1,…,m)。通過補編碼(complement coding),x與它的補向量[x=1-x]共同構(gòu)成了[I=(x,x)]。
⑵ 權(quán)值向量(Weight Vectors):設(shè)wj表示識別層F2中第j個類[cj=(j=1,…,J)]的權(quán)值。
⑶ 參數(shù)(Parameters):模糊ART隨著3個參數(shù)動態(tài)改變,它們分別是選擇參數(shù)[α>0],學(xué)習(xí)參數(shù)[β∈[0,1]],以及警戒參數(shù)[ρ∈[0,1]]。
模糊ART聚類過程包含3個關(guān)鍵步驟:
步驟1:類別選擇(Category Choice)。對每個輸入模式I,模糊ART對識別層F2中的每個聚簇根據(jù)選擇函數(shù)計算選擇值,并選擇具有最大值的聚簇作為獲勝聚簇cj*。第j個聚簇cj的選擇函數(shù)定義為:
3 實驗結(jié)果分析
為了驗證本文算法的有效性,我們在2個真實大規(guī)模數(shù)據(jù)集上進行了實驗。使用的計算機配置為Intel Core i5-6300U 2.4GHz處理器和8G內(nèi)存,Windows 10操作系統(tǒng),所有程序均在MATLAB R2015a上設(shè)計和運行。為了對各種聚類算法的精度進行評價,我們引入了3項評價指標:(a)Accuracy(Acc) [31],(b)Normalized Mutual Information(NMI) [31],(c)Rand index(RI)[31]。Acc度量了聚簇的純度,Acc越大表明聚類純度越高。其取值范圍在0到1之間。歸一化互信息NMI是一個量化兩個分布之間共享統(tǒng)計信息的對稱策略。當聚簇標簽和真實樣本類別一對一映射時NMI值到達最大值1.0。[RI∈[0,1]],當兩個算法劃分完全一致時RI=1。
3.1 數(shù)據(jù)集
為了對模糊ART算法的聚類有效性進行評價,實驗中我們使用了2個真實數(shù)據(jù)集,表1給出數(shù)據(jù)集的相關(guān)信息。
KddCup99與CoverType均來自UCI。KddCup99數(shù)據(jù)集最早來源于MIT 林肯實驗室的一項入侵檢測評估項目, 記錄了9 周內(nèi)TCP 網(wǎng)絡(luò)連接和系統(tǒng)審計數(shù)據(jù), 仿真各種不同的用戶類型、網(wǎng)絡(luò)流量和攻擊手段。這些原始數(shù)據(jù)包含約500000條連接記錄的訓(xùn)練集。每個連接記錄包含41個屬性,這些連接記錄含1種正常的標識類型normal和22種訓(xùn)練攻擊類型,共23個類別。CoverType數(shù)據(jù)集來源于US Geological Survey(USGS)和US Forest Service (USFS)對位于Roosevelt國家森林的四片荒野區(qū)域的觀測。數(shù)據(jù)集中包含581012條記錄, 這些記錄最終被分為7種類型。每條觀測記錄包含54個地質(zhì)學(xué)和地理學(xué)屬性。
3.2 聚類性能
我們評估了模糊ART的聚類性能,警戒參數(shù)r取0.85,每個算法重復(fù)實驗10次。聚類結(jié)果如表2所示。
從表2中,我們可以發(fā)現(xiàn)模糊ART在兩個大規(guī)模數(shù)據(jù)集上均取得了較高的Acc、NMI和Rand指數(shù)。
3.3 警戒參數(shù)r的變化
圖2顯示了模糊ART在2個數(shù)據(jù)集上隨警戒參數(shù)r的變化聚類性能的變化。從圖中可以看出:①CoverType數(shù)據(jù)集上聚類純度Acc隨參數(shù)r的增大,到達一定值后有所下降;②2個數(shù)據(jù)集上NMI和Rand指數(shù)都隨參數(shù)r的增大穩(wěn)步增長。
3.4 運行時間
圖3顯示了2個數(shù)據(jù)集上模糊ART的運行時間。從中可以看出,模糊ART的執(zhí)行時間都隨著數(shù)據(jù)量的增加而增加。研究表明,模糊ART算法對大規(guī)模數(shù)據(jù)的處理效率更高。
4 總結(jié)
本文討論和研究了模糊自適應(yīng)諧振理論(Fuzzy ART)算法,其具有線性計算復(fù)雜性,僅使用一個單一參數(shù),且對參數(shù)設(shè)置具有魯棒性,可產(chǎn)生更好的聚類結(jié)果。特別適用于大規(guī)模數(shù)據(jù)聚類。真實數(shù)據(jù)集上的實驗結(jié)果表明,該算法在大規(guī)模數(shù)據(jù)聚類中取得了很好的性能和更快的速度,而且也不需要預(yù)先定義聚簇個數(shù)。未來的研究方向是如何自動調(diào)整模糊ART算法中的警戒參數(shù)用于提高聚類性能。
參考文獻(References):
[1] S. Papadopoulos, Y. Kompatsiaris, A. Vakali, and P.Spyridonos, "Community detection in social media," Data Mining Knowl. Discovery,2012.24(3):515-554
[2] L. Meng and A.-H.Tan, "Community discovery in social?networks via heterogeneous link association and fusion".in Proc. SIAM Int. Conf. Data Mining,2014:803-811
[3] L. Tang and H. Liu, "Scalable learning of collective behavior based on sparse social dimensions," in Proc. ACM Int. Conf. Inf. Knowl.Manage.,2009:1107-1116
[4] A.-H. Tan, H.-L.Ong, H. Pan, J. Ng, and Q.-X.Li,"Towards personalised Web intelligence," Knowl. Inf. Syst.,2004.6(5):595-616
[5] L. Meng and A.-H.Tan, "Semi-supervised hierarchical clustering for personalized Web image organization," in Proc. Int. Joint Conf. Neural Netw.,2012.6:251-258
[6] J. C. Bezdek and R. J. Hathaway, "VAT: A tool for visual assessment of (cluster) tendency," in Proc. Int. Joint Conf. Neural Netw.,2002.5:2225-2230
[7] I. J. Sledge, J. M. Huband, and J. C. Bezdek, "(Automatic)cluster count extraction from unlabeled data sets," in Proc. Int. Conf. Fuzzy Syst. Knowl. Discovery,2008.10:3-13
[8] L. Wang, C. Leckie, K. Ramamohanarao, and J. Bezdek,"Automatically determining the number of clusters in unlabeled data sets," IEEE Trans. Knowl. Data Eng.,2012.21(3):335-350
[9] W. Wang and Y. Zhang, "On fuzzy cluster validity indices,"Fuzzy Sets Syst.,2007.158(19):2095-2117
[10] J. Liang, X. Zhao, D. Li, F. Cao, and C. Dang,?"Determining the number of clusters using information entropy for mixed data," Pattern Recognit.,2012.45(6):2251-2265
[11] C. A. Sugar and G. M. James, "Finding the number of clusters in a dataset: An information-theoretic approach," J. Amer. Statist. Assoc.,2003.98(463):750-763
[12] H. Sun, S. Wang, and Q. Jiang, "FCM-based model selection algorithms for determining the number of clusters," Pattern Recognit.,2004.37(10):2027-2037
[13] R. Kothari and D. Pitts, "On finding the number of clusters," Pattern Recognit.Lett.,1999.20(4):405-416
[14] J.-S. Lee and S. Olafsson, "A meta-learning approach for determining the number of clusters with consideration of nearest neighbors" Inf. Sci.,2013.232(5):208-224
[15] M. J. Li, M. K. Ng, Y.-M. Cheung, and J. Z. Huang,"Agglomerative fuzzy K-means clustering algorithm with selection of number of clusters," IEEE Trans. Knowl. Data Eng.,2008.20(11):1519-1534
[16] Y. Leung, J.-S.Zhang, and Z.-B. Xu, "Clustering by scale-space filtering," IEEE Trans. Pattern Anal. Mach. Intell.,2000.22(12):1396-1410
[17] H. Yan, K. Chen, L. Liu, and J. Bae, "Determining the best K for clustering transactional datasets: A coverage density-based approach," Data Knowl. Eng.,2009.68(1):28-48
[18] S. Bandyopadhyay and S. Saha, "A point symmetry-based clustering technique for automatic evolution of clusters," IEEE Trans. Knowl. Data Eng.,2008.20(11):1441-1457
[19] S. Bandyopadhyay, "Genetic algorithms for clustering and fuzzy clustering,"Wiley Interdiscipl. Rev., Data Mining Knowl. Discovery,2011.1(6):524-531
[20] H.-P. Kriegel, P. Kr?ger, J. Sander, and A. Zimek,"Density-based clustering,"Wiley Interdiscipl. Rev., Data Mining Knowl. Discovery,2011.1(3):231-240
[21] M. Ester, H.-P.Kriegel, J. Sander, and X. Xu, "A density-based algorithm for discovering clusters in large spatial databases with noise," in Proc. ACM SIGKDD Conf. Knowl. Discovery Data Mining,1996:226-231
[22] B. J. Frey and D. Dueck, "Clustering by passing messages between data points,"Science,2007.315(5814):972-976
[23] G. A. Carpenter and S. Grossberg, "A massively parallel? architecture for a self-organizing neural pattern recognition machine," Comput. Vis., Graph., Image Process.,1987. 37(1):54-115
[24] G. A. Carpenter, S. Grossberg, and D. B. Rosen, "Fuzzy ART:Fast stable learning and categorization of analog patterns by an adaptive resonance system," Neural Netw.,1991.4(6):759-771
[25] G. A. Carpenter and S. Grossberg, "ART 2: Self-organization of stable category recognition codes for analog input patterns,"Appl. Opt.,1987.26(23):4919-4930
[26] G. A. Carpenter, S. Grossberg, and D. B. Rosen, "ART 2-A: An adaptive resonance algorithm for rapid category learning and recognition,"Neural Netw.,1987.4:493-504
[27] G. A. Carpenter and S. Grossberg, "ART 3: Hierarchical? search using chemical transmitters in self-organizing pattern recognition architectures,"Neural Netw.,1990.3(2):129-152
[28] B. Mirkin, "Choosing the number of clusters," Wiley Interdiscipl. Rev., Data Mining Knowl. Discovery,2011.1(3):252-260
[29] M. M.-T. Chiang and B. Mirkin, "Intelligent choice of the number of clusters in K-means clustering:An experimental study with different cluster spreads," J. Classification,2010.27(1):3-40
[30] Grossberg S. How does a brain build a cognitive code?[M]//Studies of mind and brain. Springer, Dordrecht,1982:1-52
[31] Zhu Y, Chen S. Growing neural gas with random projection method for high-dimensional data stream clustering[C]. soft computing,2019:1-19