杜媛 張世偉
摘 要:針對自然排序樹(CAN-tree)算法構(gòu)建的樹結(jié)構(gòu)節(jié)點個數(shù)過多、壓縮性不高等問題,提出一種基于重構(gòu)的改進CAN-tree算法。首先,使用自然排序法直接構(gòu)建樹結(jié)構(gòu),將頻繁項集挖掘算法實現(xiàn)中數(shù)據(jù)庫掃描次數(shù)減少至1;然后,對構(gòu)建的樹結(jié)構(gòu)以支持度降序方式結(jié)合剪枝操作實現(xiàn)樹結(jié)構(gòu)的重構(gòu),得到高壓縮性的樹結(jié)構(gòu);最后,對重構(gòu)的樹結(jié)構(gòu)進行頻繁項集挖掘。實驗結(jié)果表明,基于重構(gòu)的改進CAN-tree算法所構(gòu)建的樹結(jié)構(gòu)節(jié)點個數(shù)減少至原來的20%以下,執(zhí)行效率提高了4至6倍,在頻繁項集挖掘中有效地壓縮了樹結(jié)構(gòu),縮短了算法的執(zhí)行時間。
關(guān)鍵詞:頻繁項集;頻繁項集頭表;重構(gòu);剪枝;最小支持度
中圖分類號: TP301.6
文獻標志碼:A
Abstract: In order to solve the problems such as too many nodes and low compressibility in the tree structure constructed by CANonical-order tree (CAN-tree) algorithm, an improved CAN-tree algorithm based on restructure was proposed. Firstly, a tree structure was constructed directly with canonical-order, which scans the database only once in the frequent itemset mining algorithm. Then, in order to get a tree structure with high compressibility, a pruning operation was used with support in desending order to restructure the tree. Finally, frequent itemsets were mined out for the reconstructed tree structure. The experimental results show that compared with original CAN-tree algorithm, the number of nodes constructed by the improved CAN-tree algorithm is reduced to less than 20%,and the execution efficiency is improved by 4 to 6 times. The proposed algorithm shortens the execution time of the frequent itemset mining algorithm and effectively compresses the tree structure in it.
Key words: frequent itemset; frequent itemset header table; restructure; pruning; minimum support
0 引言
在大數(shù)據(jù)時代,數(shù)據(jù)成為一種資源,數(shù)據(jù)挖掘技術(shù)能夠發(fā)現(xiàn)隱藏在數(shù)據(jù)中的信息,而關(guān)聯(lián)規(guī)則挖掘就是其重要組成部分之一。在關(guān)聯(lián)規(guī)則挖掘應(yīng)用中,頻繁項集挖掘尤為重要。頻繁項集挖掘算法可分成2類:1)Apriori算法[1],需多次掃描庫,影響執(zhí)行效率,同時還將產(chǎn)生大量候選項集,需大量存儲空間;2)頻繁模式增長(Frequent Pattern growth, FP-growth)算法[2],僅需掃描2次數(shù)據(jù)庫,使用頻繁模式樹(Frequent Pattern tree,F(xiàn)P-tree)存儲壓縮后的原始數(shù)據(jù),相比Apriori,F(xiàn)P-growth減少了數(shù)據(jù)庫掃描次數(shù),不僅提高了算法的執(zhí)行效率,也減少了內(nèi)存的占有量。
對于FP_growth算法,不少學(xué)者提出了改進方向:1)修改FP-tree節(jié)點結(jié)構(gòu)[3]、添加Hash表輔助存儲結(jié)構(gòu)[4-5]、使用構(gòu)造鏈表(Building list, B-list)[6]修改頻繁項集表達形式等以提高節(jié)點查詢速度;2)直接對最大頻繁項進行挖掘,使用剪枝技術(shù)減少FP-tree的節(jié)點數(shù)[7-9];3)使用分布式計算框架,設(shè)計并行運行的FP-growth算法,如并行FP-growth算法(Parallel FP-growth,PFP)[10]、利用MapReduce模型加上動態(tài)閾值法改進并行FP-growth算法[11]以縮短算法執(zhí)行時間。但是這些改進都需要掃描數(shù)據(jù)庫2次:第1次掃描創(chuàng)建降序排列的頻繁項集頭表,第2次掃描構(gòu)建FP-tree,且構(gòu)建過程依賴第1次掃描結(jié)果。而自然排序樹(Canonical-order tree,CAN-tree)算法[12]的提出正好解決了該問題。CAN-tree在構(gòu)建時使用自然排序法(例如字母排序法),例如事務(wù){(diào)a,c,b},在構(gòu)建時直接以{a,b,c}順序插入節(jié)點,而FP-tree在構(gòu)建時需要根據(jù){a}、{c}、的支持度計數(shù)進行降序排列,所以CAN-tree僅需掃描數(shù)據(jù)庫1次即可構(gòu)建完整樹,提高了算法的執(zhí)行效率。但是CAN-tree在構(gòu)建時保留了所有數(shù)據(jù),這導(dǎo)致樹的結(jié)構(gòu)非常龐大,直接降低了后期挖掘速度。對此有學(xué)者對CAN-tree節(jié)點結(jié)構(gòu)進行了改進,如將CAN-tree節(jié)點中指向子節(jié)點的指針改成指向父節(jié)點[13],在此基礎(chǔ)上,基于CAN-tree快速構(gòu)建算法(Fast Construction algorithm based on CAN-tree, FCCAN)[14]又增加了哈希存儲表和葉子頭表,以減少查詢和剪枝時間,提高條件模式基生成速度,但是這些改進都未從本質(zhì)上縮小CAN-tree的結(jié)構(gòu)。對比FP-tree可以發(fā)現(xiàn),CAN-tree的結(jié)構(gòu)明顯比較稀疏,因為FP-tree使用支持度降序排列插入節(jié)點,支持度計數(shù)越大的節(jié)點越靠近Root節(jié)點(因為Root是所有路徑的共享節(jié)點),所以支持度計數(shù)越大的節(jié)點在樹結(jié)構(gòu)中的共享程度越大,其共享前綴路徑也越多,F(xiàn)P-tree也越緊湊。在頻繁項集挖掘過程中,算法需要對整棵樹結(jié)構(gòu)進行遞歸處理,而龐大的樹結(jié)構(gòu)會明顯增加遞歸挖掘的層次和執(zhí)行的時間,所以緊湊的樹結(jié)構(gòu)在后期生成條件模式基進行頻繁項集挖掘時效率更高。
由于CAN-tree構(gòu)建的完整樹結(jié)構(gòu)龐大稀疏、后期遞歸進行頻繁項集挖掘時遞歸層次過多、挖掘效率低等問題,本文著力于縮小CAN-tree構(gòu)建的樹結(jié)構(gòu),提出了一種基于重構(gòu)的改進CAN-tree算法: 1)使用自然排序法實現(xiàn)掃描數(shù)據(jù)庫1次,構(gòu)造CAN-tree和頻繁項集頭表;2)對頻繁項集頭表進行降序排列,并根據(jù)支持度計數(shù)降序模式結(jié)合最小支持度剪枝法對CAN-tree進行重構(gòu)。改進算法保留了CAN-tree僅需掃描數(shù)據(jù)庫1次的優(yōu)勢,降低了算法在數(shù)據(jù)庫讀取上的時間損耗,同時對CAN-tree構(gòu)建的樹結(jié)構(gòu)進行重構(gòu),得到壓縮性更高的樹,以減少后期頻繁項集挖掘過程中遞歸掃描層數(shù),進一步減少算法執(zhí)行的時間損耗,提高執(zhí)行效率。
1 頻繁項集
設(shè)I={i1,i2,…,in}是項目的集合,稱為項集。如果項集長度為k,則稱為k-項集,項集I可稱為n-項集。如果I是頻繁的,則可稱為頻繁n-項集。設(shè)數(shù)據(jù)庫D,D中每條數(shù)據(jù)是一個項集,稱為TID,項集中包含一個或者多個I中的項。假設(shè)A是I中的一個子項集,A的支持度計數(shù)count(A)是D中包含A的個數(shù),支持度support(A)是count(A)在D中事務(wù)總數(shù)所占的比例,最小支持度可記為min_support。
5 結(jié)語
為減少CAN-tree構(gòu)建的樹結(jié)構(gòu)節(jié)點個數(shù),提高算法在頻繁項集挖掘過程中的效率,本文提出了一種基于重構(gòu)的改進CAN-tree算法,該算法有以下特點:1)在構(gòu)建樹結(jié)構(gòu)時首先采用自然排序法,僅需掃描數(shù)據(jù)庫一次;2)利用重構(gòu)機制,采用支持度降序方式構(gòu)建壓縮性更好的樹結(jié)構(gòu),同時加入剪枝操作以減少樹結(jié)構(gòu)中的節(jié)點個數(shù),提高挖掘效率。實驗結(jié)果表明,基于重構(gòu)的改進CAN-tree算法在頻繁項集挖掘中具有更好的執(zhí)行效率。但是改進CAN-tree算法在構(gòu)建樹結(jié)構(gòu)階段中保留了數(shù)據(jù)庫所有信息,并且比原算法多了重構(gòu)階段,所以改進CAN-tree算法在獲得高壓縮性樹結(jié)構(gòu)前對系統(tǒng)內(nèi)存的開銷較大,對于這方面的缺陷,下一步將考慮與分布式計算框架相結(jié)合,同時在內(nèi)存消耗和算法執(zhí)行時間上進行完善。
參考文獻:
[1] AGRAWL R, IMIELINSKI T, SWAMI A. Mining association rules between sets of items in large databases [C] // Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data. New York: ACM, 1993: 207-216.
[2] HAN J, PEI J, YIN Y. Mining frequent patterns without candidate generation [C]// Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. New York: ACM 2000:1-12.
[3] DING Z, WEI Q, DING X. An improved FP-growth algorithm based on Compound single linked list [C] // ICIC 09Proceedings of the 2009 Second International Conference on Information and Computing Science. Washington, DC: IEEE Computer Society, 2009, 1: 351-353.
[4] HAO J, XU H. An improved algorithm for frequency itemsets mining [C] // Proceedings of the 2017 Fifth International Conference on Advanced Cloud and Big Data. Washington, DC: IEEE Computer Society, 2017: 314-317.
[5] 李也白,唐輝,張淳,等. 基于改進的FP-tree的頻繁模式挖掘算法[J]. 計算機應(yīng)用,2011,31(1):101-103.(LI Y B, TANG H, ZHANG C, el al. Frequent pattern mining algorithm based on improve FP-tree [J]. Journal of Computer Applications, 2011, 31(1): 101-103.)
[6] 李校林,杜托,劉彪.基于B-list的快速頻繁模式挖掘算法[J].計算機應(yīng)用,2017,37(8):2357-2361,2367. (LI X L,DU T,LIU B. Fast algorithm for mining frequent patterns based on B-list [J]. Journal of Computer Applications, 2017, 37(8): 2357-2361, 2367.)
[7] 馬麗生,姚光順,楊傳健.基于改進的FP-tree最大頻繁項目集挖掘算法[J].計算機應(yīng)用,2012,32(2):326-329.(MA L S, YAO G S, YANG C J. Mining algorithm for maximal frequent itemsets based on improved FP-tree [J]. Journal of Computer Applications, 2012, 32(2): 326-329.)
[8] 寧慧,王素紅,催立剛,等. 基于改進的FP-tree最大頻繁模式挖掘算法[J]. 應(yīng)用科技,2016,43(2):37-43.(NING H, WANG S H, CUI L G, et al. An algorithm for mining maximal frequent patterns based on improved FP-tree[J]. Applied Science and Technology,2016,43(2):37-43.)
[9] 趙陽,吳廖丹. 一種自底向上的最大頻繁項集挖掘方法[J]. 計算機技術(shù)與發(fā)展,2017,27(8):57-60.(ZHAO Y,WU L D. A bottom-up method for mining maximum frequent itemsets [J]. Computer Technology and Development, 2017, 27(8): 57-60.)
[10] LI H, WANG Y, ZHAN D, el al. PFP: parallel FP-growth for query recommendation [C]// Proceedings of the 2008 ACM Conference on Recommender Systems. New York: ACM, 2008: 107-114.
[11] WEI X, MA Y, ZHANG F, el al. Incremental FP-Growth mining strategy for dynamic threshold value and database based on MapReduce [C] // Proceedings of the 2014 IEEE 18th International Conference on Computer Supported Cooperative Wori in Design. Piscataway, NJ: IEEE, 2014: 271-276.
[12] LEUNG C K-S, KHAN Q I, HOQUE T. CANTree: a tree structure for efficient incremental mining of frequent patterns [C] // Proceedings of the Fifth IEEE International Conference on Data Mining. Washington, DC: IEEE Computer Society, 2005: 274-281.
[13] 鄒力鹍,張其善. 基于CAN-樹的高效關(guān)聯(lián)規(guī)則增量挖掘算法[J]. 計算機工程,2008,34(3):29-31.(ZOU L K, ZHANG Q S. Efficient incremental association rules mining algorithm based on CAN-tree [J]. Computer Engineering, 2008, 34(3): 29-31.)
[14] 陳剛,閆英戰(zhàn),劉秉權(quán). 一種基于CAN-tree快速構(gòu)建算法[J]. 微電子學(xué)與計算機,2014,31(1):76-82.(CHEN G, YAN Y Z, LIU B Q. A fast construction algorithm based on CAN-tree[J]. Microelectronics & Computer, 2014, 32(1): 76-82.)
[15] TAN P M, STEINBACH M, KUMAR V. 數(shù)據(jù)挖掘?qū)д摚ㄍ暾妫M]. 范明,等譯. 北京:人民郵電出版社,2017: 225-228.(TAN P M, STEINBACH M, KUMAR V. Writing Introduction to Data Mining [M]. FAN M, translated. Beijing:Post & Telecom Press, 2017: 225-228.)