王智遠 任崇廣 陳榕 秦莉
Abstract: Log analysis is a very important business of cloud computing platform management work, which aims to guarantee the efficiency and availability of cloud platforms. There exist such problems as complex logs and massive logs. A log anomaly detection method is proposed in this paper. First of all, the template is formed by using text clustering based on edit distance; then, on this basis, characteristic vector is constructed, and weak classifier training is used to form scoring feature vector; finally, combined with Random Forest, weak classifier is continuously used to build strong classifier. Experimental results show that mutual information is 0.91 between the real template and log template, which has been relatively close; and the classification accuracy of using Random Forest to build strong classifier on data sets is the best, which could be up to 0.94.
本文首先對日志進行基本的清洗,然后基于Levenshtein distance(編輯距離)求得日志聚類、并形成日志模板,針對日志模板利用TF-IDF(詞頻-逆文件頻率)生成特征向量,使用貝葉斯、邏輯回歸、支持向量機、決策樹等分類器構建得分特征向量,再利用得分特征向量與隨機森林構建強分類器。在結果檢驗部分,用互信息檢測真實模板與聚類形成的模板之間的關聯性,利用準確率與召回率檢測分類器的效果,最后展示了各種分類器的分類效果。
[1] FU Qiang, LOU Jianguang, WANG Yi, et al. Execution anomaly detection in distributed systems through unstructured log analysis[C]//2009 Ninth IEEE International Conference on Data Mining. Miami,Florida: IEEE, 2009: 149-158.
[2] TANG Liang, LI Tao, PERNG C S. LogSig: Generating system events from raw textual logs[C]// Proceedings of the 20th ACM International Conference on Information and Knowledge management. Glasgow, Scotland, UK:ACM, 2011:785-794.
[3] HE Shilin, ZHU Jieming, HE Pinjia, et al. Experience report: System log analysis for anomaly detection[C]// 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE). Ottawa, ON, Canada:IEEE, 2016:207-218.
[4] XU Wei, HUANG Ling, FOX A, et al. Detecting large-scale system problems by mining console logs[C]// Proceedings of the ACM SIGOPS 22nd symposium on Operating Systems Principles. Big Sky, Montana, USA:ACM, 2009:117-132.
[5] LIM C, SINGH N, YAJNIK S. A log mining approach to failure analysis of enterprise telephony systems[C]// IEEE International Conference on Dependable Systems and Networks with FTCS and DCC. Anchorage, AK, USA:IEEE, 2008:398-403.
[6] NANDI A, MANDAL A, ATREJA S, et al. Anomaly detection using program control flow graph mining from execution logs[C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco, California, USA:ACM, 2016:215-224.
[7] SAHOO R K, OLINER A J, RISH I, et al. Critical event prediction for proactive management in large-scale computer clusters[C]// Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington, DC, USA:ACM, 2003:426-435.
[8] TINGSANCHALI T, GAUTAM M R. Application of tank, NAM, ARMA and neural network models to flood forecasting[J]. Hydrological Processes, 2015, 14(14):2473-2487.
[9] WATANABE Y, MATSUMOTO Y. Online failure prediction in cloud data centers[J]. Fujitsu Scientific & Technical Journal, 2014, 50(1):66-71.
[10]FRONZA I, SILLITTI A, SUCCI G, et al. Failure prediction based on log files using random indexing and Support Vector Machines[J]. Journal of Systems and Software, 2013, 86(1):2-11.
[11]KEERTHI S S, SHEVADE S K, BHATTACHARYYA C, et al. Improvements to Platt's SMO algorithm for SVM classifier design[J]. Neural Computation, 2001, 13(3):637-649.
[12]LEVENSHTEIN V I. Binary codes capable of correcting deletions, insertions and reversals[J]. Soviet Physics Doklady, 1966, 10(1):707-710.
[13]OKUDA T, TANAKA E, KASAI T. A method for the correction of Garbled words based on the Levenshtein Metric[J]. IEEE Transactions on Computers, 1976, C-25(2):172-178.
[14]MAKANJU A A O, ZINCIR-HEYWOOD A N, MILIOS E E. Clustering event logs using iterative partitioning[C]// Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Paris, France:ACM, 2009:1255-1264.
[15]VAARANDI R. A data clustering algorithm for mining patterns from event logs[C]// 3rd IEEE Workshop on Ip Operations & Management. Kansas City, MO, USA:IEEE, 2003:119-126.
[16]AIZAWA A. An information-theoretic perspective of tf-idf measures[J]. Information Processing & Management, 2003, 39(1):45-65.
[17]ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et al. Spark: Cluster computing with working sets[C]// HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing. Boston, MA:ACM, 2010:10.
[18]HARRELL F E. Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis[M]. New York:Springer-Verlag, 2001.
[19]LIAW A, WIENER M. Classification and regression by Random Forest[J]. R news, 2002, 2(3): 18-22.