王智遠 任崇廣 陳榕 秦莉
Abstract: Log analysis is a very important business of cloud computing platform management work, which aims to guarantee the efficiency and availability of cloud platforms. There exist such problems as complex logs and massive logs. A log anomaly detection method is proposed in this paper. First of all, the template is formed by using text clustering based on edit distance; then, on this basis, characteristic vector is constructed, and weak classifier training is used to form scoring feature vector; finally, combined with Random Forest, weak classifier is continuously used to build strong classifier. Experimental results show that mutual information is 0.91 between the real template and log template, which has been relatively close; and the classification accuracy of using Random Forest to build strong classifier on data sets is the best, which could be up to 0.94.
引言
云計算與大數據技術為分布式計算提供了一種解決方案。對于大型的分布式系統來說,出現異?;蛘埱蟪瑫r可能帶來重大損失,而由于系統規(guī)模較大,復雜度較高,這就將使系統維護人員面臨嚴峻挑戰(zhàn)[1]。分布式系統在運行時會產生大量的日志,日志記錄了系統的狀態(tài)及運行軌跡,系統維護人員可以利用日志來追蹤系統的運行狀態(tài)以發(fā)現問題[2]。但由于分布式系統日志量較大,人工查看日志耗時耗力,如何從海量日志中提取有效信息即已成為系統維護的關鍵研發(fā)內容。
基于日志的異常檢測技術主要包括如下步驟:日志收集、日志解析、特征抽取、異常檢測[3]。文獻[4]從源碼出發(fā),生成與日志相關的AST樹(抽象語法樹),從而解析出日志模板,在模板基礎之上利用PCA方法找出離群點進行異常檢測。該模型的優(yōu)點在于考慮了消息類型特征,故障檢測效果準確。文獻[5]在展開實驗研究后發(fā)現系統出現異常時,進程之間的通信變得頻繁活躍,在此基礎上設計提出了日志消息頻數驟增或激變的異常檢測模型,該模型易于實現和應用。文獻[6]利用文本與時間相似度抽取日志模板,并基于此來研發(fā)構建CFG圖(控制流圖),從而實現對每條工作流的異常檢測,該模型的優(yōu)點在于支持并行化,適合大規(guī)模數據集。文獻[7]根據時間序列和概率統計方法進行故障預測,首先使用基于規(guī)則的日志處理模式將日志文本轉換為結構化數據,將變量值提取出來構建特征,使用不同的時間序列算法對日志中數值型變量做出預測,繼而使用基于規(guī)則的概率模型進行異常檢測,最后實驗表明ARMA模型(自回歸滑動平均模型)[8]在數據量較大的情況下效果較好。文獻[9]基于概率模型進行異常檢測,利用日志模板將日志序列化,再利用貝葉斯模型計算時間窗口內與錯誤日志關聯概率最高的日志構成故障日志序列,在線檢測過程中匹配與故障序列一樣的日志序列判定為異常。文獻[10]則優(yōu)選了分類模型應用于故障預測,又通過日志模板來構建特征向量,還使用SVM(支持向量機)模型[11]進行后續(xù)的訓練預測。國內日志分析技術大多數都是著重于Web服務器日志的研究,而針對云環(huán)境日志的分析迄今為止仍然少見,因此本文就探討提出了一種針對云環(huán)境日志的分析方法。
本文首先對日志進行基本的清洗,然后基于Levenshtein distance(編輯距離)求得日志聚類、并形成日志模板,針對日志模板利用TF-IDF(詞頻-逆文件頻率)生成特征向量,使用貝葉斯、邏輯回歸、支持向量機、決策樹等分類器構建得分特征向量,再利用得分特征向量與隨機森林構建強分類器。在結果檢驗部分,用互信息檢測真實模板與聚類形成的模板之間的關聯性,利用準確率與召回率檢測分類器的效果,最后展示了各種分類器的分類效果。
1日志模板挖掘
1.1編輯距離
編輯距離是Levenshtein提出的用于計算字符串相似度的方法[12]。編輯距離指由原字符串S變化到目標字符串D所需的最少操作次數,其中涉及的操作有:針對單個字符的插入、刪除、替換。
1.2模板挖掘
日志中包含一些常量及變量,因此擬從非結構化的日志中啟用處理分析將頗具現實難度。模板挖掘是為了尋找一組結構化的日志集合用以表示原始非結構化的日志。目前模板挖掘主要有2類,分別是:基于聚類的模板挖掘技術、啟發(fā)式的挖掘技術[14-15]。本文使用基于編輯距離的模板挖掘方法。首先,對日志進行預處理,將日志中一些變量(IP,UUID等)用空字符串替換,然后運用算法1計算編輯聚類,最后形成日志模板。這里,給出算法1的研發(fā)代碼可見如下。
4結束語
本次設計開發(fā)得到的重點研究成果可闡釋如下。
(1)本文基于編輯距離進行聚類形成日志模板,在模板的基礎上構建TF-IDF特征向量,從而在弱分類器的基礎上構建強分類器。實驗表明,在訓練集與測試集上利用弱分類器構建的強分類器,無論是查全率、或是查準率均有可觀的提升。
(2)在提取日志模板的提取過程中,閾值設定采取了人工方式,靈活性較差;分類器也存在閾值人工設定,靈活性差的問題。因此,如何自動設定閾值是本文未來的研究方向。
參考文獻
[1] FU Qiang, LOU Jianguang, WANG Yi, et al. Execution anomaly detection in distributed systems through unstructured log analysis[C]//2009 Ninth IEEE International Conference on Data Mining. Miami,Florida: IEEE, 2009: 149-158.
[2] TANG Liang, LI Tao, PERNG C S. LogSig: Generating system events from raw textual logs[C]// Proceedings of the 20th ACM International Conference on Information and Knowledge management. Glasgow, Scotland, UK:ACM, 2011:785-794.
[3] HE Shilin, ZHU Jieming, HE Pinjia, et al. Experience report: System log analysis for anomaly detection[C]// 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE). Ottawa, ON, Canada:IEEE, 2016:207-218.
[4] XU Wei, HUANG Ling, FOX A, et al. Detecting large-scale system problems by mining console logs[C]// Proceedings of the ACM SIGOPS 22nd symposium on Operating Systems Principles. Big Sky, Montana, USA:ACM, 2009:117-132.
[5] LIM C, SINGH N, YAJNIK S. A log mining approach to failure analysis of enterprise telephony systems[C]// IEEE International Conference on Dependable Systems and Networks with FTCS and DCC. Anchorage, AK, USA:IEEE, 2008:398-403.
[6] NANDI A, MANDAL A, ATREJA S, et al. Anomaly detection using program control flow graph mining from execution logs[C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco, California, USA:ACM, 2016:215-224.
[7] SAHOO R K, OLINER A J, RISH I, et al. Critical event prediction for proactive management in large-scale computer clusters[C]// Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington, DC, USA:ACM, 2003:426-435.
[8] TINGSANCHALI T, GAUTAM M R. Application of tank, NAM, ARMA and neural network models to flood forecasting[J]. Hydrological Processes, 2015, 14(14):2473-2487.
[9] WATANABE Y, MATSUMOTO Y. Online failure prediction in cloud data centers[J]. Fujitsu Scientific & Technical Journal, 2014, 50(1):66-71.
[10]FRONZA I, SILLITTI A, SUCCI G, et al. Failure prediction based on log files using random indexing and Support Vector Machines[J]. Journal of Systems and Software, 2013, 86(1):2-11.
[11]KEERTHI S S, SHEVADE S K, BHATTACHARYYA C, et al. Improvements to Platt's SMO algorithm for SVM classifier design[J]. Neural Computation, 2001, 13(3):637-649.
[12]LEVENSHTEIN V I. Binary codes capable of correcting deletions, insertions and reversals[J]. Soviet Physics Doklady, 1966, 10(1):707-710.
[13]OKUDA T, TANAKA E, KASAI T. A method for the correction of Garbled words based on the Levenshtein Metric[J]. IEEE Transactions on Computers, 1976, C-25(2):172-178.
[14]MAKANJU A A O, ZINCIR-HEYWOOD A N, MILIOS E E. Clustering event logs using iterative partitioning[C]// Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Paris, France:ACM, 2009:1255-1264.
[15]VAARANDI R. A data clustering algorithm for mining patterns from event logs[C]// 3rd IEEE Workshop on Ip Operations & Management. Kansas City, MO, USA:IEEE, 2003:119-126.
[16]AIZAWA A. An information-theoretic perspective of tf-idf measures[J]. Information Processing & Management, 2003, 39(1):45-65.
[17]ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et al. Spark: Cluster computing with working sets[C]// HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing. Boston, MA:ACM, 2010:10.
[18]HARRELL F E. Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis[M]. New York:Springer-Verlag, 2001.
[19]LIAW A, WIENER M. Classification and regression by Random Forest[J]. R news, 2002, 2(3): 18-22.