JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering

Accepted manuscript available online (unedited version)

FAAD: an unsupervised fast and accurate anomaly detection method for a multi-dimensional sequence over data stream

Author(s): Bin Li, Yi-jie Wang, Dong-sheng Yang, Yong-mou Li, Xing-kong Ma
Affiliation(s): Science and Technology on Parallel and Distributed Processing Laboratory, College of Computer, National University of Defense Technology, Changsha 410073, China; more
Corresponding email(s): libin16a@nudt.edu.cn, wangyijie@nudt.edu.cn
Key Words: Data stream, Multi-dimensional sequence, Anomaly detection, Concept drift, Feature selection

Share this article to： More <<< Previous Paper \|Next Paper >>>

Bin Li, Yi-jie Wang, Dong-sheng Yang, Yong-mou Li, Xing-kong Ma. FAAD: an unsupervised fast and accurate anomaly detection method for a multi-dimensional sequence over data stream[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.1800038

@article{title="FAAD: an unsupervised fast and accurate anomaly detection method for a multi-dimensional sequence over data stream",
author="Bin Li, Yi-jie Wang, Dong-sheng Yang, Yong-mou Li, Xing-kong Ma",
journal="Frontiers of Information Technology & Electronic Engineering",
year="in press",
publisher="Zhejiang University Press & Springer",
doi="https://doi.org/10.1631/FITEE.1800038"
}

%0 Journal Article
%T FAAD: an unsupervised fast and accurate anomaly detection method for a multi-dimensional sequence over data stream
%A Bin Li
%A Yi-jie Wang
%A Dong-sheng Yang
%A Yong-mou Li
%A Xing-kong Ma
%J Frontiers of Information Technology & Electronic Engineering
%P 388-404
%@ 2095-9184
%D in press
%I Zhejiang University Press & Springer
doi="https://doi.org/10.1631/FITEE.1800038"

TY - JOUR
T1 - FAAD: an unsupervised fast and accurate anomaly detection method for a multi-dimensional sequence over data stream
A1 - Bin Li
A1 - Yi-jie Wang
A1 - Dong-sheng Yang
A1 - Yong-mou Li
A1 - Xing-kong Ma
J0 - Frontiers of Information Technology & Electronic Engineering
SP - 388
EP - 404
%@ 2095-9184
Y1 - in press
PB - Zhejiang University Press & Springer
ER -
doi="https://doi.org/10.1631/FITEE.1800038"

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: Recently, sequence anomaly detection has been widely used in many fields. Sequence data in these fields are usually multi-dimensional over the data stream. It is a challenge to design an anomaly detection method for a multi-dimensional sequence over the data stream to satisfy the requirements of accuracy and high speed. It is because: (1) Redundant dimensions in sequence data and large state space lead to a poor ability for sequence modeling; (2) anomaly detection cannot adapt to the high-speed nature of the data stream, especially when concept drift occurs, and it will reduce the detection rate. On one hand, most existing methods of sequence anomaly detection focus on the single-dimension sequence. On the other hand, some studies concerning multi-dimensional sequence concentrate mainly on the static database rather than the data stream. To improve the performance of anomaly detection for a multi-dimensional sequence over the data stream, we propose a novel unsupervised fast and accurate anomaly detection (FAAD) method which includes three algorithms. First, a method called “information calculation and minimum spanning tree cluster” is adopted to reduce redundant dimensions. Second, to speed up model construction and ensure the detection rate for the sequence over the data stream, we propose a method called “random sampling and subsequence partitioning based on the index probabilistic suffix tree.” Last, the method called 𠇊nomaly buffer based on model dynamic adjustment” dramatically reduces the effects of concept drift in the data stream. FAAD is implemented on the streaming platform Storm to detect multi-dimensional log audit data. Compared with the existing anomaly detection methods, FAAD has a good performance in detection rate and speed without being affected by concept drift.

FAAD：一种无监督快速准确的数据流上多维序列异常检测方法

摘要：最近，序列异常检测被广泛应用于许多领域。这些领域中的序列数据在数据流上通常是多维的。设计同时满足检测精度和速度的数据流上多维序列异常检测方法是一个挑战。因为：（1）序列数据和庞大状态空间的维度冗余导致序列建模能力较差；（2）异常检测无法适应数据流的高速性，尤其是概念漂移会降低检测率。一方面，大多数现有序列异常检测方法集中在单维序列。另一方面，多维序列研究主要集中在静态数据集而非数据流。为提高数据流上多维序列异常检测性能，提出一种新型无监督快速和准确异常检测（fastand accurate anomaly detection，FAAD）方法，该方法包括3种算法。首先，采用一种“信息计算和最小生成树聚类”（information calculation and minimum spanning tree cluster，IMC）方法减少冗余维度。其次，为加速模型构建确保数据流上序列的检测率，提出一种“基于索引概率后缀树的随机抽样和子序列划分”（random sampling and subsequence partitioning based on the index probabilistic suffix tree，RSIPST）方法。最后，“基于模型动态调整的异常缓冲”（anomaly buffer based on model dynamic adjustment，ABMDA）方法显著降低数据流中概念漂移的影响。在流平台Storm上实施FAAD检测多维日志审计数据。与现有异常检测方法相比，FAAD在检测精度和速度方面不受概念漂移影响，具有良好性能。

关键词组：数据流；多维序列；异常检测；概念漂移；特征选择

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Bao H, Wang YJ, 2016. A C-SVM based anomaly detection method for multi-dimensional sequence over data stream. Proc IEEE 22^nd Int Conf on Parallel and Distributed Systems, p.948-955.

[2]Box GE, Jenkins GM, Reinsel GC, et al., 2015. Time Series Analysis: Forecasting and Control. John Wiley & Sons, Hoboken, USA.

[3]Budalakoti S, Srivastava AN, Akella R, et al., 2006. Anomaly Detection in Large Sets of High-Dimensional Symbol Sequences. TM-2006-214553, NASA Ames Research Center, USA.

[4]Budalakoti S, Srivastava AN, Otey ME, 2009. Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airline safety. IEEE Trans Syst Man Cybern C, 39(1):101-113.

[5]Carlin BP, Louis TA, 2000. Bayes and Empirical Bayes Methods for Data Analysis (2^nd Ed.). Chapman & Hall/linebreak CRC Press, Boca Raton, FL, USA.

[6]Chandola V, Mithal V, Kumar V, 2008. Comparative evaluation of anomaly detection techniques for sequence data. Proc 8^th IEEE Int Conf on Data Mining, p.743-748.

[7]Chandola V, Banerjee A, Kumar V, 2009. Anomaly detection: a survey. ACM Comput Surv, 41(3), Article 15.

[8]Chandola V, Banerjee A, Kumar V, 2012. Anomaly detection for discrete sequences: a survey. IEEE Trans Knowl Data Eng, 24(5):823-839.

[9]Dani MC, Freixo C, Jollois FX, et al., 2015. Unsupervised anomaly detection for aircraft condition monitoring system. Proc IEEE Aerospace Conf, p.1-7.

[10]Esposito F, di Mauro N, Basile TMA, et al., 2008. Multi-dimensional relational sequence mining. Fundam Inform, 89(1):23-43.

[11]Hall MA, 2000. Correlation-based feature selection for discrete and numeric class machine learning. Proc 17^th Int Conf on Machine Learning, p.359-366.

[12]Jin Y, Zuo WL, 2007. Multi-dimensional concept lattice and incremental discovery of multi-dimensional sequential patterns. J Comput Res Dev, 44(11):1816-1824 (in Chinese).

[13]Kaufman L, Rousseeuw PJ, 2009. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, New York, USA.

[14]Keogh E, Chakrabarti K, Pazzani M, et al., 2001. Dimensionality reduction for fast similarity search in large time series databases. Knowl Inform Syst, 3(3):263-286.

[15]Kponyo JJ, Kuang YJ, Zhang EZ, et al., 2013. VANET cluster-on-demand minimum spanning tree (MST) prim clustering algorithm. Proc Int Conf on Computational Problem-Solving, p.101-104.

[16]Lane T, 1998. Machine Learning Techniques for the Domain of Anomaly Detection for Computer Security. Purdue University, Indiana, USA.

[17]Lee CH, 2015. A multi-phase approach for classifying multi-dimensional sequence data. Intell Data Anal, 19(3):547-561.

[18]Li C, Tian XG, Xiao X, et al., 2012. Anomaly detection of user behavior based on shell commands and co-occurrence matrix. J Comput Res Dev, 49(9):1982-1990 (in Chinese).

[19]Li XY, Wang YJ, Li XL, et al., 2014. Parallelizing skyline queries over uncertain data streams with sliding window partitioning and grid index. Knowl Inform Syst, 41(2):277-309.

[20]Parveen P, Mcdaniel N, Weger Z, et al., 2013. Evolving insider threat detection stream mining perspective. Int J Artif Intell Tools, 22(5):1360013.

[21]Qian Q, Wu JL, Zhu W, et al., 2012. Improved edit distance method for system call anomaly detection. Proc IEEE 12^th Int Conf on Computer and Information Technology, p.1097-1102.

[22]Ron DN, Singer Y, Tishby N, 1994. Learning probabilistic automata with variable memory length. Proc 7^th Annual Conf on Computational Learning Theory, p.35-46.

[23]Sarhrouni E, Hammouch A, Aboutajdine D, 2012. Application of symmetric uncertainty and mutual information to dimensionality reduction and classification of hyperspectral images. Int J Eng Technol, 4(5):268-276.

[24]Shu XK, Yao DF, Ryder BG, 2015. A formal framework for program anomaly detection. Proc 18^th Int Symp Research in Attacks, Intrusions, and Defenses, p.270-292.

[25]Tandon G, Chan P, 2003. Learning rules from system call arguments and sequences for anomaly detection. Proc ICDM Workshop on Data Mining for Computer Security, p.20-29.

[26]Wang Y, Ma X, 2015. A general scalable and elastic content-based publish/subscribe service. IEEE Trans Parall Distr Syst, 26(8):2100-2113.

[27]Wang YJ, Li S, 2006. Research and performance evaluation of data replication technology in distributed storage systems. Comput Math Appl, 51(11):1625-1632.

[28]Wang YJ, Li XY, Li XL, et al., 2013. A survey of queries over uncertain data. Knowl Inform Syst, 37(3):485-530.

[29]Wang YJ, Pei X, Ma X, et al., 2018. TA-update: an adaptive update scheme with tree-structured transmission in erasure-coded storage systems. IEEE Trans Parall Distr Syst, 29(8):1893-1906.

[30]Xianyu JC, Rasouli S, Timmermans H, 2017. Analysis of variability in multi-day GPS imputed activity-travel diaries using multi-dimensional sequence alignment and panel effects regression models. Transportation, 44(3):533-553.

[31]Xiong TK, Wang SR, Jiang QS, et al., 2011. A new Markov model for clustering categorical sequences. Proc IEEE 11^th Int Conf on Data Mining, p.854-863.

[32]Yamanishi K, Maruyama Y, 2005. Dynamic syslog mining for network failure monitoring. Proc 11^th ACM SIGKDD Int Conf on Knowledge Discovery in Data Mining, p.499-508.

[33]Yang J, Wang W, 2003. CLUSEQ: efficient and effective sequence clustering. Proc 19^th Int Conf on Data Engineering, p.101-112.

[34]Yu L, Liu H, 2003. Feature selection for high-dimensional data: a fast correlation-based filter solution. Proc 20^th Int Conf on Machine Learning, p.856-863.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

- Go to

FAAD：一种无监督快速准确的数据流上多维序列异常检测方法

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference