JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering

Accepted manuscript available online (unedited version)

A non-group parallel frequent pattern mining algorithm based on conditional patterns

Author(s): Zhe-jun Kuang, Hang Zhou, Dong-dai Zhou, Jin-peng Zhou, Kun Yang
Affiliation(s): College of Computer Science and Technology, Changchun University, Changchun 130022, China; more
Corresponding email(s): kuangzhejun@ccu.edu.cn, zhouhang0311@163.com, kddzhou@nenu.edu.cn
Key Words: Frequent pattern mining, Parallel algorithm, Conditional pattern bases, MapReduce, Big data

Share this article to： More <<< Previous Paper \|Next Paper >>>

Zhe-jun Kuang, Hang Zhou, Dong-dai Zhou, Jin-peng Zhou, Kun Yang. A non-group parallel frequent pattern mining algorithm based on conditional patterns[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.1800467

@article{title="A non-group parallel frequent pattern mining algorithm based on conditional patterns",
author="Zhe-jun Kuang, Hang Zhou, Dong-dai Zhou, Jin-peng Zhou, Kun Yang",
journal="Frontiers of Information Technology & Electronic Engineering",
year="in press",
publisher="Zhejiang University Press & Springer",
doi="https://doi.org/10.1631/FITEE.1800467"
}

%0 Journal Article
%T A non-group parallel frequent pattern mining algorithm based on conditional patterns
%A Zhe-jun Kuang
%A Hang Zhou
%A Dong-dai Zhou
%A Jin-peng Zhou
%A Kun Yang
%J Frontiers of Information Technology & Electronic Engineering
%P 1234-1245
%@ 2095-9184
%D in press
%I Zhejiang University Press & Springer
doi="https://doi.org/10.1631/FITEE.1800467"

TY - JOUR
T1 - A non-group parallel frequent pattern mining algorithm based on conditional patterns
A1 - Zhe-jun Kuang
A1 - Hang Zhou
A1 - Dong-dai Zhou
A1 - Jin-peng Zhou
A1 - Kun Yang
J0 - Frontiers of Information Technology & Electronic Engineering
SP - 1234
EP - 1245
%@ 2095-9184
Y1 - in press
PB - Zhejiang University Press & Springer
ER -
doi="https://doi.org/10.1631/FITEE.1800467"

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: Frequent itemset mining serves as the main method of association rule mining. With the limitations in computing space and performance, the association of frequent items in large data mining requires both extensive time and effort, particularly when the datasets become increasingly larger. In the process of associated data mining in a big data environment, the mapReduce programming model is typically used to perform task partitioning and parallel processing, which could improve the execution efficiency of the algorithm. However, to ensure that the associated rule is not destroyed during task partitioning and parallel processing, the inner-relationship data must be stored in the computer space. Because inner-relationship data are redundant, storage of these data will significantly increase the space usage in comparison with the original dataset. In this study, we find that the formation of the frequent pattern (FP) mining algorithm depends mainly on the conditional pattern bases. Based on the parallel frequent pattern (PFP) algorithm theory, the grouping model divides frequent items into several groups according to their frequencies. We propose a non-group PFP (NG-PFP) mining algorithm that cancels the grouping model and reduces the data redundancy between sub-tasks. Moreover, we present the NG-PFP algorithm for task partition and parallel processing, and its performance in the Hadoop cluster environment is analyzed and discussed. Experimental results indicate that the non-group model shows obvious improvement in terms of computational efficiency and the space utilization rate.

基于条件模式的一种无分组并行频繁模式挖掘算法

摘要：频繁项集挖掘是关联规则挖掘的主要方法。由于计算空间和性能限制，特别是当数据集剧增时，挖掘频繁项的关联需要大量时间和资源。在大数据环境下的关联数据挖掘过程中，通常采用MapReduce模型进行任务划分及并行处理，从而提高算法执行效率。为确保关联规则在任务划分和并行处理期间不被破坏，需要将内部关系数据存储在计算机空间中。与原始数据集相比，存储冗余的内部关系数据将显著增加空间的使用。研究发现，频繁模式挖掘算法的形成主要依赖于条件模式基。基于并行频繁模式（PFP）算法理论，本文提出一种无分组的PFP（NG-PFP）挖掘算法。该算法取消了分组模式，减少了子任务之间的数据冗余。实验结果表明，无分组模型在计算效率和空间利用率方面都有显著提高。

关键词组：频繁模式挖掘；并行算法；条件模式基；MapReduce；大数据

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Agrawal R, Srikant R, 1994. Fast algorithms for mining association rules in large databases. Proc Int Conf on Very Large Data Bases, p.487-499.

[2]Agarwal RC, Aggarwal CC, Prasad VVV, 2002. A tree projection algorithm for generation of frequent item sets. J Parall Distrib Comput, 61(3):350-371.

[3]Bauer M, Bruveris M, Charon N, et al., 2018. A relaxed approach for curve matching with elastic metrics. https://arxiv.gg363.site/abs/1803.10893

[4]Berti-Équille L, Harmouch H, Naumann F, et al., 2018. Discovery of genuine functional dependencies from relational data with missing values. Proc VLDB Endowm, 11(8):880-892.

[5]Caruccio L, Deufemia V, Polese G, 2016. On the discovery of relaxed functional dependencies. Proc 20$^rm th$ Int Database Engineering & Applications Symp, p.53-61.

[6]Caruccio L, Deufemia V, Polese G, 2017. Evolutionary mining of relaxed dependencies from big data collections. Proc 7^th Int Conf on Web Intelligence, Mining and Semantics, Article 5.

[7]Caruccio L, Polese G, Tortora G, 2018. Dependency-based query/view synchronization upon schema evolutions. Int Conf on Conceptual Modeling, p.91-105.

[8]Chen JC, Chen YG, Du XY, et al., 2013. Big data challenge: a data management perspective. Front Comput Sci, 7(2):157-164.

[9]Cong S, Han J, Padua D, 2005. Parallel mining of closed sequential patterns. Proc 11^th ACM SIGKDD Int Conf on Knowledge Discovery in Data Mining, p.562-567.

[10]Deng LL, Lou YS, 2015. Improvement and research of FP-growth algorithm based on distributed spark. Proc Int Conf on Cloud Computing and Big Data, p.105-108.

[11]di-Jorio L, Laurent A, Teisseire M, 2009. Mining frequent gradual itemsets from large databases. Int Symp on Intelligent Data Analysis, p.297-308.

[12]El-Hajj M, Zaïane OR, 2006. Parallel bifold: large-scale parallel pattern mining with constraints. Distrib Parall Datab, 20(3):225-243.

[13]Ge KS, Su HY, Li DS, et al., 2017. Efficient parallel implementation of a density peaks clustering algorithm on graphics processing unit. Front Inform Technol Electron Eng, 18(7):915-927.

[14]Han JW, Pei J, Yin YW, 2000. Mining frequent patterns without candidate generation. ACM SIGMOD Rec, 29(2):1-12.

[15]Huhtala Y, Kärkkäinen J, Porkka P, et al., 1999. Tane: an efficient algorithm for discovering functional and approximate dependencies. Comput J, 42(2):100-111.

[16]Kruse S, Naumann F, 2018. Efficient discovery of approximate dependencies. Proc VLDB Endowm, 11(7):759-772.

[17]Li HY, Wang Y, Zhang D, et al., 2008. PFP: parallel FP-growth for query recommendation. Proc ACM Conf on Recommender Systems, p.107-114.

[18]Li N, Zeng L, He Q, et al., 2012. Parallel implementation of Apriori algorithm based on MapReduce. Proc. 13^th ACIS Int Conf on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, p.236-241.

[19]Lin KW, Chung SH, 2015. A fast and resource efficient mining algorithm for discovering frequent patterns in distributed computing environments. Fut Gener Comput Syst, 52:49-58.

[20]Lin MY, Lee PY, Hsueh SC, 2012. Apriori-based frequent itemset mining algorithms on MapReduce. Proc 6^th Int Conf on Ubiquitous Information Management and Communication, Article 26.

[21]Liu JQ, Wu YS, Zhou QF, et al., 2015. Parallel Eclat for opportunistic mining of frequent itemsets. Int Conf on Database and Expert Systems Applications, p.401-415.

[22]Lucchese C, Orlando S, Perego R, et al., 2004. WebDocs: a real-life huge transactional dataset. Proc IEEE ICDM Workshop on Frequent Itemset Mining Implementations.

[23]Mandros P, Boley M, Vreeken J, 2017. Discovering reliable approximate functional dependencies. Proc 23^rd ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining, p.355-363.

[24]Riondato M, DeBrabant JA, Fonseca R, et al., 2012. PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce. Proc 21^st ACM Int Conf on Information and Knowledge Management, p.85-94.

[25]Siddiqa A, Karim A, Gani A, 2017. Big data storage technologies: a survey. Front Inform Technol Electron Eng, 18(8):1040-1070.

[26]Srikant R, Agrawal R, 1996. Mining sequential patterns: generalizations and performance improvements. Int Conf on Extending Database Technology, p.1-17.

[27]Wang F, Hu L, Zhou J, et al., 2015. A survey from the perspective of evolutionary process in the Internet of Things. Int J Distrib Sen Networks, 11(3):462752.

[28]Wang J, Han J, 2004. BIDE: efficient mining of frequent closed sequences. Proc 20^th Int Conf on Data Engineering, p.79-90.

[29]Xia D, Zhou Y, Rong Z, et al., 2013. IPFP: an improved parallel FP-growth algorithm for frequent itemsets mining. Proc 59^thISI World Statistics Congress, p.4034-4039.

[30]Xia D, Rong Z, Zhou Y, 2014. A novel parallel algorithm for frequent itemsets mining in massive small files datasets. ICIC Expr Lett Part B, 5(2):459-466.

[31]Yang Q, Du FY, Zhu X, et al., 2016. Improved balanced parallel FP-growth with MapReduce. Joint Int Conf on Artificial Intelligence and Computer Engineering and Int Conf on Network and Communication Security, p.1-5.

[32]Yang XY, Liu Z, Fu Y, 2010. MapReduce as a programming model for association rules algorithm on Hadoop. Proc 3^rd Int Conf on Information Sciences and Interaction Sciences, p.99-102.

[33]Yu KM, Zhou JY, Hsiao WC, 2007. Load balancing approach parallel algorithm for frequent pattern mining. Proc Int Conf on Parallel Computing Technologies, p.623-631.

[34]Zaki MJ, 2000. Scalable algorithms for association mining. IEEE Trans Know Data Eng, 12(3):372-390.

[35]Zaki MJ, 2001a. Parallel sequence mining on shared-memory machines. J Parall Distrib Comput, 61(3):401-426.

[36]Zaki MJ, 2001b. SPADE: an efficient algorithm for mining frequent sequences. Mach Learn, 42(1-2):31-60.

[37]Zhang XL, Breitinger F, Baggili I, 2016. Rapid Android parser for investigating DEX files (RAPID). Dig Investig, 17:28-39.

[38]Zhang XL, Baggili I, Breitinger F, 2017. Breaking into the vault: privacy, security and forensic analysis of Android vault applications. Comput Secur, 70:516-531.

[39]Zhang ZG, Ji GL, Tang MM, 2013. MREclat: an algorithm for parallel mining frequent itemsets. Proc Int Conf on Advanced Cloud and Big Data, p.177-180.

[40]Zhao YX, Zhang WX, Li DS, et al., 2016. Pegasus: a distributed and load-balancing fingerprint identification system. Front Inform Technol Electron Eng, 17(8):766-780.

[41]Zheng XF, Wang S, 2014. Study on the method of road transport management information data mining based on pruning Eclat algorithm and MapReduce. Proc Soc Behav Sci, 138:757-766.

[42]Zhou L, Zhong ZY, Chang J, et al., 2010. Balanced parallel FP-growth with MapReduce. Proc. IEEE Youth Conf on Information, Computing and Telecommunications, p.243-246.

[43]Zhuang YT, Wu F, Chen C, et al., 2017. Challenges and opportunities: from big data to knowledge in AI 2.0. Front Inform Technol Electron Eng, 18(1):3-14.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

- Go to

基于条件模式的一种无分组并行频繁模式挖掘算法

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference