CLC number: TP301
On-line Access: 2019-10-08
Received: 2018-08-05
Revision Accepted: 2018-12-18
Crosschecked: 2019-06-23
Cited: 0
Clicked: 6269
Citations: Bibtex RefMan EndNote GB/T7714
Zhe-jun Kuang, Hang Zhou, Dong-dai Zhou, Jin-peng Zhou, Kun Yang. A non-group parallel frequent pattern mining algorithm based on conditional patterns[J]. Frontiers of Information Technology & Electronic Engineering, 2019, 20(9): 1234-1245.
@article{title="A non-group parallel frequent pattern mining algorithm based on conditional patterns",
author="Zhe-jun Kuang, Hang Zhou, Dong-dai Zhou, Jin-peng Zhou, Kun Yang",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="20",
number="9",
pages="1234-1245",
year="2019",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1800467"
}
%0 Journal Article
%T A non-group parallel frequent pattern mining algorithm based on conditional patterns
%A Zhe-jun Kuang
%A Hang Zhou
%A Dong-dai Zhou
%A Jin-peng Zhou
%A Kun Yang
%J Frontiers of Information Technology & Electronic Engineering
%V 20
%N 9
%P 1234-1245
%@ 2095-9184
%D 2019
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1800467
TY - JOUR
T1 - A non-group parallel frequent pattern mining algorithm based on conditional patterns
A1 - Zhe-jun Kuang
A1 - Hang Zhou
A1 - Dong-dai Zhou
A1 - Jin-peng Zhou
A1 - Kun Yang
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 20
IS - 9
SP - 1234
EP - 1245
%@ 2095-9184
Y1 - 2019
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1800467
Abstract: Frequent itemset mining serves as the main method of association rule mining. With the limitations in computing space and performance, the association of frequent items in large data mining requires both extensive time and effort, particularly when the datasets become increasingly larger. In the process of associated data mining in a big data environment, the mapReduce programming model is typically used to perform task partitioning and parallel processing, which could improve the execution efficiency of the algorithm. However, to ensure that the associated rule is not destroyed during task partitioning and parallel processing, the inner-relationship data must be stored in the computer space. Because inner-relationship data are redundant, storage of these data will significantly increase the space usage in comparison with the original dataset. In this study, we find that the formation of the frequent pattern (FP) mining algorithm depends mainly on the conditional pattern bases. Based on the parallel frequent pattern (PFP) algorithm theory, the grouping model divides frequent items into several groups according to their frequencies. We propose a non-group PFP (NG-PFP) mining algorithm that cancels the grouping model and reduces the data redundancy between sub-tasks. Moreover, we present the NG-PFP algorithm for task partition and parallel processing, and its performance in the Hadoop cluster environment is analyzed and discussed. Experimental results indicate that the non-group model shows obvious improvement in terms of computational efficiency and the space utilization rate.
[1]Agrawal R, Srikant R, 1994. Fast algorithms for mining association rules in large databases. Proc Int Conf on Very Large Data Bases, p.487-499.
[2]Agarwal RC, Aggarwal CC, Prasad VVV, 2002. A tree projection algorithm for generation of frequent item sets. J Parall Distrib Comput, 61(3):350-371.
[3]Bauer M, Bruveris M, Charon N, et al., 2018. A relaxed approach for curve matching with elastic metrics. https://arxiv.gg363.site/abs/1803.10893
[4]Berti-Équille L, Harmouch H, Naumann F, et al., 2018. Discovery of genuine functional dependencies from relational data with missing values. Proc VLDB Endowm, 11(8):880-892.
[5]Caruccio L, Deufemia V, Polese G, 2016. On the discovery of relaxed functional dependencies. Proc 20$^rm th$ Int Database Engineering & Applications Symp, p.53-61.
[6]Caruccio L, Deufemia V, Polese G, 2017. Evolutionary mining of relaxed dependencies from big data collections. Proc 7th Int Conf on Web Intelligence, Mining and Semantics, Article 5.
[7]Caruccio L, Polese G, Tortora G, 2018. Dependency-based query/view synchronization upon schema evolutions. Int Conf on Conceptual Modeling, p.91-105.
[8]Chen JC, Chen YG, Du XY, et al., 2013. Big data challenge: a data management perspective. Front Comput Sci, 7(2):157-164.
[9]Cong S, Han J, Padua D, 2005. Parallel mining of closed sequential patterns. Proc 11th ACM SIGKDD Int Conf on Knowledge Discovery in Data Mining, p.562-567.
[10]Deng LL, Lou YS, 2015. Improvement and research of FP-growth algorithm based on distributed spark. Proc Int Conf on Cloud Computing and Big Data, p.105-108.
[11]di-Jorio L, Laurent A, Teisseire M, 2009. Mining frequent gradual itemsets from large databases. Int Symp on Intelligent Data Analysis, p.297-308.
[12]El-Hajj M, Zaïane OR, 2006. Parallel bifold: large-scale parallel pattern mining with constraints. Distrib Parall Datab, 20(3):225-243.
[13]Ge KS, Su HY, Li DS, et al., 2017. Efficient parallel implementation of a density peaks clustering algorithm on graphics processing unit. Front Inform Technol Electron Eng, 18(7):915-927.
[14]Han JW, Pei J, Yin YW, 2000. Mining frequent patterns without candidate generation. ACM SIGMOD Rec, 29(2):1-12.
[15]Huhtala Y, Kärkkäinen J, Porkka P, et al., 1999. Tane: an efficient algorithm for discovering functional and approximate dependencies. Comput J, 42(2):100-111.
[16]Kruse S, Naumann F, 2018. Efficient discovery of approximate dependencies. Proc VLDB Endowm, 11(7):759-772.
[17]Li HY, Wang Y, Zhang D, et al., 2008. PFP: parallel FP-growth for query recommendation. Proc ACM Conf on Recommender Systems, p.107-114.
[18]Li N, Zeng L, He Q, et al., 2012. Parallel implementation of Apriori algorithm based on MapReduce. Proc. 13th ACIS Int Conf on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, p.236-241.
[19]Lin KW, Chung SH, 2015. A fast and resource efficient mining algorithm for discovering frequent patterns in distributed computing environments. Fut Gener Comput Syst, 52:49-58.
[20]Lin MY, Lee PY, Hsueh SC, 2012. Apriori-based frequent itemset mining algorithms on MapReduce. Proc 6th Int Conf on Ubiquitous Information Management and Communication, Article 26.
[21]Liu JQ, Wu YS, Zhou QF, et al., 2015. Parallel Eclat for opportunistic mining of frequent itemsets. Int Conf on Database and Expert Systems Applications, p.401-415.
[22]Lucchese C, Orlando S, Perego R, et al., 2004. WebDocs: a real-life huge transactional dataset. Proc IEEE ICDM Workshop on Frequent Itemset Mining Implementations.
[23]Mandros P, Boley M, Vreeken J, 2017. Discovering reliable approximate functional dependencies. Proc 23rd ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining, p.355-363.
[24]Riondato M, DeBrabant JA, Fonseca R, et al., 2012. PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce. Proc 21st ACM Int Conf on Information and Knowledge Management, p.85-94.
[25]Siddiqa A, Karim A, Gani A, 2017. Big data storage technologies: a survey. Front Inform Technol Electron Eng, 18(8):1040-1070.
[26]Srikant R, Agrawal R, 1996. Mining sequential patterns: generalizations and performance improvements. Int Conf on Extending Database Technology, p.1-17.
[27]Wang F, Hu L, Zhou J, et al., 2015. A survey from the perspective of evolutionary process in the Internet of Things. Int J Distrib Sen Networks, 11(3):462752.
[28]Wang J, Han J, 2004. BIDE: efficient mining of frequent closed sequences. Proc 20th Int Conf on Data Engineering, p.79-90.
[29]Xia D, Zhou Y, Rong Z, et al., 2013. IPFP: an improved parallel FP-growth algorithm for frequent itemsets mining. Proc 59thISI World Statistics Congress, p.4034-4039.
[30]Xia D, Rong Z, Zhou Y, 2014. A novel parallel algorithm for frequent itemsets mining in massive small files datasets. ICIC Expr Lett Part B, 5(2):459-466.
[31]Yang Q, Du FY, Zhu X, et al., 2016. Improved balanced parallel FP-growth with MapReduce. Joint Int Conf on Artificial Intelligence and Computer Engineering and Int Conf on Network and Communication Security, p.1-5.
[32]Yang XY, Liu Z, Fu Y, 2010. MapReduce as a programming model for association rules algorithm on Hadoop. Proc 3rd Int Conf on Information Sciences and Interaction Sciences, p.99-102.
[33]Yu KM, Zhou JY, Hsiao WC, 2007. Load balancing approach parallel algorithm for frequent pattern mining. Proc Int Conf on Parallel Computing Technologies, p.623-631.
[34]Zaki MJ, 2000. Scalable algorithms for association mining. IEEE Trans Know Data Eng, 12(3):372-390.
[35]Zaki MJ, 2001a. Parallel sequence mining on shared-memory machines. J Parall Distrib Comput, 61(3):401-426.
[36]Zaki MJ, 2001b. SPADE: an efficient algorithm for mining frequent sequences. Mach Learn, 42(1-2):31-60.
[37]Zhang XL, Breitinger F, Baggili I, 2016. Rapid Android parser for investigating DEX files (RAPID). Dig Investig, 17:28-39.
[38]Zhang XL, Baggili I, Breitinger F, 2017. Breaking into the vault: privacy, security and forensic analysis of Android vault applications. Comput Secur, 70:516-531.
[39]Zhang ZG, Ji GL, Tang MM, 2013. MREclat: an algorithm for parallel mining frequent itemsets. Proc Int Conf on Advanced Cloud and Big Data, p.177-180.
[40]Zhao YX, Zhang WX, Li DS, et al., 2016. Pegasus: a distributed and load-balancing fingerprint identification system. Front Inform Technol Electron Eng, 17(8):766-780.
[41]Zheng XF, Wang S, 2014. Study on the method of road transport management information data mining based on pruning Eclat algorithm and MapReduce. Proc Soc Behav Sci, 138:757-766.
[42]Zhou L, Zhong ZY, Chang J, et al., 2010. Balanced parallel FP-growth with MapReduce. Proc. IEEE Youth Conf on Information, Computing and Telecommunications, p.243-246.
[43]Zhuang YT, Wu F, Chen C, et al., 2017. Challenges and opportunities: from big data to knowledge in AI 2.0. Front Inform Technol Electron Eng, 18(1):3-14.
Open peer comments: Debate/Discuss/Question/Opinion
<1>