JZUS - Journal of Zhejiang University SCIENCE

ENGINEERING Information Technology & Electronic Engineering

Accepted manuscript available online (unedited version)

FrepJoin: an efficient partition-based algorithm for edit similarity join

Author(s): Ji-zhou Luo, Sheng-fei Shi, Hong-zhi Wang, Jian-zhong Li
Affiliation(s): School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China; more
Corresponding email(s): luojizhou@hit.edu.cn, shengfei@hit.edu.cn, wangzh@hit.edu.cn, lijzh@hit.edu.cn
Key Words: String similarity join, Edit distance, Filter and refine, Data partition, Combined frequency vectors

Share this article to： More <<< Previous Paper \|Next Paper >>>

Ji-zhou Luo, Sheng-fei Shi, Hong-zhi Wang, Jian-zhong Li. FrepJoin: an efficient partition-based algorithm for edit similarity join[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.1601347

@article{title="FrepJoin: an efficient partition-based algorithm for edit similarity join",
author="Ji-zhou Luo, Sheng-fei Shi, Hong-zhi Wang, Jian-zhong Li",
journal="Frontiers of Information Technology & Electronic Engineering",
year="in press",
publisher="Zhejiang University Press & Springer",
doi="https://doi.org/10.1631/FITEE.1601347"
}

%0 Journal Article
%T FrepJoin: an efficient partition-based algorithm for edit similarity join
%A Ji-zhou Luo
%A Sheng-fei Shi
%A Hong-zhi Wang
%A Jian-zhong Li
%J Frontiers of Information Technology & Electronic Engineering
%P 1499-1510
%@ 2095-9184
%D in press
%I Zhejiang University Press & Springer
doi="https://doi.org/10.1631/FITEE.1601347"

TY - JOUR
T1 - FrepJoin: an efficient partition-based algorithm for edit similarity join
A1 - Ji-zhou Luo
A1 - Sheng-fei Shi
A1 - Hong-zhi Wang
A1 - Jian-zhong Li
J0 - Frontiers of Information Technology & Electronic Engineering
SP - 1499
EP - 1510
%@ 2095-9184
Y1 - in press
PB - Zhejiang University Press & Springer
ER -
doi="https://doi.org/10.1631/FITEE.1601347"

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: string similarity join (SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-and-refine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics. The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.

频率连接：基于数据划分的一种高效字符串相似性连接算法

概要：字符串相似性连接（string similarity join, SSJ）在很多应用中，特别是在需要找出重复对象的应用中发挥着关键作用。本文关注基于编辑距离的字符串相似性连接。现有算法大多采用先过滤再细化的框架，使得它们很难发现和利用字符串子集间的不相似性，也很难利用如字符频率这样的统计信息。本研究提出了一种基于数据划分的字符串相似性连接算法，它充分利用了这种统计信息。采用频率向量将字符串集划分成一系列较小的子集，使得子集之间的不相似性很容易被发现。本文提出的新算法利用划分后的数据高效地对字符串进行相似性。此外，本文还给出了一个新的过滤器，它能利用字符频率来过滤很多能够通过现有过滤器的不相似字符串。真实数据集上的试验表明，本文提出的算法性能较现有算法有较大幅度的提升。

关键词组：字符串相似性连接；编辑距离；过滤再细化；数据划分；组合频率向量

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Afrati, F.N., Sarma, A.D., Menestrina, D., et al., 2012. Fuzzy joins using MapReduce. Int. Conf. on Data Engineering, p.498-509.

[2]Arasu, A., Ganti, V., Kaushik, R., 2006. Efficient exact set-similarity joins. Int. Conf. on Very Large Data Bases, p.918-929.

[3]Bayardo, R.J., Ma, Y., Srikant, R., 2007. Scaling up all pairs similarity search. Int. World Wide Web Conf., p.131-140.

[4]Chaudhuri, S., Ganjam, K., Ganti, V., et al., 2003. Robust and efficient fuzzy match for online data cleaning. Int. SIGMOD Conf. on Management of Data, p.313-324.

[5]Chaudhuri, S., Ganti, V., Kaushik, R., 2006a. Data debugger: an operator-centric approach for data quality solutions. IEEE Data Eng. Bull., 29(2):60-66.

[6]Chaudhuri, S., Ganti, V., Kaushik, R., 2006b. A primitive operator for similarity joins in data cleaning. Int. Conf. on Data Engineering, p.687-698.

[7]Dong, X., Halevy, A.Y., Yu, C., 2007. Data integration with uncertainty. Int. Conf. on Very Large Data Bases, p.687-698.

[8]Feng, J., Wang, J., Li, G., 2012. Trie-join: a trie-based method for efficient string similarity joins. VLDB J., 21(4):437-461.

[9]Ge, T., Li, Z., 2011. Approximate substring matching over uncertain strings. Proc. VLDB Endow., {bf 4}(11):772-782.

[10]Gravano, L., Ipeirotis, P.G., Jagadish, H.V., et al., 2001. Approximate string joins in a database (almost) for free. Int. Conf. on Very Large Data Bases, p.491-500.

[11]Hadjieleftheriou, M., Srivastava, D., 2010. Weighted Set-Based String Similarity. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. AT&T Lab-Research.

[12]Henzinger, M.R., 2006. Finding near-duplicate web pages: a large-scale evaluation of algorithms. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, p.284-291.

[13]Ji, S., Li, G., Li, C., et al., 2009. Efficient interactive fuzzy keyword search. Int. World Wide Web Conf., p.371-380.

[14]Li, C., Lu, J., Lu, Y., 2008. Efficient merging and filtering algorithms for approximate string searches. Int. Conf. on Data Engineering, p.257-266.

[15]Li, G., Deng, D., Wang, J., et al., 2011. Pass-Join: a partition-based method for similarity joins. Proc. VLDB Endow., {bf 5}(3):253-264.

[16]Metwally, A., Agrawal, D., Abbadi, A.E., 2007. Detectives: detecting coalition hit inflation attacks in advertising networks streams. Int. World Wide Web Conf., p.241-250.

[17]Navarro, G., Salmela, L., 2009. Indexing variable length substrings for exact and approximate matching. Int. Symp. on String Processing and Information Retrieval, p.214-221.

[18]Qin, J., Wang, W., Xiao, C., et al., 2013. Vchunkjoin: an efficient algorithm for edit similarity joins. Trans. Knowl. Dat. Eng., {bf 25}(8):1916-1929.

[19]Sarawagi, S., Kirpal, A., 2004. Efficient set joins on similarity predicates. Int. SIGMOD Conf. on Management of Data, p.743-754.

[20]Scott, D.W., 1979. On optimal and data-based histograms. Biometrika, 66:605-610.

[21]Vernica, R., Carey, M.J., Li, C., 2010. Efficient parallel set-similarity joins using MapReduce. Int. SIGMOD Conf. on Management of Data, p.495-506.

[22]Wang, J., Li, G., Feng, J., 2011. Fast-join: an efficient method for fuzzy token matching based string similarity join. Int. Conf. on Data Engineering, p.458-469.

[23]Wang, W., Xiao, C., Lin, X., et al., 2009. Efficient approximate entity extraction with edit distance constraints. Int. SIGMOD Conf. on Management of Data, p.759-770.

[24]Xiao, C., Wang, W., Lin, X., 2008a. Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. Proc. VLDB Endow., {bf 1}(1):933-944.

[25]Xiao, C., Wang, W., Lin, X., et al., 2008b. Efficient similarity joins for near duplicate detection. Int. World Wide Web Conf., p.131-140.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

- Go to

频率连接：基于数据划分的一种高效字符串相似性连接算法

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference