JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering 2017 Vol.18 No.10 P.1556-1572

A machine learning approach to query generation in plagiarism source retrieval

Author(s): Lei-lei Kong, Zhi-mao Lu, Hao-liang Qi, Zhong-yuan Han
Affiliation(s): College of Information and Communication Engineering, Harbin Engineering University, Harbin 150001, China; more
Corresponding email(s): kongleilei1979@gmail.com, haoliang.qi@gmail.com
Key Words: Plagiarism detection, Source retrieval, Query generation, Machine learning, Learning to rank

Share this article to： More <<< Previous Article \|Next Article >>>

Lei-lei Kong, Zhi-mao Lu, Hao-liang Qi, Zhong-yuan Han. A machine learning approach to query generation in plagiarism source retrieval[J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18(10): 1556-1572.

@article{title="A machine learning approach to query generation in plagiarism source retrieval",
author="Lei-lei Kong, Zhi-mao Lu, Hao-liang Qi, Zhong-yuan Han",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="18",
number="10",
pages="1556-1572",
year="2017",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1601344"
}

%0 Journal Article
%T A machine learning approach to query generation in plagiarism source retrieval
%A Lei-lei Kong
%A Zhi-mao Lu
%A Hao-liang Qi
%A Zhong-yuan Han
%J Frontiers of Information Technology & Electronic Engineering
%V 18
%N 10
%P 1556-1572
%@ 2095-9184
%D 2017
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1601344

TY - JOUR
T1 - A machine learning approach to query generation in plagiarism source retrieval
A1 - Lei-lei Kong
A1 - Zhi-mao Lu
A1 - Hao-liang Qi
A1 - Zhong-yuan Han
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 18
IS - 10
SP - 1556
EP - 1572
%@ 2095-9184
Y1 - 2017
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1601344

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: Plagiarism source retrieval is the core task of plagiarism detection. It has become the standard for plagiarism detection to use the queries extracted from suspicious documents to retrieve the plagiarism sources. Generating queries from a suspicious document is one of the most important steps in plagiarism source retrieval. Heuristic-based query generation methods are widely used in the current research. Each heuristic-based method has its own advantages, and no one statistically outperforms the others on all suspicious document segments when generating queries for source retrieval. Further improvements on heuristic methods for source retrieval rely mainly on the experience of experts. This leads to difficulties in putting forward new heuristic methods that can overcome the shortcomings of the existing ones. This paper paves the way for a new statistical machine learning approach to select the best queries from the candidates. The statistical machine learning approach to query generation for source retrieval is formulated as a ranking framework. Specifically, it aims to achieve the optimal source retrieval performance for each suspicious document segment. The proposed method exploits learning to rank to generate queries from the candidates. To our knowledge, our work is the first research to apply machine learning methods to resolve the problem of query generation for source retrieval. To solve the essential problem of an absence of training data for learning to rank, the building of training samples for source retrieval is also conducted. We rigorously evaluate various aspects of the proposed method on the publicly available PAN source retrieval corpus. With respect to the established baselines, the experimental results show that applying our proposed query generation method based on machine learning yields statistically significant improvements over baselines in source retrieval effectiveness.

基于机器学习的抄袭源检索的查询生成方法

概要：抄袭源检索是抄袭检测的核心任务。使用从可疑文档提取的查询来检索抄袭源已成为抄袭源检索的标准方法。从可疑文档生成查询是源检索最重要的步骤。当前研究主要使用了基于启发式的查询生成方法。然而，每个启发式方法都有其优点，不同方法生成的查询可以获得不同的源检索结果，没有一种方法生成的查询的源检索性能可以在所有的文本片段上具有统计有效性地优于其他方法。这使得基于启发式的源检索查询生成方法的性能改善主要依赖专家经验。因此，很难开发一种可以克服现有启发式方法的新方法。本文提出使用统计机器学习方法解决源检索的查询生成问题，将源检索的查询生成形式化到一个排序学习的框架下，从备选查询中选择有利于提高源检索性能的查询，力争在每个可疑文档片段上获得最优的源检索性能。据我们所知，这是第一项应用机器学习方法解决源检索查询生成问题的工作。为了解决排序学习训练用例的缺失，提出了基于现有源检索语料构建查询生成语料的方法。在PAN抄袭源检索评测数据上的试验结果证明了该方法具有统计意义地优于多个基线方法。

关键词：抄袭检测；源检索；查询生成；机器学习；排序学习

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Alzahrani, S.M., Salim, N., Abraham, A., 2012. Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Trans. Syst. Man Cybern. C, 42(2):133-149.

[2]Barrón-Cedeño, A., Vila, M., Martí, M.A., et al., 2013. Plagiarism meets paraphrasing: insights for the next generation in automatic plagiarism detection. Comput. Ling., 39(4):917-947.

[3]Cao, Y., Xu, J., Liu, T.Y., et al., 2006. Adapting ranking SVM to document retrieval. Proc. 29th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, p.186-193.

[4]Cortes, C., Vapnik, V., 1995. Support-vector networks. Mach. Learn., 20(3):273-297.

[5]Elizalde, V., 2013. Using statistic and semantic analysis to detect plagiarism—notebook for PAN at CLEF 2013. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.

[6]Gillam, L., 2013. Guess again and see if they line up: surrey’s runs at plagiarism detection—notebook for PAN at CLEF 2013. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.

[7]Hagen, M., Potthast, M., Stein, B., 2015. Source retrieval for plagiarism detection from large web corpora: recent approaches. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.

[8]Haggag, O., El-Beltagy, S., 2013. Plagiarism candida-te retrieval using selective query formulation and discriminative query scoring—notebook for PAN at CLEF 2013. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.

[9]Hastie, T., Tibshirani, R., Friedman, J., 2001. The Elements of Statistical Learning: Data Mining, Inference and Prediction. CRC Press, Boca Raton.

[10]Herbrich, R., Graepel, T., Obermayer, K., 2000. Large margin rank boundaries for ordinal regression. In: Smola, A.J., Bartlett, P., Schölkopf, B., et al. (Eds.), Advances in Large Margin Classifiers. MIT Press, Cambridge, p.115-132.

[11]Höffgen, K.U., Simon, H.U., Vanhorn, K.S., 1995. Robust trainability of single neurons. J. Comput. Syst. Sci., 50(1):114-125.

[12]Jayapal, A., 2012. Similarity overlap metric and greedy string tiling at PAN 2012: plagiarism detection—notebook for PAN at CLEF 2012. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.

[13]Joachims, T., 2002. Optimizing search engines using clickthrough data. Proc. 8th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.133-142.

[14]Kong, L.L., Qi, H.L., Wang, S., et al., 2012. Approaches for candidate document retrieval and detailed comparison of plagiarism detection—notebook for PAN at CLEF 2012. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.

[15]Lee, T., Chae, J., Park, K., et al., 2013. CopyCaptor: plagiarized source retrieval system using global word frequency and local feedback—notebook for PAN at CLEF 2013. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.

[16]Nallapati, R., 2004. Discriminative models for information retrieval. Proc. 27th Annual ACM SIGIR Int. Conf. on Research and Development in Information Retrieval, p.64-71.

[17]Potthast, M., Gollub, T., Hagen, M., et al., 2012a. Overview of the 4th International Competition on Plagiarism Detection. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.

[18]Potthast, M., Hagen, M., Stein, B., et al., 2012b. ChatNoir: a search engine for the ClueWeb09 corpus. Proc. 35th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, p.1004.

[19]Potthast, M., Hagen, M., Gollub, T., et al., 2013a. Overview of the 5th International Competition on Plagiarism Detection. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.

[20]Potthast, M., Hagen, M., Völske, M., et al., 2013b. Crowdsourcing interaction logs to understand text reuse from the web. Proc. 51st ACM Annual Meeting of the Association of Computational Linguistics, p.1212-1221.

[21]Potthast, M., Hagen, M., Beyer, A., et al., 2014. Overview of the 6th International Competition on Plagiarism Detection. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.

[22]Prakash, A., Saha, S., 2014. Experiments on document chunking and query formation for plagiarism source retrieval—notebook for PAN at CLEF 2014. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.

[23]Rafiei, J., Mohtaj, S., Zarrabi, V., et al., 2015. Source retrieval plagiarism detection based on noun phrase and keyword phrase extraction—notebook for PAN at CLEF 2015. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.

[24]Robertson, S.E., 1997. Overview of the Okapi projects. J. Docum., 53(1):3-7.

[25]Suchomel, Š., Brandejs, M., 2015. Improving synoptic querying for source retrieval—notebook for PAN at CLEF 2015. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.

[26]Toutanova, K., Klein, D., Manning, C.D., et al., 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. Proc. Conf. of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, p.173-180.

[27]Williams, K., Chen, H.H., Choudhury, S.R., et al., 2013. Unsupervised ranking for plagiarism source retrieval— notebook for PAN at CLEF 2013. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.

[28]Williams, K., Chen, H.H., Giles, C.L., 2014a. Supervised ranking for plagiarism source retrieval—notebook for PAN at CLEF 2014. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.

[29]Williams, K., Chen, H.H., Giles, C.L., 2014b. Classifying and ranking search engine results as potential sources of plagiarism. Proc. ACM Symp. on Document Engineering, p.97-106.

[30]Zubarev, D., Sochenkov, I., 2014. Using sentence similarity measure for plagiarism source retrieval—notebook for PAN at CLEF 2014. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Similar articles

- Go to

基于机器学习的抄袭源检索的查询生成方法

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference