CLC number: TP391

On-line Access: 2015-11-04

Received: 2015-03-07

Revision Accepted: 2015-08-09

Crosschecked: 2015-10-15

Cited: 0

Clicked: 2063

Jie Zhou


Frontiers of Information Technology & Electronic Engineering  2015 Vol.16 No.11 P.940-956


Automatically building large-scale named entity recognition corpora from Chinese Wikipedia

Author(s):  Jie Zhou, Bi-cheng Li, Gang Chen

Affiliation(s):  Department of Signal Analysis and Information Processing, Zhengzhou Information Science and Technology Institute, Zhengzhou 450002, China

Corresponding email(s):   zhoujie.nlp@gmail.com

Key Words:  NER corpora, Chinese Wikipedia, Entity classification, Domain adaptation, Corpus selection

Named entity recognition (NER) is a core component in many natural language processing applications. Most NER systems rely on supervised machine learning methods, which depend on time-consuming and expensive annotations in different languages and domains. This paper presents a method for automatically building silver-standard NER corpora from chinese Wikipedia. We refine novel and language-dependent features by exploiting the text and structure of chinese Wikipedia. To reduce tagging errors caused by entity classification, we design four types of heuristic rules based on the characteristics of chinese Wikipedia and train a supervised NE classifier, and a combined method is used to improve the precision and coverage. Then, we realize type identification of implicit mention by using boundary information of outgoing links. By selecting the sentences related with the domains of test data, we can train better NER models. In the experiments, large-scale NER corpora containing 2.3 million sentences are built from chinese Wikipedia. The results show the effectiveness of automatically annotated corpora, and the trained NER models achieve the best performance when combining our silver-standard corpora with gold-standard corpora.

Constructing NER corpora is important but labor-intensive. This paper proposed an automatic way to construct NER corpora from Chinese Wikipedia, and the experiments show promising results. This work is meaningful, and the constructed corpora can be used in many Chinese NLP applications. The paper is well written.




