Full Text:   <872>

CLC number: TP391

On-line Access: 2017-12-04

Received: 2016-04-01

Revision Accepted: 2016-06-30

Crosschecked: 2017-09-22

Cited: 0

Clicked: 1647

Citations:  Bibtex RefMan EndNote GB/T7714

 ORCID:

Hou-kui Zhou

http://orcid.org/0000-0001-7915-8684

-   Go to

Article info.
Open peer comments

Frontiers of Information Technology & Electronic Engineering  2017 Vol.18 No.10 P.1511-1524

10.1631/FITEE.1601125


Topic discovery and evolution in scientific literature based on content and citations


Author(s):  Hou-kui Zhou, Hui-min Yu, Roland Hu

Affiliation(s):  College of Information Science & Electronic Engineering, Zhejiang University, Hangzhou 310027, China; more

Corresponding email(s):   yhm2005@zju.edu.cn

Key Words:  Topic extraction, Topic evolution, Evaluation method


Hou-kui Zhou, Hui-min Yu, Roland Hu. Topic discovery and evolution in scientific literature based on content and citations[J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18(10): 1511-1524.

@article{title="Topic discovery and evolution in scientific literature based on content and citations",
author="Hou-kui Zhou, Hui-min Yu, Roland Hu",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="18",
number="10",
pages="1511-1524",
year="2017",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1601125"
}

%0 Journal Article
%T Topic discovery and evolution in scientific literature based on content and citations
%A Hou-kui Zhou
%A Hui-min Yu
%A Roland Hu
%J Frontiers of Information Technology & Electronic Engineering
%V 18
%N 10
%P 1511-1524
%@ 2095-9184
%D 2017
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1601125

TY - JOUR
T1 - Topic discovery and evolution in scientific literature based on content and citations
A1 - Hou-kui Zhou
A1 - Hui-min Yu
A1 - Roland Hu
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 18
IS - 10
SP - 1511
EP - 1524
%@ 2095-9184
Y1 - 2017
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1601125


Abstract: 
Researchers across the globe have been increasingly interested in the manner in which important research topics evolve over time within the corpus of scientific literature. In a dataset of scientific articles, each document can be considered to comprise both the words of the document itself and its citations of other documents. In this paper, we propose a citation-content-latent Dirichlet allocation (LDA) topic discovery method that accounts for both document citation relations and the content of the document itself via a probabilistic generative model. The citation-content-LDA topic model exploits a two-level topic model that includes the citation information for ‘father’ topics and text information for sub-topics. The model parameters are estimated by a collapsed Gibbs sampling algorithm. We also propose a topic evolution algorithm that runs in two steps: topic segmentation and topic dependency relation calculation. We have tested the proposed citation-content-LDA model and topic evolution algorithm on two online datasets, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) and IEEE Computer Society (CS), to demonstrate that our algorithm effectively discovers important topics and reflects the topic evolution of important research themes. According to our evaluation metrics, citation-content-LDA outperforms both content-LDA and citation-LDA.

基于内容和引用的科研文献的主题发现和演化

概要:科研文献数据库中的重要主题随时间的演化的方式已经越来越受到全球研究者的关注。在一个科研论文数据集中,任何一篇论文可以被认为是由组成论文本身的词和论文引用的文献所组成的。在本文中,我们提出了一种名为"Citation-content-LDA (latent Dirichlet allocation)"的主题发现方法,该方法在一个概率生成模型中同时生成文献的引用关系和文献本身的词。Citation-content-LDA模型利用了一种两层结构的主题模型,即利用引用信息生成父主题和利用文本信息生成子主题。模型参数通过吉布斯采样算法来估计。我们还提出了一个主题演化算法,该算法包括主题分割和主题间依赖关系计算两个步骤。我们在IEEE Transactionson Pattern Analysis and Machine Intelligence (PAMI)和IEEE Computer Society (CS)两个数据集上测试了提出的Citation-content-LDA模型和主题演化算法,证明了我们提出的算法能有效的发现重要的主题和反映重要研究主题的主题演化情况。经过我们的评价指标的评测,Citation-content-LDA算法的性能优于Content-LDA和Citation-LDA算法。

关键词:主题提取;主题演化;评价方法

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Ahmed, A., Xing, E.P., 2010. Timeline: a dynamic hierarchical Dirichlet process model for recovering birth/death and evolution of topics in text stream. Proc. 26th Conf. on Uncertainty in Artificial Intelligence, p.20-29.

[2]Blei, D.M., Lafferty, J.D., 2006. Dynamic topic models. Proc. 23rd ACM Int. Conf. on Machine Learning, p.113-120.

[3]Blei, D.M., Ng, A.Y., Jordan, M.I., 2003. Latent Dirichlet allocation. J. Mach. Learn. Res., 3:993-1022.

[4]Brin, B.S., Page, L., 1998. The anatomy of a large scale hypertextual web search engine. Comput. Netw. ISDN Syst., 30(98):107-117.

[5]Chang, J., Blei, D.M., 2009. Relational topic models for document networks. Proc. 12th Int. Conf. on Artificial Intelligence and Statistics, p.81-88.

[6]Cohn, D., Chang, H., 2000. Learning to probabilistically identify authoritative documents. Proc. 17th Int. Conf. on Machine Learning, p.167-174.

[7]Dietz, L., Bickel, S., Scheffer, T., 2007. Unsupervised prediction of citation influences. Proc. 24th ACM Int. Conf. on Machine Learning, p.233-240.

[8]Erosheva, E., Fienberg, S., Lafferty, J., 2004. Mixed-membership models of scientific publications. PNAS, 101(Suppl 1):5220-5227.

[9]Griffiths, T.L., Steyvers, M., 2004. Finding scientific topics. PNAS, 101(Suppl 1):5228-5235.

[10]Guo, Z., Zhang, Z., Zhu, S., et al., 2014. A two-level topic model towards knowledge discovery from citation networks. IEEE Trans. Knowl. Data Eng., 26(4):780-794.

[11]He, Q., Chen, B., Pei, J., et al., 2009. Detecting topic evolution in scientific literature: how can citations help Proc. 18th ACM Conf. on Information and Knowledge Management, p.957-966.

[12]Hofmann, T., 2001. Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn., 42(1-2):177-196.

[13]Lin, F.R., Huang, F.M., Liang, C.H., 2007. Individualized storyline-based news topic retrospection. Pacific Asia Conf. on Information Systems, Article 140.

[14]Lu, Z., Mamoulis, N., Cheung, D.W., 2014. A collective topic model for milestone paper discovery. Proc. 37th Int. ACM SIGIR Conf. on Research & Development in Information Retrieval, p.1019-1022.

[15]Macroberts, M.H., Macroberts, B.R., 1989. Problems of citation analysis: a critical review. J. Am. Soc. Inform. Sci., 40(5):342-349.

[16]Mei, Q., Zhai, C., 2005. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. Proc. 11th ACM SIGKDD Int. Conf. on Knowledge Discovery in Data Mining, p.198-207.

[17]Mei, Q., Cai, D., Zhang, D., et al., 2008. Topic modeling with network regularization. Proc. 17th Int. Conf. on World Wide Web, p.101-110.

[18]Nallapati, R., Cohen, W.W., 2008. Link-PLSA-LDA: a new unsupervised model for topics and influence of blogs. Proc. 2nd Int. Conf. on Weblogs and Social Media, p.84-92.

[19]Nallapati, R.M., Ahmed, A., Xing, E.P., et al., 2008. Joint latent topic models for text and citations. Proc. 14th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.542-550.

[20]Wang, X.L., Zhai, C.X., Roth, D., 2013. Understanding evolution of research themes: a probabilistic generative model for citations. Proc. 19th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.1115-1123.

[21]Wang, X.R., McCallum, A., 2006. Topics over time: a non-Markov continuous-time model of topical trends. Proc. 12th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.424-433.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn
Copyright © 2000 - Journal of Zhejiang University-SCIENCE