CLC number: TP391
On-line Access: 2017-12-04
Received: 2016-04-01
Revision Accepted: 2016-06-30
Crosschecked: 2017-09-22
Cited: 0
Clicked: 6663
Hou-kui Zhou, Hui-min Yu, Roland Hu. Topic discovery and evolution in scientific literature based on content and citations[J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18(10): 1511-1524.
@article{title="Topic discovery and evolution in scientific literature based on content and citations",
author="Hou-kui Zhou, Hui-min Yu, Roland Hu",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="18",
number="10",
pages="1511-1524",
year="2017",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1601125"
}
%0 Journal Article
%T Topic discovery and evolution in scientific literature based on content and citations
%A Hou-kui Zhou
%A Hui-min Yu
%A Roland Hu
%J Frontiers of Information Technology & Electronic Engineering
%V 18
%N 10
%P 1511-1524
%@ 2095-9184
%D 2017
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1601125
TY - JOUR
T1 - Topic discovery and evolution in scientific literature based on content and citations
A1 - Hou-kui Zhou
A1 - Hui-min Yu
A1 - Roland Hu
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 18
IS - 10
SP - 1511
EP - 1524
%@ 2095-9184
Y1 - 2017
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1601125
Abstract: Researchers across the globe have been increasingly interested in the manner in which important research topics evolve over time within the corpus of scientific literature. In a dataset of scientific articles, each document can be considered to comprise both the words of the document itself and its citations of other documents. In this paper, we propose a citation-content-latent Dirichlet allocation (LDA) topic discovery method that accounts for both document citation relations and the content of the document itself via a probabilistic generative model. The citation-content-LDA topic model exploits a two-level topic model that includes the citation information for ‘father’ topics and text information for sub-topics. The model parameters are estimated by a collapsed Gibbs sampling algorithm. We also propose a topic evolution algorithm that runs in two steps: topic segmentation and topic dependency relation calculation. We have tested the proposed citation-content-LDA model and topic evolution algorithm on two online datasets, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) and IEEE Computer Society (CS), to demonstrate that our algorithm effectively discovers important topics and reflects the topic evolution of important research themes. According to our evaluation metrics, citation-content-LDA outperforms both content-LDA and citation-LDA.
[1]Ahmed, A., Xing, E.P., 2010. Timeline: a dynamic hierarchical Dirichlet process model for recovering birth/death and evolution of topics in text stream. Proc. 26th Conf. on Uncertainty in Artificial Intelligence, p.20-29.
[2]Blei, D.M., Lafferty, J.D., 2006. Dynamic topic models. Proc. 23rd ACM Int. Conf. on Machine Learning, p.113-120.
[3]Blei, D.M., Ng, A.Y., Jordan, M.I., 2003. Latent Dirichlet allocation. J. Mach. Learn. Res., 3:993-1022.
[4]Brin, B.S., Page, L., 1998. The anatomy of a large scale hypertextual web search engine. Comput. Netw. ISDN Syst., 30(98):107-117.
[5]Chang, J., Blei, D.M., 2009. Relational topic models for document networks. Proc. 12th Int. Conf. on Artificial Intelligence and Statistics, p.81-88.
[6]Cohn, D., Chang, H., 2000. Learning to probabilistically identify authoritative documents. Proc. 17th Int. Conf. on Machine Learning, p.167-174.
[7]Dietz, L., Bickel, S., Scheffer, T., 2007. Unsupervised prediction of citation influences. Proc. 24th ACM Int. Conf. on Machine Learning, p.233-240.
[8]Erosheva, E., Fienberg, S., Lafferty, J., 2004. Mixed-membership models of scientific publications. PNAS, 101(Suppl 1):5220-5227.
[9]Griffiths, T.L., Steyvers, M., 2004. Finding scientific topics. PNAS, 101(Suppl 1):5228-5235.
[10]Guo, Z., Zhang, Z., Zhu, S., et al., 2014. A two-level topic model towards knowledge discovery from citation networks. IEEE Trans. Knowl. Data Eng., 26(4):780-794.
[11]He, Q., Chen, B., Pei, J., et al., 2009. Detecting topic evolution in scientific literature: how can citations help Proc. 18th ACM Conf. on Information and Knowledge Management, p.957-966.
[12]Hofmann, T., 2001. Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn., 42(1-2):177-196.
[13]Lin, F.R., Huang, F.M., Liang, C.H., 2007. Individualized storyline-based news topic retrospection. Pacific Asia Conf. on Information Systems, Article 140.
[14]Lu, Z., Mamoulis, N., Cheung, D.W., 2014. A collective topic model for milestone paper discovery. Proc. 37th Int. ACM SIGIR Conf. on Research & Development in Information Retrieval, p.1019-1022.
[15]Macroberts, M.H., Macroberts, B.R., 1989. Problems of citation analysis: a critical review. J. Am. Soc. Inform. Sci., 40(5):342-349.
[16]Mei, Q., Zhai, C., 2005. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. Proc. 11th ACM SIGKDD Int. Conf. on Knowledge Discovery in Data Mining, p.198-207.
[17]Mei, Q., Cai, D., Zhang, D., et al., 2008. Topic modeling with network regularization. Proc. 17th Int. Conf. on World Wide Web, p.101-110.
[18]Nallapati, R., Cohen, W.W., 2008. Link-PLSA-LDA: a new unsupervised model for topics and influence of blogs. Proc. 2nd Int. Conf. on Weblogs and Social Media, p.84-92.
[19]Nallapati, R.M., Ahmed, A., Xing, E.P., et al., 2008. Joint latent topic models for text and citations. Proc. 14th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.542-550.
[20]Wang, X.L., Zhai, C.X., Roth, D., 2013. Understanding evolution of research themes: a probabilistic generative model for citations. Proc. 19th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.1115-1123.
[21]Wang, X.R., McCallum, A., 2006. Topics over time: a non-Markov continuous-time model of topical trends. Proc. 12th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.424-433.
Open peer comments: Debate/Discuss/Question/Opinion
<1>