Full Text:   <3481>

CLC number: O212.8; H03

On-line Access: 

Received: 2008-11-15

Revision Accepted: 2009-04-10

Crosschecked: 2009-04-29

Cited: 0

Clicked: 3741

Citations:  Bibtex RefMan EndNote GB/T7714

-   Go to

Article info.
1. Reference List
Open peer comments

Journal of Zhejiang University SCIENCE A 2009 Vol.10 No.6 P.858~867

http://doi.org/10.1631/jzus.A0820796


Hierarchical topic modeling with nested hierarchical Dirichlet process


Author(s):  Yi-qun DING, Shan-ping LI, Zhen ZHANG, Bin SHEN

Affiliation(s):  School of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China; more

Corresponding email(s):   shan@zju.edu.cn

Key Words:  Topic modeling, Natural language processing, Chinese restaurant process, Hierarchical Dirichlet process, Markov chain Monte Carlo, Nonparametric Bayesian statistics


Yi-qun DING, Shan-ping LI, Zhen ZHANG, Bin SHEN. Hierarchical topic modeling with nested hierarchical Dirichlet process[J]. Journal of Zhejiang University Science A, 2009, 10(6): 858~867.

@article{title="Hierarchical topic modeling with nested hierarchical Dirichlet process",
author="Yi-qun DING, Shan-ping LI, Zhen ZHANG, Bin SHEN",
journal="Journal of Zhejiang University Science A",
volume="10",
number="6",
pages="858~867",
year="2009",
publisher="Zhejiang University Press & Springer",
doi="10.1631/jzus.A0820796"
}

%0 Journal Article
%T Hierarchical topic modeling with nested hierarchical Dirichlet process
%A Yi-qun DING
%A Shan-ping LI
%A Zhen ZHANG
%A Bin SHEN
%J Journal of Zhejiang University SCIENCE A
%V 10
%N 6
%P 858~867
%@ 1673-565X
%D 2009
%I Zhejiang University Press & Springer
%DOI 10.1631/jzus.A0820796

TY - JOUR
T1 - Hierarchical topic modeling with nested hierarchical Dirichlet process
A1 - Yi-qun DING
A1 - Shan-ping LI
A1 - Zhen ZHANG
A1 - Bin SHEN
J0 - Journal of Zhejiang University Science A
VL - 10
IS - 6
SP - 858
EP - 867
%@ 1673-565X
Y1 - 2009
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/jzus.A0820796


Abstract: 
This paper deals with the statistical modeling of latent topic hierarchies in text corpora. The height of the topic tree is assumed as fixed, while the number of topics on each level as unknown a priori and to be inferred from data. Taking a nonparametric Bayesian approach to this problem, we propose a new probabilistic generative model based on the nested hierarchical Dirichlet process (nHDP) and present a markov chain Monte Carlo sampling algorithm for the inference of the topic tree structure as well as the word distribution of each topic and topic distribution of each document. Our theoretical analysis and experiment results show that this model can produce a more compact hierarchical topic structure and captures more fine-grained topic relationships compared to the hierarchical latent Dirichlet allocation model.

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1] Bast, H., Majumdar, D., 2005. Why Spectral Retrieval Works. Proc. 28th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, p.11-18.

[2] Blackwell, D., MacQueen, J.B., 1973. Ferguson distributions via Polya Urn schemes. Ann. Statist., 1(2):353-355.

[3] Blei, D.M., Lafferty, J.D., 2006. Dynamic Topic Models. Proc. 23rd Int. Conf. on Machine Learning, p.113-120.

[4] Blei, D.M., Lafferty, J.D., 2007. A correlated topic model of science. Ann. Appl. Statist., 1(1):17-35.

[5] Blei, D.M., Griffiths, T.L., Jordan, M.I., Tenenbaum, J.B., 2003a. Hierarchical Topic Models and the Nested Chinese Restaurant Process. NIPS, p.17-24.

[6] Blei, D.M., Ng, A.Y., Jordan, M.I., 2003b. Latent Dirichlet allocation. J. Mach. Learning Res., 3(4-5):993-1022.

[7] Boley, D.L., 1998. Principal direction divisive partitioning. Data Min. Knowl. Discov., 2(4):325-344.

[8] Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R., 1990. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci., 41(6):391-407.

[9] Dhillon, I.S., Modha, D.S., 2001. Concept decompositions for large sparse text data using clustering. Mach. Learning, 42(1/2):143-175.

[10] Elkan, C., 2006. Clustering Documents with an Exponential-family Approximation of the Dirichlet Compound Multinomial Distribution. Proc. 23rd Int. Conf. on Machine Learning, p.289-296.

[11] Geman, S., Geman, D., 1990. Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. In: Shafer, G., Pearl, J. (Eds.), Readings in Uncertain Reasoning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, p.452-472.

[12] Griffiths, T.L., Steyvers, M., 2004. Finding scientific topics. PNAS, 101(Suppl. 1):5228-5235.

[13] Li, W., McCallum, A., 2006. Pachinko Allocation: DAG-structured Mixture Models of Topic Correlations. Proc. 23rd Int. Conf. on Machine Learning, p.577-584.

[14] Lin, J., 1991. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory, 37(1):145-151.

[15] Manning, C.D., Raghavan, P., Schütze, H., 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge, England.

[16] Mimno, D., Li, W., McCallum, A., 2007. Mixtures of Hierarchical Topics with Pachinko Allocation. Proc. 24th Int. Conf. on Machine Learning, p.633-640.

[17] Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T., 1992. Numerical Recipes in C. Cambridge University Press, Cambridge, England.

[18] Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P., 2004. The Author-topic Model for Authors and Documents. Proc. 20th Conf. on Uncertainty in Artificial Intelligence, p.487-494.

[19] Strehl, A., Ghosh, J., Mooney, R., 2000. Impact of Similarity Measures on Web-page Clustering. Proc. 17th National Conf. on Artificial Intelligence: Workshop of Artificial Intelligence for Web Search, p.58-64.

[20] Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M., 2006. Hierarchical Dirichlet processes. J. Am. Statist. Assoc., 101(476):1566-1581.

[21] Walker, D.D., Ringger, E.K., 2008. Model-based Document Clustering with a Collapsed Gibbs Sampler. Proc. 14th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.704-712.

[22] Wallach, H.M., 2006. Topic Modeling: Beyond Bag-of-words. Proc. 23rd Int. Conf. on Machine Learning, p.977-984.

[23] Wei, X., Croft, B.W., 2006. LDA-based Document Models for Ad-hoc Retrieval. Proc. 29th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, p.178-185.

[24] Zhang, Z., Phan, X.H., Horiguchi, S., 2008. An Efficient Feature Selection Using Hidden Topic in Text Categorization. Proc. 22nd Int. Conf. on Advanced Information Networking and Applications, p.1223-1228.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn
Copyright © 2000 - Journal of Zhejiang University-SCIENCE