CLC number: TP391
On-line Access: 2018-06-07
Received: 2016-10-26
Revision Accepted: 2017-01-03
Crosschecked: 2018-04-03
Cited: 0
Clicked: 6887
Yue-peng Zou, Ji-hong Ouyang, Xi-ming Li. Supervised topic models with weighted words: multi-label document classification[J]. Frontiers of Information Technology & Electronic Engineering, 2018, 19(4): 513-523.
@article{title="Supervised topic models with weighted words: multi-label document classification",
author="Yue-peng Zou, Ji-hong Ouyang, Xi-ming Li",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="19",
number="4",
pages="513-523",
year="2018",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1601668"
}
%0 Journal Article
%T Supervised topic models with weighted words: multi-label document classification
%A Yue-peng Zou
%A Ji-hong Ouyang
%A Xi-ming Li
%J Frontiers of Information Technology & Electronic Engineering
%V 19
%N 4
%P 513-523
%@ 2095-9184
%D 2018
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1601668
TY - JOUR
T1 - Supervised topic models with weighted words: multi-label document classification
A1 - Yue-peng Zou
A1 - Ji-hong Ouyang
A1 - Xi-ming Li
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 19
IS - 4
SP - 513
EP - 523
%@ 2095-9184
Y1 - 2018
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1601668
Abstract: supervised topic modeling algorithms have been successfully applied to multi-label document classification tasks. Representative models include labeled latent Dirichlet allocation (L-LDA) and dependency-LDA. However, these models neglect the class frequency information of words (i.e., the number of classes where a word has occurred in the training data), which is significant for classification. To address this, we propose a method, namely the class frequency weight (CF-weight), to weight words by considering the class frequency knowledge. This CF-weight is based on the intuition that a word with higher (lower) class frequency will be less (more) discriminative. In this study, the CF-weight is used to improve L-LDA and dependency-LDA. A number of experiments have been conducted on real-world multi-label datasets. Experimental results demonstrate that CF-weight based algorithms are competitive with the existing supervised topic models.
[1]Blei DM, McAuliffe JD, 2007. Supervised topic models. 20th Int Conf on Neural Information Processing Systems, p.121-128.
[2]Blei DM, Ng AY, Jordan MI, 2003. Latent Dirichlet allocation. J Mach Learn Res, 3:993-1022.
[3]Chang CC, Lin CJ, 2016. LIBSVM—a Library for Support Vector Machines. https://www.csie.ntu.edu.tw/~cjlin/libsvm/ [Accessed on May 22, 2018].
[4]Debole F, Sebastiani F, 2004. Supervised term weighting for automated text categorization. In: Sirmakessis S (Ed.), Text Mining and Its Applications. Springer, Berlin, p.81-97.
[5]Ghahramani Z, 2001. An introduction to hidden Markov models and Bayesian networks. Int J Patt Recogn Artif Intell, 15(1):9-42.
[6]Griffiths TL, Steyvers M, 2004. Finding scientific topics. Proc Nat Acad Sci USA, 101(Suppl 1):5228-5235.
[7]Guan H, Zhou JY, Guo MY, 2009. A class-feature-centroid classifier for text categorization. 18th Int Conf on World Wide Web, p.201-210.
[8]Kim D, Kim S, Oh A, 2012. Dirichlet process with mixed random measures: a nonparametric topic model for labeled data. 29th Int Conf on Machine Learning, p.675- 682.
[9]Lacoste-Julien S, Sha F, Jordan MI, 2008. DiscLDA: discriminative learning for dimensionality reduction and classification. 21st Int Conf on Neural Information Processing Systems, p.897-904.
[10]Lee S, Kim J, Myaeng SH, 2015. An extension of topic models for text classification: a term weighting approach. Int Conf on Big Data and Smart Computing, p.217-224.
[11]Li XM, Ouyang JH, Zhou XT, 2015a. Centroid prior topic model for multi-label classification. Patt Recogn Lett, 62:8-13.
[12]Li XM, Ouyang JH, Zhou XT, 2015b. Supervised topic models for multi-label classification. Neurocomputing, 149:811- 819.
[13]Machine Learning & Knowledge Discovery Group, 2011. Learning from Multi-label Data. http://mlkd.csd.auth.gr/multilabel.html [Accessed on May 12, 2018].
[14]Madsen RE, Kauchak D, Elkan C, 2005. Modeling word burstiness using the Dirichlet distribution. 22nd Int Conf on Machine Learning, p.545-552.
[15]Petterson J, Smola A, Caetano T, et al., 2010. Word features for latent Dirichlet allocation. 23rd Int Conf on Neural Information Processing Systems, p.1921-1929.
[16]Ramage D, Hall D, Nallapati R, et al., 2009. Labeled LDA: a supervised topic model for credit attribution in multi- labeled corpora. Conf on Empirical Methods in Natural Language Processing, p.248-256.
[17]Ramage D, Manning CD, Dumais S, 2011. Partially labeled topic models for interpretable text mining. 17th ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining, p.457-465.
[18]Reisinger J, Waters A, Silverthorn B, et al., 2010. Spherical topic models. Proc 27th Int Conf on Machine Learning, p.1-8.
[19]Rubin TN, Chambers A, Smyth P, et al., 2012. Statistical topic models for multi-label document classification. Mach Learn, 88(1-2):157-208.
[20]Salton G, Buckley C, 1988. Term-weighting approaches in automatic text retrieval. Inform Process Manag, 24(5): 513-523.
[21]Shang LF, Chan KP, Pan GD, 2011. DTTM: a discriminative temporal topic model for facial expression recognition. 7th Int Conf on Advances in Visual Computing, p.596-606.
[22]Tsoumakas G, Spyromitros-Xioufis E, Vilcek J, et al., 2011a. Mulan: a Java library for multi-label learning. J Mach Learn Res, 12(7):2411-2414.
[23]Tsoumakas G, Katakis I, Vlahavas I, 2011b. Random k-labelsets for multilabel classification. IEEE Trans Knowl Data Eng, 23(7):1079-1089.
[24]Wilson AT, Chew PA, 2010. Term weighting schemes for latent Dirichlet allocation. Human Language Technologies: Annual Conf of the North American Chapter of the Association for Computational Linguistics, p.465-473.
[25]Zhu J, Ahmed A, Xing EP, 2012. MedLDA: maximum margin supervised topic models. 26th Annual Int Conf on Machine Learning, p.1257-1264.
Open peer comments: Debate/Discuss/Question/Opinion
<1>