Full Text:   <547>

Summary:  <157>

CLC number: TP391

On-line Access: 2018-06-07

Received: 2016-10-26

Revision Accepted: 2017-01-03

Crosschecked: 2018-04-03

Cited: 0

Clicked: 1658

Citations:  Bibtex RefMan EndNote GB/T7714

 ORCID:

Xi-ming Li

http://orcid.org/0000-0001-8190-5087

-   Go to

Article info.
Open peer comments

Frontiers of Information Technology & Electronic Engineering  2018 Vol.19 No.4 P.513-523

http://doi.org/10.1631/FITEE.1601668


Supervised topic models with weighted words: multi-label document classification


Author(s):  Yue-peng Zou, Ji-hong Ouyang, Xi-ming Li

Affiliation(s):  College of Computer Science and Technology, Jilin University, Changchun 130012, China; more

Corresponding email(s):   ouyj@jlu.edu.cn, liximing86@gmail.com

Key Words:  Supervised topic model, Multi-label classification, Class frequency, Labeled latent Dirichlet allocation (L-LDA), Dependency-LDA


Yue-peng Zou, Ji-hong Ouyang, Xi-ming Li. Supervised topic models with weighted words: multi-label document classification[J]. Frontiers of Information Technology & Electronic Engineering, 2018, 19(4): 513-523.

@article{title="Supervised topic models with weighted words: multi-label document classification",
author="Yue-peng Zou, Ji-hong Ouyang, Xi-ming Li",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="19",
number="4",
pages="513-523",
year="2018",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1601668"
}

%0 Journal Article
%T Supervised topic models with weighted words: multi-label document classification
%A Yue-peng Zou
%A Ji-hong Ouyang
%A Xi-ming Li
%J Frontiers of Information Technology & Electronic Engineering
%V 19
%N 4
%P 513-523
%@ 2095-9184
%D 2018
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1601668

TY - JOUR
T1 - Supervised topic models with weighted words: multi-label document classification
A1 - Yue-peng Zou
A1 - Ji-hong Ouyang
A1 - Xi-ming Li
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 19
IS - 4
SP - 513
EP - 523
%@ 2095-9184
Y1 - 2018
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1601668


Abstract: 
supervised topic modeling algorithms have been successfully applied to multi-label document classification tasks. Representative models include labeled latent Dirichlet allocation (L-LDA) and dependency-LDA. However, these models neglect the class frequency information of words (i.e., the number of classes where a word has occurred in the training data), which is significant for classification. To address this, we propose a method, namely the class frequency weight (CF-weight), to weight words by considering the class frequency knowledge. This CF-weight is based on the intuition that a word with higher (lower) class frequency will be less (more) discriminative. In this study, the CF-weight is used to improve L-LDA and dependency-LDA. A number of experiments have been conducted on real-world multi-label datasets. Experimental results demonstrate that CF-weight based algorithms are competitive with the existing supervised topic models.

词加权有监督主题模型:多标签文本分类

摘要:有监督主题模型已成功应用于多标签文本分类任务。代表性模型包括有监督隐含狄利克雷分配模型(labeled latent Dirichlet allocation, L-LDA)和判别隐含狄利克雷分配模型(dependency-LDA)。这些已有模型忽略单词类别频率信息,即训练集中单词出现的类别数量,对分类任务的影响。对此引入类别频率信息,提出一个类别频率词权重方法(class frequency weight, CF-weight)。CF-weight方法基于如下假设:具有较高(或较低)类别频率的单词在分类问题中具有较低(或较高)判别力。将CF-weight方法应用于L-LDA和dependency-LDA模型。实验结果表明,相比传统有监督主题模型,基于CF-weight的模型在多标签分类性能上具有优势。

关键词:有监督主题模型;多标签分类;类别频率;有监督隐含狄利克雷分配模型;判别隐含狄利克雷分配模型

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Blei DM, McAuliffe JD, 2007. Supervised topic models. 20th Int Conf on Neural Information Processing Systems, p.121-128.

[2]Blei DM, Ng AY, Jordan MI, 2003. Latent Dirichlet allocation. J Mach Learn Res, 3:993-1022.

[3]Chang CC, Lin CJ, 2016. LIBSVM—a Library for Support Vector Machines. https://www.csie.ntu.edu.tw/~cjlin/libsvm/ [Accessed on May 22, 2018].

[4]Debole F, Sebastiani F, 2004. Supervised term weighting for automated text categorization. In: Sirmakessis S (Ed.), Text Mining and Its Applications. Springer, Berlin, p.81-97.

[5]Ghahramani Z, 2001. An introduction to hidden Markov models and Bayesian networks. Int J Patt Recogn Artif Intell, 15(1):9-42.

[6]Griffiths TL, Steyvers M, 2004. Finding scientific topics. Proc Nat Acad Sci USA, 101(Suppl 1):5228-5235.

[7]Guan H, Zhou JY, Guo MY, 2009. A class-feature-centroid classifier for text categorization. 18th Int Conf on World Wide Web, p.201-210.

[8]Kim D, Kim S, Oh A, 2012. Dirichlet process with mixed random measures: a nonparametric topic model for labeled data. 29th Int Conf on Machine Learning, p.675- 682.

[9]Lacoste-Julien S, Sha F, Jordan MI, 2008. DiscLDA: discriminative learning for dimensionality reduction and classification. 21st Int Conf on Neural Information Processing Systems, p.897-904.

[10]Lee S, Kim J, Myaeng SH, 2015. An extension of topic models for text classification: a term weighting approach. Int Conf on Big Data and Smart Computing, p.217-224.

[11]Li XM, Ouyang JH, Zhou XT, 2015a. Centroid prior topic model for multi-label classification. Patt Recogn Lett, 62:8-13.

[12]Li XM, Ouyang JH, Zhou XT, 2015b. Supervised topic models for multi-label classification. Neurocomputing, 149:811- 819.

[13]Machine Learning & Knowledge Discovery Group, 2011. Learning from Multi-label Data. http://mlkd.csd.auth.gr/multilabel.html [Accessed on May 12, 2018].

[14]Madsen RE, Kauchak D, Elkan C, 2005. Modeling word burstiness using the Dirichlet distribution. 22nd Int Conf on Machine Learning, p.545-552.

[15]Petterson J, Smola A, Caetano T, et al., 2010. Word features for latent Dirichlet allocation. 23rd Int Conf on Neural Information Processing Systems, p.1921-1929.

[16]Ramage D, Hall D, Nallapati R, et al., 2009. Labeled LDA: a supervised topic model for credit attribution in multi- labeled corpora. Conf on Empirical Methods in Natural Language Processing, p.248-256.

[17]Ramage D, Manning CD, Dumais S, 2011. Partially labeled topic models for interpretable text mining. 17th ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining, p.457-465.

[18]Reisinger J, Waters A, Silverthorn B, et al., 2010. Spherical topic models. Proc 27th Int Conf on Machine Learning, p.1-8.

[19]Rubin TN, Chambers A, Smyth P, et al., 2012. Statistical topic models for multi-label document classification. Mach Learn, 88(1-2):157-208.

[20]Salton G, Buckley C, 1988. Term-weighting approaches in automatic text retrieval. Inform Process Manag, 24(5): 513-523.

[21]Shang LF, Chan KP, Pan GD, 2011. DTTM: a discriminative temporal topic model for facial expression recognition. 7th Int Conf on Advances in Visual Computing, p.596-606.

[22]Tsoumakas G, Spyromitros-Xioufis E, Vilcek J, et al., 2011a. Mulan: a Java library for multi-label learning. J Mach Learn Res, 12(7):2411-2414.

[23]Tsoumakas G, Katakis I, Vlahavas I, 2011b. Random k-labelsets for multilabel classification. IEEE Trans Knowl Data Eng, 23(7):1079-1089.

[24]Wilson AT, Chew PA, 2010. Term weighting schemes for latent Dirichlet allocation. Human Language Technologies: Annual Conf of the North American Chapter of the Association for Computational Linguistics, p.465-473.

[25]Zhu J, Ahmed A, Xing EP, 2012. MedLDA: maximum margin supervised topic models. 26th Annual Int Conf on Machine Learning, p.1257-1264.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn
Copyright © 2000 - Journal of Zhejiang University-SCIENCE