Full Text:   <4646>

CLC number: TP391.4

On-line Access: 2012-09-05

Received: 2011-12-19

Revision Accepted: 2012-06-25

Crosschecked: 2012-08-03

Cited: 13

Clicked: 4411

Citations:  Bibtex RefMan EndNote GB/T7714

-   Go to

Article info.
1. Reference List
Open peer comments

Journal of Zhejiang University SCIENCE C 2012 Vol.13 No.9 P.649-659


Short text classification based on strong feature thesaurus

Author(s):  Bing-kun Wang, Yong-feng Huang, Wan-xia Yang, Xing Li

Affiliation(s):  Information Cognitive and Intelligent System Research Institute, Department of Electronic and Engineering, Tsinghua University, Beijing 100084, China; more

Corresponding email(s):   Wangbingkun77@yahoo.com.cn, wbk10@mails.tsinghua.edu.cn

Key Words:  Short text, Classification, Data sparseness, Semantic, Strong feature thesaurus (SFT), Latent Dirichlet allocation (LDA)

Bing-kun Wang, Yong-feng Huang, Wan-xia Yang, Xing Li. Short text classification based on strong feature thesaurus[J]. Journal of Zhejiang University Science C, 2012, 13(9): 649-659.

@article{title="Short text classification based on strong feature thesaurus",
author="Bing-kun Wang, Yong-feng Huang, Wan-xia Yang, Xing Li",
journal="Journal of Zhejiang University Science C",
publisher="Zhejiang University Press & Springer",

%0 Journal Article
%T Short text classification based on strong feature thesaurus
%A Bing-kun Wang
%A Yong-feng Huang
%A Wan-xia Yang
%A Xing Li
%J Journal of Zhejiang University SCIENCE C
%V 13
%N 9
%P 649-659
%@ 1869-1951
%D 2012
%I Zhejiang University Press & Springer
%DOI 10.1631/jzus.C1100373

T1 - Short text classification based on strong feature thesaurus
A1 - Bing-kun Wang
A1 - Yong-feng Huang
A1 - Wan-xia Yang
A1 - Xing Li
J0 - Journal of Zhejiang University Science C
VL - 13
IS - 9
SP - 649
EP - 659
%@ 1869-1951
Y1 - 2012
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/jzus.C1100373

data sparseness, the evident characteristic of short text, has always been regarded as the main cause of the low accuracy in the classification of short texts using statistical methods. Intensive research has been conducted in this area during the past decade. However, most researchers failed to notice that ignoring the semantic importance of certain feature terms might also contribute to low classification accuracy. In this paper we present a new method to tackle the problem by building a strong feature thesaurus (SFT) based on latent Dirichlet allocation (LDA) and information gain (IG) models. By giving larger weights to feature terms in SFT, the classification accuracy can be improved. Specifically, our method appeared to be more effective with more detailed classification. Experiments in two short text datasets demonstrate that our approach achieved improvement compared with the state-of-the-art methods including support vector machine (SVM) and Naïve Bayes Multinomial.

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article


[1]Blei, D.M., Ng, A.Y., Jordan, M.I., 2003. Latent Dirichlet allocation. J. Mach. Learn. Res., 3(1):993-1022.

[2]Bollegala, D., Matsuo, Y., Ishizuka, M., 2007. Measuring Semantic Similarity Between Terms Using Web Search Engine. Proc. 16th Int. Conf. on World Wide Web, p.757-766.

[3]Bollegala, D., Weir, D., Carroll, J., 2011. Using Multiple Sources to Construct a Sentiment Sensitive Thesaurus for Cross-Domain Sentiment Classification. Proc. 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, p.132-141.

[4]CAS (Chinese Academy of Sciences), 2010. Chinese Lexical Analysis System of the CAS. Institute of Computing Technology, Chinese Academy of Sciences. Available from http://ictclas.org/ [Accessed on Sept. 20, 2011].

[5]Gabrilovich, E., Markovitch, S., 2005. Feature Generation for Text Categorization Using World Knowledge. Proc. 19th Int. Joint Conf. on Artificial Intelligence, p.1048-1053.

[6]Gabrilovich, E., Markovitch, S., 2006. Overcoming the Brittleness Bottleneck Using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge. Proc. 21st National Conf. on Artificial Intelligence, p.1301-1306.

[7]Griffiths, T.L., Steyvers, M., 2004. Finding scientific topics. PNAS, 101(suppl_1):5228-5235.

[8]Heinrich, G., 2005. Parameter Estimation for Text Analysis. Technical Report, University of Leipzig, Germany. Available from http://www.arbylon.net/publications/text-est.pdf

[9]Hu, X., Sun, N., Zhang, C., Chua, T.S., 2009. Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge. Proc. 18th ACM Conf. on Information and Knowledge Management, p.919-928.

[10]Koller, D., Friedman, N., 2009. Probabilistic Graphical Models: Principles and Techniques. MIT Press, Cambridge, USA, p.3-6.

[11]Li, Y.H., McLean, D., Bandar, Z.A., O′Shea, J.D., Crockett, K., 2006. Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng., 18(8):1138-1150.

[12]Manning, C.D., Raghavan, P., Schütze, H., 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK, p.257.

[13]Metzler, D., Dumais, S., Meek, C., 2007. Similarity Measures for Short Segments of Text. 29th European Conf. in Information Retrieval Research, p.16-27.

[14]Peng, T., Zuo, W.L., He, F.L., 2008. SVM based adaptive learning method for text classification from positive and unlabeled documents. Knowl. Inf. Syst., 16(3):281-301.

[15]Phan, X.H., Nguyen, L.M., Horiguchi, S., 2008. Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-Scale Data Collections. Proc. 17th Int. Conf. on World Wide Web, p.91-100.

[16]Phan, X.H., Nguyen, C.T., Le, D.T., Nguyen, L.M., Horiguchi, S., Ha, Q.T., 2011. A hidden topic-based framework toward building applications with short Web documents. IEEE Trans. Knowl. Data Eng., 23(7):961-976.

[17]Quan, X.J., Liu, G., Lu, Z., Ni, X.L., Liu, W.Y., 2010. Short text similarity based on probabilistic topics. Knowl. Inf. Syst., 25(3):473-491.

[18]Sahami, M., Heilman, T.D., 2006. A Web-Based Kernel Function for Measuring the Similarity of Short Text Snippets. Proc. 15th Int. Conf. on World Wide Web, p.377-386.

[19]Sohu Research & Development Center, 2008. Text Classification Corpus of Sogou Labs. Available from http://www. sogou.com/labs/dl/c.html [Accessed on Sept. 20, 2011].

[20]Witten, I.H., Frank, E., Hall, M.A., 2011. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Press, San Francisco, USA.

[21]Yih, W.T., Meek, C., 2007. Improving Similarity Measures for Short Segments of Text. Proc. 22nd National Conf. on Artificial Intelligence, p.1489-1494.

[22]Zhang, Y.T., Gong, L., Wang, Y.C., 2005. An improved TF-IDF approach for text classification. J. Zhejiang Univ.-Sci., 6A(1):49-55.

[23]Zong, C.Q., 2008. Statistical Signal Processing. Tsinghua University Press, Beijing, China, p.344 (in Chinese).

Open peer comments: Debate/Discuss/Question/Opinion


Please provide your name, email address and a comment

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn
Copyright © 2000 - Journal of Zhejiang University-SCIENCE