Full Text:   <302>

Summary:  <22>

CLC number: TP391

On-line Access: 2020-07-10

Received: 2019-05-13

Revision Accepted: 2019-08-21

Crosschecked: 2020-06-06

Cited: 0

Clicked: 448

Citations:  Bibtex RefMan EndNote GB/T7714

 ORCID:

Li Deng

https://orcid.org/0000-0002-9978-3337

Xin Du

https://orcid.org/0000-0002-6215-9733

Ji-zhong Shen

https://orcid.org/0000-0002-9031-2379

-   Go to

Article info.
Open peer comments

Frontiers of Information Technology & Electronic Engineering  2020 Vol.21 No.7 P.995-1004

10.1631/FITEE.1900240


Web page classification based on heterogeneous features and a combination of multiple classifiers


Author(s):  Li Deng, Xin Du, Ji-zhong Shen

Affiliation(s):  College of Information Science & Electronic Engineering, Zhejiang University, Hangzhou 310027, China

Corresponding email(s):   jzshen@zju.edu.cn

Key Words:  Web page classification, Web page features, Combined classifiers


Li Deng, Xin Du, Ji-zhong Shen. Web page classification based on heterogeneous features and a combination of multiple classifiers[J]. Frontiers of Information Technology & Electronic Engineering, 2020, 21(7): 995-1004.

@article{title="Web page classification based on heterogeneous features and a combination of multiple classifiers",
author="Li Deng, Xin Du, Ji-zhong Shen",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="21",
number="7",
pages="995-1004",
year="2020",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1900240"
}

%0 Journal Article
%T Web page classification based on heterogeneous features and a combination of multiple classifiers
%A Li Deng
%A Xin Du
%A Ji-zhong Shen
%J Frontiers of Information Technology & Electronic Engineering
%V 21
%N 7
%P 995-1004
%@ 2095-9184
%D 2020
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1900240

TY - JOUR
T1 - Web page classification based on heterogeneous features and a combination of multiple classifiers
A1 - Li Deng
A1 - Xin Du
A1 - Ji-zhong Shen
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 21
IS - 7
SP - 995
EP - 1004
%@ 2095-9184
Y1 - 2020
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1900240


Abstract: 
Precise web page classification can be achieved by evaluating features of web pages, and the structural features of web pages are effective complements to their textual features. Various classifiers have different characteristics, and multiple classifiers can be combined to allow classifiers to complement one another. In this study, a web page classification method based on heterogeneous features and a combination of multiple classifiers is proposed. Different from computing the frequency of HTML tags, we exploit the tree-like structure of HTML tags to characterize the structural features of a web page. Heterogeneous textual features and the proposed tree-like structural features are converted into vectors and fused. Confidence is proposed here as a criterion to compare the classification results of different classifiers by calculating the classification accuracy of a set of samples. Multiple classifiers are combined based on confidence with different decision strategies, such as voting, confidence comparison, and direct output, to give the final classification results. Experimental results demonstrate that on the Amazon dataset, 7-web-genres dataset, and DMOZ dataset, the accuracies are increased to 94.2%, 95.4%, and 95.7%, respectively. The fusion of the textual features with the proposed structural features is a comprehensive approach, and the accuracy is higher than that when using only textual features. At the same time, the accuracy of the web page classification is improved by combining multiple classifiers, and is higher than those of the related web page classification algorithms.

基于异构特征和组合分类器的网页分类

邓立,杜歆,沈继忠
浙江大学信息与电子工程学院,中国杭州市,310027

摘要:网页特征是网页分类的关键,通过有区分度的特征能有效对网页分类。网页结构特征是对文本特征的有效补充。不同分类器有不同特点,多分类器组合可实现分类器性能互补。提出一种基于异构特征和组合分类器的网页分类算法。与计算HTML标记的频率不同,本文采用树状分布的HTML标签表示网页结构特征,以向量形式将异构文本和结构特征融合。通过计算一组样本的分类准确率,提出将分类结果置信度作为比较不同分类器分类结果的标准。基于置信度采用投票、比较大小和直接输出的决策策略,得到组合分类器的分类结果。实验结果表明,在Amazon数据集、7-web-genres数据集和DMOZ数据集中,准确率分别提高到94.2%、95.4%、95.7%。融合文本和结构特征的分类方法比仅使用文本特征的方法更全面有效。同时多分类器组合能够提高网页分类准确率,高于同类网页组合分类算法。

关键词:网页分类;网页特征;分类器组合

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Ali F, Khan P, Riaz K, et al., 2017. A fuzzy ontology and SVM-based web content classification system. IEEE Access, 5:25781-25797.

[2]Baskin II, Marcou G, Horvath D, et al., 2017. Bagging and boosting of classification models. In: Varnek A (Ed.), Tutorials in Chemoinformatics, Wiley Online Library, p.241-247.

[3]Cai D, Yu SP, Wen JR, et al., 2003. Extracting content structure for web pages based on visual representation. Asia-Pacific Web Conf, p.406-417.

[4]Elsalmy F, Ismail R, Abdelmoez W, 2017. Enhancing web page classification models. Int Conf on Advanced Intelligent Systems and Informatics, p.742-750.

[5]Gers FA, Schmidhuber J, Cummins F, 2000. Learning to forget: continual prediction with LSTM. Neur Comput, 12(10): 2451-2471.

[6]Gogar T, Hubacek O, Sedivy J, 2016. Deep neural networks for web page information extraction. IFIP Int Conf on Artificial Intelligence Applications and Innovations, p.154-163.

[7]Heinrich G, 2017. Evaluation of a distribution-based web page classification. In: Friedrichsen M, Kamalipour Y (Eds.), Digital Transformation in Journalism and News Media. Springer, Cham, p.55-68.

[8]Kumari KP, Reddy AV, 2012. Performance improvement of web page genre classification. Int J Comput Appl, 53(10): 24-27.

[9]Li HK, Xu Z, Li T, et al., 2017. An optimized approach for massive web page classification using entity similarity based on semantic network. Fut Gener Comput Syst, 76: 510-518.

[10]Mikolov T, Chen K, Corrado G, et al., 2013. Efficient estimation of word representations in vector space. https://arxiv.org/abs/1301.3781

[11]Onan A, 2015. Artificial immune system based web page classification. In: Silhavy R, Senkerik R, Oplatkova Z, et al. (Eds.), Software Engineering in Intelligent Systems. Springer, Cham, p.189-199.

[12]Onan A, 2016. Classifier and feature set ensembles for web page classification. J Inform Sci, 42(2):150-165.

[13]Panchekha P, Torlak E, 2016. Automated reasoning for web page layout. ACM SIGPLAN Not, 51(10):181-194.

[14]Pritsos DA, Stamatatos E, 2013. Open-set classification for automated genre identification. European Conf on Information Retrieval, p.207-217.

[15]Qi XG, Davison BD, 2006. Knowing a web page by the company it keeps. Proc 15th ACM Int Conf on Information and Knowledge Management, p.228-237.

[16]Qi XG, Davison BD, 2009. Web page classification: features and algorithms. ACM Comput Surv, 41(2):12.

[17]Sze V, Chen YH, Yang TJ, et al., 2017. Efficient processing of deep neural networks: a tutorial and survey. Proc IEEE, 105(12):2295-2329.

[18]Wei YL, Wang W, Wang BL, et al., 2017. A method for topic classification of web pages using LDA-SVM model. Chinese Int Automation Conf, p.589-596.

[19]Xue WM, Bao H, Huang WM, et al., 2006. Web page classification based on SVM. 6th World Congress on Intelligent Control and Automation, p.6111-6114.

[20]Zhu J, Xie Q, Yu SI, et al., 2016. Exploiting link structure for web page genre identification. Data Min Knowl Discov, 30(3):550-575.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn
Copyright © 2000 - Journal of Zhejiang University-SCIENCE