CLC number: TP391
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2020-06-06
Cited: 0
Clicked: 7721
Citations: Bibtex RefMan EndNote GB/T7714
https://orcid.org/0000-0002-9978-3337
Li Deng, Xin Du, Ji-zhong Shen. Web page classification based on heterogeneous features and a combination of multiple classifiers[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.1900240 @article{title="Web page classification based on heterogeneous features and a combination of multiple classifiers", %0 Journal Article TY - JOUR
基于异构特征和组合分类器的网页分类浙江大学信息与电子工程学院,中国杭州市,310027 摘要:网页特征是网页分类的关键,通过有区分度的特征能有效对网页分类。网页结构特征是对文本特征的有效补充。不同分类器有不同特点,多分类器组合可实现分类器性能互补。提出一种基于异构特征和组合分类器的网页分类算法。与计算HTML标记的频率不同,本文采用树状分布的HTML标签表示网页结构特征,以向量形式将异构文本和结构特征融合。通过计算一组样本的分类准确率,提出将分类结果置信度作为比较不同分类器分类结果的标准。基于置信度采用投票、比较大小和直接输出的决策策略,得到组合分类器的分类结果。实验结果表明,在Amazon数据集、7-web-genres数据集和DMOZ数据集中,准确率分别提高到94.2%、95.4%、95.7%。融合文本和结构特征的分类方法比仅使用文本特征的方法更全面有效。同时多分类器组合能够提高网页分类准确率,高于同类网页组合分类算法。 关键词组: Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference[1]Ali F, Khan P, Riaz K, et al., 2017. A fuzzy ontology and SVM-based web content classification system. IEEE Access, 5:25781-25797. ![]() [2]Baskin II, Marcou G, Horvath D, et al., 2017. Bagging and boosting of classification models. In: Varnek A (Ed.), Tutorials in Chemoinformatics, Wiley Online Library, p.241-247. ![]() [3]Cai D, Yu SP, Wen JR, et al., 2003. Extracting content structure for web pages based on visual representation. Asia-Pacific Web Conf, p.406-417. ![]() [4]Elsalmy F, Ismail R, Abdelmoez W, 2017. Enhancing web page classification models. Int Conf on Advanced Intelligent Systems and Informatics, p.742-750. ![]() [5]Gers FA, Schmidhuber J, Cummins F, 2000. Learning to forget: continual prediction with LSTM. Neur Comput, 12(10): 2451-2471. ![]() [6]Gogar T, Hubacek O, Sedivy J, 2016. Deep neural networks for web page information extraction. IFIP Int Conf on Artificial Intelligence Applications and Innovations, p.154-163. ![]() [7]Heinrich G, 2017. Evaluation of a distribution-based web page classification. In: Friedrichsen M, Kamalipour Y (Eds.), Digital Transformation in Journalism and News Media. Springer, Cham, p.55-68. ![]() [8]Kumari KP, Reddy AV, 2012. Performance improvement of web page genre classification. Int J Comput Appl, 53(10): 24-27. ![]() [9]Li HK, Xu Z, Li T, et al., 2017. An optimized approach for massive web page classification using entity similarity based on semantic network. Fut Gener Comput Syst, 76: 510-518. ![]() [10]Mikolov T, Chen K, Corrado G, et al., 2013. Efficient estimation of word representations in vector space. https://arxiv.org/abs/1301.3781 ![]() [11]Onan A, 2015. Artificial immune system based web page classification. In: Silhavy R, Senkerik R, Oplatkova Z, et al. (Eds.), Software Engineering in Intelligent Systems. Springer, Cham, p.189-199. ![]() [12]Onan A, 2016. Classifier and feature set ensembles for web page classification. J Inform Sci, 42(2):150-165. ![]() [13]Panchekha P, Torlak E, 2016. Automated reasoning for web page layout. ACM SIGPLAN Not, 51(10):181-194. ![]() [14]Pritsos DA, Stamatatos E, 2013. Open-set classification for automated genre identification. European Conf on Information Retrieval, p.207-217. ![]() [15]Qi XG, Davison BD, 2006. Knowing a web page by the company it keeps. Proc 15th ACM Int Conf on Information and Knowledge Management, p.228-237. ![]() [16]Qi XG, Davison BD, 2009. Web page classification: features and algorithms. ACM Comput Surv, 41(2):12. ![]() [17]Sze V, Chen YH, Yang TJ, et al., 2017. Efficient processing of deep neural networks: a tutorial and survey. Proc IEEE, 105(12):2295-2329. ![]() [18]Wei YL, Wang W, Wang BL, et al., 2017. A method for topic classification of web pages using LDA-SVM model. Chinese Int Automation Conf, p.589-596. ![]() [19]Xue WM, Bao H, Huang WM, et al., 2006. Web page classification based on SVM. 6th World Congress on Intelligent Control and Automation, p.6111-6114. ![]() [20]Zhu J, Xie Q, Yu SI, et al., 2016. Exploiting link structure for web page genre identification. Data Min Knowl Discov, 30(3):550-575. ![]() Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou
310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn Copyright © 2000 - 2025 Journal of Zhejiang University-SCIENCE |
Open peer comments: Debate/Discuss/Question/Opinion
<1>