Full Text:   <3>

Summary:  <1>

CLC number: TP391

On-line Access: 2026-01-09

Received: 2024-10-22

Revision Accepted: 2025-09-28

Crosschecked: 2026-01-11

Cited: 0

Clicked: 4

Citations:  Bibtex RefMan EndNote GB/T7714

 ORCID:

Jingfa LIU

https://orcid.org/0000-0002-0407-1522

-   Go to

Article info.
Open peer comments

Frontiers of Information Technology & Electronic Engineering  2025 Vol.26 No.12 P.2569-2582

http://doi.org/10.1631/FITEE.2400939


A focused crawling strategy based on comprehensive priority evaluation of hyperlinks and improved Bayesian classifier


Author(s):  Jingfa LIU, Yongchuang WU, Zhaoxia LIU

Affiliation(s):  School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou 510006, China; more

Corresponding email(s):   jfliu@gdufs.edu.cn, wu112002@outlook.com, 554822022@qq.com

Key Words:  Focused crawler (FC), Bayesian classifier, Information retrieval, Priority evaluation


Jingfa LIU, Yongchuang WU, Zhaoxia LIU. A focused crawling strategy based on comprehensive priority evaluation of hyperlinks and improved Bayesian classifier[J]. Frontiers of Information Technology & Electronic Engineering, 2025, 26(12): 2569-2582.

@article{title="A focused crawling strategy based on comprehensive priority evaluation of hyperlinks and improved Bayesian classifier",
author="Jingfa LIU, Yongchuang WU, Zhaoxia LIU",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="26",
number="12",
pages="2569-2582",
year="2025",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2400939"
}

%0 Journal Article
%T A focused crawling strategy based on comprehensive priority evaluation of hyperlinks and improved Bayesian classifier
%A Jingfa LIU
%A Yongchuang WU
%A Zhaoxia LIU
%J Frontiers of Information Technology & Electronic Engineering
%V 26
%N 12
%P 2569-2582
%@ 2095-9184
%D 2025
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2400939

TY - JOUR
T1 - A focused crawling strategy based on comprehensive priority evaluation of hyperlinks and improved Bayesian classifier
A1 - Jingfa LIU
A1 - Yongchuang WU
A1 - Zhaoxia LIU
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 26
IS - 12
SP - 2569
EP - 2582
%@ 2095-9184
Y1 - 2025
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2400939


Abstract: 
Avoidance of topic drift and enabling crossing tunnels are two main difficulties in focused crawling. To overcome the problem of topic drift, we design a comprehensive priority evaluation (CPE) method based on the web text, anchor text, and context of hyperlinks, which improves the topic-relevance evaluation of unvisited hyperlinks. Subsequently, we propose an improved bayesian classifier with weights (BCW), which adds label weights to the feature words of the bayesian classifier to enhance the accuracy of webpage classification. To cross tunnels through which some topic-relevant webpages can be reached from low-relevance webpages, we construct a content block segmentation (CBS) technology for webpages based on the backtracking method, which segments a webpage into multiple blocks and then judges the relevance of every content block, extracting hyperlinks with high comprehensive relevance. Finally, a BCW-based focused crawling strategy combining the CPE and CBS strategies (BCW_CC) is proposed and experimentally evaluated for focused crawling in two domains: rainstorm disasters and sports. The results demonstrate the effectiveness of the developed BCW_CC method.

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Ai FJ, Yin XY, 2024. Research on text semantic enhancement topic crawler integrating BTM and TextCNN. Softw Guide, 23(3):21-26 (in Chinese).

[2]Cheng YK, Liao WJ, Cheng G, 2018. Strategy of focused crawler with word embedding clustering weighted in shark-search algorithm. Comput Digit Eng, 46(1):144-148 (in Chinese).

[3]China Internet Network Information Center, 2024. The 53rd Statistical Report on China’s Internet Development (in Chinese). https://www.cnnic.net.cn/NMediaFile/2024/0325/MAIN1711355296414FIQ9XKZV63.pdf [Accessed on Mar. 25, 2024].

[4]Deng SQ, 2020. Research on the focused crawler of mineral intelligence service based on semantic similarity. J Phys Conf Ser, 1575(1):012142.

[5]Dhanith PRJ, Saeed K, Rohith G, et al., 2024. Weakly supervised learning for an effective focused web crawler. Eng Appl Artif Intell, 132:107944.

[6]Ding SC, Liu K, Fang Z, 2022. Crawler with dynamic thesaurus and improved shark-search algorithm: case study of military equipment. Data Anal Knowl Discov, 6(8):52-60 (in Chinese).

[7]Du YJ, Li CX, Hu Q, et al., 2017. Ranking webpages using a path trust knowledge graph. Neurocomputing, 269:58-72.

[8]Fan GF, Zhang LZ, Yu M, et al., 2022. Applications of random forest in multivariable response surface for short-term load forecasting. Int J Electr Power Energy Syst, 139:108073.

[9]Farag MMG, Lee S, Fox EA, 2018. Focused crawler for events. Int J Digit Libr, 19(1):3-19.

[10]Gao Y, Feng ZL, Wang XY, et al., 2023. Reinforcement learning based web crawler detection for diversity and dynamics. Neurocomputing, 520:115-128.

[11]Guan WG, Luo YC, 2016. Design and implementation of focused crawler based on concept context graph. Comput Eng Des, 37(10):2679-2684 (in Chinese).

[12]He YL, Ou GL, Fournier-Viger P, et al., 2025. Attribute grouping-based naive Bayesian classifier. Sci China Inform Sci, 68(3):132106.

[13]Hosseinkhani J, Taherdoost H, Keikhaee S, 2021. ANTON framework based on semantic focused crawler to support web crime mining using SVM. Ann Data Sci, 8(2):227-240.

[14]Hsu CC, Wu F, 2006. Topic-specific crawling on the Web with the measurements of the relevancy context graph. Inform Syst, 31(4-5):232-246.

[15]Hu ZW, Cui JJ, Lin A, 2023. Identifying potentially excellent publications using a citation-based machine learning approach. Inform Process Manag, 60(3):103323.

[16]Jia Z, Pramanik S, Roy RS, et al., 2021. Complex temporal question answering on knowledge graphs. Proc 30th ACM Int Conf on Information and Knowledge Management, p.792-802.

[17]Khan M, Mello GBM, Habib L, et al., 2024. HITS-based propagation paradigm for graph neural networks. ACM Trans Knowl Discov Data, 18(4):100.

[18]Kumar S, Gupta M, 2021. A review of focused crawling schemes for search engine. In: Zhang YD, Senjyu T, So–In C, et al. (Eds.), Smart Trends in Computing and Communications: Proceedings of SmartCom 2020. Springer, Singapore, p.311-317.

[19]Liu JF, Li F, Jiang SY, 2019. Focused annealing crawler algorithm for rainstorm disasters based on comprehensive priority and host information. Comput Sci, 46(2):215-222 (in Chinese).

[20]Liu JF, Dong Y, Liu ZX, et al., 2022a. Applying ontology learning and multi-objective ant colony optimization method for focused crawling to meteorological disasters domain knowledge. Expert Syst Appl, 198:116741.

[21]Liu JF, Li X, Zhang QS, et al., 2022b. A novel focused crawler combining Web space evolution and domain ontology. Knowl Based Syst, 243:108495.

[22]Liu JF, Wang Z, Zhong G, et al., 2023. A new focused crawler using an improved tabu search algorithm incorporating ontology and host information. Front Inform Technol Electron Eng, 24(6):859-875.

[23]Liu JF, Yang ZH, Yan XM, et al., 2024. Applying particle swarm optimization-based dynamic adaptive hyperlink evaluation to focused crawler for meteorological disasters. Complex Intell Syst, 10(1):233-255.

[24]Liu WJ, Du YJ, 2014. A novel focused crawler based on cell-like membrane computing optimization algorithm. Neurocomputing, 123:266-280.

[25]Rawat S, Patil DR, 2013. Efficient focused crawling based on best first search. Proc 3rd IEEE Int Advance Computing Conf, p.908-911.

[26]Saleh AI, Abulwafa AE, Al Rahmawy MF, 2017. A web page distillation strategy for efficient focused crawling based on optimized Naïve Bayes (ONB) classifier. Appl Soft Comput, 53:181-204.

[27]Seyfi A, Patel A, Júnior JC, 2016. Empirical evaluation of the link and content-based focused treasure-crawler. Comput Stand Interfaces, 44:54-62.

[28]Suebchua T, Rungsawang A, Yamana H, 2016. Adaptive focused website segment crawler. Proc 19th Int Conf on Network-Based Information Systems, p.181-187.

[29]Wang H, 2011. Design and implementation of theme crawling based on breadth first. MS Thesis, Fudan University, Shanghai (in Chinese). https://cdmd.cnki.com.cn/Article/CDMD-10246-1012330588.htm

[30]Wu YL, Zhao SL, Li CJ, et al., 2017. Text classification method based on TF-IDF and cosine similarity. J Chin Inform Process, 31(5):138-145 (in Chinese).

[31]Xiong GY, Yang BL, 2025. A self-decision topic crawler algorithm with online training. J Beijing Univ Aeronaut Astronaut, 51(2):602-615 (in Chinese).

[32]Yang B, Chen HC, Zhu GY, et al., 2014. A new web page ranking algorithm based on hyperlink diversity analysis. Chin J Comput, 37(4):833-847 (in Chinese).

[33]Yang YK, Du YJ, Sun JY, et al., 2008. A topic-specific web crawler with concept similarity context graph based on FCA. Proc 4th Int Conf on Intelligent Computing, p.840-847.

[34]Yu J, Liu Q, 2015. Survey on topic-focused crawlers. Comput Eng Sci, 37(2):231-237 (in Chinese).

[35]Yu LX, Li YL, Zeng QT, 2021. Design of topic Web crawler based on improved PageRank algorithm. J Phys Conf Ser, 1754(1):012210.

[36]Yu ZW, Wang ZQ, You J, et al., 2017. A new kind of nonparametric test for statistical comparison of multiple classifiers over multiple datasets. IEEE Trans Cybern, 47(12):4418-4431.

[37]Yuan ZQ, Zhang WH, Fu HJ, et al., 2017. A PageRank-improved ranking algorithm based on cheating similarity and cheating relevance. Proc IEEE/ACIS 16th Int Conf on Computer and Information Science, p.257-263.

[38]Zhang HH, Li SH, Feng JY, et al., 2021. Public opinion analysis of Weibo comments based on crawler and SVM. Proc IEEE 4th Advanced Information Management, Communicates, Electronic and Automation Control Conf, p.589-593.

[39]Zhao YF, He XT, Yu GX, et al., 2025. Personalized federated few-shot node classification. Sci China Inform Sci, 68(1):112105.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn
Copyright © 2000 - 2026 Journal of Zhejiang University-SCIENCE