CLC number: TP39
On-line Access: 2023-07-03
Received: 2022-07-22
Revision Accepted: 2023-01-06
Crosschecked: 2023-07-03
Cited: 0
Clicked: 1414
Citations: Bibtex RefMan EndNote GB/T7714
https://orcid.org/0000-0002-0407-1522
https://orcid.org/0000-0003-4940-2812
Jingfa LIU, Zhen WANG, Guo ZHONG, Zhihe YANG. A new focused crawler using an improved tabu search algorithm incorporating ontology and host information[J]. Frontiers of Information Technology & Electronic Engineering, 2023, 24(6): 859-875.
@article{title="A new focused crawler using an improved tabu search algorithm incorporating ontology and host information",
author="Jingfa LIU, Zhen WANG, Guo ZHONG, Zhihe YANG",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="24",
number="6",
pages="859-875",
year="2023",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2200315"
}
%0 Journal Article
%T A new focused crawler using an improved tabu search algorithm incorporating ontology and host information
%A Jingfa LIU
%A Zhen WANG
%A Guo ZHONG
%A Zhihe YANG
%J Frontiers of Information Technology & Electronic Engineering
%V 24
%N 6
%P 859-875
%@ 2095-9184
%D 2023
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2200315
TY - JOUR
T1 - A new focused crawler using an improved tabu search algorithm incorporating ontology and host information
A1 - Jingfa LIU
A1 - Zhen WANG
A1 - Guo ZHONG
A1 - Zhihe YANG
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 24
IS - 6
SP - 859
EP - 875
%@ 2095-9184
Y1 - 2023
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2200315
Abstract: To solve the problems of incomplete topic description and repetitive crawling of visited hyperlinks in traditional focused crawling methods, in this paper, we propose a novel focused crawler using an improved tabu search algorithm with domain ontology and host information (FCITS_OH), where a domain ontology is constructed by formal concept analysis to describe topics at the semantic and knowledge levels. To avoid crawling visited hyperlinks and expand the search range, we present an improved tabu search (ITS) algorithm and the strategy of host information memory. In addition, a comprehensive priority evaluation method based on Web text and link structure is designed to improve the assessment of topic relevance for unvisited hyperlinks. Experimental results on both tourism and rainstorm disaster domains show that the proposed focused crawlers overmatch the traditional focused crawlers for different performance metrics.
[1]Asano Y, Tezuka Y, Nishizeki T, 2007. Improvements of HITS algorithms for spam links. Proc 9th Asia-Pacific Web Conf and 8th Int Conf on Web-Age Information Management, p.479-490.
[2]Chakrabarti S, van den Berg M, Dom B, 1999. Focused crawling: a new approach to topic-specific Web resource discovery. Comput Netw, 31(11-16):1623-1640.
[3]de Bra P, Houben GJ, Kornatzky Y, et al., 1994. Information retrieval in distributed hypertexts. Proc RIAO: Intelligent Multimedia Information Retrieval Systems and Management, p.481-491.
[4]Deng SQ, 2020. Research on the focused crawler of mineral intelligence service based on semantic similarity. J Phys Conf Ser, 1575:012142.
[5]Derrac J, García S, Molina D, et al., 2011. A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol Comput, 1(1):3-18.
[6]Du YJ, Hai YF, Xie CZ, et al., 2014. An approach for selecting seed URLs of focused crawler based on user-interest ontology. Appl Soft Comput, 14:663-676.
[7]Farag MMG, Lee S, Fox EA, 2018. Focused crawler for events. Int J Dig Libr, 19(1):3-19.
[8]Fei CJ, Liu BS, 2018. Focused crawler based on LDA extended topic terms. Comput Appl Softw, 35(4):49-54(in Chinese).
[9]Guan WG, Luo YC, 2016. Design and implementation of focused crawler based on concept context graph. Comput Eng Des, 37(10):2679-2684(in Chinese).
[10]He S, Cheng JX, Cai XB, 2009. Focused crawler based on simulated anneal algorithm. Comput Technol Dev, 19(12):55-58, 62(in Chinese).
[11]Hosseinkhani J, Taherdoost H, Keikhaee S, 2021. ANTON framework based on semantic focused crawler to support Web crime mining using SVM. Ann Data Sci, 8(2):227-240.
[12]Jiang QC, Zhang Y, 2007. SiteRank-based crawling ordering strategy for search engines. Proc 7th IEEE Int Conf on Computer and Information Technology, p.259-263.
[13]Khan MA, Sharma DK, 2016. Self-adaptive ontology-based focused crawling: a literature survey. Proc 5th Int Conf on Reliability, Infocom Technologies and Optimization (Trends and Future Directions), p.595-601.
[14]Lakzaei B, Shmasfard M, 2021. Ontology learning from relational databases. Inform Sci, 577:280-297.
[15]Li L, Zhang GY, Li ZW, 2015. Research on focused crawling technology based on SVM. Comput Sci, 42(2):118-122(in Chinese).
[16]Liu JF, Li F, Jiang SY, 2019. Focused annealing crawler algorithm for rainstorm disasters based on comprehensive priority and host information. Comput Sci, 46(2):215-222(in Chinese).
[17]Liu JF, Wang DW, Yan XM, 2021. Tabu search algorithm for dynamic facility layout problem. J Huazhong Univ Sci Technol (Nat Sci Ed), 49(2):44-50(in Chinese).
[18]Liu JF, Dong Y, Liu ZX, et al., 2022a. Applying ontology learning and multi-objective ant colony optimization method for focused crawling to meteorological disasters domain knowledge. Expert Syst Appl, 198:116741.
[19]Liu JF, Li X, Zhang QS, et al., 2022b. A novel focused crawler combining Web space evolution and domain ontology. Knowl-Based Syst, 243:108495.
[20]Liu WJ, Du YJ, 2014. A novel focused crawler based on cell-like membrane computing optimization algorithm. Neurocomputing, 123:266-280.
[21]Ma LL, Li HW, Lian SW, et al., 2016. A strategy of disaster focused crawler based on ontology semantics. Comput Eng, 42(11):50-56(in Chinese).
[22]Prakash J, Kumar R, 2015. Web crawling through shark-search using PageRank. Proc Comput Sci, 48:210-216.
[23]Rani M, Dhar AK, Vyas OP, 2017. Semi-automatic terminology ontology learning based on topic modeling. Eng Appl Artif Intell, 63:108-125.
[24]Rawat S, Patil DR, 2013. Efficient focused crawling based on best first search. Proc 3rd IEEE Int Advance Computing Conf, p.908-911.
[25]Tong YL, 2008. Application of focused crawler using adaptive dynamical evolutional particle swarm optimization. Geomat Inform Sci Wuhan Univ, 33(12):1296-1299(in Chinese).
[26]Wang ZG, Meng BJ, 2014. A comparison of approaches to Chinese word segmentation in Hadoop. Proc IEEE Int Conf on Data Mining Workshop, p.844-850.
[27]Wu TY, 2018. Research on information retrieval technology based on Word2vec+BM25. Electron World, 2018(22):135-136.
[28]Wu YL, Zhao SL, Li CJ, et al., 2017. Text classification method based on TF-IDF and cosine similarity. J Chin Inform Process, 31(5):138-145(in Chinese).
[29]Xiao JJ, Chen ZY, 2018. Focused crawling based on grey wolf algorithms. Comput Sci, 45(11A):146-148, 166(in Chinese).
[30]Yan W, Pan L, 2018. Designing focused crawler based on improved genetic algorithm. Proc 10th Int Conf on Advanced Computational Intelligence, p.319-323.
[31]Yu J, Liu G, 2015. Survey on topic-focused crawlers. Comput Eng Sci, 37(2):231-237(in Chinese).
[32]Yuan ZQ, Zhang WH, Fu HJ, et al., 2017. A PageRank-improved ranking algorithm based on cheating similarity and cheating relevance. Proc IEEE/ACIS 16th Int Conf on Computer and Information Science, p.257-263.
[33]Zhu G, Yang JY, Wu XH, et al., 2017. Research on construction of hierarchy relationship and ontology of meteorological disaster based on FCA. Mod Inform, 37(5):79-88(in Chinese).
Open peer comments: Debate/Discuss/Question/Opinion
<1>