CLC number: TN915.08
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 0000-00-00
Cited: 8
Clicked: 10396
SU Gui-yang, LI Jian-hua, MA Ying-hua, LI Sheng-hong. Improving the precision of the keyword-matching pornographic text filtering method using a hybrid model[J]. Journal of Zhejiang University Science A, 2004, 5(9): 1106-1113.
@article{title="Improving the precision of the keyword-matching pornographic text filtering method using a hybrid model",
author="SU Gui-yang, LI Jian-hua, MA Ying-hua, LI Sheng-hong",
journal="Journal of Zhejiang University Science A",
volume="5",
number="9",
pages="1106-1113",
year="2004",
publisher="Zhejiang University Press & Springer",
doi="10.1631/jzus.2004.1106"
}
%0 Journal Article
%T Improving the precision of the keyword-matching pornographic text filtering method using a hybrid model
%A SU Gui-yang
%A LI Jian-hua
%A MA Ying-hua
%A LI Sheng-hong
%J Journal of Zhejiang University SCIENCE A
%V 5
%N 9
%P 1106-1113
%@ 1869-1951
%D 2004
%I Zhejiang University Press & Springer
%DOI 10.1631/jzus.2004.1106
TY - JOUR
T1 - Improving the precision of the keyword-matching pornographic text filtering method using a hybrid model
A1 - SU Gui-yang
A1 - LI Jian-hua
A1 - MA Ying-hua
A1 - LI Sheng-hong
J0 - Journal of Zhejiang University Science A
VL - 5
IS - 9
SP - 1106
EP - 1113
%@ 1869-1951
Y1 - 2004
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/jzus.2004.1106
Abstract: With the flooding of pornographic information on the Internet, how to keep people away from that offensive information is becoming one of the most important research areas in network information security. Some applications which can block or filter such information are used. Approaches in those systems can be roughly classified into two kinds: metadata based and content based. With the development of distributed technologies, content based filtering technologies will play a more and more important role in filtering systems. Keyword matching is a content based method used widely in harmful text filtering. Experiments to evaluate the recall and precision of the method showed that the precision of the method is not satisfactory, though the recall of the method is rather high. According to the results, a new pornographic text filtering model based on reconfirming is put forward. Experiments showed that the model is practical, has less loss of recall than the single keyword matching method, and has higher precision.
[1] Aas, K., Eikvil, L., 1999. Text Categorisation: A Survey. http://citeseer.nj.nec.com/aas99text.html.
[2] Amos, F., Jared, S., 2002. Censorship Resistant Peer-to-Peer Content Addressable Networks. Proceedings of Symposium on Discrete Algorithms.
[3] Apte, C., Damerau, F., Weiss, S., 1998. Text Mining with Decision Rules and Decision Trees. Proceedings of the Conference on Automated Learning and Discovery, Workshop 6: Learning from Text and the Web.
[4] Ding, C., Chi, C.H., Deng, J., Dong, C.L., 1999. Centralized Content-Based Web Filtering and Blocking: How Far Can It Go? IEEE SMC’99 Conference Proceedings. 1999 IEEE International Conference on Systems, Man, and Cybernetics, 2:115-119.
[5] Cheeseman, P., Kelly, J., Self, M., Stutz, J., Taylor, W., Freeman, D., 1988. Autoclass: A Bayesian Classification System. Proc. Fifth Int. Conf. on Machine Learning. San Mateo, California, p.54-64.
[6] Creecy, R.H., Masand, B.M., Smith, S.J., Waltz, D.L., 1992. Trading mips and memory for knowledge engineering: Classifying census returns on the connection machine. Comm. ACM, 35:48-63.
[7] Honorguard. Christian Filtered Internet Service on the Internet. http://www.honorguard.net/.
[8] Infranet. A System that Enables Clients to Surreptitiously Retrieve Sensitive Content via Cooperating Web Servers Distributed across the Global Internet. http://nms.lcs.mit.edu/projects/infranet.
[9] He, J., Tan, A.H., Tan, C.L., 2000a. Machine Learning Methods for Chinese Web Page Categorization. In proceedings, ACL’2000 International Workshop on Chinese Language Processing, Hong Kong.
[10] He, J., Tan, A.H., Tan, C.L., 2000b. A Comparative Study on Chinese Text Categorization Methods. Proceedings, PRICAI’2000 International Workshop on Text and Web Mining, Melbourne, p.24-35.
[11] He, J., Tan, A.H., Tan, C.L., 2003. On Machine Learning Methods for Chinese Document Classification. Applied Intelligence, 18:311-322.
[12] John, G.H., Langley, P., 1995. Estimating Continuous Distributions in Bayesian Classifiers. Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers, San Mateo, p. 338-345.
[13] Lewis, D.D., Ringuette, M., 1994. Comparison of Two Learning Algorithms for Text Categorization. Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval (SDAIR’94).
[14] Mladenic, D., 1999. Text-learning and related intelligent agents: a survey. IEEE Intelligent Systems, 14(4):44-54.
[15] Ng, H.T., Goh, W.B., Low, K.L., 1997. Feature Selection, Perception Learning, and A Usability Case Study for Text Categorization. 20th Ann. Int. ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’97), p.67-73.
[16] Quinlan, J.R., 1986. Induction of decision trees. Machine Learning, 1:81-106.
[17] Rocchio, J., 1971. Relevance Feedback in Information Retrieval. In: G. Salton, ed., The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice Hall, p.313-323.
[18] SurfControl. Internet Filtering Solutions Stop Unwanted Content. http://www1.surfwatch.com/.
[19] S4F. Provides Internet Content Filtering Solutions and Internet Blocking of Pornography, Adult Material, Criminal Activity, Chat Blocking, and Many More Categories. http://www.s4f.com/.
[20] Thorsten, J., 1998. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. European Conference on Machine Learning (ECML).
[21] Waldman, M., Rubin, A., Cranor, L., 2000. Publius: A Robust, Tamper-Evident, Censorship-Resistant Web Publishing System. Proc. of the 9th USENIX Security Symposium.
[22] Waldman, M., Mazieres, D., 2001. Tangler: A Censorshipresistant Publishing System Based on Document Entanglements. Proc. 8th ACM Conf. on Computer and Communications Security.
[23] Wiener, E., Pedersen, J.O., Weigend, A.S., 1995. A Neural Network Approaches to Topic Spotting. Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval (SDAIR’95).
[24] Yang, Y., 1994. Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval. 17th Ann. Int. ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’94), p.13-22.
[25] Yang, Y., Liu, X., 1999. A Re-examination of Text Categorization Methods. 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99), p.42-49.
[26] Yang, Y., 1997. An Evaluation of Statistical Approach to text Categorization. Technical Report CMU-CS-97-127, Computer Science Department, Carnegie Mellon University.
Open peer comments: Debate/Discuss/Question/Opinion
<1>