CLC number: TP311
On-line Access: 2018-01-11
Received: 2016-06-11
Revision Accepted: 2016-09-14
Crosschecked: 2017-11-26
Cited: 0
Clicked: 7168
Qiao Yu, Shu-juan Jiang, Rong-cun Wang, Hong-yang Wang. A feature selection approach based on a similarity measure for software defect prediction[J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18(11): 1744-1753.
@article{title="A feature selection approach based on a similarity measure for software defect prediction",
author="Qiao Yu, Shu-juan Jiang, Rong-cun Wang, Hong-yang Wang",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="18",
number="11",
pages="1744-1753",
year="2017",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1601322"
}
%0 Journal Article
%T A feature selection approach based on a similarity measure for software defect prediction
%A Qiao Yu
%A Shu-juan Jiang
%A Rong-cun Wang
%A Hong-yang Wang
%J Frontiers of Information Technology & Electronic Engineering
%V 18
%N 11
%P 1744-1753
%@ 2095-9184
%D 2017
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1601322
TY - JOUR
T1 - A feature selection approach based on a similarity measure for software defect prediction
A1 - Qiao Yu
A1 - Shu-juan Jiang
A1 - Rong-cun Wang
A1 - Hong-yang Wang
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 18
IS - 11
SP - 1744
EP - 1753
%@ 2095-9184
Y1 - 2017
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1601322
Abstract: software defect prediction is aimed to find potential defects based on historical data and software features. Software features can reflect the characteristics of software modules. However, some of these features may be more relevant to the class (defective or non-defective), but others may be redundant or irrelevant. To fully measure the correlation between different features and the class, we present a feature selection approach based on a similarity measure (SM) for software defect prediction. First, the feature weights are updated according to the similarity of samples in different classes. Second, a feature ranking list is generated by sorting the feature weights in descending order, and all feature subsets are selected from the feature ranking list in sequence. Finally, all feature subsets are evaluated on a k-nearest neighbor (KNN) model and measured by an area under curve (AUC) metric for classification performance. The experiments are conducted on 11 National Aeronautics and Space Administration (NASA) datasets, and the results show that our approach performs better than or is comparable to the compared feature selection approaches in terms of classification performance.
[1]Aha, D.W., Kibler, D., Albert, M.K., 1991. Instance-based learning algorithms. Mach. Learn., 6(1):37-66.
[2]Catal, C., Diri, B., 2009. Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem. Inform. Sci., 179(8):1040-1058.
[3]Duch, W., Wieczorek, T., Biesiada, J., et al., 2004. Comparison of feature ranking methods based on information entropy. Int. Joint Conf. on Neural Networks, p.1415-1419.
[4]Galar, M., Fernández, A., Barrenechea, E., et al., 2012. A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C, 42(4):463-484.
[5]Gao, K., Khoshgoftaar, T.M., Wang, H., et al., 2011. Choosing software metrics for defect prediction: an investigation on feature selection techniques. Softw. Pract. Exper., 41(5):579-606.
[6]Ghareb, A.S., Bakar, A.A., Hamdan, A.R., 2016. Hybrid feature selection based on enhanced genetic algorithm for text categorization. Expert Syst. Appl., 49:31-47.
[7]Gray, D., Bowes, D., Davey, N., et al., 2011. The misuse of the NASA metrics data program data sets for automated software defect prediction. Int. Conf. on Evaluation and Assessment in Software Engineering, p.96-103.
[8]Guyon, I., Elisseeff, A., 2003. An introduction to variable and feature selection. J. Mach. Learn. Res., 3:1157-1182.
[9]Hall, M.A., 1999. Correlation-Based Feature Selection for Machine Learning. University of Waikato, Hamilton, New Zealand.
[10]Halstead, M.H., 1977. Elements of Software Science. Elsevier, New York, USA.
[11]Han, Y., Park, K., Guan, D., et al., 2013. Topological similarity-based feature selection for graph classification. Comput. J., 58(9):1884-1893.
[12]Holte, R.C., 1993. Very simple classification rules perform well on most commonly used datasets. Mach. Learn., 11(1):63-90.
[13]Huang, J., Ling, C.X., 2005. Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng., 17(3):299-310.
[14]Jiang, Y., Lin, J., Cukic, B., et al., 2009. Variance analysis in software fault prediction models. Int. Symp. on Software Reliability Engineering, p.99-108.
[15]Jing, X., Ying, S., Zhang, Z., et al., 2014a. Dictionary learning based software defect prediction. Int. Conf. on Software Engineering, p.414-423.
[16]Jing, X., Zhang, Z., Ying, S., et al., 2014b. Software defect prediction based on collaborative representation classification. Companion of Int. Conf. on Software Engineering, p.632-633.
[17]Jing, X., Wu, F., Dong, X., et al., 2015. Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning. Joint Meeting on Foundations of Software Engineering, p.496-507.
[18]Karegowda, A.G., Manjunath, A.S., Jayaram, M.A., 2010. Comparative study of attribute selection using gain ratio and correlation based feature selection. Int. J. Inform. Technol. Knowl. Manag., 2(2):271-277.
[19]Khoshgoftaar, T.M., Gao, K., Napolitano, A., et al., 2014. A comparative study of iterative and non-iterative feature selection techniques for software defect prediction. Inform. Syst. Front., 16(5):801-822.
[20]Kira, K., Rendell, L.A., 1992. A practical approach to feature selection. Int. Workshop on Machine Learning, p.249-256.
[21]Kononenko, I., 1994. Estimating attributes: analysis and extensions of RELIEF. European Conf. on Machine Learning, p.171-182.
[22]Laradji, I.H., Alshayeb, M., Ghouti, L., 2015. Software defect prediction using ensemble learning on selected features. Inform. Softw. Technol., 58:388-402.
[23]Liu, H., Yu, L., 2005. Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. Knowl. Data Eng., 17(4):491-502.
[24]Liu, H., Sun, J., Liu, L., et al., 2009. Feature selection with dynamic mutual information. Patt. Recogn., 42(7):1330-1339.
[25]Liu, H., Motoda, H., Setiono, R., et al., 2010. Feature selection: an ever evolving frontier in data mining. Int. Workshop on Feature Selection in Data Mining, p.4-13.
[26]Liu, S., Chen, X., Liu, W., et al., 2014. FECAR: a feature selection framework for software defect prediction. Annual Computer Software and Applications Conf., p.426-435.
[27]McCabe, T.J., 1976. A complexity measure. IEEE Trans. Softw. Eng., SE-2(4):308-320.
[28]Miao, L., Liu, M., Zhang, D., 2012. Cost-sensitive feature selection with application in software defect prediction. Int. Conf. on Pattern Recognition, p.967-970.
[29]Nam, J., Kim, S., 2015a. CLAMI: defect prediction on unlabeled datasets. Int. Conf. on Automated Software Engineering, p.452-463.
[30]Nam, J., Kim, S., 2015b. Heterogeneous defect prediction. Joint Meeting on Foundations of Software Engineering, p.508-519.
[31]Shepperd, M., Song, Q., Sun, Z., et al., 2013. Data quality: some comments on the NASA software defect datasets. IEEE Trans. Softw. Eng., 39(9):1208-1215.
[32]Tantithamthavorn, C., McIntosh, S., Hassan, A.E., et al., 2016. Automated parameter optimization of classification techniques for defect prediction models. Int. Conf. on Software Engineering, p.321-332.
[33]Uysal, A.K., Gunal, S., 2012. A novel probabilistic feature selection method for text classification. Knowl. Based Syst., 36:226-235.
[34]Wang, H., Khoshgoftaar, T.M., Seliya, N., 2015. On the stability of feature selection methods in software quality prediction: an empirical investigation. Int. J. Softw. Eng. Know. Eng., 25:1467-1490.
[35]Wang, Z., Li, M., Li, J., 2015. A multi-objective evolutionary algorithm for feature selection based on mutual information with a new redundancy measure. Inform. Sci., 307:73-88.
[36]Wilcoxon, F., 1945. Individual comparisons by ranking methods. Biometr. Bull., 1(6):80-83.
[37]Xu, J., Zhou, Y., Chen, L., et al., 2012. An unsupervised feature selection approach based on mutual information. J. Comput. Res. Dev., 49(2):372-382 (in Chinese).
[38]Xue, B., Zhang, M., Browne, W.N., 2013. Particle swarm optimization for feature selection in classification: a multi-objective approach. IEEE Trans. Cybern., 43(6):1656-1671.
[39]Yang, S., Gu, J., 2004. Feature selection based on mutual information and redundancy-synergy coefficient. J. Zhejiang Univ.-Sci., 5(11):1382-1391.
Open peer comments: Debate/Discuss/Question/Opinion
<1>