Improving with Hybrid Feature Selection in Software Defect Prediction

Muhammad Yoga Adha Pratama; Rudy Herteno; Mohammad Reza Faisal; Radityo Adi Nugroho; Friska Abadi

doi:10.15575/join.v9i1.1307

Authors

Muhammad Yoga Adha Pratama Department of Computer Science, University of Lambung Mangkurat, Kalimantan Selatan, Indonesia
Rudy Herteno Department of Computer Science, University of Lambung Mangkurat, Kalimantan Selatan, Indonesia
Mohammad Reza Faisal Department of Computer Science, University of Lambung Mangkurat, Kalimantan Selatan, Indonesia
Radityo Adi Nugroho Department of Computer Science, University of Lambung Mangkurat, Kalimantan Selatan, Indonesia
Friska Abadi Department of Computer Science, University of Lambung Mangkurat, Kalimantan Selatan, Indonesia

DOI:

https://doi.org/10.15575/join.v9i1.1307

Keywords:

Software Defect Prediction, Particle Swarm Optimization, Feature Selection, Filter, Wrapper, Naive Bayes

Abstract

Software defect prediction (SDP) is used to identify defects in software modules that can be a challenge in software development. This research focuses on the problems that occur in Particle Swarm Optimization (PSO), such as the problem of noisy attributes, high-dimensional data, and premature convergence. So this research focuses on improving PSO performance by using feature selection methods with hybrid techniques to overcome these problems. The feature selection techniques used are Filter and Wrapper. The methods used are Chi-Square (CS), Correlation-Based Feature Selection (CFS), and Forward Selection (FS) because feature selection methods have been proven to overcome data dimensionality problems and eliminate noisy attributes. Feature selection is often used by some researchers to overcome these problems, because these methods have an important function in the process of reducing data dimensions and eliminating uncorrelated attributes that can cause noisy. Naive Bayes algorithm is used to support the process of determining the most optimal class. Performance evaluation will use AUC with an alpha value of 0.050. This hybrid feature selection technique brings significant improvement to PSO performance with a much lower AUC value of 0.00342. Comparison of the significance of AUC with other combinations shows the value of FS PSO of 0.02535, CFS FS PSO of 0.00180, and CS FS PSO of 0.01186. The method in this study contributes to improving PSO in the SDP domain by significantly increasing the AUC value. Therefore, this study highlights the potential of feature selection with hybrid techniques to improve PSO performance in SDP.

References

M. J. Hernández-Molinos, A. J. Sánchez-García, R. E. Barrientos-Martínez, J. C. Pérez-Arriaga, and J. O. Ocharán-Hernández, “Software Defect Prediction with Bayesian Approaches,” Mathematics, vol. 11, no. 11, Jun. 2023, doi: 10.3390/math11112524.

C. Ni, X. Chen, F. Wu, Y. Shen, and Q. Gu, “An empirical study on pareto based multi-objective feature selection for software defect prediction,” Journal of Systems and Software, vol. 152, pp. 215–238, Jun. 2019, doi: 10.1016/j.jss.2019.03.012.

B. Khan et al., “Software Defect Prediction for Healthcare Big Data: An Empirical Evaluation of Machine Learning Techniques,” J Healthc Eng, vol. 2021, 2021, doi: 10.1155/2021/8899263.

H. Xie, L. Zhang, C. P. Lim, Y. Yu, and H. Liu, “Feature selection using enhanced particle swarm optimisation for classification models,” Sensors, vol. 21, no. 5, pp. 1–40, Mar. 2021, doi: 10.3390/s21051816.

T. M. Shami, A. A. El-Saleh, M. Alswaitti, Q. Al-Tashi, M. A. Summakieh, and S. Mirjalili, “Particle Swarm Optimization: A Comprehensive Survey,” IEEE Access, vol. 10, pp. 10031–10061, 2022, doi: 10.1109/ACCESS.2022.3142859.

M. Cai, “An Improved Particle Swarm Optimization Algorithm and Its Application to the Extreme Value Optimization Problem of Multivariable Function,” Comput Intell Neurosci, vol. 2022, 2022, doi: 10.1155/2022/1935272.

A. G. Gad, “Particle Swarm Optimization Algorithm and Its Applications: A Systematic Review,” Archives of Computational Methods in Engineering, vol. 29, no. 5, pp. 2531–2561, Aug. 2022, doi: 10.1007/s11831-021-09694-4.

J. Divasón, A. Pernia-Espinoza, and F. J. Martinez-de-Pison, “HYB-PARSIMONY: A hybrid approach combining Particle Swarm Optimization and Genetic Algorithms to find parsimonious models in high-dimensional datasets,” Neurocomputing, vol. 560, Dec. 2023, doi: 10.1016/j.neucom.2023.126840.

B. J. Solano-Rojas, R. Villalón-Fonseca, and R. Batres, “Micro Evolutionary Particle Swarm Optimization (MEPSO): A new modified metaheuristic,” Systems and Soft Computing, vol. 5, Dec. 2023, doi: 10.1016/j.sasc.2023.200057.

S. Bahassine, A. Madani, M. Al-Sarem, and M. Kissi, “Feature selection using an improved Chi-square for Arabic text classification,” Journal of King Saud University - Computer and Information Sciences, vol. 32, no. 2, pp. 225–231, Feb. 2020, doi: 10.1016/j.jksuci.2018.05.010.

M. P. A. Starmans, S. R. van der Voort, J. M. C. Tovar, J. F. Veenland, S. Klein, and W. J. Niessen, “Radiomics,” in Handbook of Medical Image Computing and Computer Assisted Intervention, Elsevier, 2019, pp. 429–456. doi: 10.1016/B978-0-12-816176-0.00023-5.

W. Shafqat, S. Malik, K. T. Lee, and D. H. Kim, “Pso based optimized ensemble learning and feature selection approach for efficient energy forecast,” Electronics (Switzerland), vol. 10, no. 18, Sep. 2021, doi: 10.3390/electronics10182188.

M. Reif and F. Shafait, “Efficient feature size reduction via predictive forward selection,” Pattern Recognit, vol. 47, no. 4, pp. 1664–1673, Apr. 2014, doi: 10.1016/j.patcog.2013.10.009.

K. Orphanou, A. Dagliati, L. Sacchi, A. Stassopoulou, E. Keravnou, and R. Bellazzi, “Incorporating repeating temporal association rules in Naïve Bayes classifiers for coronary heart disease diagnosis,” J Biomed Inform, vol. 81, pp. 74–82, May 2018, doi: 10.1016/j.jbi.2018.03.002.

S. Chen, G. I. Webb, L. Liu, and X. Ma, “A novel selective naïve Bayes algorithm,” Knowl Based Syst, vol. 192, Mar. 2020, doi: 10.1016/j.knosys.2019.105361.

R. Ferenc, P. Gyimesi, G. Gyimesi, Z. Tóth, and T. Gyimóthy, “An automatically created novel bug dataset and its validation in bug prediction,” Journal of Systems and Software, vol. 169, Nov. 2020, doi: 10.1016/j.jss.2020.110691.

S. Mcmurray and A. H. Sodhro, “A Study on ML-Based Software Defect Detection for Security Traceability in Smart Healthcare Applications,” Sensors, vol. 23, no. 7, Apr. 2023, doi: 10.3390/s23073470.

H. Alsawalqah et al., “Software defect prediction using heterogeneous ensemble classification based on segmented patterns,” Applied Sciences (Switzerland), vol. 10, no. 5, Mar. 2020, doi: 10.3390/app10051745.

H. Wei, C. Hu, S. Chen, Y. Xue, and Q. Zhang, “Establishing a software defect prediction model via effective dimension reduction,” Inf Sci (N Y), vol. 477, pp. 399–409, Mar. 2019, doi: 10.1016/j.ins.2018.10.056.

M. Shobana et al., “Classification and Detection of Mesothelioma Cancer Using Feature Selection-Enabled Machine Learning Technique,” Biomed Res Int, vol. 2022, 2022, doi: 10.1155/2022/9900668.

P. Bathla and R. Kumar, “A hybrid system to predict brain stroke using a combined feature selection and classifier,” Intelligent Medicine, Aug. 2023, doi: 10.1016/j.imed.2023.06.002.

J. Linja, J. Hämäläinen, P. Nieminen, and T. Kärkkäinen, “Feature selection for distance-based regression: An umbrella review and a one-shot wrapper,” Neurocomputing, vol. 518, pp. 344–359, Jan. 2023, doi: 10.1016/j.neucom.2022.11.023.

N. García-Pedrajas and G. Cerruela-García, “MABUSE: A margin optimization based feature subset selection algorithm using boosting principles,” Knowl Based Syst, vol. 253, Oct. 2022, doi: 10.1016/j.knosys.2022.109529.

B. Sen Peng, H. Xia, Y. K. Liu, B. Yang, D. Guo, and S. M. Zhu, “Research on intelligent fault diagnosis method for nuclear power plant based on correlation analysis and deep belief network,” Progress in Nuclear Energy, vol. 108, pp. 419–427, Sep. 2018, doi: 10.1016/j.pnucene.2018.06.003.

W. BinSaeedan and S. Alramlawi, “CS-BPSO: Hybrid feature selection based on chi-square and binary PSO algorithm for Arabic email authorship analysis,” Knowl Based Syst, vol. 227, Sep. 2021, doi: 10.1016/j.knosys.2021.107224.

F. U?urlu, S. Y?ld?z, M. Boran, Ö. U?urlu, and J. Wang, “Analysis of fishing vessel accidents with Bayesian network and Chi-square methods,” Ocean Engineering, vol. 198, Feb. 2020, doi: 10.1016/j.oceaneng.2020.106956.

S. Shafiee, L. M. Lied, I. Burud, J. A. Dieseth, M. Alsheikh, and M. Lillemo, “Sequential forward selection and support vector regression in comparison to LASSO regression for spring wheat yield prediction based on UAV imagery,” Comput Electron Agric, vol. 183, Apr. 2021, doi: 10.1016/j.compag.2021.106036.

A. Got, A. Moussaoui, and D. Zouache, “Hybrid filter-wrapper feature selection using whale optimization algorithm: A multi-objective approach,” Expert Syst Appl, vol. 183, Nov. 2021, doi: 10.1016/j.eswa.2021.115312.

P. Shekhar and A. Patra, “A forward–backward greedy approach for sparse multiscale learning,” Comput Methods Appl Mech Eng, vol. 400, Oct. 2022, doi: 10.1016/j.cma.2022.115420.

H. Zhao, Z. Gao, F. Xu, Y. Zhang, and J. Huang, “An efficient adaptive forward–backward selection method for sparse polynomial chaos expansion,” Comput Methods Appl Mech Eng, vol. 355, pp. 456–491, Oct. 2019, doi: 10.1016/j.cma.2019.06.034.

J. Wieczorek and J. Lei, “Model selection properties of forward selection and sequential cross-validation for high-dimensional regression,” Canadian Journal of Statistics, vol. 50, no. 2, pp. 454–470, Jun. 2022, doi: 10.1002/cjs.11635.

B. Fu, Y. He, Q. Guo, and J. Zhang, “An improved competitive particle swarm optimization algorithm based on de-heterogeneous information,” Journal of King Saud University - Computer and Information Sciences, vol. 35, no. 6, Jun. 2023, doi: 10.1016/j.jksuci.2022.12.012.

R. Malhotra, R. Kapoor, P. Saxena, and P. Sharma, “SAGA: A Hybrid Technique to handle Imbalance Data in Software Defect Prediction,” in ISCAIE 2021 - IEEE 11th Symposium on Computer Applications and Industrial Electronics, Institute of Electrical and Electronics Engineers Inc., Apr. 2021, pp. 331–336. doi: 10.1109/ISCAIE51753.2021.9431842.

M. H. Murad, A. K. Balla, M. S. Khan, A. Shaikh, S. Saadi, and Z. Wang, “Thresholds for interpreting the fragility index derived from sample of randomised controlled trials in cardiology: a meta-epidemiologic study,” BMJ Evid Based Med, vol. 28, no. 2, pp. 133–136, 2023, doi: 10.1136/bmjebm-2021-111858.

N. S. Mohamed et al., “Impact factors of orthopaedic journals between 2010 and 2016: trends and comparisons with other surgical specialties,” Ann Transl Med, vol. 6, no. 7, pp. 114–114, Apr. 2018, doi: 10.21037/atm.2018.03.02.

A. Iqbal and S. Aftab, “A classification framework for software defect prediction using multi-filter feature selection technique and MLP,” International Journal of Modern Education and Computer Science, vol. 12, no. 1, pp. 18–25, 2020, doi: 10.5815/ijmecs.2020.01.03.

T. Chakraborty and A. K. Chakraborty, “Hellinger Net: A Hybrid Imbalance Learning Model to Improve Software Defect Prediction,” IEEE Trans Reliab, vol. 70, no. 2, pp. 481–494, Jun. 2021, doi: 10.1109/TR.2020.3020238.

N. S. Harzevili and S. H. Alizadeh, “Analysis and modeling conditional mutual dependency of metrics in software defect prediction using latent variables,” Neurocomputing, vol. 460, pp. 309–330, Oct. 2021, doi: 10.1016/j.neucom.2021.05.043.

A. O. Balogun et al., “Empirical analysis of rank aggregation-based multi-filter feature selection methods in software defect prediction,” Electronics (Switzerland), vol. 10, no. 2, pp. 1–16, Jan. 2021, doi: 10.3390/electronics10020179.

Z. Ding and L. Xing, “Improved software defect prediction using Pruned Histogram-based isolation forest,” Reliab Eng Syst Saf, vol. 204, Dec. 2020, doi: 10.1016/j.ress.2020.107170.