CatBoost Optimization Using Recursive Feature Elimination

Authors

  • Agus Hadianto Master of Informatics, President University, Indonesia
  • Wiranto Herry Utomo Master of Informatics, President University, Indonesia

DOI:

https://doi.org/10.15575/join.v9i2.1324

Keywords:

CatBoost, Feature Selection, RFE

Abstract

CatBoost is a powerful machine learning algorithm capable of classification and regression application. There are many studies focusing on its application but are still lacking on how to enhance its performance, especially when using RFE as a feature selection. This study examines the CatBoost optimization for regression tasks by using Recursive Feature Elimination (RFE) for feature selection in combination with several regression algorithm. Furthermore, an Isolation Forest algorithm is employed at preprocessing to identify and eliminate outliers from the dataset. The experiment is conducted by comparing the CatBoost regression model's performances with and without the use of RFE feature selection. The outcomes of the experiments indicate that CatBoost with RFE, which selects features using Random Forests, performs better than the baseline model without feature selection. CatBoost-RFE outperformed the baseline with notable gains of over 48.6% in training time, 8.2% in RMSE score, and 1.3% in R2 score. Furthermore, compared to AdaBoost, Gradient Boosting, XGBoost, and artificial neural networks (ANN), it demonstrated better prediction accuracy. The CatBoost improvement has a substantial implication for predicting the exhaust temperature in a coal-fired power plant.

References

[1] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin, “CatBoost: unbiased boosting with categorical features,” in Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., Curran Associates, Inc., 2018. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2018/file/14491b756b3a51daac41c24863285549-Paper.pdf

[2] S. Karimi, J. Shiri, and P. Marti, “Supplanting missing climatic inputs in classical and random forest models for estimating reference evapotranspiration in humid coastal areas of Iran,” Comput Electron Agric, vol. 176, 2020, doi: 10.1016/j.compag.2020.105633.

[3] A. V. Dorogush, V. Ershov, and A. Gulin, “CatBoost: gradient boosting with categorical features support,” Oct. 2018, [Online]. Available: http://arxiv.org/abs/1810.11363

[4] J. T. Hancock and T. M. Khoshgoftaar, “CatBoost for big data: an interdisciplinary review,” J Big Data, vol. 7, no. 1, p. 94, Dec. 2020, doi: 10.1186/s40537-020-00369-8.

[5] “comparison-between-xgboost-lightgbm-and-catboost-using-a-home-credit-dataset”.

[6] Y. Xia, L. He, Y. Li, N. Liu, and Y. Ding, “Predicting loan default in peer‐to‐peer lending using narrative data,” J Forecast, vol. 39, no. 2, pp. 260–280, Mar. 2020, doi: 10.1002/for.2625.

[7] P. S. Kumar, A. K. K, S. Mohapatra, B. Naik, J. Nayak, and M. Mishra, “CatBoost Ensemble Approach for Diabetes Risk Prediction at Early Stages,” in 2021 1st Odisha International Conference on Electrical Power Engineering, Communication and Computing Technology(ODICON), IEEE, Jan. 2021, pp. 1–6. doi: 10.1109/ODICON50556.2021.9428943.

[8] Y. Rathod et al., “Predictive Analysis of Polycystic Ovarian Syndrome using CatBoost Algorithm,” in 2022 IEEE Region 10 Symposium (TENSYMP), IEEE, Jul. 2022, pp. 1–6. doi: 10.1109/TENSYMP54529.2022.9864439.

[9] S. Ben Jabeur, C. Gharib, S. Mefteh-Wali, and W. Ben Arfi, “CatBoost model and artificial intelligence techniques for corporate failure prediction,” Technol Forecast Soc Change, vol. 166, p. 120658, May 2021, doi: 10.1016/j.techfore.2021.120658.

[10] N. Nguyen et al., “A Proposed Model for Card Fraud Detection Based on CatBoost and Deep Neural Network,” IEEE Access, vol. 10, pp. 96852–96861, 2022, doi: 10.1109/ACCESS.2022.3205416.

[11] S. Hussain et al., “A novel feature engineered-CatBoost-based supervised machine learning framework for electricity theft detection,” Energy Reports, vol. 7, pp. 4425–4436, Nov. 2021, doi: 10.1016/j.egyr.2021.07.008.

[12] R. Punmiya and S. Choe, “Energy Theft Detection Using Gradient Boosting Theft Detector With Feature Engineering-Based Preprocessing,” IEEE Trans Smart Grid, vol. 10, no. 2, pp. 2326–2329, Mar. 2019, doi: 10.1109/TSG.2019.2892595.

[13] K. M. Ghori, A. Rabeeh Ayaz, M. Awais, M. Imran, A. Ullah, and L. Szathmary, “Impact of Feature Selection on Non-technical Loss Detection,” in 2020 6th Conference on Data Science and Machine Learning Applications (CDMA), IEEE, Mar. 2020, pp. 19–24. doi: 10.1109/CDMA47397.2020.00009.

[14] A. Sau and I. Bhakta, “Screening of anxiety and depression among seafarers using machine learning technology,” Inform Med Unlocked, vol. 16, p. 100228, 2019, doi: 10.1016/j.imu.2019.100228.

[15] J. Nayak, B. Naik, P. B. Dash, S. Vimal, and S. Kadry, “Hybrid Bayesian optimization hypertuned catboost approach for malicious access and anomaly detection in IoT nomalyframework,” Sustainable Computing: Informatics and Systems, vol. 36, p. 100805, Dec. 2022, doi: 10.1016/j.suscom.2022.100805.

[16] N. Bakhareva, A. Shukhman, A. Matveev, P. Polezhaev, Y. Ushakov, and L. Legashev, “Attack Detection in Enterprise Networks by Machine Learning Methods,” in 2019 International Russian Automation Conference (RusAutoCon), IEEE, Sep. 2019, pp. 1–6. doi: 10.1109/RUSAUTOCON.2019.8867696.

[17] Y. Wang, X. Huang, X. Ren, Z. Chai, and X. Chen, “In-process belt-image-based material removal rate monitoring for abrasive belt grinding using CatBoost algorithm,” The International Journal of Advanced Manufacturing Technology, vol. 123, no. 7–8, pp. 2575–2591, Dec. 2022, doi: 10.1007/s00170-022-10341-w.

[18] M. Ou, P. Cui, J. Pei, Z. Zhang, and W. Zhu, “Asymmetric Transitivity Preserving Graph Embedding,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA: ACM, Aug. 2016, pp. 1105–1114. doi: 10.1145/2939672.2939751.

[19] H.-C. Yi, Z.-H. You, and Z.-H. Guo, “Construction and Analysis of Molecular Association Network by Combining Behavior Representation and Node Attributes,” Front Genet, vol. 10, Nov. 2019, doi: 10.3389/fgene.2019.01106.

[20] F. Lin, E.-M. Cui, Y. Lei, and L. Luo, “CT-based machine learning model to predict the Fuhrman nuclear grade of clear cell renal cell carcinoma,” Abdominal Radiology, vol. 44, no. 7, pp. 2528–2534, Jul. 2019, doi: 10.1007/s00261-019-01992-7.

[21] A. A. Kolesnikov, P. M. Kikin, and A. M. Portnov, “DISEASES SPREAD PREDICTION IN TROPICAL AREAS BY MACHINE LEARNING METHODS ENSEMBLING AND SPATIAL ANALYSIS TECHNIQUES,” The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. XLII-3/W8, pp. 221–226, Aug. 2019, doi: 10.5194/isprs-archives-XLII-3-W8-221-2019.

[22] J. Fan, X. Wang, F. Zhang, X. Ma, and L. Wu, “Predicting daily diffuse horizontal solar radiation in various climatic regions of China using support vector machine and tree-based soft computing models with local and extrinsic climatic data,” J Clean Prod, vol. 248, p. 119264, Mar. 2020, doi: 10.1016/j.jclepro.2019.119264.

[23] E. B. Postnikov, B. Jasiok, and M. Chorążewski, “The CatBoost as a tool to predict the isothermal compressibility of ionic liquids,” J Mol Liq, vol. 333, p. 115889, Jul. 2021, doi: 10.1016/j.molliq.2021.115889.

[24] A. N. Beskopylny et al., “Concrete Strength Prediction Using Machine Learning Methods CatBoost, k-Nearest Neighbors, Support Vector Regression,” Applied Sciences, vol. 12, no. 21, p. 10864, Oct. 2022, doi: 10.3390/app122110864.

[25] D. Niu, L. Diao, Z. Zang, H. Che, T. Zhang, and X. Chen, “A Machine-Learning Approach Combining Wavelet Packet Denoising with Catboost for Weather Forecasting,” Atmosphere (Basel), vol. 12, no. 12, p. 1618, Dec. 2021, doi: 10.3390/atmos12121618.

[26] G. Huang et al., “Evaluation of CatBoost method for prediction of reference evapotranspiration in humid regions,” J Hydrol (Amst), vol. 574, pp. 1029–1041, Jul. 2019, doi: 10.1016/j.jhydrol.2019.04.085.

[27] W. Xiang, P. Xu, J. Fang, Q. Zhao, Z. Gu, and Q. Zhang, “Multi-dimensional data-based medium- and long-term power-load forecasting using double-layer CatBoost,” Energy Reports, vol. 8, pp. 8511–8522, Nov. 2022, doi: 10.1016/j.egyr.2022.06.063.

[28] H. Sun, Y. Chen, L. Li, and B. Zhao, “Estimating Sea Surface pCO2 in the North Atlantic based on CatBoost,” 2021, doi: 10.20944/preprints202104.0065.v1.

[29] F. Yao, J. Sun, and J. Dong, “Estimating Daily Dew Point Temperature Based on Local and Cross-Station Meteorological Data Using CatBoost Algorithm,” Computer Modeling in Engineering & Sciences, vol. 130, no. 2, pp. 671–700, 2022, doi: 10.32604/cmes.2022.018450.

[30] M. Luo et al., “Combination of Feature Selection and CatBoost for Prediction: The First Application to the Estimation of Aboveground Biomass,” Forests, vol. 12, no. 2, p. 216, Feb. 2021, doi: 10.3390/f12020216.

[31] N. H. M. Khalid, A. R. Ismail, N. A. Aziz, and A. A. A. Hussin, “Performance Comparison of Feature Selection Methods for Prediction in Medical Data,” 2023, pp. 92–106. doi: 10.1007/978-981-99-0405-1_7.

[32] R. Zhu, G. Ciren, B. Tang, and X. Gong, “Power system short‐term voltage stability assessment based on improved CatBoost with consideration of model confidence,” Energy Sci Eng, vol. 11, no. 2, pp. 783–795, Feb. 2023, doi: 10.1002/ese3.1362.

[33] F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation Forest,” in 2008 Eighth IEEE International Conference on Data Mining, IEEE, Dec. 2008, pp. 413–422. doi: 10.1109/ICDM.2008.17.

[34] D. Chicco, M. J. Warrens, and G. Jurman, “The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation,” PeerJ Comput Sci, vol. 7, p. e623, Jul. 2021, doi: 10.7717/peerj-cs.623.

Downloads

Published

2024-08-24

Issue

Section

Article

Citation Check