Analysis of Data and Feature Processing on Stroke Prediction using Wide Range Machine Learning Model


  • Untari Novia Wisesty School of Computing, Telkom University, Bandung, Indonesia
  • Tjokorda Agung Budi Wirayuda School of Computing, Telkom University, Bandung, Indonesia
  • Febryanti Sthevanie School of Computing, Telkom University, Bandung, Indonesia
  • Rita Rismala School of Computing, Telkom University, Bandung, Indonesia



Stroke Prediction, Machine Learning, Sampling Data, Pearson Correlation, PCA


Stroke is a disease which cause the death of brain cells, so that the part of the body controlled by the brain loses its function. If not treated immediately, this disease can cause long-term disability, brain damage, and death. In this research, stroke prediction was carried out on the Stroke dataset acquired from the Kaggle dataset using various machine learning models. Then, data sampling techniques are used to handle data imbalance problems in the stroke dataset, which include Random Undersampling, Random Oversampling, and SMOTE techniques. Pearson Correlation and Principal Component Analysis are also used for dimensional reduction and analyzing the important features that are most influential in predicting stroke. Pearson Correlation produces five attributes that have the highest Pearson coefficient, namely age, hypertension, heart disease, blood sugar level, and marital status. Experimental results have demonstrated that the utilization of RUS, ROS, and SMOTE sampling techniques can significantly boost the F1-Score testing by an impressive 43.44%, 34.44%, and 35.55% respectively, as compared to experiments conducted without implementing any data sampling techniques. The highest F1-Score testing was achieved using the Support Vector Machine and Gaussian Naïve Bayes models, namely 0.83.


B. W. Negasa, T. W. Wotale, M. E. Lelisho, L. K. Debusho, K. Sisay, and W. Gezimu, “Modeling Survival Time to Death among Stroke Patients at Jimma University Medical Center, Southwest Ethiopia: A Retrospective Cohort Study,” Stroke Res. Treat., vol. 2023, pp. 1–10, Nov. 2023, doi: 10.1155/2023/1557133.

“Acute Ischemic Stroke: Management Approach,” Indian J. Crit. Care Med., vol. 23, no. S2, pp. 140–146, Jun. 2019, doi: 10.5005/jp-journals-10071-23192.

D. Kuriakose and Z. Xiao, “Pathophysiology and Treatment of Stroke: Present Status and Future Perspectives,” Int. J. Mol. Sci., vol. 21, no. 20, p. 7609, Oct. 2020, doi: 10.3390/ijms21207609.

G. Fekadu, L. Chelkeba, and A. Kebede, “Risk factors, clinical presentations and predictors of stroke among adult patients admitted to stroke unit of Jimma university medical center, south west Ethiopia: prospective observational study,” BMC Neurol., vol. 19, no. 1, p. 187, Dec. 2019, doi: 10.1186/s12883-019-1409-0.

fedesoriano, “Stroke Prediction Dataset.” 2020. [Online]. Available:

M. Guhdar, A. Ismail Melhum, and A. Luqman Ibrahim, “Optimizing Accuracy of Stroke Prediction Using Logistic Regression,” J. Technol. Inform. JoTI, vol. 4, no. 2, pp. 41–47, Jan. 2023, doi: 10.37802/joti.v4i2.278.

E. Dritsas and M. Trigka, “Stroke Risk Prediction with Machine Learning Techniques,” Sensors, vol. 22, no. 13, p. 4670, Jun. 2022, doi: 10.3390/s22134670.

Md. M. Islam, S. Akter, Md. Rokunojjaman, J. H. Rony, A. Amin, and S. Kar, “Stroke Prediction Analysis using Machine Learning Classifiers and Feature Technique,” Int. J. Electron. Commun. Syst., vol. 1, no. 2, pp. 57–62, Dec. 2021, doi: 10.24042/ijecs.v1i2.10393.

O. Shobayo, O. Zachariah, M. O. Odusami, and B. Ogunleye, “Prediction of Stroke Disease with Demographic and Behavioural Data Using Random Forest Algorithm,” Analytics, vol. 2, no. 3, pp. 604–617, Aug. 2023, doi: 10.3390/analytics2030034.

T. Tazin, M. N. Alam, N. N. Dola, M. S. Bari, S. Bourouis, and M. Monirujjaman Khan, “Stroke Disease Detection and Prediction Using Robust Learning Approaches,” J. Healthc. Eng., vol. 2021, pp. 1–12, Nov. 2021, doi: 10.1155/2021/7633381.

G. Sailasya and G. L. A. Kumari, “Analyzing the Performance of Stroke Prediction using ML Classification Algorithms,” Int. J. Adv. Comput. Sci. Appl., vol. 12, no. 6, 2021, doi: 10.14569/IJACSA.2021.0120662.

A. M. A. Rahim, A. Sunyoto, and M. R. Arief, “Stroke Prediction Using Machine Learning Method with Extreme Gradient Boosting Algorithm,” MATRIK J. Manaj. Tek. Inform. Dan Rekayasa Komput., vol. 21, no. 3, pp. 595–606, Jul. 2022, doi: 10.30812/matrik.v21i3.1666.

S. Dev, H. Wang, C. S. Nwosu, N. Jain, B. Veeravalli, and D. John, “A predictive analytics approach for stroke prediction using machine learning and neural networks,” Healthc. Anal., vol. 2, p. 100032, Nov. 2022, doi: 10.1016/

F. Zinzendoff Okwonu, B. Laro Asaju, and F. Irimisose Arunaye, “Breakdown Analysis of Pearson Correlation Coefficient and Robust Correlation Methods,” IOP Conf. Ser. Mater. Sci. Eng., vol. 917, no. 1, p. 012065, Sep. 2020, doi: 10.1088/1757-899X/917/1/012065.

E. I. Obilor and E. C. Amadi, “Test for Significance of Pearson’s Correlation Coefficient (r),” Int. J. Innov. Math. Stat. Energy Policies, vol. 6, no. 1, pp. 11–23, 2018.

E. Elhaik, “Principal Component Analyses (PCA)-based findings in population genetic studies are highly biased and must be reevaluated,” Sci. Rep., vol. 12, no. 1, p. 14683, Aug. 2022, doi: 10.1038/s41598-022-14395-4.

L. Peng, G. Han, A. Landjobo Pagou, and J. Shu, “Electric submersible pump broken shaft fault diagnosis based on principal component analysis,” J. Pet. Sci. Eng., vol. 191, p. 107154, Aug. 2020, doi: 10.1016/j.petrol.2020.107154.

M. Saripuddin, A. Suliman, S. Syarmila Sameon, and B. N. Jorgensen, “Random Undersampling on Imbalance Time Series Data for Anomaly Detection,” in 2021 The 4th International Conference on Machine Learning and Machine Intelligence, Hangzhou China: ACM, Sep. 2021, pp. 151–156. doi: 10.1145/3490725.3490748.

M. Bach, A. Werner, and M. Palt, “The Proposal of Undersampling Method for Learning from Imbalanced Datasets,” Procedia Comput. Sci., vol. 159, pp. 125–134, 2019, doi: 10.1016/j.procs.2019.09.167.

R. G, A. K. Tyagi, and V. K. Reddy, “Performance Analysis of Under-Sampling and Over-Sampling Techniques for Solving Class Imbalance Problem,” SSRN Electron. J., 2019, doi: 10.2139/ssrn.3356374.

D. Elreedy and A. F. Atiya, “A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) for handling class imbalance,” Inf. Sci., vol. 505, pp. 32–64, Dec. 2019, doi: 10.1016/j.ins.2019.07.070.

B. S. Raghuwanshi and S. Shukla, “SMOTE based class-specific extreme learning machine for imbalanced learning,” Knowl.-Based Syst., vol. 187, p. 104814, Jan. 2020, doi: 10.1016/j.knosys.2019.06.022.

I. D. Mienye, Y. Sun, and Z. Wang, “Prediction performance of improved decision tree-based algorithms: a review,” Procedia Manuf., vol. 35, pp. 698–703, 2019, doi: 10.1016/j.promfg.2019.06.011.

C. Zhang, C. Hu, S. Xie, and S. Cao, “Research on the application of Decision Tree and Random Forest Algorithm in the main transformer fault evaluation,” J. Phys. Conf. Ser., vol. 1732, no. 1, p. 012086, Jan. 2021, doi: 10.1088/1742-6596/1732/1/012086.

M. Schonlau and R. Y. Zou, “The random forest algorithm for statistical learning,” Stata J. Promot. Commun. Stat. Stata, vol. 20, no. 1, pp. 3–29, Mar. 2020, doi: 10.1177/1536867X20909688.

Y. Ding, H. Zhu, R. Chen, and R. Li, “An Efficient AdaBoost Algorithm with the Multiple Thresholds Classification,” Appl. Sci., vol. 12, no. 12, p. 5872, Jun. 2022, doi: 10.3390/app12125872.

Y. Zhang et al., “Research and Application of AdaBoost Algorithm Based on SVM,” in 2019 IEEE 8th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), Chongqing, China: IEEE, May 2019, pp. 662–666. doi: 10.1109/ITAIC.2019.8785556.

T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco California USA: ACM, Aug. 2016, pp. 785–794. doi: 10.1145/2939672.2939785.

J. Shen and H. Fang, “Human Activity Recognition Using Gaussian Naïve Bayes Algorithm in Smart Home,” J. Phys. Conf. Ser., vol. 1631, no. 1, p. 012059, Sep. 2020, doi: 10.1088/1742-6596/1631/1/012059.

S. Uddin, I. Haque, H. Lu, M. A. Moni, and E. Gide, “Comparative performance analysis of K-nearest neighbour (KNN) algorithm and its different variants for disease prediction,” Sci. Rep., vol. 12, no. 1, p. 6256, Apr. 2022, doi: 10.1038/s41598-022-10358-x.

J. Cervantes, F. Garcia-Lamont, L. Rodríguez-Mazahua, and A. Lopez, “A comprehensive survey on support vector machine classification: Applications, challenges and trends,” Neurocomputing, vol. 408, pp. 189–215, Sep. 2020, doi: 10.1016/j.neucom.2019.10.118.

B. Gaye, D. Zhang, and A. Wulamu, “Improvement of Support Vector Machine Algorithm in Big Data Background,” Math. Probl. Eng., vol. 2021, pp. 1–9, Jun. 2021, doi: 10.1155/2021/5594899.

J. Singh and R. Banerjee, “A Study on Single and Multi-layer Perceptron Neural Network,” in 2019 3rd International Conference on Computing Methodologies and Communication (ICCMC), Erode, India: IEEE, Mar. 2019, pp. 35–40. doi: 10.1109/ICCMC.2019.8819775.

H. Alla, L. Moumoun, and Y. Balouki, “A Multilayer Perceptron Neural Network with Selective-Data Training for Flight Arrival Delay Prediction,” Sci. Program., vol. 2021, pp. 1–12, Jun. 2021, doi: 10.1155/2021/5558918.

D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” 2014, doi: 10.48550/ARXIV.1412.6980.

R. Wu and N. Hao, “Quadratic discriminant analysis by projection,” J. Multivar. Anal., vol. 190, p. 104987, Jul. 2022, doi: 10.1016/j.jmva.2022.104987.

A. Araveeporn, “Comparing the Linear and Quadratic Discriminant Analysis of Diabetes Disease Classification Based on Data Multicollinearity,” Int. J. Math. Math. Sci., vol. 2022, pp. 1–11, Sep. 2022, doi: 10.1155/2022/7829795.







Citation Check