K-Means-Based Pseudo-Labeling Technique in Supervised Learning Models for Regional Classification Based on Types of Non-Communicable Diseases

Authors

  • Herison Surbakti Information Technology, Faculty of Science and Technology, Universitas Respati Yogyakarta, Indonesia
  • Tb Ai Munandar Informatics, Faculty of Computer Science, Universitas Bhayangkara Jakarta Raya, Indonesia

DOI:

https://doi.org/10.15575/join.v10i2.1609

Keywords:

K-means, Non-communicable diseases, Pseudo-labeling, Regional classification, semi-supervised learning

Abstract

Non-Communicable Diseases (NCDs) pose a critical threat to global public health, with Indonesia experiencing significant challenges due to high mortality rates and uneven regional distribution. In Banten Province, limited access to labeled health data hampers effective, data-driven intervention strategies. This study proposes a semi-supervised learning approach to develop a regional classification model for NCDs. The methodology begins with K-Means clustering applied to data from 254 community health centers (Puskesmas) to generate pseudo-labels. Various cluster configurations (k=2 to 8) were evaluated, with the optimal result being two clusters based on a silhouette score of 0.735. These clusters were then used to create a semi-labeled dataset for supervised learning. Eight classification algorithms—CN2 Rule Inducer, k-Nearest Neighbor (kNN), Logistic Regression, Naïve Bayes, Neural Network, Random Forest, Support Vector Machine (SVM), and Decision Tree—were trained and compared. Among them, the Neural Network model achieved the highest performance, with an AUC of 0.999 and an MCC of 0.976, indicating excellent stability and predictive accuracy. The findings validate the effectiveness of semi-supervised learning for health classification tasks when labeled data is scarce. This approach can serve as a valuable decision-support tool for regional health planning and targeted interventions, enhancing the precision and efficiency of public health responses.

References

[1] Kesehatan Kementerian, “Laporan Kinerja Instansi Pemerintah Kementerian Kesehatan RI untuk Tahun Anggaran 2021,” Jakarta, Feb. 2022. Accessed: Oct. 18, 2024. [Online]. Available: https://ppid.kemkes.go.id/wp-content/uploads/2022/06/lakip_2022.pdf

[2] H. Arifin et al., “Analysis of Modifiable, Non-Modifiable, and Physiological Risk Factors of Non-Communicable Diseases in Indonesia: Evidence from the 2018 Indonesian Basic Health Research,” J Multidiscip Healthc, vol. Volume 15, pp. 2203–2221, Sep. 2022, doi: 10.2147/JMDH.S382191.

[3] Indriani and V. Fatmawati, “The Identification of Non-Communicable Diseases (NCDS) Risk Factors in Yogyakarta, Indonesia,” 2023, pp. 165–174. doi: 10.2991/978-94-6463-190-6_21.

[4] World Health Organization, “NONCOMMUNICABLE DISEASES COUNTRY PROFILES 2018.” Accessed: Oct. 18, 2023. [Online]. Available: https://www.who.int/docs/default-source/ncds/9789241514620-eng.pdf?sfvrsn=48f7a45c_2

[5] A. Afif, “Analisis Cluster Ward Pada Pengelompokan Wilayah Puskesmas Di Kota Kediri Berdasarkan Penyakit Tidak Menular,” Unisda Journal of Mathematics and Computer Science (UJMC), vol. 8, no. 2, pp. 39–44, Dec. 2022, doi: 10.52166/ujmc.v8i2.3567.

[6] R. Ferdousi, M. A. Hossain, and A. El Saddik, “Early-Stage Risk Prediction of Non-Communicable Disease Using Machine Learning in Health CPS,” IEEE Access, vol. 9, pp. 96823–96837, 2021, doi: 10.1109/ACCESS.2021.3094063.

[7] C. Wu, T. Zhou, Y. Tian, J. Wu, J. Li, and Z. Liu, “A method for the early prediction of chronic diseases based on short sequential medical data,” Artif Intell Med, vol. 127, p. 102262, May 2022, doi: 10.1016/j.artmed.2022.102262.

[8] B. Legetic, A. Medici, M. Hernández-Avila, G. Alleyne, and A. Hennis, DISEASE CONTROL PRIORITIES • THIRD EDITION Economic Dimensions of Noncommunicable Diseases in Latin America and the Caribbean. 2016. Accessed: Nov. 06, 2025. [Online]. Available: www.paho.org/permissions

[9] A. Taher et al., “Comprehensive Efforts to Accelerate Non-Communicable Disease Services in the Era of COVID-19 in Indonesia’s Suburban Area,” ASEAN Journal of Community Engagement, vol. 6, no. 1, pp. 152–68, Jul. 2022, doi: 10.7454/ajce.v6i1.1167.

[10] A. Budreviciute et al., “Management and Prevention Strategies for Non-communicable Diseases (NCDs) and Their Risk Factors,” Front Public Health, vol. 8, Nov. 2020, doi: 10.3389/fpubh.2020.574111.

[11] L. Handayani and L. Kristiana, “Faktor-Faktor Yang Memengaruhi Keterjangkauan Pelayanan Kesehatan Di Puskesmas Daerah Terpencil Perbatasan Di Kabupaten Sambas (Studi Kasus di Puskesmas Sajingan Besar)”, Accessed: Nov. 06, 2025. [Online]. Available: https://media.neliti.com/media/publications-test/21346-faktor-faktor-yang-memengaruhi-keterjang-cdf92541.pdf

[12] L. C. S. Edmund, C. K. Ramaiah, and S. P. Gulla, “Electronic Medical Records Management Systems: An Overview,” DESIDOC Journal of Library & Information Technology, vol. 29, no. 6, pp. 3–12, Nov. 2009, doi: 10.14429/djlit.29.273.

[13] N. N. Basil, S. Ambe, C. Ekhator, and E. Fonkem, “Health Records Database and Inherent Security Concerns: A Review of the Literature,” Cureus, Oct. 2022, doi: 10.7759/cureus.30168.

[14] I. Silva, D. Ferreira, H. Peixoto, and J. Machado, “A Data Acquisition and Consolidation System based on openEHR applied to Physical Medicine and Rehabilitation,” Procedia Comput Sci, vol. 220, pp. 844–849, 2023, doi: 10.1016/j.procs.2023.03.113.

[15] C. A. S. Andrade et al., “Inequalities in the burden of non-communicable diseases across European countries: a systematic analysis of the Global Burden of Disease 2019 study,” Int J Equity Health, vol. 22, no. 1, p. 140, Jul. 2023, doi: 10.1186/s12939-023-01958-8.

[16] S. Pengpid and K. Peltzer, “Trends in behavioral and biological risk factors for non-communicable diseases among adults in Bhutan: results from cross-sectional surveys in 2007, 2014, and 2019,” Front Public Health, vol. 11, Aug. 2023, doi: 10.3389/fpubh.2023.1192183.

[17] R. A. Roomaney, B. van Wyk, A. Cois, and V. Pillay-van Wyk, “Inequity in the Distribution of Non-Communicable Disease Multimorbidity in Adults in South Africa: An Analysis of Prevalence and Patterns,” Int J Public Health, vol. 67, Aug. 2022, doi: 10.3389/ijph.2022.1605072.

[18] J. Shu and W. Jin, “Prioritizing non-communicable diseases in the post-pandemic era based on a comprehensive analysis of the GBD 2019 from 1990 to 2019,” Sci Rep, vol. 13, no. 1, p. 13325, Aug. 2023, doi: 10.1038/s41598-023-40595-7.

[19] A. Mohammed, “The effects of COVID-19 on Non-Communicable Disease : A Case Study of Six Countries (COVID-19 Situational Analysis Project)”.

[20] T. T. Alamnia, G. M. Sargent, and M. Kelly, “Patterns of Non-Communicable Disease, Multimorbidity, and Population Awareness in Bahir Dar, Northwest Ethiopia: A Cross-Sectional Study,” Int J Gen Med, vol. Volume 16, pp. 3013–3031, Jul. 2023, doi: 10.2147/IJGM.S421749.

[21] X.-F. Pan, J. Yang, Y. Wen, N. Li, S. Chen, and A. Pan, “Non-Communicable Diseases During the COVID-19 Pandemic and Beyond,” Engineering, vol. 7, no. 7, pp. 899–902, Jul. 2021, doi: 10.1016/j.eng.2021.02.013.

[22] Q. Zeng et al., “The Epidemiological Characteristics of Noncommunicable Diseases and Malignant Tumors in Guiyang, China: Cross-sectional Study,” JMIR Public Health Surveill, vol. 8, no. 10, p. e36523, Oct. 2022, doi: 10.2196/36523.

[23] W. Peng et al., “Trends in major non-communicable diseases and related risk factors in China 2002–2019: an analysis of nationally representative survey data,” Lancet Reg Health West Pac, p. 100809, Jun. 2023, doi: 10.1016/j.lanwpc.2023.100809.

[24] G. R. Menon, J. Yadav, and D. John, “Burden of non-communicable diseases and its associated economic costs in India,” Social Sciences & Humanities Open, vol. 5, no. 1, p. 100256, 2022, doi: 10.1016/j.ssaho.2022.100256.

[25] A. K. Yadav, K. R. Paltasingh, and P. K. Jena, “Incidence of Communicable and Non-communicable Diseases in India: Trends, Distributional Pattern and Determinants,” The Indian Economic Journal, vol. 68, no. 4, pp. 593–609, Dec. 2020, doi: 10.1177/0019466221998841.

[26] S. Nomura, H. Sakamoto, C. Ghaznavi, and M. Inoue, “Toward a third term of Health Japan 21 – implications from the rise in non-communicable disease burden and highly preventable risk factors,” Lancet Reg Health West Pac, vol. 21, p. 100377, Apr. 2022, doi: 10.1016/j.lanwpc.2021.100377.

[27] F. Mbonyinshuti, J. Nkurunziza, J. Niyobuhungiro, and E. Kayitare, “Application of random forest model to predict the demand of essential medicines for noncommunicable diseases management in public health facilities,” Pan African Medical Journal, vol. 42, 2022, doi: 10.11604/pamj.2022.42.89.33833.

[28] A. S. Abdalrada, J. Abawajy, T. Al-Quraishi, and S. M. S. Islam, “Machine learning models for prediction of co-occurrence of diabetes and cardiovascular diseases: a retrospective cohort study,” J Diabetes Metab Disord, vol. 21, no. 1, pp. 251–261, Jan. 2022, doi: 10.1007/s40200-021-00968-z.

[29] Q. Liu et al., “Predicting the Risk of Incident Type 2 Diabetes Mellitus in Chinese Elderly Using Machine Learning Techniques,” J Pers Med, vol. 12, no. 6, p. 905, May 2022, doi: 10.3390/jpm12060905.

[30] D. A. Debal and T. M. Sitote, “Chronic kidney disease prediction using machine learning techniques,” J Big Data, vol. 9, no. 1, p. 109, Nov. 2022, doi: 10.1186/s40537-022-00657-5.

[31] N. Shi et al., “Predicting the Need for Therapeutic Intervention and Mortality in Acute Pancreatitis: A Two-Center International Study Using Machine Learning,” J Pers Med, vol. 12, no. 4, p. 616, Apr. 2022, doi: 10.3390/jpm12040616.

[32] J. Zhang, R. Han, G. Shao, B. Lv, and K. Sun, “Artificial Intelligence in Cardiovascular Atherosclerosis Imaging,” J Pers Med, vol. 12, no. 3, p. 420, Mar. 2022, doi: 10.3390/jpm12030420.

[33] K. Al Sadi and W. Balachandran, “Prediction Model of Type 2 Diabetes Mellitus for Oman Prediabetes Patients Using Artificial Neural Network and Six Machine Learning Classifiers,” Applied Sciences, vol. 13, no. 4, p. 2344, Feb. 2023, doi: 10.3390/app13042344.

[34] G. Özsezer and G. Mermer, “Diabetes Risk Prediction with Machine Learning Models,” Artificial Intelligence Theory and Applications, vol. 2, no. 2, pp. 1–9, 2022.

[35] O. A. Ebrahim and G. Derbew, “Application of supervised machine learning algorithms for classification and prediction of type-2 diabetes disease status in Afar regional state, Northeastern Ethiopia 2021,” Sci Rep, vol. 13, no. 1, p. 7779, May 2023, doi: 10.1038/s41598-023-34906-1.

[36] J. J. Boutilier, T. C. Y. Chan, M. Ranjan, and S. Deo, “Risk Stratification for Early Detection of Diabetes and Hypertension in Resource-Limited Settings: Machine Learning Analysis,” J Med Internet Res, vol. 23, no. 1, p. e20123, Jan. 2021, doi: 10.2196/20123.

[37] Y. C A Padmanabha Reddy, P. Viswanath, and B. Eswara Reddy, “Semi-supervised learning: a brief review,” International Journal of Engineering & Technology, vol. 7, no. 1.8, p. 81, Feb. 2018, doi: 10.14419/ijet.v7i1.8.9977.

[38] M. F. A. Hady and F. Schwenker, “Semi-supervised Learning,” 2013, pp. 215–239. doi: 10.1007/978-3-642-36657-4_7.

[39] Y. Wang, X. Gu, W. Hou, M. Zhao, L. Sun, and C. Guo, “Dual Semi-Supervised Learning for Classification of Alzheimer’s Disease and Mild Cognitive Impairment Based on Neuropsychological Data,” Brain Sci, vol. 13, no. 2, Feb. 2023, doi: 10.3390/brainsci13020306.

[40] M. U. Alam and R. Rahmani, “Federated Semi-Supervised Multi-Task Learning to Detect COVID-19 and Lungs Segmentation Marking Using Chest Radiography Images and Raspberry Pi Devices: An Internet of Medical Things Application,” Sensors, vol. 21, no. 15, p. 5025, Jul. 2021, doi: 10.3390/s21155025.

[41] Y. Zhang, L. Su, Z. Liu, W. Tan, Y. Jiang, and C. Cheng, “A semi-supervised learning approach for COVID-19 detection from chest CT scans,” Neurocomputing, vol. 503, pp. 314–324, Sep. 2022, doi: 10.1016/j.neucom.2022.06.076.

[42] C. H. Han, M. Kim, and J. T. Kwak, “Semi-supervised learning for an improved diagnosis of COVID-19 in CT images,” PLoS One, vol. 16, no. 4, p. e0249450, Apr. 2021, doi: 10.1371/journal.pone.0249450.

[43] Z. Huang, G. Long, B. Wessler, and M. C. Hughes, “A New Semi-supervised Learning Benchmark for Classifying View and Diagnosing Aortic Stenosis from Echocardiograms,” 2021. [Online]. Available: https://github.com/tufts-ml/ssl-for-echocardiograms

[44] H. Wu, J. Sun, and Q. You, “Semi-Supervised Learning for Medical Image Classification Based on Anti-Curriculum Learning,” Mathematics, vol. 11, no. 6, p. 1306, Mar. 2023, doi: 10.3390/math11061306.

[45] S. Lim, J. Park, M. Lee, and H. Lee, “Unsupervised object discovery with pseudo label generated using K-means and self-supervised transformer,” Neurocomputing, vol. 545, p. 126326, Aug. 2023, doi: 10.1016/j.neucom.2023.126326.

[46] L. Chen et al., “Making Your First Choice: To Address Cold Start Problem in Medical Active Learning,” 2023. [Online]. Available: https://github.com/cliangyu/CSVAL.

[47] F. H. Awad, M. M. Hamad, and L. Alzubaidi, “Robust Classification and Detection of Big Medical Data Using Advanced Parallel K-Means Clustering, YOLOv4, and Logistic Regression,” Life, vol. 13, no. 3, p. 691, Mar. 2023, doi: 10.3390/life13030691.

[48] K. Liu, X. Ning, and S. Liu, “Medical Image Classification Based on Semi-Supervised Generative Adversarial Network and Pseudo-Labelling,” Sensors, vol. 22, no. 24, p. 9967, Dec. 2022, doi: 10.3390/s22249967.

[49] S. M. Miraftabzadeh, C. G. Colombo, M. Longo, and F. Foiadelli, “K-Means and Alternative Clustering Methods in Modern Power Systems,” IEEE Access, vol. 11, pp. 119596–119633, 2023, doi: 10.1109/ACCESS.2023.3327640.

Downloads

Published

2025-11-08

Issue

Section

Article

Citation Check

Similar Articles

<< < 4 5 6 7 8 9 10 11 12 > >> 

You may also start an advanced similarity search for this article.