LLM-Based Information Retrieval for Disease Detection Using Semantic Similarity
DOI:
https://doi.org/10.15575/join.v10i1.1486Keywords:
CRISP-DM Framework, Disease Detection, Information Retrieval System, Large Language Model, Semantic SimilarityAbstract
Information retrieval systems are vital for disease prediction, but traditional methods like TF-IDF struggle with word meanings and produce long, complex vectors. This research uses Large Language Models (LLMs) and follows the CRISP-DM methodology to improve accuracy. Using health forum discussions labeled with specific diseases, we split the data into queries and a corpus. Semantic similarity is used to retrieve the most relevant text from the corpus. After preprocessing, we compare LLMs and TF-IDF, with LLMs achieving an accuracy of 0.911 (Top-K=30), outperforming TF-IDF. LLMs excel by creating shorter, meaningful vectors that preserve context, enabling precise semantic matching. These results demonstrate LLMs' potential to enhance healthcare information retrieval, offering more accurate and context-aware solutions. This research highlights how advanced AI can overcome traditional methods' limitations, opening new possibilities for medical informatics.
References
[1] N. Ghaffar Nia, E. Kaplanoglu, and A. Nasab, “Evaluation of artificial intelligence techniques in disease diagnosis and prediction,” Discover Artificial Intelligence, vol. 3, no. 1, p. 5, Jan. 2023, doi: 10.1007/s44163-023-00049-5.
[2] T. A. Sugiyatmi, U. Hadi, D. Chalidyanto, F. Hafidz, and M. Miftahussurur, “Does the implementation of national health insurance affect the workload of a doctor and have an impact on service quality? A systematic literature review,” J Public Health Afr, Oct. 2019, doi: 10.4081/jphia.2019.1198.
[3] F. M. Ekawati and M. Claramita, “Indonesian General Practitioners’ Experience of Practicing in Primary Care under the Implementation of Universal Health Coverage Scheme (JKN),” J Prim Care Community Health, vol. 12, p. 215013272110237, Jan. 2021, doi: 10.1177/21501327211023707.
[4] C. Maharani, S. R. Rahayu, M. Marx, and S. Loukanova, “The National Health Insurance System of Indonesia and primary care physicians’ job satisfaction: a prospective qualitative study,” Fam Pract, vol. 39, no. 1, pp. 112–124, Jan. 2022, doi: 10.1093/fampra/cmab067.
[5] R. Pratama and A. Yufika, “Physicians’ Workload and Quality Healthcare in Indonesia,” Trends in Infection and Global Health, vol. 3, no. 1, pp. 43–55, Jun. 2023, doi: 10.24815/tigh.v3i1.32363.
[6] M. L. Barnett, D. Boddupalli, S. Nundy, and D. W. Bates, “Comparative Accuracy of Diagnosis by Collective Intelligence of Multiple Physicians vs Individual Physicians,” JAMA Netw Open, vol. 2, no. 3, p. e190096, Mar. 2019, doi: 10.1001/jamanetworkopen.2019.0096.
[7] N. P. Tigga and S. Garg, “Prediction of Type 2 Diabetes using Machine Learning Classification Methods,” Procedia Comput Sci, vol. 167, pp. 706–716, 2020, doi: 10.1016/j.procs.2020.03.336.
[8] M. A. J. Tengnah, R. Sooklall, and S. D. Nagowah, “A Predictive Model for Hypertension Diagnosis Using Machine Learning Techniques,” in Telemedicine Technologies, Elsevier, 2019, pp. 139–152. doi: 10.1016/B978-0-12-816948-3.00009-X.
[9] S. Grampurohit and C. Sagarnal, “Disease Prediction using Machine Learning Algorithms,” in 2020 International Conference for Emerging Technology (INCET), IEEE, Jun. 2020, pp. 1–7. doi: 10.1109/INCET49848.2020.9154130.
[10] P. Hamsagayathri and S. Vigneshwaran, “Symptoms Based Disease Prediction Using Machine Learning Techniques,” in 2021 Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV), IEEE, Feb. 2021, pp. 747–752. doi: 10.1109/ICICV50876.2021.9388603.
[11] P. Hema, N. Sunny, R. Venkata Naganjani, and A. Darbha, “Disease Prediction using Symptoms based on Machine Learning Algorithms,” in 2022 International Conference on Breakthrough in Heuristics And Reciprocation of Advanced Technologies (BHARAT), IEEE, Apr. 2022, pp. 49–54. doi: 10.1109/BHARAT53139.2022.00021.
[12] A. Divya, B. Deepika, C. H. Durga Akhila, A. Tonika Devi, B. Lavanya, and E. Sravya Teja, “Disease Prediction Based on Symptoms Given by User Using Machine Learning,” SN Comput Sci, vol. 3, no. 6, p. 504, Oct. 2022, doi: 10.1007/s42979-022-01399-0.
[13] J. H. Kamdar, J. Jeba Praba, and J. J. Georrge, “Artificial Intelligence in Medical Diagnosis: Methods, Algorithms and Applications,” 2020, pp. 27–37. doi: 10.1007/978-3-030-40850-3_2.
[14] S. Haque, Z. Eberhart, A. Bansal, and C. McMillan, “Semantic similarity metrics for evaluating source code summarization,” in Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension, New York, NY, USA: ACM, May 2022, pp. 36–47. doi: 10.1145/3524610.3527909.
[15] A. Aszani, H. I. Wicaksono, U. Nadzima, and L. Heryawan, “Information Retrieval for Early Detection of Disease Using Semantic Similarity,” IJCCS (Indonesian Journal of Computing and Cybernetics Systems), vol. 17, no. 1, p. 45, Feb. 2023, doi: 10.22146/ijccs.80077.
[16] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” Aug. 2019.
[17] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter,” Oct. 2019.
[18] K. Song, X. Tan, T. Qin, J. Lu, and T.-Y. Liu, “MPNet: Masked and Permuted Pre-training for Language Understanding,” Apr. 2020.
[19] W. Wang, H. Bao, S. Huang, L. Dong, and F. Wei, “MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers,” Dec. 2020.
[20] E. De Santis, A. Martino, F. Ronci, and A. Rizzi, “From Bag-of-Words to Transformers: A Comparative Study for Text Classification in Healthcare Discussions in Social Media,” IEEE Trans Emerg Top Comput Intell, pp. 1–15, 2024, doi: 10.1109/TETCI.2024.3423444.
[21] R. Wirth and J. Hipp, “CRISP-DM: Towards a Standard Process Model for Data Mining.”
[22] K. H. Brodersen, C. S. Ong, K. E. Stephan, and J. M. Buhmann, “The Balanced Accuracy and Its Posterior Distribution,” in 2010 20th International Conference on Pattern Recognition, IEEE, Aug. 2010, pp. 3121–3124. doi: 10.1109/ICPR.2010.764.
[23] T. Wolf et al., “Transformers: State-of-the-Art Natural Language Processing,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Stroudsburg, PA, USA: Association for Computational Linguistics, 2020, pp. 38–45. doi: 10.18653/v1/2020.emnlp-demos.6.
[24] S. Pathak, N. Chaudhary, P. Dhakal, S. R. Yadav, B. K. Gupta, and O. P. Kurmi, “Comparative Study of Chikungunya Only and Chikungunya-Scrub Typhus Coinfection in Children: Findings from a Hospital-Based Observational Study from Central Nepal,” Int J Pediatr, vol. 2021, pp. 1–6, Apr. 2021, doi: 10.1155/2021/6613564.
Downloads
Published
Issue
Section
Citation Check
License
Copyright (c) 2025 Muhammad Adrinta Abdurrazzaq, Edwin Lesmana Tjiong, Kent Algren Wanady

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.
You are free to:
- Share — copy and redistribute the material in any medium or format for any purpose, even commercially.
- The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
-
Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
-
NoDerivatives — If you remix, transform, or build upon the material, you may not distribute the modified material.
-
No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
Notices:
- You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation.
- No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.
This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License