Performance of Machine Learning Algorithms on Automatic Summarization of Indonesian Language Texts

Galih Wiratmoko; Husni Thamrin; Endang Wahyu Pamungkas

doi:10.15575/join.v10i1.1506

Authors

Galih Wiratmoko Department of Informatics, Universitas Muhammadiyah Madiun, Indonesia
Husni Thamrin Department of Informatic, Universitas Muhammadiyah Surakarta, Indonesia https://orcid.org/0000-0001-5865-9113
Endang Wahyu Pamungkas Department of Informatic, Universitas Muhammadiyah Surakarta, Indonesia

DOI:

https://doi.org/10.15575/join.v10i1.1506

Keywords:

Abstractive algorithms, Bahasa Indonesia, Hybrid model, T5-model, Text summarization

Abstract

Automatic text summarization (ATS) has become an essential task for processing huge amounts of information efficiently. ATS has been extensively studied in resource-rich languages like English, but research on summarization for under-resourced languages, such as Bahasa Indonesia, is still limited. Indonesian presents unique linguistic challenges, including its agglutinative structure, borrowed vocabulary, and limited availability of high-quality training data. This study conducts a comparative evaluation of extractive, abstractive, and hybrid models for Indonesian text summarization, utilizing the IndoSum dataset which contains 20,000 text-summary pairs. We tested several models including LSA (Latent Semantic Analysis), LexRank, T5, and BART, to assess their effectiveness in generating summaries. The results show that the LexRank+BERT hybrid model outperforms traditional extractive methods, achieving better ROUGE precision, recall, and F-measure scores. Among the abstractive methods, the T5-Large model demonstrated the best performance, producing more coherent and semantically rich summaries compared to other models. These findings suggest that hybrid and abstractive approaches are better suited for Indonesian text summarization, especially when leveraging large-scale pre-trained language models.

References

[1] M. F. Mridha, A. A. Lima, K. Nur, S. C. Das, M. Hasan, and M. M. Kabir, “A Survey of Automatic Text Summarization: Progress, Process and Challenges,” IEEE Access, vol. 9, pp. 156043–156070, 2021, doi: 10.1109/ACCESS.2021.3129786.

[2] D. Yadav, R. Katna, A. K. Yadav, and J. Morato, “Feature Based Automatic Text Summarization Methods: A Comprehensive State-of-the-Art Survey,” IEEE Access, vol. 10, pp. 133981–134003, 2022, doi: 10.1109/ACCESS.2022.3231016.

[3] K. M. Chaitrashree, T. N. Sneha, S. R. Tanushree, G. R. Usha, and T. C. Pramod, “Unstructured Medical Text Classification using Machine Learning and Deep Learning Approaches,” in 2021 International Conference on Recent Trends on Electronics, Information, Communication & Technology (RTEICT), 2021, pp. 429–433. doi: 10.1109/RTEICT52294.2021.9573667.

[4] Á. Hernández-Castañeda, R. A. García-Hernández, Y. Ledeneva, and C. E. Millán-Hernández, “Extractive Automatic Text Summarization Based on Lexical-Semantic Keywords,” IEEE Access, vol. 8, pp. 49896–49907, 2020, doi: 10.1109/ACCESS.2020.2980226.

[5] B. Khan, Z. A. Shah, M. Usman, I. Khan, and B. Niazi, “Exploring the Landscape of Automatic Text Summarization: A Comprehensive Survey,” IEEE Access, vol. 11, pp. 109819–109840, 2023, doi: 10.1109/ACCESS.2023.3322188.

[6] N. Giarelis, C. Mastrokostas, and N. Karacapilidis, “Abstractive vs . Extractive Summarization : An Experimental Review,” 2023.

[7] Á. Hernández-Castañeda, R. A. García-Hernández, Y. Ledeneva, and C. E. Millán-Hernández, “Extractive Automatic Text Summarization Based on Lexical-Semantic Keywords,” IEEE Access, vol. 8, pp. 49896–49907, 2020, doi: 10.1109/ACCESS.2020.2980226.

[8] B. M. Gurusamy, P. K. Rengarajan, and P. Srinivasan, “A hybrid approach for text summarization using semantic latent Dirichlet allocation and sentence concept mapping with transformer,” vol. 13, no. 6, pp. 6663–6672, 2023, doi: 10.11591/ijece.v13i6.pp6663-6672.

[9] N. Khotimah and A. S. Girsang, “Indonesian News Articles Summarization Using Genetic Algorithm,” Engineering Letters, vol. 30, no. 1, pp. 152 – 160, 2022.

[10] Y. M. Sari and N. S. Fatonah, “Peringkasan Teks Otomatis pada Modul Pembelajaran Berbahasa Indonesia Menggunakan,” vol. 7, no. 2, pp. 153–159, 2021.

[11] A. Bahari and K. E. Dewi, “peringkasan teks otomatis abtraktif menggunakan transformer pada teks bahasa indonesia,” KOMPUTA : Jurnal Ilmiah Komputer dan Informatika, vol. 13, no. 1, 2024.

[12] A. nur Khasanah and M. Hayati, “Abtsractive-Based Automatic Text Summarization On Indonesian News Using GPT2,” vol. X, no. 1, pp. 9–18, 2023.

[13] A. S. Girsang and F. J. Amadeus, “Extractive Text Summarization for Indonesian News Article Using Ant System Algorithm,” vol. 14, no. 2, pp. 295–301, 2023, doi: 10.12720/jait.14.2.295-301.

[14] K. E. Dewi and N. I. Widiastuti, “Automatic Summarization of Indonesian Texts Using a Hybrid Approach,” vol. 15, no. 1, 2022.

[15] K. Kurniawan and S. Louvan, “I NDO S UM : A New Benchmark Dataset for Indonesian Text Summarization,” 2018 International Conference on Asian Language Processing (IALP), pp. 215–220, 2018.

[16] N. Babanejad, H. Davoudi, A. Agrawal, A. An, and M. Papagelis, “The Role of Preprocessing for Word Representation Learning in Affective Tasks,” IEEE Trans Affect Comput, vol. 15, no. 1, pp. 254–272, 2024, doi: 10.1109/TAFFC.2023.3270115.

[17] E. Qais and V. M. N., “TxtPrePro: Text Data Preprocessing Using Streamlit Technique for Text Analytics Process,” in 2023 International Conference on Network, Multimedia and Information Technology (NMITCON), 2023, pp. 1–6. doi: 10.1109/NMITCON58196.2023.10275887.

[18] G. Khekare, C. Masudi, Y. K. Chukka, and D. P. Koyyada, “Text Normalization and Summarization Using Advanced Natural Language Processing,” in 2024 International Conference on Integrated Circuits and Communication Systems (ICICACS), 2024, pp. 1–6. doi: 10.1109/ICICACS60521.2024.10498983.

[19] M. A. Dursun and S. Serttaş, “A Multi-Metric Model for analyzing and comparing extractive text summarization approaches and algorithms on scientific papers,” vol. 1, pp. 31–48, 2024, doi: 10.24012/dumf.1376978.

[20] A. R. Lubis, H. R. Safitri, and M. Lubis, “improving text summarization quality by combining T5-based models and convulutional seq2seq models,” vol. 5, no. 1, pp. 451–459, 2023.

[21] A. Al Foysal and R. Böck, “Who Needs External References ?— Text Summarization Evaluation Using Original Documents,” pp. 970–995, 2023.

[22] G. Erkan and D. R. Radev, “LexRank: Graph-based Lexical Centrality as Salience in Text Summarization,” Journal of Artificial Intelligence Research, vol. 22, pp. 457–479, Dec. 2004, doi: 10.1613/jair.1523.

[23] R. Mihalcea and D. Radey, “Graph-based natural language processing and information retrieval.,” Cambridge University Press, 2011.

[24] K. Clark, U. Khandelwal, O. Levy, and C. D. Manning, “What does BERT look at? an analysis of BERT’s attention,” arXiv preprint arXiv:1906.04341, 2019, doi: 10.18653/v1/W19-4828.

[25] S. MacAvaney, A. Yates, A. Cohan, and N. Goharian, “CEDR,” in Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA: ACM, Jul. 2019, pp. 1101–1104. doi: 10.1145/3331184.3331317.

[26] S. Narayan, S. B. Cohen, and M. Lapata, “Ranking Sentences for Extractive Summarization with Reinforcement Learning,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Stroudsburg, PA, USA: Association for Computational Linguistics, 2018, pp. 1747–1759. doi: 10.18653/v1/N18-1158.

[27] C. Raffel et al., “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020.

[28] L. Xue et al., “mt5: A massively multilingual pre-trained text-to-text transformer,” arXiv preprint arXiv:2010.11934, 2020, doi: 10.48550/arXiv.2010.11934.

[29] M. Lewis et al., “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” arXiv preprint arXiv:1910.13461, 2019, doi: 10.48550/arXiv.1910.13461.

[30] M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy, “Spanbert: Improving pre-training by representing and predicting spans,” Trans Assoc Comput Linguist, vol. 8, pp. 64–77, 2020.

[31] J. Ranganathan and G. Abuka, “Text Summarization using Transformer Model,” in 2022 Ninth International Conference on Social Networks Analysis, Management and Security (SNAMS), 2022, pp. 1–5. doi: 10.1109/SNAMS58071.2022.10062698.

[32] A. M. A. Zeyad and A. Biradar, “Advancements in the Efficacy of Flan-T5 for Abstractive Text Summarization: A Multi-Dataset Evaluation Using ROUGE and BERTScore,” in 2024 International Conference on Advancements in Power, Communication and Intelligent Systems (APCI), 2024, pp. 1–5. doi: 10.1109/APCI61480.2024.10616418.