Study of the Application of Text Augmentation with Paraphrasing to Overcome Imbalanced Data in Indonesian Text Classification

Mutiara Indryan Sari; Lya Hulliyyatus Suadaa

doi:10.15575/join.v10i1.1472

Authors

Mutiara Indryan Sari Computational Statistics, Politeknik Statistika STIS, Indonesia
Lya Hulliyyatus Suadaa Computational Statistics, Politeknik Statistika STIS, Indonesia

DOI:

https://doi.org/10.15575/join.v10i1.1472

Keywords:

Imbalanced dataset, Paraphrase, Pre-trained model, Text augmentation, Text classification

Abstract

Data imbalance in text classification often leads to poor recognition of minority classes, as classifiers tend to favor majority categories. This study addresses the data imbalance issue in Indonesian text classification by proposing a novel text augmentation approach using fine-tuned pre-trained models: IndoGPT2, IndoBART-v2, and mBART50. Unlike back-translation, which struggles with informal text, text augmentation using pre-trained models significantly improves the F1 score of minority labels, with fine-tuned mBART50 outperforming back translation and other models by balancing semantic preservation and lexical diversity. However, the approach faces limitations, including the risk of overfitting due to synthetic text's lack of natural variations, restricted generalizability from reliance on datasets such as ParaCotta, and the high computational costs associated with fine-tuning large models like mBART50. Future research should explore hybrid methods that integrate synthetic and real-world data to enhance text quality and diversity, as well as develop smaller, more efficient models to reduce computational demands. The findings underscore the potential of pre-trained models for text augmentation while emphasizing the importance of considering dataset characteristics, language style, and augmentation volume to achieve optimal results.

References

[1] E. Olshannikova, T. Olsson, J. Huhtamäki, and H. Kärkkäinen, “Conceptualizing Big Social Data,” J. Big data, vol. 4, no. 1, 2017, doi: 10.1186/s40537-017-0063-x.

[2] J. Kaur and J. R. Saini, “A Study of Text Classification Natural Language Processing Algorithms for Indian Languages,” Vnsgu J. Sci. Technol., vol. 4, no. 1, pp. 162–167, 2015.

[3] Y. Ko and J. Seo, “Automatic text categorization by unsupervised learning,” pp. 453–459, 2000, doi: 10.3115/990820.990886.

[4] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing, “Learning from class-imbalanced data: Review of methods and applications,” Expert Syst. Appl., vol. 73, pp. 220–239, 2017, doi: 10.1016/j.eswa.2016.12.035.

[5] N. A. Verdikha, T. B. Adji, and A. E. Permanasari, “Komparasi Metode Oversampling Untuk Klasifikasi Teks Ujaran Kebencian,” in Semin. Nas. Teknol. Inf. dan Multimed. 2018, pp. 85–90, 2018.

[6] A. Sun, E. P. Lim, and Y. Liu, “On strategies for imbalanced text classification using SVM: A comparative study,” Decis. Support Syst., vol. 48, no. 1, pp. 191–201, 2009, doi: 10.1016/j.dss.2009.07.011.

[7] T. A. Le and D. Moeljadi, “Sentiment Analysis for Low Resource Languages: A Study on Informal Indonesian Tweets,” pp. 123–131, 2016.

[8] V. S. Spelmen and R. Porkodi, “A Review on Handling Imbalanced Data,” in Proc. 2018 Int. Conf. Curr. Trends Towar. Converging Technol. ICCTCT 2018, pp. 1–11, 2018, doi: 10.1109/ICCTCT.2018.8551020.

[9] I.A. Rahma and L. H. Suadaa “Penerapan Text Augmentation untuk Mengatasi Data yang Tidak Seimbang pada Klasifikasi Teks Berbahasa Indonesia,” 2023.

[10] A. A. Tavor et al., “Do not have enough data? Deep learning to the rescue!” in 34th AAAI Conference on Artificial Intelligence, 2020, pp. 7383–7390. doi: 10.1609/aaai.v34i05.6233.

[11] D. R. Beddiar, M. S. Jahan, and M. Oussalah, “Data expansion using back translation and paraphrasing for hate speech detection,” Online Social Networks and Media, vol. 24, 2021, doi: 10.1016/j.osnem.2021.100153.

[12] William and Y. Sari, “CLICK-ID: A novel dataset for Indonesian clickbait headlines,” Data in Brief, vol. 32, p. 106231, 2020, doi: 10.1016/j.dib.2020.106231.

[13] A. D. Sanya and L. H. Suadaa, “Handling Imbalanced Dataset on Hate Speech Detection in Indonesian Online News Comments,” in 2022 10th ICoICT, Bandung, Indonesia, 2022, pp. 380–385, doi: 10.1109/ICoICT55009.2022.9914883.

[14] A. Purwarianti and I. A. P. A. Crisdayanti, “Improving Bi-LSTM Performance for Indonesian Sentiment Analysis Using Paragraph Vector,” in 2019 ICAICTA, Yogyakarta, Indonesia, 2019, pp. 1-5, doi: 10.1109/ICAICTA.2019.8904199.

[15] A. F. Aji et al., “ParaCotta: Synthetic Multilingual Paraphrase Corpora from the Most Diverse Translation Sample Pair,” Proc. 35th Pacific Asia Conf. Lang. Inf. Comput. PACLIC 2021, pp. 666–675, 2021, doi: 10.48550/arXiv.2205.04651.

[16] A. Kumar, S. Bhattamishra, M. Bhandari, and P. Talukdar, “Submodular optimization-based diverse paraphrasing and its effectiveness in data augmentation,” in 2019 Proceedings of NAACL-HLT, Jun 2019, pp. 3609–3619, doi: 10.18653/v1/N19-1363.

[17] B. Li, Y. Hou, and W. Che, “Data augmentation approaches in natural language processing: A survey,” AI Open, vol. 3, pp. 71–90, 2022, doi: 10.1016/j.aiopen.2022.03.001.

[18] J. Li, T. Tang, W. X. Zhao, and J. R. Wen, “Pretrained Language Models for Text Generation: A Survey,” IJCAI Int. Jt. Conf. Artif. Intell., vol. 1, no. 1, pp. 4492–4499, 2021, doi: 10.24963/ijcai.2021/612.

[19] S. Cahyawijaya et al., “IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation,” In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov 2021, pp. 8875–8898, doi: 10.48550/arXiv.2104.08200.

[20] Y. Liu et al., “Multilingual denoising pre-training for neural machine translation,” Trans. Assoc. Comput. Linguist., vol. 8, pp. 726–742, 2020, doi: 10.1162/tacl_a_00343.

[21] Y. Tang et al., “Multilingual Translation with Extensible Multilingual Pretraining and Finetuning,” 2020, doi: 10.48550/arXiv.2008.00401.

[22] S. Y. Feng et al., “A Survey of Data Augmentation Approaches for NLP,” Find. Assoc. Comput. Linguist. ACL-IJCNLP 2021, pp. 968–988, 2021, doi: 10.18653/v1/2021.findings-acl.84.

[23] J. Cervantes, F. Garcia-Lamont, L. Rodríguez-Mazahua, and A. Lopez, “A comprehensive survey on support vector machine classification: Applications, challenges and trends,” Neurocomputing, vol. 408, pp. 189–215, 2020, doi: 10.1016/j.neucom.2019.10.118.

[24] S. Qaiser and R. Ali, “Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents,” Int. J. Comput. Appl., vol. 181, no. 1, pp. 25–29, 2018, doi: 10.5120/ijca2018917395.

[25] B. Wilie et al., “IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding,” Proc. 1st Conf. Asia-Pacific Chapter Assoc. Comput. Linguist. 10th Int. Jt. Conf. Nat. Lang. Process., pp. 843–857, 2020, doi: 10.48550/arXiv.2009.05387.

[26] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a Method for Automatic Evaluation of Machine Translation,” Proc. 40th Annu. Meet. Assoc. Comput. Linguist., pp. 311–318, 2002, doi: 10.3917/chev.030.0107.

[27] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “Bertscore: Evaluating Text Generation with Bert,” 8th Int. Conf. Learn. Represent. ICLR 2020, pp. 1–43, 2020, doi: 10.48550/arXiv.1904.09675.

[28] A. Graefe, M. Haim, B. Haarmann, and H. B. Brosius, “Readers’ perception of computer-generated news: Credibility, expertise, and readability,” Journalism, vol. 19, no. 5, pp. 595–610, 2018, doi: 10.1177/1464884916641269.

[29] Orme, “MaxDiff Analysis: Simple Counting, Individual-Lelvel Logit, and HB” vol.98382, no. 360, 2009.

[30] S. Kiritchenko and S. M. Mohammad, “Best–Worst scaling more reliable than rating scales: A case study on sentiment intensity annotation,” ACL 2017 - 55th Annu. Meet. Assoc. Comput. Linguist. Proc. Conf. (Long Pap., vol. 2, pp. 465–470, 2017, doi: 10.18653/v1/P17-2074.