Multi Rule-based and Corpus-based for Sundanese Stemmer

Ade Sutedi; Muhammad Rikza Nasrulloh; Rickard Elsen

doi:10.15575/join.v7i2.846

Authors

Ade Sutedi Department of Informatics Engineering, Institut Teknologi Garut, Indonesia
Muhammad Rikza Nasrulloh Department of Informatics Engineering, Institut Teknologi Garut, Indonesia
Rickard Elsen Department of Informatics Engineering, Institut Teknologi Garut, Indonesia

DOI:

https://doi.org/10.15575/join.v7i2.846

Keywords:

Corpus-based, Multi Rule-based, Stemmer, Sundanese

Abstract

The purpose of this study is to develop a stemming method by involved several methods including morphological (with affix and pro-lexeme removal), syllable (canonical) pattern, and corpus data as a comparison of the final results of stemming. The algorithm checks a number of the string first and removes affixes, then check the syllable pattern according to the stripping result, then compares to the corpus data which determines the final stemming process. In this study, the corpus data was taken from Sundanese dictionary consists of a single word used for the root word and the extracted dataset from the online Sundanese magazine. The results showed that the stripping of affix and pro-lexeme can remove the corresponding affixes and pro-lexeme then compares words that have a syllable pattern then executes the basic words quickly and the use of corpus can improve accuracy and reduce the over-stemming problems that occur in the stemming process.

Author Biography

Ade Sutedi, Department of Informatics Engineering, Institut Teknologi Garut

Teknik Informatika

References

P. Willett, â€œThe Porter stemming algorithm: then and now,â€ Program, vol. 40, no. 3, pp. 219â€“223, Jul. 2006, doi: 10.1108/00330330610681295.

M. Adriani, J. Asian, B. Nazief, S. M. M. Tahaghoghi, and H. E. Williams, â€œStemming Indonesian,â€ ACM Trans. Asian Lang. Inf. Process., vol. 6, no. 4, pp. 1â€“33, Dec. 2007, doi: 10.1145/1316457.1316459.

A. Purwarianti, â€œA non deterministic Indonesian stemmer,â€ Proc. 2011 Int. Conf. Electr. Eng. Informatics, ICEEI 2011, no. October, 2011, doi: 10.1109/ICEEI.2011.6021829.

A. A. Damar, K. Dewi, and U. M. Siti, â€œPenerapan Algoritma Paice atau Husk untuk Stemming pada Kamus Bahasa Inggris ke Bahasa Indonesia,â€ J. Tek. Inform., vol. 6, no. 2, Oct. 2013, doi: 10.15408/jti.v6i2.2031.

A. S. Rizki, A. Tjahyanto, and R. Trialih, â€œComparison of stemming algorithms and its effect on Indonesian text processing,â€ TELKOMNIKA (Telecommunication Comput. Electron. Control., vol. 17, no. 1, p. 95, Feb. 2019, doi: 10.12928/telkomnika.v17i1.10183.

Y. Anistyasari and E. Hariadi, â€œAlgoritma Baru Pembentukan Kata Dasar Pada Proses Stemming Bahasa Indonesia,â€ Pros. SNRT (Seminar Nas. Ris. Ter., vol. 5662, no. November, pp. 70â€“76, 2019.

F. Amin and Purwatiningtyas, â€œStemmer Bahasa Jawa Ngoko dengan Metode Affix Removal Stemmer (Rule Base Approach),â€ J. Teknol. Inf. Din., vol. 21, no. 1, pp. 16â€“24, 2016.

N. Hidayatullah, A. P. Wibawa, and H. A. Rosyid, â€œPenerapan ECS Stemmer untuk Modifikasi Nazief & Adriani Berbahasa Jawa,â€ vol. 3, no. 3, pp. 343â€“348, 2019.

R. Maulidi, â€œStemmer Untuk Bahasa Madura Dengan Modifikasi Metode Enhanced Confix Stripping Stemmer,â€ in Prosiding Seminar Nasional FDI 2016, 2016, no. December, pp. 12â€“15.

G. Ngurah, M. Nata, and P. P. Yudiastra, â€œStemming teks sor-singgih Bahasa Bali,â€ Konf. Nas. Sist. Inform. 2017 STMIK, no. Agustus, pp. 608â€“612, 2017.

M. Agus, P. Subali, C. Fatichah, and D. Informatika, â€œKombinasi Metode Rule-Based Dan N-Gram Stemming Untuk Mengenali Stemmer Bahasa Bali,â€ J. Teknol. Inf. dan Ilmu Komput., vol. 6, no. 2, 2019, doi: 10.25126/jtiik.201961105.

D. Junaedi, O. Herlistiono, and D. Akbar, â€œStemmer For Basa Sunda,â€ pp. 275â€“278, 2010.

A. Purwoko, â€œModel Stemming Berbasis kamus untuk dokumen berbahasa sunda,â€ INSTITUT PERTANIAN BOGOR, 2011.

A. A. Suryani, D. H. Widyantoro, A. Purwarianti, and Y. Sudaryat, â€œThe Rule-Based Sundanese Stemmer,â€ ACM Trans. Asian Low-Resource Lang. Inf. Process., vol. 17, no. 4, pp. 1â€“28, Aug. 2018, doi: 10.1145/3195634.

A. Sutedi, R. Elsen, and M. R. Nashrulloh, â€œSundanese Stemming using Syllable Pattern,â€ vol. 6, no. 2, pp. 218â€“224, 2021, doi: 10.15575/join.v6i2.812.

I. Baidillah et al., Direktori Aksara Sunda untuk Unicode, 1st ed. Dinas Pendidikan Provinsi Jawa Barat, 2008.

D. Sudaryat, Yayat, A. Prawirasumantri, and K. Yudibrata, Tata Basa Sunda Kiwari. Bandung: Yrama Widya, 2013.

L. S. Faznur et al., â€œKomparasi fonem bahasa sunda dan bahasa indonesia dalam buku teks,â€ Pena Literasi J. Pendidik. Bhs. dan Sastra Indones., vol. 2, no. 2, pp. 105â€“114, 2019.

R. A. Danadibrata, Kamus Basa Sunda, 4th ed. Bandung: Panitia Penerbitan Kamus Basa Sunda dan PT. Kiblat Buku Utama, 2015.