The Hybrid of Jaro-Winkler and Rabin-Karp Algorithm in Detecting Indonesian Text Similarity

Authors

  • Muhamad Arief Yulianto Pamulang University, Indonesia
  • Nurhasanah Nurhasanah Pamulang University, Indonesia

DOI:

https://doi.org/10.15575/join.v6i1.640

Keywords:

combination, Jaro-Winkler, Rabin-Karp, text similarity

Abstract

The String-matching technique is part of the similarity technique. This technique can detect the similarity level of the text. The Rabin-Karp is an algorithm of string-matching type. The Rabin-Karp is capable of multiple patterns searching but does not match a single pattern. The Jaro-Winkler Distance algorithm can find strings within approximate string matching. This algorithm is very suitable and gives the best results on the matching of two short strings. This study aims to overcome the shortcomings of the Rabin-Karp algorithm in the single pattern search process by combining the Jaro-Winkler and Rabin-Karp algorithm methods. The merging process started from pre-processing and forming the K-Gram data. Then, it was followed by the calculation of the hash value for each K-Gram by the Rabin-Karp algorithm. The process of finding the same hash score and calculating the percentage level of data similarity used the Jaro-Winkler algorithm. The test was done by comparing words, sentences, and journal abstracts that have been rearranged. The average percentage of the test results for the similarity level of words in the combination algorithm has increased. In contrast, the results of the percentage test for the level of similarity of sentences and journal abstracts have decreased. The experimental results showed that the combination of the Jaro-Winkler algorithm on the Rabin-Karp algorithm can improve the similarity of text accuracy.

References

H. Ezzikouri, M. Erritali, and M. Oukessou, “Semantic Similarity / Relatedness for Cross Language Plagiarism Detection,†vol. 1, no. 2, pp. 371–374, 2016, doi: 10.11591/ijeecs.v1.i2.pp371-374.

B. Leonardo and S. Hansun, “Text documents plagiarism detection using Rabin-Karp and Jaro-Winkler distance algorithms,†Indones. J. Electr. Eng. Comput. Sci., vol. 5, no. 2, pp. 462–471, 2017, doi: 10.11591/ijeecs.v5.i2.pp462-471.

D. Leman, M. Rahman, F. Ikorasaki, B. S. Riza, and M. B. Akbbar, “Rabin Karp and Winnowing Algorithm for Statistics of Text Document Plagiarism Detection,†2019 7th Int. Conf. Cyber IT Serv. Manag. CITSM 2019, 2019, doi: 10.1109/CITSM47753.2019.8965422.

A. P. U. Siahaan, “Rabin-Karp Elaboration in Comparing Pattern Based on Hash Data,†Int. J. Secur. Its Appl., vol. 12, no. 2, pp. 59–66, 2018, doi: 10.14257/ijsia.2018.12.2.06.

A. Bahrul Khoir, H. Qodim, B. Busro, and A. Rialdy Atmadja, “Implementation of rabin-karp algorithm to determine the similarity of synoptic gospels,†J. Phys. Conf. Ser., vol. 1175, no. 1, 2019, doi: 10.1088/1742-6596/1175/1/012120.

A. P. U. Siahaan et al., “Combination of levenshtein distance and rabin-karp to improve the accuracy of document equivalence level,†Int. J. Eng. Technol., vol. 7, no. 2 Special Issue 27, pp. 17–21, 2018, doi: 10.14419/ijet.v7i2.27.12084.

A. D. Hartanto, A. Syaputra, and Y. Pristyanto, “Best parameter selection of rabin-Karp algorithm in detecting document similarity,†2019 Int. Conf. Inf. Commun. Technol. ICOIACT 2019, no. February 2020, pp. 457–461, 2019, doi: 10.1109/ICOIACT46704.2019.8938458.

A. Filcha and M. Hayaty, “Implementasi Algoritma Rabin-Karp untuk Pendeteksi Plagiarisme pada Dokumen Tugas Mahasiswa,†JUITA J. Inform., vol. 7, no. 1, p. 25, 2019, doi: 10.30595/juita.v7i1.4063.

D. Steveson, H. Agung, and F. Mulia, “APLIKASI PENDETEKSI PLAGIARISME TUGAS DAN MAKALAH PADA SEKOLAH MENGGUNAKAN ALGORITMA RABIN,†J. Algoritm. Log. dan Komputasi, vol. 1, no. 1, pp. 12–17, 2018, [Online]. Available: https://journal.ubm.ac.id/index.php/alu.

J. Priambodo, “Pendeteksian Plagiarisme Menggunakan Algoritma Rabin-Karp dengan Metode Rolling Hash,†J. Inform. Univ. Pamulang, vol. 3, no. 1, p. 39, 2018, doi: 10.32493/informatika.v3i1.1518.

D. Uji Cahyono, “Aplikasi Deteksi Dini Plagiarisme Judul Tugas Akhir Mahasiswa Sekolah Tinggi Ilmu Kesehatan Yayasan Rs. Islam Surabaya Dengan Algoritma Rabin-Karp,†Appl. Technol. Comput. Sci. J., vol. 1, no. 1, pp. 1–10, 2018, doi: 10.33086/atcsj.v1i1.3.

A. P. U. Siahaan, R. Rahim, M. Mesran, and D. Siregar, “K-Gram As A Determinant Of Plagiarism Level in Rabin-Karp Algorithm,†Int. J. Sci. Technol. Res., vol. 06, no. 07, pp. 350–353, 2017, doi: 10.31219/osf.io/yxjnp.

Y. Rochmawati and R. Kusumaningrum, “Studi Perbandingan Algoritma Pencarian String dalam Metode Approximate String Matching untuk Identifikasi Kesalahan Pengetikan Teks,†J. Buana Inform., vol. 7, no. 2, pp. 125–134, 2016, doi: 10.24002/jbi.v7i2.491.

D. Z. Putri, D. Puspitaningrum, and Y. Setiawan, “Konversi Citra Kartu Nama ke Teks Menggunakan Teknik OCR dan Jaro-Winkler Distance,†J. Teknoinfo, vol. 12, no. 1, p. 1, 2018, doi: 10.33365/jti.v12i1.35.

K. M. Suryaningrum and A. T, “Pengkoreksian dan Suggestion Word pada Keyword Menggunakan Algoritma Jaro-Winkler,†J. Teknol. Informasi-AITI, vol. 13, no. 2, pp. 169–181, 2016.

A. Prasetyo, W. M. Baihaqi, and I. S. Had, “Algoritma Jaro-Winkler Distance: Fitur Autocorrect dan Spelling Suggestion pada Penulisan Naskah Bahasa Indonesia di BMS TV,†J. Teknol. Inf. dan Ilmu Komput., vol. 5, no. 4, p. 435, 2018, doi: 10.25126/jtiik.201854780.

S. C. Cahyono, “Comparison of document similarity measurements in scientific writing using Jaro-Winkler Distance method and Paragraph Vector method,†IOP Conf. Ser. Mater. Sci. Eng., vol. 662, no. 5, 2019, doi: 10.1088/1757-899X/662/5/052016.

P. Novantara, “Implementasi Algoritma Jaro-Winkler Distance Untuk Sistem Pendeteksi Plagiarisme Pada Dokumen Skripsi,†Buffer Inform., vol. 3, no. 1, 2018, doi: 10.25134/buffer.v3i2.960.

S. Christina, E. D. Oktaviyani, and B. Famungkas, “Mendeteksi Plagiarism Pada Dokumen Proposal Skripsi Menggunakan Algoritma Jaro Winkler Distance,†J. SAINTEKOM, vol. 8, no. 2, p. 143, 2018, doi: 10.33020/saintekom.v8i2.68.

I. E. Agbehadji, H. Yang, S. Fong, and R. Millham, “The Comparative Analysis of Smith-Waterman Algorithm with Jaro-Winkler Algorithm for the Detection of Duplicate Health Related Records,†2018 Int. Conf. Adv. Big Data, Comput. Data Commun. Syst. icABCD 2018, 2018, doi: 10.1109/ICABCD.2018.8465458.

T. Tinaliah and T. Elizabeth, “Perbandingan Hasil Deteksi Plagiarisme Dokumen dengan Metode Jaro-Winkler Distance dan Metode Latent Semantic Analysis,†J. Teknol. dan Sist. Komput., vol. 6, no. 1, pp. 7–12, 2018, doi: 10.14710/jtsiskom.6.1.2018.7-12.

Jayanta, H. Mahfud, and T. Pramiyati, “Analisis pengukuran self plagiarism menggunakan algoritma Rabin-Karp dan Jaro-Winkler distance dengan stemming Tala,†Semin. Nas. Teknol. Inf. dan Multimed., vol. 5, no. 1, pp. 1–6, 2017.

M. J. Tannga, S. Rahman, and Hasniati, “Analisis Perbandingan Algoritma Levenshtein Distance Dan Jaro Winkler Untuk Aplikasi Deteksi Plagiarisme Dokumen Teks,†Jtriste, vol. 4, no. 1, pp. 44–54, 2017.

L. Hakim, “Penggunaan N-Gram dan Jaro Winkler Distance pada Aplikasi Kelas Daring untuk Deteksi Plagiat,†in Seminar Nasional Sains dan Teknologi 2019, 2019, pp. 1–10, [Online]. Available: jurnal.umj.ac.id/index.php/semnastek.

S. Sugiono, H. Herwin, H. Hamdani, and E. Erlin, “Aplikasi Pendeteksi Tingkat Kesamaan Dokumen Teks: Algoritma Rabin Karp Vs. Winnowing,†Digit. Zo. J. Teknol. Inf. dan Komun., vol. 9, no. 1, pp. 82–93, 2018, doi: 10.31849/digitalzone.v9i1.1242.

S. C. Cahyono, “Comparison of document similarity measurements in scientific writing using Jaro-Winkler Distance method and Paragraph Vector method,†IOP Conf. Ser. Mater. Sci. Eng., vol. 662, p. 52016, 2019, doi: 10.1088/1757-899x/662/5/052016.

M. Qulub, E. Utami, and A. Sunyoto, “Stemming Kata Berimbuhan Tidak Baku Bahasa Indonesia Menggunakan Algoritma Jaro-Winkler Distance,†Creat. Inf. Technol. J., vol. 5, no. 4, p. 254, 2020, doi: 10.24076/citec.2018v5i4.218.

A. F. Hidayatullah, “The influence of stemming on Indonesian tweet sentiment analysis,†Int. Conf. Electr. Eng. Comput. Sci. Informatics, vol. 2, no. August, pp. 127–132, 2015, doi: 10.11591/eecsi.v2i1.791.

D. A. Putra and H. Sujaini, “Implementasi Algoritma Rabin-Karp untuk Membantu Pendeteksian Plagiat pada Karya Ilmiah(CONTOH PLAGIAT),†J. Sist. dan Teknol. Inf., vol. 4, no. 1, pp. 66–74, 2015, [Online]. Available: http://jurnal.untan.ac.id/index.php/justin/article/view/12411.

M. V Bhosale and A. A. Vankudre, “Detection of Real-Time Traffic through Twitter Stream Analysis,†J. Adv. Eng. Sci., vol. 2, no. 2, pp. 124–126, 2017.

P. M. Prihatini, I. K. G. D. Putra, I. A. D. Giriantari, and M. Sudarma, “Stemming Algorithm for Indonesian Digital News Text Processing,†Int. J. Eng. Emerg. Technol., vol. 2, no. 2, pp. 1–7, 2017.

P. Internal, M. Algoritma, and T. Informatika, “SATIN – Sains dan Teknologi Informasi Sistem Pendeteksi Tingkat Kesamaan Teks pada Pengusulan Proposal,†SATIN - Sains dan Teknol. Inf., vol. 4, no. 2, pp. 84–91, 2018

Downloads

Published

2021-06-17

Issue

Section

Article

Citation Check

Most read articles by the same author(s)