Analysis and Implementation Machine Learning for YouTube Data Classification by Comparing the Performance of Classification Algorithms

Riyan Amanda; Edi Surya Negara

doi:10.15575/join.v5i1.505

Authors

Riyan Amanda Universitas Bina Darma, Indonesia
Edi Surya Negara Universitas Bina Darma, Indonesia

DOI:

https://doi.org/10.15575/join.v5i1.505

Keywords:

Data Mining, Experimental Method, Machine Learning, YouTube Data Classification

Abstract

Every day, people around the world upload 1.2 million videos to YouTube or more than 100 hours per minute, and this number is increasing. The condition of this continuous data will be useless if not utilized again. To dig up information on large-scale data, a technique called data mining can be a solution. One of the techniques in data mining is classification. For most YouTube users, when searching for video titles do not match the desired video category. Therefore, this research was conducted to classify YouTube data based on its search text. This article focuses on comparing three algorithms for the classification of YouTube data into the Kesenian and Sains category. Data collection in this study uses scraping techniques taken from the YouTube website in the form of links, titles, descriptions, and searches. The method used in this research is an experimental method by conducting data collection, data processing, proposed models, testing, and evaluating models. The models applied are Random Forest, SVM, Naive Bayes. The results showed that the accuracy rate of the random forest model was better by 0.004%, with the label encoder not being applied to the target class, and the label encoder had no effect on the accuracy of the classification models. The most appropriate model for YouTube data classification from data taken in this study is NaÃ¯ve Bayes, with an accuracy rate of 88% and an average precision of 90%.

Author Biographies

Riyan Amanda, Universitas Bina Darma

Magister Teknik Informatika

Edi Surya Negara, Universitas Bina Darma

Magister Teknik Informatika

References

Suyanto, Data Mining Untuk Klasifikasi dan Klasterisasi Data, Edisi Revisi. Bandung: Informatika Bandung, 2019.

Shaila S.G, Prasanna MSM, and K. Mohit, â€œClassification of YouTube Data based on Sentiment Analysis,â€ Int. J. Eng. Res. Comput. Sci. Eng. IJERCSE, vol. 5, no. 6, Art. no. 6, Jun. 2018.

S. Fitri, â€œPerbandingan Kinerja Algoritma Klasifikasi Naive Bayesan, Lazy-IBK, Zero-R, dan Decision Tree-J48,â€ J. DASI, vol. 15, no. 1, Art. no. 1, Apr. 2014.

N. Saputra, T. B. Adji, and A. E. Permanasari, â€œAnalisis Sentimen Data Presiden Jokowi Dengan Preprocessing Normalisasi Dan Stemming Menggunakan Metode Naive Bayes dan SVM,â€ J. Din. Inform., vol. 5, no. 1, Art. no. 1, Nov. 2015.

Y. Mardi, â€œData Mining: Klasifikasi Menggunakan Algoritma C4.5,â€ J. Edik Inform., vol. 2, no. 2, Art. no. 2, 2017.

L. Swastina, â€œPenerapan Algoritma C4.5 Untuk Penentuan Jurusan Mahasiswa,â€ J. GEMA Aktual., vol. 2, no. 1, Art. no. 1, Jun. 2013.

D. M. A. Budanis and F. Slamat, â€œKlasifikasi Data Karyawan Untuk Menentukan Jadwal Kerja Menggunakan Metode Decision Tree,â€ J. IPTEK, vol. 16, no. 1, Art. no. 1, May 2012.

Mambang and A. Byna, â€œAnalisis Perbandingan Algoritma C.45, Random Forest Dengan CHAID Decision Tree Untuk Klasifikasi Tingkat Kecemasan Ibu Hamil,â€ Semin. Nas. Teknol. Inf. Dan Multimed. 2017, p. 6, Feb. 2017.

T. Salles, M. GonÃ§alves, V. Rodrigues, and L. Rocha, â€œImproving random forests by neighborhood projection for effective text classification,â€ Inf. Syst., vol. 77, pp. 1â€“21, Sep. 2018, doi: 10.1016/j.is.2018.05.006.

L. Cunhe and W. Chenggang, â€œA new semi-supervised support vector machine learning algorithm based on active learning,â€ in 2010 2nd International Conference on Future Computer and Communication, Wuhan, China, 2010, pp. V3-638-V3-641, doi: 10.1109/ICFCC.2010.5497471.

M. Hofmann, â€œSupport Vector Machines â€” Kernels and the Kernel Trick,â€ Notes, vol. 26, no. 3, Art. no. 3, Jun. 2006.

C. Manning, P. Raghavan, and H. Schuetze, â€œIntroduction to Information Retrieval,â€ p. 581, Apr. 2009.

T. Sutabri, A. Suryatno, D. Setiadi, and E. S. Negara, â€œImproving NaÃ¯ve Bayes in Sentiment Analysis for Hotel Industry in Indonesia,â€ in 2018 Third International Conference on Informatics and Computing (ICIC), Palembang, Indonesia, Oct. 2018, pp. 1â€“6, doi: 10.1109/IAC.2018.8780444.

M. Raza, F. K. Hussain, O. K. Hussain, M. Zhao, and Z. ur Rehman, â€œA comparative analysis of machine learning models for quality pillar assessment of SaaS services by multi-class text classification of usersâ€™ reviews,â€ Future Gener. Comput. Syst., vol. 101, pp. 341â€“371, Dec. 2019, doi: 10.1016/j.future.2019.06.022.

C. Darujati and A. B. Gumelar, â€œPemanfaatan Teknik Supervised Untuk Klasifikasi Teks Bahasa Indonesia,â€ J. Link, vol. 16, no. 1, Art. no. 1, Feb. 2012.

I. Destuardi and S. Sumpeno, â€œKlasifikasi Emosi Untuk Teks Bahasa Indonesia Menggunakan Metode Naive Bayes,â€ Semin. Nas. Pascasarj. IX â€“ ITS, p. 5, Dec. 2009.

J. Hartmann, J. Huppertz, C. Schamp, and M. Heitmann, â€œComparing automated text classification methods,â€ Int. J. Res. Mark., vol. 36, no. 1, Art. no. 1, Mar. 2019, doi: 10.1016/j.ijresmar.2018.09.009.