Analysis and Implementation Machine Learning for YouTube Data Classification by Comparing the Performance of Classification Algorithms

Authors

  • Riyan Amanda Universitas Bina Darma, Indonesia
  • Edi Surya Negara Universitas Bina Darma, Indonesia

DOI:

https://doi.org/10.15575/join.v5i1.505

Keywords:

Data Mining, Experimental Method, Machine Learning, YouTube Data Classification

Abstract

Every day, people around the world upload 1.2 million videos to YouTube or more than 100 hours per minute, and this number is increasing. The condition of this continuous data will be useless if not utilized again. To dig up information on large-scale data, a technique called data mining can be a solution. One of the techniques in data mining is classification. For most YouTube users, when searching for video titles do not match the desired video category. Therefore, this research was conducted to classify YouTube data based on its search text. This article focuses on comparing three algorithms for the classification of YouTube data into the Kesenian and Sains category. Data collection in this study uses scraping techniques taken from the YouTube website in the form of links, titles, descriptions, and searches. The method used in this research is an experimental method by conducting data collection, data processing, proposed models, testing, and evaluating models. The models applied are Random Forest, SVM, Naive Bayes. The results showed that the accuracy rate of the random forest model was better by 0.004%, with the label encoder not being applied to the target class, and the label encoder had no effect on the accuracy of the classification models. The most appropriate model for YouTube data classification from data taken in this study is Naïve Bayes, with an accuracy rate of 88% and an average precision of 90%.

Author Biographies

Riyan Amanda, Universitas Bina Darma

Magister Teknik Informatika

Edi Surya Negara, Universitas Bina Darma

Magister Teknik Informatika

References

Suyanto, Data Mining Untuk Klasifikasi dan Klasterisasi Data, Edisi Revisi. Bandung: Informatika Bandung, 2019.

Shaila S.G, Prasanna MSM, and K. Mohit, “Classification of YouTube Data based on Sentiment Analysis,†Int. J. Eng. Res. Comput. Sci. Eng. IJERCSE, vol. 5, no. 6, Art. no. 6, Jun. 2018.

S. Fitri, “Perbandingan Kinerja Algoritma Klasifikasi Naive Bayesan, Lazy-IBK, Zero-R, dan Decision Tree-J48,†J. DASI, vol. 15, no. 1, Art. no. 1, Apr. 2014.

N. Saputra, T. B. Adji, and A. E. Permanasari, “Analisis Sentimen Data Presiden Jokowi Dengan Preprocessing Normalisasi Dan Stemming Menggunakan Metode Naive Bayes dan SVM,†J. Din. Inform., vol. 5, no. 1, Art. no. 1, Nov. 2015.

Y. Mardi, “Data Mining: Klasifikasi Menggunakan Algoritma C4.5,†J. Edik Inform., vol. 2, no. 2, Art. no. 2, 2017.

L. Swastina, “Penerapan Algoritma C4.5 Untuk Penentuan Jurusan Mahasiswa,†J. GEMA Aktual., vol. 2, no. 1, Art. no. 1, Jun. 2013.

D. M. A. Budanis and F. Slamat, “Klasifikasi Data Karyawan Untuk Menentukan Jadwal Kerja Menggunakan Metode Decision Tree,†J. IPTEK, vol. 16, no. 1, Art. no. 1, May 2012.

Mambang and A. Byna, “Analisis Perbandingan Algoritma C.45, Random Forest Dengan CHAID Decision Tree Untuk Klasifikasi Tingkat Kecemasan Ibu Hamil,†Semin. Nas. Teknol. Inf. Dan Multimed. 2017, p. 6, Feb. 2017.

T. Salles, M. Gonçalves, V. Rodrigues, and L. Rocha, “Improving random forests by neighborhood projection for effective text classification,†Inf. Syst., vol. 77, pp. 1–21, Sep. 2018, doi: 10.1016/j.is.2018.05.006.

L. Cunhe and W. Chenggang, “A new semi-supervised support vector machine learning algorithm based on active learning,†in 2010 2nd International Conference on Future Computer and Communication, Wuhan, China, 2010, pp. V3-638-V3-641, doi: 10.1109/ICFCC.2010.5497471.

M. Hofmann, “Support Vector Machines — Kernels and the Kernel Trick,†Notes, vol. 26, no. 3, Art. no. 3, Jun. 2006.

C. Manning, P. Raghavan, and H. Schuetze, “Introduction to Information Retrieval,†p. 581, Apr. 2009.

T. Sutabri, A. Suryatno, D. Setiadi, and E. S. Negara, “Improving Naïve Bayes in Sentiment Analysis for Hotel Industry in Indonesia,†in 2018 Third International Conference on Informatics and Computing (ICIC), Palembang, Indonesia, Oct. 2018, pp. 1–6, doi: 10.1109/IAC.2018.8780444.

M. Raza, F. K. Hussain, O. K. Hussain, M. Zhao, and Z. ur Rehman, “A comparative analysis of machine learning models for quality pillar assessment of SaaS services by multi-class text classification of users’ reviews,†Future Gener. Comput. Syst., vol. 101, pp. 341–371, Dec. 2019, doi: 10.1016/j.future.2019.06.022.

C. Darujati and A. B. Gumelar, “Pemanfaatan Teknik Supervised Untuk Klasifikasi Teks Bahasa Indonesia,†J. Link, vol. 16, no. 1, Art. no. 1, Feb. 2012.

I. Destuardi and S. Sumpeno, “Klasifikasi Emosi Untuk Teks Bahasa Indonesia Menggunakan Metode Naive Bayes,†Semin. Nas. Pascasarj. IX – ITS, p. 5, Dec. 2009.

J. Hartmann, J. Huppertz, C. Schamp, and M. Heitmann, “Comparing automated text classification methods,†Int. J. Res. Mark., vol. 36, no. 1, Art. no. 1, Mar. 2019, doi: 10.1016/j.ijresmar.2018.09.009.

Downloads

Published

2020-07-16

Issue

Section

Article

Citation Check