Pembangkitan Dataset Aksara Bali Menggunakan Web Scrapping untuk Pengenalan Aksara Bali Berbasis Optical Character Recognition

Authors

  • I Gede Pradipta Adi Nugraha
  • Ahmad Asroni Universitas Pendidikan Ganesha
  • Luh Joni Erawati Dewi
  • Gede Indrawan

DOI:

https://doi.org/10.31598/jurnalresistor.v6i2.1475

Keywords:

aksara bali, dataset, pengenalan karakter, web scrapping, transliterasi

Abstract

Dataset gambar aksara Bali yang besar dan komprehensif adalah sumber daya penting dalam upaya pemeliharaan warisan budaya Bali serta pengembangan aplikasi terkait. Dataset ini memungkinkan analisis mendalam, pengenalan karakter, pemrosesan bahasa, dan pengujian aplikasi seperti pengenalan tulisan otomatis, pengajaran otomatis, dan pemahaman bahasa, yang semuanya mendukung kemajuan dalam pelestarian warisan budaya Bali. Metode penelitian untuk menghasilkan dataset aksara Bali melibatkan akses ke sumber data, analisis struktur web, dan penerapan teknik web scraping dengan JavaScript untuk pengambilan gambar secara otomatis. Dua tahap utama, yaitu inisiasi dan pengumpulan data, memungkinkan pengumpulan dataset dalam jumlah besar dengan efisiensi tinggi, mempercepat proses pengumpulan data dan meningkatkan akurasi dalam penelitian aksara Bali. Data yang digunakan berasal dari kamus bahasa Bali, bahasa Indonesia, dan bahasa Inggris, dengan total 35.319 kata dalam bahasa Bali, yang kemudian dikonversi menjadi aksara Bali. Hasil dari pembuatan dataset ini mencakup 35.319 pasang data berupa gambar teks aksara Bali dan tesk transliterasinya, memiliki potensi besar untuk pelatihan model pengenalan aksara Bali dan penelitian bahasa Bali. Langkah ini menguatkan ketersediaan dataset yang relevan, berkualitas tinggi, dan memiliki nilai signifikan dalam pengembangan teknologi serta penelitian lebih lanjut di bidang bahasa Bali dan pengenalan aksara Bali.

Downloads

Download data is not yet available.

References

G. Indrawan, I. K. Paramarta, and K. Agustini, “A New Method of Latin-to-Balinese Script Transliteration Based on Noto Sans Balinese Font and Dictionary Data Structure,” in Proceedings of the 2nd International Conference on Software Engineering and Information Management, in ICSIM ’19. New York, NY, USA: Association for Computing Machinery, 2019, pp. 75–79. doi: 10.1145/3305160.3305167.

G. Indrawan, Sariyasa, and I. K. Paramarta, “A New Method of Latin-To-Balinese Script Transliteration based on Bali Simbar Font,” in 2019 Fourth International Conference on Informatics and Computing (ICIC), 2019, pp. 1–6. doi: 10.1109/ICIC47613.2019.8985675.

G. Indrawan, I. K. Paramarta, and K. Agustini, “A new method of Latin-to-balinese script transliteration based on noto sans balinese font and dictionary data structure,” in ACM Int. Conf. Proceeding Ser., 2019, pp. 75–79.

A. Asroni, G. Indrawan, and L. Joni Erawati Dewi, “Implementasi Hirarki Dataset Dalam Membangun Model Language Aksara Bali Menggunakan Framework Tesseract OCR”, [Online]. Available: https://s.id/jurnalresistor

G. Indrawan, I. Gede Nurhayata, and I. Ketut Paramarta, “A Method to Accommodate Backward Compatibility on the Learning Application-based Transliteration to the Balinese Script.” [Online]. Available: www.ijacsa.thesai.org

L. H. Loekito, G. Indrawan, and I. K. Paramarta, “Error Analysis of Latin-to-Balinese Script Transliteration Method Based on Noto Sans Balinese Font,” 2020.

G. Indrawan, I. P. E Swastika, and I. K. Paramarta, “An Improved Algorithm and Accuracy Analysis Testing Cases of Latin-to-Balinese Script Transliteration Method based on Bali Simbar Dwijendra Font”.

G. Indrawan, K. Setemen, W. Sutaya, and I. K. Paramarta, “Handling of Line Breaking on Latin-to-Balinese Script Transliteration Web Application as Part of Balinese Language Ubiquitous Learning,” in 2020 6th International Conference on Science in Information Technology (ICSITech), 2020, pp. 40–44. doi: 10.1109/ICSITech49800.2020.9392035.

G. Indrawan, I. K. Paramarta, K. Agustini, and Sariyasa, “Latin-to-Balinese script transliteration method on mobile application: A comparison,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 10, no. 3, pp. 1331–1342, 2018, doi: 10.11591/ijeecs.v10.i3.pp1331-1342.

G. Indrawan, I. K. Paramarta, K. Agustini, and Sariyasa, “Latin-to-Balinese script transliteration method on mobile application: A comparison,” Indones. J. Electr. Eng. Comput. Sci., vol. 10, no. 3, pp. 1331–1342, 2018.

G. Indrawan, N. N. H. Puspita, I. K. Paramarta, and Sariyasa, “LBtrans-bot: A Latin-to-Balinese script transliteration robotic system based on noto sans Balinese font,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 12, no. 3, pp. 1247–1256, Dec. 2018, doi: 10.11591/ijeecs.v12.i3.pp1247-1256.

G. Indrawan, N. N. H. Puspita, I. K. Paramarta, and Sariyasa, “LBtrans-bot: A Latin-to-Balinese script transliteration robotic system based on noto sans Balinese font,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 12, no. 3, pp. 1247–1256, 2018, doi: 10.11591/ijeecs.v12.i3.pp1247-1256.

I. K. Paramarta, G. Indrawan, I. B. Rai, and I. N. Martha, “Bound Vowels Grapheme Representation in Balinese Script,” in Proceedings of the 2nd International Conference on Languages and Arts across Cultures (ICLAAC 2022), Atlantis Press SARL, 2023, pp. 165–172. doi: 10.2991/978-2-494069-29-9_18.

G. Indrawan, I. G. Aris Gunadi, M. Santo Gitakarma, and I. K. Paramarta, “Latin to Balinese Script Transliteration: Lessons Learned from the Computer-Based Implementation,” in Proceedings of the 2021 4th International Conference on Software Engineering and Information Management, in ICSIM ’21. New York, NY, USA: Association for Computing Machinery, 2021, pp. 171–175. doi: 10.1145/3451471.3451499.

G. Indrawan, G. Rasben Dantes, K. Yota Ernanda Aryanto, and I. Ketut Paramarta, “Handling of Mathematical Expression on Latin-to-Balinese Script Transliteration Method on Mobile Computing,” in 2020 Fifth International Conference on Informatics and Computing (ICIC), Nov. 2020, pp. 1–5. doi: 10.1109/ICIC50835.2020.9288563.

G. Indrawan, A. Asroni, L. Joni Erawati Dewi, I. G. A. Gunadi, and I. K. Paramarta, “Balinese Script Recognition Using Tesseract Mobile Framework,” Lontar Komputer : Jurnal Ilmiah Teknologi Informasi, vol. 13, no. 3, p. 160, Nov. 2022, doi: 10.24843/lkjiti.2022.v13.i03.p03.

M. R. Rafsanjani, “ScrapPaper: A web scrapping method to extract journal information from PubMed and Google Scholar search result using Python,” bioRxiv, 2022, [Online]. Available: https://api.semanticscholar.org/CorpusID:247412163

G. Indrawan, L. J. E. Dewi, I. Gede Aris Gunadi, K. Agustini, and I. Ketut Paramarta, “The Analysis of Noto Serif Balinese Font to Support Computer-assisted Transliteration to Balinese Script,” in Information and Communication Technology for Competitive Strategies (ICTCS 2021), A. Joshi, M. Mahmud, and R. G. Ragel, Eds., Singapore: Springer Nature Singapore, 2023, pp. 571–580.

G. Adomavicius and A. Tuzhilin, “Web Scraping:State of the art,” IEEE Trans Knowl Data Eng, vol. 17, no. 6, pp. 734–749, 2019.

G. Indrawan, A. Asroni, L. Joni Erawati Dewi, I. G. A. Gunadi, and I. K. Paramarta, “Balinese Script Recognition Using Tesseract Mobile Framework,” Lontar Komputer : Jurnal Ilmiah Teknologi Informasi, vol. 13, no. 3, p. 160, Nov. 2022, doi: 10.24843/lkjiti.2022.v13.i03.p03.

S. Chaudhari, R. Aparna, V. G. Tekkur, G. L. Pavan, and S. R. Karki, “Ingredient/Recipe Algorithm using Web Mining and Web Scraping for Smart Chef,” Proceedings of CONECCT 2020 - 6th IEEE International Conference on Electronics, Computing and Communication Technologies, no. 3, pp. 22–25, 2020, doi: 10.1109/CONECCT50063.2020.9198450.

G. Indrawan, I. W. Sutaya, K. U. Ariawan, M. S. Gitakarma, I. G. Nurhayata, and I. K. Paramarta, “A METHOD FOR NON-ALPHANUMERIC TEXT PROCESSING ON TRANSLITERATION TO THE BALINESE SCRIPT,” ICIC Express Letters, vol. 16, no. 7, pp. 687–694, Jul. 2022, doi: 10.24507/icicel.16.07.687.

G. Indrawan, C. O. Birawidya, L. J. Erawati Dewi, K. Agustini, I. Gede Aris Gunadi, and I. Ketut Paramarta, “DERIVATIVE WORD CONVERSION METHOD TO BALINESE SCRIPT ON MOBILE COMPUTING,” ICIC Express Letters, vol. 17, no. 7, pp. 725–733, Jul. 2023, doi: 10.24507/icicel.17.07.725.

M. J. Lee, J. Kang, K. Hreha, and M. Pappadis, “A Novel Web Scraping Approach to Identify Stroke Outcome Measures: A Feasibility Study,” Arch Phys Med Rehabil, vol. 103, no. 3, p. e30, 2022, doi: https://doi.org/10.1016/j.apmr.2022.01.082.

V. A. Flores, P. A. Permatasari, and L. Jasa, “Penerapan Web Scraping Sebagai Media Pencarian dan Menyimpan Artikel Ilmiah Secara Otomatis Berdasarkan Keyword,” Majalah Ilmiah Teknologi Elektro, vol. 19, no. 2, p. 157, 2020, doi: 10.24843/mite.2020.v19i02.p06.

I. M. D. R. Mudiarta et al., “Balinese character recognition on mobile application based on tesseract open source OCR engine,” J Phys Conf Ser, vol. 1516, no. 1, 2020, doi: 10.1088/1742-6596/1516/1/012017.

P. Nyoman Crisnapati et al., “Pasang Aksara Bot: A Balinese Script Writing Robot using Finite State Automata Transliteration Method,” in Journal of Physics: Conference Series, Institute of Physics Publishing, Jun. 2019. doi: 10.1088/1742-6596/1175/1/012108.

A. S. -, R. K. -, K. P. -, M. Kr. R. -, and V. S. -, “E-commerce Price Comparison Website using Web Scraping,” International Journal of Innovative Research in Engineering & Multidisciplinary Physical Sciences, vol. 11, no. 3, Jun. 2023, doi: 10.37082/ijirmps.v11.i3.230223.

A. Ahmed and A. Salam, “Automatic Scientific Literature Gathering and Analysis from Textual Corpus using Web Scraping and Locality Sensitive Hashing,” 2023, doi: 10.0825/IEEESEM.2023156577.

W. Uriawan, A. Wahana, D. Wulandari, W. Darmalaksana, and R. Anwar, “Pearson correlation method and web scraping for analysis of islamic content on instagram videos,” Proceedings - 2020 6th International Conference on Wireless and Telematics, ICWT 2020, 2020, doi: 10.1109/ICWT50448.2020.9243626.

Downloads

Published

2023-08-31

How to Cite

Nugraha, I. G. P. A. ., Asroni, A., Dewi, L. J. E. ., & Indrawan, G. . (2023). Pembangkitan Dataset Aksara Bali Menggunakan Web Scrapping untuk Pengenalan Aksara Bali Berbasis Optical Character Recognition. Jurnal RESISTOR (Rekayasa Sistem Komputer), 6(2), 92-103. https://doi.org/10.31598/jurnalresistor.v6i2.1475