Evaluating RAG Performance on Small Language Models for Low-Resource Devices through Chunking and Retrieval Methods

Amelia Dewi Agustiani; Salsabila Maharani Putri; Jonner Hutahaean; Muhammad Rizqi Sholahuddin; Muhammad Riza Alifi; Ade Hodijah

doi:10.15575/join.v11i1.1733

Authors

Amelia Dewi Agustiani Politeknik Negeri Bandung, Indonesia
Salsabila Maharani Putri Politeknik Negeri Bandung, Indonesia
Jonner Hutahaean
Muhammad Rizqi Sholahuddin Politeknik Negeri Bandung, Indonesia
Muhammad Riza Alifi
Ade Hodijah

DOI:

https://doi.org/10.15575/join.v11i1.1733

Keywords:

Chunking Technic, Low-Resource Device, Retrieval-Augmented Generation (RAG), Retrieval Approaches, Small Language Model (SLM)

Abstract

Retrieval-Augmented Generation (RAG) combines generative capabilities of language models with external document retrieval to answer questions grounded in reference texts. However, deploying RAG on low-resource devices like Android smartphones is challenging because SLMs have limited computational capacity and depend heavily on efficient chunking and retrieval. Although interest in on-device processing is growing, research on RAG configurations for SLMs under strict resource constraints especially for domain-specific tasks remains limited. This study therefore investigates which combinations of chunking technique, chunk size, overlap, and retrieval strategy best balance accuracy and speed on low-resource devices. The evaluation uses 148 Indonesian questions sourced from an official Hajj guidebook. The study consists of two phases retrieval and generation. Retrieval is evaluated using BLEU, ROUGE-L, MRR, MAP, and Hit@k, while answer quality is measured with BERTScore. The experiments compare different chunking methods (fixed-size or semantic), chunk sizes (128 or 256 tokens), overlaps (25, 50 and 100 tokens), and retrieval methods (dense, sparse, or hybrid). Results show that sparse retrieval with 256-token chunks and 100-token overlap yields the best answer quality (F1 = 0.726). However, 128-token chunks with the same overlap provide the fastest generation time (69.737 seconds). The main contribution of this study is a systematic evaluation of RAG configurations for fully on-device SLMs using a domain-specific Hajj and Umrah dataset not explored in prior research. The findings provide practical guidance for designing efficient and accurate RAG-based question-answering systems on low-resource devices.

References

[1] Y. Gao et al., ‘Retrieval-Augmented Generation for Large Language Models: A Survey’, Dec. 2023, [Online]. Available: http://arxiv.org/abs/2312.10997

[2] X. Wang et al., ‘Searching for Best Practices in Retrieval-Augmented Generation’, Jul. 2024, [Online]. Available: http://arxiv.org/abs/2407.01219

[3] J. Liu, R. Ding, L. Zhang, P. Xie, and F. Huang, ‘CoFE-RAG: A Comprehensive Full-chain Evaluation Framework for Retrieval-Augmented Generation with Enhanced Data Diversity’, Oct. 2024, [Online]. Available: http://arxiv.org/abs/2410.12248

[4] R. F. Reza, Muhmmad Thoriq, and Rd. Imam Saepul Millah, ‘Sentiment Analysis of Marketplace Review with Islamic Perspective using Fine-Tuning DistilBERT’, Khazanah Journal of Religion and Technology, vol. 2, no. 2, pp. 45–54, Jan. 2025, doi: 10.15575/kjrt.v2i2.1118.

[5] C. Van Nguyen et al., ‘A Survey of Small Language Models’, Oct. 2024, [Online]. Available: http://arxiv.org/abs/2410.20011

[6] T. Fan, J. Wang, X. Ren, and C. Huang, ‘MiniRAG: Towards Extremely Simple Retrieval-Augmented Generation’, Jan. 2025, [Online]. Available: http://arxiv.org/abs/2501.06713

[7] Z. Lu et al., ‘Small Language Models: Survey, Measurements, and Insights’, Sep. 2024, [Online]. Available: http://arxiv.org/abs/2409.15790

[8] P. Lewis et al., ‘Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks’, Apr. 2021, [Online]. Available: http://arxiv.org/abs/2005.11401

[9] D. Kim, B. Kim, D. Han, and M. Eibich, ‘AutoRAG: Automated Framework for optimization of Retrieval Augmented Generation Pipeline’, Oct. 2024, [Online]. Available: http://arxiv.org/abs/2410.20878

[10] Nerve Sparks, ‘nerve-sparks/iris_android’, 2025, India.

[11] Shubham Panchal, ‘shubham0204/Android-Document-QA’, 2025, India.

[12] S. Setty, H. Thakkar, A. Lee, E. Chung, and N. Vidra, ‘Improving Retrieval for RAG based Question Answering Models on Financial Documents’, Mar. 2024, [Online]. Available: http://arxiv.org/abs/2404.07221

[13] R. Qu, R. Tu, and F. Bao, ‘Is Semantic Chunking Worth the Computational Cost?’, Oct. 2024, [Online]. Available: http://arxiv.org/abs/2410.13070

[14] X. Ma, Y. Gong, P. He, H. Zhao, and N. Duan, ‘Query Rewriting for Retrieval-Augmented Large Language Models’, 2023. [Online]. Available: https://github.com/xbmxb/RAG-query-rewriting

[15] S.-C. Lin, J.-H. Yang, R. Nogueira, M.-F. Tsai, C.-J. Wang, and J. Lin, ‘Multi-Stage Conversational Passage Retrieval: An Approach to Fusing Term Importance Estimation and Neural Query Rewriting’, Mar. 2021, [Online]. Available: http://arxiv.org/abs/2005.02230

[16] P. Mandikal and R. Mooney, ‘Sparse Meets Dense: A Hybrid Approach to Enhance Scientific Document Retrieval’, Jan. 2024, [Online]. Available: http://arxiv.org/abs/2401.04055

[17] B. G. Chepino, R. R. Yacoub, A. Aula, M. Saleh, and B. W. Sanjaya, ‘EFFECT OF MINMAX NORMALIZATION ON ORB DATA FOR IMPROVED ANN ACCURACY’, Journal of Electrical Engineering, Energy, and Information Technology (J3EIT), vol. 11, no. 2, p. 29, Aug. 2023, doi: 10.26418/j3eit.v11i2.68689.

[18] Bartowski, ‘bartowski/gemma-2-2b-it-GGUF’, Hugging Face.

[19] J. Park, K. Atarashi, K. Takeuchi, and H. Kashima, ‘Emulating Retrieval Augmented Generation via Prompt Engineering for Enhanced Long Context Comprehension in LLMs’, Feb. 2025, [Online]. Available: http://arxiv.org/abs/2502.12462

[20] H. Yu, A. Gan, K. Zhang, S. Tong, Q. Liu, and Z. Liu, ‘Evaluation of Retrieval-Augmented Generation: A Survey’, May 2024, doi: 10.1007/978-981-96-1024-2_8.