Understanding the Effectiveness of Query Expansion in IndoSBERT-Based Semantic Retrieval
Main Article Content
Abstract
The advancement of information retrieval systems has shifted from keyword-based approaches to semantic retrieval using transformer-based models such BERT and its variants. Despite their ability to capture contextual meaning, the vocabulary mismatch problem between queries and documents remains a key challenge. Query expansion (QE) is commonly used to address this issue, but its effectiveness in semantic retrieval is not always consistent. This study aims to analyze the impact of query expansion on a semantic retrieval system based on a fine-tuned IndoSBERT model using a dataset of undergraduate thesis titles and abstracts at repository UPN “Veteran” Jawa Timur. A hybrid QE approach is proposed by combining pretrained FastText and domain-specific Word2Vec embeddings, with and without filtering mechanisms. The system performance is evaluated using Precision@15, Recall@15, Mean Average Precision (MAP), and nDCG@15. The results show that QE can improve retrieval performance when properly controlled. The best performance is achieved by the hybrid QE with filtering, where MAP increases from 0.389 (without QE) to 0.483 and nDCG reaches 0.911. In contrast, FastText-based QE without filtering results in performance degradation due to query drift. It can be concluded that the effectiveness of QE in semantic retrieval is highly dependent on the quality of expansion terms and the application of filtering strategies. QE is not inherently beneficial, but requires careful design to improve retrieval performance.
Article Details

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (CC BY-SA 4.0) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.
References
[2] B. Juarto and Yulianto, “Indonesian News Classification Using IndoBert,” Intell. Syst. Appl. Eng., pp. 0–2, 2023.
[3] S. Jeong, C. Park, and J. C. Park, “Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation,” Proc. Second Work. Sch. Doc. Process., pp. 7–17, 2021.
[4] A. Yang, M. Ai, G. Penha, and E. Palumbo, Aligned Query Expansion : Efficient Query Expansion for Information Retrieval through LLM Alignment, vol. 1, no. 1. arXiv, 2025.
[5] M. Pan, W. Xiong, S. Zhou, M. Gao, and J. Chen, “LLM-Based Query Expansion with Gaussian Kernel Semantic Enhancement for Dense Retrieval,” electronics, pp. 1–18, 2025.
[6] A. Allahim, A. Cherif, and A. Imine, “Semantic approaches for query expansion: taxonomy, challenges, and future research directions,” PeerJ Comput. Sci., vol. 11, pp. 1–53, 2025, doi: 10.7717/peerj-cs.2664.
[7] R. Lumbantoruan, M. Puspita, S. Nababan, and L. A. Saragih, “Analisis Perbandingan FastText dan Word2Vec pada Sistem Temu Balik Informasi,” Semin. Nas. Sains Data, vol. 2024, no. Senada, pp. 1033–1041, 2024.
[8] S. Brandl, D. Lassner, A. Baillot, and S. Nakajima, “Domain-Specific Word Embeddings with Structure Prediction,” 2022.
[9] A. Pertiwi, A. Azhari, and S. Mulyana, “Fast2Vec , a modified model of FastText that enhances semantic analysis in topic evolution,” PeerJ Comput. Sci., pp. 1–36, 2025, doi: 10.7717/peerj-cs.2862.
[10] V. Petras, A. Lüschow, R. Ramthun, J. Stiller, C. España-Bonet, and S. Henning, “Query or Document Translation for Academic Search -- What’s the Real Difference?,” in Experimental IR Meets Multilinguality, Multimodality, and Interaction, A. Arampatzis, E. Kanoulas, T. Tsikrika, S. Vrochidis, H. Joho, C. Lioma, C. Eickhoff, A. Névéol, L. Cappellato, and N. Ferro, Eds., Cham: Springer International Publishing, 2020, pp. 28–42.
[11] F. Arasyi, “indo-sentence-bert: Sentence Transformer for Bahasa Indonesia with Multiple Negative Ranking Loss,” huggingface Repos., 2022, [Online]. Available: https://huggingface.co/firqaaa/indo-sentence-bert-base
[12] N. Fujishiro, Y. Otaki, and S. Kawachi, “Accuracy of the Sentence-BERT Semantic Search System for a Japanese Database of Closed Medical Malpractice Claims,” Appl. Sci. 13(6) 4051, 2023, doi: https://doi.org/10.3390/app13064051.
[13] S. Regularization, J. Chang, and K. Grover, “" That was smooth ": Exploration of S-BERT with Multiple Negatives Ranking Loss and,” Stanford CS224N Nat. Lang. Process. with Deep Learn., 2024, [Online]. Available: https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1244/final-projects/JohnnyChangKanuGroverKaushalAtulAlate.pdf
[14] S. M. Toth, “Development of a Semantic Search Tool for Swedish Legal Judgements Based on Fine- Tuning Large Language Models,” 2024.
[15] M. Douze et al., “The Faiss library,” 2025, [Online]. Available: http://arxiv.org/abs/2401.08281
[16] T. I. Ramadhan, A. Supriatman, and T. R. Kurniawan, “Passage Retrieval untuk Question Answering Bahasa Indonesia Menggunakan BERT dan FAISS,” J. Algoritm., vol. 21, no. 2, pp. 156–163, 2024, doi: 10.33364/algoritma/v.21-2.2100.
[17] X. Wang, C. Macdonald, and I. Ounis, “Improving zero-shot retrieval using dense external expansion,” Inf. Process. Manag., vol. 59, no. 5, p. 103026, 2022, doi: 10.1016/j.ipm.2022.103026.
[18] Y. S. NADIA ELAESIANA P and Aina Musdholifah, “Pengembangan Word Embedding untuk Domain Spesifik Ulasan Hotel Berbahasa Indonesia,” Universitas Gajah Mada, 2020. [Online]. Available: https://etd.repository.ugm.ac.id/penelitian/detail/191233