Understanding the Effectiveness of Query Expansion in IndoSBERT-Based Semantic Retrieval

Main Article Content

Dela Puspita Lasminingrum
Eva Yulia Puspaningrum
Budi Mukhammad Mulyo

Abstract

The advancement of information retrieval systems has shifted from keyword-based approaches to semantic retrieval using transformer-based models such BERT and its variants. Despite their ability to capture contextual meaning, the vocabulary mismatch problem between queries and documents remains a key challenge. Query expansion (QE) is commonly used to address this issue, but its effectiveness in semantic retrieval is not always consistent. This study aims to analyze the impact of query expansion on a semantic retrieval system based on a fine-tuned IndoSBERT model using a dataset of undergraduate thesis titles and abstracts at repository UPN “Veteran” Jawa Timur. A hybrid QE approach is proposed by combining pretrained FastText and domain-specific Word2Vec embeddings, with and without filtering mechanisms. The system performance is evaluated using Precision@15, Recall@15, Mean Average Precision (MAP), and nDCG@15. The results show that QE can improve retrieval performance when properly controlled. The best performance is achieved by the hybrid QE with filtering, where MAP increases from 0.389 (without QE) to 0.483 and nDCG reaches 0.911. In contrast, FastText-based QE without filtering results in performance degradation due to query drift. It can be concluded that the effectiveness of QE in semantic retrieval is highly dependent on the quality of expansion terms and the application of filtering strategies. QE is not inherently beneficial, but requires careful design to improve retrieval performance.

Article Details

How to Cite
Lasminingrum, D., Puspaningrum, E., & Mulyo, B. (2026). Understanding the Effectiveness of Query Expansion in IndoSBERT-Based Semantic Retrieval. Journal of Informatics Information System Software Engineering and Applications (INISTA), 8(2), 75-89. https://doi.org/10.20895/inista.v8i2.2116
Section
Articles

References

[1] H. Iida and N. Okazaki, “Incorporating Semantic Textual Similarity and Lexical Matching for Information Retrieval,” Conf. 35th Pacific Asia Conf. Lang. Inf. Comput. Shanghai, China, 2021.
[2] B. Juarto and Yulianto, “Indonesian News Classification Using IndoBert,” Intell. Syst. Appl. Eng., pp. 0–2, 2023.
[3] S. Jeong, C. Park, and J. C. Park, “Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation,” Proc. Second Work. Sch. Doc. Process., pp. 7–17, 2021.
[4] A. Yang, M. Ai, G. Penha, and E. Palumbo, Aligned Query Expansion : Efficient Query Expansion for Information Retrieval through LLM Alignment, vol. 1, no. 1. arXiv, 2025.
[5] M. Pan, W. Xiong, S. Zhou, M. Gao, and J. Chen, “LLM-Based Query Expansion with Gaussian Kernel Semantic Enhancement for Dense Retrieval,” electronics, pp. 1–18, 2025.
[6] A. Allahim, A. Cherif, and A. Imine, “Semantic approaches for query expansion: taxonomy, challenges, and future research directions,” PeerJ Comput. Sci., vol. 11, pp. 1–53, 2025, doi: 10.7717/peerj-cs.2664.
[7] R. Lumbantoruan, M. Puspita, S. Nababan, and L. A. Saragih, “Analisis Perbandingan FastText dan Word2Vec pada Sistem Temu Balik Informasi,” Semin. Nas. Sains Data, vol. 2024, no. Senada, pp. 1033–1041, 2024.
[8] S. Brandl, D. Lassner, A. Baillot, and S. Nakajima, “Domain-Specific Word Embeddings with Structure Prediction,” 2022.
[9] A. Pertiwi, A. Azhari, and S. Mulyana, “Fast2Vec , a modified model of FastText that enhances semantic analysis in topic evolution,” PeerJ Comput. Sci., pp. 1–36, 2025, doi: 10.7717/peerj-cs.2862.
[10] V. Petras, A. Lüschow, R. Ramthun, J. Stiller, C. España-Bonet, and S. Henning, “Query or Document Translation for Academic Search -- What’s the Real Difference?,” in Experimental IR Meets Multilinguality, Multimodality, and Interaction, A. Arampatzis, E. Kanoulas, T. Tsikrika, S. Vrochidis, H. Joho, C. Lioma, C. Eickhoff, A. Névéol, L. Cappellato, and N. Ferro, Eds., Cham: Springer International Publishing, 2020, pp. 28–42.
[11] F. Arasyi, “indo-sentence-bert: Sentence Transformer for Bahasa Indonesia with Multiple Negative Ranking Loss,” huggingface Repos., 2022, [Online]. Available: https://huggingface.co/firqaaa/indo-sentence-bert-base
[12] N. Fujishiro, Y. Otaki, and S. Kawachi, “Accuracy of the Sentence-BERT Semantic Search System for a Japanese Database of Closed Medical Malpractice Claims,” Appl. Sci. 13(6) 4051, 2023, doi: https://doi.org/10.3390/app13064051.
[13] S. Regularization, J. Chang, and K. Grover, “" That was smooth ": Exploration of S-BERT with Multiple Negatives Ranking Loss and,” Stanford CS224N Nat. Lang. Process. with Deep Learn., 2024, [Online]. Available: https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1244/final-projects/JohnnyChangKanuGroverKaushalAtulAlate.pdf
[14] S. M. Toth, “Development of a Semantic Search Tool for Swedish Legal Judgements Based on Fine- Tuning Large Language Models,” 2024.
[15] M. Douze et al., “The Faiss library,” 2025, [Online]. Available: http://arxiv.org/abs/2401.08281
[16] T. I. Ramadhan, A. Supriatman, and T. R. Kurniawan, “Passage Retrieval untuk Question Answering Bahasa Indonesia Menggunakan BERT dan FAISS,” J. Algoritm., vol. 21, no. 2, pp. 156–163, 2024, doi: 10.33364/algoritma/v.21-2.2100.
[17] X. Wang, C. Macdonald, and I. Ounis, “Improving zero-shot retrieval using dense external expansion,” Inf. Process. Manag., vol. 59, no. 5, p. 103026, 2022, doi: 10.1016/j.ipm.2022.103026.
[18] Y. S. NADIA ELAESIANA P and Aina Musdholifah, “Pengembangan Word Embedding untuk Domain Spesifik Ulasan Hotel Berbahasa Indonesia,” Universitas Gajah Mada, 2020. [Online]. Available: https://etd.repository.ugm.ac.id/penelitian/detail/191233