This paper investigates the specialization of dense retrieval models through listwise distillation and large language model (LLM) data augmentation. The authors find that standard fine-tuning methods can degrade model effectiveness, and propose a novel approach that combines listwise distillation with synthetic query generation to improve retrieval performance across various datasets. Their results indicate that training with diverse synthetic queries can match the effectiveness of human-written queries, while also highlighting challenges related to task contamination and the limitations of cross-encoder teachers.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
11 views7 pages
2502.19712v1
This paper investigates the specialization of dense retrieval models through listwise distillation and large language model (LLM) data augmentation. The authors find that standard fine-tuning methods can degrade model effectiveness, and propose a novel approach that combines listwise distillation with synthetic query generation to improve retrieval performance across various datasets. Their results indicate that training with diverse synthetic queries can match the effectiveness of human-written queries, while also highlighting challenges related to task contamination and the limitations of cross-encoder teachers.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7
Teaching Dense Retrieval Models to Specialize with Listwise
Distillation and LLM Data Augmentation
Manveer Singh Tamber Suleman Kazi University of Waterloo Vectara Waterloo, Ontario, Canada Palo Alto, California, USA mtamber@uwaterloo.ca suleman@vectara.com
Vivek Sourabh Jimmy Lin
Vectara University of Waterloo Palo Alto, California, USA Waterloo, Ontario, Canada arXiv:2502.19712v1 [cs.IR] 27 Feb 2025
vivek@vectara.com jimmylin@uwaterloo.ca
Abstract the MTEB leaderboard [28] for retrieval. We specifically focus on
While the current state-of-the-art dense retrieval models exhibit BERT-base models for their computational efficiency and practical- strong out-of-domain generalization, they might fail to capture ity for retrieval. In particular, we consider BGE-base BAAI/bge-base- nuanced domain-specific knowledge. In principle, fine-tuning these en-v1.5 [46], GTE-base Alibaba-NLP/gte-base-en-v1.5 [23] and Arctic- models for specialized retrieval tasks should yield higher effec- embed-m Snowflake/snowflake-arctic-embed-m-v1.5 [27]. We try to im- tiveness than relying on a one-size-fits-all model, but in practice, prove the effectiveness of multiple SOTA embedding models on results can disappoint. We show that standard fine-tuning meth- multiple retrieval datasets, ensuring that our findings generalize ods using an InfoNCE loss can unexpectedly degrade effectiveness across different datasets and models with their varied training rather than improve it, even for domain-specific scenarios. This regimens. We also experiment with the unsupervised variant of holds true even when applying widely adopted techniques such E5-base intfloat/e5-base-unsupervised [45] to investigate fine-tuning an as hard-negative mining and negative de-noising. To address this, embedding model with strong contrastive pre-training for retrieval, we explore a training strategy that uses listwise distillation from a but without supervised fine-tuning. The unsupervised E5 model teacher cross-encoder, leveraging rich relevance signals to fine-tune also allows us to study fine-tuning a model with a lower threat of the retriever. We further explore synthetic query generation using task contamination, which we argue is prevalent within current large language models. Through listwise distillation and training SOTA embedding models that aim to be presented competitively in with a diverse set of queries ranging from natural user searches benchmarks such as BEIR. and factual claims to keyword-based queries, we achieve consistent Our findings reveal that, surprisingly, fine-tuning SOTA embed- effectiveness gains across multiple datasets. Our results also reveal ding models with a contrastive loss often hurts the effectiveness of that synthetic queries can rival human-written queries in training the models, even when training to specialize for particular retrieval utility. However, we also identify limitations, particularly in the tasks, taking care to train with diverse queries, and de-noising hard effectiveness of cross-encoder teachers as a bottleneck. We release negatives using a cross-encoder teacher. However, incorporating our code and scripts to encourage further research.1 a listwise distillation loss alongside the contrastive loss allows for effectiveness gains across a diverse range of retrieval tasks. Enabled by the recent advances in LLMs, we also experiment 1 Introduction with LLM data augmentation by generating a diverse set of queries Embedding models have emerged as key components of search in the form of natural web search queries, questions, titles, claims, pipelines. Dense retrieval involves embedding models mapping and keywords, without relying on existing query examples from both queries and passages into a vector space, ensuring that relevant the retrieval dataset. We find that training with diverse generated passages are positioned close to the corresponding queries. queries benefits retrieval across BEIR tasks, outperforming prior The BEIR benchmark [38] demonstrated that embedding models approaches that rely solely on synthetic user search queries or for retrieval can struggle when evaluated using out-of-distribution questions. Furthermore, the quality of synthetic queries is promis- retrieval tasks. The work showed that while the dense retrieval ing, showing effectiveness that is competitive with human-written models studied generally outperformed BM25 on effectiveness met- queries for training retrieval models. rics when evaluated within the same domain they were trained While our approach achieves promising retrieval effectiveness on, these same models underperformed BM25 on average when improvements across multiple datasets and models, we also study considering the diverse retrieval tasks in BEIR. cases where our methods were not able to improve retrieval effec- While the state-of-the-art (SOTA) in embedding models has tiveness. These cases reveal challenges with task contamination continued to improve, obtaining strong out-of-domain general- and the need for stronger cross-encoders trained with more diverse ization, there may still be an interest in better specializing models ranking tasks to compete with advances in dense retrieval. for certain retrieval domains or tasks. In this work, we focus on fine-tuning SOTA BERT-base [14] embedding models that lead 1 https://github.com/manveertamber/enhancing_domain_adaptation 2 Background Dataset BGE RankT5-3B Previous work has explored the domain adaptation of retriever mod- DL19 70.2 75.1 els through synthetic query generation [24, 25]. These approaches DL20 67.7 77.2 typically involved generating synthetic queries for many given TREC-COVID 78.1 85.7 passages and then fine-tuning retriever models using these passage- NFCorpus 37.3 40.4 query pairs. FiQA 40.6 52.6 Alongside putting forward the BEIR benchmark, Thakur et al. [38] SCIDOCS 21.7 19.3 also investigated adapting retrievers to BEIR datasets using a T5- ArguAna 63.6 34.9 Touché-2020 25.7 38.6 based query generator [29, 33] trained on MS MARCO. The TAS-B DBPedia 40.7 48.1 dense retriever[19] fine-tuned with these queries showed mixed FEVER 86.3 84.9 results, underperforming the original TAS-B and BM25 on aver- Climate-FEVER 31.2 27.3 age. GPL [44] extended this work by introducing an additional SciFact 74.1 77.1 step: using a cross-encoder to label query-passage pairs. The dense Table 1: NDCG@10 Scores for BGE and RankT5-3B reranking retriever was then trained to mimic the score margins between the top-100 BGE retrieved passages. Best scores are bolded. positive and negative query-passage pairs from the cross-encoder. Several more recent works have introduced improved approaches to synthetic query generation and domain adaptation. InPars [5] used GPT-3 [7] for few-shot query generation to adapt rerankers instead of retrievers. InPars-v2 [20] later replaced GPT-3 with the MSMARCO [5, 35, 38, 44]. Promptagator [12] focused on generating open-source GPT-J [43]. Promptagator [12] used few-shot examples task-specific queries but did so by using up to 8 examples from the of task-specific queries and passages to generate queries more in same retrieval task that the models were evaluated on. In contrast, line with those from the BEIR datasets being used for evaluation. our methods generate diverse queries without having to rely on UDAPDR [35] focused on the domain adaptation of ColBERTv2 [36], queries from the evaluation dataset. a multi-vector retrieval model, leveraging GPT-3 and FLAN-T5 For brevity, we do not provide the prompts to generate synthetic XXL [8] for query generation to train up to 10 cross-encoders that queries in this paper, but we make them available in our GitHub were each in turn used to annotate triples of (query, positive docu- repository. In general, we prompt the LLM to generate queries of ment, and negative document) to fine-tune retrieval models. under 20 words that are addressed by the given passage. We use Unlike previous approaches, we explore the generation of diverse vLLM [22] for inference and find that 100K queries for MSMARCO queries and the use of listwise distillation from cross-encoders passages can be generated in roughly 1-2 hours using an RTX 6000 to provide strong relevance signals for adapting retrievers. We Ada GPU, although times vary depending on passage lengths. generate diverse queries without relying on few-shot examples 3.1.1 Filtering Generated Queries. To ensure the quality of gener- from evaluation datasets to allow for fine-tuning that generalizes ated queries, some filtering is needed. Promptagator filtered out well to retrieval domains and tasks without needing specific human- queries if their corresponding passage did not rank first with the written query examples. We also evaluate how generated queries unadapted retriever. This may remove high-quality queries that are compare to human-written queries and we demonstrate the need challenging for the retriever. To improve filtering, we make use of for cross-encoder listwise distillation empirically, showing that a a stronger teacher cross-encoder. First, we discard queries where simpler contrastive learning setup fails to improve the retrieval the passage isn’t in the top 20 retrieved passages. Then, we filter effectiveness of current SOTA embedding models. out queries where the passage isn’t ranked first with the reranker.
3 Methodology 3.2 Evaluation
3.1 Synthetic Query Generation We evaluate retrieval effectiveness after fine-tuning retrievers to Given up to 100K passages randomly sampled from a retrieval cor- adapt to BEIR [38] corpora and the MSMARCO [3] passage ranking pus, we generate queries for each passage to fine-tune the retrievers. corpus. We report NDCG@10 and Recall@100 scores to evaluate The assumption is that the content of the 100K passages should the fine-tuned retrievers. For BEIR datasets, we focus on the TREC- represent the retrieval task well and indicate what users might COVID [40], NFcorpus [6], FiQA [26], SCIDOCS [9], ArguAna [41], search for. We generate synthetic queries using Llama-3.1 (8B) [16] Touché-2020 [4], DBPedia [18], FEVER [39], CLIMATE-FEVER [15], by providing up to 3 examples of passage-query pairs and then and SciFact [42] datasets. We choose these datasets because their prompting the LLM to generate a query for a given passage. Given corpora are under open licenses and they represent a diversity of Wikipedia passages from BEIR’s NQ corpus, we use GPT4o [1] retrieval tasks offering varying types of queries (e.g., factual claims, to generate high-quality queries to use as examples to guide the opinion-based questions), corpora (e.g., Wikipedia, scientific ab- generation of Llama-3.1 (8B). We examine generating six differ- stracts, forum posts), and topics (e.g., financial, COVID-19, climate- ent types of queries, including questions, claims, titles, keywords, change). We also consider MSMARCO passage ranking to examine natural user search queries, and natural user search queries given our fine-tuning methods in-domain. For MSMARCO passages, we human-written examples from MSMARCO [3]. sample 200K passages to generate queries instead of 100K to allow Prior domain adaptation work primarily generated synthetic for greater possible effectiveness boosts because MSMARCO pas- questions or search queries resembling human-written ones from sage ranking is a more general retrieval task that should already be in-domain for retrieval models. After fine-tuning with MSMARCO 𝑠 (𝑞𝑖 ,𝑝𝑖 ) passages, we evaluate using the TREC Deep Learning Tracks from 𝑒 𝜏 2019 (DL19) and 2020 (DL20) [10, 11]. 𝑠˜ (𝑞𝑖 , 𝑝𝑖 ) = 𝑠 (𝑞𝑖 ,𝑝𝑖𝑘 ) 𝑠 (𝑞𝑖 ,𝑝𝑖 ) (1) Í𝐾 𝑒 + 𝑘=1 𝜏 𝑒 𝜏 3.2.1 Prior Fine-Tuning. Table 1 presents NDCG@10 scores across based on a scoring function 𝑠𝑑𝑒 (𝑞, 𝑝) for the dense retriever, which the studied datasets after reranking BGE-retrieved passages with is the cosine similarity of the query and passage embeddings and RankT5-3B. We focus on RankT5-3B [48] as the reranker due to its 𝑠𝑐𝑒 (𝑞, 𝑝) for the cross-encoder, which is the relevance score from the strong effectiveness and generalization to out-of-domain tasks [32, cross-encoder for a passage with respect to a query. We minimize 37, 47, 48]. While reranking generally improves effectiveness, sup- the KL divergence of the two relevance distributions 𝑠˜𝑑𝑒 (𝑞, 𝑝) and porting the motivation for distilling rerankers into retrievers, we 𝑠˜𝑐𝑒 (𝑞, 𝑝) over all the 𝐾 = 19 hard negative passages and the relevant observe score drops on FEVER, Climate-FEVER, ArguAna, and SCI- passage for each query. We find that the temperature parameters DOCS after reranking. This is unexpected given rerankers’ typical of 𝜏 = 0.05 for 𝑠˜𝑑𝑒 (𝑞, 𝑝) and 𝜏 = 0.3 for 𝑠˜𝑐𝑒 (𝑞, 𝑝) work well for advantages over retrievers such as the ability to judge the passage distilling into the retriever models. We tuned these hyperparameters relevance directly with respect to the query and RankT5-3B’s signif- by training BGE on MSMARCO and evaluating on DL19 and DL20. icantly larger size (3B parameters) compared to the BGE retriever We use these values for all experiments. (110M parameters). However, the BGE model and the embedding models examined have undergone extensive fine-tuning. Notably, while RankT5 is trained only on MSMARCO and NQ [21], GTE [23] and Arctic [27] include FEVER and other BEIR datasets in their training data, along with pretraining that involves scientific ab- stracts with corresponding titles as queries, potentially inflating their performance on SCIDOCS and similar datasets. The BGE model’s training data is not clearly mentioned [46], but we suspect it is similarly suited for BEIR evaluation. Table 1 both highlights the promise of cross-encoder distillation but also reveals challenges in fair evaluation due to potential task contamination and data leakage for retrievers being trained on BEIR datasets, potentially to remain competitive for BEIR and MTEB evaluation.
Figure 1: Retrieval effectiveness scores on DL19 and DL20 at
3.2.2 Passage De-duplication. Similar to other previous work [31], different hard negative filtering thresholds during BGE-base we identify many near-duplicate passages in MSMARCO and BEIR fine-tuning on MSMARCO passages. corpora, which may reduce generated query diversity and hinder contrastive learning if duplicates appear in training batches. To address this, we normalize text by considering whitespace, case, 3.3.3 Positive-Aware Contrastive Loss. Much of the recent work and punctuation, and then remove passages that are substrings in dense retrieval models, and all of the models considered in of others. This approach eliminated over 950K from MSMARCO’s this work [23, 27, 45, 46], involve training the models with an total 8.8M passages. InfoNCE [30] contrastive loss that takes advantage of in-batch or mined hard- negatives to learn to represent text with embeddings contrastively. Many of these works have also argued the importance 3.3 Model Training of training with hard-negatives to train effective dense retrieval 3.3.1 Cross-Encoder Teacher. We use RankT5-3B [48] as a cross- models [23, 27]. In our work, we contrast with every hard-negative encoder teacher. We first retrieve 20 passages for each generated passage in the batch to allow for a larger contrastive batch size us- query using the retriever to be fine-tuned. We then rerank these ing the following loss, where 𝐾 = 19 hard negatives 𝑝 𝑗𝑘 are mined 20 passages for the query using RankT5-3B, verifying that the for each query 𝑞 𝑗 and used for training. We use a temperature corresponding passage ranks first as described in Section 3.1.1. We of 𝜏 = 0.01, which is the commonly used value with the models normalize scores across all queries using min-max normalization, considered [23, 27, 45, 46]: but using the 1st and 99th percentiles with clipping to scale data to 𝑠𝑑𝑒 (𝑞𝑖 ,𝑝𝑖 ) [0,1]. The time for reranking varies depending on passage lengths, 𝑛 1 ∑︁ 𝑒 𝜏 however, reranking 20 MSMARCO passages each for 100K queries − log 𝑠𝑑𝑒 (𝑞𝑖 ,𝑝 𝑗𝑘 ) 𝑠𝑑𝑒 (𝑞𝑖 ,𝑝 𝑗 ) (2) 𝑛 𝑖=1 Í𝑛 Í𝐾 takes roughly 5 hours using an RTX 6000 Ada GPU. 𝑗=1 (𝑒 + 𝑘=1 𝑒 ) 𝜏 𝜏
A recent study explored techniques for mining hard negatives while
3.3.2 Cross-encoder Listwise Distillation. RocketQAv2 [34] pro- acknowledging that some mined “negatives” may be relevant de- posed jointly training a dense retriever and a cross-encoder, al- spite lacking relevance labels, which hurts contrastive learning [13]. lowing both models to learn from each other. We use a similar It found that the best method for filtering false negatives is to ex- formulation in our work to distill a cross-encoder to retrievers clude passages with a relevance score above a certain percentage given a relevance distribution: of the relevant passage’s score, using a teacher embedding model. Dataset BGE GTE Arctic Promptagator nDCG Recall nDCG Recall nDCG Recall nDCG DL19 70.2 → 71.8 60.9 → 63.0 71.9 → 71.8 62.1 → 65.6 74.4 → 74.3 64.7 → 67.2 – DL20 67.7 → 69.7 71.5 → 75.6 71.5 → 72.6 69.8 → 72.3 72.1 → 73.6 74.2 → 75.4 – TREC-COVID 78.1 → 82.2 14.1 → 15.8 75.3 → 78.2 14.0 → 14.8 82.2 → 80.9 14.8 → 14.9 75.6 NFCorpus 37.3 → 38.3 33.7 → 35.8 35.3 → 37.9 33.1 → 34.6 36.1 → 37.4 32.4 → 33.7 33.4 FiQA 40.6 → 44.3 74.2 → 75.8 48.7 → 49.5 81.7 → 81.4 42.5 → 45.5 74.8 → 75.9 46.2 ArguAna 63.6 → 61.4 99.2 → 99.2 62.1 → 60.9 99.2 → 99.4 56.5 → 58.6 98.4 → 99.2 59.4 Touché-2020 25.7 → 35.3 48.7 → 50.6 27.5 → 32.1 48.3 → 49.0 33.2 → 37.9 50.0 → 52.4 34.5 DBPedia 40.7 → 45.5 53.0 → 56.3 36.9 → 44.7 46.9 → 55.2 44.7 → 45.1 58.7 → 57.2 38.0 SCIDOCS 21.7 → 19.4 49.6 → 45.5 21.7 → 20.2 50.1 → 46.6 20.0 → 18.6 42.3 → 43.5 18.4 FEVER 86.3 → 80.0 97.2 → 95.7 92.1 → 82.4 97.5 → 96.0 85.6 → 80.4 97.6 → 96.1 77.0 Climate-FEVER 31.2 → 25.1 63.6 → 59.3 40.1 → 30.3 71.7 → 63.8 34.7 → 27.2 66.7 → 63.1 16.8 SciFact 74.1 → 76.2 96.7 → 97.0 75.5 → 76.3 97.3 → 97.7 70.5 → 73.6 94.8 → 95.7 65.0 Table 2: nDCG@10 and Recall@100 for the models (before fine-tuning → after fine-tuning). Bolded values indicate the best score for each model on each dataset. Promptagator scores (nDCG only) are shown on the right for reference.
MSMARCO SciFact FiQA
GradCache [17] to support large contrastive batch sizes following DL19 DL20 findings that larger batch sizes help train more effective embedding NDCG Recall NDCG Recall NDCG Recall NDCG Recall models for retrieval [23, 27, 46]. We use a 90%/10% train/dev split BGE 70.2 60.9 67.7 71.5† 74.1 96.7 40.6† 74.2 + FT with contrastive loss 68.6 61.0 66.8 70.8 72.1 96.0 39.5 70.7 and train for up to 30 epochs, selecting the model with the best dev + FT with listwise loss 71.8 63.6 69.7 75.1 76.3 97.0 44.8 75.9 + FT with combined loss 71.8 63.0 69.7 75.5 76.2 97.0 44.3 75.8 loss and stopping if the dev loss fails to improve for two epochs. GTE 71.9 62.1† 71.5 69.8† 75.5 97.3 48.7 81.7 Training times vary by dataset; for MSMARCO passages, training + FT with contrastive loss 71.5 62.6 67.8 69.8 71.2 95.5 45.1 77.1 took 48-60 hours for all models considered. + FT with listwise loss 72.3 64.7 72.6 73.8 75.4 97.7 48.5 80.2 + FT with combined loss 71.8 65.6 72.6 73.3 76.3 97.7 49.5 81.4 Arctic 74.4 64.7 72.1 74.2 70.5† 94.8 42.5† 74.8 + FT with contrastive loss 72.2 64.3 71.9 73.1 69.1 94.9 41.8 73.0 + FT with listwise loss 74.4 67.4 73.6 75.0 73.2 96.0 45.2 75.9 + FT with combined loss 74.3 67.2 73.6 75.4 73.6 95.7 45.5 75.9 4 Results E5-unsupervised 56.3† 52.6† 54.6† 62.1† 74.3 98.7 40.1† 71.8† + FT with contrastive loss 67.9 61.7 68.2 71.7 72.3 95.0 40.5 73.8 4.1 Evaluating the Effectiveness of our Methods + FT with listwise loss 69.9 63.7 72.7 74.3 76.2 96.7 44.0 75.5 + FT with combined loss 72.4 63.3 73.2 75.2 76.3 97.0 45.4 77.0 Table 3 first shows that training with a contrastive loss underper- forms training with a listwise loss and generally hurts effectiveness Table 3: NDCG@10 and Recall@100 for the original and fine- over the base models, except for when fine-tuning the unsuper- tuned (FT) models, examining the effect of the training loss. vised E5 model. Nonetheless, training with a listwise distillation loss A † indicates a significant difference between the original consistently improves effectiveness over the base models. While and FT model with the combined loss, based on a one-sided, combining the listwise loss and contrastive loss does not generally paired t-test (𝑝 < 0.05), with Holm–Bonferroni correction clearly beat training with a listwise loss alone, it does do so for E5- across all datasets and metrics for each model’s results. Best unsupervised which suggests that it is most helpful to incorporate a scores for each model are bolded. contrastive loss when the model has not already undergone strong contrastive fine-tuning. For further experiments, we train with the combined loss as it provides arguably the strongest baseline. Table 2 shows that both NDCG@10 and Recall@100 consistently Specifically, passages scoring above 95% of the relevant passage’s improve across all three models for most datasets. However, ex- score were removed. ceptions include FEVER, Climate-FEVER, SCIDOCS, and ArguAna, In our work, we tune this threshold using the cross-encoder’s nor- where Section 3.2.1 highlighted challenges with the reranker’s rela- malized scores. Figure 1 shows MAP, NDCG@10, and Recall@100 tive effectiveness on these particular datasets, potentially because scores on DL19 and DL20 by the threshold applied for the BGE-base of the prior fine-tuning of the retriever models on these datasets model trained using MSMARCO passages and the contrastive loss or retrieval tasks, unlike the reranker. Additionally, some scores described above. We find that a threshold of 60% works well and is marginally decrease after fine-tuning, such as DL19 NDCG for the a reasonable moderate threshold that balances the retrieval metrics. GTE and Arctic models, FiQA recall for the GTE model, and DBPedia 3.3.4 Training Setup. We train models using the sum of the con- recall and TREC-COVID NDCG for the Arctic model. These results trastive loss and the listwise distillation loss, finding that a combi- present the difficulty of performing and evaluating domain adapta- nation works well with a 0.1 weight on the contrastive loss. This tion. The Table also provides scores from Promptagator, which is weight was tuned based on dev losses when fine-tuning BGE on MS- the most recent domain adaptation work for single-vector dense re- MARCO passages. We examine the choice of loss function further trievers. Promptagator’s scores are generally lower than the scores in Section 4.1. We train our models on a single 48GB RTX 6000 Ada for the base models even without further fine-tuning, suggesting or 48GB L40S GPU depending on availability. Models were trained that Promptagator domain adaptation methods have quickly been with a learning rate of 2𝑒 − 4 and 4096 queries per batch, leveraging undermined by the SOTA models. Query Type #Q NDCG@10 To ensure a fair comparison, we also train with a 56K subset of (DL19+20) FiQA Sci DL19 DL20 FiQA SciFact synthetic queries, aligning with the human-written query set on as (No fine-tuning) – – – 56.3 54.6 40.1 74.4 many passages as possible. In this controlled setting, models trained MSMARCO User Queries 96K 39k 5k 72.8 72.8 43.7 75.2 with few-shot synthetic queries outperform those trained with Claims 136k 46k 5k 65.3 68.0 40.7 74.5 Titles 99k 34k 5k 72.9 71.6 42.9 72.3 human-written queries on DL19, while zero-shot queries achieve All Queries 646k 240k 30k 72.4 73.2 45.4 76.3 slightly higher recall on DL20. When we use all available synthetic Table 4: Retrieval effectiveness when training the E5- queries for training, scores improve further, with the few-shot unsupervised model with different query types and all gen- synthetic queries generally surpassing the zero-shot ones except erated queries. #Q gives the number of queries for training. for on DL19 when considering NDCG. Nonetheless, models trained on human-written queries retain the highest NDCG on DL20. Overall, these findings suggest that relatively lightweight LLMs, such as Llama-3.1 (8B), can generate synthetic queries that are DL19 DL20 competitive with human-written ones in training usefulness. Query Type # Queries NDCG Recall NDCG Recall (No fine-tuning) - 56.3 52.6 54.6 62.1 5 Conclusion User Queries (Human) 56K 71.6 63.9 73.4 73.0 User Queries Few-Shot (Synthetic) 56K 71.7 64.9 70.3 72.1 While previous methods in domain adaptation and dense retrieval User Queries Few-Shot (Synthetic) 96K 72.8 65.3 72.8 74.5 involving contrastive learning may not be sufficient to enhance User Queries Zero-shot (Synthetic) 56K 70.8 62.2 70.5 73.1 the retrieval effectiveness of SOTA embedding models in particular User Queries Zero-shot (Synthetic) 98K 74.0 62.9 71.8 73.7 tasks, cross-encoder distillation and diverse synthetic queries are Table 5: Retrieval effectiveness for the E5-unsupervised promising, allowing for effectiveness gains across varied datasets model fine-tuned with human-written and synthetic queries. and models. We show that synthetic data can rival human-written For the synthetic queries, results are provided for both a sub- queries and relevance judgements for training. Nonetheless, we set of 56K queries to provide a fair comparison and the full find that cross-encoder effectiveness is a limiting factor for training query set. effective embedding models, suggesting the importance of stronger cross-encoder teachers to further strengthen dense retrievers.
4.2 Evaluating the Quality of Generated Queries References
[1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- 4.2.1 Evaluating the Role of Query Types. Table 4 shows that while cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal training with the query type that most aligns with the retrieval Anadkat, et al. 2023. GPT-4 Technical Report. arXiv:2303.08774 (2023). [2] Negar Arabzadeh, Alexandra Vtyurina, Xinyi Yan, and Charles L. A. Clarke. task is helpful, training with a diverse set of queries generally 2022. Shallow pooling for sparse labels. Inf. Retr. 25, 4 (Dec. 2022), 365–385. results in the best effectiveness and is the preferred approach. In https://doi.org/10.1007/s10791-022-09411-0 [3] Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong SciFact evaluation, where queries are scientific claims, training Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir with MSMARCO-style user queries is most beneficial, followed by Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2016. training with claims. In MSMARCO DL19 and DL20 evaluation, MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268v3 (2016). training with MSMARCO-style user queries scores best on DL20, [4] Alexander Bondarenko, Maik Fröbe, Meriem Beloucif, Lukas Gienapp, Yamen but training with titles scores marginally better on DL19. Queries Ajjour, Alexander Panchenko, Chris Biemann, Benno Stein, Henning Wachsmuth, in FiQA are user-written questions and training with MSMARCO- Martin Potthast, and Matthias Hagen. 2020. Overview of Touché 2020: Argument Retrieval. In Experimental IR Meets Multilinguality, Multimodality, and Interaction, style queries scores best. Generally, training with all queries allows Avi Arampatzis, Evangelos Kanoulas, Theodora Tsikrika, Stefanos Vrochidis, for the strongest scores, regardless of the particular query type that Hideo Joho, Christina Lioma, Carsten Eickhoff, Aurélie Névéol, Linda Cappellato, and Nicola Ferro (Eds.). Springer International Publishing, Cham, 384–395. most aligns with the retrieval task. [5] Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, and Rodrigo Nogueira. 2022. InPars: Data Augmentation for Information Retrieval using Large Language 4.2.2 Comparing Human-written and Synthetic Queries. Table 5 Models. arXiv:2202.05144 (2022). compares training the E5-unsupervised model using human-written [6] Vera Boteva, Demian Gholipour, Artem Sokolov, and Stefan Riezler. 2016. A Full- Text Learning to Rank Dataset for Medical Information Retrieval. In Advances queries from the MSMARCO training set to synthetic queries gener- in Information Retrieval, Nicola Ferro, Fabio Crestani, Marie-Francine Moens, ated by an LLM in either a few-shot setting (given three MSMARCO Josiane Mothe, Fabrizio Silvestri, Giorgio Maria Di Nunzio, Claudia Hauff, and passage-query examples) or a zero-shot setting (without examples). Gianmaria Silvello (Eds.). Springer International Publishing, Cham, 716–722. [7] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, We focus on queries for 200K randomly sampled MSMARCO pas- Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda sages from the training set. Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Applying the filtering process from Section 3.1.1, we find that Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin human-written queries are filtered at a much higher rate, leaving Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya only 56K of the original 200K compared to 96K for the few-shot Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ran- synthetic queries and 98K for the zero-shot synthetic queries. This zato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, is notable since RankT5-3B was fine-tuned on these same human- Inc., 1877–1901. https://proceedings.neurips.cc/paper_files/paper/2020/file/ written queries but still filtered many out, suggesting that MS- 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf [8] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, MARCO’s labeled passages may not always be the most relevant Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Web- for their query, as argued in previous work [2]. son, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan [26] Macedo Maia, Siegfried Handschuh, André Freitas, Brian Davis, Ross McDer- Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, mott, Manel Zarrouk, and Alexandra Balahur. 2018. WWW’18 Open Challenge: Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Financial Opinion Mining and Question Answering. In Companion Proceedings Denny Zhou, Quoc V. Le, and Jason Wei. 2024. Scaling Instruction-Finetuned of the The Web Conference 2018 (Lyon, France) (WWW ’18). International World Language Models. Journal of Machine Learning Research 25, 70 (2024), 1–53. Wide Web Conferences Steering Committee, Republic and Canton of Geneva, http://jmlr.org/papers/v25/23-0870.html CHE, 1941–1942. https://doi.org/10.1145/3184558.3192301 [9] Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel Weld. 2020. [27] Luke Merrick, Danmei Xu, Gaurav Nuti, and Daniel Campos. 2024. Arctic- SPECTER: Document-level Representation Learning using Citation-informed Embed: Scalable, Efficient, and Accurate Text Embedding Models. arXiv preprint Transformers. In Proceedings of the 58th Annual Meeting of the Association for arXiv:2405.05374 (2024). Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel [28] Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. 2023. MTEB: Tetreault (Eds.). Association for Computational Linguistics, Online, 2270–2282. Massive Text Embedding Benchmark. In Proceedings of the 17th Conference of https://doi.org/10.18653/v1/2020.acl-main.207 the European Chapter of the Association for Computational Linguistics, Andreas [10] Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. 2020. Overview Vlachos and Isabelle Augenstein (Eds.). Association for Computational Linguistics, of the TREC 2020 Deep Learning Track. In Proceedings of the Twenty-Ninth Text Dubrovnik, Croatia, 2014–2037. REtrieval Conference Proceedings (TREC 2020). Gaithersburg, Maryland. [29] Rodrigo Nogueira and Jimmy Lin. 2019. From doc2query to docTTTTTquery. [11] Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Online preprint (2019). Voorhees. 2019. Overview of the TREC 2019 Deep Learning Track. In Proceedings [30] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation Learning of the Twenty-Eighth Text REtrieval Conference Proceedings (TREC 2019). Gaithers- with Contrastive Predictive Coding. arXiv:1807.03748 (2018). burg, Maryland. [31] Ronak Pradeep, Nandan Thakur, Sahel Sharifymoghaddam, Eric Zhang, Ryan [12] Zhuyun Dai, Vincent Y Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Nguyen, Daniel Campos, Nick Craswell, and Jimmy Lin. 2024. Ragnarök: A Kelvin Guu, Keith B Hall, and Ming-Wei Chang. 2022. Promptagator: Few-shot Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented dense retrieval from 8 examples. arXiv:2209.11755 (2022). Generation Track. (2024). arXiv:2406.16828 [cs.IR] https://arxiv.org/abs/2406. [13] Gabriel de Souza P. Moreira, Radek Osmulski, Mengyao Xu, Ronay Ak, Benedikt 16828 Schifferer, and Even Oldridge. 2025. NV-Retriever: Improving text embedding [32] Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Jiaming Shen, models with effective hard-negative mining. arXiv:2407.15831 (2025). Tianqi Liu, Jialu Liu, Donald Metzler, Xuanhui Wang, et al. 2023. Large Lan- [14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: guage Models are Effective Text Rankers with Pairwise Ranking Prompting. Pre-training of Deep Bidirectional Transformers for Language Understanding. In arXiv:2306.17563 (2023). Proceedings of the 2019 Conference of the North American Chapter of the Association [33] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association Limits of Transfer Learning with a Unified Text-to-Text Transformer. The Journal for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. of Machine Learning Research 21, 1 (2020), 5485–5551. [15] Thomas Diggelmann, Jordan Boyd-Graber, Jannis Bulian, Massimiliano Ciaramita, [34] Ruiyang Ren, Yingqi Qu, Jing Liu, Wayne Xin Zhao, QiaoQiao She, Hua Wu, and Markus Leippold. 2020. CLIMATE-FEVER: A Dataset for Verification of Real- Haifeng Wang, and Ji-Rong Wen. 2021. RocketQAv2: A Joint Training Method World Climate Claims. arXiv:2012.00614 (2020). for Dense Passage Retrieval and Passage Re-ranking. In Proceedings of the 2021 [16] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Conference on Empirical Methods in Natural Language Processing, Marie-Francine Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association et al. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 (2024). for Computational Linguistics, Online and Punta Cana, Dominican Republic, [17] Luyu Gao, Yunyi Zhang, Jiawei Han, and Jamie Callan. 2021. Scaling Deep 2825–2835. https://doi.org/10.18653/v1/2021.emnlp-main.224 Contrastive Learning Batch Size under Memory Limited Setup. In Proceedings of [35] Jon Saad-Falcon, Omar Khattab, Keshav Santhanam, Radu Florian, Martin Franz, the 6th Workshop on Representation Learning for NLP. Salim Roukos, Avirup Sil, Md Sultan, and Christopher Potts. 2023. UDAPDR: Un- [18] Faegheh Hasibi, Fedor Nikolaev, Chenyan Xiong, Krisztian Balog, Svein Erik supervised Domain Adaptation via LLM Prompting and Distillation of Rerankers. Bratsberg, Alexander Kotov, and Jamie Callan. 2017. DBpedia-Entity v2: A In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Test Collection for Entity Search. In Proceedings of the 40th International ACM Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for SIGIR Conference on Research and Development in Information Retrieval (Shinjuku, Computational Linguistics, Singapore, 11265–11279. https://doi.org/10.18653/ Tokyo, Japan) (SIGIR ’17). Association for Computing Machinery, New York, NY, v1/2023.emnlp-main.693 USA, 1265–1268. https://doi.org/10.1145/3077136.3080751 [36] Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei [19] Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, and Allan Zaharia. 2022. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Hanbury. 2021. Efficiently Teaching an Effective Dense Retriever with Balanced Interaction. In Proceedings of the 2022 Conference of the North American Chapter Topic Aware Sampling. In Proceedings of the 44th International ACM SIGIR Confer- of the Association for Computational Linguistics: Human Language Technologies, ence on Research and Development in Information Retrieval (Virtual Event, Canada) Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (SIGIR ’21). Association for Computing Machinery, New York, NY, USA, 113–122. (Eds.). Association for Computational Linguistics, Seattle, United States, 3715– https://doi.org/10.1145/3404835.3462891 3734. https://doi.org/10.18653/v1/2022.naacl-main.272 [20] Vitor Jeronymo, Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, Roberto Lotufo, [37] Manveer Singh Tamber, Ronak Pradeep, and Jimmy Lin. 2023. Scaling Down, LiT- Jakub Zavrel, and Rodrigo Nogueira. 2023. InPars-v2: Large Language Models as ting Up: Efficient Zero-Shot Listwise Reranking with Seq2seq Encoder-Decoder Efficient Dataset Generators for Information Retrieval. arXiv:2301.01820 (2023). Models. arXiv:2312.16098 (2023). [21] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur [38] Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Gurevych. 2021. BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Information Retrieval Models. arXiv:2104.08663 (2021). Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: A [39] James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. Benchmark for Question Answering Research. Transactions of the Association for 2018. FEVER: a Large-scale Dataset for Fact Extraction and VERification. In Computational Linguistics 7 (2019), 452–466. Proceedings of the 2018 Conference of the North American Chapter of the Association [22] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, for Computational Linguistics: Human Language Technologies, Volume 1 (Long Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Mem- Papers), Marilyn Walker, Heng Ji, and Amanda Stent (Eds.). Association for ory Management for Large Language Model Serving with PagedAttention. In Computational Linguistics, New Orleans, Louisiana, 809–819. https://doi.org/10. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles. 18653/v1/N18-1074 [23] Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan [40] Ellen Voorhees, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, Zhang. 2023. Towards General Text Embeddings with Multi-stage Contrastive William R Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang. 2021. Learning. arXiv:2308.03281 (2023). TREC-COVID: constructing a pandemic information retrieval test collection. In [24] Davis Liang, Peng Xu, Siamak Shakeri, Cicero Nogueira dos Santos, Ramesh ACM SIGIR Forum, Vol. 54. ACM New York, NY, USA, 1–12. Nallapati, Zhiheng Huang, and Bing Xiang. 2020. Embedding-based Zero-shot [41] Henning Wachsmuth, Shahbaz Syed, and Benno Stein. 2018. Retrieval of the Retrieval through Query Generation. arXiv:2009.10270 (2020). Best Counterargument without Prior Topic Knowledge. In Proceedings of the 56th [25] Ji Ma, Ivan Korotkov, Yinfei Yang, Keith Hall, and Ryan McDonald. 2021. Zero-shot Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Neural Passage Retrieval via Domain-targeted Synthetic Question Generation. In Papers), Iryna Gurevych and Yusuke Miyao (Eds.). Association for Computational Proceedings of the 16th Conference of the European Chapter of the Association for Linguistics, Melbourne, Australia, 241–251. https://doi.org/10.18653/v1/P18-1023 Computational Linguistics: Main Volume, Paola Merlo, Jorg Tiedemann, and Reut [42] David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Tsarfaty (Eds.). Association for Computational Linguistics, Online, 1075–1088. Arman Cohan, and Hannaneh Hajishirzi. 2020. Fact or Fiction: Verifying Scientific https://doi.org/10.18653/v1/2021.eacl-main.92 Claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 7534–7550. [46] Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and https://doi.org/10.18653/v1/2020.emnlp-main.609 Jian-Yun Nie. 2024. C-Pack: Packed Resources For General Chinese Embed- [43] Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autore- dings. In Proceedings of the 47th International ACM SIGIR Conference on Re- gressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax. search and Development in Information Retrieval (Washington DC, USA) (SI- [44] Kexin Wang, Nandan Thakur, Nils Reimers, and Iryna Gurevych. 2022. GPL: GIR ’24). Association for Computing Machinery, New York, NY, USA, 641–649. Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense https://doi.org/10.1145/3626772.3657878 Retrieval. In Proceedings of the 2022 Conference of the North American Chapter [47] Soyoung Yoon, Eunbi Choi, Jiyeon Kim, Hyeongu Yun, Yireun Kim, and Seung- of the Association for Computational Linguistics: Human Language Technologies, won Hwang. 2024. ListT5: Listwise Reranking with Fusion-in-Decoder Improves Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz Zero-shot Retrieval. In Proceedings of the 62nd Annual Meeting of the Association (Eds.). Association for Computational Linguistics, Seattle, United States, 2345– for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, 2360. https://doi.org/10.18653/v1/2022.naacl-main.168 and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, [45] Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Thailand, 2287–2308. https://doi.org/10.18653/v1/2024.acl-long.125 Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised [48] Honglei Zhuang, Zhen Qin, Rolf Jagerman, Kai Hui, Ji Ma, Jing Lu, Jianmo Ni, contrastive pre-training. arXiv preprint arXiv:2212.03533 (2022). Xuanhui Wang, and Michael Bendersky. 2023. RankT5: Fine-Tuning T5 for Text Ranking with Ranking Losses. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2308–2313.
Thijs de Ruyter Van Steveninck, Nico Van Der Windt (Auth.), M. S. Oosterbaan, Thijs de Ruyter Van Steveninck, N. Van Der Windt (Eds.) - The Determinants of Economic Growth-Springer US (2000)