0% found this document useful (0 votes)

11 views7 pages

2502.19712v1

This paper investigates the specialization of dense retrieval models through listwise distillation and large language model (LLM) data augmentation. The authors find that standard fine-tuning methods can degrade model effectiveness, and propose a novel approach that combines listwise distillation with synthetic query generation to improve retrieval performance across various datasets. Their results indicate that training with diverse synthetic queries can match the effectiveness of human-written queries, while also highlighting challenges related to task contamination and the limitations of cross-encoder teachers.

Uploaded by

talel46237

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views7 pages

2502.19712v1

Uploaded by

talel46237

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Teaching Dense Retrieval Models to Specialize with Listwise

Distillation and LLM Data Augmentation

Manveer Singh Tamber Suleman Kazi
University of Waterloo Vectara
Waterloo, Ontario, Canada Palo Alto, California, USA
mtamber@uwaterloo.ca suleman@vectara.com

Vivek Sourabh Jimmy Lin

Vectara University of Waterloo
Palo Alto, California, USA Waterloo, Ontario, Canada
arXiv:2502.19712v1 [cs.IR] 27 Feb 2025

vivek@vectara.com jimmylin@uwaterloo.ca

Abstract the MTEB leaderboard [28] for retrieval. We specifically focus on

While the current state-of-the-art dense retrieval models exhibit BERT-base models for their computational efficiency and practical-
strong out-of-domain generalization, they might fail to capture ity for retrieval. In particular, we consider BGE-base BAAI/bge-base-
nuanced domain-specific knowledge. In principle, fine-tuning these en-v1.5 [46], GTE-base Alibaba-NLP/gte-base-en-v1.5 [23] and Arctic-
models for specialized retrieval tasks should yield higher effec- embed-m Snowflake/snowflake-arctic-embed-m-v1.5 [27]. We try to im-
tiveness than relying on a one-size-fits-all model, but in practice, prove the effectiveness of multiple SOTA embedding models on
results can disappoint. We show that standard fine-tuning meth- multiple retrieval datasets, ensuring that our findings generalize
ods using an InfoNCE loss can unexpectedly degrade effectiveness across different datasets and models with their varied training
rather than improve it, even for domain-specific scenarios. This regimens. We also experiment with the unsupervised variant of
holds true even when applying widely adopted techniques such E5-base intfloat/e5-base-unsupervised [45] to investigate fine-tuning an
as hard-negative mining and negative de-noising. To address this, embedding model with strong contrastive pre-training for retrieval,
we explore a training strategy that uses listwise distillation from a but without supervised fine-tuning. The unsupervised E5 model
teacher cross-encoder, leveraging rich relevance signals to fine-tune also allows us to study fine-tuning a model with a lower threat of
the retriever. We further explore synthetic query generation using task contamination, which we argue is prevalent within current
large language models. Through listwise distillation and training SOTA embedding models that aim to be presented competitively in
with a diverse set of queries ranging from natural user searches benchmarks such as BEIR.
and factual claims to keyword-based queries, we achieve consistent Our findings reveal that, surprisingly, fine-tuning SOTA embed-
effectiveness gains across multiple datasets. Our results also reveal ding models with a contrastive loss often hurts the effectiveness of
that synthetic queries can rival human-written queries in training the models, even when training to specialize for particular retrieval
utility. However, we also identify limitations, particularly in the tasks, taking care to train with diverse queries, and de-noising hard
effectiveness of cross-encoder teachers as a bottleneck. We release negatives using a cross-encoder teacher. However, incorporating
our code and scripts to encourage further research.1 a listwise distillation loss alongside the contrastive loss allows for
effectiveness gains across a diverse range of retrieval tasks.
Enabled by the recent advances in LLMs, we also experiment
1 Introduction
with LLM data augmentation by generating a diverse set of queries
Embedding models have emerged as key components of search in the form of natural web search queries, questions, titles, claims,
pipelines. Dense retrieval involves embedding models mapping and keywords, without relying on existing query examples from
both queries and passages into a vector space, ensuring that relevant the retrieval dataset. We find that training with diverse generated
passages are positioned close to the corresponding queries. queries benefits retrieval across BEIR tasks, outperforming prior
The BEIR benchmark [38] demonstrated that embedding models approaches that rely solely on synthetic user search queries or
for retrieval can struggle when evaluated using out-of-distribution questions. Furthermore, the quality of synthetic queries is promis-
retrieval tasks. The work showed that while the dense retrieval ing, showing effectiveness that is competitive with human-written
models studied generally outperformed BM25 on effectiveness met- queries for training retrieval models.
rics when evaluated within the same domain they were trained While our approach achieves promising retrieval effectiveness
on, these same models underperformed BM25 on average when improvements across multiple datasets and models, we also study
considering the diverse retrieval tasks in BEIR. cases where our methods were not able to improve retrieval effec-
While the state-of-the-art (SOTA) in embedding models has tiveness. These cases reveal challenges with task contamination
continued to improve, obtaining strong out-of-domain general- and the need for stronger cross-encoders trained with more diverse
ization, there may still be an interest in better specializing models ranking tasks to compete with advances in dense retrieval.
for certain retrieval domains or tasks. In this work, we focus on
fine-tuning SOTA BERT-base [14] embedding models that lead
1 https://github.com/manveertamber/enhancing_domain_adaptation
2 Background Dataset BGE RankT5-3B
Previous work has explored the domain adaptation of retriever mod- DL19 70.2 75.1
els through synthetic query generation [24, 25]. These approaches DL20 67.7 77.2
typically involved generating synthetic queries for many given TREC-COVID 78.1 85.7
passages and then fine-tuning retriever models using these passage- NFCorpus 37.3 40.4
query pairs. FiQA 40.6 52.6
Alongside putting forward the BEIR benchmark, Thakur et al. [38] SCIDOCS 21.7 19.3
also investigated adapting retrievers to BEIR datasets using a T5- ArguAna 63.6 34.9
Touché-2020 25.7 38.6
based query generator [29, 33] trained on MS MARCO. The TAS-B
DBPedia 40.7 48.1
dense retriever[19] fine-tuned with these queries showed mixed
FEVER 86.3 84.9
results, underperforming the original TAS-B and BM25 on aver- Climate-FEVER 31.2 27.3
age. GPL [44] extended this work by introducing an additional SciFact 74.1 77.1
step: using a cross-encoder to label query-passage pairs. The dense
Table 1: NDCG@10 Scores for BGE and RankT5-3B reranking
retriever was then trained to mimic the score margins between
the top-100 BGE retrieved passages. Best scores are bolded.
positive and negative query-passage pairs from the cross-encoder.
Several more recent works have introduced improved approaches
to synthetic query generation and domain adaptation. InPars [5]
used GPT-3 [7] for few-shot query generation to adapt rerankers
instead of retrievers. InPars-v2 [20] later replaced GPT-3 with the MSMARCO [5, 35, 38, 44]. Promptagator [12] focused on generating
open-source GPT-J [43]. Promptagator [12] used few-shot examples task-specific queries but did so by using up to 8 examples from the
of task-specific queries and passages to generate queries more in same retrieval task that the models were evaluated on. In contrast,
line with those from the BEIR datasets being used for evaluation. our methods generate diverse queries without having to rely on
UDAPDR [35] focused on the domain adaptation of ColBERTv2 [36], queries from the evaluation dataset.
a multi-vector retrieval model, leveraging GPT-3 and FLAN-T5 For brevity, we do not provide the prompts to generate synthetic
XXL [8] for query generation to train up to 10 cross-encoders that queries in this paper, but we make them available in our GitHub
were each in turn used to annotate triples of (query, positive docu- repository. In general, we prompt the LLM to generate queries of
ment, and negative document) to fine-tune retrieval models. under 20 words that are addressed by the given passage. We use
Unlike previous approaches, we explore the generation of diverse vLLM [22] for inference and find that 100K queries for MSMARCO
queries and the use of listwise distillation from cross-encoders passages can be generated in roughly 1-2 hours using an RTX 6000
to provide strong relevance signals for adapting retrievers. We Ada GPU, although times vary depending on passage lengths.
generate diverse queries without relying on few-shot examples
3.1.1 Filtering Generated Queries. To ensure the quality of gener-
from evaluation datasets to allow for fine-tuning that generalizes
ated queries, some filtering is needed. Promptagator filtered out
well to retrieval domains and tasks without needing specific human-
queries if their corresponding passage did not rank first with the
written query examples. We also evaluate how generated queries
unadapted retriever. This may remove high-quality queries that are
compare to human-written queries and we demonstrate the need
challenging for the retriever. To improve filtering, we make use of
for cross-encoder listwise distillation empirically, showing that a
a stronger teacher cross-encoder. First, we discard queries where
simpler contrastive learning setup fails to improve the retrieval
the passage isn’t in the top 20 retrieved passages. Then, we filter
effectiveness of current SOTA embedding models.
out queries where the passage isn’t ranked first with the reranker.

3 Methodology 3.2 Evaluation

3.1 Synthetic Query Generation We evaluate retrieval effectiveness after fine-tuning retrievers to
Given up to 100K passages randomly sampled from a retrieval cor- adapt to BEIR [38] corpora and the MSMARCO [3] passage ranking
pus, we generate queries for each passage to fine-tune the retrievers. corpus. We report NDCG@10 and Recall@100 scores to evaluate
The assumption is that the content of the 100K passages should the fine-tuned retrievers. For BEIR datasets, we focus on the TREC-
represent the retrieval task well and indicate what users might COVID [40], NFcorpus [6], FiQA [26], SCIDOCS [9], ArguAna [41],
search for. We generate synthetic queries using Llama-3.1 (8B) [16] Touché-2020 [4], DBPedia [18], FEVER [39], CLIMATE-FEVER [15],
by providing up to 3 examples of passage-query pairs and then and SciFact [42] datasets. We choose these datasets because their
prompting the LLM to generate a query for a given passage. Given corpora are under open licenses and they represent a diversity of
Wikipedia passages from BEIR’s NQ corpus, we use GPT4o [1] retrieval tasks offering varying types of queries (e.g., factual claims,
to generate high-quality queries to use as examples to guide the opinion-based questions), corpora (e.g., Wikipedia, scientific ab-
generation of Llama-3.1 (8B). We examine generating six differ- stracts, forum posts), and topics (e.g., financial, COVID-19, climate-
ent types of queries, including questions, claims, titles, keywords, change). We also consider MSMARCO passage ranking to examine
natural user search queries, and natural user search queries given our fine-tuning methods in-domain. For MSMARCO passages, we
human-written examples from MSMARCO [3]. sample 200K passages to generate queries instead of 100K to allow
Prior domain adaptation work primarily generated synthetic for greater possible effectiveness boosts because MSMARCO pas-
questions or search queries resembling human-written ones from sage ranking is a more general retrieval task that should already be
in-domain for retrieval models. After fine-tuning with MSMARCO
𝑠 (𝑞𝑖 ,𝑝𝑖 )
passages, we evaluate using the TREC Deep Learning Tracks from 𝑒 𝜏
2019 (DL19) and 2020 (DL20) [10, 11]. 𝑠˜ (𝑞𝑖 , 𝑝𝑖 ) = 𝑠 (𝑞𝑖 ,𝑝𝑖𝑘 )
𝑠 (𝑞𝑖 ,𝑝𝑖 )
(1)
Í𝐾
𝑒 + 𝑘=1
𝜏 𝑒 𝜏
3.2.1 Prior Fine-Tuning. Table 1 presents NDCG@10 scores across based on a scoring function 𝑠𝑑𝑒 (𝑞, 𝑝) for the dense retriever, which
the studied datasets after reranking BGE-retrieved passages with is the cosine similarity of the query and passage embeddings and
RankT5-3B. We focus on RankT5-3B [48] as the reranker due to its 𝑠𝑐𝑒 (𝑞, 𝑝) for the cross-encoder, which is the relevance score from the
strong effectiveness and generalization to out-of-domain tasks [32, cross-encoder for a passage with respect to a query. We minimize
37, 47, 48]. While reranking generally improves effectiveness, sup- the KL divergence of the two relevance distributions 𝑠˜𝑑𝑒 (𝑞, 𝑝) and
porting the motivation for distilling rerankers into retrievers, we 𝑠˜𝑐𝑒 (𝑞, 𝑝) over all the 𝐾 = 19 hard negative passages and the relevant
observe score drops on FEVER, Climate-FEVER, ArguAna, and SCI- passage for each query. We find that the temperature parameters
DOCS after reranking. This is unexpected given rerankers’ typical of 𝜏 = 0.05 for 𝑠˜𝑑𝑒 (𝑞, 𝑝) and 𝜏 = 0.3 for 𝑠˜𝑐𝑒 (𝑞, 𝑝) work well for
advantages over retrievers such as the ability to judge the passage distilling into the retriever models. We tuned these hyperparameters
relevance directly with respect to the query and RankT5-3B’s signif- by training BGE on MSMARCO and evaluating on DL19 and DL20.
icantly larger size (3B parameters) compared to the BGE retriever We use these values for all experiments.
(110M parameters). However, the BGE model and the embedding
models examined have undergone extensive fine-tuning. Notably,
while RankT5 is trained only on MSMARCO and NQ [21], GTE [23]
and Arctic [27] include FEVER and other BEIR datasets in their
training data, along with pretraining that involves scientific ab-
stracts with corresponding titles as queries, potentially inflating
their performance on SCIDOCS and similar datasets. The BGE
model’s training data is not clearly mentioned [46], but we suspect
it is similarly suited for BEIR evaluation. Table 1 both highlights
the promise of cross-encoder distillation but also reveals challenges
in fair evaluation due to potential task contamination and data
leakage for retrievers being trained on BEIR datasets, potentially to
remain competitive for BEIR and MTEB evaluation.

Figure 1: Retrieval effectiveness scores on DL19 and DL20 at

3.2.2 Passage De-duplication. Similar to other previous work [31],
different hard negative filtering thresholds during BGE-base
we identify many near-duplicate passages in MSMARCO and BEIR
fine-tuning on MSMARCO passages.
corpora, which may reduce generated query diversity and hinder
contrastive learning if duplicates appear in training batches. To
address this, we normalize text by considering whitespace, case, 3.3.3 Positive-Aware Contrastive Loss. Much of the recent work
and punctuation, and then remove passages that are substrings in dense retrieval models, and all of the models considered in
of others. This approach eliminated over 950K from MSMARCO’s this work [23, 27, 45, 46], involve training the models with an
total 8.8M passages. InfoNCE [30] contrastive loss that takes advantage of in-batch or
mined hard- negatives to learn to represent text with embeddings
contrastively. Many of these works have also argued the importance
3.3 Model Training of training with hard-negatives to train effective dense retrieval
3.3.1 Cross-Encoder Teacher. We use RankT5-3B [48] as a cross- models [23, 27]. In our work, we contrast with every hard-negative
encoder teacher. We first retrieve 20 passages for each generated passage in the batch to allow for a larger contrastive batch size us-
query using the retriever to be fine-tuned. We then rerank these ing the following loss, where 𝐾 = 19 hard negatives 𝑝 𝑗𝑘 are mined
20 passages for the query using RankT5-3B, verifying that the for each query 𝑞 𝑗 and used for training. We use a temperature
corresponding passage ranks first as described in Section 3.1.1. We of 𝜏 = 0.01, which is the commonly used value with the models
normalize scores across all queries using min-max normalization, considered [23, 27, 45, 46]:
but using the 1st and 99th percentiles with clipping to scale data to
𝑠𝑑𝑒 (𝑞𝑖 ,𝑝𝑖 )
[0,1]. The time for reranking varies depending on passage lengths, 𝑛
1 ∑︁ 𝑒 𝜏
however, reranking 20 MSMARCO passages each for 100K queries − log 𝑠𝑑𝑒 (𝑞𝑖 ,𝑝 𝑗𝑘 )
𝑠𝑑𝑒 (𝑞𝑖 ,𝑝 𝑗 )
(2)
𝑛 𝑖=1 Í𝑛 Í𝐾
takes roughly 5 hours using an RTX 6000 Ada GPU. 𝑗=1 (𝑒 + 𝑘=1 𝑒 )
𝜏 𝜏

A recent study explored techniques for mining hard negatives while

3.3.2 Cross-encoder Listwise Distillation. RocketQAv2 [34] pro- acknowledging that some mined “negatives” may be relevant de-
posed jointly training a dense retriever and a cross-encoder, al- spite lacking relevance labels, which hurts contrastive learning [13].
lowing both models to learn from each other. We use a similar It found that the best method for filtering false negatives is to ex-
formulation in our work to distill a cross-encoder to retrievers clude passages with a relevance score above a certain percentage
given a relevance distribution: of the relevant passage’s score, using a teacher embedding model.
Dataset BGE GTE Arctic Promptagator
nDCG Recall nDCG Recall nDCG Recall nDCG
DL19 70.2 → 71.8 60.9 → 63.0 71.9 → 71.8 62.1 → 65.6 74.4 → 74.3 64.7 → 67.2 –
DL20 67.7 → 69.7 71.5 → 75.6 71.5 → 72.6 69.8 → 72.3 72.1 → 73.6 74.2 → 75.4 –
TREC-COVID 78.1 → 82.2 14.1 → 15.8 75.3 → 78.2 14.0 → 14.8 82.2 → 80.9 14.8 → 14.9 75.6
NFCorpus 37.3 → 38.3 33.7 → 35.8 35.3 → 37.9 33.1 → 34.6 36.1 → 37.4 32.4 → 33.7 33.4
FiQA 40.6 → 44.3 74.2 → 75.8 48.7 → 49.5 81.7 → 81.4 42.5 → 45.5 74.8 → 75.9 46.2
ArguAna 63.6 → 61.4 99.2 → 99.2 62.1 → 60.9 99.2 → 99.4 56.5 → 58.6 98.4 → 99.2 59.4
Touché-2020 25.7 → 35.3 48.7 → 50.6 27.5 → 32.1 48.3 → 49.0 33.2 → 37.9 50.0 → 52.4 34.5
DBPedia 40.7 → 45.5 53.0 → 56.3 36.9 → 44.7 46.9 → 55.2 44.7 → 45.1 58.7 → 57.2 38.0
SCIDOCS 21.7 → 19.4 49.6 → 45.5 21.7 → 20.2 50.1 → 46.6 20.0 → 18.6 42.3 → 43.5 18.4
FEVER 86.3 → 80.0 97.2 → 95.7 92.1 → 82.4 97.5 → 96.0 85.6 → 80.4 97.6 → 96.1 77.0
Climate-FEVER 31.2 → 25.1 63.6 → 59.3 40.1 → 30.3 71.7 → 63.8 34.7 → 27.2 66.7 → 63.1 16.8
SciFact 74.1 → 76.2 96.7 → 97.0 75.5 → 76.3 97.3 → 97.7 70.5 → 73.6 94.8 → 95.7 65.0
Table 2: nDCG@10 and Recall@100 for the models (before fine-tuning → after fine-tuning). Bolded values indicate the best
score for each model on each dataset. Promptagator scores (nDCG only) are shown on the right for reference.

MSMARCO SciFact FiQA

GradCache [17] to support large contrastive batch sizes following
DL19 DL20
findings that larger batch sizes help train more effective embedding
NDCG Recall NDCG Recall NDCG Recall NDCG Recall
models for retrieval [23, 27, 46]. We use a 90%/10% train/dev split
BGE 70.2 60.9 67.7 71.5† 74.1 96.7 40.6† 74.2
+ FT with contrastive loss 68.6 61.0 66.8 70.8 72.1 96.0 39.5 70.7 and train for up to 30 epochs, selecting the model with the best dev
+ FT with listwise loss 71.8 63.6 69.7 75.1 76.3 97.0 44.8 75.9
+ FT with combined loss 71.8 63.0 69.7 75.5 76.2 97.0 44.3 75.8
loss and stopping if the dev loss fails to improve for two epochs.
GTE 71.9 62.1† 71.5 69.8† 75.5 97.3 48.7 81.7 Training times vary by dataset; for MSMARCO passages, training
+ FT with contrastive loss 71.5 62.6 67.8 69.8 71.2 95.5 45.1 77.1 took 48-60 hours for all models considered.
+ FT with listwise loss 72.3 64.7 72.6 73.8 75.4 97.7 48.5 80.2
+ FT with combined loss 71.8 65.6 72.6 73.3 76.3 97.7 49.5 81.4
Arctic 74.4 64.7 72.1 74.2 70.5† 94.8 42.5† 74.8
+ FT with contrastive loss 72.2 64.3 71.9 73.1 69.1 94.9 41.8 73.0
+ FT with listwise loss 74.4 67.4 73.6 75.0 73.2 96.0 45.2 75.9
+ FT with combined loss 74.3 67.2 73.6 75.4 73.6 95.7 45.5 75.9 4 Results
E5-unsupervised 56.3† 52.6† 54.6† 62.1† 74.3 98.7 40.1† 71.8†
+ FT with contrastive loss 67.9 61.7 68.2 71.7 72.3 95.0 40.5 73.8
4.1 Evaluating the Effectiveness of our Methods
+ FT with listwise loss 69.9 63.7 72.7 74.3 76.2 96.7 44.0 75.5
+ FT with combined loss 72.4 63.3 73.2 75.2 76.3 97.0 45.4 77.0
Table 3 first shows that training with a contrastive loss underper-
forms training with a listwise loss and generally hurts effectiveness
Table 3: NDCG@10 and Recall@100 for the original and fine-
over the base models, except for when fine-tuning the unsuper-
tuned (FT) models, examining the effect of the training loss.
vised E5 model. Nonetheless, training with a listwise distillation loss
A † indicates a significant difference between the original
consistently improves effectiveness over the base models. While
and FT model with the combined loss, based on a one-sided,
combining the listwise loss and contrastive loss does not generally
paired t-test (𝑝 < 0.05), with Holm–Bonferroni correction
clearly beat training with a listwise loss alone, it does do so for E5-
across all datasets and metrics for each model’s results. Best
unsupervised which suggests that it is most helpful to incorporate a
scores for each model are bolded.
contrastive loss when the model has not already undergone strong
contrastive fine-tuning. For further experiments, we train with the
combined loss as it provides arguably the strongest baseline.
Table 2 shows that both NDCG@10 and Recall@100 consistently
Specifically, passages scoring above 95% of the relevant passage’s
improve across all three models for most datasets. However, ex-
score were removed.
ceptions include FEVER, Climate-FEVER, SCIDOCS, and ArguAna,
In our work, we tune this threshold using the cross-encoder’s nor-
where Section 3.2.1 highlighted challenges with the reranker’s rela-
malized scores. Figure 1 shows MAP, NDCG@10, and Recall@100
tive effectiveness on these particular datasets, potentially because
scores on DL19 and DL20 by the threshold applied for the BGE-base
of the prior fine-tuning of the retriever models on these datasets
model trained using MSMARCO passages and the contrastive loss
or retrieval tasks, unlike the reranker. Additionally, some scores
described above. We find that a threshold of 60% works well and is
marginally decrease after fine-tuning, such as DL19 NDCG for the
a reasonable moderate threshold that balances the retrieval metrics.
GTE and Arctic models, FiQA recall for the GTE model, and DBPedia
3.3.4 Training Setup. We train models using the sum of the con- recall and TREC-COVID NDCG for the Arctic model. These results
trastive loss and the listwise distillation loss, finding that a combi- present the difficulty of performing and evaluating domain adapta-
nation works well with a 0.1 weight on the contrastive loss. This tion. The Table also provides scores from Promptagator, which is
weight was tuned based on dev losses when fine-tuning BGE on MS- the most recent domain adaptation work for single-vector dense re-
MARCO passages. We examine the choice of loss function further trievers. Promptagator’s scores are generally lower than the scores
in Section 4.1. We train our models on a single 48GB RTX 6000 Ada for the base models even without further fine-tuning, suggesting
or 48GB L40S GPU depending on availability. Models were trained that Promptagator domain adaptation methods have quickly been
with a learning rate of 2𝑒 − 4 and 4096 queries per batch, leveraging undermined by the SOTA models.
Query Type #Q NDCG@10 To ensure a fair comparison, we also train with a 56K subset of
(DL19+20) FiQA Sci DL19 DL20 FiQA SciFact synthetic queries, aligning with the human-written query set on as
(No fine-tuning) – – – 56.3 54.6 40.1 74.4 many passages as possible. In this controlled setting, models trained
MSMARCO User Queries 96K 39k 5k 72.8 72.8 43.7 75.2 with few-shot synthetic queries outperform those trained with
Claims 136k 46k 5k 65.3 68.0 40.7 74.5
Titles 99k 34k 5k 72.9 71.6 42.9 72.3
human-written queries on DL19, while zero-shot queries achieve
All Queries 646k 240k 30k 72.4 73.2 45.4 76.3 slightly higher recall on DL20. When we use all available synthetic
Table 4: Retrieval effectiveness when training the E5- queries for training, scores improve further, with the few-shot
unsupervised model with different query types and all gen- synthetic queries generally surpassing the zero-shot ones except
erated queries. #Q gives the number of queries for training. for on DL19 when considering NDCG. Nonetheless, models trained
on human-written queries retain the highest NDCG on DL20.
Overall, these findings suggest that relatively lightweight LLMs,
such as Llama-3.1 (8B), can generate synthetic queries that are
DL19 DL20 competitive with human-written ones in training usefulness.
Query Type # Queries
NDCG Recall NDCG Recall
(No fine-tuning) - 56.3 52.6 54.6 62.1
5 Conclusion
User Queries (Human) 56K 71.6 63.9 73.4 73.0
User Queries Few-Shot (Synthetic) 56K 71.7 64.9 70.3 72.1 While previous methods in domain adaptation and dense retrieval
User Queries Few-Shot (Synthetic) 96K 72.8 65.3 72.8 74.5 involving contrastive learning may not be sufficient to enhance
User Queries Zero-shot (Synthetic) 56K 70.8 62.2 70.5 73.1 the retrieval effectiveness of SOTA embedding models in particular
User Queries Zero-shot (Synthetic) 98K 74.0 62.9 71.8 73.7
tasks, cross-encoder distillation and diverse synthetic queries are
Table 5: Retrieval effectiveness for the E5-unsupervised promising, allowing for effectiveness gains across varied datasets
model fine-tuned with human-written and synthetic queries. and models. We show that synthetic data can rival human-written
For the synthetic queries, results are provided for both a sub- queries and relevance judgements for training. Nonetheless, we
set of 56K queries to provide a fair comparison and the full find that cross-encoder effectiveness is a limiting factor for training
query set. effective embedding models, suggesting the importance of stronger
cross-encoder teachers to further strengthen dense retrievers.

4.2 Evaluating the Quality of Generated Queries References

[1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren-
4.2.1 Evaluating the Role of Query Types. Table 4 shows that while cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal
training with the query type that most aligns with the retrieval Anadkat, et al. 2023. GPT-4 Technical Report. arXiv:2303.08774 (2023).
[2] Negar Arabzadeh, Alexandra Vtyurina, Xinyi Yan, and Charles L. A. Clarke.
task is helpful, training with a diverse set of queries generally 2022. Shallow pooling for sparse labels. Inf. Retr. 25, 4 (Dec. 2022), 365–385.
results in the best effectiveness and is the preferred approach. In https://doi.org/10.1007/s10791-022-09411-0
[3] Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong
SciFact evaluation, where queries are scientific claims, training Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir
with MSMARCO-style user queries is most beneficial, followed by Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2016.
training with claims. In MSMARCO DL19 and DL20 evaluation, MS MARCO: A Human Generated MAchine Reading COmprehension Dataset.
arXiv:1611.09268v3 (2016).
training with MSMARCO-style user queries scores best on DL20, [4] Alexander Bondarenko, Maik Fröbe, Meriem Beloucif, Lukas Gienapp, Yamen
but training with titles scores marginally better on DL19. Queries Ajjour, Alexander Panchenko, Chris Biemann, Benno Stein, Henning Wachsmuth,
in FiQA are user-written questions and training with MSMARCO- Martin Potthast, and Matthias Hagen. 2020. Overview of Touché 2020: Argument
Retrieval. In Experimental IR Meets Multilinguality, Multimodality, and Interaction,
style queries scores best. Generally, training with all queries allows Avi Arampatzis, Evangelos Kanoulas, Theodora Tsikrika, Stefanos Vrochidis,
for the strongest scores, regardless of the particular query type that Hideo Joho, Christina Lioma, Carsten Eickhoff, Aurélie Névéol, Linda Cappellato,
and Nicola Ferro (Eds.). Springer International Publishing, Cham, 384–395.
most aligns with the retrieval task. [5] Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, and Rodrigo Nogueira. 2022.
InPars: Data Augmentation for Information Retrieval using Large Language
4.2.2 Comparing Human-written and Synthetic Queries. Table 5 Models. arXiv:2202.05144 (2022).
compares training the E5-unsupervised model using human-written [6] Vera Boteva, Demian Gholipour, Artem Sokolov, and Stefan Riezler. 2016. A Full-
Text Learning to Rank Dataset for Medical Information Retrieval. In Advances
queries from the MSMARCO training set to synthetic queries gener- in Information Retrieval, Nicola Ferro, Fabio Crestani, Marie-Francine Moens,
ated by an LLM in either a few-shot setting (given three MSMARCO Josiane Mothe, Fabrizio Silvestri, Giorgio Maria Di Nunzio, Claudia Hauff, and
passage-query examples) or a zero-shot setting (without examples). Gianmaria Silvello (Eds.). Springer International Publishing, Cham, 716–722.
[7] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan,
We focus on queries for 200K randomly sampled MSMARCO pas- Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda
sages from the training set. Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan,
Applying the filtering process from Section 3.1.1, we find that Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter,
Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
human-written queries are filtered at a much higher rate, leaving Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya
only 56K of the original 200K compared to 96K for the few-shot Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners.
In Advances in Neural Information Processing Systems, H. Larochelle, M. Ran-
synthetic queries and 98K for the zero-shot synthetic queries. This zato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates,
is notable since RankT5-3B was fine-tuned on these same human- Inc., 1877–1901. https://proceedings.neurips.cc/paper_files/paper/2020/file/
written queries but still filtered many out, suggesting that MS- 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
[8] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus,
MARCO’s labeled passages may not always be the most relevant Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Web-
for their query, as argued in previous work [2]. son, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha
Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan [26] Macedo Maia, Siegfried Handschuh, André Freitas, Brian Davis, Ross McDer-
Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, mott, Manel Zarrouk, and Alexandra Balahur. 2018. WWW’18 Open Challenge:
Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Financial Opinion Mining and Question Answering. In Companion Proceedings
Denny Zhou, Quoc V. Le, and Jason Wei. 2024. Scaling Instruction-Finetuned of the The Web Conference 2018 (Lyon, France) (WWW ’18). International World
Language Models. Journal of Machine Learning Research 25, 70 (2024), 1–53. Wide Web Conferences Steering Committee, Republic and Canton of Geneva,
http://jmlr.org/papers/v25/23-0870.html CHE, 1941–1942. https://doi.org/10.1145/3184558.3192301
[9] Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel Weld. 2020. [27] Luke Merrick, Danmei Xu, Gaurav Nuti, and Daniel Campos. 2024. Arctic-
SPECTER: Document-level Representation Learning using Citation-informed Embed: Scalable, Efficient, and Accurate Text Embedding Models. arXiv preprint
Transformers. In Proceedings of the 58th Annual Meeting of the Association for arXiv:2405.05374 (2024).
Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel [28] Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. 2023. MTEB:
Tetreault (Eds.). Association for Computational Linguistics, Online, 2270–2282. Massive Text Embedding Benchmark. In Proceedings of the 17th Conference of
https://doi.org/10.18653/v1/2020.acl-main.207 the European Chapter of the Association for Computational Linguistics, Andreas
[10] Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. 2020. Overview Vlachos and Isabelle Augenstein (Eds.). Association for Computational Linguistics,
of the TREC 2020 Deep Learning Track. In Proceedings of the Twenty-Ninth Text Dubrovnik, Croatia, 2014–2037.
REtrieval Conference Proceedings (TREC 2020). Gaithersburg, Maryland. [29] Rodrigo Nogueira and Jimmy Lin. 2019. From doc2query to docTTTTTquery.
[11] Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Online preprint (2019).
Voorhees. 2019. Overview of the TREC 2019 Deep Learning Track. In Proceedings [30] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation Learning
of the Twenty-Eighth Text REtrieval Conference Proceedings (TREC 2019). Gaithers- with Contrastive Predictive Coding. arXiv:1807.03748 (2018).
burg, Maryland. [31] Ronak Pradeep, Nandan Thakur, Sahel Sharifymoghaddam, Eric Zhang, Ryan
[12] Zhuyun Dai, Vincent Y Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Nguyen, Daniel Campos, Nick Craswell, and Jimmy Lin. 2024. Ragnarök: A
Kelvin Guu, Keith B Hall, and Ming-Wei Chang. 2022. Promptagator: Few-shot Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented
dense retrieval from 8 examples. arXiv:2209.11755 (2022). Generation Track. (2024). arXiv:2406.16828 [cs.IR] https://arxiv.org/abs/2406.
[13] Gabriel de Souza P. Moreira, Radek Osmulski, Mengyao Xu, Ronay Ak, Benedikt 16828
Schifferer, and Even Oldridge. 2025. NV-Retriever: Improving text embedding [32] Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Jiaming Shen,
models with effective hard-negative mining. arXiv:2407.15831 (2025). Tianqi Liu, Jialu Liu, Donald Metzler, Xuanhui Wang, et al. 2023. Large Lan-
[14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: guage Models are Effective Text Rankers with Pairwise Ranking Prompting.
Pre-training of Deep Bidirectional Transformers for Language Understanding. In arXiv:2306.17563 (2023).
Proceedings of the 2019 Conference of the North American Chapter of the Association [33] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the
Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association Limits of Transfer Learning with a Unified Text-to-Text Transformer. The Journal
for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. of Machine Learning Research 21, 1 (2020), 5485–5551.
[15] Thomas Diggelmann, Jordan Boyd-Graber, Jannis Bulian, Massimiliano Ciaramita, [34] Ruiyang Ren, Yingqi Qu, Jing Liu, Wayne Xin Zhao, QiaoQiao She, Hua Wu,
and Markus Leippold. 2020. CLIMATE-FEVER: A Dataset for Verification of Real- Haifeng Wang, and Ji-Rong Wen. 2021. RocketQAv2: A Joint Training Method
World Climate Claims. arXiv:2012.00614 (2020). for Dense Passage Retrieval and Passage Re-ranking. In Proceedings of the 2021
[16] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Conference on Empirical Methods in Natural Language Processing, Marie-Francine
Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association
et al. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 (2024). for Computational Linguistics, Online and Punta Cana, Dominican Republic,
[17] Luyu Gao, Yunyi Zhang, Jiawei Han, and Jamie Callan. 2021. Scaling Deep 2825–2835. https://doi.org/10.18653/v1/2021.emnlp-main.224
Contrastive Learning Batch Size under Memory Limited Setup. In Proceedings of [35] Jon Saad-Falcon, Omar Khattab, Keshav Santhanam, Radu Florian, Martin Franz,
the 6th Workshop on Representation Learning for NLP. Salim Roukos, Avirup Sil, Md Sultan, and Christopher Potts. 2023. UDAPDR: Un-
[18] Faegheh Hasibi, Fedor Nikolaev, Chenyan Xiong, Krisztian Balog, Svein Erik supervised Domain Adaptation via LLM Prompting and Distillation of Rerankers.
Bratsberg, Alexander Kotov, and Jamie Callan. 2017. DBpedia-Entity v2: A In Proceedings of the 2023 Conference on Empirical Methods in Natural Language
Test Collection for Entity Search. In Proceedings of the 40th International ACM Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for
SIGIR Conference on Research and Development in Information Retrieval (Shinjuku, Computational Linguistics, Singapore, 11265–11279. https://doi.org/10.18653/
Tokyo, Japan) (SIGIR ’17). Association for Computing Machinery, New York, NY, v1/2023.emnlp-main.693
USA, 1265–1268. https://doi.org/10.1145/3077136.3080751 [36] Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei
[19] Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, and Allan Zaharia. 2022. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late
Hanbury. 2021. Efficiently Teaching an Effective Dense Retriever with Balanced Interaction. In Proceedings of the 2022 Conference of the North American Chapter
Topic Aware Sampling. In Proceedings of the 44th International ACM SIGIR Confer- of the Association for Computational Linguistics: Human Language Technologies,
ence on Research and Development in Information Retrieval (Virtual Event, Canada) Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz
(SIGIR ’21). Association for Computing Machinery, New York, NY, USA, 113–122. (Eds.). Association for Computational Linguistics, Seattle, United States, 3715–
https://doi.org/10.1145/3404835.3462891 3734. https://doi.org/10.18653/v1/2022.naacl-main.272
[20] Vitor Jeronymo, Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, Roberto Lotufo, [37] Manveer Singh Tamber, Ronak Pradeep, and Jimmy Lin. 2023. Scaling Down, LiT-
Jakub Zavrel, and Rodrigo Nogueira. 2023. InPars-v2: Large Language Models as ting Up: Efficient Zero-Shot Listwise Reranking with Seq2seq Encoder-Decoder
Efficient Dataset Generators for Information Retrieval. arXiv:2301.01820 (2023). Models. arXiv:2312.16098 (2023).
[21] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur [38] Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna
Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Gurevych. 2021. BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of
Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Information Retrieval Models. arXiv:2104.08663 (2021).
Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: A [39] James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal.
Benchmark for Question Answering Research. Transactions of the Association for 2018. FEVER: a Large-scale Dataset for Fact Extraction and VERification. In
Computational Linguistics 7 (2019), 452–466. Proceedings of the 2018 Conference of the North American Chapter of the Association
[22] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, for Computational Linguistics: Human Language Technologies, Volume 1 (Long
Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Mem- Papers), Marilyn Walker, Heng Ji, and Amanda Stent (Eds.). Association for
ory Management for Large Language Model Serving with PagedAttention. In Computational Linguistics, New Orleans, Louisiana, 809–819. https://doi.org/10.
Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles. 18653/v1/N18-1074
[23] Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan [40] Ellen Voorhees, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman,
Zhang. 2023. Towards General Text Embeddings with Multi-stage Contrastive William R Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang. 2021.
Learning. arXiv:2308.03281 (2023). TREC-COVID: constructing a pandemic information retrieval test collection. In
[24] Davis Liang, Peng Xu, Siamak Shakeri, Cicero Nogueira dos Santos, Ramesh ACM SIGIR Forum, Vol. 54. ACM New York, NY, USA, 1–12.
Nallapati, Zhiheng Huang, and Bing Xiang. 2020. Embedding-based Zero-shot [41] Henning Wachsmuth, Shahbaz Syed, and Benno Stein. 2018. Retrieval of the
Retrieval through Query Generation. arXiv:2009.10270 (2020). Best Counterargument without Prior Topic Knowledge. In Proceedings of the 56th
[25] Ji Ma, Ivan Korotkov, Yinfei Yang, Keith Hall, and Ryan McDonald. 2021. Zero-shot Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Neural Passage Retrieval via Domain-targeted Synthetic Question Generation. In Papers), Iryna Gurevych and Yusuke Miyao (Eds.). Association for Computational
Proceedings of the 16th Conference of the European Chapter of the Association for Linguistics, Melbourne, Australia, 241–251. https://doi.org/10.18653/v1/P18-1023
Computational Linguistics: Main Volume, Paola Merlo, Jorg Tiedemann, and Reut [42] David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen,
Tsarfaty (Eds.). Association for Computational Linguistics, Online, 1075–1088. Arman Cohan, and Hannaneh Hajishirzi. 2020. Fact or Fiction: Verifying Scientific
https://doi.org/10.18653/v1/2021.eacl-main.92 Claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP), Bonnie Webber, Trevor Cohn, Yulan He, and
Yang Liu (Eds.). Association for Computational Linguistics, Online, 7534–7550. [46] Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and
https://doi.org/10.18653/v1/2020.emnlp-main.609 Jian-Yun Nie. 2024. C-Pack: Packed Resources For General Chinese Embed-
[43] Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autore- dings. In Proceedings of the 47th International ACM SIGIR Conference on Re-
gressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax. search and Development in Information Retrieval (Washington DC, USA) (SI-
[44] Kexin Wang, Nandan Thakur, Nils Reimers, and Iryna Gurevych. 2022. GPL: GIR ’24). Association for Computing Machinery, New York, NY, USA, 641–649.
Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense https://doi.org/10.1145/3626772.3657878
Retrieval. In Proceedings of the 2022 Conference of the North American Chapter [47] Soyoung Yoon, Eunbi Choi, Jiyeon Kim, Hyeongu Yun, Yireun Kim, and Seung-
of the Association for Computational Linguistics: Human Language Technologies, won Hwang. 2024. ListT5: Listwise Reranking with Fusion-in-Decoder Improves
Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz Zero-shot Retrieval. In Proceedings of the 62nd Annual Meeting of the Association
(Eds.). Association for Computational Linguistics, Seattle, United States, 2345– for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins,
2360. https://doi.org/10.18653/v1/2022.naacl-main.168 and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok,
[45] Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Thailand, 2287–2308. https://doi.org/10.18653/v1/2024.acl-long.125
Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised [48] Honglei Zhuang, Zhen Qin, Rolf Jagerman, Kai Hui, Ji Ma, Jing Lu, Jianmo Ni,
contrastive pre-training. arXiv preprint arXiv:2212.03533 (2022). Xuanhui Wang, and Michael Bendersky. 2023. RankT5: Fine-Tuning T5 for Text
Ranking with Ranking Losses. In Proceedings of the 46th International ACM SIGIR
Conference on Research and Development in Information Retrieval. 2308–2313.

Lesson Plan FeedBACK Mechanism
100% (3)
Lesson Plan FeedBACK Mechanism
8 pages
Data aug-IR
No ratings yet
Data aug-IR
15 pages
Advanced Machine Learning and Information Retrieval 2
No ratings yet
Advanced Machine Learning and Information Retrieval 2
13 pages
2305.09612v1
No ratings yet
2305.09612v1
11 pages
Fine-Tuning Embedding Models_ Achieving More with Less _ by Nilesh Raghuvanshi _ Nov, 2024 _ Towards AI
No ratings yet
Fine-Tuning Embedding Models_ Achieving More with Less _ by Nilesh Raghuvanshi _ Nov, 2024 _ Towards AI
20 pages
2310.01329
No ratings yet
2310.01329
18 pages
VICUNA with LLaMA: Techniques and Applications: The Complete Guide for Developers and Engineers
From Everand
VICUNA with LLaMA: Techniques and Applications: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Comprehensive Guide to EasyMock: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to EasyMock: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Practical Moq for .NET Developers: Definitive Reference for Developers and Engineers
From Everand
Practical Moq for .NET Developers: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
RAFT RAG Cum Fine Tuning
No ratings yet
RAFT RAG Cum Fine Tuning
11 pages
DistilBERT, A Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter
No ratings yet
DistilBERT, A Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter
5 pages
RAFT: Adapting Language Model To Domain Specific RAG: Vu Et Al. 2023 Lazaridou Et Al. 2022
No ratings yet
RAFT: Adapting Language Model To Domain Specific RAG: Vu Et Al. 2023 Lazaridou Et Al. 2022
11 pages
Mockito Techniques for Effective Unit Testing: Definitive Reference for Developers and Engineers
From Everand
Mockito Techniques for Effective Unit Testing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Keras Deep Learning Essentials: Definitive Reference for Developers and Engineers
From Everand
Keras Deep Learning Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
AutoRAG Automated Framework For Optimization of Retrieval-Augmented Generation
No ratings yet
AutoRAG Automated Framework For Optimization of Retrieval-Augmented Generation
22 pages
MoIL Momentum Imitation Learning For Efficient Vision-Language Adaptation
No ratings yet
MoIL Momentum Imitation Learning For Efficient Vision-Language Adaptation
13 pages
2501.04652v1
No ratings yet
2501.04652v1
9 pages
Caffe Deep Learning Framework Essentials: Definitive Reference for Developers and Engineers
From Everand
Caffe Deep Learning Framework Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
RAG Training NEW
No ratings yet
RAG Training NEW
47 pages
2312.15503v1
No ratings yet
2312.15503v1
9 pages
Improving Text Embeddings With Large Language Models
No ratings yet
Improving Text Embeddings With Large Language Models
20 pages
ELREA 多个lora适配器动态选取
No ratings yet
ELREA 多个lora适配器动态选取
29 pages
SMART: Robust and E Fficient Fine-Tuning For Pre-Trained Natural Language Models Through Principled Regularized Optimization
No ratings yet
SMART: Robust and E Fficient Fine-Tuning For Pre-Trained Natural Language Models Through Principled Regularized Optimization
21 pages
Lecture11- Unsupervised Learning (I)
No ratings yet
Lecture11- Unsupervised Learning (I)
29 pages
999 RAFT Adapting Language Mod (1)
No ratings yet
999 RAFT Adapting Language Mod (1)
12 pages
Effective Mocha Testing: Definitive Reference for Developers and Engineers
From Everand
Effective Mocha Testing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Jacob Devlin BERT
No ratings yet
Jacob Devlin BERT
43 pages
RAFT Gorilla Blog
No ratings yet
RAFT Gorilla Blog
5 pages
TestNG Essentials: Definitive Reference for Developers and Engineers
From Everand
TestNG Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Kubeflow Operations and Workflow Engineering: Definitive Reference for Developers and Engineers
From Everand
Kubeflow Operations and Workflow Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
btad617
No ratings yet
btad617
10 pages
SciRepEval
No ratings yet
SciRepEval
21 pages
UNIT-5 part1
No ratings yet
UNIT-5 part1
15 pages
Java Testing for New Developers: A Practical Guide with Examples
From Everand
Java Testing for New Developers: A Practical Guide with Examples
William E. Clark
No ratings yet
RAFT
No ratings yet
RAFT
12 pages
Effective Cucumber Automation: Definitive Reference for Developers and Engineers
From Everand
Effective Cucumber Automation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Applied Data Mining with Weka: Definitive Reference for Developers and Engineers
From Everand
Applied Data Mining with Weka: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Practical MXNet Applications: Definitive Reference for Developers and Engineers
From Everand
Practical MXNet Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
CatBoost Algorithms and Applications: Definitive Reference for Developers and Engineers
From Everand
CatBoost Algorithms and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
DSPy - In-Context Learning for Extreme Multi-Label Classification 2401.12178v1
No ratings yet
DSPy - In-Context Learning for Extreme Multi-Label Classification 2401.12178v1
8 pages
Jina-Embeddings-V3:: Multilingual Embeddings With Task Lora
No ratings yet
Jina-Embeddings-V3:: Multilingual Embeddings With Task Lora
20 pages
AVL Trees: Algorithms and Balanced Data Structures
From Everand
AVL Trees: Algorithms and Balanced Data Structures
Richard Johnson
No ratings yet
JMockit in Practice: Definitive Reference for Developers and Engineers
From Everand
JMockit in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Uni-Retriever Towards Learning The Unified Embedd
No ratings yet
Uni-Retriever Towards Learning The Unified Embedd
9 pages
GTE
No ratings yet
GTE
18 pages
SpecFlow Test Automation Essentials: Definitive Reference for Developers and Engineers
From Everand
SpecFlow Test Automation Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Beyond Explaining The Basics of Retrieval (Augmented Generation)
No ratings yet
Beyond Explaining The Basics of Retrieval (Augmented Generation)
22 pages
Applied Machine Learning with Scikit-learn: Definitive Reference for Developers and Engineers
From Everand
Applied Machine Learning with Scikit-learn: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Text and Code Embeddings by Contrastive Pre-Training
No ratings yet
Text and Code Embeddings by Contrastive Pre-Training
13 pages
MLflow in Practice: Definitive Reference for Developers and Engineers
From Everand
MLflow in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering Java Design Patterns: Unlock the Secrets of Expert-Level Skills
From Everand
Mastering Java Design Patterns: Unlock the Secrets of Expert-Level Skills
Larry Jones
No ratings yet
Ranking Approach to Monolingual Question
No ratings yet
Ranking Approach to Monolingual Question
6 pages
ScalaTest Essentials: Definitive Reference for Developers and Engineers
From Everand
ScalaTest Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Linear Programming Word Problems Formulation Using
No ratings yet
Linear Programming Word Problems Formulation Using
10 pages
VMWARE Certified Spring Professional Certification Concept Based Practice Questions - Latest Edition
From Everand
VMWARE Certified Spring Professional Certification Concept Based Practice Questions - Latest Edition
Exam OG
No ratings yet
DocBERT - BERT For Document Classification
No ratings yet
DocBERT - BERT For Document Classification
7 pages
LightGBM in Practice: Definitive Reference for Developers and Engineers
From Everand
LightGBM in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
2412.13860v1
No ratings yet
2412.13860v1
10 pages
Training the application of LLM
No ratings yet
Training the application of LLM
68 pages
BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
No ratings yet
BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
16 pages
2507.01806v1
No ratings yet
2507.01806v1
19 pages
Library Management System
100% (2)
Library Management System
4 pages
Biostatistic Outline
No ratings yet
Biostatistic Outline
2 pages
02 Magician Profile v2
No ratings yet
02 Magician Profile v2
48 pages
The Fundamental Theorem Difference Equations Differential Equations of Calculus
No ratings yet
The Fundamental Theorem Difference Equations Differential Equations of Calculus
12 pages
Elementary Signals Class
No ratings yet
Elementary Signals Class
9 pages
TLE-Learning - Plan in Garments
No ratings yet
TLE-Learning - Plan in Garments
3 pages
Medical Sociology
0% (1)
Medical Sociology
3 pages
Facilitation
No ratings yet
Facilitation
1 page
HP Vertica 7.1.x SQL Reference Manual
No ratings yet
HP Vertica 7.1.x SQL Reference Manual
1,873 pages
Understand The Cultural Influences in Representative Creative Works.
No ratings yet
Understand The Cultural Influences in Representative Creative Works.
3 pages
Thijs de Ruyter Van Steveninck, Nico Van Der Windt (Auth.), M. S. Oosterbaan, Thijs de Ruyter Van Steveninck, N. Van Der Windt (Eds.) - The Determinants of Economic Growth-Springer US (2000)
No ratings yet
Thijs de Ruyter Van Steveninck, Nico Van Der Windt (Auth.), M. S. Oosterbaan, Thijs de Ruyter Van Steveninck, N. Van Der Windt (Eds.) - The Determinants of Economic Growth-Springer US (2000)
289 pages
MM Presentation
No ratings yet
MM Presentation
12 pages
Bai Tap Tieng Anh 12 Unit 1 - Unit 5 PDF
No ratings yet
Bai Tap Tieng Anh 12 Unit 1 - Unit 5 PDF
11 pages
Sitting Arrangement Venn Diagrams - PMD
No ratings yet
Sitting Arrangement Venn Diagrams - PMD
8 pages
Riley Nurs3021evaluation
No ratings yet
Riley Nurs3021evaluation
22 pages
Great Expectations Vocabulary
No ratings yet
Great Expectations Vocabulary
5 pages
BlackBoard Community Forum (3/22) Packet
No ratings yet
BlackBoard Community Forum (3/22) Packet
4 pages
Tiếng Anh 9 Friends Plus_Unit 5_Practice Test 2
No ratings yet
Tiếng Anh 9 Friends Plus_Unit 5_Practice Test 2
4 pages
Maxsecure Analog CCTV Booklet
No ratings yet
Maxsecure Analog CCTV Booklet
12 pages
Huang 2015
No ratings yet
Huang 2015
26 pages
Lecture 09
No ratings yet
Lecture 09
7 pages
Z Score Table
No ratings yet
Z Score Table
1 page
Balibar, Etienne - Jus Pactum Lex On The Constitution of The Subject in The Theologico-Political Treatise
100% (2)
Balibar, Etienne - Jus Pactum Lex On The Constitution of The Subject in The Theologico-Political Treatise
18 pages
1989 Louisiana Loess Fieldtrip Guidebook
No ratings yet
1989 Louisiana Loess Fieldtrip Guidebook
38 pages
Alex Kerridge
No ratings yet
Alex Kerridge
4 pages
Iss Masters Research Paper
No ratings yet
Iss Masters Research Paper
65 pages
Dhvani
No ratings yet
Dhvani
114 pages
684c20656336c Campus Challenge25 Amex Offer
No ratings yet
684c20656336c Campus Challenge25 Amex Offer
25 pages
How To Solve Cubes and Dice Problems Part 1
No ratings yet
How To Solve Cubes and Dice Problems Part 1
9 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

2502.19712v1

Uploaded by

2502.19712v1

Uploaded by

Teaching Dense Retrieval Models to Specialize with Listwise

Distillation and LLM Data Augmentation

Vivek Sourabh Jimmy Lin

Abstract the MTEB leaderboard [28] for retrieval. We specifically focus on

3 Methodology 3.2 Evaluation

Figure 1: Retrieval effectiveness scores on DL19 and DL20 at

A recent study explored techniques for mining hard negatives while

MSMARCO SciFact FiQA

4.2 Evaluating the Quality of Generated Queries References

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.