Bridging the Preference Gap between Retrievers and LLMs (2024)

Zixuan Ke²,Weize Kong¹,Cheng Li¹,Mingyang Zhang¹,Qiaozhu Mei³and Michael Bendersky¹
¹Google Research
²University of Illinois at Chicago
³University of Michigan
¹{weize,chgli,mingyang,bemike}@google.com
²zke4@uic.edu
³qmei@umich.edu
The work was done during internship at Google Research.The work was done as a visiting researcher at Google Research.

Abstract

Large Language Models (LLMs) have demonstrated superior results across a wide range of tasks, while retrieval has long been established as an effective means of obtaining task-relevant information for humans. Retrieval-augmented Generation (RAG) are known for their effectiveness in knowledge-intensive tasks by locating relevant information and placing it within the context window of the LLM. However, the relationship between retrievers and LLMs is still under-investigated. Most existing work treats the retriever and the LLM as independent components and leaves a gap between retrieving human-friendly information and assembling a LLM-friendly context. In this work, we examine a novel bridge model, validate the ranking and selection assumptions in retrievers in the context of RAG, and propose a training framework that chains together supervised and reinforcement learning to learn a bridge model. Empirical results demonstrate the effectiveness of our method in both question-answering and personalized generation tasks.

Bridging the Preference Gap between Retrievers and LLMs

Zixuan Ke²^†^†thanks: The work was done during internship at Google Research.,Weize Kong¹,Cheng Li¹,Mingyang Zhang¹,Qiaozhu Mei³^†^†thanks: The work was done as a visiting researcher at Google Research.and Michael Bendersky¹¹Google Research²University of Illinois at Chicago³University of Michigan¹{weize,chgli,mingyang,bemike}@google.com²zke4@uic.edu³qmei@umich.edu

1 Introduction

Large language models (LLMs) such as GPT-3 Brown etal. (2020) and PaLM2 Anil etal. (2023), have demonstrated impressive performance on a wide range of language tasks.Alongside LLMs, retrieval-augmented generation (RAG), which can retrieve knowledge from an external dataset when needed, have also produced strong results in many knowledge intensive-tasks.

Bridging the Preference Gap between Retrievers and LLMs (1)

However, despite some extensive RAG studies Khandelwal etal. (2020); Borgeaud etal. (2022); Izacard etal. (2022); Yasunaga etal. (2023),most of them study retrievers and LLMs separately.On one hand, considerable effort has been dedicated to designing user-friendly retrievers, based on the general belief that ranking is paramount, as humans typically read from top to bottom. On the other hand, LLMs exhibit preferences different from humans and yield accurate predictions only when prompts align with these preferences. This discrepancy leads to sub-optimal design in current RAG systems, a phenomenon we term preference gap. This gap manifests in various aspects. For example, the general belief on ranking may not align with LLM’s preferences due to the self-attention mechanism in Transformers, which can focus on any token regardless of its sequential position.Another aspect is selection; humans can disregard irrelevant passages, but it has been shown that LLMs are highly sensitive to irrelevant content Shi etal. (2023a). There are likely more gaps that further diverge from human behavior, such as repetition, which is generally considered detrimental for retrieval systems Xia etal. (2017), but maybe be useful to LLMs for relevance weighting.

We empirically investigate this preference gap, focusing specifically on ranking and selection. As shown in Fig.1, the performance varied by around 1% when we randomized the top-5 retrieved passages¹¹1We note that this is different from Liu etal. (2023), where a “loss in the middle” phenomenon is observed, meaning that the model are better at using the relevant information at the beginning or end of its input context. This is probably due to different number of passages used. In Liu etal. (2023), 20 total documents are employed, whereas we used 5 passages. Regardless, this does not affect the conclusion that there exists a preference gap between the retrievers and LLMs which needs to be bridged.. However, the variation exceeded 5% when selecting the top-1 passage. This indicates the general belief in the importance of ranking in retrieval does not apply to LLMs , indicating that selection could be more crucial. This finding confirms the existence of the preference gap between the retrievers and LLMs, and highlights the importance of bridging this preference gap to enhance RAG. This is a crucial insight that has not been revealed before, which can guide RAG designers towards what should be achieved in the model.

To design a RAG model, existing work has tried to finetune the LLMs to align with the retriever’s preference, or to finetune the retriever to align with the LLM’s preference. However, since they primarily focuses on ranking, their approaches are sub-optimal for bridging the preference gap. Fine-tuning retrievers can only achieve re-ranking, and fine-tuning LLMs is often impractical, as many are only accessible through APIs. To better bridge the gap beyond ranking, we propose a framework called BGM (Bridging the Gap between retrievers and LLMs). We propose to keep the retriever and LLM fixed, and train a bridge model in between. The bridge model aims to transform the retrieved information into a format that LLMs can work with effectively.

Since the preference gap is present in both ranking and selection, and potentially extends to aspects such as repetition, in contrast to the conventional scoring format, we structure the bridge model as a sequence-to-sequence (seq2seq) model. This approach allows it to not only rerank but also dynamically select passages for each query, and potentially employ more advanced strategies like repetition. In a typical RAG, there is no ground truth relevance label for what should be retrieved but only ground truth label for the downstream task. Existing supervised learning (SL) approaches use the supervision provided by the LLM, such as the perplexity of downstream tasks. However, this is ineffective due to sparse supervision as it is nearly impossible to feed all possible retrieved sequences into the LLM to obtain supervision. Even worse, such SL relies on the intermediate relevance label and is not end-to-end training on the downstream tasks. To address these issues, we employed reinforcement learning (RL) on the SL trained bridge model, where the downstream performance metrics are used as the reward and the bridge model are regarded as a policy model. This framework, chaining SL and RL, provides increased supervision from the downstream task. It also offers the model the flexibility to explore more advanced strategies, such as repetition, in forming the optimal passage sequence.

Our experiments reveal that BGM can enhance the performance of various downstream tasks, such as Question Answering (QA) and personalized generation, across a spectrum of datasets, from public QA and amazon reviews to private email conversations. Notably, the modified passages retrieved by BGM surpass the performance of strong retrievers and reranking baseline models. This underscores the significance and promise of the bridging approach in the realm of RAG. In summary, our contributions can be summarized as follows:

•
We empirically establish the existence of the preference gap between retrievers and LLMs, and introduce BGM, which designed to address this preference gap.
•
We propose a seq2seq bridge model to jointly accomplish reranking and selection, adapting the retrieved information to be LLM-friendly. We employ a SL and RL training scheme to optimize this adaptation process.
•
The evaluation of BGM comprises diverse datasets, including QA and text generation, including both publicly available and personal information sources. The comprehensive evaluation underscores the effectiveness of BGM in bridging the preference gap and improving performance in downstream tasks.

Bridging the Preference Gap between Retrievers and LLMs (2)

2 Related Work

Retrieval-augmented Generation (RAG).Augmenting LLMs with relevant information retrieved from various knowledge sources has shown effective in improving performance across numerous NLP tasks, including language modeling Borgeaud etal. (2022); Khandelwal etal. (2020); Shi etal. (2023b), question answering Lewis etal. (2020); Izacard etal. (2022); deJong etal. (2023); DeJong etal. (2023); Shi etal. (2023b); Guu etal. (2020); Izacard and Grave (2020); Xu etal. (2023), fact versification Lewis etal. (2020) and text generation Lewis etal. (2020). Specifically, RAG utilizes input as a query and comprises two main components: (1) retriever retrieves a set of items from a large corpus. Note that the retrieval units vary across different works, including documents, passages, or even tokens. In this study, we focus on retrieving passages; and (2) LLM incorporate the retrieved passages as additional information, integrating them into the input context for making final predictions.

A fundamental question in this process arises regarding the differing preferences between LLMs and existing retrievers, with LLMs performing optimally only when their preferences are satisfied. Bridging the gap between these differing preferences is crucial. Depending on which components are subject to updates, this challenge can be categorized into three families

Finetuning retrievers and LLMs jointly. This is the most widely used and conventional setting of RAG Izacard etal. (2022); Khandelwal etal. (2020); Wu etal. (2022); Guu etal. (2020); Lewis etal. (2020). However, they are mostly less than 1B so may nor regard as “large” LM. For example, AltasIzacard etal. (2022) finetunes LLM (T5Raffel etal. (2020a)) and retriever (ContrieverIzacard etal. (2021)) jointly by leveraging the LLM to provide supervisory signal to train the retriever. RAGLewis etal. (2020) uses a tunnable query encoder and DPRKarpukhin etal. (2020) as retriever, BART as LLM, and design an end-to-end training and inference schema to train the query encoder and the LLM.

Finetuning LLMs only. Updating retrievers is not always desirable as it is costly and requires the document index to be periodically updated. To bridge the preference gap, it is also possible to only update the LLMs. Fid Izacard and Grave (2020) takes the retrieved documents and query as input, finetunes the LLM to adapt to the external information. Similarly, LummenDeJong etal. (2023) and GlimmerdeJong etal. (2023) improve FiD via adding reranker and pre-encoding memeory.

Finetuning retrievers only. Although above systems have shown improves results, they are not always applicable in practice. Many LLMs such as GPT-3 and Codex are not open-sourced due to commercial considerations and are only available as black-box APIs. Through these APIs, users can send queries and receive responses. Given this constraint, a natural approach is to only update the retrievers to ensure they can retrieve passages that are more compatible with LLMs. REPLUGShi etal. (2023b) adapts a similar idea as Atlas but fix the LM. RECOMPXu etal. (2023) trains a compressors to summarize the retrieved document from retriever. However, this family of models is incapable of performing any sample-level selection and can only choose top passages by setting a fixed threshold.

Unlike the existing three families, BGM works on LLM (> 11B) and directly working on bridging the preference gap, without finetuning LLMs or retrievers but a bridge model in between (Fig.2).

RL for information retrieval. Before the LLM era, RL has been used in information retrieval (IR) Xia etal. (2017); Wei etal. (2017); Zeng etal. (2018); Xu etal. (2020). The core approach was to frame the IR problem as a Markov Decision Process and apply an RL algorithm to solve it. Typically, an IR task would be structured to determine which document to select for a ranking position, using ranking metrics such as DCG as the reward. This approach had advantages, including using the ranking position directly as supervision and optimizing the IR measure directly. However, none of these existing works explored the application of RL in the context of RAG.

3 Problem Formulation

Retriever. Given an input $x$ , the retriever aims to retrieve a ranked list of passages from a corpus $D=\{d_{i}\}_{i=1}^{m}$ that are relevant to $x$ . In this work, we assume a frozen dense retriever. Typically, a dual encoder architecture is applied, where an encoder is used to encode both the input context $x$ and the passage $d$ . Specifically, the encoder maps each passage to an embedding $\bm{E}(d)$ . The similarity between input and passage embedding is computed by their cosine similarity,

s(d,x)=\text{cos}(\bm{E}(d),\bm{E}(x)).

(1)

The top-k passages that have the highest similarity scores when compared with the input $x$ are retrieved in this step.

(d^{\text{retr.}}_{j})_{j=1}^{k}=\text{Top-K}(\{s(d,x)\}_{i=1}^{m}).

(2)

Bridge Model for RAG. The retrieved top-K passages provide richer information about the original input $x$ and can potentially help the LLM to make a better prediction on downstream tasks. However, since there is preference gap, we propose a bridge model $\bm{B}$ to adapt the retrieved passages to LLM-friendly passages. As mentioned in the Sec. 1, the bridge model is a seq2seq model. It takes as input all the retrieved passages $(d^{\text{retr.}}_{j})_{j=1}^{k}$ and output a subset of adapted passages $(d^{\text{bdr.}}_{j})_{j=1}^{n}$ . This format is advantageous because the seq2seq model automatically considers ranking by generating the next token based on the preceding one and selection by placing the end-of-sentence token in the appropriate position. Note that $n$ may or may not equal to $k$ due to selection.

(d^{\text{bdr.}}_{j})_{j=1}^{n}=\bm{B}((d^{\text{retr.}}_{j})_{j=1}^{k}).

(3)

It is important to note that in practice, the bridge model’s input includes the query, passage IDs, and passage content, while the output consists of passage IDs. We then convert the obtained passage IDs to their corresponding documents when we use them in RL (Sec.4.2). Additionally, the passages in passage sequence (e.g., $(d^{\text{retr.}}_{j})_{j=1}^{n}$ ) are concatenated into a long sequence before being fed to the network.

Retrieval-augmented Generation with bridge model. Given the adapted passages from the bridge model as context, $(d^{\text{bdr.}}_{j})_{j=1}^{n}$ , we concatenate them with the input $x$ , and fed the resulting long sequence into the LLM to obtain the output for downstream tasks.

Bridging the Preference Gap between Retrievers and LLMs (3)

4 Training the Bridge Model

In Eq.3, we format the bridge model as seq2seq model. However, it is challenging to effectively train it as there is no ground truth retrieved passages and LLM is assumed to be API without gradient. To address the issue, we propose to chain supervised learning (SL) and reinforcement learning (RL), where SL aims to reduce the search space of RL and provides a reasonably good initial model that does ranking and selection. Meanwhile, RL aims to optimize the policy model, i.e., bridge model, against the reward.

4.1 Supervised Learning

To conduct SL, the ground-truth sequence is required for each query. In a typical RAG, the ground truth relevance passage for the query is not provided, let alone the ground-truth sequence. To address this, we propose to synthesis silver passage sequence by selecting only the useful passage. This is done by greedy search that incrementally selects the next passage that can maximize the downstream task performance.

Synthesising silver passage sequence using greedy search.Assume that the downstream task performance for a given passage sequence of any length, used as context, is denoted as $R(\cdot)$ . The empty passage sequence, i.e., no retrieved passage, denoted as $d^{\text{retr.}}_{\varnothing}$ , has downstream task performance $R(d^{\text{retr.}}_{\varnothing})$ . Starting from $d^{\text{retr.}}_{\varnothing}$ and empty silver passage sequence $d^{\text{silv.}}$ , we try to add (i.e., concatenate) the candidate passage that can improve most to the silver passage sequence, until no improvement can be made. Algorithm 1 shows the pseudo-code for synthesising silver passage sequence.

Input: $(d^{\text{bdr.}}_{j})_{j=1}^{n}$ , $R(\cdot)$

Output: $(d_{j}^{\text{silv.}})_{j=1}^{s}$

1 $d^{\text{silv.}}=()$ ;

2 $silv=R(d^{\text{retr.}}_{\varnothing})$ ;

3 $prev=0$ ;

4 $best=\varnothing$ ;

5whileTruedo

6for $d\leftarrow d^{\text{retr.}}_{1}$ to $d^{\text{retr.}}_{n}$ where $d$ not in $d^{\text{silv.}}$ do

7 $seq$ = concat( $d$ , $d^{\text{silv.}}$ );

8 $cur$ = $R(seq)$ ;

9if(cur > silv) and (cur > prev)then

4.2 Reinforcement Learning

Although supervised learning can already help training the bridge model, it is still ineffective (we can see using SL alone results in mixed performance compared to using a retriever alone in Table 5). This is attributed to sparse supervision and the lack of end-to-end training on downstream results.

To address these issues, we apply RL to further train the bridge model. RL does not limit the possible passage sequences (we only consider permutations or deletions in the silver passage sequence in SL. However, the optimal passage sequence might require more complex manipulations, such as repetition, which RL can accommodate) and give more supervision to find the best passages sequence. Using the performance of downstream task as the reward, the bridge model is trained in an end-to-end manner on the downstream task.Specifically, our task can be formulated as a RL problem as follows:

Reward is the performance of tasks against their ground-truth labels as we focus on labeled downstream tasks.

Policy Model is the bridge model that need to be trained.

Action Space is limitted to Passage IDs as we are interested in adapting the retrieved passages

Training. The reward objective can be optimized by any off-the-shelf RL algorithm, e.g., proximal policy optimization (PPO).

5 Experiments

5.1 Datasets and Baselines

Datasets. We consider four datasets, ranging from popular QA datasets to less investigated personalized ones. We also include one dataset that contains private email conversations (Avocado Email), which is unlikely to be included in the LLM’s pre-training datasets. This will further help us investigate the effectiveness of our proposed BGM model, as the LLM will have to rely on the retrieved passages. The summary of statistics is given in Table1.

Open-domain QA. We conduct evaluations on two open-domain QA: Natural Questions (NQ) Kwiatkowski etal. (2019) and HotpotQA Yang etal. (2018). They both consist of questions, answers collected from Wikipedia and the Web. HotpotQA is a multi-hop QA dataset which requires findings and reasoning over multiple passages to answer the question. The candidate passages are retrieved from WikiPedia pages

Personalized Generation. We follow Li etal. (2023) to construct the personalized generation datasets. Specifically, suppose a user is writing a document, which we call the current document. Given the immediate context $x$ and the user’s personal context, we aim the finish the document as if the user had completed it. The immediate context $x$ or the input to the personalized writing task, is defined as the title and the start of the current document. The candidate passages are retrieved from documents authored by this user in the past. This includes datasets from two domains: Avocado Email (Email) Oard etal. (2015) and Amazon Book (Book) Ni etal. (2019).

Bridging the Preference Gap between Retrievers and LLMs (4)

Baselines. We consider the state-of-the-art baselines: (1) GTRNi etal. (2021) a widely recognized retriever that operates independently of LLMs; (2) a variant of GTR, termed Random, in which the order of passages retrieved by GTR is randomized; (3) Point-wise score ranking (PSR)Izacard etal. (2022). This is a variant from Izacard etal. (2022) where we substitute the decoder with our LLM and apply point-wise scoring reranking. This approach can be regarded as utilizing a reranker as a bridge model, but lacking the capability for dynamic selection; and (4) Additionally, we include a non-retrieval baseline, Naive, where no retrieval is conducted to aid in the generation process.

	#Training	#Val.	#Test	Avg. # words
NQ	79,168	8,757	3,610	517.82
HotpotQA	68,659	5,600	5,600	564.83
Email	13,305	764	1,227	173.85
Book	20,789	41,331	41,331	124.52

5.2 LLM and Hyperparamters

We select the T5-XXL (11B) Raffel etal. (2020b) model as our bridge model for most experiments. In the supervised learning stage, the T5-11B model is fine-tuned with a base learning rate of 0.001. A linear warmup scheduler is used for the first 1,000 training steps. Additionally, the square root normalized decay of the learning rate is applied. The model is trained until its performance converges on the validation set. Decoding is performed using beam search with a beam size of 4. We use the PaLM2 model Anil etal. (2023), a new state-of-the-art LLM, as our LLM. It adopts temperature sampling as thedecoding strategy. The parameters of PaLM2 are frozen, and we set the temperature to 0 to make the output deterministic. We use Exact-Match (EM) and BLEU as the metric to select the best prompt among prompt variants. It is also used as the reward to train the bridge model in the RL stage. The “Top-K” in Eq.2 is set to 5.

5.3 Evaluation Results and Analysis

Model	NQ	HotpotQA	Email	Book
Metric	EM	EM	BLEU	BLEU
Naïve	33.07	28.01	5.57	11.5
Random	43.71	26.10	8.55	8.61
GTR	43.79	25.80	9.76	8.75
PSR	43.60	25.51	9.08	9.14
BGM	45.37	35.64	10.42	12.07

Superior of BGM. Table2 reports the overall performance. We can see that

(1) The proposed BGM outperforms all the 4 baselines in all 4 datasets. This clearly indicates BGM is effective in adapting the retrieved passages.

(2) Compared to the Naive approach, BGM shows significant improvement overall. However, an exception is observed in the Book dataset, where BGM’s improvement is more modest. This suggests that the LLM already possesses a substantial amount of relevant knowledge, likely due to the inclusion of Amazon reviews (which the Book dataset comprises) in the LLM’s pre-training data. Consequently, retrieval is not always essential in this context. This also explains why the Naive approach outperforms other baselines (except BGM) on the Book dataset. It’s important to note that BGM still manages to show improvement, albeit to a lesser extent, in this scenario

(3) Compared to the GTR approach, BGM demonstrates substantial improvement. The NQ dataset shows the least improvement, as most instances only require one retrieved passage, which both GTR and BGM can successfully include. On the other hand, the HotpotQA dataset shows the most significant improvement, indicating that HotpotQA may be more sensitive to irrelevant passages. It’s noteworthy that Random and GTR perform similarly, while GTR shows improvement over Naive (with the exception of the Book dataset). This suggests that the ranking itself does not significantly impact performance, but the selection of passages does

(4) Compared to the PSR approach, BGM once again demonstrates significant improvement. This indicates that pure reranking alone is not sufficient for the bridge model. Selection must also be taken into account. It’s notable that PSR performs similarly to GTR, further suggesting that reranking alone does not greatly affect performance

Model	NQ	HotpotQA	Email	Book
Metric	EM	EM	BLEU	BLEU
Naive	33.07	28.01	5.57	11.5
PSR	43.6	25.51	9.08	9.14
PSR (Top1)	42.02	32.69	7.28	11.53
PSR (Top2)	42.54	31.05	7.77	10.11
PSR (Top3)	42.85	32.71	8.21	9.70
PSR (Top4)	43.71	32.37	8.26	9.11
BGM	45.37	35.64	10.42	12.07

Silver Data	NQ	HotpotQA	Email	Book
Metric	EM	EM	BLEU	BLEU
GTR	43.79	25.8	9.76	8.75
PSR	43.68	29.73	10.1	10.35
Greedy (BGM)	45.37	35.64	10.42	12.07

Model	NQ	HotpotQA	Email	Book
Metric	EM	EM	BLEU	BLEU
GTR	43.79	25.8	9.76	8.75
BGM (SL only)	39.44	34.26	8.62	12.05
BGM	45.37	35.64	10.42	12.07

Model	NQ	HotpotQA	Email	Book
Metric	EM	EM	BLEU	BLEU
GTR	43.79	25.8	9.76	8.75
FLAN-T5-Large	44.15	35.87	10.18	10.19
FLAN-T5-XL	44.87	35.41	9.64	10.7
FLAN-T5-XXL (BGM)	45.37	35.64	10.42	12.07

Model	Palm2-XXS		Palm2-S
Data	NQ	HotpotQA	Email	Book
Metric	EM	EM	BLEU	BLEU
Naive	12.13	14.57	33.07	28.01
Random	31.19	24.41	43.71	26.1
GTR	31.91	23.17	43.79	25.8
PSR	32.04	22.53	43.60	25.51
BGM	39.88	28.69	45.37	35.64

Model

HotpotQA

Book

Metric

BLEU

Test on Palm2-S

BGM

45.37

35.64

10.42

12.07

BGM

(Train on NQ)

—

33.42

5.66

11.22

BGM

(Train on Email)

35.59

27.98

—

11.38

Test on Palm2-XXS

BGM

(Train on Palm2-XXS)

39.88

28.69

—

BGM

(Train on Palm2-S)

30.63

24.55

—

Understanding BGM. We are interested in what happen when we optimize the BGM, whether BGM’s improvement is trival, and how some important setting affects the BGM.

(1) Whether the selection can be achieved naively by manually threshold the PSR? We conducted an ablation experiment using various thresholds for PSR, as shown in Table3. The term “Top-K” in rows 3 to 6 refers to the selection of only the Top-K passages from PSR reranked passages. The results are mixed, but consistently inferior to those of BGM. This suggests that a naive manual threshold applied to the reranking model is insufficient to meet our objectives. Therefore, it is necessary to consider both reranking and dynamic selection simultaneously

(2) How different silver passage sequence will affect the final performance? In Table2, we employed greedy search passage sequences as silver passage sequence for SL (Sec.4.1). However, lacking ground truth passage sequences, we also explored whether this approach is superior to others. Our ablation experiment in Table4 involved using various types of silver passage sequence. The first row, labeled ’GTR’, indicates the use of GTR’s retrieved passages as silver passage sequence. Similarly, ’PSR’ refers to using PSR reranked passages as silver passage sequence. In the final row, we used greedy search passage sequences, representing the current version of BGM. The results demonstrate that the quality of silver passage sequence significantly affects downstream task performance. While using PSR as silver passage sequence shows improvement over GTR on three datasets, Greedy (BGM) further enhances performance and achieves the best results. Identifying potentially better silver passage sequence is left for future work.

(3) How helpful RL is to the final performance? BGM integrates SL and RL, and we aim to assess the effectiveness of each component. An ablation experiment is detailed in Table5. We can observe that BGM, when operating with only SL, performs significantly worse than the full BGM model in NQ, HotpotQA, and Email, and in some cases, it even underperforms GTR. This suggests that SL alone is inadequate, highlighting the necessity of incorporating RL. It’s important to note that removing SL is not feasible, as this would result in an excessively large search space for RL to effectively learn.

(4) How different size of bridge model affect the final performance? In Table2, we utilized Flan-T5-XXL (11B) as the bridge model, which is already smaller than the LLM (24B). However, we are curious about the feasibility of using an even more lightweight language model (LM) as the bridge model. The ablation study shown in Table6 presents the outcomes of employing bridge models of various sizes. It is evident that all three sizes (large, XL, and XXL) surpass the performance of the setup without a bridge model (i.e., GTR), with the largest size yielding the best results. This demonstrates that a bridge model is beneficial even at a smaller scale, and that larger sizes lead to further improvements.

(5) How different size of LLMs affect the final performance? In Table2, Palm2-S was used as the LLM. We are interested in evaluating the effectiveness of the bridge model with different sizes of LLMs. We conducted experiments using Palm2-XXS, as detailed in Table7. It is important to note that the smaller LLM struggles with personalized generation datasets (i.e., Email and Book), resulting in BLEU scores lower than 1%. We opted not to report these results as they may not accurately reflect the trend. However, observations from NQ and HotpotQA suggest that BGM significantly outperforms all baselines by a large margin. This indicates that BGM is effective even with a smaller LLM.

(6) Can a trained bridge model generalize to different datasets and LLMs? The effectiveness of the bridge model across various datasets and LLMs has been demonstrated. A more ambitious objective is to extend this performance to new contexts without additional training. We conducted experiments to investigate the generalizability of the bridge model. The upper section of Table8 presents the data on generalizability across different datasets. Row 2 shows the performance of BGM (i.e., the bridge model) when trained exclusively on the NQ dataset and then tested on three other unseen datasets. Similarly, row 3 describes training BGM solely on the Email dataset and testing it on the other three. In all cases, the performance is inferior to that of BGM when both training and testing are conducted on the same datasets. This result is expected, as no specific techniques have been developed for dataset generalization, a topic we leave for future work.

In the lower section of Table8, we present the results of BGM when tested on Palm2-XXS, but trained on Palm2-S. Comparing with results of BGM both trained and tested on Palm2-XXS (as shown in row 1 of the bottom section), the mismatch between LLM sizes leads to a significant decline in performance. This suggests that BGM’s ability to generalize across different LLM sizes is currently limited. Addressing this limitation is considered an important direction for future research.

(8) Case Studies. We provided examples of GTR, PSR, and BGM in TableLABEL:ap_tab.case in the Appendix for the NQ dataset. For question I, both GTR and PSR yield the same incorrect output, even though their contexts have different orders. Only BGM provides the correct answer, indicating that additional irrelevant context can be noisy and detrimental to RAG’s performance. In question II, unlike question I, GTR and PSR produce different answers. This suggests that the ranking can sometimes influence the results. While the specific mechanism by which ranking affects outcomes remains unclear (we regard it as a valuable topic for future research), it is evident that we need to consider ranking, as it may or may not affect performance. Again, BGM selects the most relevant passage and delivers the correct answer. In question III, none of the candidate passages contain the answer (they discuss FaZe Clan and the number of subscribers, but do not identify who has the most subscribers). GTR and PSR provide incorrect answers, as their additional context is unhelpful. In contrast, BGM opts not to select any passages and answers the question using its own parameters, resulting in the correct answer. This demonstrates that retrieval-augmented processes are not always necessary, and BGM is capable of handling such cases

6 Conclusion

This paper highlights the importance of bridging the preference gap between retrievers and LLMs. It begins by demonstrating the existence of this gap, especially in terms of ranking and selection. This provides an important guidance about what should be achieved in a RAG system. It then suggests the development of a lightweight bridge model. This component is designed to adapt user-friendly passages retrieved by the retrievers into formats more suitable for LLMs. To accomplish these objectives, we introduce a novel system named BGM. BGM treats the bridging task as a seq2seq problem, allowing for simultaneous consideration of both ranking and selection. It is trained through a combination of supervised learning and reinforcement learning, facilitating end-to-end training of the bridge model based on the performance in downstream tasks and is capble even when the LLM is black-box (i.e., API). Extensive experiments have demonstrated BGM’s effectiveness.

References

Adolphs etal. (2021)Leonard Adolphs, Benjamin Boerschinger, Christian Buck, MichelleChen Huebscher, Massimiliano Ciaramita, Lasse Espeholt, Thomas Hofmann, Yannic Kilcher, Sascha Rothe, PierGiuseppe Sessa, etal. 2021.Boosting search engines with interactive agents.arXiv preprint arXiv:2109.00527.
Anil etal. (2023)Rohan Anil, AndrewM. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, JonathanH. Clark, LaurentEl Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, YiTay, Kefan Xiao, Yuanzhong Xu, Yujing Zhang, GustavoHernández Ábrego, Junwhan Ahn, Jacob Austin, Paul Barham, JanA. Botha, James Bradbury, Siddhartha Brahma, Kevin Brooks, Michele Catasta, Yong Cheng, Colin Cherry, ChristopherA. Choquette-Choo, Aakanksha Chowdhery, Clément Crepy, Shachi Dave, Mostafa Dehghani, Sunipa Dev, Jacob Devlin, Mark Díaz, Nan Du, Ethan Dyer, Vladimir Feinberg, Fangxiaoyu Feng, Vlad Fienber, Markus Freitag, Xavier Garcia, Sebastian Gehrmann, Lucas Gonzalez, and etal. 2023.Palm 2 technical report.CoRR, abs/2305.10403.
Bacciu etal. (2023)Andrea Bacciu, Florin Cocunasu, Federico Siciliano, Fabrizio Silvestri, Nicola Tonellotto, and Giovanni Trappolini. 2023.Rraml: Reinforced retrieval augmented machine learning.arXiv preprint arXiv:2307.12798.
Borgeaud etal. (2022)Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, GeorgeBm Van DenDriessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, etal. 2022.Improving language models by retrieving from trillions of tokens.In International conference on machine learning, pages 2206–2240. PMLR.
Brown etal. (2020)Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, JaredD Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020.Language models are few-shot learners.In Advances in Neural Information Processing Systems.
DeJong etal. (2023)Michiel DeJong, Yury Zemlyanskiy, Nicholas FitzGerald, Joshua Ainslie, Sumit Sanghai, Fei Sha, and WilliamW Cohen. 2023.Pre-computed memory or on-the-fly encoding? a hybrid approach to retrieval augmentation makes the most of your compute.In International Conference on Machine Learning, pages 7329–7342. PMLR.
deJong etal. (2023)Michiel deJong, Yury Zemlyanskiy, Nicholas FitzGerald, Sumit Sanghai, WilliamW Cohen, and Joshua Ainslie. 2023.Glimmer: generalized late-interaction memory reranker.arXiv preprint arXiv:2306.10231.
Guu etal. (2020)Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020.Retrieval augmented language model pre-training.In International conference on machine learning, pages 3929–3938. PMLR.
Izacard etal. (2021)Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. 2021.Unsupervised dense information retrieval with contrastive learning.arXiv preprint arXiv:2112.09118.
Izacard and Grave (2020)Gautier Izacard and Edouard Grave. 2020.Leveraging passage retrieval with generative models for open domain question answering.arXiv preprint arXiv:2007.01282.
Izacard etal. (2022)Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2022.Few-shot learning with retrieval augmented language models.arXiv preprint arXiv:2208.03299.
Karpukhin etal. (2020)Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020.Dense passage retrieval for open-domain question answering.arXiv preprint arXiv:2004.04906.
Khandelwal etal. (2020)Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. 2020.Generalization through memorization: Nearest neighbor language models.In International Conference on Learning Representations.
Kwiatkowski etal. (2019)Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, AndrewM. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019.Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:452–466.
Lewis etal. (2020)Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, etal. 2020.Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474.
Li etal. (2023)Cheng Li, Mingyang Zhang, Qiaozhu Mei, Yaqing Wang, SpurthiAmba Hombaiah, YiLiang, and Michael Bendersky. 2023.Teach llms to personalize–an approach inspired by writing education.arXiv preprint arXiv:2308.07968.
Liu etal. (2023)NelsonF Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023.Lost in the middle: How language models use long contexts.arXiv preprint arXiv:2307.03172.
Ni etal. (2019)Jianmo Ni, Jiacheng Li, and Julian McAuley. 2019.Justifying recommendations using distantly-labeled reviews and fine-grained aspects.In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 188–197, Hong Kong, China. Association for Computational Linguistics.
Ni etal. (2021)Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, GustavoHernández Ábrego, JiMa, VincentY Zhao, YiLuan, KeithB Hall, Ming-Wei Chang, etal. 2021.Large dual encoders are generalizable retrievers.arXiv preprint arXiv:2112.07899.
Nogueira and Cho (2017)Rodrigo Nogueira and Kyunghyun Cho. 2017.Task-oriented query reformulation with reinforcement learning.arXiv preprint arXiv:1704.04572.
Oard etal. (2015)Douglas Oard, William Webber, David Kirsch, and Sergey Golitsynskiy. 2015.Avocado research email collection.Philadelphia: Linguistic Data Consortium.
Raffel etal. (2020a)Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and PeterJ Liu. 2020a.Exploring the limits of transfer learning with a unified text-to-text transformer.The Journal of Machine Learning Research, 21(1):5485–5551.
Raffel etal. (2020b)Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and PeterJ. Liu. 2020b.Exploring the limits of transfer learning with a unified text-to-text transformer.J. Mach. Learn. Res.
Shi etal. (2023a)Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, EdH Chi, Nathanael Schärli, and Denny Zhou. 2023a.Large language models can be easily distracted by irrelevant context.In International Conference on Machine Learning, pages 31210–31227. PMLR.
Shi etal. (2023b)Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2023b.Replug: Retrieval-augmented black-box language models.arXiv preprint arXiv:2301.12652.
Wei etal. (2017)Zeng Wei, Jun Xu, Yanyan Lan, Jiafeng Guo, and Xueqi Cheng. 2017.Reinforcement learning to rank with markov decision process.In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval.
Wu etal. (2022)Yuhuai Wu, MarkusN Rabe, DeLesley Hutchins, and Christian Szegedy. 2022.Memorizing transformers.arXiv preprint arXiv:2203.08913.
Wu etal. (2021)Zeqiu Wu, YiLuan, Hannah Rashkin, David Reitter, Hannaneh Hajishirzi, Mari Ostendorf, and GauravSingh Tomar. 2021.Conqrr: Conversational query rewriting for retrieval with reinforcement learning.arXiv preprint arXiv:2112.08558.
Xia etal. (2017)Long Xia, Jun Xu, Yanyan Lan, Jiafeng Guo, Wei Zeng, and Xueqi Cheng. 2017.Adapting markov decision process for search result diversification.In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval.
Xu etal. (2023)Fangyuan Xu, Weijia Shi, and Eunsol Choi. 2023.Recomp: Improving retrieval-augmented lms with compression and selective augmentation.arXiv preprint arXiv:2310.04408.
Xu etal. (2020)Jun Xu, Zeng Wei, Long Xia, Yanyan Lan, Dawei Yin, Xueqi Cheng, and Ji-Rong Wen. 2020.Reinforcement learning to rank with pairwise policy gradient.In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 509–518.
Yang etal. (2018)Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, WilliamW. Cohen, Ruslan Salakhutdinov, and ChristopherD. Manning. 2018.Hotpotqa: A dataset for diverse, explainable multi-hop question answering.In EMNLP.
Yasunaga etal. (2023)Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Richard James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2023.Retrieval-augmented multimodal language modeling.
Zeng etal. (2018)Wei Zeng, Jun Xu, Yanyan Lan, Jiafeng Guo, and Xueqi Cheng. 2018.Multi page search with reinforcement learning to rank.In Proceedings of the 2018 ACM SIGIR international conference on theory of information retrieval, pages 175–178.

Question I

Where does new crust come from in sea floor spreading

Answer

“basaltic magma” or “volcanic activity”

GTR Context

Given the passage titles and contexts below:

Title: Seafloor spreading;

Context: Seafloor spreading Seafloor spreading is a process that occurs at mid-ocean ridges, where new oceanic crust is formed through volcanic activity and then gradually moves away from the ridge. Earlier theories (e.g. by Alfred Wegener and Alexander du Toit) of continental drift postulated that continents "ploughed" through the sea. The idea that the seafloor itself moves (and also carries the continents with it)as it expands from a central axis was proposed by Harry Hess from Princeton University in the 1960s. The theory is well accepted now, and the phenomenon is known to be caused by convection currents in the;

Title: Oceanic crust;

Context: frozen in the basalt. A symmetrical pattern of positive and negative magnetic lines emanates from the mid-ocean ridge. New rock is formed by magma at the mid-ocean ridges, and the ocean floor spreads out from this point. When the magma cools to form rock, its magnetic polarity is aligned with the then-current positions of the magnetic poles of the Earth. New magma then forces the older cooled magma away from the ridge. This process results in parallel sections of oceanic crust of alternating magnetic polarity. Oceanic crust Oceanic crust is the uppermost layer of the oceanic portion of a tectonic;

Title: Seafloor spreading;

Context:

\sim

Tx/L, where L is the distance between the ridge to the continental shelf (roughly half the ocean width), and T is the ocean age. Seafloor spreading Seafloor spreading is a process that occurs at mid-ocean ridges, where new oceanic crust is formed through volcanic activity and then gradually moves away from the ridge. Earlier theories (e.g. by Alfred Wegener and Alexander du Toit) of continental drift postulated that continents "ploughed" through the sea. The idea that the seafloor itself moves (and also carries the continents with it) as it expands from a central axis was proposed by Harry Hess;

Title: Seafloor spreading;

Context: As new seafloor forms and spreads apart from the mid-ocean ridge it slowly cools over time. Older seafloor is therefore colder than new seafloor, and older oceanic basins deeper than new oceanic basins due to isostasy. If the diameter of the earth remains relatively constant despite the production of new crust, a mechanism must exist by which crust is also destroyed. The destruction of oceanic crust occurs at subduction zones where oceanic crust is forced under either continental crust or oceanic crust. Today, the Atlantic basin is actively spreading at the Mid-Atlantic Ridge. Only a small portion of the oceanic;

Title: Seafloor spreading;

Context: other areas), material from the upper mantle rises through the faults between oceanic plates to form new crust as the plates move away from each other, a phenomenon first observed as continental drift. When Alfred Wegener first presented a hypothesis of continental drift in 1912, he suggested that continents ploughed through the ocean crust. This was impossible: oceanic crust is both more dense and more rigid than continental crust. Accordingly, Wegener’s theory wasn’t taken very seriously, especially in the United States. Since then, it has been shown that the motion of the continents is linked to seafloor spreading by the;

What is the answer of the following question:

where does new crust come from in sea floor spreading

GTR Prediction

the upper mantle

PSR Context

Given the passage titles and contexts below:

Title: Seafloor spreading;

Context: Seafloor spreading Seafloor spreading is a process that occurs at mid-ocean ridges, where new oceanic crust is formed through volcanic activity and then gradually moves away from the ridge. Earlier theories (e.g. by Alfred Wegener and Alexander du Toit) of continental drift postulated that continents "ploughed" through the sea. The idea that the seafloor itself moves (and also carries the continents with it) as it expands from a central axis was proposed by Harry Hess from Princeton University in the 1960s. The theory is well accepted now, and the phenomenon is known to be caused by convection currents in the;Title: Seafloor spreading;

Context:

\sim

Title: Oceanic crust;

Title: Seafloor spreading;

What is the answer of the following question:

where does new crust come from in sea floor spreading

PSR Prediction

the upper mantle

BGM Context

Given the passage titles and contexts below:

Title: Seafloor spreading;

Context: Seafloor spreading Seafloor spreading is a process that occurs at mid-ocean ridges, where new oceanic crust is formed through volcanic activity and then gradually moves away from the ridge. Earlier theories (e.g. by Alfred Wegener and Alexander du Toit) of continental drift postulated that continents "ploughed" through the sea. The idea that the seafloor itself moves (and also carries the continents with it) as it expands from a central axis was proposed by Harry Hess from Princeton University in the 1960s. The theory is well accepted now, and the phenomenon is known to be caused by convection currents in the;

What is the answer of the following question:

where does new crust come from in sea floor spreading

BGM Prediction

volcanic activity

Question II

who sings i want to dance with you

Answer

“George Strait”

GTR Context

Given the passage titles and contexts below:

Title: I Just Want to Dance with You;

Context: I Just Want to Dance with You "I Just Want to Dance with You" is a song written by John Prine and Roger Cook, and performed by American country music singer George Strait. It was released in April 1998 as the first single to his album, "One Step at a Time", it is his 34th Number One single on the "Billboard" Hot Country Singles & Tracks chart, and his 42nd Number One single when all major trade charts are counted. Prine recorded it 12 years earlier, for his 1986 album "German Afternoons". It was also a hit for Daniel O’Donnell;

Title: Tony Caso;

Context: Tony Caso Tony Caso (Anthony Caso) is an American 1980s pop/dance music recording artist and, later, actor. Tony Caso began recording in the early 1980s, as "Tony Caso and Salvation". His first single, "I Want To Dance With You" (1981), was issued on Lam Records. A second single, ’Hot Blooded Woman’, was also issued in 1981. Tony joined the Bobby O label in New York, recording in "One Two Three" and "Waterfont Home". He had a number of singles throughout the 1980s: All The Love In My Heart - 1983 (O Records) Take A Chance (On Me) - 1984 (O;

Title: Voglio ballare con te;

Context: Voglio ballare con te "Voglio ballare con te" (English: "I want to dance with you") is a song recorded by Italian rapper Baby K, with vocals by Spanish singer Andrés Dvicio, for her upcoming third studio album. The song was released on 2 June 2017, through Sony Music Italy. It was produced by Takagi and Mr Ketra, who also produced the singer’s smash-hit "Roma Bangkok". In Italy, "Voglio ballare con te" debuted at number 51, and later peaked at number 2. The song reached the triple platinum certification in that territory for selling over 150,000 copies. It also peaked at;

Title: Voglio ballare con te;

Context: number 20 in Switzerland, marking Espósito’s second entry and highest-charting song on the chart. "Locos Valientes", the Spanish-language version of the song, was released on September 8, 2017. The song became Baby K’s second Spanish song, following "Roma-Bangkok" featuring Giusy Ferreri or Lali. Voglio ballare con te "Voglio ballare con te" (English: "I want to dance with you") is a song recorded by Italian rapper Baby K, with vocals by Spanish singer Andrés Dvicio, for her upcoming third studio album. The song was released on 2 June 2017, through Sony Music Italy. It was produced by Takagi and Mr Ketra,;

Title: Karen Brooks;

Context: Karen Brooks Karen Brooks (born April 30, 1954) is an American singer and writer who is best known for a series of singles recorded by Emmylou Harris, Rosanne Cash, Patty Loveless, Tanya Tucker, Russell Smith, David Allen Coe, Crystal Gayle and Exile. She won a Grammy for her contribution to the soundtrack for the Sesame Street movie "Follow That Bird". She sang a duet with Johnny Cash, "I Will Dance With You", and also with T. G. Sheppard, "Faking Love", which was a number-1 hit for three weeks on the "Billboard" country chart in February 1983. She also had a;

What is the answer of the following question:

who sings i want to dance with you

GTR Prediction

Tony Caso

PSR Context

Given the passage titles and contexts below:

Title: Pure Country;

Context: Pure Country Pure Country is a 1992 American dramatic musical western film directed by Christopher Cain and starring George Strait in his acting debut, with Lesley Ann Warren, Isabel Glasser and Kyle Chandler. The film was considered a box office bomb, but it grossed over

15millionagainsta

10 million budget, and the soundtrack was a critical success and, to date, is Strait’s best selling album. It was followed by two direct-to-video sequels, "" (2010) and "" (2017). The film begins with various shots of the audience chanting "Dusty!", which is repeated throughout. Meanwhile, the band begins, as the;

Title: Pure Country;

Context: Worth, including North Side Coliseum. The bar scenes where Dusty meets Harley were filmed at Western Kountry Klub, located between Midlothian and Mansfield Tx. Despite Strait’s super-star status in the music world, "Pure Country" only grossed just over $15 million at the box office. Although the expectations had been higher for Strait’s first major film role, this did not stop the soundtrack album from becoming the best-selling of Strait’s career to date. The film also received mainly negative reviews upon its release, but critics responded nicely to certain aspects of the film. It currently has a score of 38% on;

Title: Pure Country;

Context: Instead, it focuses on a young woman’s struggles to become a country singer. George Strait appears as himself, but not as a central character of the film. A second sequel titled, "" was released for a direct-to-video on August 1, 2017. Pure Country Pure Country is a 1992 American dramatic musical western film directed by Christopher Cain and starring George Strait in his acting debut, with Lesley Ann Warren, Isabel Glasser and Kyle Chandler. The film was considered a box office bomb, but it grossed over $15 million against a $10 million budget, and the soundtrack was a critical success;

Title: Pure Country: Pure Heart;

Context: Pure Country: Pure Heart Pure Country: Pure Heart is a 2017 American direct-to-video country musical directed by Damon Santostefano. It’s a story about teenage sisters who go to Nashville when they discover their late father was a country music singer. It is nominally a sequel to the 1992 film "Pure Country", but has a separate and unrelated plot. Ada Spencer and her younger sister, Piper, are high schoolers living in a rural Tennessee farmhouse with their widowed mother Elizabeth and their grandmother, Meemaw. When a water pipe bursts, the girls find a Silver Star and other information about their father,;

Title: Pure Country: Pure Heart;

Context: Elizabeth visits Marq, reconciling with her and leading to Ada and Piper singing one of their father’s songs, on stage with Willie Nelson and Marq, at a fund raiser for veterans. Pure Country: Pure Heart Pure Country: Pure Heart is a 2017 American direct-to-video country musical directed by Damon Santostefano. It’s a story about teenage sisters who go to Nashville when they discover their late father was a country music singer. It is nominally a sequel to the 1992 film "Pure Country", but has a separate and unrelated plot. Ada Spencer and her younger sister, Piper, are high schoolers living;

What is the answer of the following question:

who plays dusty in the movie pure country

PSR Prediction

George Strait

BGM Context

Given the passage titles and contexts below:

Title: I Just Want to Dance with You;

What is the answer of the following question: who sings i want to dance with you

BGM Prediction

George Strait

Question III

who has the most subscribers in faze clan

Answer

“FaZe Rug”

GTR Context

Given the passage titles and contexts below:

Title: FaZe Clan;

Context: "FaZe 2.0" by FaZe members and fans. The FaZe "CS:GO" team went on to become one of the most successful rosters for the 2017/2018 seasons. FaZe Clan is the most popular esports organization in the world, based on the organization’s social media following. As of November 28, 2018, FaZe Clan and its members together have 82 million YouTube subscribers, 11.2 billion YouTube views, 11.3 million Twitch followers, 130 million Twitch views, 43.1 million Twitter followers, 45.8 million Instagram followers, 2.8 million Facebook likes and followers. FaZe Clan has made $6,148,290.91 from esport tournament prize pools alone. FaZe Clan started on;

Title: FaZe Clan;

Context: February 18, 2018. FaZe Clan FaZe Clan (formerly FaZe Sniping) is an American esports and entertainment organization that competes in various video game tournaments. The organization was founded as a gaming clan on YouTube by players known as Housecat, ClipZ, and Resistance in 2010, who all created "trickshot" videos for the video game "". In 2012, with the release of "", the organization decided to expand into competitive play. In 2016, a new era for FaZe began when the organization bought a "" professional team. This moment marked the beginning of FaZe Clan expanding into various esports. This movement is;

Title: FaZe Clan;

Context: FaZe Clan FaZe Clan (formerly FaZe Sniping) is an American esports and entertainment organization that competes in various video game tournaments. The organization was founded as a gaming clan on YouTube by players known as Housecat, ClipZ, and Resistance in 2010, who all created "trickshot" videos for the video game "". In 2012, with the release of "", the organization decided to expand into competitive play. In 2016, a new era for FaZe began when the organization bought a "" professional team. This moment marked the beginning of FaZe Clan expanding into various esports. This movement is referred to as;

Title: FaZe Clan;

Context: Duty" community has been frustrated with the way FaZe Clan has been/not been involved in the game in which it was founded. Most of the other FaZe members have changed their video topics and mostly do vlog videos. This sparked FaZe Clan to organize the highly successful FaZe Bootcamp on the release of "". FaZe trickshotters known as Kitty, Dirty, Bloo, GwidT and Replays contributed to a week dedicated to creating "Call of Duty" content. FaZe’s first roster—consisting of players named Heist, Folsom, Secretly, and Sham— was created to compete at the 2013 MLG Winter Championship. Their next roster—consisting of;

Title: F Is for Family;

Context: F Is for Family F is for Family is an American adult animated sitcom created by Bill Burr and Michael Price and produced by Gaumont International Television and Vince Vaughn’s Wild West Television. The show premiered on December 18, 2015, to generally favorable reviews. Season 2 premiered May 30, 2017. On June 28, 2017, the show was renewed for a third season. On July 1, 2018, Burr confirmed season three. On November 30, the third season was released. The series was announced in October 2014 as part of a partnership between Netflix, Gaumont International Television, and Wild West Television.;

What is the answer of the following question:

who has the most subscribers in faze clan

GTR Prediction

Tfue

PSR Context

Given the passage titles and contexts below:

Title: FaZe Clan;

Title: F Is for Family;

Title: FaZe Clan;

What is the answer of the following question:

who has the most subscribers in faze clan

GTR Prediction

Tfue

BGM Context

What is the answer of the following question:

who has the most subscribers in faze clan

BGM Prediction

FaZe Rug