Improving RAG with Query expansion & reranking models

Improving RAG with Query expansion & reranking models

Exploring Rerankers such as Cross Encoders, Colbert v2 & FlashRank

7 min read

Query expansion:

When performing information retrieval, you don’t always get what you want. A method suggested to improve the recall of search systems is query expansion, which adds additional terms to the search query, recovering relevant documents that might not have lexical overlap with the initial query. This idea is essential and valuable to enhance the performance of retrieval-augmented generation (RAG) systems.

Why Use Query Expansion?

Query expansion is vital for several reasons:

  • Improves Recall: It helps retrieve documents that are semantically related to the query but don’t necessarily share common keywords.
  • Addresses Query Ambiguity: It’s beneficial for short or ambiguous queries, providing more context and clarity.
  • Enhances Document Matching: Expanding the query terms increases the likelihood of matching with the correct documents in the database.

The LLM Approach to Query Expansion

Recent advancements propose leveraging Large Language Models (LLMs) for query expansion. Unlike traditional methods like Pseudo-Relevance Feedback (PRF), which depends on the content of the retrieved documents, LLMs utilize their generative capabilities to create meaningful query expansions. This approach taps into the inherent knowledge encoded within the LLM, generating alternative terms and phrases that might be relevant to the original query.

Below is the code for query expansion (note that we’re using the OpenAI chat API).

#we are using Openai to generate query
import os
import openai
from openai import OpenAI
openai.api_key = os.environ['OPENAI_API_KEY']
openai_client = OpenAI()

def augment_query_generated(query, model="gpt-3.5-turbo"):
    messages = [
            "role": "system",
            "content": "You are a helpful expert financial research assistant. Provide an example answer to the given question that might be found in a document like an annual report. "
        {"role": "user", "content": query}

    response =
    content = response.choices[0].message.content
    return content
original_query = "What were the most important factors that contributed to increases in revenue?"
hypothetical_answer = augment_query_generated(original_query)
# We are combining our original query + hypothetical_answer
joint_query = f"{original_query} {hypothetical_answer}"

The prompt asks the LLM to generate a hypothetical answer for a given query. We can combine the generated answer to our original query and then pass it back to our LLM as a joint query. This provides more context to the LLM prompt prior to retrieving the result.

What were the most important factors that contributed to increases in
revenue? In the fiscal year 2020, several key factors contributed to
the significant increase in our company's revenue. Firstly, we
implemented a successful marketing campaign that effectively targeted
new customer segments and enhanced our brand visibility. This resulted
in a substantial growth in our customer base and overall

Secondly, we expanded our product line by introducing
innovative products that catered to evolving consumer preferences. This
diversification strategy allowed us to tap into new markets and
capitalize on emerging trends, thereby driving revenue

Additionally, our commitment to customer satisfaction and
delivering exceptional service played a crucial role in increasing
revenue. By focusing on enhancing customer experience and implementing
customer retention programs, we not only fostered loyalty but also
attracted new customers through positive word-of-mouth

Furthermore, we adopted a proactive approach to
pricing and cost management. Through effective cost-cutting measures
and strategic pricing adjustments, we were able to optimize
profitability without compromising on product quality or customer
satisfaction. This emphasis on achieving operational efficiency
positively impacted our revenue growth.

Lastly, our investments in
technology and digital transformation significantly contributed to
revenue increases. By leveraging data analytics and automation, we
streamlined our processes, improved our decision-making capabilities,
and personalized customer experiences. These technological advancements
resulted in higher customer engagement and increased revenue

In conclusion, the most important factors that contributed
to the increases in our revenue included successful marketing
campaigns, product diversification, customer satisfaction initiatives,
strategic pricing, and investments in technology and digital

As you can see, using query expansion in RAG systems this way offers several benefits:

  • Better document retrieval: Expanded queries lead to more accurate and comprehensive document retrieval, a crucial step in RAG models.
  • Enhanced understanding: Expanded queries provide RAG models with a broader context, improving the model’s understanding and responses.
  • Versatility: This approach is adaptable to various domains and types of queries, enhancing the versatility of RAG models.

Drawbacks and the Role of Reranking

Although query expansion offers significant benefits, it’s not without its drawbacks:

  • Over-expansion: Adding too many terms can sometimes lead to irrelevant document retrieval.
  • Quality control: The relevance of expanded terms is only sometimes guaranteed.

To mitigate these issues, reranking plays a crucial role. It refines the initial retrieval output, recalibrating document rankings based on their relevance to the expanded query. This ensures that only the most pertinent documents are prioritized, effectively sifting through the noise introduced by query expansion.

1. Cross-Encoder Reranking

Among the reranking methodologies, cross-encoder models stand out for their ability to significantly enhance search accuracy. These models diverge from traditional ranking metrics, such as cosine similarity, by employing deep learning to evaluate the alignment between each document and the query directly. Cross-encoders output a relevance score by processing the query and document in tandem, enabling a more nuanced document selection process.

Practical Implementation

In practice, cross-encoder reranking is applied after expanding the search query to include a broader set of documents. This approach not only refines the selection of documents from the initial retrieval but also enhances the utility of RAG models by:

  1. Improving Accuracy: Cross-encoders enhance the precision of document retrieval by ensuring that documents are ranked according to their actual relevance to the query.
  2. Expanding Versatility: This method adapts seamlessly across various domains and query types, elevating the adaptability of RAG models.

Example Use Case

Consider a scenario where your application needs to retrieve and rank documents based on their relevance to user queries. After initial query expansion, you’d apply a cross-encoder model to rerank the results.

In the example below, we use the sentence transformers Cross-Encoder. You need to pass the retriever documents from your RAG, following which the Cross-Encoder will give you a ranking based on the most relevant documents.

import numpy as np
#cross encoder reranker
from sentence_transformers import CrossEncoder
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

# Extract text content from Document objects and convert it to strings
document_texts = [doc.page_content for doc in retrieved_documents]
query_text = "What were the most important factors that contributed to increases in revenue?"
# Create pairs as strings
pairs = [[query_text, doc_text] for doc_text in document_texts]
# Predict scores for pairs
scores = cross_encoder.predict(pairs)
# Print scores
for score in scores:

print("New Ordering:")
for o in np.argsort(scores)[::-1]:

You can use this or the answer from the reranker and do further processing, such as passing top documents to the LLM to get the final answer.

Next, we’ll look at the reranker model, ColBERTv2, which is among the fastest reranker models available today.

2. ColBERT: The reranker model 

ColBERT is a document reranker model using a late interaction architecture over BERT, designed to enhance the performance of document retrieval and ranking of the documents. It’s particularly notable for balancing computational efficiency with high accuracy.

Core Idea and Architecture

ColBERT separates the encoding of query and document texts using BERT, allowing offline pre-computation of document encodings and significantly reducing the computational load per query. The model employs a unique approach where each query and document token is encoded into a low-dimensional vector, facilitating fast and accurate retrieval.

Late Interaction Mechanism

The crux of ColBERT’s efficiency lies in its late interaction mechanism. Instead of squashing all token vectors into a single vector, it compares each query vector with every vector of the document. This method ensures a more nuanced and accurate representation of the document’s relevance to the query.

Indexing and Retrieval in Colbert

ColBERT’s indexing process is a three-stage process:

  1. Centroid Selection: Using k-means clustering to select centroids for residual encoding.
  2. Passage Encoding: Encoding documents with the selected centroid and computing the quantized residual.
  3. Index Inversion: Creating an inverted list of embeddings grouped by centroids for fast retrieval.

ColBERT efficiently computes the cosine similarity for each query vector during retrieval, leading to a fast and accurate ranking of documents.

Learn more about Colbert V2 & other ranker comparison benchmarks on this blog.

Practical Implementation 

We’ll use the ColBERT reranker with LanceDB, which provides an interface to choose from different ranking hybrid methods for querying the documents.

from lancedb.rerankers import ColbertReranker

db = lancedb.connect("/tmp/db")
registry = EmbeddingFunctionRegistry.get_instance()
func = registry.get("openai").create()

class Words(LanceModel):
    text: str = func.SourceField()
    vector: Vector(func.ndims()) = func.VectorField()

table = db.create_table("colbertv2demo", schema=Words,mode="overwrite")

# data from retriever
formatted_data = [{"text": doc.page_content} for doc in retrieved_documents]

# ingest docs with auto-vectorization
# Create the FTS index on the 'text' field

# colbertReranker
reranker_colbert = ColbertReranker()
results_colbert ="technologies and business models", query_type="hybrid").rerank(reranker=reranker_colbert).to_pandas()

The results below are from the reranked ColBERT v2 model.

3 .FlashRank

FlashRank is an ultra-lite & super-fast Python library to add reranking to your existing search & retrieval pipelines and is based on SoTA cross-encoders. Ensure you pip install flashrank before running the following code.

from flashrank import Ranker, RerankRequest

query = 'What were the most important factors that contributed to increases in revenue?'

ranker = Ranker(model_name="ms-marco-MiniLM-L-12-v2", cache_dir="/opt")

rerankrequest = RerankRequest(query=query, passages=formatted_data)
results = ranker.rerank(rerankrequest)

All the code is available in the following Colab:

Google Collab for query expansion with reranker model


The integration of LLM-based query expansion and advanced reranking models like Cross-Encoder ColBERT v2 & FlashReranker are opening up new possibilities in the field of information retrieval. These methods enhance the precision and recall of document retrieval systems and ensure that RAG models can deliver highly relevant and contextually richer results. As we continue to explore and innovate within this space, these tools will become easier to use and more common in various use cases and domains.

Our Google Colab notebooks provide a hands-on introduction to implementing query expansion, Cross-Encoder reranking, and leveraging ColBERT v2 models for those eager to dive deeper and experiment with these concepts. Through practical exploration, users can experience firsthand the impact of these technologies on enhancing information retrieval and document ranking processes.

But that’s not all. For even more exciting applications of vector databases and Large Language Models (LLMs), explore the LanceDB repository. LanceDB offers a powerful and versatile vector database that can revolutionize how you work with data.

Explore the full potential of this cutting-edge technology by visiting the vectordb-recipes repository.