by Mahesh Deshwal

Problem Statement:

In a typical RAG pipeline, LLM Context window is limited so for a hypothetical 10000 pages document, we need to chunk the document. For any incoming user query, we need to fetch `Top-N` related chunks and because neither our Embedding are 100% accurate nor search algo is perfect, it could give us unrelated results too. This is a flaw in RAG pipeline. How can you deal with it? If you fetch Top-1 and the context is different then it’s a sure bad answer. On the other hand, if you fetch more chunks and pass to LLM, it’ll get confused and with higher number, it’ll go out of context.

What’s the remedy?

Out of all the methods available, Re-ranking is the simplest. Idea is pretty simple.

1. You assume that Embedding + Search algo are not 100% precise so you use Recall to your advantage and get similar high `N` (say 25) number of related chunks from corpus.

2. Second step is to use a powerful model to increase the Precision. You re-rank above `N` queries again so that you can change the relative ordering and now select Top `K` queries (say 3) to pass as a context where `K` < `N` thus increasing the Precision.

Why can’t you use the bigger model in the first place?

Would your search results be better if you were searching in 100 vs 100000 documents? Yes, so no matter how big of a model you use, you’ll always have some irrelevent results because of the huge domain.

Smaller model with efficient searching algo does the work of searching in a bigger domain to get more number of elements while the larger model is precise and because it just works on `K`, there is a bit more overhead but improved relevancy.

Follow along with this colab

!pip install -U lancedb transformers datasets FlagEmbedding unstructured -qq

# NOTE: If there is an import error, restart and run the notebook again 

from FlagEmbedding import LLMEmbedder, FlagReranker 
# Al document present here https://github.com/FlagOpen/FlagEmbedding/tree/master
import os
import lancedb
import re
import pandas as pd
import random

from datasets import load_dataset

import torch
import gc

import lance
from lancedb.embeddings import with_embeddings


task = "qa" # Encode for a specific task (qa, icl, chat, lrlm, tool, convsearch)
embed_model = LLMEmbedder('BAAI/llm-embedder', use_fp16=False) # Load model (automatically use GPUs)

reranker_model = FlagReranker('BAAI/bge-reranker-base', use_fp16=True) # use_fp16 speeds up computation with a slight performance degradation

Load BeIR Dataset. This is a dataset built specially for retrieval tasks to see how good your search is working

queries = load_dataset("BeIR/scidocs", "queries")["queries"].to_pandas()
docs = load_dataset('BeIR/scidocs', 'corpus')["corpus"].to_pandas().dropna(subset = "text").sample(10000) # just random samples for faster embed demo
docs.sample(3)

Get embedding using LLM embedder and create Database using LanceDB

def embed_documents(batch):
    """
    Function to embed the whole text data
    """
    return embed_model.encode_keys(batch, task=task) # Encode data or 'keys'


db = lancedb.connect("./db") # Connect Local DB
if "doc_embed" in db.table_names():
  table = db.open_table("doc_embed") # Open Table
else:
  # Use the train text chunk data to save embed in the DB
  data = with_embeddings(embed_documents, docs, column = 'text',show_progress = True, batch_size = 128)
  table = db.create_table("doc_embed", data=data) # create Table

Search from a random Text

def search(query, top_k = 10):
  """
  Search a query from the table
  """
  query_vector = embed_model.encode_queries(query, task=task) # Encode the QUERY (it is done differently than the 'key')
  search_results = table.search(query_vector).limit(top_k)
  return search_results


query = random.choice(queries["text"])
print("QUERY:-> ", query)

# get top_k search results
search_results = search("what is mitochondria?", top_k = 10).to_pandas().dropna(subset = "text").reset_index(drop = True)

search_results

Re-rank Search Results using Re-ranker from BGE Reranker

Pass all the results to a stronger model to give them the similarity ranking

def rerank(query, search_results):
  search_results["old_similarity_rank"] = search_results.index+1 # Old ranks

  torch.cuda.empty_cache()
  gc.collect()

  search_results["new_scores"] = reranker_model.compute_score([[query,chunk] for chunk in search_results["text"]]) # Re compute ranks
  return search_results.sort_values(by = "new_scores", ascending = False).reset_index(drop = True)

print("QUERY:-> ", query)

rerank(query, search_results)

Visit the LanceDB (YC W22) repo to learn more about lanceDB python and Typescript library

To discover more applied GenAI and vectorDB applications, examples, & tutorials, visit vectordb-recipes

Adios Amigos! Until next time …….