Improve (almost) every retriever with LanceDB hybrid search and Reranking

Improve (almost) every retriever with LanceDB hybrid search and Reranking

7 min read

Retrieval is a key component of any type of recommender system, including RAG. The quality of responses generated by chatbot can only be as good as the retrieved context. In this blog, we'll see how you can go about improving retrieval performance of limit our discussion to vectorDB-based retrievers.

There are various ways which you can go about improving the retrieval performance of vectorDB-based retriever but they can often end up disrupting the existing setup in place. For example, if the first thing you want to try to improve your retriever is change the chunking strategy, you'll need to re-ingest the entire dataset. Same goes if you want to change the embedding model. So, its always better to try simpler & faster to implement solutions before jumping to modelling or chunking. LanceDB comes with built-in APIs to streamline these experiments. Let's get started.

Experiment setup

We'll start off by selecting 2 dataset and set a baseline for the retrieval performance - SQuAD and a synthetic dataset generated from LLama2 review paper. The metric used here will be hit-rate @ top-5

Setting up the embedding model

Throughout this experiment the embedding model used will be BAAI/bge-small-en-v1.5 . With LanceDB embedding API, you can setup the tables to automatically embed your dataset and queries in the background, abstracting away the modelling details.

import lancedb
from lancedb.embeddings import get_registry
from lancedb.pydantic import Vector, LanceModel


model = get_registry().get("sentence-transformers").create(name="BAAI/bge-small-en-v1.5")

class Schema(LanceModel):
    text: str = model.SourceField()
    vector: Vector(model.ndims()) = model.VectorField()
    # vector field is automatically generated


db = lancedb.connect("~/db")
table = db.create_table("table", schema=Schema)

data = [{"text": "data"}, {"text": "dddadata}, ...]
table.add(data)
table.search("query")

The baseline

Dataset Query Type Reranker Hit Rate
SQuAD vector 81.322
FTS 83.509
LLama2-review vector 58.63
FTS 59.54

Reranking results

In the context of search, reranking means reordering the search results returned by a search algorithm based on some criteria. This can be useful when the initial ranking of the search results is not satisfactory or when the user has provided additional information that can be used to improve the ranking of the search results.

LanceDB Reranking API

LanceDB comes with some built-in rerankers. To use a reranker, you need to create an instance of the reranker and pass it to the rerank method of the query builder.

from lancedb.rerankers import ColbertReranker

colbert = ColbertReranker()

table.search("query").rerank(reranker=colbert) # reranker vector search
table.search("query", query_type="fts").rerank(reranker=colbert)

Reranking models used

The rerankers used in this experiment are all available out of the box with LanceDB. These are a mix of open-source and proprietary API-based models.

  • Cross Encoder - This uses open source cross encoder architecture for generating similarity scores of query-context pairs for reranking. The model used is cross-encoder/ms-marco-TinyBERT-L-6 . This model was trained on the MS Marco Passage Ranking task.
  • Cohere - This uses API-based proprietary reranking model by Cohere. The model used here is rerank-english-v2.0
  • ColBERT - ColBERT relies on fine-grained contextual late interaction: it encodes each passage into a matrix of token-level embeddings. Then at search time, it embeds every query into another matrix and efficiently finds passages that contextually match the query using scalable vector-similarity operators. This experiment uses colbert-v2 models with approx ~110M params
  • AnswerDotAi Colbertv1 -  This model builds upon the JaColBERTv2.5 recipe and has just 33 million parameters, meaning it’s able to search through hundreds of thousands of documents in milliseconds, on CPU. With its very small parameter count, it demonstrates that there’s a lot of retrieval performance to be squeezed out of creative approaches, such as multi-vector models, with low parameter counts, which are better suited to a lot of uses than gigantic 7-billion parameter embedders. Learn more about this in the AnswerDotAI announcement blog.

Reranked vector and FTS results

Dataset Query Type Reranker Hit Rate
SQuAD vector CrossEncoder 85.22
vector ColBERT 85.49
vector Answerdotai ColBERT 85.95
vector Cohere 86.09
FTS 83.509
FTS CrossEncoder 86.73
FTS ColBERT 86.74
FTS Answerdotai ColBERT 87.24
FTS Cohere 87.72
LLama2-review vector CrossEncoder 62.27
vector ColBERT 62.72
vector Answerdotai ColBERT 65.45
vector Cohere 66.80
FTS 59.54
FTS CrossEncoder 60.90
FTS ColBERT 64.54
FTS Answerdotai ColBERT 65.00
FTS Cohere 67.27

Note: In case of vector and FTS only search, the results were over-fetched by a factor of 2 for reranking and then the top-k( top-5 ) results were taken. This is needed as there is only one result set, unlike hybrid search where there are result sets from both vector and FTS. Without over-fetching, reranking would not have any effect.

Hybrid search is broad term that refers to a search algorithm that combines multiple search techniques. You can perform hybrid search in LanceDB by combining the results of semantic and full-text search via a reranking algorithm of your choice.

LanceDB hybrid search with custom Reranking

table.search("query", query_type="hybrid") # uses Liner combination reranker

table.search("query", query_type="hybrid").rerank(reranker=colbert)

Hybrid search with reranking results

Dataset Query Type Reranker Hit Rate
SQuAD Hybrid LinearCombination 88.42
Hybrid Reciprocal Rank Fusion 89.81
Hybrid CrossEncoder 91.51
Hybrid ColBERT 91.53
Hybrid Answerdotai ColBERT 92.26
Hybrid Cohere 92.35
LLama2-review Hybrid LinearCombination 62.27
Hybrid Reciprocal Rank Fusion 66.36
Hybrid CrossEncoder 65.45
Hybrid ColBERT 70.90
Hybrid Answerdotai ColBERT 70.90
Hybrid Cohere 75.00

Full benchmark

Dataset Query Type Reranker Hit Rate
SQuAD vector 81.322
vector CrossEncoder 85.22
vector ColBERT 85.49
vector Answerdotai ColBERT 85.95
vector Cohere 86.09
FTS 83.509
FTS CrossEncoder 86.73
FTS ColBERT 86.74
FTS Answerdotai ColBERT 87.24
FTS Cohere 87.72
Hybrid LinearCombination 88.42
Hybrid Reciprocal Rank Fusion 89.81
Hybrid CrossEncoder 91.51
Hybrid ColBERT 91.53
Hybrid Answerdotai ColBERT 92.26
Hybrid Cohere 92.35

Dataset Query Type Reranker Hit Rate
LLama2-review vector 58.63
vector CrossEncoder 62.27
vector ColBERT 62.72
vector Answerdotai ColBERT 65.45
vector Cohere 66.80
FTS 59.54
FTS CrossEncoder 60.90
FTS ColBERT 64.54
FTS Answerdotai ColBERT 65.00
FTS Cohere 67.27
Hybrid LinearCombination 62.27
Hybrid Reciprocal Rank Fusion 66.36
Hybrid CrossEncoder 65.45
Hybrid ColBERT 70.90
Hybrid Answerdotai ColBERT 70.90
Hybrid Cohere 75.00

SQuAD

Baseline: Vector Search - 81.322

Best result: Hybrid search with Cohere reranker - 92.35

Llama-2-review

Baseline: Vector Search - 58.63

Best result: Hybrid search with Cohere reranker - 75.00

Most rerankers ended up improving the result with Cohere leading the pack. The highlight here would be AnswerDotAi ColBERT-small-v1 as it performs almost as good as Cohere reranker, which has been by far the best reranker across all our tests. Also, is more than 3x smaller than ColBERT-v2 so it can be run locally on CPU.

Conclusion

Our experiments on the SQuAD and Llama2-review datasets demonstrated significant improvements in model performance using these techniques. Specifically, we observed an 11% increase in accuracy on SQuAD (from 81.22% to 92.35%) and a 16% increase in hit-rate on Llama2-review (from 58.63% to 75.00%). These methods are particularly effective in scenarios where a slight increase in query latency is acceptable, as they offer substantial gains in accuracy without sacrificing too much speed.

This study exclusively explored techniques to optimize the retriever's performance without requiring a complete dataset re-ingestion. These methods represent the most straightforward approaches to enhance accuracy. Reranker models, often overlooked due to their substantial latency during queries, are becoming more viable with the emergence of smaller, on-device models that can significantly reduce query time.

future work

A comprehensive evaluation of reranker performance would require a careful consideration of the trade-off between accuracy and latency. We intend to investigate this aspect in future studies. It's important to note that direct comparisons can be challenging, as some of the most effective rerankers are API-based, and their latency can fluctuate due to network factors.

Another important direction is evaluating other methods of improving retrieval performance like - fine-tuning embedding models, choosing the best embedding model architectures depending on the dataset types, the effect choosing various chunking techniques & other best practices involved. These methods are generally more disruptive compared to reranking & hybrid search, as they require complete data re-ingestion, but are useful in cases where low latency at query time is critical.

References

The benchmarking code was based on Ragged repo, that contains a bunch of utility functions for ingesting and evaluating retrievers.

GitHub - lancedb/ragged
Contribute to lancedb/ragged development by creating an account on GitHub.

It allows you to simply run evaluations across multiple query-types, rerankers and embedding models powered by LanceDB. For example this is the evaluation code for SQuAD data with linear combination reranker.

from ragged.dataset import CSVDataset, SquadDataset
from ragged.rag import llamaIndexRAG
from ragged.metrics.retriever.hit_rate import HitRate
from lancedb.rerankers import LinearCombinationReranker
from ragged.search_utils import QueryType
import wandb

dataset = SquadDataset()
reranker = LinearCombinationReranker()
hit_rate = HitRate(dataset, embedding_registry_id="sentence-transformers", embed_model_kwarg={"name": "BAAI/bge-small-en-v1.5", "device": "cuda"})

query_types = [QueryType.HYBRID, QueryType.RERANK_VECTOR, QueryType.RERANK_FTS]

run = wandb.init(project="ragged_bench", name=f"linearcombination")

for query_type in query_types:
    hr = hit_rate.evaluate(5, query_type=query_type, use_existing_table=use_existing_table)
    run.log({f"{query_type}": hr.model_dump()[f"{query_type}"]})

wandb.finish()

Note: The hit-rate eval in ragged, discards failed queries. So evaluation results with other implementations might be different. The purpose of this evaluation is to just show performance gains on a standard metrics that remains constant for all evaluations.

The llama2-review synthetic dataset was generated from llama-2-paper dataset on LLamahub using the datagen utils in ragged:

from ragged.dataset.gen.gen_retrieval_data import gen_query_context_dataset
fragged.dataset.gen.llm_calls import OpenAIInferenceClient

clinet = OpenAIInferenceClient()
df = gen_query_context_dataset(directory="data/source_files", inference_client=clinet)

print(df.head())
# save the dataframe
df.to_csv("data.csv")

The public W&B dashboards for both these experiments can be found here.

SQuAD

LLama2-review