Demystifying LLM Apps with Lance

Demystifying LLM Apps with Lance

6 min read

There’s a lot of justifiable excitement around using LLMs as the basis for building a new generation of smart apps. While tools like LangChain or GPT-Index makes it easy to create prototypes, it’s useful to look behind the curtain to see what the wizard looks like before taking these apps into production. In this post, we’ll show you how to build a Q&A chatbot using just base-level OSS python tooling, OpenAI, and Lance.

tl;dr — if you just want to play with the worked example, you can find a notebook on the lance github.


We’ll build a question-answer bot with answers drawn from youtube transcripts. The result is going to be a Jupyter Notebook where you can get the answer to a question you ask AND it shows the top matching YouTube video starting at the relevant point.

Data prep

To make things simple, we’ll just use a ready made dataset of YouTube transcripts on HuggingFace:

from datasets import load_dataset
data = load_dataset('jamescalam/youtube-transcriptions', split='train')

This dataset has 208619 transcript sentences (across 700 videos)

    features: ['title', 'published', 'url', 'video_id', 'channel_id', 'id', 'text', 'start', 'end'],
    num_rows: 208619

We’ll use pandas to create the context windows of size 20 with stride 4. That means every 4 sentences, it creates a “context” by concatenating the next 20 sentences together. Eventually, the answers will come from finding the right context and summarizing it.

import numpy as np
import pandas as pd

window = 20
stride = 4

def contextualize(raw_df, window, stride):
    def process_video(vid):
        # For each video, create the text rolling window
        text = vid.text.values
        time_end = vid["end"].values
        contexts = vid.iloc[:-window:stride, :].copy()
        contexts["text"] = [' '.join(text[start_i:start_i+window])
                            for start_i in range(0, len(vid)-window, stride)]
        contexts["end"] = [time_end[start_i+window-1]
                            for start_i in range(0, len(vid)-window, stride)]        
        return contexts
    # concat result from all videos
    return pd.concat([process_video(vid) for _, vid in raw_df.groupby("title")])

df = contextualize(data.to_pandas(), 20, 4)

A brief aside: it’s annoying that pandas does not provide rolling window functions on non-numeric columns because process_video should just be simple call to pd.DataFrame.rolling .

Me finding out pandas rolling windows calls text columns “nuisance”

With this we go from 200K+ sentences to <50K contexts

>>> len(df)


Here we’ll use the OpenAI embeddings API to turn text into embeddings. There are other services like Cohere providing similar functionality, or you can run your own.

The OpenAI python API requires an API key for authentication. For details, you can refer to their documentation to see how to set it up.

The documentation *looks* super simple:

import openai
openai.Embedding.create(input="text", engine="text-embedding-ada-002")

But of course, the OpenAI API is almost always out of capacity, or you’re being throttled, or there’s some cryptic JSON decode error. To make our lives easier, we can use the ratelimiter and retry packages in python:

import functools
import openai
import ratelimiter
from retry import retry

embed_model = "text-embedding-ada-002"

# API limit at 60/min == 1/sec
limiter = ratelimiter.RateLimiter(max_calls=0.9, period=1.0)

# Get the embedding with retry
@retry(tries=10, delay=1, max_delay=30, backoff=3, jitter=1)
def embed_func(c):    
    rs = openai.Embedding.create(input=c, engine=embed_model)
    return [record["embedding"] for record in rs["data"]]

rate_limited = limiter(embed_func)

And we can also request batches of embeddings instead of just one at a time

from import tqdm
import math

# We request in batches rather than 1 embedding at a time
def to_batches(arr, batch_size):
    length = len(arr)
    def _chunker(arr):
        for start_i in range(0, len(df), batch_size):
            yield arr[start_i:start_i+batch_size]
    # add progress meter
    yield from tqdm(_chunker(arr), total=math.ceil(length / batch_size))
batch_size = 1000
batches = to_batches(df.text.values.tolist(), batch_size)
embeds = [emb for c in batches for emb in rate_limited(c)]


Once the embeddings are created, we have to make it searchable. At this point, most existing toolchains require you to spin up a separate service or store your vectors separately from the data separately from the vector index. This is where Lance comes in. We can merge the embeddings (vectors) with the original data and write it to disk. So the vectors you need for similarity search lives with the index to make it fast lives with the data you need to filter / return to the user.

import lance
import pyarrow as pa
from lance.vector import vec_to_table

table = vec_to_table(np.array(embeds))
combined = pa.Table.from_pandas(df).append_column("vector", table["vector"])
ds = lance.write_dataset(combined, "chatbot.lance")

This is sufficient to do vector search in Lance

ds.to_table(nearest={"column": "vector",
                     "q": [<query vector>], 
                     "k": <how many results>}).to_pandas()

The above query retrieves the k rows whose vector is most similar to the query vector q .

But a brute force approach is a bit slow (~150ms). Instead, if you’re going to be making a bunch of requests, you can create an ANN index on the vector column:

ds = ds.create_index("vector", index_type="IVF_PQ", 
                     num_partitions=64, num_sub_vectors=96)

If you run the same ANN search query, you should get a much faster answer now.

Because Lance is columnar, you can easily retrieve additional columns. By default, the to_table returns all columns plus an extra “score” column. You can choose to add a columns argument to to_table to select a subset of available columns. You can also filter the ANN results by passing a SQL where clause string to the filter parameter in to_table .

Me talking about Lance


Unless you’re already familiar with how LLM apps work, you might be asking “what do embeddings have to do with Q&A bots?” This is where prompt engineering comes in. There are lots of great material dedicated to prompt engineering so I’ll spare the details in this post. Here we’ll just go through a practical example.

The magic happens in 3 steps:

1. Embed the query text

Let’s say you want to ask “Which training method should I use for sentence transformers when I only have pairs of related sentences?” Since we’ve converted all the transcripts into embeddings, we’ll also do the same with the query text itself

query = ("Which training method should I use for sentence transformers "
         "when I only have pairs of related sentences?")
openai.Embedding.create(input=query, engine="text-embedding-ada-002")

2. Search for most similar context

We then use the query vector to find the most similar context

context = ds.to_table(
            "column": "vector",
            "k": 3,
            "q": query_vector

3. Create a prompt for the OpenAI completion API

LangChain and similar tools have lots of Prompt templates for your use case. For the task at hand, we create our own

        "Answer the question based on the context below.\n\n"+

And we plug in the text from step 2 above into the context. We then use OpenAI again to get the answer

def complete(prompt):
    # query text-davinci-003
    res = openai.Completion.create(
    return res['choices'][0]['text'].strip()

An example usage would look like:

>> query = "who was the 12th person on the moon and when did they land?"
>> complete(query)
'The 12th person on the moon was Harrison Schmitt, and he landed on December 11, 1972.'

Putting it all together

Remember our question was, “Which training method should I use for sentence transformers when I only have pairs of related sentences?” Using the steps outlined above, I get back:

“NLI with multiple negative ranking loss”

And since Lance allows you to retrieve additional columns easily, I can not only show the most relevant YouTube video, but start it at the right place in the video:


In this post, we saw how to use Lance as a critical component to power an LLM-based app. In addition, we went through an end-to-end workflow peeling back the covers. For these types of search workflows, Lance is a great fit because 1) it’s super easy to use for ANN, 2) it’s columnar so you can add a ton of additional features, and 3) it has lightning fast random access speed so the index can be disk-based and extremely scalable.

If you’d like to give it a shot, you can find the Lance repo on github. If you like us, we’d really appreciate a ⭐️ and your feedback!

Special thanks to Rob Meng for inspiring this post