Almost a year ago I announced we were going to be embarking on a journey to build a new 2.0 version of our file format. Several months later, we released a beta, and last fall it became our default file format. Overall, I've been super pleased with how well it worked. As we've been working with the community and stressing the format in production, we've identified a number of areas for improvement. We've been working on a 2.1 format for a few months to address these insights.
As our 2.1 format enters beta I wanted to take some time and look back on the 2.0 format and talk about what worked, what didn't work, and what we're going to be doing differently in 2.1.
2.0: We Got Rid of Row Groups
The most direct success of the 2.0 format, by far, was getting rid of row groups. We have not missed them...ever...at all. We have not lost any opportunity for parallelism or performance and we have removed the single biggest foot gun from Parquet. I described my reasoning for this already in an earlier post so I don't need to go into details here.
2.0: We Filled the I/O Queue
The most important performance concept in modern I/O, whether it is NVMe or cloud storage, is queue depth. If you do not have enough concurrent I/O requests in flight, you will leave bandwidth on the floor. In fact, it's even better to have too many I/O requests even though you face overhead and fairness issues. In 2.0 we built a scheduling algorithm to hit whatever queue depth we wanted, in priority order, without going over our parallelism budget. So we don't even have to pay those overhead and fairness costs.

I discussed most of this in depth in an earlier post. However, I was not prepared for just how effective this was. I had hoped, originally, to match Parquet's scan speed while introducing random access. I figured this meant we would be slower than Parquet until we figured out compression. However, I was surprised to find that Lance often beats Parquet in full scans. This turned out to be true even when the file was 2-3 times bigger!
2.0: Our Flexible Container Format um...Flexed?
The overall structure of the file format has worked great. Our encoding scheme for 2.0 evolved significantly as it was developed. In 2.1 we pretty much reworked everything related to encodings. The protobufs have been many, and they have been varied, and they have enabled extremely rapid prototyping. Through all of this, I have not once encountered any reason to change the overall file structure. Pages, column descriptors, and a file footer are a simple, consistent, and extremely flexible structure.

2.1: We're Squeezing out Compression via Structural Encoding
The 2.0 format has very little compression, primarily limited to dictionary compression and a few other tricks. This often surprises people coming from Parquet, but the truth is, compression isn't as critical for us. Our primary goal is working well with multimodal data such as embeddings, tensors, audio, video images, etc. These types are typically compressed already (often with lossy compression) and they're so large that any compression of metadata isn't as significant as you might expect. As a result, compression was a lower priority and, as we started using 2.0 internally, and customers started adopting it in production, we realized we needed to tie things off and ship something stable.
That being said, we've found some areas where compression is significant. Compression metadata doesn't have much impact on overall storage costs, but it can significantly speed up things like prefiltering. Also, semantic text is a very common part of our user's datasets, and some of these text-heavy datasets (e.g. Common Crawl, Fineweb, Github Code) have massive text columns which need solid compression.
Integrating compression into Lance is trickier than it might seem. Compression is implicitly related to I/O scheduling. For example, if you delta encode your values, or you strip out NULLs during compression, then it can become difficult to extract a single value from the compressed output. Fortunately, while we were working on 2.0, we had two brilliant interns run various experiments into compression, and this really helped us nail down these integration points and refine our encoding traits. What we discovered is that we were mashing together two different concepts. In 2.1 we have split the generic step of "encoding" into structural encoding and compressive encoding.
I'm going to be writing an entire blog post on this topic but structural encoding tells us how an "array" is split into a series of "buffers". Compressive encoding then focuses on how those buffers can be compressed. For example, in Arrow, we have a well defined standard for splitting an array into buffers. But...there are too many buffers! (this turns out to be bad for random access). In Parquet, we have a completely different standard for splitting an array into buffers that uses far fewer buffers, but can introduce read amplification. In Lance 2.1 we now have two different strategies that we switch between depending on the size of the value.

In fact, there was one more lesson for us. Once we figured out structural encoding we suddenly discovered a way to create a taxonomy for compression algorithms. For example, transparent compression algorithms (like bitpacking) support random access while opaque compression algorithms (like delta encoding) do not. You may think we obviously just use transparent compression everywhere but—and here is where the magic comes in—the structural encoding that you choose controls which category of compressive encodings you are allowed. There are times in Lance 2.1 where opaque compression is perfectly valid! I promise I will write a lot more about this later.
2.1: We're Knocking Out the 1-2 IOP Challenge
As I was working on 2.0 encodings I would often refer to something I called the "1-2 IOP challenge" (it's a punchy name 😉). The dream was to come up with an encoding so that I could access any element in an array in a single IOP (for fixed width data types) or 2 IOPS (for variable width data types). This sounds simple at first. We all know how to grab a string/binary value. First we grab the offsets and then we grab the value.
Unfortunately, it seems that someone has told customers about fancy data types. Also, customers love NULL values. As an example, let's look at what happens if you have a list of strings (e.g. tags, facets, etc.). This one array has many pieces of information.
What does this mean? Well, if you've been keeping up on your Arrow column format homework you know that we now have a struct validity buffer, a list validity buffer, a list offsets buffer, a string offsets buffer, a string validity buffer, and a string values buffer. In Lance 2.0 this, unfortunately, meant that we needed to perform 6 IOPS to fetch a single value. We have failed the 1-2 IOP challenge.

In Lance 2.1 it turns out the answer is once again related to the idea of structural encodings. Recall that the structural encoding tells us how an array is split into buffers (thats important because it controls our IOPS). What's more, the structural encodings all have a concept know as a repetition index. The union of these two concepts turns out to be exactly what we need for the 1-2 IOP challenge and I am proud to say that we have solved it. No matter what data type you have, no matter how complex it is, no matter how many layers of validity you've stashed away, we can access any value in 1 or 2 IOPS. I'll definitely be writing more on this concept soon.

2.1: We're Pushing Statistics so Far Down they Fall Out
During the 2.0 development process we had a working prototype of pushdown filtering. I didn't love it but it worked. It then broke. We fixed it. Then it broke again. As I write this that defunct prototype is still lurking in the code, but I've come to figure out why it just didn't stick.
There's a lot of reasons for this but I'll boil them down to "introducing all this compute complexity in the file format, when you have a fully functional and complex query engine just sitting there, is just a lot of work for little gain". What's more, keeping statistics decoupled from encoding can be helpful for performance too.
It turns out the answer, for us, is that statistics based pushdown is "just another index" (technically a primary index) and it is much easier for us to store this information outside the file (just like we do with all our other indices). This concept may sound strange but it has existed for a very long time, it just had a different name, the zone map (just not the one you use for planting your garden).

By storing these indices outside the file we can pick which columns we want to index after we've already written the file. We can also retrain the index with a different block size without rewriting the file. What's more, we just so happen to have infrastructure for managing external indices lying around.
I can already hear you asking, "what if I don't have a query engine lying around?". Well, in that case we will give you one. We take some kind of compute engine like Datafusion (or possibly just arrow-rs) and shove it into a plugin. This plugin calculates the index during write and stores it in the file as a free global buffer. Then, during read, the plugin will apply the filter.
2.1 We're Experiencing Deja Vu with I/O Scheduling
One of the things we got right in 2.0 was I/O scheduling. It turns out this is one of the things we also got wrong. We could saturate the bandwidth on full scans but random access on NVMe is tricky. A good NVMe can perform close to a million reads a second (the $200 disk I have at home for playing games daily development can achieve ~800K reads/s). This means an average latency of almost 1 microsecond.

At the risk of stating the obvious, this is not a lot of time. In 2.0 our poor little read request requires two thread transfers. First we put it in a queue to the I/O thread. Then we put the response back into a queue for the decode thread. As a result we were spending 2-3 microseconds of overhead for a request that only takes a single microsecond. Adding a fully synchronous blocking API brought our file reader from 40K values/second to 400K values/second in some low-level benchmarking! As part of the stabilization for 2.1 I'm hoping to look into this further, and make up new terms like "semi-synchronous I/O".
2.1 Timeline
The 2.1 format is now officially in beta. There are a few known limitations but it should work for most types. Follow our work and track its completion here. If you want to run experiments you can enable the 2.1 format in a few ways:
import lance
import pyarrow as pa
data = pa.table({"x": range(1000)})
ds = lance.write_dataset(data, "/tmp/test_ds", data_storage_version="2.1")
print(ds.data_storage_version) # Should be 2.1
The data_storage_version is set when a new dataset is created and will control all data written by that dataset
from lance.file import LanceFileReader, LanceFileWriter
import pyarrow as pa
data = pa.table({"x": range(1000)})
with LanceFileWriter("/tmp/test_file.lance", version="2.1") as writer:
writer.write_batch(data)
reader = LanceFileReader("/tmp/test_file.lance")
print(reader.metadata())
If you'd prefer to work directly with Lance files for experimentation or lower-level access we have python bindings for a file reader/writer
Over the next few months we will be adding new tests, fixing some remaining todos, dogfooding, and tuning the performance. We encourage you all to try it out and do your own experiments. Feel free to start filing bugs about 2.1 and we will take a look. I'll also be writing additional blog posts going over the new features at a much lower level.