Sitemap

Apache Parquet vs. Newer File Formats (BtrBlocks, FastLanes, Lance, Vortex)

7 min readSep 26, 2025

For over a decade, Apache Parquet has been the cornerstone of analytical data storage. Parquet emerged in the Hadoop era as an open columnar file format designed for large-scale analytical workloads. Its structure -columnar layout, per-page compression, and strong encoding schemes was perfectly aligned with the needs of the time: high-throughput batch analytics on massive datasets stored in data lakes.

Press enter or click to view image in full size

Over time, Parquet became the de facto standard. Every major compute engine such as Spark, Trino, Flink, etc. works with it. Open table formats in a lakehouse architecture, such as Apache Iceberg, Delta Lake, and Apache Hudi all rely on Parquet as their default storage substrate. Having said that, we are also seeing the emergence of newer workloads now. Batch analytics remains important, but today’s pipelines stretch further:

  • AI pipelines require fast feature retrieval, vector search, and low-latency scoring.
  • Hardware has diversified — beyond CPUs, organizations are increasingly using GPUs, wide SIMD instructions, and even ARM and RISC-V architectures.
  • Storage has changed — NVMe-backed systems and memory-mapped datasets call for fine-grained, cache-friendly data access.

This shift has prompted the community to ask different questions. Instead of focusing on whether Parquet is ‘good enough,’ the real question becomes: what additional capabilities do modern workloads demand, and do we need new file formats to meet them?

Why Apache Parquet Became the Default?

To understand why new file formats are being built, it’s worth reflecting on why Parquet became so dominant. The thing is, Parquet’s architecture is deceptively simple but extremely effective:

  • Columnar layout: Storing data column by column groups homogeneous values together. This improves compression efficiency and enables predicate pushdown by pruning unnecessary columns during query execution.
  • Row groups and pages: Within this columnar layout, Parquet organizes data into row groups (commonly ~128 MB), which are further divided into column chunks and smaller pages. This structure define clear, fixed-size chunks of data and makes it easier for query engines to parallelize scans and skip over unneeded sections, improving efficiency at scale.
  • Encodings and compression: Each page can use type-specific encodings, such as Dictionary, Run-Length Encoding (RLE) or Delta encoding , combined with block compression (Snappy, Zstd, LZ4, GZIP). This two-layer design provides both speed and compactness.
  • Statistics and filtering. Parquet stores per-page and per-column statistics such as min/max values, null counts, and distinct counts. These allow query engines to skip pages or entire row groups when predicates fall outside recorded ranges. Parquet also supports dictionary filtering (using dictionary values for comparisons) and optional bloom filters for selective reads. Together, these features make predicate pushdown highly effective.
  • Interoperability. Virtually every compute engine and storage system supports Parquet. It became the default choice not just by technical merit but by ecosystem consensus.
Press enter or click to view image in full size

For scan-heavy batch analytics, these features remain hard to beat. A large table stored in Parquet can be efficiently scanned, compressed to save storage costs, and pruned using statistics. This explains why Parquet has remained the backbone of big data architectures (e.g. data lakes) for so long. But as I mentioned, the analytics ecosystem itself is evolving. There might be workloads that bring requirements Parquet was never originally designed to optimize for. Let’s understand some of these aspects.

Where Parquet Struggles?

Parquet was designed in an era when workloads were dominated by sequential batch scans and hardware was largely CPU-bound. Today, this design brings some limitations.

  1. Decode bottlenecks. Heavyweight codecs like Zstd produce excellent compression ratios, but decompression can saturate CPU cycles, particularly in pipelines where query latency matters more than disk savings.
  2. Random access inefficiency. Nested and variable-width data in Parquet requires reading and decompressing entire pages, even for small slices. Random lookups become I/O-expensive, with significant read amplification.
  3. Memory pressure. Row groups are typically 128 MB by default. To access even a single record, large chunks may need to be decompressed, which can inflate working sets in RAM, leading to cache inefficiency and increased memory usage.
  4. Lack of SIMD/GPU awareness. Parquet’s encodings and compression schemes are not optimized for data-parallel execution. Modern CPUs with wide SIMD extensions (AVX-512) and GPUs cannot be fully leveraged, leaving performance potential on the table.

For many pipelines, these are not a“big deal.” However, today’s AI pipelines often need to fetch individual feature vectors or embeddings quickly. RAG workloads are especially sensitive to random access performance, since each query may need to fetch small slices of data across massive corpora stored on NVMe. Vector search systems operate over billions of high-dimensional vectors where query performance hinges on microseconds of latency. In these cases, Parquet’s page-oriented decompression and lack of SIMD/GPU optimization could become real bottlenecks, introducing unnecessary I/O, memory pressure, and wasted CPU cycles.

Rise of New File Formats

Press enter or click to view image in full size

The reason we’re seeing a wave of new file formats is simple: researchers and practitioners are pushing beyond the assumptions Parquet was built on. Each new format makes different trade-offs in encoding, layout, and execution to meet these needs. In this section, we will quickly go over some of the newer file formats and the value they bring with these newer innovations.

BtrBlocks

BtrBlocks, developed at TUM, introduces the idea of cascaded lightweight compression (LWC). Instead of relying on heavyweight compressors like Zstd, it uses chains of lightweight encodings (bit-packing, dictionary, frame-of-reference). A greedy, sample-based algorithm selects the best chain per column segment.

This approach provides fast decompression while maintaining strong compression ratios, particularly on integer-heavy data. The main limitation is that BtrBlocks is CPU-oriented and not fully optimized for SIMD/GPU execution. Still, it set the stage for newer formats to build on the idea of codec chaining.

FastLanes

FastLanes, a research project from CWI, pushes the LWC concept further and adapts it for modern hardware. Its key innovations include:

  • Expression Encoding. Instead of one codec per column, FastLanes allows flexible chains of lightweight codecs (e.g., Frame-of-Reference, Delta, Dictionary, FSST, ALP). These outperform heavyweight codecs in both speed and compression.
  • Multi-Column Compression (MCC). By exploiting inter-column correlations -like equality, one-to-one mappings, or string splitting , FastLanes compresses beyond traditional columnar limits.
  • Segmented layout. Data is decompressed in small vectors (≈1K values) rather than entire row groups. This reduces memory pressure and improves cache efficiency.
  • Compressed execution. Instead of fully materializing decoded vectors, FastLanes returns compressed vectors directly to engines (e.g. DuckDB, Velox), allowing SIMD/GPU-friendly execution on compressed data.

In their paper, FastLanes demonstrates that formats designed with SIMD and GPU in mind can deliver orders-of-magnitude gains on modern hardware.

Lance (LanceDB)

The Lance format, developed alongside LanceDB, is built around the realities of NVMe-backed storage and random access workloads.

Key features:

  • Adaptive structural encodings. Rather than hardcoding how validity and repetition are stored, Lance uses adaptive strategies that balance random access and full scans.
  • Dual encoding schemes. “Full zip” encoding is efficient for scans, while “miniblock” encoding optimizes random access.
  • Repetition index. Enables random access in 1–2 IOPS per lookup, independent of nesting depth. This is a dramatic improvement over Parquet, which scales poorly for nested data.
  • Struct packing. Groups entire structs into single columns, boosting throughput for multi-field access at the expense of fine-grained column projection.

Benchmarks show Lance matching or exceeding Parquet for full scans while significantly outperforming Parquet and Arrow for random access, particularly for nested data types.

Nimble (Meta)

Meta introduced Nimble, a new columnar file format optimized for machine learning feature stores and very wide tables (thousands of columns). Its design goals:

  • Lightweight metadata for handling extremely wide schemas.
  • Cascaded encodings, with support for SIMD and GPU acceleration.
  • Portable implementation for consistent decoding across engines.

In internal benchmarks, Meta reported Nimble achieving 2–3× faster decode times than other columnar formats like ORC. While still early-stage, Nimble reflects the trend of designing formats around ML pipelines rather than traditional batch analytics.

Vortex: Compressed Arrow Arrays

Vortex extends the Arrow ecosystem by supporting compressed Arrow arrays across memory, disk, and network. Its goal is to keep data compressed end-to-end while retaining Arrow’s zero-copy semantics. Vortex emphasizes:

  • Random access performance: Claims 100–200× faster than Parquet
  • Scan performance: 2–10× faster than Parquet while maintaining similar compression ratios.
  • Portability: WASM decoders make Vortex usable in web and embedded environments.

Some Numbers

To ground the discussion, here are some figures reported by the projects themselves:

FastLanes

  • ~43–44× faster decode vs. Parquet+Snappy/Zstd
  • 315–416× faster random access than Parquet
  • ~2% better compression than Parquet+Zstd (without heavyweight codecs)
  • SIMD scaling: ~40% faster with AVX-512

Lance

  • Random access sustained in 1–2 IOPS per lookup, independent of nesting depth.

Nimble

  • Reported 2–3× faster decode than ORC on ML feature store workloads.

Vortex

  • Claimed 100–200× faster random access vs. Parquet
  • 2–10× faster scans than Parquet, with similar compression ratios and write throughput.

Parquet’s influence is undeniable — it gave the ecosystem a ‘‘standard file format’’ for analytical storage. And I am sure that foundation isn’t going away. But the rise of new workloads shows that storage formats themselves are now a site of innovation. It requires rethinking of how storage interacts with execution engines and hardware. Formats like FastLanes, Lance, Nimble, and Vortex each explore different dimensions of this problem. For practitioners, the question becomes when to rely on Parquet’s universality and when to reach for a specialized format to unlock specific benefits. We will see!

References

--

--

Dipankar Mazumdar
Dipankar Mazumdar

Written by Dipankar Mazumdar

Dipankar is currently the Director of Developer Advocacy at Cloudera, where he leads worldwide developer initiatives focused on Lakehouse Architecture & GenAI.

Responses (2)