With the rise of the cloud data lakes to democratize data & enable all sort of analytical workloads(BI, Data Science), one of the most critical decisions to make for an organization’s data architecture is deciding which table format to adopt. If you are new to the space of data lakes & lakehouse, the below article should give you a high-level overview of what table formats are and its necessity.
Table formats in a Data lake & why Apache Iceberg?
In the past few days I have been learning & sharing my experience with the data community about the ‘Apache Iceberg’…
Apache Iceberg is one of the 3 table formats that are currently available for organizing and tracking data files in data lakes. Before these, Apache Hive was the only table format that was widely used with HDFS(Hadoop distributed file system). Among other non-trivial features, Iceberg supports ACID transactional capabilities. So, that means you are allowed to perform any kind of data warehouse-level operations such as INSERT, DELETE & UPDATE directly on your data lake storage (Amazon S3, Microsoft ADLS, etc.) 🎊
In this article, we will take a look at some of the significant features that Apache Iceberg provides out-of-the-box. A lot of these features separate Iceberg from the other available ones such as Delta Lake & Apache Hudi. Let’s go 🏃♂️!
As we would expect with database tables, can you do expressive SQL queries with Iceberg? For sure 💯 — Iceberg supports flexible SQL commands to merge new data, update existing rows, and perform targeted deletes. Here’s a snippet that shows how the MERGE INTO command makes merging two tables very flexible.
Partitioning is a technique to make queries faster & efficient by grouping similar rows together so you don’t have to perform full-table scans. Early table formats such as Hive supported partitioning however with Hive you would have to maintain separate partition columns. Also, during query time you will need to supply a separate partition filter.
What gives Apache Iceberg the edge🚀 is something called “hidden partitioning”. Basically Iceberg,
⚡️handles the tedious task of producing partition values for rows in a table
⚡️avoids reading unnecessary partitions automatically
⚡️most importantly partition layouts can evolve as needed
Over the course of time, there will be scenarios when you may need to change the schema of your table, add/rename columns, etc. This is called schema evolution. With Apache Iceberg you don’t have to worry about doing all these and bringing back ‘zombie’ 🧟♂️ data. Iceberg schema evolution supports safe column Add, Drop, Update, Reorder and Rename.
And you know the best part? Iceberg guarantees that schema evolution changes are independent. So, you don’t have to worry about rebuilding the entire table in any way, which is typically very cost-intensive. How does it do that?
⚡️Each column in an Iceberg table is tracked using unique IDs. When you add a column, it gets a new ID so that it doesn’t get mixed up with the existing data.
As with any traditional enterprise data warehouse, there will be scenarios:
👉🏻 when you would like to query data at a specific point of time or audit modified/deleted data
👉🏻 rollback tables to a specific version for a variety of reasons(bad data, pipeline bug, etc.)
It is therefore imperative to have similar ability in a data lake table & Apache Iceberg facilitates that very well. But how?
📸 SNAPSHOTS: A snapshot is the state of a table at ‘some’ given point of time. Iceberg keeps a log of previous snapshots of the table allowing for ‘time travel’ queries. Iceberg supports two Spark read options for accessing snapshots — ‘snapshot-id’ & ‘as-of-timestamp’. Read more about how you can achieve time travel below.
Spark Queries # To use Iceberg in Spark, first configure Spark catalogs. Iceberg uses Apache Spark's DataSourceV2 API…
As more & more data piles up in a data lake table like Apache Iceberg, there will be an increase in metadata stored in the manifest files. This can lead to queries becoming less efficient as the processing time to open the files will increase. Hence, there is a need for ‘compacting’ the tables.
🧊 Best part about Iceberg is data compaction is supported out-of-the-box & you can choose from different rewrite strategies such as ‘bin-packing’ or ‘sorting’ to optimize file layout and size.
The snippet below shows an example.
Here’s a visual summary of all the five features we discussed above.
This brings us to the end of the Features 101 blog. The whole idea behind this was to highlight some of the necessary features that make Apache Iceberg a quintessential table format. In future blogs, we will discuss some of the concepts related to Apache Iceberg & how they are leveraged in real-world use cases.
I also write & share about Data & Analytics(visualization, data infrastructure, Machine Learning) on my socials. Let’s talk 🗣
Icon credits: https://www.flaticon.com/