Apache Hudi (Part 1): History, Getting Started

Dipankar Mazumdar
7 min readNov 29, 2023

--

I recently joined Onehouse.ai to contribute to Apache Hudi and work on advocacy efforts, helping engineering teams build and scale robust data platforms. Most of my experience in the data lake storage space has been with another table format called Apache Iceberg. This blog series is an effort to jot down everything I learn about Apache Hudi as I navigate this new journey. Hope this will serve as a guide to people getting started with Hudi.

Before we get into the whats and hows, let’s understand the motivation and a bit of background of what led to the inception of Hudi.

Hudi (pronounced “Hoodie”) emerged from Uber Engineering in 2017 as their ‘incremental processing’ framework on Hadoop. At that point, the term ‘lakehouse’ had not yet gained prominence. Since Uber was dealing with some legit real-time challenges (freshness in data), such as weather and traffic data influencing ride prices, it was non-trivial for them to build a system that could power all business-critical data pipelines at low latency and high efficiency.

Obstacles & Motivation

As you would expect from any larger organization running analytics at scale, Uber had to go through a couple of ‘critical’ phases in their journey before finally building a system that could serve as the backbone for their long-term analytical requirements (ML, experiments, City Ops).

With the 1st generation, the focus was on aggregating all of Uber’s data in one centralized repository and offering a SQL interface for different stakeholders to access the data. A data warehouse was the obvious first step at that point. However, operating a warehouse was getting expensive to cater to the scalability needs. The 2nd generation involved re-architecting their entire platform by introducing a Hadoop data lake, which served as the storage for all of their raw data, and Parquet as the file format for better compression and statistics for query planning. While this model resulted in the data lake being the central source of truth for all their analytical data, there were some new problems.

  • Small file problem: many small data files were being written to HDFS due to increased ingestion jobs, ad hoc batch jobs, etc.
  • Data latency: fresh data availability was limited to once every 24 hours, which is impractical for real-time decision-making.
  • Rewriting entire table/partitions: despite the daily addition of just over 100 GB of new data per table, each execution of the ingestion job necessitated the rewriting of the entire dataset, surpassing 100 TB, for a particular table.
  • Issue with snapshot-based ingestion: both new data ingestion and modeling of the related derived table involved creating new snapshots of the entire dataset, leading to prolonged execution times (taking 20+ hours with over 1,000 Spark executors).
  • Updates/Deletes: the snapshot-based approach involved refreshing the entire dataset every 24 hours. However, since there was a need for more real-time and incremental updates, there were challenges supporting update and delete operations due to the limitations of HDFS and Parquet.

These problems set the requirements for a new generation of data platform at Uber, with Apache Hudi being the centerpiece, serving as an abstraction layer on top of file formats like Parquet and data lake storage, HDFS.

Uber’s Big Data Platform — Gen 3 (Credits: https://www.uber.com/blog/uber-big-data-platform/)

Hudi addressed the aforementioned problems while providing a solid foundation to scale to the growing analytical needs within the organization.

Today, Hudi has evolved to be the streaming data lake platform that brings database fundamentals to the world of data lakes, with its core aligned to streaming primitives (faster upserts & change streams). Here are some highlights of Hudi:

  • enables transactions such as updates, inserts, and deletes on existing data file formats with ACID guarantees
  • Move from a snapshot-based to an incremental processing model, which helps address data latency issues significantly
  • Since users can incrementally pull ‘only changed’ data, the query scans are less expensive and enable incremental updates to the derived tables
  • various table optimization services such as cleaning, compaction, clustering, and Z-ordering (multi-dimension cluster)
  • multi-modal indexes — simple, HBase, Bloom filter, Bucket for efficient upserts

So, now that we are past the motivation and history, let’s tackle — ‘Getting Started with Hudi’.

Since this learning journey is new for me, I wanted to take a more deliberate approach that will serve as a foundation for my future work. This should also be beneficial to the developers in the community who are in the same shoes or looking at getting started with Apache Hudi. With that in mind, I started a concept map to lay out all the elements one might need to navigate Hudi.

Apache Hudi fundamentals

What I have realized is that this comes down to four main pillars:

  1. Concepts: the core theoretical foundations around Hudi. Things like timeline, table types, indexes, metadata tables, etc. These concepts are non-trivial to understanding how Hudi works and should be the first thing in one’s learning path.
  2. Services: various data and table services that make Hudi a ‘data lake platform’. Understanding some essential services like compaction and clustering that help optimize the data & file layout for better performance is critical. This would potentially be the next learning area.
  3. Practical: once we have a decent understanding of the theoretical concepts, the natural next step is to expand the knowledge to a more hands-on form. Hudi, being the open table format, works with various compute engines such as Presto, Trino, Spark, and Flink, among others. The Hudi docs site has excellent quick-start guides for Spark (SQL, DataSource — Python, Scala) and Flink (SQL, DataStream API). There is also a docker demo that presents a real-world example to show Hudi works end-to-end (micro-batch ingestion from Kafka, sync with Hive, Running incremental queries on Hudi, etc.).
  4. Use Cases: the final learning pillar is the application part. This is where you combine everything (theory + practical aspects) and apply them to real-world business requirements with Hudi as the data lake/lakehouse platform. Some common use cases are — near real-time ingestion, incremental processing, unified batch & streaming architecture, open table format (lakehouse), etc.

These four learning pillars are expected to cover some of the following questions:

  • What is Hudi — its features and capabilities?
  • How does Hudi work (architecture)?
  • How to get hands-on with Hudi using compute like Spark/Flink?
  • How is Hudi used in the industry?

Since I started this concept map as a starting point to better structure the Hudi learning path, I expect this to keep evolving based on my own learning experience and feedback from the Hudi community & contributors.

Finally, two things that are going to be critical in this learning journey of ours is: content and community.

Content

The Hudi official site does a good job of structuring various forms of content such as documentation, learning materials, quick-start guides, concepts, FAQs, and use cases. This site has been my first contact point for understanding the nitty-gritty around Hudi, its capabilities, and how-to’s. The blog section also does a great job of collating helpful write-ups from the community in one place. You can also learn about the project’s roadmap (something I always look for) and get a link to the version-specific downloads on the site.

Apache Hudi Official Site

Community/Help

Most of the things I have learned about a particular technology (esp. OSS) have been through community involvement. I highly recommend getting involved in any form. In most cases, it starts with being in a learner’s shoes and then gradually sharing about those learnings or helping others with the pain points. Here are some ways to be a part of the Hudi community — dev mailing list for development discussions, quick help via Slack, RFC proposal for extensive features & changes. There are also regular community syncs for Hudi developers and users to interact, including monthly community calls and office hours.

Apache Hudi Slack for Community

In the next part, I will explore some of the core concepts of Hudi’s architecture from the 1st learning pillar with the idea of distilling the complex technical concepts into easily understandable forms (think lots of graphics). Stay tuned!

If you want to hear more about Hudi, I also write regularly on my socials — (LinkedIn, Twitter).

--

--

Dipankar Mazumdar

Dipankar is currently a Staff Data Engineering Advocate at Onehouse.ai where he focuses on open source projects in the data lakehouse space.