New Job: Apache Hudi, Iceberg, what lies ahead?

Dipankar Mazumdar
7 min readNov 23, 2023

--

Today marks the end of my third week at Onehouse.ai, and these initial weeks have been nothing short of exhilarating. As I navigate through this new journey of mine, I believe it is essential to reflect on a few important elements that led me to this incredible opportunity and to contemplate what the future holds.

My career trajectory over the years has been an interesting one. I started as a Software Engineer, and then, with the advent of ‘Big Data’ and my interest, I slowly switched gears to work more on data-specific roles such as Data Visualization/BI, Data Science, and ML research before exploring ‘sorta’ new function in the data world — Developer Relations (DevRel).

Career Path over the years leading to DevRel

DevRel, Lakehouses

When I took up my first DevRel role with Qlik in January 2021, I immediately fell in love with the work. The truth is, I had just learned about this new line of work before I applied. I wrote a bit about the time I spent understanding this new function and what helped me here. Although I had very little knowledge of DevRel then, I knew I wanted to pursue it rigorously because of the things that resonated with me — community, open-source & simplifying technology. These three aspects have been the bedrock of my career path, guiding most of my decisions.

In 2022, I joined Dremio and was incredibly lucky to have had the opportunity to work on critical open-source projects such as Apache Iceberg and Arrow. Apache Arrow was already an ecosystem by then and had a well-established community. However, Iceberg presented a massive opportunity in terms of advocacy efforts. By the time I joined, there was some awareness around Iceberg through organizations sharing their implementation journey or tech talks by creators & committers. But we wanted to double down on these efforts. Dremio was an early proponent of Iceberg and was one of the organizations producing quality technical content around this technology. This provided an edge!

In the past 1.6 years, I was able to establish an innate bond with the data engineering community, trying to help (and learn from) both beginner and experienced-level engineers build data platforms using Iceberg and the related OSS stack. The contributions in a DevRel role usually include a wide array of things. I personally enjoyed creating and growing the Iceberg X community to 1K+ people, co-authoring 6–7 chapters for the O’Reilly book on Iceberg, building projects to show how to run analytics on top of table formats & jotting down technical blogs/posts.

Here is a glimpse of a few things I managed to ship that hopefully have added some value to the community.

LinkedIn post

What next?

This brings me to what I will be focusing on next — Hi Apache Hudi & OneTable 👋

The last couple of years of working with open table formats have convinced me that a composable lakehouse architecture is the future of running analytical workloads such as BI, ML, and streaming (with flexibility). Some of the reasons that have helped organizations adopt this architecture are:

  • Open data architecture: flexibility to store data in ‘open’ table formats such as Hudi, Iceberg, Delta Lake, and free users to bring the right compute engine for the right workload (ad hoc SQL, stream processing, BI).
  • Reduced costs: The ability to scale storage by paying cheaper costs with cloud object stores (S3) and running workloads such as ingestion and data preparation simultaneously on low-cost compute vs. running expensive workloads on a data warehouse.
  • Best of both worlds: a lakehouse brings all the amazing data management capabilities (compaction, clustering, clean-up services) from the data warehousing world but also allows users to take advantage of the low costs of data lakes and deal with any type of workload (relational, non-relational). This really brings out the best of both platforms.

Returning to the three things that serve as a foundation for me, let me share some POV and what excites me.

Technology

Apache Hudi has been a pioneer in the open table format space. It came out of Uber Engineering in 2017 as their ‘incremental processing’ framework on Hadoop. At that point, the term ‘lakehouse’ wasn’t really a thing! Since they were dealing with some legit real-time data challenges (e.g., weather or traffic influencing ride prices), it was non-trivial for them to build a system that could power all business-critical data pipelines at low latency and high efficiency.

Hudi has stood the test of time and has today evolved to be a solid foundation for many organizations, such as ByteDance (TikTok’s recommendation system), Walmart (store transactions), Amazon (real-time event analytics for deliveries), Disney+Hotstar (real-time advertising for 20M+ viewers) & more. There is no doubt that Hudi will continue to see similar adoption in the future. Having worked with another table format in the same space, I understand the value of where Hudi can fill in the gaps.

Community

I absolutely resonate with one of Apache Foundation’s primary missions — ‘Community Over Code’.

Community Over Code, Halifax, Canada (Apache)

An open-source project is much more than the technology itself. It is about the people and community that builds the technology and supports fellow developers in their journey. Hudi is one such project with diverse people (and orgs) contributing to the code, documentation, and everything else, making the developer experience smooth and pleasant. We can see the diversity of the project from the chart below, with developers from Tencent, Uber, ByteDance, Onehouse & many more.

Hudi PR Creators, Credits: OSSInsight.io

This diversity ensures that decisions are fair and inclusive, creating an environment where innovation can thrive, leaving aside bias. I found pretty similar vibes in the Apache Iceberg community, which was one reason I enjoyed working every day. The hope is the same with the Hudi community.

Open Source Data

Well, there is no doubt that now, more than ever, we are realizing the importance of open source in data, especially in data storage and infrastructure. I touched upon some of the reasons for that above, but one crucial factor is storing data as an ‘independent tier’ in open table and file formats such as Hudi, Iceberg, Parquet, Arrow, etc., that doesn’t lock us to a proprietary system and allows using any compute of our choice for specific use cases. This flexibility to mix and match the suitable components to form a robust data stack has been extremely valuable for customers.

Open Source landscape, Credit: https://mad.firstmark.com/

To end, an essential aspect of open-source software is its openness and its ‘standard’. Having the standards defined for open-source enables everyone to align with it. It allows sticking to the design choices and objectives, which means the project can evolve as per the expectations set, leading to more adoption.

What’s my role in all of this?

I joined Onehouse.ai as a Staff Developer Advocate to continue working on pivotal open-source projects such as Apache Hudi and OneTable and helping the engineering community build robust data platforms and resolve some of the frictions.

Scale by the Bay — 1st talk about OneTable

While I am intrigued by the capabilities of Hudi, I am also excited about the next phase for lakehouses, which is interoperability. The decision to choose a table format is critical since it is the backbone of any data architecture. In my experience, most of it comes down to feature-level comparisons, the complexity of implementation, and support by specific query engines, i.e., the ecosystem. Also, with newer workloads and use cases among organizations operating lakehouses, it has become evident that these table formats must be ‘interoperable’.

OneTable is a recent open-source project with contributions from Google, Microsoft, and Onehouse to make table formats interoperable omnidirectionally. And I am really excited to bring this to the community and work on a common goal. There is a lot to do here but would love any initial feedback and ideas (Join Discussion).

Beyond these considerations, I am driven to advance in my DevRel career. I have been lucky to have worked in various capacities (being the 1st DevRel, setting visions, executing critical strategies, etc.). Now, I want to take all these experiences and apply them in a Staff capacity.

Join me!

I have started delving more into the internals of Apache Hudi and OneTable. My aim is to simplify complex technological concepts with a more practical and distilled learning approach. To this, I have started a ‘Hudi Concept’ series on LinkedIn & Twitter. If you have been following my work for a while, you know I love breaking down things into its canonical form. This series will also be incorporated as a blog exploring things in detail.

Hudi Concept Series: Timeline, File Layout

Finally, I look forward to having you join me as I explore more of the data architecture space. Occasionally, I intend to share insights derived from my experience with different table formats (especially Iceberg). This won’t be conducting competitive analyses but rather providing technical perspectives to assist engineers in making informed decisions, given the diverse requirements of each organization. Stay tuned for more updates!

--

--

Dipankar Mazumdar

Dipankar is currently a Staff Data Engineering Advocate at Onehouse.ai where he focuses on open source projects in the data lakehouse space.