Building a Tableau dashboard directly on the data lake with Dremio

Dipankar Mazumdar
6 min readAug 8, 2022

As a BI engineer, you are tasked with developing a new dashboard for the revenue unit of your organization. You have all the wireframes defined, had discussions with the business stakeholders to understand the key metrics, and all you need is access to the data to start working on it.

Traditionally, your organization’s data flow architecture looks something like this.

A hypothetical data to reporting scenario

Your company’s data first lands in a cloud data lake such as Amazon S3 (rep. in the bottom). Then, you submit a ticket for the data engineering team to help move the data for your reporting. Based on the workload of your data engineering team, they will move the data via ETL pipelines to some sort of a data warehouse. You need access to multiple databases within the data warehouse, so you make another request to get that data. And finally, a data copy request is made for the business unit specific to your analysis.

Imagine you have to run your analyses on a dataset with millions of records. You will also have to consider the performance aspect of the dashboard. So, maybe you will create some cubes or extracts depending on the BI tool you plan to use, resulting in additional data copies. This is not a very unusual way of working. In fact, a lot of my personal experiences have been quite similar.

Do we see any problems with this approach? 💯%

Listing some non-trivial ones that have relevancy with this blog.

  • Data requests take a long time: depending on the availability and workload of the engineering team who builds ETL/ELT pipelines, this will vary. This ultimately impacts the time-to-insight for the organization & analysis may not be relevant.
  • Lots of data copies: While the initial idea of storing data centrally in a data lake and making data copies for moving relevant data to a warehouse & upstream BI tools made sense, it resulted in many unmanageable copies of data. We suffer from issues such as data drift, KPI drift, etc.
  • Broken ETL pipelines: With so much data movement among systems, ensuring every ETL pipeline runs smoothly and aligns to the best software engineering practices is hard. No one likes an overnight ran ETL job failure impacting all your reports & models.
  • Data in closed formats: With data moved from a data lake storage to a data warehouse, you are essentially locking-in data to a vendor-specific format & you will mostly need to use the same vendor’s engine to process the data. While this may sound good for some orgs, with the changing needs of a fast-moving company, this can be a limitation. Also, how about future use cases such as ML?

There can be a different way of working too…

Well, it’s hard to say that you can change the traditional way of working overnight and have no ETL pipelines and data copies. But there is a new approach to this, which bestows some significant advantages.

So, your organization stores all of its data in a cloud data lake. Why not use a BI tool like Tableau directly on the data lake? The question is :

  • Can we do that, and if so, how?
  • What about the performance?

So this is where Dremio comes in and sits between the data stored in a data lake and the client engine (shown in the image above). It is an engine for open platforms that provides data warehouse-level performance and capabilities on the data lake. In addition, Dremio enables native connection with BI tools such as Tableau to operate directly on the data lake in live query mode. Most importantly, it is super fast 🚀 and comes with a free standard cloud edition.

Let us walk through a quick tutorial to show fast it is to build a dashboard in Tableau using the data stored in our data lake.

  • Sign up for Dremio and get started. For details watch this video.
  • Load the dataset in Amazon S3. For this tutorial, I created a dataset based on the data of Kaggle’s H&M Personalized Fashion recommendations. I have the data file in an open file format, Parquet.
  • Now let’s access the dataset from Dremio. You can see all your data sources in the ‘sources’ section (highlighted below). The data lake ‘test’ is basically my S3 source where the data resides. On the right, you can see our parquet file.
  • Let’s click on this data file. Dremio will present you with an interface where you can get a preview of the dataset, run queries, join datasets, create new fields, etc.

As seen from the snippet above, I have a preview of my dataset & all the available options.

  • To build the dashboard in Tableau and perform your analysis on the data lake’s data, Dremio allows you to connect directly from the UI. Click on the Tableau icon as seen below.

This action will download a small tableau data source file(.TDS). Note that this file doesn’t connect any data but rather the information necessary to connect to the actual data. In our context, this will help us authenticate with Dremio and provide a seamless experience

  • Once we enter our credentials, we should have the connection established and are ready for our dashboard in Tableau.

To have an idea of the total no. of records in this dataset, I dropped the ‘Migrated data (Count)’ field and can see that we have close to 32M records.

  • Now, I want to analyze the average price per product group in this H&M dataset. So, I will do an ‘AVG(Price)’ by ‘Product Group Name’.

The critical thing to note here is that every analysis you run here in Tableau is actually a Live Query back to Dremio. Recall that Dremio sits in between the data lake & the client. It takes the query for this million-row dataset, processes it, and returns the result to Tableau in sub-seconds. The gif above illustrates how fast it is to do this on the data lake directly. Dremio achieves this sub-second performance using a feature called Data Reflections.

  • To understand what query was run & what happened underneath, Dremio keeps a list of all the jobs in the ‘Jobs’ page (shown below)
  • Now, if you open one of the jobs, you will be presented with details such as the query executed, a summary of the query ran, accelerated queries (reflections), etc. as seen below.
  • After adding some more analysis, our final dashboard looks like below.

Dremio provides a new way to deal with some of the challenges discussed above with traditional data architectures. It mitigates the wait time on the part of an analyst/scientist, helps reduce data copies (when you can), and provides amazing performance while embracing storing data in open formats so it can also be leveraged by multiple engines based on the use-cases.

If you are interested in getting a hands-on experience connecting Dremio to Tableau to explore data and build interactive dashboards directly on the AWS data lake, join us for the AWS Dev Day Chicago on August 24th. You’ll need to register — seats are limited.

Lunch, swag & more!

Interested in data related discussions? Connect with me in LinkedIn & Twitter.

--

--

Dipankar Mazumdar

Dipankar is currently a Staff Data Engineering Advocate at Onehouse.ai where he focuses on open source projects in the data lakehouse space.