Manifesto of Internal Data Densification

Sep 5, 2019

Klaus Schulte

*by Klaus Schulte & Rosario Gauna*

Data Densification – you will have already come across this topic, when you have gone beyond ‘Show Me’ in Tableau:

- You’ve added curvy aspects to your designs with curved lines or polygons,
- you’ve created radial bar charts, or
- you’ve created other visualizations, where the original data was simply not enough, to let your design idea come to life.

For all these mentioned and unmentioned use cases, you have used **Data Densification** to enrich the original data by adding extra points.

There are a lot of good blogs around describing the techniques that can be used for Data Densification. We can highly recommend for example Ken Flerlage’s video “An Introduction to Data Densification” as part of the #ThinkDataTuesday campaign by Tableau and the accompanied blogpost. Tableau greats like Jeffrey Shaffer, Jonathan Drummey, and Joe Mako have also talked and written about the topic.

In Ludovic Tavernier’s and my (Klaus) session Surprise Me – Creating Advanced and Unique Charts in Tableau at Tableau Conference Europe in Berlin earlier this year, Ludovic explained the concept of Data Densification with this slide:

- Let’s assume, the original data had only one record.
- Now you have your design idea to visualize the data coming from this record as a rectangle. You cannot draw a rectangle with only one point, you need four points.
- When you want to draw a curve, you need even more than four points, for a sigmoid curve some authors recommend for example 49 points. For shorter or longer curves, you may need less or more than 49 points.
- So, for both design ideas, you have to
**add extra points**.

- There are two techniques to add extra points, for both techniques, a set of records has to be created outside Tableau in a so-called
**model dataset**. - The first technique uses
**pre-densified data**in the model dataset. That means, the model dataset includes as many rows as needed for your design (e. g. four rows for a rectangle or 49(ish) rows for a curve). - The second technique uses
**bins to artificially densify**the data. To create the bins in Tableau at a later stage, it needs only two rows to define the range in the model dataset, wherein the bins will be calculated (e. g. 1 & 4 for the rectangle and 1 & 49 for the curve) later. - For both techniques, the
**Cartesian product**has to be built using a common link or a join calculation. - That means we have
**multiple copies**of the data for the first technique (e. g. 49 copies for curvy designs). For the range & bins-technique the**data has doubled**.

Both techniques perform Data Densification on **row level**. The main advantage is that you could drill down to row level in your visualization and the design would still work. The biggest **disadvantage** is, obviously, that you have to duplicate your data. When you already have a ton of data, you probably don’t want to do this.

*But: Do we really have to? Let’s try something different…*

The common techniques of Data Densification allow users to drill down to **row level**. There are for sure use cases, when you visualize row level data, for example when the data is pre-aggregated. We have seen this for example in many #makeovermonday datasets.

But in many, if not even in most (real world) visualizations, we visualize **aggregations**, often based on transactional data, like practiced many times with Tableau Superstore. And this is the key to a technique, that we have named **Internal Data Densification**.

While Data Densification densifies and multiplies the data by a factor **n** using a model dataset, Internal Data Densification creates a **densificator** from **the original data** and splits up the data into **n** partitions (these partitions do not necessarily have the same number of records). E. g. for n=2 we would double the data with Data Densification and build two (smaller) partitions with Internal Data Densification.

This can be done based on any dimension or measure in the original data, that has at least two distinct (for doubling the data) members, and that’s why this technique only works for **aggregations**.

This is **important**: Internal Data Densification is based on the assumption, that you always have enough members for the desired number of “copies” in your densification dimension for the **level of detail in your view**.

Let’s come back to this assumption after introducing the **techniques**.

- Use
**IF- or CASE-Statements**for qualitative dimensions like names or geographical data. - Use
**Modulus**for date dimensions like minutes, hours, days, weeks, or months and for any integer dimension or measure.*(With the ASCII-function you can translate every name into an integer to be able to use Modulus as well.)* - With Data Densification, each copy of the data is identical. Therefore, you can for example sum up averages. You can’t do this with Internal Data Densification. To get back the aggregation we’ve split up into our different partitions, we can use
**LODs.**We come back to this later when we will show some use cases for this technique.

Let’s give an example with Tableau Superstore data and try to “**double**” the data with Internal Data Densification. The visualization we have in mind is to show **category sales 2018 by month**.

**EXAMPLE 1**

Using [Region] as densificator with an IF-Statement:

[Region_Dens] = IF [Region]="West" THEN 1 ELSE 2 END

Creating a view with MONTH(OrderDate) on columns, Category on rows and COUNTD(Region_dens) on label shows us, whether this densificator is valid:

Yes, we have two distinct values (1, and 2) in each period for each category.

Would it also work for Sub-Category?

No, we have periods with only one value (e. g. region West *or* one of the other regions) and some periods without any sales.

**EXAMPLE 2**

Using [Order Date] as densificator with Modulus:

[OrderDate_Dens] = DATEPART('day',[Order Date]) % 2

Modulus calculates a **remainder**. In this case, the formula above is calculating a “0” for even days (e. g. 8 = **2** x 4, remainder = **0**) and a “1” for odd days (e. g. 7 = **2** x 3, remainder = **1**).

You can use Modulus with any other dividend to define your densificator.

When we look again at MONTH(OrderDate) on columns, category on rows and COUNTD(OrderDate_dens) on label, we can see that [OrderDate_Dens] is also a valid densificator.

**EXAMPLE 3**

What if we needed **four copies** of the data? We can again use [Order Date] as our densificator:

[OrderDate_Dens_4x] = STR(DATEPART('day',[Order Date]) % 2)+"-" +STR(DATEPART('week',[Order Date]) % 2)

This formula is combining a “0” for even and a “1” for odd **days** with a “0” for even and a “1” for odd **weeks** and is giving us four possible combinations (0-0,0-1,1-0,1-1).

Is it valid? Yes, it is!

**EXAMPLE 4**

For the visualization we have in mind here (**category sales 2018 by month**) the number of “copies” you can create with this data is limited. We managed to create 10 copies by combining two densificators, but not more.

Superstore is not the perfect dataset for this technique. Perfect would be for example production data, where we **have logs that come in regularly** like every minute or every 10 seconds or something like that.

Trying modulus with data from my (Klaus) solar power plant (where the data comes in every 5 minutes), we managed to put together 3,456 valid combinations within minutes using modulus with days, hours and minutes on MONTH(Time)- and “Dimension”-level.

Enough “copies” to draw some nice curves 😎.

- Internal Data Densification
**creates a densificator from the original data**. This can be done based on any dimension or measure in the original data, that has at least two distinct members. - The technique performs Data Densification on
**aggregation level**. - In order to use the technique, it must always be
**ensured that there are enough members for the desired number of “copies”**in the densification dimension. Therefore, we highly recommend**densificability checks**(*this term was made up by Ken Flerlage*).

Ok, let’s head over to some use cases to demonstrate the capabilities of this technique. And these are massive!

For our first use case, we want to create **a Sigmoid-Style Line Chart**. You can of course argue the purpose of such a chart – but hey, this is just to demonstrate the capabilities of Internal Data Densification.

To create this, we are leveraging Internal Data Densification in combination with range & bins-technique. To do this, we need one extra “copy” of our data.

We want to show **category profits for all years.** Let’s check first if there is a valid densificator and try again Modulus with day of Order Date:

[Densificator] = DATEPART('day',[Order Date]) % 2

Yes, we have a valid densificator!

For the range & bins-technique we have to set the range of 49 first. This is the number of points we want to use for our sigmoid-style curve. We call this calculated field **path**…

[Path] = [Densificator] * 48

…and this gives us either a 0 or a 48. When we now create a bin and set the bin size to 1, we should get our 49 points.

Let’s check: Yes, 49 marks (you have to select “Show missing values” to get this)!

We have split up our data into two parts, therefore we have to redefine our Profit using a LOD:

[Profit LOD] = { FIXED MONTH([Order Date]),[Category]:SUM([Profit])}

We want to select YEAR(Order Date) as a filter at a later point. Make sure to set the filter as a context filter, otherwise you would have to add YEAR(Order Date) to this Fixed-LOD.

You can now use [Path] and [Profit LOD] and follow for example this blog by Toan Hoang to create curvy line charts.

Check out the curvy lines using this technique on Tableau Public.

In this exercise, I (Rosario) used the technique to create a radial chart with **weekly sales in 2018**.

In this viz, we want to show **weekly sales** and two data points are required per week. Let’s again check first if there is a valid densificator and try Modulus, but now with the **ASCII value** of the first character of the Product Name field:

[Densificator] = ASCII([Product Name]) % 2

Yes, we get a valid densificator.

We have split up our data into **two partitions**, therefore we have to redefine our weekly sales with the following LOD:

[Weekly Sales] = {FIXED [Week]: SUM([Sales])}

In this exercise, using the technique allows each of the partitions to use different values for the radius:

[Radius] = 0.15+(ATTR([Data_Dens]) * (sum([Weekly Sales] / {FIXED: MAX([Weekly Sales])}))

Resulting in:

- When densificator equal 0: [Radius] = 0.15
- When densificator equal 1: [Radius] = 0.15 + (sum([Weekly Sales] / {FIXED: MAX([Weekly Sales])}))

Play with the interactive version or download the workbook here.

Just a few weeks ago, I (Klaus) wrote about how to encode change in line charts.

The blogpost describes how the data was tripled to achieve this. Again, let’s try it with Internal Data Densification.

In this viz, we want to show **category sales for all years.** Let’s check first if there is a valid densificator and try again Modulus with day of Order Date:

No, we have some months with only two distinct remainders and therefore the densificator** isn’t valid**.

What about this:

[Densificator] = int(Sales) % 3

Yes, with Modulus of integer Sales, we get a valid one.

We have split up our data into three partitions, therefore we have to redefine our Sales with the following LOD:

`[Sales LOD] = { FIXED YEAR([Order Date]), MONTH([Order Date]),
[Category]: SUM([Sales])}`

Here, we’ve included YEAR(Order Date) in our LOD and can take year and category just as a dimension filter.

The rest is just finding and replacing both the old densificator (“Table Name”) and the old measure (Sales) with our new ones. That’s it!

You can find and download the updated workbook on Tableau Public.

In this exercise, I (Rosario) used the technique to **avoid pivoting** data with three measures (Low, High and Price) contained in each daily record by stock. Thus, I could simplify the necessary calculations.

The challenge in question was launched by the unique and incredible Andy Kriebel and corresponds to the 45th week of 2017 of #WorkoutWednesday. In that week, he challenged us to build a beautiful display style of a portfolio of stocks, an idea originally coming from Jeffrey Shaffer. This exercise was the first time where I experienced the benefits of using the Internal Data Densification technique. Below I share a summary of the main steps of the solution, applying the technique as I use it today.

In this viz, we want to show the **52-week low** and the **52-week high** of a stock in the last year and the **current price**. Three data points are required per Stock (or Symbol). Let’s again check first if there is a valid densificator:

[Marks_Dens] = CASE DATEPART(‘day’, [Date]) % 3 WHEN 0 THEN “Low” WHEN 1 THEN “High” WHEN 2 THEN “Close” END

Yes, we get a valid densificator.

We have split up our data into **three partitions**, therefore we get the value of the minimum price and the maximum price per stock in the last year and the current price per stock (the stock, in the table is identified with the name of Symbol) with the following LOD calculations:

[52-Week Low] = {FIXED [Symbol] : MIN([High])}

[52-Week High] = {FIXED [Symbol] : MAX([High])}

[Price] = {FIXED [Symbol]: SUM(IIF([Date] = {FIXED: MAX([Date])}, [Close], NULL))}

Subsequently, for each of the three possible densifier values, associate a value within the x-axis.

[Mark Axis Value] = CASE [Marks_Dens] WHEN "Low" THEN0.004 WHEN "High" THEN0.996 WHEN "Close" THEN ([Price] -[52-Week Low]) / ([52-Week High] -[52-Week Low]) END

Finally, for each of the three possible densifier values, associate a measure with the label you wish to display.

[Mark Label Value] = CASE [Marks_Dens] WHEN "Low" THEN [52-Week Low] WHEN "High" THEN [52-Week High] WHEN "Close" THEN NULL END

In this exercise, using the technique allows us to show for each of the stocks three marks (thanks to densification) and also for each of the marks to use **independent measure values**.

The updated version of the viz can be found here.

We both think, that **Internal Data Densification** is a very powerful technique, that allows you to create any chart in Tableau that would normally require the addition of extra points. We’ve shown four use cases but could have added many more.

Internal Data Densification can be used only under certain **assumptions**. Like described in our use cases, there has to be a **valid densificator**. Furthermore, it **works only on aggregation level** and best with large and granular data sets. Therefore, it is a nice addition to the existing Data Densification techniques, because these data sets are the cases, where Data Densification would ‘hurt’ the most. As mentioned before: when you already have tons of data, you don’t want to multiply it. For small or/and pre-aggregated data sets, you won’t have a problem to multiply your data.

That means, you have to **know your data**! We don’t recommend using this technique with live data, where you cannot be sure about what data comes in. Use it with completed/closed data sets or in cases, where you have full control over extract refreshes. **Make sure that your densificator will work in ANY case!**

That said we hope you enjoyed reading. Please use this technique and share your use cases with us on Twitter (@ProfDrKSchulte and @rosariogaunag) or in the comments below.

Thanks for feedback along the way to Jeffery Shaffer, Ken Flerlage and Ludovic Tavernier.

*Rosario & Klaus*

## COMMENTS