Datasource Based Lookups in Tableau

Mar 11, 2019

Klaus Schulte

a collaborative blog by Ken Flerlage and Klaus Schulte

### The Challenge

Last week at work I (Klaus) puzzled my head over an interesting question. I was looking at production orders and production dates and had to calculate differences between production dates:

• If a production order has consecutive production days (the datediff between the days is 1) then the machine only has to be equipped once.
• If the datediff between two production days is >1, the machine has to be equipped on each production date.

### The Table Calculation-Approach

My first impulse was: Table Calculations! This should be something lookup() should do easily for me!

Adapting my question to superstore data, I’m looking at customers and order dates. I created this lookup-formula, let it restart every customer and was ready to go:

For my business question I calculated only 1s and 0s (if datediff =1 then 0 else 1 end) and was then able to sum up the number of equipment activities for each production order very easily.

The problem with this approach is: you cannot narrow down your time dimension without further ado. Look what happens when, for example, filtering down the years to 2018:

There are no lookup-values for the first orders of each customer in 2018 because I filtered them out of the view. That’s how table calculations work! It is due to Tableau’s order of operations. We have brought in order date as a dimension filter and these are applied before table calculations:

Source: Tableau Help

There are ways to deal with this, for example using another table calculation as a filter. This can be achieved with index() or with lookup(), like very cleverly done and described in this blogpost by Gauthier Bonnot from the Data School.

But with bringing in order date as a measure filter, you lose some of the filter functionalities you have when bringing it in as a dimension filter – for example to choose a certain month with just one click/checkbox.

Dimension filter

Measure filter

That’s why I was in need of a different solution, one without table calculations.

### The SQL LAG Approach

At this point, Ken was in the game. I loved to read Ken’s SQL series and ever since then I’ve had so much fun transforming ETL flows I had formerly used into nice little SQL-Querys.

I still had my credentials for the MSSQL Server Ken provided in his initial SQL post and after some quick googling I discovered the LAG() function and created this query:

```select
[customer name],
[order date],
count([order id]) as orders,
sum([sales]) as sales,
LAG([order date]) OVER (PARTITION BY [customer name] ORDER BY [customer name], [order date]) as LagDate

from orders
group by [customer name], [order date]```

SQL window functions are very similar to Tableau’s table calculations, also allowing to define where to compute the function along (PARTITION BY) and the underlying order (ORDER BY).

This query then allowed me to calculate my date difference (after changing the format from datetime to date) in a very simple calculation, being on ‘day’-level, I didn’t even need to use the datediff-formula.

When I now add this measure to my filtered view the date differences for each first order of 2018 are back!

How nice 🤩, thank you Ken!

Unfortunately the database I used in my project didn’t support the lag()-function and I was back on start 🤨.

### The SQL Self-Join Approach

Being not able to use the lag()-function the alternative approach was to create a row ID in SQL and then self-joining the table on this ID.

In MSSQL this again can be done very easily using the row_number()-function. (My MySQL database unfortunately didn’t allow row_number either, but I eventually found another way to do the same.) Row_number in combination with partition gives me IDs for order dates by customers.

```select
[customer name],
[order date],
count([order id]) as orders,
sum([sales]) as sales,
ROW_NUMBER() OVER (PARTITION BY [customer name] ORDER BY [customer name],[order date]) AS ID

from orders
group by [customer name],[order date]```

Then Ken’s idea was to do the self-join in Tableau. So, I just doubled the query (without orders and sales) and created this join calculation that brought me the exact same result like the SQL LAG-approach.

With the same formula like in the LAG-Approach ([order date]-[LagDate]) we are getting the same results:

## Completing the Picture

Since Klaus and I (Ken) collaborated on this solution, I thought I’d provide some of my own thoughts on this technique. First, of all, let me just say that I love LAG and self-join approaches as they provide you with a lot more flexibility and control than the table calculation approach. In this case, LAG would certainly be my preferred approach (note: there are a variety of other SQL Window Functions that can be equally as powerful as LAG and LEAD). But, as Klaus noted, not every database supports window functions (as discussed in SQL for Tableau: The Basics, these are not part of ANSI standard SQL). So, let’s see if we can come up with some additional alternatives to Klaus’s approach.

### The Brute Force SQL-Approach

First, I’m curious of we can do this using brute force SQL, rather than using Tableau to perform the join. Essentially, we want to take the following SQL, written by Klaus, then join it to itself:

```SELECT
[Customer Name],
[Order Date],
COUNT([Order ID]) AS [Orders],
SUM(Sales) AS [Sales],
ROW_NUMBER() OVER (PARTITION BY [customer name] ORDER BY [Customer Name], [Order Date]) AS ID

FROM [Orders]
GROUP BY [Customer Name], [Order Date]```

But one of the big problems here is that we need to join this entire result to itself. We cannot simply insert a JOIN in its normal location (as part of the FROM clause) because we need the grouping to occur before the join (for more, see SQL for Tableau: Order of Operations). So, we’re going to need to use some subqueries (Guess What! We have a blog for that too! See SQL for Tableau: Subqueries).

Let’s start by duplicating our query:

```SELECT
[Customer Name],
[Order Date],
COUNT([Order ID]) AS [Orders],
SUM(Sales) AS [Sales],
ROW_NUMBER() OVER (PARTITION BY [customer name] ORDER BY [Customer Name], [Order Date]) AS ID

FROM [Orders]
GROUP BY [Customer Name], [Order Date]

SELECT
[Customer Name],
[Order Date],
COUNT([Order ID]) AS [Orders],
SUM(Sales) AS [Sales],
ROW_NUMBER() OVER (PARTITION BY [customer name] ORDER BY [Customer Name], [Order Date]) AS ID

FROM [Orders]
GROUP BY [Customer Name], [Order Date]```

Now let’s turn each of them into a subquery with a join in between. In the following, I’ll highlight the added SQL in bold and red.

```(SELECT
[Customer Name],
[Order Date],
COUNT([Order ID]) AS [Orders],
SUM(Sales) AS [Sales],
ROW_NUMBER() OVER (PARTITION BY [customer name] ORDER BY [Customer Name], [Order Date]) AS ID

FROM [Orders]
GROUP BY [Customer Name], [Order Date]) as [o1]

LEFT JOIN

(SELECT
[Customer Name],
[Order Date],
COUNT([Order ID]) AS [Orders],
SUM(Sales) AS [Sales],
ROW_NUMBER() OVER (PARTITION BY [customer name] ORDER BY [Customer Name], [Order Date]) AS ID

FROM [Orders]
GROUP BY [Customer Name], [Order Date]) as [o2]```

Basically, we are aliasing the subqueries as o1 and o2. Then, as Klaus did earlier, we are performing a left join so we can be sure we don’t drop the first record.

We now need to add in our ON clause.

```(SELECT
[Customer Name],
[Order Date],
COUNT([Order ID]) AS [Orders],
SUM(Sales) AS [Sales],
ROW_NUMBER() OVER (PARTITION BY [customer name] ORDER BY [Customer Name], [Order Date]) AS ID

FROM [Orders]
GROUP BY [Customer Name], [Order Date]) as [o1]

LEFT JOIN

(SELECT
[Customer Name],
[Order Date],
COUNT([Order ID]) AS [Orders],
SUM(Sales) AS [Sales],
ROW_NUMBER() OVER (PARTITION BY [customer name] ORDER BY [Customer Name], [Order Date]) AS ID

FROM [Orders]
GROUP BY [Customer Name], [Order Date]) as [o2]

ON [o1].[customer name] = [o2].[customer name] and [o1].[ID] = [o2].[ID]+1```

So we now have two sub-queries that we are joining together. But we haven’t specified any fields to pull back from this big mess, so let’s do that now:

```SELECT
[o1].[Customer Name],
[o1].[Order Date],
[o1].[Orders],
[o1].[Sales],
[o2].[Order Date] as [LagDate]

FROM

(SELECT
[Customer Name],
[Order Date],
COUNT([Order ID]) AS [Orders],
SUM(Sales) AS [Sales],
ROW_NUMBER() OVER (PARTITION BY [customer name] ORDER BY [Customer Name], [Order Date]) AS ID

FROM [Orders]
GROUP BY [Customer Name], [Order Date]) as [o1]

LEFT JOIN

(SELECT
[Customer Name],
[Order Date],
COUNT([Order ID]) AS [Orders],
SUM(Sales) AS [Sales],
ROW_NUMBER() OVER (PARTITION BY [customer name] ORDER BY [Customer Name], [Order Date]) AS ID

FROM [Orders]
GROUP BY [Customer Name], [Order Date]) as [o2]

ON [o1].[Customer Name] = [o2].[Customer Name] and [o1].[ID] = [o2].[ID]+1```

And that should do it. If we use this custom SQL in Tableau, we’ll get the results we’re looking for:

And, again, we can calculate the difference using  [order date]-[LagDate].

### The Tableau Prep-Approach

Okay, so that’s one method, but that SQL query is pretty ugly and not everyone is adept at SQL, so is there another method we can use that avoids custom SQL altogether? Enter Tableau Prep!

Unfortunately, Tableau Prep does not yet have the ability to create a row number (if you’d like to have this feature, give the following idea a vote: Add Record ID / Index feature in Tableau Prep), so we’re going to have to do something a bit more creative. Let’s start by connecting to our Orders table, then we’ll aggregate Sales and Order Count by Customer Name and Order Date.

Now we’re going to self-join the aggregate to itself, grabbing all previous records. To do that, we’ll first create a new branch (essentially a copy of the aggregated data). Then we’ll join like this:

This will result in a list of each customer’s order dates, along with all of that customer’s previous order dates.

Next, after removing some unnecessary fields and renaming others, we’ll find the maximum Previous Order Date for each Customer Name and Order Date.

This will give us a list of the order dates immediately before the order date we’re looking at.

Next, we’ll left join this back to the previous set of data.

This will eliminate all those extraneous previous date records, leaving us with only one record that contains the customer name, order date, and information from immediately previous order.

Finally, we’ll do some cleanup to remove unnecessary fields, then we’ll output to a hyper data file.

The result should look just like the custom SQL we used earlier.

And it works!

If you’re interested in seeing this Tableau Prep flow, feel free to download it here.

### Conclusion

To wrap up my little piece of this, I will just reiterate how much I love this approach. It’s quite powerful and has lots of use cases. For example, a common need is to compare data from the current and previous month (or year). This can be done in lots of different ways, including LOOKUP table calculations, LODs, etc., but in some cases, the self-join method will work as well. And this method can really help to eliminate some of the pain of table calcs and LODs.

We both hope that you enjoyed reading and find own use cases for this!

Klaus & Ken