Setting up an Automated Data-Pipeline — From the Web to Tableau Public
Dec 15, 2020Klaus Schulte
In my work as an analyst and consultant as well as in my work at university, there are few things more satisfying to me, than building automated data-pipelines, with an effective and elegant Tableau or Tableau Public visualization as the icing on the cake: set up once – runs for a lifetime.
There are countless ways to achieve the data-pipeline part of this, with applications like Tableau Prep, Alteryx or Pentaho PDI (this is the one I have specialized on before Tableau Prep became a thing), or with Code solutions using R, Python or any other programming language to your preference. We can use all of these tools and programming languages to create automated and scheduled processes, which will then for example update databases or Tableau datasources.
However, when it comes to publishing the final dashboard to Tableau Public, we are very limited because there is exactly one data source that can get a daily refresh on Tableau Public: Google Sheets.
While there have already been some blogposts about this topic (like this one by Jacob Olsufka on the Tableau Public blog), I’d like to dig a little bit deeper in this post and show a no-cost (under certain assumptions) and (almost) no-code solution, to scrape data from the web, save it to a database and bring the data to a Tableau visualization using Google Sheets, Google BigQuery and — of course — Tableau Public.
For this blog, I’ve set up a brand new google account. It takes just a few clicks and you’re ready to go.
Sign up to Google Cloud using the credentials you’ve just created in step 1 and start your free trial.
I’ve signed up as an individual. And don’t be surprised: you will be asked for your credit card information for billing purposes. However, the Google Cloud free trial includes $300 in credit to spend over the next 90 days. More than that, even after this period, 10 GB of storage and 1 TB of queries are free every month. Please check the BigQuery pricing information if you need more than that.
Within the Google Cloud Platform select APIs & Services from the hamburger menu, then select the API library and search for the Google Sheets API. Then — you guess it — enable the API.
Now you’re set to create your data-pipeline.
There are one million use cases for data pipelines, I chose to create a self-updating table of Germany’s 1. Bundesliga standings. Why did I choose this example — mainly because I want to reuse a visualization created for one of the recent #WoW2020 challenges, a Premier League table originally created by Luke Stanke.
In the following, I will…
There are a lot of blogposts around which describe how-to set up the Google Sheets/BigQuery part of this. I’ve mainly leaned on this blog by Richard Peterson to name one.
First step was to find a website that shows the current matchday. It didn’t take long to find this one on sportschau.de.
When I saw the URL, I knew that this site will always show the current matchday, because no matchday number is included.
Again, there are many blogs available about how to use the IMPORTHTML function in Google Sheets. To keep it very short, all you need is an URL, a Query (table or list) and an Index. Having only one table on this website, this was easy to figure out.
Under File/Spreadsheet Settings I chose the spreadsheet to be recalculated On change and every hour (I thing even the default setting “On change” should do it).
While the result includes all the data I need, we have some columns whith multiple variables. Therefore, I will split the information into separate columns to be able to process the variables separately in later steps.
Here I’m splitting up column B into columns F&G using the SPLIT function. Then I’m calculating the missing times in column H with an IF-statement.
Column C is split up into columns I and J again using the SPLIT-Function.
To split up column D I used a combination of SPLIT and INDEX (and an IF-statement for the games not yet played).
Here for the overall goals scored by the home team (and away goals (column L), halftime home goals (column M), and halftime away goals (column N) accordingly):
That’s the data prep part!
First thing we have to do in BigQuery is to create a dataset in our brand new project, let’s name it codatablog.
Then, within codatablog we can create our first table:
To create the table, choose the following settings.
The URL can be taken from the Share-dialogue in Google Sheets:
Choosing the Google Sheets file format can course some problems with identifying table headers and formats correctly; it should work fine with the CSV format most of time.
Et voila: we have our table, which can be queried with Google’s Standard SQL…
…which gives us back the data from our Google Sheet:
Nice! Thanks to the connection through the Google Sheets API, this table will always contain the latest information from the website we’ve imported into Google Sheets.
For my second table, I can get rid of some of the columns. I will also define a key and add a timestamp which I will use later to consolidate my data. Under More, I can define the second table (matchday_datalake) to write the data into.
Hitting Run will execute the query and create my second table.
Next step is to schedule this query.
To do so, you first have to enable the BigQuery Data Transfer API:
After a short time, you are then able to create a new scheduled query. Something End-of-Day-ish seems a good time to schedule the query, since all games end before 11pm at latest. I’m also defining an end date for my scheduled query, since season normally ends mid of June. I’ve added some extra days here because of the COVID uncertainty in this season.
The data from each query will be appended to the destination table.
Part of the scheduling process is also to enable billing. Again — 10 GB of storage and 1 TB of queries are free every month.
This daily query will create duplicate records, because it runs everyday. Therefore, and basically the same way like in Step 4, I’m creating a third table matchday_cons and schedule a second query to consolidate my “datalake” using this SQL-statement:
SELECT Key, Date, time_comp, Home, Away, G_Home, G_Away, H_Home, H_Away, max(timestamp) FROM `unitxxxxxxxxxx.codatablog.matchday_datalake` where G_Home is not null Group by Key, Date, time_comp, Home, Away, G_Home, G_Away, H_Home, H_Away
Make sure to choose “Overwrite table” instead of “Append to table” to keep only distinct matches in this third table.
Also, make sure to schedule this query some time after the query from step 4 to make sure that the latest data is included.
That’s it for the BigQuery part.
The easiest way to connect Google Sheets to BigQuery would be Connected Sheets, which is however limited to certain google accounts (G Suite Enterprise, G Suite Enterprise for Education, and G Suite Enterprise Essentials).
Another way to do it is to use the OWOX BI BigQuery Reports add-on. To install it, choose Get add-ons from the Add-ons menu, search for it and install it.
Then create a new report with this add-on. Choose Add a new report, then select the BigQuery project and create a new query. With a Select * from table I’m just querying all the data from my consolidated table.
Next step is to schedule refreshes of the report:
While my other two queries run shortly before midnight, I’ve set this one to be updated directly after midnight.
Now, we’re all set to connect this Google Sheet to a Tableau visualization.
Like mentioned before, I’m reusing a visualization from a recent WorkoutWednesday. To be able to use it, I’m pivoting the Home and the Away columns and adopting the calculations where necessary.
Last things to do are replacing the data source, changing colors a little bit, uploading on Tableau Public,
This post describes a way to keep your Tableau Public visualizations up-to-date. It may cause costs for the Google Cloud services leveraged here if used too excessively, so please keep an eye on that.