Converting NetCDF files to .hyper for use in Tableau
Nov 16, 2022mapsoverlord
Recently, Klaus Schulte asked for some help with wrestling NetCDF files into Tableau. To help out, I wrote up a little Python script to pull the data out and convert it to a .hyper file for use in Tableau and figured that I should go ahead and document how to work with NetCDF in Tableau and share the script that I wrote in case anyone else had a similar problem. So here it is. Thank you to Klaus for encouraging me to write this up and for proving many of the helpful tips for working with the NetCDF data in Tableau!
In this blog post I’m going to talk more conceptually about the process and about how to deal with the data in Tableau, but if you want the technical details and a script to get you started with processing NetCDF data, the whole script with lots of annotation is up in a Jupyter notebook that you can read through, download, etc.: The Jupyer notebook is here!
NetCDF files are a special file format for storing array-oriented scientific data.
Typically you’ll have a grid of locations and then a number of attributes associated with each of those locations. For each attribute at each location you might have multiple measurements or time steps. The data gets complicated pretty fast! The UCAR web site on NetCDF has all sorts of information about the file format, structure, and use cases! I highly recommend scrolling through if you have to create or deal with NetCDF files.
Here is a look at a grid of data with one time step and attribute from the NetCDF file Klaus shared with me – it’s just a bunch of raster cells:
Let’s zoom in and see more of what this looks like – it’s just a bunch of squares and each square has a value (or, in the case of a NetCDF, each square often has LOTS of values):
No, Tableau doesn’t work directly with raster datasets, therefore we’ll need to start by converting this data into a grid of POINTS (with latitude and longitude coordinates) so that we can map and analyze them in Tableau. What we want to end up with is a bunch of point locations – one to represent every cell in the NetCDF file. Something like this:
When we convert the data to points, we’ll drop those into a big file so that we have a row in a table for each grid cell. In the case of this dataset, we have a 175 x 225 grid, so we’ll have 39,375 rows to represent each individual point in the raster. That will be data for the whole grid, though, as we’ll see in a bit, there may not be data values at each of these locations.
So when we map it in Tableau some of these points won’t show up. Why is this? Rasters are rectangular grids, but the area with data may be just a subset. In this case we have something like this – the grey dots are the “cells” (in Tableau these are really just point locations representing the center of the raster cell) without any recorded data values and the red dots are the “cells” where we have data recorded:
While the script that I wrote produces a .hyper table with the locations for all of the cells in the raster, you could streamline and just record the location for the cells that have valid data values. Since the geometry didn’t add that much to the overall size of the resulting .hyper file I just left it all in.
As I noted earlier, I put together a Jupyter notebook with all of the steps and documentation and you can grab it here. I’m not going to go into a lot of details about the script since you can look at the notebook, but I’ll give a little bit of an overview of the process.
I used Python to convert the NetCDF into a Tableau .hyper file. There are two python libraries that make this really easy – one is NetCDF4, a special library for working with NetCDF files, and the other is the Tableau Hyper API.
Using these libraries you can load up a NetCDF, obtain information on the variables and structure, then walk through the file and grab data for each location (the attributes & the location information) and write it directly into a .hyper file. Since the geometry is constant (i.e., the grid doesn’t change), you can create a .hyper file with TWO tables in it – one with just the geometry and one with the attributes. The nice thing about this is that you don’t waste storage space with lots of copies of the same latitude ad longitude values.
While the geometry table is relatively small (~39k rows), the attribute table becomes large quickly! The file that I was working with had 852 time steps, so when I grabbed ONE attribute for each location with valid data in the dataset (the red locations in the map above), I ended up with a table with 19,765,548 rows! We’re starting to talk about some reasonably large data!!
In the end, my hyper table gave me a structure like this, and I could link the attributes and geometry together with a simple relationship using an ID field. The ID field that I put in each table was just an indicator of the row and column for each cell, since that was a small and easy unique identifier:
I left the geometry values as latitude and longitude when I wrote the .hyper. You could convert those to Tableau spatial data points as part of the process of creating the .hyper, but it’s not necessary since you can easily turn them into a point geometry in Tableau using a MakePoint(latitude, longitude) calculation. You can just drop that single geometry field on the map to visualize your data.
Don’t be surprised when you first add the geometry and see a set of points that extend beyond the “study area” of your data. Remember that a raster is a big rectangle so there may be many points that don’t have data values. All you need to do is filter on the ID from your data table since that table only contains IDs for the cells with data!
The data table likely has a LOT of time steps – filter to the one time step of interest before you map the data so that you just see that one value and not the sum of all attributes across all time steps in the dataset. Also note that the time step value is in relative terms – so in the example dataset the first data point is on January 1st, 1951. Therefore, we can calculate a proper date from the time field like this: DATE(DATEADD(‘day’,[Time],#1951-01-01#)).
You might also want to use a spatial join to bring in information about locations of interest, for instance if you wanted to just filter the results to a single state of Germany you can join in a spatial file with the boundaries of the German states, and use the state name to filter!
Finally, the geometry comes in as point locations since we just have the latitude and longitude values. If you want a polygon representation that will fill in the space a little better you might consider using the Buffer functionality in Tableau to transform your point geometries to polygons. To do this, you need to figure out a bit about the configuration of the grid. In the example dataset it was a 4x4km grid. If you set the buffer radius to 2km you will end up with polygon circles that will fill in the space nicely: BUFFER([geometry],2,’km’)
When you put this all together you can start doing some fun analysis and exploration of your NetCDF data! Here’s an example from Klaus, you can also check it out on Tableau Public.
Let us know what you do with NetCDF files in Tableau!