Expanding the Tableau Data Pipeline: Auto-creation of tde from csv

The scenario

Imagine that you’ve pulled some data out of a database (choose your flavor) and want to now analyze in Tableau. You notice, however, that it’s too massive to tinker with in Excel (yes, this actually happened).

So, you think: “Wouldn’t it be awesome to have Tableau generate a shared data source for me if I just point it to a csv and then choose what datatypes are in it?” At that point, you can analyze to your heart’s content. Bonus: it’s repeatable and, if your csv updates, so will your data source.

We can now use this as part of our data pipeline: we write some awesome query and want to share the results with our colleagues. Instead of sending a csv file to the crew, we just send them a Tableau data source.

The setup

You will need 2 config/base Tableau files (we’ve included them here):

  • XML file with the structure for a TDS
  • Basic/simple/dummy TDE file (really, it’s not dumb at all, as TDEs are amazing; rather, it’s just a basic ‘helper’ TDE we’ll use to package with the TDS for the TDSX)

csv_to_tde_1

 

csv_to_tde_2

The script (see our github repo for the script):

  • Reads your csv file
  • Let’s you choose what data type you need for the columns; if you want to let Tableau work its magic, just set the ‘Choose Data Type?’ to false and the Extract engine will come to the rescue.
  • updates the XML in the TDS
  • Packages the ‘helper’ TDE with the update TDS
  • Publishes to Server and refreshes with the new data and new file

 

csv_to_tde_3

 

 

Don’t believe us? Watch the video (this is where we’ve set the ‘Choose Data Type’ to True)…

 

 

A Practical Example

With the ability to search the web for interesting and varied data sets, we run into csv’s a lot. The NYC Taxi data is no exception

So, we grabbed a month’s worth of data (approx 2gb) and pulled into Tableau (see image below) and in 180 seconds (this was for the refresh on 12 million rows; the script took less than 3 seconds), it was a shared data source.

Another version of this example exported all the months, merged them and then made a data source on Tableau. Either way, all one needs is a csv file.

taxi_data

 

 

We’ll update this in future releases to use tables/custom sql as well as make it a more robust pipeline.

 

CsvTdeCreator_Master

TdeContent

 

What is DataOps-IST & Why Do You Need It?

In the technology world, things change quickly. It’s common knowledge, so we don’t feel there’s a huge need to convince people of the fact. Often times, there are buzz words which surface and catch on like rampant wildfire: there’s no escaping it. Some of these continue on and become verifiable domains and others wither and are quickly extinguished.

One word (or domain) which has certainly taken root in an abundance of phenomenal technology and devices is Data; big, small or any size between, it’s simply data and we have plenty of it. The theme that follows is our belief that Data needs it’s own domain and cannot exist without a solid foundation in, well, Data.

There are domains which serve to enhance and support the DataOps realm. Some examples: IT, Development and Engineering are a few that, in some form, support a piece of the larger DataOps domain. Companies abound with Analysts and reporting experts but there is no specific focus on the merit and value of Data as a whole. We propose that without a focus on Data, businesses will fail to realize the bigger picture in which each piece of the organization connects to the other.

DataOps, and more specifically, DataOps-IST is simple and involves three foundational pillars:

  • Infrastructure: Data and an appropriate Analytics Infrastructure (Tableau in our case)
  • Social Engineering (and Research)
  • Toolmaking & Delivery

DataOps (1)

Between each pillar, there are the obvious feedback loops wherein each connects to the other and serves to enhance and support the domain. Around the domain exists a logging and monitoring aspect; we want to know how well and to what extent we are serving our customers. For this function, we use Logentries to answer questions such as:

  • Who is using our particular report and what are the load times?
  • Where are there gaps in performance?Is it hardware or report related?
  • Can we potentially remove parts of a report/dashboard because of performance?
  • Can we, in real time, monitor a new report and quickly (also in real time) update parts which are not performing?

Again, the domain monitoring is critical since it enhances how we discuss progress and future enhancements with customers.

In the DataOps domain, there is a vast opportunity, mostly because of the natural growth of other fields, to firmly establish a pattern of analytics and data in any organization (big and small).   We are in a transitional time where the technology has promoted the growth of analytics as a separate, yet, equally foundational part of any company.

Look for further updates on the topic. We’ll start to discuss the specifics of each pillar and why it’s important to leverage them correctly, not to mention the functionality which makes each uniquely DataOps-oriented.

Enjoy!

 

 This post was originally published on blog.logentries.com in April 2015.