DataOps-IST: Toolmaking and Delivery

“In God we trust; all others must bring data.”

-W. Edwards Deming

We should all have concerns about the modern business world. We are asking our colleagues to do things they are not capable of doing. We are setting unrealistic expectations for individual performance. As data professionals, we provide the material support, and then neglect to provide the requisite skill and training to take advantage of those resources. We ignore evidence that suggests it is much more difficult to master a new skill than we allow ourselves to believe. We allow managers to repeat buzzwords like ‘analytics’, ‘big data’, and ‘business intelligence’ without providing the infrastructure necessary to handle the weight of those terms.

We can do better. At Pluralsight, we are doing better; DataOps-IST enables this.

DataOps is a domain, by which we can begin to tackle the problems associated with the waves of data and the corresponding undue expectations we thus put upon our users.

There are three primary pillars of DataOps-IST:

In previous posts, we covered the first two pillars. Toolmaking and Delivery is the aspect of DataOps that first provides users with a resource, but along with that, also provides the underpinning of continual monitoring and improvement, which leads to optimal utilization of that resource.

If we are to accomplish the objective of increasing data literacy, analytical skill, and data-supported output throughout the organization, then it is imperative that the Data Team (however it manifests itself in your org) works towards a role that entails less report-making, and more ‘tool’ provisioning. To put it another way, they need to provide the means rather than the ends of the reporting cycle.

And those means include the following: Shared Data Sources that users can access and analyze, Plug-and-Play Report Templates that allow for quick and sensible data visualizations, Data Models, as well as Data Dictionaries that denote agreed upon contexts and definitions of fields, tables, views, etc.

Everything mentioned above, in addition to our previous posts, is only the beginning though. The last component of DataOps is the most important, for it affords us the opportunity of ceaseless evaluation of our previous efforts. It kicks off the invaluable cycle of assessment. How do we determine that the tools and resources we’ve provided our colleagues are of the utmost value and relevance? How do we ensure that our users are leveraging our data to provide the best possible end product?

We study the logs, of course. Logentries provides us the ability to do so. If we are falling short in any particular aspect of the process, our examination of the logs will show us that. Even if it’s to simply identify and remove those legacy reports and leftover dashboards that inevitably take up too much space on your server. The monitoring and reevaluation step of DataOps allows us to get back to the data again, to begin the process anew, and ensure that our colleagues are continually equipped to ‘bring the data’ they need in order to do their work in a meaningful way.

 

 

 

Expanding the Tableau Data Pipeline: Auto-creation of tde from csv

The scenario

Imagine that you’ve pulled some data out of a database (choose your flavor) and want to now analyze in Tableau. You notice, however, that it’s too massive to tinker with in Excel (yes, this actually happened).

So, you think: “Wouldn’t it be awesome to have Tableau generate a shared data source for me if I just point it to a csv and then choose what datatypes are in it?” At that point, you can analyze to your heart’s content. Bonus: it’s repeatable and, if your csv updates, so will your data source.

We can now use this as part of our data pipeline: we write some awesome query and want to share the results with our colleagues. Instead of sending a csv file to the crew, we just send them a Tableau data source.

The setup

You will need 2 config/base Tableau files (we’ve included them here):

  • XML file with the structure for a TDS
  • Basic/simple/dummy TDE file (really, it’s not dumb at all, as TDEs are amazing; rather, it’s just a basic ‘helper’ TDE we’ll use to package with the TDS for the TDSX)

csv_to_tde_1

 

csv_to_tde_2

The script (see our github repo for the script):

  • Reads your csv file
  • Let’s you choose what data type you need for the columns; if you want to let Tableau work its magic, just set the ‘Choose Data Type?’ to false and the Extract engine will come to the rescue.
  • updates the XML in the TDS
  • Packages the ‘helper’ TDE with the update TDS
  • Publishes to Server and refreshes with the new data and new file

 

csv_to_tde_3

 

 

Don’t believe us? Watch the video (this is where we’ve set the ‘Choose Data Type’ to True)…

 

 

A Practical Example

With the ability to search the web for interesting and varied data sets, we run into csv’s a lot. The NYC Taxi data is no exception

So, we grabbed a month’s worth of data (approx 2gb) and pulled into Tableau (see image below) and in 180 seconds (this was for the refresh on 12 million rows; the script took less than 3 seconds), it was a shared data source.

Another version of this example exported all the months, merged them and then made a data source on Tableau. Either way, all one needs is a csv file.

taxi_data

 

 

We’ll update this in future releases to use tables/custom sql as well as make it a more robust pipeline.

 

CsvTdeCreator_Master

TdeContent