Everybody seems to be talking about data these days, and for good reason. Organizations large and small are collecting more information than ever before and using it to gain valuable insight on performance metrics, company health, and customer behavior. At the same time, advanced technologies like
On the surface, these are some pretty cool developments! Making sense of large amounts of data can support a lot of important business decisions. So while a few talented folks may impressively continue doing things the old fashioned way, most of us should be trying to embrace increasingly large pools of data in more effective ways. For Dropbox Sign, this was a big motivation for us as we recently chose to overhaul our patchwork of siloed systems and reporting tools. As a company we had reached the size and complexity where we needed a more consistent and reliable integrated data solution.
This project was, as you might expect, a big endeavor! The title of this article “6 easy steps” is a bit tongue-in-cheek, but after putting in the work we can now look at things with the clarity of hindsight. All the moving parts have come together and really begun to make sense. From a BI perspective, we’re in a much improved place, and it’s made us realize that the knowledge we’ve gained may be useful to others.
We therefore wanted to share our methods, results, and some of the lessons we learned along the way.
When You’re Just Starting Out
In the beginning, a project like this can feel daunting. With so many stakeholders, disparate data sources, and tooling options, where to begin? There’s really no one-size-fits-all solution here so it’s important to ground yourself with research and planning.
As Andy Grove famously said, “Energy put in early in any product process pays off ten-fold, energy put in late negative ten-fold!”
Reading blog posts like this one and others like it is a great start and should give you a good foundation for execution. However, for a company of our size or perhaps a bit bigger, the following steps will probably be about the right blend of cost, power, and flexibility.
To summarize our project strategy, we’ll begin by copying our data from primary locations into a “single source of truth” while simultaneously applying transforms in a process known as ETL (and we’ll discuss those details more below). We’ll then connect a reporting layer to join tables from different sources and build reports, and finally write some scripts to sync data out to external platforms before kicking off integrity verification. Let’s get started!
Step 1: Get a “Single Source of Truth”
It’s hard to overstate the importance of having a central data warehouse from which everything else is driven. With this constraint in place, even if numbers are reported incorrectly, at least you’ll know where to look in order to fix them. This is in addition to the benefits that come from offloading analytic loads from your production database. In deciding what to use here, one of our first considerations regarded the subject of scale: How much data would we need to manage and how would it be accessed?
For very large businesses, particularly those in the B2C space, large amounts of information necessitate sophisticated software and querying layers. And while our needs were not yet so big, we knew that we wanted to be forward thinking and evaluate all our possible options.
On this more extreme side there are technologies like Hadoop and Hive, and more recently HBase and Apache Spark. These tools and others like them are primarily backed by the distributed filesystem HDFS and the processing framework MapReduce. The former two excel at analyzing large data sets and finding trends over long periods of time through long-running batch jobs. HBase and Spark, by comparison, provide interactive real-time querying abilities by loading portions of the data into distributed memory stores.
A simpler storage solution, which may not handle the same scale but is still quite powerful in its own right, is Amazon Redshift. Redshift is a MPP type distributed database that’s scalable both horizontally and vertically. Also, because it runs in the AWS ecosystem it offers convenient access to what are probably familiar high-level support and maintenance tools such as automated backups, access management, and streamlined upgrades. Oh, and did I say it was based on Postgresr? There’s a lot to like here.
To be thorough, we should also mention the turn-key solutions available which often combine storage with ETL. Some examples in this space are companies like Segment and RJMetrics. Going this route definitely puts you on the “buy” side of the classic buy vs. build question, which is not necessarily a bad thing if you’ve got the cash. While purchasing products like these tends to be faster and less impactful on resources, it’s probably going to be less flexible and also more expensive (over the long run) than if you build something yourself.
Since we were building a custom system and we projected our needs wouldn’t require the sophistication of technology like Hadoop we ultimately chose Amazon Redshift because of its speed, cost-effectiveness, and our generally positive past experiences with Amazon AWS.
Step 2: Set Up an Ingress ETL Pipeline
We now we had a shiny new Amazon Redshift instance, empty and waiting. Time to fill it up! This is the start of the ETL stage where you SELECT tables from a primary source, like a production database, and insert them into the warehouse. Simultaneously, there will likely be transformational improvements you can make such as adjusting formatting, altering schemas, augmenting with new columns, and generally preparing things for easier querying later. Once this is done, you’ll take the result and load it into your warehouse.
Similar to storage solutions, there are a lot of possibilities here for tooling. For real-time ingestion of large quantities of data there are stream processing options like Kafka and Spark Stream. Those, combined with something like logstash, can potentially receive structured data sources like JSON-based log files. While these technologies can offer some cool benefits, like having multiple publishers and subscribers and real-time updates, we ultimately decided it was a bit overkill for our current needs.
A step down in complexity yields open source ETL libraries like Pentaho and Talend. These tools could have potentially worked for us but still felt a bit awkward. The main data source we planned to “ETL” into Redshift was our production MySQL read-replica. We mainly needed a lightweight framework to provide good convention. Some niceties like incremental loading, error handling, and lifecycle hooks are also helpful, and more than anything we needed it to be rock-solid reliable and designed well enough to be able to easily extend and customize it.
While there may be several such tools that could have done the job, eventually we settled on a Ruby framework called DataDuck ETL (github source). We found it had most of the features we were looking for out-of-the-box and a clean, well-designed interface to continue to build from.
Step 3: Connect a Visualization and Reporting Tool
When it comes to reporting, there are plenty of options to choose from. There are more mature solutions like Tableau as well as some new up-and-comers like Amazon Quicksight, Looker, and ChartIO. This is probably one of the more subjective areas of the project, and I could see people going a variety of ways here. For us, we eventually settled on Tableau due to its relative affordability, strong integration support with multiple platforms, and our existing in-house expertise with it.
It turned out to be an excellent choice. Tableau is fairly robust, with a number of reporting options and features and a broad ecosystem of integrations. Its interface is well-polished and comes with both native clients and cloud-based reporting components. We’re primarily using it for internal dashboards across all our departments and teams.
It’s worth noting here that we’re currently still developing reports “as ordered” from our various departments and changes all need to go through our BI team. At some point, I could see us wanting to package these reports with additional tools and capabilities. This, combined with the proper training, could enable departments to “own their data” and handle many more customizations themselves. How exactly to do this remains an open question but it’s definitely something being successfully done at larger companies with dedicated data development teams.
Step 4 - Sync Data to Additional Platforms
One of our major goals with this project was to be able to run reports from a central location, but reporting isn’t the only reason for having a single source of truth. It’s also an excellent launching point for outbound data-syncing. The sharing of information this step provides can make your paid products far more useful and those who use them will love the enhanced context.
One prime example is how we now send subscription data into Salesforce (using their bulk API). This gives our account managers a central place to view our users’ plan and renewal statuses in an environment they’re already extremely comfortable working in. We’ve also begun to form a customer success department so we started sending usage metrics into a tool called Totango to identify people who may benefit from proactive outreach.
The key here is to favor one-way dataflow whenever possible. If you do need to sync data from a 3rd party platform back to the warehouse), the data from the external source should be new and authoritative. For example, bringing in support ticket information from ZenDesk would make sense, because that’s the only place where that information would originate from. Updating a customer’s name from Zendesk, on the other hand, would make less sense, because it could possibly conflict with data in our production database. In short, knowing which sources are authoritative for which data will help you avoid having to create reconciliation efforts warehouse.
Fortunately, with a tool like Tableau, you have often have an alternative to merging new data sources on the backend. Tableau can connect natively to many popular platforms, like ZenDesk and Salesforce, then join the tables through common, unique fields directly within the report. It can then be pushed to the cloud and scheduled to pull new data on an ongoing basis so the reports are always up-to-date. We use this strategy frequently and it’s an excellent way to simplify your work with giving up any concessions.
Step 5 - Check Your Data Integrity
Without a doubt, you will need a quality-control stage for identifying parts of your system that just aren’t functioning properly which leads to bad data. The root cause will vary as it may be due to things like poor validation on input, mistaken transformations in the ETL stage, and bugs in the production code base. This is probably the least fun part of the process but also one of the most essential.
Here are a few tips that you can keep in mind to improve integrity:
- Check your numbers for sanity and accuracy, and cross-check them with supporting sources if possible.
- Try to institute “unit” tests (of a sort) where you introduce state of a certain type at the beginning of the system and log it as it travels through your warehouse and, eventually, into to your reporting tools.
- If you had previous BI processes in place, leave them up long enough to check your new numbers against it. You can then ensure they match before deprecating the old way.
- When you identify irregularities, try to trace them to their source and fix the underlying issue. You may also be able to retroactively repair data that has been corrupted if you know the source.
Like QA for software development, this stage never really ends. Whenever changes are made that touch the data pipeline, or new services become integrated, updates will likely need to be made and checked for accuracy.
Step 6 - Supplement, Maintain, Improve
In addition to the primary pipeline, there may be ancillary systems you can use to support your goals and require minimal overhead to implement.
A perfect example of this for us is the frontend analytics platform Heap (which, incidentally, also supports a native connection to Redshift). Heap enables our Product department to own frontend usage and engagement metrics while separating the data collection and analysis away from engineering.
Other useful tools in this category include Google Tag Manager for front-end analytic distribution and Optimizely for AB testing.
Conclusion and Parting Thoughts
While by no means finished, our new pipeline has already begun to provide much-appreciated clarity and insight into a number of processes. Our automated scripts run once a night and bring fresh new data to our reports and business products every day. For version “1.0”, we’re quite satisfied with how it’s worked out and even more encouraging is we feel we’ve only begun to realize the full potential.
Additionally, as we continue to grow there will continue to be new data sources to integrate, and new employees to report to. We’ve already begun to talk about new ways in which these efforts could be expanded, and we feel well-positioned to bring those ideas to life.
This post was originally published by The Next Web.
Fique por dentro
Thank you!
Thank you for subscribing!