Skip to main content
No items found.
logo dropboxsign
Why Dropbox Sign?
Expand or collapse accordion

What you can do

Sign documents online
Create electronic signatures
Choose or create templates
Fill and sign PDFs
Complete online contracts
Document management
Explore features
icon arrow right

Use Cases

Sales and business development
Human Resources
Start-ups
Financial technology
Property
On-demand services
Products
Expand or collapse accordion
icon dropbox
Sign
Make it easy to send and sign
icon dropbox
Sign API
Integrate eSign in your workflow
icon dropbox fax
Fax
Send faxes without a fax machine
icon dropbox integrations
Integrations
We meet you where you work
Resources
Expand or collapse accordion
Blog
Workflow expertise & product news
Customer stories
Real-world stories with real results
Help Centre
In-depth guidance for our products
Resource library
Reports, videos and info sheets
Developers
Pricing
Expand or collapse accordion
Dropbox Sign pricing
Find the right plan for you
Dropbox Sign API pricing
Real-world stories with real results
Contact sales
Sign up
Contact sales
Sign in
Expand or collapse accordion
Dropbox Sign
Dropbox Forms
Dropbox Fax
Free trial
Blog
/
Workflow management

6 Easy Steps To An Integrated Data Solution

by 
Aron Solberg
June 1, 2016
10
minute read
6 Easy Steps To An Integrated Data Solution
icon tooltip

New look, same great product! HelloSign is now Dropbox Sign.

icon close

‍






Everybody seems to be talking about data these days, and for good reason. Organizations large and small are collecting more information than ever before and using it to gain valuable insight on performance metrics, company health, and customer behavior. At the same time, advanced technologies like 

‍

On the surface, these are some pretty cool developments! Making sense of large amounts of data can support a lot of important business decisions. So while a few talented folks may impressively continue doing things the old fashioned way, most of us should be trying to embrace increasingly large pools of data in more effective ways. For Dropbox Sign, this was a big motivation for us as we recently chose to overhaul our patchwork of siloed systems and reporting tools. As a company we had reached the size and complexity where we needed a more consistent and reliable integrated data solution. 

‍

This project was, as you might expect, a big endeavor! The title of this article “6 easy steps” is a bit tongue-in-cheek, but after putting in the work we can now look at things with the clarity of hindsight. All the moving parts have come together and really begun to make sense. From a BI perspective, we’re in a much improved place, and it’s made us realize that the knowledge we’ve gained may be useful to others. 

‍

We therefore wanted to share our methods, results, and some of the lessons we learned along the way.



‍

When You’re Just Starting Out

‍


In the beginning, a project like this can feel daunting. With so many stakeholders, disparate data sources, and tooling options, where to begin? There’s really no one-size-fits-all solution here so it’s important to ground yourself with research and planning. 

‍

As Andy Grove famously said, “Energy put in early in any product process pays off ten-fold, energy put in late negative ten-fold!” 

‍

Reading blog posts like this one and others like it is a great start and should give you a good foundation for execution. However, for a company of our size or perhaps a bit bigger, the following steps will probably be about the right blend of cost, power, and flexibility. 

‍

To summarize our project strategy, we’ll begin by copying our data from primary locations into a “single source of truth” while simultaneously applying transforms in a process known as ETL (and we’ll discuss those details more below). We’ll then connect a reporting layer to join tables from different sources and build reports, and finally write some scripts to sync data out to external platforms before kicking off integrity verification. Let’s get started!

‍

Step 1: Get a “Single Source of Truth”

‍


It’s hard to overstate the importance of having a central data warehouse from which everything else is driven. With this constraint in place, even if numbers are reported incorrectly, at least you’ll know where to look in order to fix them. This is in addition to the benefits that come from offloading analytic loads from your production database. In deciding what to use here, one of our first considerations regarded the subject of scale: How much data would we need to manage and how would it be accessed? 

‍

For very large businesses, particularly those in the B2C space, large amounts of information necessitate sophisticated software and querying layers. And while our needs were not yet so big, we knew that we wanted to be forward thinking and evaluate all our possible options. 

‍

On this more extreme side there are technologies like Hadoop and Hive, and more recently HBase and Apache Spark. These tools and others like them are primarily backed by the distributed filesystem HDFS and the processing framework MapReduce. The former two excel at analyzing large data sets and finding trends over long periods of time through long-running batch jobs. HBase and Spark, by comparison, provide interactive real-time querying abilities by loading portions of the data into distributed memory stores.

‍

A simpler storage solution, which may not handle the same scale but is still quite powerful in its own right, is Amazon Redshift. Redshift is a MPP type distributed database that’s scalable both horizontally and vertically. Also, because it runs in the AWS ecosystem it offers convenient access to what are probably familiar high-level support and maintenance tools such as automated backups, access management, and streamlined upgrades. Oh, and did I say it was based on Postgresr? There’s a lot to like here. 

‍

To be thorough, we should also mention the turn-key solutions available which often combine storage with ETL. Some examples in this space are companies like Segment and RJMetrics. Going this route definitely puts you on the “buy” side of the classic buy vs. build question, which is not necessarily a bad thing if you’ve got the cash. While purchasing products like these tends to be faster and less impactful on resources, it’s probably going to be less flexible and also more expensive (over the long run) than if you build something yourself. 

‍

Since we were building a custom system and we projected our needs wouldn’t require the sophistication of technology like Hadoop we ultimately chose Amazon Redshift because of its speed, cost-effectiveness, and our generally positive past experiences with Amazon AWS.

Step 2: Set Up an Ingress ETL Pipeline

‍


We now we had a shiny new Amazon Redshift instance, empty and waiting. Time to fill it up! This is the start of the ETL stage where you SELECT tables from a primary source, like a production database, and insert them into the warehouse. Simultaneously, there will likely be transformational improvements you can make such as adjusting formatting, altering schemas, augmenting with new columns, and generally preparing things for easier querying later. Once this is done, you’ll take the result and load it into your warehouse. 

‍

Similar to storage solutions, there are a lot of possibilities here for tooling. For real-time ingestion of large quantities of data there are stream processing options like Kafka and Spark Stream. Those, combined with something like logstash, can potentially receive structured data sources like JSON-based log files. While these technologies can offer some cool benefits, like having multiple publishers and subscribers and real-time updates, we ultimately decided it was a bit overkill for our current needs. 

‍

A step down in complexity yields open source ETL libraries like Pentaho and Talend. These tools could have potentially worked for us but still felt a bit awkward. The main data source we planned to “ETL” into Redshift was our production MySQL read-replica. We mainly needed a lightweight framework to provide good convention. Some niceties like incremental loading, error handling, and lifecycle hooks are also helpful, and more than anything we needed it to be rock-solid reliable and designed well enough to be able to easily extend and customize it. 

‍

While there may be several such tools that could have done the job, eventually we settled on a Ruby framework called DataDuck ETL (github source). We found it had most of the features we were looking for out-of-the-box and a clean, well-designed interface to continue to build from.

‍

Step 3: Connect a Visualization and Reporting Tool

‍

When it comes to reporting, there are plenty of options to choose from. There are more mature solutions like Tableau as well as some new up-and-comers like Amazon Quicksight, Looker, and ChartIO. This is probably one of the more subjective areas of the project, and I could see people going a variety of ways here. For us, we eventually settled on Tableau due to its relative affordability, strong integration support with multiple platforms, and our existing in-house expertise with it.

‍

It turned out to be an excellent choice. Tableau is fairly robust, with a number of reporting options and features and a broad ecosystem of integrations. Its interface is well-polished and comes with both native clients and cloud-based reporting components. We’re primarily using it for internal dashboards across all our departments and teams. 

‍

It’s worth noting here that we’re currently still developing reports “as ordered” from our various departments and changes all need to go through our BI team. At some point, I could see us wanting to package these reports with additional tools and capabilities. This, combined with the proper training, could enable departments to “own their data” and handle many more customizations themselves. How exactly to do this remains an open question but it’s definitely something being successfully done at larger companies with dedicated data development teams.

‍

Step 4 - Sync Data to Additional Platforms

‍


One of our major goals with this project was to be able to run reports from a central location, but reporting isn’t the only reason for having a single source of truth. It’s also an excellent launching point for outbound data-syncing. The sharing of information this step provides can make your paid products far more useful and those who use them will love the enhanced context. 

‍

One prime example is how we now send subscription data into Salesforce (using their bulk API). This gives our account managers a central place to view our users’ plan and renewal statuses in an environment they’re already extremely comfortable working in. We’ve also begun to form a customer success department so we started sending usage metrics into a tool called Totango to identify people who may benefit from proactive outreach. 

‍

The key here is to favor one-way dataflow whenever possible. If you do need to sync data from a 3rd party platform back to the warehouse), the data from the external source should be new and authoritative. For example, bringing in support ticket information from ZenDesk would make sense, because that’s the only place where that information would originate from. Updating a customer’s name from Zendesk, on the other hand, would make less sense, because it could possibly conflict with data in our production database. In short, knowing which sources are authoritative for which data will help you avoid having to create reconciliation efforts warehouse. 

‍

Fortunately, with a tool like Tableau, you have often have an alternative to merging new data sources on the backend. Tableau can connect natively to many popular platforms, like ZenDesk and Salesforce, then join the tables through common, unique fields directly within the report. It can then be pushed to the cloud and scheduled to pull new data on an ongoing basis so the reports are always up-to-date. We use this strategy frequently and it’s an excellent way to simplify your work with giving up any concessions.

‍

Step 5 - Check Your Data Integrity

‍


Without a doubt, you will need a quality-control stage for identifying parts of your system that just aren’t functioning properly which leads to bad data. The root cause will vary as it may be due to things like poor validation on input, mistaken transformations in the ETL stage, and bugs in the production code base. This is probably the least fun part of the process but also one of the most essential. 

‍

Here are a few tips that you can keep in mind to improve integrity:

‍

  • Check your numbers for sanity and accuracy, and cross-check them with supporting sources if possible.
  • Try to institute “unit” tests (of a sort) where you introduce state of a certain type at the beginning of the system and log it as it travels through your warehouse and, eventually,  into to your reporting tools.
  • If you had previous BI processes in place, leave them up long enough to check your new numbers against it. You can then ensure they match before deprecating the old way.
  • When you identify irregularities, try to trace them to their source and fix the underlying issue. You may also be able to retroactively repair data that has been corrupted if you know the source.

‍


Like QA for software development, this stage never really ends. Whenever changes are made that touch the data pipeline, or new services become integrated, updates will likely need to be made and checked for accuracy.

‍

Step 6 - Supplement, Maintain, Improve

‍


In addition to the primary pipeline, there may be ancillary systems you can use to support your goals and require minimal overhead to implement. 

‍

A perfect example of this for us is the frontend analytics platform Heap (which, incidentally, also supports a native connection to Redshift). Heap enables our Product department to own frontend usage and engagement metrics while separating the data collection and analysis away from engineering. 

‍

Other useful tools in this category include Google Tag Manager for front-end analytic distribution and Optimizely for AB testing.


Conclusion and Parting Thoughts


While by no means finished, our new pipeline has already begun to provide much-appreciated clarity and insight into a number of processes. Our automated scripts run once a night and bring fresh new data to our reports and business products every day. For version “1.0”, we’re quite satisfied with how it’s worked out and even more encouraging is we feel we’ve only begun to realize the full potential. 

‍

Additionally, as we continue to grow there will continue to be new data sources to integrate, and new employees to report to. We’ve already begun to talk about new ways in which these efforts could be expanded, and we feel well-positioned to bring those ideas to life. 
 
 
This post was originally published by The Next Web.

Stay in the loop

Done! Please check your inbox.

Thank you!
Thank you for subscribing!

Lorem ipsum

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Lorem ipsum
icon arrow right
icon close

Up next:

Hands using a tablet on a desk with another person in the background.
Workflow management
8
minute read

Unlocking efficiency: the overlooked benefit of digitising documents

Streamlining Processes With WCA Tools to Grow Revenue blog editorial illustration
Workflow management
5
minute read

5 ways workflow content automation tools drive higher revenue

Interactive content

Supplier workflow improvement tool

Products
Dropbox SignDropbox Sign APIDropbox FaxIntegrations
Why Dropbox Sign
Electronic signaturesSign documentsSign and Fill PDFsOnline contractsCreate electronic signaturesSignature editorSign word documents
Support
Help CentreContact salesContact supportManage cookiesGetting started: Dropbox SignGetting started: Dropbox Sign API
Resources
BlogCustomer storiesResource CentreLegality guideTrust centre
Partners
Strategic PartnersPartners locator
Company
CareersTermsPrivacy
icon facebookicon youtube

Accepted payment methods

Mastercard logoVisa logoAmerican Express LogoDiscover logo
CPA Compliance BadgeHIPAA compliance badgeSky High Enterprise Ready badgeISO 9001 Certified badge

Dropbox Sign electronic signatures are legally binding in the United States, European Union, United Kingdom and in many countries around the world.
For more information, please view our Terms and Conditions and Privacy Policy