How to Avoid “Data Wrangling” in your Data Projects
Guest blogger Paul Laughlin considers how, despite the amount of media coverage for Deep Learning and other more advanced techniques, most Data Science teams are still struggling with basic data problems. We refer to this as “data wrangling”.
Even well established analytics teams can still lack the Single Customer View, or easily accessible Data Lake or analytical playpen, they need for their work.
Insight Leaders also regularly express frustration that they and their teams are still bogged down in data ‘fire fighting’, rather than getting onto analytical work that could be transformative.
Part of the problem may be that lack of focus in this area. Data and Data Management is often still considered the least sexy part of Customer Insight or Data Science. All too often, leaders lack clear data plans, models or strategy to develop the data ecosystem (including infrastructure) that will enable all other work by the team.
Back in 2015, we conducted a poll of leaders, asking about use of data models & metadata. Shockingly, none of those surveyed had Conceptual Data Models in place and half also lacked Logical Data Models. Exacerbating this problem of lacking a clear, technology independent, understanding of your data – all leaders surveyed cited a lack of effective metadata. If this is still the case in your company then without these tools your data management is in danger of considerable rework and feeling like a DIY, best endeavours, frustration.
So, what are the common data problems I hear, when meeting data leaders across the country? Here is the one that crops up most often:
Too much time taken up on data prep
I was reminded of this often cited challenge by a post on LinkedIn from Martin Squires, experienced leader of Boots insight team. Sharing a post originally published in Forbes magazine, Martin reflected how little has changed two decades. The survey showed that, just as Martin & I found 20 years ago, over 60% of data scientists time is taken up with cleaning & organising data.
Why does this continue to be the case for so many years? Here are some common causes:
- Under investment in technology whose benefit is not seen outside of analytics teams (Data Lakes/ETL software)
- Lack of transparency to internal customers as to the amount of time taken up in data prep (inadequate briefing process)
- Lack of consequences for IT or internal customers if they let the situation continue (share the pain)
On that last point, I want to reiterate advice given to coaching clients. Ask yourself honestly, are you your own worst enemy by ‘keeping the show on the road’ despite these data barriers? Have you ever considered letting a piece of work or regular job fail, so as to highlight technology problems that your team are currently masking by manual workarounds? It’s worth considering as a tactic.
Beyond that more radical approach, what can data leaders do to overcome these problems & achieve delivery of successful data projects to reduce the data wrangling workload? Here are 3 tips that I hope help set you on the right path.
Create a playpen to enable play to prioritise data needed
Here, once again, language can confuse or divide. Whether one talks about Data Lakes or, less impressively, ‘playpens’ or ’sandpits’ within a server or data warehouse — common benefits can be realised.
Over a decade working across IT roles, followed be leading data projects from the business side, taught me that one of the biggest causes of delay and mistakes was data mapping work. The arduous task of accurately mapping all the data required by a business, from source systems through any required ETL (Extract Transform & Load) layers, onto the analytics database solution is fraught with problems.
All too often this is the biggest cost & cause of delays or rework for data projects. Frustratingly, for those who do audit usage afterwards, one can find that not all the data loaded is actually used. So, after frustration for both IT and Insight teams, it seems too late on its discovered that only a subset of the data really added value to their work.
This is where a free-format Data Lakes or Playpens can really add value. They should be used to enable IT to dump data there with minimal effort, or for insight teams to be enabled to access potential data sources for one-off extracts to the playpen. Here, as that name suggests, analysts or data scientists can have opportunity to play with the data. However, this far more valuable work than that sounds. Better language is perhaps a ‘data lab’. Here, the business experts have opportunity to trial use of different potential data feeds & variables within them, to learn which are actually useful/predictive/used for analysis or modelling that will add value.
The great benefit of this approach is to enable a lower cost & more flexible way of de-scoping the data variables & data feeds actually required in live systems. Reducing those can radically increase the speed of deliver for new data warehouses or releases of changes/upgrades.
Recruit and develop Data Specialist roles outside of IT
The approach proposed above, together with innumerable change projects across today’s businesses, need to be informed by someone who knows what each data item means. That may sound obvious, but too few businesses have clear knowledge management or career development strategies to meet that need.
Decades ago, small IT teams contained long serving experts who had built all the systems used & were actively involved with fixing any data issues that arose. If they were also sufficiently knowledgeable about the business and how each data item was used by different teams, they could potentially provide the data expertise I propose. However, those days have long gone.
Most corporate IT teams are now closer to the proverbial baked bean factory. They may have the experience & skills needed to deliver the data infrastructure. But, they lack any depth of understanding of the data items (or blood) that flows through those arteries. If the data needs of analysts or data scientists are to be met, they need to be able to talk with experts in data models, data quality & metadata. To discuss what analysts are seeking to understand or model in the real world of a customer and translate that into the most accurate & accessible proxy within data variables available.
So, I recommend insight leaders seriously consider the benefit of ‘in-house’ data management teams, with real specialism in understanding data and curating it to meet team needs. We’ve previously posted some hints for getting the best out of these teams.
Grow incrementally, delivering value each time, to justify investment
I’m sure all change leaders and most insight leaders have heard the advice on how to eat an elephant or deliver major change. That rubric, to deliver one bite at a time, is as true as ever.
Although it can help for an insight leader to take time out, step back & consider all their data needs/gaps – they also need to be pragmatic about the best approach to deliver those. Using the data lake approach & data specialists mentioned above, time should be taken to prioritise data requirements.
Investigating data requirements so as to be able to score each against both potential business value & ease of implementation (classic Boston Consulting Grid style), can help with scoping decisions. But, I’d also counsel against just selecting randomly the most promising & easiest to access variables.
Instead, think in terms of ‘use cases’. Most successful insight teams have grown incrementally, by proving the value they can add to a business one application at a time. So, dimensions like the different urgency + importance of business problems, as well as opens of leaders come into play as well.
For your first iteration of a project to invest in extra data, then prove value to business, in order to secure budget for next wave – look for the following characteristics:
- Analysis using Data Lake/Playpen has shown potential
- Relatively easy to access data & not too many variables (in the ‘quick win’ category for IT team)
- Important business problem that is widely seen as a current priority to fix (with rapid impact able to be measured)
- Good stakeholder relationship with business leader in application area (current or potential advocate)
How is your Data Wrangling going?
Do your analysts spend too much time hunting down the right data & then corralling it into the form needed for required analysis? Have you overcome the time burnt by data prep? If so, what has worked for you and your team?
Let’s not wait another 20 years to stop the data wrangling drain. There is too much potentially valuable insight or data science work to be done.