#301 Learnings From 25+ Years in Data Quality - Interview w/ Olga Maydanchik
Manage episode 412576022 series 3293786
Please Rate and Review us on your podcast app of choice!
Get involved with Data Mesh Understanding's free community roundtables and introductions: https://landing.datameshunderstanding.com/
If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here
Episode list and links to all available episode transcripts here.
Provided as a free resource by Data Mesh Understanding. Get in touch with Scott on LinkedIn.
Transcript for this episode (link) provided by Starburst. You can download their Data Products for Dummies e-book (info-gated) here and their Data Mesh for Dummies e-book (info gated) here.
Olga's LinkedIn: https://www.linkedin.com/in/olga-maydanchik-23b3508/
Walter Shewhart - Father of Statistical Quality Control: https://en.wikipedia.org/wiki/Walter_A._Shewhart
William Edwards Deming - Father of Quality Improvement/Control: https://en.wikipedia.org/wiki/W._Edwards_Deming
Larry English - Information Quality Pioneer: https://www.cdomagazine.tech/opinion-analysis/article_da6de4b6-7127-11eb-970e-6bb1aee7a52f.html
Tom Redman - 'The Data Doc': https://www.linkedin.com/in/tomredman/
In this episode, Scott interviewed Olga Maydanchik, an Information Management Practitioner, Educator, and Evangelist.
Some key takeaways/thoughts from Olga's point of view:
- Learn your data quality history. There are people who have been fighting this good fight for 25+ years. Even for over a century if you look at statistical quality control. Don't needlessly reinvent some of it :)
- Data literacy is a very important aspect of data quality. If people don't understand the costs of bad quality, they are far less likely to care about quality.
- Data quality can be a tricky topic - if you let consumers know that the data quality isn't perfect, they can lose trust. But A) in general, that conversation is getting better/easier to have and B) we _have_ to be able to identify quality as a problem in order to fix it.
- Data quality is NOT a project - it's a continuous process.
- Even now, people are finding it hard to use the well-established data quality dimensions. It's a framework for considering/measuring/understanding data quality so it’s not very helpful to data stewards / data engineers in creating data quality rules.
- The majority of quality errors are not random, they come from faulty data mapping / bugs in pipelines. Having good quality rules will catch a large percentage of errors that can be fixed in bulk.
- When thinking about getting started around data quality, it doesn't have to be complex and with lots of tools. It can be people looking at the data for potential issues and talking to producers. Then you can build a business case for fixing the data to get funding. You have to roll up your sleeves and talk to people but you can get forward momentum.
- Data quality issues aren't inherently material to the business processes - they are only bad when they cause issues for the business. You have to find those actual business issues to get people to care and get funding for fixing it. Quality for the sake of quality is just extra cost. Do not create too many data quality rules that do not matter.
- Relatedly, being able to show someone a relatively basic quality indicator early is far better than asking for a lot of budget to figure out the quality levels. You can do that with something as simple as random sampling 100-200 records and an hour of 1-2 people's time.
- To understand which data quality challenges and use cases are the most important, data people simply have to learn more about the business. Good data quality is about fit for purpose and that means understanding the purposes :)
- To find your initial good data quality use cases, look to mission criticality. What dashboards or reports are actually important to the company and why? Then work backwards to see if quality is an issue for those dashboards and reports. That's how you find your early buy-in to work on a quality initiative that can scale.
- !Controversial!: Data contracts are not at all new, we just now have a good enough set of tools and technologies to be able to do them better at scale.
- ?Controversial?: Most are doing data contracts … not that well. For them, it's about the technology and not the process. There isn't a continuous approach. Scott note: Andrew Jones has said the same. It's about ensuring a process that results in quality data, not the tools.
- For data contracts, there MUST be a feedback loop or we aren't actually delivering to needs, especially as needs evolve. Look to the widely used customer supply model for insights into what we need to achieve and how when it comes to data contracts.
- Many companies are creating actual financial incentives tied to data quality in order to ensure people care about data quality. That's not right for every organization but it does send a clear message as to the importance of data quality.
- You have to consider your data supply chain - if your interface for data input is bad, your data is very likely to be bad. People will simply enter garbage to move forward.
- Doing data quality manually is not sustainable/scalable. But you don't need to start with expensive tools, you can get your arms around things initially pretty easily. It will help you identify your actual problems instead of spending time specifically on tools.
- ?Controversial?: Many vendors are selling their tools as the fix to data quality. But detecting data errors with the tools is only the start of the data quality improvements. Once errors are detected, root cause analysis for the errors needs to be performed and the processes / code need to be fixed. None of the data quality tools can do this. It is human’s job. Beware the snake oil.
Learn more about Data Mesh Understanding: https://datameshunderstanding.com/about
Data Mesh Radio is hosted by Scott Hirleman. If you want to connect with Scott, reach out to him on LinkedIn: https://www.linkedin.com/in/scotthirleman/
If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/
If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here
All music used this episode was found on PixaBay and was created by (including slight edits by Scott Hirleman): Lesfm, MondayHopes, SergeQuadrado, ItsWatR, Lexin_Music, and/or nevesf
422 قسمت