Data quality management: What you need to know
By: John Bauman, SAS Insights Editor
As organizations collect more data, managing the quality of that data becomes more important every day. After all, data is the lifeblood of your organization. Data quality management helps by combining organizational culture, technology and data to deliver results that are accurate and useful.
Data quality is not good or bad, high or low. It’s a range, or measure, of the health of the data pumping through your organization. For some processes, a marketing list with 5 percent duplicate names and 3 percent bad addresses might be acceptable. But if you’re meeting regulatory requirements, the risk of fines demands higher levels of data quality.
Data quality management provides a context-specific process for improving the fitness of data that’s used for analysis and decision making. The goal is to create insights into the health of that data using various processes and technologies on increasingly bigger and more complex data sets.
Why do we need data quality management?
Data quality management is an essential process in making sense of your data, which can ultimately help your bottom line.
First, good data quality management builds a foundation for all business initiatives. Outdated or unreliable data can lead to mistakes and missteps. A data quality management program establishes a framework for all departments in the organization that provides for – and enforces – rules for data quality.
Second, accurate and up-to-date data provides a clear picture of your company’s day-to-day operations so you can be confident in upstream and downstream applications that use all that data. Data quality management also cuts down on unnecessary costs. Poor quality can lead to costly mistakes and oversights, like losing track of orders or spending. Data quality management builds an information foundation that allows you to understand your organization and expenses by having a firm grasp on your data.
Finally, you need data quality management to meet compliance and risk objectives. Good data governance requires clear procedures and communication, as well as good underlying data. For example, a data governance committee may define what should be considered “acceptable” for the health of the data. But how do you define it in the database? How do you monitor and enforce the policies? Data quality is an implementation of the policy at the database level.
Data quality is an important part of implementing a data governance framework. And good data quality management supports data stewards in doing their jobs.
Want to see data management at work in the real world?
Find out how a solid data management foundation gives you data you can trust and helps you solve everyday business problems.
Download a free white paper
The dimensions of data quality management
There are several data quality dimensions in use. This list continues to grow as data grows in size and diversity; however, a few of the core dimensions remain constant across data sources.
- Accuracy measures the degree to which data values are correct – and is paramount to the ability to draw accurate conclusions from your data.
- Completeness means all data elements have tangible values.
- Consistency focuses on uniform data elements across different data instances, with values taken from a known reference data domain.
- Age addresses the fact that data should be fresh and current, with values that are up to date across the board.
- Uniqueness demonstrates that each record or element is represented once within a data set, helping avoid duplicates.
Key features of data quality management
A good data quality program uses a system with a variety of features that help improve the trustworthiness of your data.
First, data cleansing helps correct duplicate records, nonstandard data representations and unknown data types. Cleansing enforces the data standardization rules that are needed to deliver insights from your data sets. This also establishes data hierarchies and reference data definitions to customize data to fit your unique needs.
Data profiling, the act of monitoring and cleansing data, is used to validate data against standard statistical measures, uncover relationships and verify data against matching descriptions. Data profiling steps will establish trends to help you discover, understand and potentially expose inconsistencies in your data.
Validating business rules, and creating a business glossary and lineage, help you act on poor-quality data before it harms your organization. This entails creating descriptions and requirements for system-to-system business term translations. Data can also be validated against standard statistical measures or customized rules.
In addition to those key features, having a centralized view of enterprise activity through a data management console is a key way to make the process simpler.
Accurate and up-to-date data provides a clear picture of your company’s day-to-day operations so you can be confident in upstream and downstream applications that use all that data.
How important is data quality management for big data?
Big data has and will continue to be a disrupting influence on businesses. Consider the massive volumes of streaming data from connected devices in the Internet of Things. Or numerous shipment tracking points that flood business servers and must be combed through for analysis. With all that big data comes bigger data quality management problems. These can be summed up in three main points.
Repurposing
These days there is a rampant repurposing of the same data sets in different contexts. This has the negative effect of giving the same data different meanings in different settings – and raising questions about data validity and consistency. You need good data quality to grasp these structured and unstructured big data sets.
Validating
When using the externally created data sets that are commonplace in big data, it can be hard to embed controls for validation. Correcting the errors will make the data inconsistent with its original source, but maintaining consistency can mean making some concessions on quality. This issue of balancing oversight with big data sets begs for data quality management features that can provide a solution.
Rejuvenation
Data rejuvenation extends the lifetime of historical information that previously may have been left in storage, but it also increases the need for validation and governance. New insights can be extracted from old data – but first, that data must be correctly integrated into newer data sets.
Where and when should data quality happen?
You can best observe data quality management in action through the lens of a modern day data problem. In real-life applications, different data problems require different latencies.
For example, there is a real-time need for data quality when you’re processing a credit card transaction. This could flag fraudulent purchases, aiding both customers and businesses. But if you’re updating loyalty cards and reward points for that same customer, you can do overnight processing for this less-pressing task. In both cases, you’re applying the principles of data quality management in the real world. At the same time, you are recognizing the needs of your customers and approaching the task in the most efficient and helpful way possible.
Recommended reading
- Article What is a data lake & why does it matter?As containers for multiple collections of data in one convenient location, data lakes allow for self-service access, exploration and visualization.
- Article General Data Protection Regulation: From burden to opportunityThe General Data Protection Regulation stirs up mixed emotions, but Kalliopi Spyridaki shows how to use the new legislation for business advantage.
- Article Data integration: It ain't what it used to beOnce limited in scope, data integration now supports analytics and data-driven operational processes like real-time insurance claims processing and IoT apps.
- Article How openness can supercharge event stream analyticsWhat does openness do for event stream analytics? David Loshin shows how it helps you speed and govern the full streaming analytics life cycle.