After reviewing our list of data quality dimensions to consider if your data is clean, use this post to get started with data cleaning in your organization.
Business data is typically relied upon for analytics and business intelligence. If your data is not of high enough quality for those use cases, data cleaning can reduce errors and improve the documentation and presentation of the data. As you manage your data, make sure that the causes of errors are not only detected and corrected, but the process of collecting and managing the data is improved to prevent future errors and maintain a sound data system.
Common causes of data quality issues
Data Conversion. Most often your data came from somewhere else before it was in your database. During transfers, data can get lost or modified in ways that make it unusable. If databases are mapped using old data structures or conversions, the data in the new database will be incorrect. Often users do not see what is what is actually being stored, so the data structure and mapping between the old and new databases are often the culprit for errors in the data.
Merging databases and other system consolidations. When you phase out or combine an old system, a poorly planned merge can leave you with little time to plan or prevent errors. As the data is moving into a non-empty database, there is little flexibility for changes in the data structure. Often data will not fit in with the new structure, and duplicates and conflicts will result.
Manual data entry. Much data is typed into databases by people, and mistakes are inevitable. People misspell entries, use the wrong format for a field, enter data into the wrong box, or input the wrong value. Because these errors are not systematic, they can be difficult to trace or correct.
Batch feeds. A batch feed carries large volumes of data, and any problem in it can cause issues magnified by future feeds. The problems may accumulate and become difficult to track down and fix. If an error finds its way into the source system, it can flow through the batch feeds and go unnoticed.
Real time interfaces. The problem with data exchanged through real-time interfaces is that there is little time to verify that the data is accurate. There are typically multiple points in the capture process where real time data can be corrupted or lost. The data comes in small packets, and can be incorrect, leaving you with unreliable data.
Disconnect between data priorities with business priorities. Data priorities and business priorities should be in alignment. Avoid collecting and managing data that is not important to your business and which can corrupt other data elements. Prioritize data that is important to the business, and keeping the quality of that data high.
Before you start a data cleaning process, make a plan
Audit and organize the data. Understanding your data before cleaning improves the efficiency of your project and reduces the time and cost of data cleaning. Understand the purpose, location, flow, and workflows of your data before you start.
Document data quality requirements and define rules for measuring quality. Create a reference for success, and targets to keep the project in check along the way. Set statistical checks on the data, and set a standard of quality control and completeness.
Create a strategy. Outline a plan for your data quality that supports ongoing operations and data management. Identify the data sets that meet your quality standard, and the data sets that need to be cleaned. Identify possible solutions with a plan for implementation. Your general plan should be to define, identify, correct, and document data errors, and modify procedures to avoid errors in the future.
Identify and correct errors in data. The method to error detection will depend on the database and the dataset you are using. Depending on your team and the issues you have encountered, there are a variety of free or open source tools or services and enterprise solutions for cleaning your data. The pandas python library is an open-source software library for data manipulation and analysis. You can use CSVKit for converting and working with CSV files. R is a popular system for data cleaning, and features such as R plyr, Reshape2 or ggplot2 can be used for cleaning data. Trifacta Data Wrangler, Informatica and Trillium Software are among a few of the companies that offer data cleaning services.
Coming up Next: I will cover the importance and details of documentation and prevention methods in the next post.