Data Cleaning

Work illustrations by Storyset

Uncover your data

Companies across nearly all industries are now recognizing the competitive edge and insights that their data can offer them—and new analytical tools greatly improve the accessibility of these data-driven insights, especially for firms without extensive expertise in data analytics.

Schedule a Meeting   

Work illustrations by Storyset

Data are getting more important every day, but even the most well-kept data are often unclean—that is, it is often riddled with:

  • Incorrect Values
  • Extreme Outliers
  • Missing Values and Rows
  • Duplicate Records
  • Inconsistent Formatting

And many other headaches—which WILL take your organization forever to fix.

Raw data is rarely ready for immediate consumption. Dealing with dirty data is an extremely time-consuming process, and it is an issue which all data-oriented professionals face. If you input low-quality or inaccurate data, you WILL output low-quality and inaccurate insights. You’ve heard the phrase before: garbage in, garbage out!

Here Are 7 of the Most Common Forms of Data Impurity Which You May Be Observing

Missing Data

“Missing data” happen when certain values or entire rows of data are missing. Untrained analysts often make the common mistake of simply deleting missing data, but such data points can often be extremely valuable. To extract this value, proficient data professionals use several techniques to account for missing data. Boxplot’s experts know how to deal with missing information, including data merging and imputation, to fill in the gaps.

Poor Accuracy

“Poor accuracy” is when the data are of valid format and data type, but it’s simply wrong. As with missing data, poor data accuracy is difficult for untrained users to recognize. It’s important to understand the range within which values can possibly fall, and to correct data points that don’t seem reasonable. Through tools like aggregation, distribution estimation, and other techniques, data experts can then identify quickly if a value is likely to be incorrect.

Data Type Constraints

Data type constraints are when data are supposed to be of a particular data type (text, number, date, etc.) but is actually of another data type. The tricky part about this data problem is that it’s often not obvious that the data type is the issue. A common example of Data Type Constraint violation is “Dirty Dates”, or when some dates in a column of calendar data appear as strings when they are supposed to appear as datetime or other built-in date formatting.

Extreme Outliers

An extreme outlier is when a data point is not obviously incorrect, but does not seem to be reasonable given how other data are distributed. Identifying outliers is a key phase of the data cleaning process, because while in some instances you want to delete outliers outright, in other instances outliers provide key insights that would be ignored if the data responsible are deleted. Because it is such a crucial –and, for inexperienced users, time-consuming– step in the data cleaning process, the experts at Boxplot devote much of their expertise to efficient, automated outlier identification.

Poor Uniformity

Poor data uniformity is when data of the same attribute does not agree in terms of units of measurement. An example would be weight data in which some data points are measured in pounds and others are measured in kilograms.

Unmerged Data

Unmerged data are when multiple data sources are to be combined into one. Merging data can be a tedious, mind-numbing task if done improperly. Believe us — we know. Our goal is to save you this busy work; that’s why we at Boxplot emphasize task automation for merging data.

Extracting data

When the information we care about is embedded in long, ill-structured data points.Sometimes, data values need to be separated or parsed in order to be useful. For example, a data set may have mailing addresses recorded as a single data point, whereas what you really want is city, state, and zip recorded as separate data points. As with many other data cleanliness issues, automation is the cure to time wasted extracting data by hand.