Understanding Data Cleaning

Data is information collected through observations. It is often a set of qualitative and quantitative variables or a compilation of both. Data often entered in a system can have multiple layers of issues while retrieving, which in most cases will cause you to clean the data before you can make sense of the same and process the same to come up with actionable insights.

Contributed by: Krina

Data cleaning is a very crucial first step in any machine learning project. It is an inevitable step in the process of model building and data analysis, but no one really can or tells you how to go about the same. It is not the best part of machine learning, but yet is the part that can make or break your algorithm. It is funny if you think that a Data Scientist spends most part of their time building ML algorithms and ML models. The fact lies that most of them spend a good 70-75% of their time in data cleaning. It’s strongly believed that a cleaner dataset beats any fancy algorithm you build.

Different types of data will need different types of cleaning. The steps involved in data cleaning are as follows:

Removal of unwanted observations
Fixing structural errors
Handling missing data
Managing unwanted outliers

Steps in Data Cleaning

Removing Unwanted Observations
- Dealing with duplicates:
  Duplicates generally arise during data collection – combining two data sets, data scraping or receiving data with different primary keys in respective departments in the organization can lead to duplicates. These are observations that hold no value as a part of your primary objective.
- Irrelevant observations:
  This will be where EDA from univariate to bivariate analysis will come in handy to identify crucial insights about the data. If you look at distributions of categorical features, you could come across classes that probably should not exist. You will have to keep that in mind and categorize those features correctly before model building.

Fixing structural errors

The next section under data cleaning is fixing errors. This could range from checking for typographic errors or inconsistent capitalization to spell checks or characters that exist like unwanted spaces or signs etc. You can also look for mislabeled classes that can impact your data analysis and sabotage your algorithm, and you would end up spending quite a few resources in a direction with not great results.

E.g. Countries listed could have countries listed as U.S.A, USA, United States of America, U .S .A., USA, u.s.a so on and so forth. Though all these mean the same country writing them inappropriately can lead to wrongful categorization.

Managing unwanted outliers

Outliers can cause issues in some of your model building processes. Like, Linear Regression is less robust as compared to Random Forests or Decision trees. It’s crucial that you have some logical backing while getting rid of the outliers, which mostly should be increasing the accuracy of the model performance. You cannot just remove a big number treating it as an outlier.

Handling missing data
- Missing data is one of the trickiest parts of Data Cleaning for Machine Learning. We cannot just remove a piece of information unless we are aware of the importance with respect to our ultimate target variable and how it is related to it. E.g., imagine you are trying to check customer churn based on Customer Ratings, and it has missing values. If you drop variables, it could form an important part of the data and could play a crucial role in prediction, which forms an important part of real-world problems.
- Imputing missing values based on existing data values or past observations, as you can call it, is another way to deal. Imputing is suboptimal as the original data was missing, but we filled it in. This always leads to a loss of information, no matter how sophisticated your imputation technique is.

Using any of these two methods is suboptimal as how hard you try it’s like dropping or replacing a puzzle piece without which the data is not complete (pretending that the data does not exist). There will always be a risk of reinforcing the patterns in the existing data, which might mean a little bit of bias in the resultant.

So, missing data is always informative and warning of something important. And we must practice being aware of our algorithm of missing data. This can be done by flagging it. Using this technique of flagging, you are effectively allowing the algorithm to estimate the optimal constant for missing values instead of just imputing it in with the mean.

You used a bunch of dirty data, missed the part of data cleaning, or just did some little bit here and there, and you present your analysis and inferences based on the same to your organization. It is going to cost your client or organization time and brand image, along with profits. You can be in a whole lot of trouble since incorrect or incomplete data can lead you to inappropriate conclusions.

In the real world, incorrect data can be very costly. Companies use a huge amount of information from databases that varies from customer details, contact addresses or even banking information. Any error, the company will suffer financial damage or even lose customers in the long run.

It’s so important to have simple algorithms while focusing on high-quality data.

While cleaning the data some points that should be very much focused on are:

1. Data Quality

Validity: The fact that the data is related to the business problem and consists of the required variables and fields required for the analysis.

Datatype: Values in a particular column must be
Unique constraints: A field or combination of fields must be unique in the dataset.
Primary-key & Foreign-key constraints, as in a relational database, a foreign key variable cannot have a value that is not referenced to the primary key.
Regular expression patterns: Variable fields must be in a certain fashion, E.g., the Date should be dd-mm-yyyy or dd-mm-yy.
Cross-field validation: A stock price can be volatile, so can be any value across dates, but the document date of a sale cannot be before the purchase order date.

Completeness: The extent to which the data is known. Consideration of the Missing values in the data and how it can impact our study.

Accuracy: The degree to which the data fields are close to true values.

E.g., if during this pandemic you say that tourism is flourishing, and the facts do not support it.

Another important aspect to keep in mind is there is a difference between accuracy and precision. Place of residence is Earth, is actually true. But, not precise. Where on the planet is precise being a country name, state name, street name or city name.

Consistency: Within the data set or across data sets is the data consistent. I am a graduate employee but age 16. It contradicts a fact since it’s not possible that the employee is a graduate or has studied until the age of 16. Both are different and conflicting entries.

Uniformity: The degree to which the data is the same unit of measure. If the export data consists of a variable being document currency. It can be dollar for USA, Euro for European countries, INR for Asian and Indian so on and so forth. So, you cannot really analyze the data like that. And so, data should be converted to a similar measure unit.

2. Workflow (inspection, cleaning, verifying & reporting)

The sequence of events mentioned to be followed to clean the data and make the best out of what has been provided to you.

Inspecting: Detection of incorrect, inconsistent & unwanted data

Data Profiling: The Summary statistics would give a fair idea about the quality of the data. How many missing values? What is the shape of the data before cleaning? Is the data variable as a string or numeric?

Visualizations: Is it easier to look at a plot and say the distribution is normal or skewed on either side versus from the summary statistics? Obviously, visualizing data can give you brilliant insights and dive you through the nitty gritty in the summary such as mean, standard deviation, range, or quantiles. For example, it is easier to look at outliers with a boxplot of average income across countries.

Cleaning: Remove or impute the anomalies that are a part of the data. Missing value treatment, outlier treatment and then removal or addition of variables will all comprise this part. Incorrect data is generally removed, imputed, or corrected based on specific evidence from the clients. Standardizing, Scaling and Normalization are perhaps important parts of preprocessing the data but are the resultants after the cleaning part is done appropriately in an iterative manner.
Verifying: Post cleaning, we need to verify with the domain experts to verify if the data is appropriate
Reporting: Description of the data cleaning process, changes made and quality of the data once preprocessing was completed.

Last but not the least, data cleaning is a massive part of data pre-processing and model building for machine learning algorithms so it’s something you can never give a miss. While at a data science course or project it is crucial to spend a substantial amount of time for the same. Also, while you consider working in the data science industry it is very important that you address these aspects of learning and you can land up in an entry level job which can give you a graceful and swift entry into the job of your dream.

I would like to conclude by quoting Jim Bergeson ”Data will talk to you if you’re willing to listen”.

Thank you for reading this blog on Data Cleaning. If you wish to learn more such concepts, check out Great Learning Academy’s pool of free online courses and upskill today.