Short note on the Data Cleaning for Data Science!

May 19, 2020, 5:43 p.m. By: Merlyn Shelley

Data Cleaning for Data Science

Data cleaning is the foremost vital step in any data science project. Data must be in a clean and ordered form for executing them in machine learning models. Also, only with clean data, the ML models can give accurate predictive results.

In computer science and mathematics it is a generic understanding that the quality of the output solely depends on the quality of the input. Here, in data cleaning, we should make sure of the quality of the input fed to the model to get the desired output.

garbage in, garbage out

So, data cleaning is the process of finding the errands or inconsistent values in a raw dataset and either replacing them up with the near approximate values or to discard the corrupt values completely.

This is actually a curative process that needs to be handled with care because data is the only precious resource where we can derive the inference from. If we lose any essential data, then the results won't be accurate, and the complete data science project would turn meaningless. Moreover, every professional data scientist would spend more time in this process of cleaning the data and give them an ordered form for further manipulations.

According to the study conducted by Forbes, Data Scientists are spending 60% of their time in data cleaning and organisation. Next 19%, they do contribute to collecting data sets for analysis. So on the whole of 80%, involves collecting, cleaning and preparing data for the actual process of executing in models. And it is reported that 76% of data scientists enjoy data cleaning tasks.

Let's discuss the best ways to perform data cleaning effectively.

  • We can make use of the Pandas library in Python or Dplyr in R for converting the raw data into well-structured data frames.

  • We can now make use of the data cleaning tools like Google Refine, Trifacta, Talend, Paxata, OpenRefine, Alteryx, Data Ladder’s DataMatch, WinPure and a lot many open-source tools that are available in the market for automating the data cleaning process.

  • For distributed Systems in Apache Spark, Optimus is one such open-source framework to execute data cleaning effectively.

Now Let's look into the Strategies for Data Cleaning in Data Science.

  • To find the critical data fields

  • To collect the required data

  • To remove the duplicate data

  • To fix the empty values and missing values

  • To regularise the data cleaning process either weekly or monthly

  • Do review, adapt and repeat the above process

Doing this way, we can prepare adequate data efficiently towards deploying it in a machine learning model that results in absolute success.

Few GitHub References: