Data cleaning is the most important part of data analysis and if we have missing values in our dataset, our task is going to be more tedious. If number of observations with missing values is <=5% we can simply delete those observations, but what if the number of observations are a lot? then we can’t afford to delete them. In that case we need to “somehow” impute those missing values. Here we will discuss that “somehow”.
Very basic way to imputation is mean, median or mode, based on the type of the attribute.
If it is continuous variable for which we are doing the imputation, we can replace the missing values with the mean or median of that column.
There can be some situation that may appear in this case, like if our column attribute has very large or very small values (outliers) then our mean will not be a good representative to replace the missing values, in that case we have to choose median.
Another way to make it more perfect would be to take “Conditional mean” or “Conditional median” as impute value. Suppose we have a data set like this, where “Salary” column has two missing values for person “E” and “J”.
If we will replace those “NA” with the mean of the column it will be 28450, but are we really doing the right assignment by assigning same salary to a person with “Highest Education” as “School” and other with “Graduate”? In this case “Conditional mean” will be the right choice, so for the person “E”, salary will be the mean of all salary of all “Graduate” people, which is 30333.33, and for “J”, salary will be the mean of all salary of all people who are having their “highest education” as “School”, which is 11000. Now it sounds reasonable.
But what if our data is not continuous, it is ordinal (i.e. rating in a range). Suppose we have a data set like below.
Here we can’t simply replace the missing value (product E) with the mean of the rating, which will be 3.18, if we do so it will not be correct representative of the “User Rating”. In this case we need to take the mode instead of mean/median, to see what maximum people have rated. In this case also we can take conditional mode based on the “Type” of the product to see what maximum people have rated about that particular type of product, which will give more accurate imputed “rating”.
To impute missing value for different data type and more accurately see this