Analysis with Outlier and Missing Value

There are some methods to deal with the data with Outlier and Missing Value.

This page is written for missing values.

We can use the outlier as normal data. But this approach often leads bad outputs. So use outlier as missing value is a solution.

Remove the Line

In my experience, many software of statistics remove the line including missing values without any notice to the user.

It is difficult to notice the bad effect of missing values because this approach is the default in much software.

Strength : Very easy

Weakness : Removed also normal data

It is the method to use new data as missing data. We need to decide the rule to make the data.

Strength : Easy. Effect for outputs leaded from normal data is small.

Weakness : Could be a bias

If we know the range of the data, limits of the range or outlier can be used to fill up.

Strength : Outputs could be include the causes of missing value

Weakness : Effect of the filled up data may be large

Using the average near the missing value by k-NN.

Strength : If the cause of the missing is not special, it may be the best in the filling up approaches.

Weakness : The volume of calculation is not easy

EM Algorithm is the approach to use all normal values but missing values.

Strength : Not use the change of the value

Weakness : The cause of the missing is ignored. It may lead the bad effect for the output of the analysis.

Analysis Using Category Data (Decision Tree, Associations Analysis etc.) is useful. We use the missing value as the category, "missing value".

DecisionTree in RapidMiner use the missing value as the category "?".

Natto use the category, " " (blank) as the missing value.

Strength : Information of missing data can be used

Weakness : The small number of the normal data is not used in the analysis. (Significant figures changes into large.)