Prediction by cluster analysis

Cluster Analysis is generally introduced as a method of grouping samples and asking "Which group do you belong to?" About the sample used for grouping.

However, the mixture distribution method can also be used for Prediction by Statistical Model . There is also a way to use methods other than the mixture distribution method as a data pre-processing method when making predictions using a statistical model .

Work before prediction

When making a prediction by cluster analysis, the work "Which group do you belong to?" Is performed as the work before the prediction.

For example, if you perform cluster analysis on the data on the left, you can extract three groups as shown on the right. Use this result when making predictions.

Prediction of group

The mixture distribution method is a type of cluster analysis, but it is a method that can predict not only grouping but also samples other than the sample used for grouping by a statistical model "Which group do you belong to?" . For example, if you use the result of the above work to predict a group at any position, you will see the figure below.

Variation of method

The mixture distribution method itself can be used for prediction. Cluster analysis techniques other than the mixture distribution method can also be used to predict groups.

The procedure is to create a variable with the name of the group in the cluster analysis, then label the variable and use the Label Classification technique.

Various variations can be made by combining the cluster analysis method and the label classification method.

Prediction of outlier

As mentioned above, you can see "Which group do you belong to?" In the group prediction. By the way, depending on the purpose of the analysis, you may want to judge whether it is an outlier or not, such as "Doesn't it belong to any group?" However, I don't know this because the group prediction method always assigns to some group.

"Doesn't it belong to any group?" Can be done by checking "Do you belong to this group?" For each group and collecting the results. This is a method that uses cluster analysis as an intermediate process for a One-Class Model .

In the explanation of the words "analysis" and " stratification ", the expression "if you divide, you can understand" is used, but the prediction of outliers using cluster analysis can also be done by "if you divide, you can understand". It has become. Cluster analysis is useful as an easy way to "divide".

Using the mixed distribution MT method instead of the mixed distribution method, for example, predicting a group at any position will result in the figure below. The farther from the center of each group, the higher the value, and you can see the size of the deviation. You can use it like "If it is XX or more, it is considered that it does not belong to any group".

Variation of method

There are variations in the cluster analysis method and variations in the judgment method.

Variations on cluster analysis methods include the k-Means method and the mixture distribution method. It may be possible with DBDCAN etc., but even if it can handle the distribution of complicated shapes, it seems that there is nothing that seems to be good for the subsequent judgment method.

If you use the k-Means method or the mixture distribution method for cluster analysis, you can use the method of judging from the center of each group by the Euclidean distance or Mahalanobis distance.

Mixture distribution MT

Mixture distribution MT is a type of method used to predict outliers. The method of cluster analysis is the mixture distribution method, and the judgment method is the MT method .

How to use it properly with the method of creating outlier groups

The difference between the method of predicting outliers using cluster analysis and the Outlier detection with cluster analysis can be confusing.

If you have a lot of samples and want to divide them into several groups and outlier groups, this is the method of the outlier detection page by cluster analysis .

The method of predicting outliers using cluster analysis is when you want to see if there is data that is the source and you want to see if other data is included in the data that is the source.

Software

Prediction of group

How to predict a group by the mixture distribution method by R can be found on the page of Cluster analysis by R.

Prediction of outlier

The Mixture Distribution MT method with R can be found on the Analysis of anomaly quantification by R.