Top Page | Upper Page | Contents | About This Site | JAPANESE

Standardization and Normalization with PCA

Standardization and Normalization and Principal Component Analysis are used as data preprocessing. With the methods, the original data and the range of values are changed.

Depending on the data, standardization or normalization after creating the principal component may make the characteristics of the data easier to understand. The following are examples of when used alone and when used in combination. Example 1 is an example in which the effect is sufficient even when used alone, and the effect of the combination is not effective. Example 2 is an example in which the effect of combination is large and it is difficult to analyze afterwards without combination.

Below, we will see what happens if the preprocessing is different when UMAP of Visualization by compressing high dimensions into two dimensions .

Principal component analysis uses the one that uses the correlation matrix.

Example 1: When the range of values ??on two axes is an order of magnitude

The range of values ??on the two axes is an order of magnitude.
PCAandStandization

Without pre-processing

The effect of the order of magnitude is large, and the group is divided into two, making it impossible to distinguish between A and B.

PCAandStandization

Only one pre-processing

From the left, only Standardization, only Normalization, and only Principal Component Analysis (PCA).

In each case, the three groups separated cleanly.

PCAandStandization PCAandStandization PCAandStandization

NormalizationAfter principal component analysis

On the left is the principal component analysis (PCA) followed by normalization.

It is not particularly different because it is effective enough by itself.

PCAandStandization

Example 2: Two variables with high correlation and outliers that maintain the correlation

The data to be analyzed. B is the majority of the distribution, and A is like an outlier.
PCAandStandization

Without pre-processing

You can see where A was placed because it is colored, but without the color, the characteristics of the original data are not visible.

PCAandStandization

Only one pre-processing

From the left, only Standardization, only Normalization, and only Principal Component Analysis (PCA).

In both cases, the outlier A is a lump and is a little far from B.

PCAandStandization PCAandStandization PCAandStandization

Only pre-processing is below.

PCAandStandization PCAandStandization PCAandStandization

After principal component analysis, standardization and normalization

The left is Principal Component Analysis (PCA) followed by Standardization, and the right is Principal Component Analysis (PCA) followed by Normalization.

The mass of A is completely away from B.

PCAandStandization PCAandStandization

The range of y-axis in normalization is larger than standardization.

Only pre-processing is below. Visually, these graphs are same to the PCA-only. But the range of axis is different.

"Range of Y-axis / range of X-axis" of PCA-only is "1.4 / 7 = 0.2". PCA-Standardization is "14 / 5 = 2.8". PCA-Normalization is "1 / 1 = 1".

Tha order of ratio is similar to the order of clearity of graphs.

The reason why the ration of PCA-Normalization is larger is that sample number of A is far less than B.

PCAandStandization PCAandStandization

Principal component analysis after standardization and normalization

The left is Standardization followed by Principal Component Analysis (PCA), and the right is Normalization followed by Principal Component Analysis (PCA).

The results are the same as when standardization, normalization, and principal component analysis were performed independently. When combining, you can see that the order is important.

PCAandStandization PCAandStandization

Only pre-processing is below. These are same to the PCA only.

PCAandStandization PCAandStandization

Software

R-EDA1

In R-EDA1 , when using Visualization by compressing high dimensions into two dimensions , the above methods can be used alone or in combination. When combined, principal component analysis is processed first.
R-EDA1




NEXT Analysis Using Intermediate Layer

Tweet