Top Page | Upper Page | Contents | About This Site | JAPANESE

Mahalanobis' Distance

Mahalanobis' Distance (MD) is one of the distance. It considers correlation.

I put this page under the section of MT method. But MD is used also in Discriminant Analysis.

Calculation of MD

In this case, there are two variances, (x, y).

Meaning of Average

Average is used as the center of the data. And distance is from the center to each point.

MD and MD^2

In the real analysis, I use MD^2 more than MD. So in the discussion, I often call "MD" as "MD^2".

Various Diffinition of Mahalanobis' Distance

There are various diffinition of MD. I write 5 types of diffinition. And there are various meaning and use of MD.

No.1 : Use "n - 1" for Covariance

I call No.1, in this site, for the deffinition above. Covariance is
MD2

The average of the square of MD is
MD2
It is calculated by k (nunmber of vaiance) and n (number of samples).

The number of average could be used for analysis.

It is needed before calculation that data file "Data1.csv" is in the folder "C:/Rtest"

setwd("C:/Rtest")
Data1 <- read.table("Data1.csv", header=T, sep=",")
Ave1 <- colMeans(Data1)
Var1 <- var(Data1)
MD1 <- mahalanobis(Data1, Ave1, Var1)
write.csv(MD1, file = "MD1.csv") # Write MD into csv file

The output of R is the square of MD.

If we change "colMeans(Data1)", we can calculate other distance from except average point. If we change "var(Data1)", we can calculate using oather covariance matrix.

No.2 : No.1 and Normalized Data

In No.2, Normalization is done before the calculation of No.1. Using this data, formulation of MD is simple.
MD2

The value of MD in No.2 is equal to No.1.

In No.2, check of Multicollinearity is easy because covariance matrix is equal to correlation matrix. And we can find the variance of 0 variance, in the calculation of normalization.

No.3 : Use "n" for Covariance

In general calculation, we often use "n-1" for the denominator of the value. I use "n-1" in the calculation above.

In definition of No.3, "n" is used. Covariance is
MD2

If n is used, average of square of MD is
MD2
k is the number of variances.

In R, the part of Var1 of No.1 is changed into
n <- nrow(Data1) # Get the nunber of samples
Var1 <- var(Data1)*(n-1)/n # Calculate covariance

No.4 : No.3 / k

In MT method , MD is
MD2

In this diffinition
MD2

In the real analysis, we change the combination of vairiances. In No.4, the change of numbers of variances does not affect strongly to MD. So it is useful.

In the calculation of R, the calculation of MD is changed into below.
MD1 <- mahalanobis(Data1, colMeans(Data1), var(Data1)*(nrow(Data1)-1)/nrow(Data1))/ncol(Data1))

In R, the part of No.1 is change into below,
n <- nrow(Data1)
Var1 <- var(Data1)*(n-1)/n
k <- ncol(Data1)
MD1 <- mahalanobis(Data1, Ave1, Var1)/k

No.5 : No.4 and No.2

In many cases, "n - 1" is used to calculate standard deviaion. No.2 uses "n - 1".

For No.4, to study data, I use "n" to calculate standard deviation. By n, covariance matrix is same to correlation matrix. Correlation matrix is useful.

In the Excel sample file in the page, Process of MT method , I use No.5 type of calculation.

Average of Square of MD

It might be strange that average is the simple number. Or it can be calculated only k and n.

I confirm the case of No.4. No.1 and No.3 is similar.

MD2
MD2 MD2 MD2 MD2 MD2

The formulation of last part is the part of "B * A". "B * A" is the unit matrix.
MD2 MD2

Euclidean Distance

Euclidean distance is used in our daily life as a method to calculate distance.

If covariance matrix is a unit matrix, MD is equalt to Euclidean distance.
MD2
So MD is not completely different with Euclidean distance.




NEXT Difference between MT method and Hotering theory

Tweet