Mahalanobis' Distance (MD) is one of the distance. It considers correlation.
I put this page under the section of MT method. But MD is used also in Discriminant Analysis.
In this case, there are two variances, (x, y).
Average is used as the center of the data. And distance is from the center to each point.
In the real analysis, I use MD^2 more than MD. So in the discussion, I often call "MD" as "MD^2".
There are various diffinition of MD. I write 5 types of diffinition. And there are various meaning and use of MD.
I call No.1, in this site, for the deffinition above.
Covariance is
The average of the square of MD is
It is calculated by k (nunmber of vaiance) and n (number of samples).
The number of average could be used for analysis.
It is needed before calculation that data file "Data1.csv" is in the folder "C:/Rtest"
setwd("C:/Rtest")
Data1 <- read.table("Data1.csv", header=T, sep=",")
Ave1 <- colMeans(Data1)
Var1 <- var(Data1)
MD1 <- mahalanobis(Data1, Ave1, Var1)
write.csv(MD1, file = "MD1.csv") # Write MD into csv file
The output of R is the square of MD.
If we change "colMeans(Data1)", we can calculate other distance from except average point. If we change "var(Data1)", we can calculate using oather covariance matrix.
In No.2,
Normalization
is done before the calculation of No.1.
Using this data, formulation of MD is simple.
The value of MD in No.2 is equal to No.1.
In No.2, check of Multicollinearity is easy because covariance matrix is equal to correlation matrix. And we can find the variance of 0 variance, in the calculation of normalization.
In general calculation, we often use "n-1" for the denominator of the value. I use "n-1" in the calculation above.
In definition of No.3, "n" is used.
Covariance is
If n is used, average of square of MD is
k is the number of variances.
In R, the part of Var1 of No.1 is changed into
n <- nrow(Data1) # Get the nunber of samples
Var1 <- var(Data1)*(n-1)/n # Calculate covariance
In
MT method
, MD is
In this diffinition
In the real analysis, we change the combination of vairiances. In No.4, the change of numbers of variances does not affect strongly to MD. So it is useful.
In the calculation of R, the calculation of MD is changed into below.
MD1 <- mahalanobis(Data1, colMeans(Data1), var(Data1)*(nrow(Data1)-1)/nrow(Data1))/ncol(Data1))
In R, the part of No.1 is change into below,
n <- nrow(Data1)
Var1 <- var(Data1)*(n-1)/n
k <- ncol(Data1)
MD1 <- mahalanobis(Data1, Ave1, Var1)/k
In many cases, "n - 1" is used to calculate standard deviaion. No.2 uses "n - 1".
For No.4, to study data, I use "n" to calculate standard deviation. By n, covariance matrix is same to correlation matrix. Correlation matrix is useful.
In the Excel sample file in the page, Process of MT method , I use No.5 type of calculation.
It might be strange that average is the simple number. Or it can be calculated only k and n.
I confirm the case of No.4. No.1 and No.3 is similar.
The formulation of last part is the part of "B * A".
"B * A" is the unit matrix.
Euclidean distance is used in our daily life as a method to calculate distance.
If covariance matrix is a unit matrix, MD is equalt to Euclidean distance.
So MD is not completely different with Euclidean distance.
NEXT Difference between MT method and Hotering theory