Data Analysis by R

Decision tree by R

Decision Tree is a powerful method of data mining . Here, we will introduce how to use it.

The decision tree can also be used for prediction as one of the machine learning methods, but it is not used because it is overfitting or too rough and it is difficult to make it feel just right.

R library type

When using a decision tree in R, there are various libraries, and there are advantages and disadvantages.

For practical use, the sample code on this site tries to improve the shortcomings.

About the high degree of perfection of rpart

rpart is well done, it automatically determines the type of Y and runs a classification tree for qualitative data and a regression tree for quantitative data. Also, X can be used even if qualitative data and quantitative data are mixed.

If you execute it like rpart with CHAID or C5.0 without any special care, an error will occur when "data is logical type or character string type". CHAID and C5.0 have longer code than rpart due to some pre-processing to avoid this error.

Introductory decision tree

Here are some examples of Classification tree and regression tree, and introductory usage.

Classification tree

An example of using R is as follows. (The following is copy-paste and can be used as it is. In this example, it is assumed that the folder named "Rtest" on the C drive contains data with the name "Data.csv" before this code. In addition, it is necessary to install the library "partykit".

Suppose the data has Y and 20 Xs. There are 90 lines.
Decision Tree

setwd("C:/Rtest")
library(partykit)

library(rpart)

Data <- read.csv("Data.csv", header=T)

treeModel <- rpart(Y ~ ., data = Data)

plot(as.party(treeModel))

Decision Tree

For the following data, the process of "converting the columns of quantitative data into qualitative data" can be performed as follows.
Decision Tree
setwd("C:/Rtest")
Data <- read.csv("Data.csv", header=T)

for (i in 1:ncol(Data)) {

if (class(Data[,i]) == "numeric") {

Data[,i] <- droplevels(cut(Data[,i], breaks = 5,include.lowest = TRUE))
# When dividing into 5 parts. Quantitative data is converted into qualitative data.
}

}

Decision Tree

Regression tree

Suppose the data has Y and four Xs. There are 106 lines.
Decision Tree

The code should be exactly the same as above. If Y is quantitative data, it will be treated as a regression tree.
Decision Tree

N-try tree

This is an example of N-try tree.

For qualitative data only (CHAID)

Before this code, you need to install the library "CHAID".

Suppose the data has Y and three Xs. There are 100 lines.
Decision Tree

setwd("C:/Rtest")
library(CHAID)

Data <- read.csv("Data.csv", header=T, stringsAsFactors=TRUE)

treeModel <- chaid(Y ~ ., data = Data)

plot(treeModel)

Decision Tree

When quantitative data is mixed in X and Y (CHAID)

Suppose the data has Y and three Xs. There are 100 lines. X1 is quantitative data.
Decision Tree

Note that this code can also be used for Y quantitative data.

setwd("C:/Rtest")
library(CHAID)

Data <- read.csv("Data.csv", header=T, stringsAsFactors=TRUE)

for (i in 1:ncol(Data)) {

if (class(Data[,i]) == "numeric") {

Data[,i] <- droplevels(cut(Data[,i], breaks = 5,include.lowest = TRUE))

}

}

treeModel <- chaid(Y ~ ., data = Data)

plot(treeModel)

Decision Tree

WThe sample data "wether.numeric" in Weka was converted to csv format, and the Y column name was changed, and it did not branch. I think the cause is small data with only 14 rows of this data. Changing the parameters may change the situation, but I haven't tried that much.
Decision Tree

When quantitative data is mixed in X and Y (C5.0)

C5.0 is a method that can be used even if quantitative and qualitative data are mixed. Quantitative data is binary and qualitative data is N-ary.

It can be used in R's C50 library. Installation is easy. In the code below, first, when reading the csv file, the columns that are automatically recognized as string type (strings) are converted to factor type (factor). The 5 rows starting with for convert the columns that are automatically recognized as logical to factor type. Failure to do these pre-processing will result in an error in the C50 package.

setwd("C:/Rtest")
library(C50)

library(partykit)

Data <- read.csv("Data.csv", header=T, stringsAsFactors=TRUE)

if (class(Data$Y) == "numeric") {

Data$Y <- droplevels(cut(Data$Y, breaks = 5,include.lowest = TRUE))

}

for (i in 1:ncol(Data)) {

if (class(Data[,i]) == "logical") {

Data[,i] <- as.factor(Data[,i])

}

}

treeModel <- C5.0(Y ~ ., data = Data)

plot(as.party(treeModel))

The left is the result of the data created by CHAID "When quantitative data is mixed in X and Y". The right is the result of changing the column name of Y by converting the sample data "wether.numeric" in Weka to csv format. The result is almost the same as Weka's J48.
Decision Tree Decision Tree

Find important variables by Random Forest

This is an example of a Random forest.

Before this code, you need to install the library "randomForst".

Suppose the data has Y and 20 Xs. There are 90 lines.
Decision Tree

setwd("C:/Rtest")
library(randomForest)

Data <- read.csv("Data.csv", header=T, stringsAsFactors=TRUE)

treeModel <- randomForest(Y ~ ., data = Data, ntree = 10)

varImpPlot(treeModel)


Decision Tree

The method of obtaining the predicted value is the same as other methods in how to use the Software for Prediction.

Data mining by Random Forest

The randomForest library is built to help you find and predict important variables using Random Forest . However, it seems that there is no function to output the individual trees (forests) that are created in the middle of those calculations, although you can see such information.

So, I made a code to output the structure of the tree.

It's based on rpart, CAHID, and C5.0, not the randomForest code. Since a typical random forest is a binary tree, it is based on rpart that the result is close to that of a tree in the middle of a general random forest calculation. I wanted something from N-tri , so I also made CHAID-based and C5.0-based random forests.

The point is how to create a dataset for 10 times. With this code, it's convenient to be able to adjust it yourself.

rpart-based random forest

Suppose the data has Y and 20 Xs. There are 90 lines.
Decision Tree

setwd("C:/Rtest")
library(rpart)

library(partykit)

Data <- read.csv("Data.csv", header=T)

ncolMax <- ncol(Data)

nrowMax <- nrow(Data)

DataY <- Data$Y

Data$Y <- NULL

for (i in 1:9) {

DataX <- Data[,runif (floor(sqrt(ncolMax)), max=ncolMax)]

Data1 <- transform(DataX, Y = DataY)@

Data2 <- Data1[runif (floor(sqrt(nrowMax)), ,max=nrowMax), ]
# Randomly select the number of rows with the square root of the number of rows
treeModel <- rpart(Y ~ ., data = Data2, minsplit = 3)

jpeg(paste("plot",i,".jpg"), width = 300, height = 300)

plot(as.party(treeModel))

dev.off()

}


Running this code will create as many resulting image files as there are trees in your working directory.
Decision Tree
As intended, we have also created a tree with three or more branches.

C5.0 based random forest

setwd("C:/Rtest")
library(C50)
library(partykit)
Data <- read.csv("Data.csv", header=T, stringsAsFactors=TRUE)
ncolMax <- ncol(Data)
nrowMax <- nrow(Data)
if (class(Data$Y) == "numeric") {
Data$Y <- droplevels(cut(Data$Y, breaks = 5,include.lowest = TRUE))
}
DataY <- Data$Yv for (i in 1:ncolMax) {
if (class(Data[,i]) == "logical") {
Data[,i] <- as.factor(Data[,i])
}
}
Data$Y <- NULL
for (i in 1:9) {
DataX <- Data[,runif (floor(sqrt(ncolMax)), max=ncolMax)]
Data1 <- transform(DataX, Y = DataY)@
Data2 <- Data1[runif (floor(sqrt(nrowMax)), ,max=nrowMax), ]
treeModel <- C5.0(Y ~ ., data = Data2)
jpeg(paste("plot",i,".jpg"), width = 300, height = 300)
plot(as.party(treeModel))
dev.off()
}

Running this code will create as many resulting image files as there are trees in your working directory.
Decision Tree
As intended, we have also created a tree with three or more branches.

CHAID-based random forest

The feature of the code below is that, in addition to being CHAID, it is the "column bagging" written on the ensemble learning page. That is, sampling is only columns, not rows.

It takes longer to calculate than the C5.0 based one.

setwd("C:/Rtest")
library(CHAID)
Data <- read.csv("Data.csv", header=T, stringsAsFactors=TRUE)
ncolMax <- ncol(Data)
nrowMax <- nrow(Data)
for (i in 1:ncolMax) {
if (class(Data[,i]) == "numeric") {
Data[,i] <- droplevels(cut(Data[,i], breaks = 5,include.lowest = TRUE))
}
}
DataY <- Data$Y
Data$Y <- NULL
for (i in 1:9) {
DataX <- Data[,runif (floor(sqrt(ncolMax)), max=ncolMax)]
Data1 <- transform(DataX, Y = DataY)@
#Data2 <- Data1[runif (floor(sqrt(nrowMax)), ,max=nrowMax), ]
Data2 <- Data1
treeModel <- chaid(Y ~ ., data = Data2)
jpeg(paste("plot",i,".jpg"), width = 600, height = 300)
plot(treeModel)
dev.off()
}


Decision Tree

Model tree

Example of Model tree.

setwd("C:/Rtest")
library(Cubist)

library(ggplot2)

Data <- read.csv("Data.csv", header=T)

Ydata <- Data$Y

Data$Y<-NULL

Cu <- cubist(y = Ydata, x=Data, data = Data)

summary(Cu)

Decision Tree
Output <- predict(Cu,Data)

Data2 <- cbind(Data, Output)

ggplot(Data2, aes(x=Y, y=Output)) + geom_point()+xlab("Data Y")+ylab("Predicted Y")+ggtitle("Model Tree (Cubist)")


Decision Tree

Examples to limit the number of rules as 3
Cu <- cubist(y = Ydata, x=Data, data = Data, control=cubistControl(rules = 3))