Data Analysis by R

About data analysis by R

Data analysis by R is, data science for environmental and quality is what made as a spin-out from. Data analysis by Excel and the data analysis by Python is a sister version of.

There are two motives I made.

First , data science for the environment and quality was created for studying methods, so there may be examples of R, but it is very difficult to use as a tool for daily use. is. The necessary code is scattered, and even if it is written, it is difficult to know where it is. We needed to eliminate this and make R easier to use.

Secondly, I wanted to make R usable even for people who need data analysis with one hand of work. R has the appeal of being able to quickly reach outputs that cannot be reached with Excel or existing statistical software.

R is a very difficult tool for people who need to analyze data with one-sided work. It takes a considerable amount of time to be able to use it without affecting the main business, but unless you are a person who does not suffer from it or can secure that time, you can not overcome the threshold.

Sample code editing policy

Code for actual analysis

The content of this site is suitable for actual analysis. "Principal component analysis and multidimensional scaling" will come out without any hesitation. In actual battles, individual methods are often not enough, but since the explanations in the world are all about individual methods, this site also has the intention of bridging the gap.

For individual methods, you may want to look at the data science explanations for the environment and quality, but if you search, there are quite a few explanations out there.

In exploratory data analysis, supervised learning techniques are rarely used

Supervised learning methods ( regression analysis , pattern recognition ) are useful when you just want to create a model that fits well.

Even in cause analysis, it is a method that makes you want to use the variable that represents the result as teacher data (objective variable). However, if you use these methods, the result will be dragged by the sampling bias of the prepared data and the habit of variables, and you may not be able to see what you should see. I myself have had quite a few detours, even though I thought I had taken a shortcut because of this.

Therefore, the exploratory data analysis method is such that the part to be seen is not narrowed down by the method.

Make a code that can be used with copy and paste

In some sample code in the world, it is necessary to rewrite the variable name according to your own data. However, in data analysis where the speed of analysis is important, manual handling is inefficient and prone to error. Also, when there are many variables, it is difficult to manually enter them one by one.

Therefore, I try to avoid doing the sample code as much as possible.

Some machine learning methods require the user to set various parameters, but I choose the one that does not have parameters as much as possible. Even if the parameters are set, even if the defaults are used, the method that can be used is selected.

Common points of sample code

setwd ("C: / Rtest")

It is assumed that the input data is in a folder called "Rtest" on the C drive. It is designed to set up a working directory so that you can do that. If you create a folder with this name in advance and put the file you want to analyze, you can use the sample code without any changes.

Data <-read.csv ("Data.csv", header = T)

The input data is assumed to be a csv file. If you prepare it in Excel, you need to make it so that the first column is column A. Also, the top row is assumed to be the variable name (column name).

"Csv file" can be created by selecting the save format when saving in Excel.

There are many R sample codes in the world, but in many cases, "manual input data", "random number data", and "R sample data" are used, and the code is used for the data that you have. When you want to apply, you may not know what to do with the input data.

On this site, I made a sample code so that I wouldn't have time to worry about that.

Common errors

Load library

Some R libraries need to be written in the code as "library (...)" every time, and some do not.

What you need to write in the code every time is also included in the sample code.

Also, in R, the basic libraries are installed with the initial installation, but other than that, users are allowed to download and use them from the Internet as needed.

Once downloaded, you can continue to use it as if it were an initially installed library.

The sample code is always required, so it is assumed that it has been downloaded. Therefore, if there is a library that has not been downloaded, an error will occur.

If you get an error, please download it from CRAN. Once downloaded, the error will disappear from the next time.

Automatic selection of variable types

R automatically determines and processes quantitative variables (numeric type) and qualitative variables (character type). Even if you think of it as a quantitative variable, if even one character contains a character other than a numerical value, it will be treated as a qualitative variable and you may not be able to perform the analysis you expected.

Variable name "X1, X2"

As the output of R, matrix data without column names may be produced. When the process "transform" is performed on the data, the column names "X1, X2 ..." are automatically given to the part without the column name.

"Transoform" is a process to combine matrix data, but if the column name "X1, X2" is included in the other party to be combined, the column name will be duplicated, so "X1, X2" of the other party to be combined The column name will be changed to something different. You shouldn't use the column names "X1, X2" too much, as it can be confusing. For example, "X01, X02" does not cause this problem.