Data Analysis by Python

Visualization of the entire data with Python

This is a way to see all of the variables when there are multiple variables. If the data is arranged in time series, it will be time series analysis.

Qualitative variables are dummy-converted so that this method can be used. If there are 3 categories in 1 column, 3 columns of data will be created.

In the example below, the variables X1 to X9 are quantitative variables and the variable X10 is a qualitative variable.

Line graph by variable

Line graphs by variable can be roughly divided into a method of superimposing on one graph and a method of creating a graph by variable. For those who create a graph by variable, the number of variables is large in the case of the following method. It takes a lot of time to draw a graph. Depending on the PC, it may freeze. I haven't found an easier way to use this graph in R. If you want to use this graph, it is easier to use Python, Pandas Plot (matplotlib) , or Excel sparkline .

Overlay on one graph

import os #
import pandas as pd
#
os.chdir("C:\\PyTest")
#
df= pd.read_csv("Data.csv" , engine='python')
#
df2 = pd.get_dummies(df)
#
df2.plot()
#

Separate the graph for each variable, but match the Y-axis range

import os #
import pandas as pd
#
os.chdir("C:\\PyTest")
#
df= pd.read_csv("Data.csv" , engine='python')
#
df2 = pd.get_dummies(df)
#
df2.plot(subplots=True, sharey=True)
#

The Y-axis range changes for each graph

import os #
import pandas as pd
#
os.chdir("C:\\PyTest")
#
df= pd.read_csv("Data.csv" , engine='python')
#
df2 = pd.get_dummies(df)
#
df2.plot(subplots=True)
#

Heat map

Heat map the data as it is

import os #
import pandas as pd
#
import matplotlib.pyplot as plt
#
import seaborn as sns
#
%matplotlib inline
sns.set(font='HGMaruGothicMPRO')
#
os.chdir("C:\\PyTest")
#
df= pd.read_csv("Data.csv" , engine='python')
#
df2 = pd.get_dummies(df)
#
sns.heatmap(df2)
#

Standardize data and heatmap

In each variable, the average 0, standard deviation 1 standardization from it, and in the graph. When variables with very different values ??are included, you can see what each variable looks like.

import os #
import pandas as pd
#
import matplotlib.pyplot as plt
#
import seaborn as sns
#
from sklearn import preprocessing
#
%matplotlib inline
sns.set(font='HGMaruGothicMPRO')
#
os.chdir("C:\\PyTest")
#
df= pd.read_csv("Data.csv" , engine='python')
#
df2 = pd.get_dummies(df)
#
df3 = preprocessing.scale(df2)
#
sns.heatmap(df3)
#

Normalize the data and heatmap

For each variable, normalize to a minimum value of 0 and a maximum value of 1, and then graph. The effect is similar to standardization. If qualitative variables are mixed, it is easier to see the appearance of 0 and 1 here.

import os #
import pandas as pd
#
import matplotlib.pyplot as plt
#
import seaborn as sns
#
from sklearn import preprocessing
#
%matplotlib inline
sns.set(font='HGMaruGothicMPRO')
#
os.chdir("C:\\PyTest")
#
df= pd.read_csv("Data.csv" , engine='python')
#
df2 = pd.get_dummies(df)
#
df3 = preprocessing.minmax_scale(df2)
#
sns.heatmap(df3)
#

Line graph that can be enlarged

With Plotly, you can magnify a part of it. Time-series data with many waveforms is convenient because the waveforms are crushed and difficult to understand if there are many waveforms, but you can magnify and view any place. Also, Plotly is attractive because it is very light in operation.

import os #
import pandas as pd
#
import plotly.express as px
#
import plotly.io as pio
#
os.chdir("C:\\PyTest")
#
df= pd.read_csv("Data2.csv" , engine='python')
#
df['X']=df.index
#
fig = px.line(x = df['X'], y = df['Y'])
#
fig.show()
#
Plotly
* This image is a copy of the Jupyter Notebook screen Therefore, it cannot be scaled. You can zoom in and out on the Jupyter Notebook screen.