Top Page | Upper Page | Contents | About This Site | JAPANESE

Data Analysis Jobs

People call data with outliers, missing values , multicollinearity , uneven distribution , bias, and inconsistent format "bad data" or "dirty data" is a lot.

Also, when the number of variables (columns in the data table) is small or the number of samples (rows in the data table) is small, many people give up on data analysis, thinking that they cannot understand anything. This is also bad data in a broad sense.

Then, what is non-bad data? It is, for example, data measured in a place such as a university laboratory, where the measurement environment and measurement methods are controlled as strictly as possible.

The data used in data analysis, such as business and social research, is usually bad somewhere. This site is littered with stories about bad data, so this page summarizes them.

Guidelines and basic ideas for proceeding with data analysis

Where there are links, the full story is there, and here's a quick summary.

Outliers and missing values

I sometimes see explanations that simply remove outliers and missing values ??as a countermeasure. This approach ignores outliers and missing values ??and tries to look only at data that is not outliers or missing values.

However, in data with outliers or missing values, investigation of the reasons for the "outliers" or "missing" comes first.


When there is multicollinearity , sparse modeling tries to create a model that corresponds mechanically.

However, in data with multicollinearity, the multicollinearity can be used as a clue to better understand the background of the data. If you start data analysis from such a perspective, the final analysis results will become solid.

By the way, if you want to get data without multicollinearity and rigorously examine the difference between each variable, there is an approach to experiment using Design of Experiments . However, it is necessary to confirm how much the conclusions obtained from the data collected by artificial experiments apply to what is actually happening.

Bumpy distribution

normal distribution
Many well-known statistical methods assume a normal distribution .

For this reason, there are people who think that if normality cannot be confirmed after normality testing , it should not be used. Also, some people think that we should use the median instead of the mean , or that we should use non-parametric tests .

However, as I wrote on the page " It's not a normal distribution, what should I do? " , If the purpose is to solve problems that occur in business, it is often useful enough.

By the way, when "the distribution is not uneven, but it has a smooth mountain shape, but it feels different from the normal distribution", it is good to use a distribution created from the normal distribution or a generalized linear mixed model . I have.

high variability

It's not normal for the same person to get different values ??within the range of a few kilograms every time they step on the scale, but if it's a measurement that's not very familiar to you, it's likely that there will be "large variability." I have.

In such cases, rather than analyzing the data, it is necessary to look at the specific values ??of the data and how the values ??change, and then investigate the object itself and the method of measurement. Proceeding with the improvement may lead to the resolution of the problem.


Regarding data analysis such as "ice cream sales", I think that anyone would think "it's strange" if I said "this is what happens in winter" with only summer data.

In the field of machine learning , it is said to be "model degradation", but this is also caused by data bias.

Survey data does not include the opinions of "people who did not answer the survey" or "people who were not targeted for the survey", so it is still biased. As a basic way of thinking, it is good to think that "the data is biased somewhere, but I don't know what it is". Doing so will make it easier to implement measures that do not rely excessively on the results of data analysis.

1 variable (small data 1)

I sometimes meet people who think that "data analysis is easy" and "I don't know or can't do much" for data with only one variable.

However, as with the analysis of sensor data , even one variable can be quite difficult. And that hard work can lead to data analysis that leads to big results.

Small number of samples (small data 2)

Occasionally, I meet people who think that if the number of samples is small, it is not valid.

However, statistics was originally developed at a time when it was difficult to obtain a large number of samples. By using tests and estimations , you can get hints for grasping the current situation and considering the next move from a small sample.

Data is not managed uniformly

"Data is distributed in different databases", "Different words are used for the same meaning", "Numeric data and character data are mixed", "There are multiple formats for time data", "Time data There is also data such as "It is out of sync".

These bad data are bad in a different way than statistical rigor. In the case of such bad data, careful preprocessing is a countermeasure.

By the way, there are many explanations of data analysis in the world that say that "90% of the work is preprocessing", which is a countermeasure against such bad data.

Common ideas in data analysis of bad data

The idea of "what is data?"

Even if it is bad data, in order to make data analysis that contributes to problem solving , think to the extent that "the data at hand represents something that is only a small part of what is happening", I think the most important thing to keep in mind is to keep in mind that you can't explain everything .

Use qualitative hypothesis exploration and systems thinking to uncover areas not covered by data .

Visualize data with graphs

When you study various things, you tend to be overwhelmed by numerical values ??such as model accuracy (degree of fit). Data analysis of bad data assumes that it is bad, so it is easy to become overwhelmed with grief. Even if it is bad, in order to make data analysis that leads to problem solving , we will increase the certainty of the result by combining multiple approaches.

It combines not only mathematical processing and numerical judgments, but also graphical statistics .

Purpose of data analysis

The more you study statistics, the easier it will be, and the easier it will be to judge whether data analysis is "good" or "bad" based on the assumptions of statistics. People tend to say, "You can't use that method unless you test normality and confirm that a normal distribution holds." Even though bad data is included as a matter of course, if you start to stick to the rigor of statistics, it will be difficult to progress in data analysis using statistics.

To avoid falling into this trap, the success criteria should be "whether or not the problem is solved" . As for statistics, it's OK to say "it helped solve the problem" or "it was a hint to solve the problem."

Data analysis without a model

With bad data, it happens that such a model cannot be made. If you were in a situation where you had to create a highly accurate model from this data, you would come to a dead end.

But, as the Data Science Jobs page says, all too often the model doesn't matter for the purpose of problem solving.

It is not important what the model can do, but what you notice while creating the model, such as "This data is like this," is important for problem solving.

mismatch between the name of a discipline and the scope of that discipline

There are fields such as " causal inference ", " time series analysis ", and " quality engineering ".

Details are explained in each item on this site. It has become an academic subject. Therefore, there are many causal problems in the world that cannot be helped even if you master the explanation of "causal inference". The same thing happens in time series analysis and quality engineering.

It is rare that these studies are directly applied to business problems and problems occurring in factories, and data analysis progresses. At that time, it is not the data that is bad, but rather the name of the discipline that is bad.

Data analysis pitfalls of good data

On this page, we are talking about data analysis of bad data as "bad data is commonplace". By the way, even if the data is statistically ideal, that is, good data, it is good to know that "bad data is natural".

Data collected in design of experiments

Data collected by design of experiments are ideal because they are designed to be independent between variables.

However, to repeat the above, it is necessary to confirm whether the data collected by the design of experiments represent what is actually happening. Sometimes we do things that don't normally happen to do experiments, and that's why the data doesn't tell us what we want to know.

In addition, there are cases where a method of "creating a data set by collecting only samples that fit the conditions planned by the experimental design method from a huge amount of data" can be done, but "samples that do not apply" can be created. Something may have been lost by removing it.

sampled data

If the original data is biased, or the phenomenon that is occurring is biased, the sampled data may be biased.

NEXT Image of Dark Data