Statistics has said that "The more samples lead the more precise analysis."
But the profit of the analysis of big data is not explained only by this logic. In this page, I write the strength and weakness of big data in my way.
If we get 1 data, we can estimate the average. If we get 2 data, we can estimate the distribution.
In the science field, there are cases that it is very difficult to get even if 1 data.
But, generally, it is difficult to confirm the conclusion lead by 1 or two data. We often add data. The more samples lead the more precise conclusion. The logic of estimation explains that mathematically. This logic was made in the age that data size is small.
About the data size, over 10000, the problem in the old logic become remarkably.
Hypothesis Testing says that "p-value less than 0.05 is the standard to confirm the difference."
But it is popular that we find "p-value is less than 0.000000001" in the analysis of big data because p-value tends to smaller when the size of data is bigger.
Usually, such output is not useful because, often, the difference found by the testing is not practical.
Kinds of errors are examples of weakness of estimation .
The evaluation of accidental errors becomes better by bigger data. But systematic errors do not change. The problem of systematic errors is difficult because the causes of systematic errors are not related to the statistics field.
"Measured by same condition" is the precondition of data in the statistics. The logics of the hypothesis testing and the estimation using normal distribution are useful for such data.
But when we analyze the big data, it may be "measured by complex condition." Even if the software of statistics analyzes such data, the conclusion may lead bad effects.
Stratified sampling is useful to use the data measured by complex condition as the data measured by same condition. But this method is difficult when we cannot classify the data.
When there are only a few data, the difference of two groups is not clear by the analysis of graphs. In this case, the analysis using p-value is useful.
When there are much data, the graph shows the difference clearly. So p-value is not the main tool of this analysis.
For much data, Graphs of Disribution and line-graph are useful.
I often introduce the idea of the decision tree in this site. It is the method to analyze the data measured by complex condition numerically.
When I explain the reason of the power of big data by old-style statistics, I need to explain the weakness. So big data may look troublesome data with no benefit.
But when I analyze big data visually, I often find the interesting phenomena. The phenomena are often difficult to explain by statistical values, for example, average and standard deviation .
Statistical Way of Making Hypothesis
Selection of Methods
Outlier and Missing Value
NEXT Normal Distribution and OthersTweet