The bad data data analysis page says that what the data tells us is only a fraction of what is actually happening.
Dark data is invisible data, but I can't explain what I should be careful about by saying "I can't see it", so I made a picture of my image.
The dark data here is the extent of what is happening that cannot be understood by "data".
If I draw a picture of invisible dark data, it looks like this in my case.
This is the image I have, but dark data has changed significantly in the last 10 years.
In the figure, I also changed the colors. The color is a color that expresses the feeling that "the number is the same, but the meaning is changing."
In machine learning , we take data, create a model, and then use the model.
Dark data changes significantly over a period of about 10 years, but the period of data collected to create a model is as short as a few hours, and as long as three months.
It is often called "model degradation", but the accuracy of the model gradually changes due to changes in the dark data.
Then, if you say, "Is it okay to prepare data for about 10 years?", it's not so. Even if a good model can be created in a short period of time, it often happens that the model cannot be decided in a long period of time. In such a case, it is necessary to take a countermeasure to "continue updating the model for a short period of time without leaving it to be created".
In the factory, in the case of machines, various settings are changed on a daily basis to achieve the best conditions. In the case of humans, the amount of force is adjusted.
This will allow us to be flexible.
Although it depends on the field, general industrial products undergo model changes and develop new products one after another. These changes occur at intervals of a few months at the shortest and several years at the longest.
For this reason, even if changes occur in the dark data, it is less likely to cause problems because the production will be finished before the effects of the changes begin to appear.
However, if you say, "I have an order, so I will make a product from 3 years ago for the first time in a long time," a problem may occur. Even though it should be made using exactly the same materials and under exactly the same conditions as when it was made successfully in the past, incidents occur where "defective products are made!"
As for how to deal with such problems, in the case of the author, regarding the production method and measurement method, "what is being done and how?" I had no choice but to connect it to problem solving.
A person who has been working in the same factory for decades once said, " Factory is a monster." In data science terms, dark data is a monster for data scientists.
NEXT Mathematical Science in Data ScienceTweet