Why there will be a shortage of data in 2 years and what we can do about it


About 10 years ago, we talked about “big data” and the huge amounts of data that became available and how to “value” it. Since then, the amounts of data have increased explosively: in the last two years, we have collected 90% of the total amount of data, according to Statista. And the growth is far from over. In fact, Epoch.ai predicts that we will have “consumed” all available data by 2026.
About 10 years ago, we talked about “big data” and the huge amounts of data that became available and how to “value” it. Since then, the amounts of data have increased explosively: in the last two years, we have collected 90% of the total amount of data, according to Statista. And the growth is far from over. This increase is due, on the one hand, to our Internet consumption behavior and, on the other hand, to the increase in computing power, which increases the need for data and generates new data. However, this has led to a major problem: the quality of data is declining significantly and high-quality data is becoming increasingly scarce. In fact, Epoch.ai predicts that we will have “consumed” all available data by 2026.
Compare it to a gas supply: if more gas is consumed than produced, scarcity occurs. The same thing is happening with data now, with an important nuance: there is plenty of data, but it lacks high-quality and useful data. Good quality data is becoming scarcer and depleted. But how does this happen?
There are two main reasons for the scarcity of high-quality data. First, due to the massive increase in AI and language models, the demand for data has increased exponentially. After all, data is the fuel for AI. Second, the rise of synthetic data has worsened the situation. Synthetic data is data created or derived by AI, such as images or texts generated by AI. This data is often used as training data for AI models, but this creates a vicious cycle. If a language model gives an incorrect answer, this output can still be used for (re) training purposes, which can further reduce the quality of the data and models.
There is a huge demand for datasets with unique, high-quality data. Data collected directly on the basis of human behavior in the physical world is essential here. Examples include the extensive photo and film archive of the British broadcaster BBC, which has been accessed by tech parties for access to millions of recordings that have never been broadcast. These images and audio recordings are crucial for the further development of AI models such as DALL-E and Midjourney image generators, and for training AI models to recognize specific objects.
Another example is the multi-million dollar partnership between Google and Universal Music to access all audio recordings and the rights to use them. This is again aimed at obtaining high-quality input for the further development of AI models, for example for speech recognition. Companies that collect unique data will be able to earn a lot of money selling this data in the coming years. The importance of good data will only increase, because AI only works optimally if the data is in order.
Preventing biases in AI is essential. This is only possible with the right and high-quality data. Biases occur when the data used to train AI contains biases. These biases can influence the AI results, leading to undesirable and discriminatory outcomes. By using high-quality, diverse and representative data, biases can be minimized as much as possible.
The future of AI depends heavily on the availability of high-quality data. As the amount of data continues to grow, quality decreases, which is a major challenge for the development of reliable AI systems. It is crucial to invest in collecting and maintaining high-quality data so that we can continue to develop AI in a way that is useful and ethical. Companies that succeed will play an important role in the future of technology and data analysis.

