In 2006, British mathematician and entrepreneur Clive Humby made an astute observation that “Data is the new oil”. Over the next 15 years, tech companies like Facebook and Google have amassed an incalculable amount of data with their free services. And this data is generally in big data stores, also called a data lake.
Data lake: What you need to know
A data lake is a centralised repository allowing organisations to store all of their structured and unstructured data at any scale. The easiest way to look at it is by checking the files system on your computer and all the data stored in it. Now, imagine the same for an organisation with hundreds and thousands of employees and customers, and their data stored in giant data centres and cloud-scalable networks.
A data lake allows its users to store their data as-is and without requiring them to structure the data. This includes structured data relational databases (rows and columns), semi-structured data in the form of CSV, logs, XML, JSON, unstructured data such as emails, documents, PDFs and binary data such as images, audio and video.
An organisation can either build a data lake on-premise or build a cloud-based data lake with the help of Amazon, Microsoft and Google. Once the data is stored in a data lake, organisations can run different types of analytics starting from dashboards and visualisations to real-time analytics, big data processing and even machine learning.
Data lake: why do organisations need it
There was once a time when every company wanted to be a tech company. A good example is Uber and Tesla, who are in the industry of transportation, but often speak of themselves as a tech company. Now, the reality is one where every company wants to be a data company. It is widely believed that organisations generating successful business value from their data will outperform their peers.
A data lake allows organisations and their leaders to run new types of analytics like big data processing and machine learning over new sources of data like log files, data from click-streams, social media and other connected devices stored in the data lake. These analytics allow these companies to identify and tap into opportunities for business growth faster than their rivals.
Advanced analytics can help organisations to attract and retain customers, boost productivity, make informed decisions and maintain devices. A data lake is also useful to make organisational data from different sources available to various end-users like business analysts, data engineers, data scientists, product managers and executives.
Dr. Kirk Borne, Principal Data Scientist & Data Science Fellow, Booz Allen Hamilton said, “With the data lake, business value maximisation from data is within every organisation’s reach.”
Difference between a data lake and a data warehouse
The similarity between a data lake and a data warehouse ends in the fact that they are both data repositories. Beyond that, they are both different and a typical organisation will require both a data lake and a data warehouse to serve different needs and use cases.
The major difference is that a data warehouse makes use of highly structured data whereas a data lake supports all types of data. A data lake also offers a massive scale since the data that might be analysed in the future can be easily stored. A data warehouse comes with storage limitations and organisations generally remove all irrelevant data.
Another key difference between a data warehouse and a data lake pertains to cost. The intensive data management required for data warehouses make it expensive to maintain compared to data lakes. A data lake is also beneficial for those working with metadata, allowing users to gain basic insights quickly. Data warehouses don’t offer similar flexibility of working with metadata.
Use cases
The biggest advantage of a data lake is the ability to store the data in its raw form without having to worry about structuring the data. This allows organisations to run their advanced analytics or mining software to glean useful insights from that data. Here are some of the ways in which data lakes are used by organisations around the world.
- Improved interactions with customers: A data lake allows organisations to combine customer data from a CRM platform with social media analytics and a marketing platform to understand the most profitable customer cohort, the cause of customer churn and opportunities like promotions or rewards to increase loyalty.
- Improve R&D choices: A data lake can help R&D teams to test their hypothesis, refine assumptions and assess results. This includes doing genomic research to create more effective medication, choosing the right materials in product design and understanding whether customers will pay for different attributes.
- Operational efficiency: A data lake makes it easier for organisations to store information from IoT devices and run analytics on machine-generated IoT data. This leads to the discovery of ways in which operations costs can be reduced and quality and efficiency can be increased.
What are the challenges with a data lake
One of the biggest challenges with a data lake is ensuring that the large repository of data does not turn into a data swamp. In 2015, David Needle called out data lakes as one of the more controversial ways to manage big data. Even PwC has noted in its research that not all data lake initiatives are successful.
“We see customers creating big data graveyards, dumping everything into the Hadoop distributed file system (HDFS) and hoping to do something with it down the road. But then they just lose track of what’s there,” Sean Martin, CTO of Cambridge Semantics, was quoted by PwC in its research.
With poor planning or management, a data lake will essentially turn into a data swap with degraded value. A business implementing a data lake should prepare for a number of challenges but the most important one being setting business priorities. While a data lake can store any kind of data, it is not ideal to throw everything into a data lake with a belief that it will provide value in future.
Another challenge is designating use cases and end-users for a data lake. A data lake should ideally hold accurate data, be fit for a purpose and cater to the people capable of manipulating it. A robust data ingestion process while designing the structure of a data lake is another challenge that businesses need to tackle early.
The last hurdle for an effective data lake is maintaining good communication with all the stakeholders. A data lake should not look like an opaque storage and businesses must have good communication where all the stakeholders are aware of how and why to use the data in a data lake.