A data lake is a recent concept, and has become more popular with the advent of big data technologies. Big data technologies and cloud infrastructure have made it cheaper and efficient to store and process huge volumes of data. They also enable the storing and processing of unstructured data. They enable real time processing of data. This has opened up possibilities for companies.

What is a Data Lake?

Its a collection of all enterprise data on big data platforms. It involves collecting  and storing large volumes of enterprise data in its raw format, from multiple data sources within or outside the organization, without necessarily having to worry about the exact business requirements.

Why is data lake necessary?

The recent advancements in the field of artificial intelligence have further increased the need of underlying data to build smart cutting edge systems. The research in these fields have got tremendous traction, and new business possibilities are evolving, which rely heavily on the underlying enterprise data.

Companies are soon realizing that they may have to capture and retain data within the organization for activities that they dont know yet. Even if they have no idea on how to use the data today, they know for sure that tomorrow, new possibilities are going to open up, and these new possibilities are not going to be of any use if they have not stored sufficient amount of data today.

How are data lakes different?

Large corporates and even small companies have been using data for decision making for a long time. But now, it is evident that enterprise data will have more variety of applications, not limited to traditional decision support. The applications of data lake are going to outpace the applications of traditional decision support systems.

Where are data lakes used?

Data in the future will be used to build new business models, carry out extensive market research, understand customer behavior, automate many operational activities, build new products and services, enhance existing products and services, identify risks etc. Note that all this might not be cost effective today, but there are good chances that these processes will be done by machines tomorrow in a cost effective manner, using data from data lakes.

What does a data lake have to do with advanced analytics?

Recent progress in advanced analytics are enabling companies to build analytical models quickly. These self learning models are as good as the data that goes into them, and hence, they will need relevant historic data in huge volumes. The importance of a data lake becomes evident as companies start to build their data science capabilities. Analytical models that use machine learning techniques work well when they analyse as much data as possible from a variety of different data sources. A data lake enables quick onboarding of such advanced analytical models.

Conclusion – Why should companies build data lakes?

Data lake is the most cost effective method of storing vast volumes of enterprise data for current and future use. It can and will extend the applications of data beyond traditional decision support and will open up new possibilities for companies to stay ahead of the game. Data is the new oil, and it makes sense to store as much data as you can, today, so that it can be put to use tomorrow.