Modern data strategies heavily promote building an enterprise data lake – A repository of all data within the enterprise stored in its raw format. This strategy gives quick benefits, but if not designed correctly, data lake can soon become toxic.
Here are some of the guiding principles for designing data lake.
- Data within the data lake is stored in the same format as that of the source. The idea is to store data quickly with minimal processing to make the process fast and cost efficient.
- Data within the data lake is reconciled with the source everytime a new data set is loaded, to ensure that it is a mirror copy of data inside the source.
- Data within the data lake is well documented to ensure correct interpretation of data. Data catalogue and definitions are made available to all authorised users through a convenient channel.
- Data within the data lake can be traced back to its source to ensure integrity of data.
- Data within the data lake is secured through a controlled access mechanism. It is generally made available to data analysts and data scientists to explore further.
- Data within the data lake is generally large in volume. The idea is to store as much data as possible, without worrying about which data elements are going to be useful and which are not. This enables an exploratory environment, where users can keep looking at more data and build reports or analytical models in an incremental fashion.
- Data within the data lake is stored in the form of daily copies of data so that previous versions of data can be easily accessed for exploration. Accumulation of historic data overtime enables companies to do trend analysis as well as build intelligent machine learning models that can learn from previous data to predict outcomes.
- Data within the data lake is never deleted.
- Data within the data lake is generally stored in open source big data platforms like hadoop to ensure minimum storage costs. This also enables very efficient querying and processing of large volumes of data during iterative data exploration and analysis.
- Data within the data lake is stored in the format that it is received from the source, and is not necessarily structured. The idea is to put minimum efforts while storing data into the data lake. All efforts to organize and decipher data happens post loading.
- Data within the data lake is not subjected to any data quality checks prior to loading the data.
- Data within the data lake is not subjected to any preprocessing (like de-duplication, standardization, cleansing, referential checks or homogenization). It is available for analysis in its raw form.
- Data within the data lake is accessible to only authorised users. The authorised users are generally data analysts and data scientists.
- Data within the data lake always reflects the latest state of business and gets refreshed as frequently as deemed necessary for effective usage.
- Data within the data lake is subjected to minimal governance.
- Data within the data lake is never edited or manipulated and is a true reflection of its source. Data gets loaded or copied into the data lake from source systems through automated and scheduled jobs.
- Data within the data lake is not used for reporting purposes as it is, most of the time, not in a format that can be reported. It is subjected to heavy post-load processing until it becomes useful for analysis and reporting.
- Data within the data lake represents the lowest grain of source data in its most basic form, and does not contain any processed or aggregated data. This data can be further processed, cleansed, standardized, homogenized, consolidated, aggregated etc as required by data analysts or data scientists to create final data sets.