Data warehouses are built for tactical and strategic decision making purposes. Here are a few challenges that should be properly addressed while designing a data warehouse..
Business analysts need to understand the data to build transformation rules. Hence data from various source systems should be made available to them to analyse in some sort of staging area, much before the data warehouse is built.
Confidential data should be masked or encrypted. There should be no manual updates to data within the data warehouse. Data access authorisation should be designed ahead of time, much before the work of data warehousing project actually begins.
Data within the data warehouse should be easily understood by business users. It is necessary to have a data catalogue and clear definitions of data within the data warehouse. The data catalogue and definitions should be easily accessible to authorised users, preferably through an online portal.
Data within the data warehouse should be trustworthy and should reflect the reality in the outside world. Hence, a careful analysis of data and data collection methods should be done to ensure that the data is fit for purpose. If necessary, data should be cleansed to improve its quality. The source of data within the data warehouse should be easily traceable to its users. Data governance framework should be in place to monitor data quality and improve it over time.
Referential integrity constraints should be enforced on data within the data warehouse to ensure data integrity in the long run. This enforces certain design complications as data within the data warehouse also contains audits of changes that happen to data over time. Exception handling and exception re-processing should be designed within the automated data loads to ensure data integrity.
Data should reflect a global enterprise view within the data warehouse. Standardising data involves business stakeholder management and a strong data governance framework. Agreement on data definitions, nomenclature and taxonomy should be done ahead of time.
Data from various source systems should be consolidated together to provide an enterprise view of the data. This requires combining data from various data sources and homogenizing it to ensure a consistent view of all data.
Data processing is an expensive activity, and only new data or changes to the previous data should be processed to get loaded into the data warehouse during each load cycle. This enforces certain design complications in the ETL interfaces.
Data transformation is necessary to convert data from source system into a format that is useful for the end users. Data transformation techniques can get quite complex and hence standard design patterns for data transformation should be created ahead of time, to ensure consistency within various ETL interfaces.
Data access within the data warehouse should be efficient. Data within the data warehouse should be properly modelled, and indexed and partitioned, if necessary, to ensure efficient query performance.
The referential integrity checks that are carried out during data loads may consume huge cache size in the memory, thereby causing data load failures. Proper data storage and memory allocation requirements need to be communicated to the database administrators ahead of time.
Initial Data Loads
Initial loads are the heaviest data loads, when historic data from source systems gets loaded into the data warehouse for the first time. Special memory allocations are required during initial data loads.
Source data validation
Technical checks are necessary on source data, especially if it is received in the form of flat files, to ensure smooth data loads. These checks should be performed as soon as the source data is available for loading into the data warehouse.
Data reconciliation needs to be done after every data load to ensure that the data loads have happened correctly. These checks can be done against certain key values between the source systems and data warehouse.