Data Warehousing and BI Analytics — Aamir P

11 min readMay 7, 2024

Hello Readers!
In this article, we will have a beginner-level understanding of Data Warehousing and BI Analytics. Hope you find it useful!

A data warehouse is a centralized repository of high-quality data, facilitating accurate reporting and analysis for informed business decision-making. It serves as the backbone of modern business intelligence, offering a single source of truth for both current and historical organizational data.

Let me put it in simple words, consider a Data Warehouse as a system that aggregates data from one or more sources into a single consistent data store to support data analytics. It supports AI, ML and Data Mining. Database operations and Data Analytics operations are separated here.

Businesses evaluate data warehouse systems based on features, compatibility, ease of use, support quality, and costs. Location is crucial, balancing data security with accessibility. Architectural considerations include scalability, data types supported, and support for big data processing. Implementation ease depends on data governance, migration, and user management capabilities. User skill levels and vendor support play vital roles. Cost considerations extend beyond initial expenses, encompassing infrastructure, software licensing, migration, administration, and ongoing support. Ultimately, organizations must weigh these factors to choose between on-premises or cloud-based solutions, considering the total cost of ownership and adherence to data privacy regulations.

IBM Db2 Warehouse is a versatile data warehousing solution suitable for various environments, including cloud, on-premises, and hybrid setups. Offering scalability and speed with Massively Parallel Processing (MPP), it facilitates rapid data analysis through in-database analytics and machine learning algorithms. Db2 Warehouse simplifies schema generation and data transformation, supporting structured and unstructured data sources. With features like BLU Acceleration for fast queries and dashboard monitoring for performance tracking, it caters to diverse use cases such as elasticity, data integration, and line-of-business analytics. Integration with JDBC, Apache Spark, Python, R Studio, and REST API extends its analytical capabilities for efficient data processing and visualization.

Data Mart

A data mart is a specialized subset of an enterprise data warehouse, tailored to meet the needs of a specific business function or user community. It provides focused support for tactical decision-making by offering timely access to relevant data. Structured with a star or snowflake schema, a data mart typically consists of a central fact table surrounded by dimension tables, enabling efficient analysis. Unlike transactional databases, data marts store validated and cleaned data, accumulating historical information for trend analysis. There are three main types of data marts: dependent, independent, and hybrid. Dependent data marts draw from the enterprise data warehouse, ensuring data integrity and security. Independent data marts bypass the warehouse, requiring custom data pipelines and separate security measures. Hybrid data marts combine warehouse data with external sources. Regardless of type, data marts aim to provide end-users with timely insights, accelerate business processes, facilitate informed decisions, and ensure secure data access.

Data Lake

A data lake serves as a storage repository for vast amounts of structured, semi-structured, and unstructured data in its raw or native format, tagged with metadata for classification. Unlike data warehouses, data lakes do not require predefined schemas, allowing for agile storage and analysis of data without strict conformance. Deployable across various technologies, including cloud object storage and distributed systems like Apache Hadoop, data lakes offer scalability and flexibility, catering to evolving business needs. Users benefit from the ability to store diverse data types, scalable storage capacities, and time savings by eliminating the need for upfront schema definition and data transformation. However, data lakes may lack curated data quality and governance compared to data warehouses. Typical users of data lakes include data scientists and developers, while business and data analysts predominantly utilize data warehouses. Overall, data lakes complement data warehouses, addressing different data storage and analysis requirements within organizations.

Lakehouse

The process of managing data within organizations mirrors the logistics of a restaurant kitchen. Just as raw ingredients are delivered to a restaurant’s loading dock, data flows into organizations from various sources like cloud environments and operational applications. This influx of data requires efficient storage and processing, akin to sorting and organizing ingredients in a kitchen. Data lakes serve as repositories for raw, structured, and unstructured data, providing a cost-effective solution for capturing diverse data types quickly, similar to storing ingredients upon delivery.

However, like ingredients needing preparation before cooking, data from lakes must be organized and transformed for analytical insights. This is where enterprise data warehouses (EDWs) come in, optimizing and organizing data for specific analytical tasks, analogous to the pantry and freezers in a kitchen. EDWs ensure data integrity, governance, and trustworthiness, crucial for generating insights and analytics for business decision-making.

Yet, challenges arise with both data lakes and data warehouses. Data lakes may become “data swamps” due to duplicate or incomplete data, while query performance can be an issue. On the other hand, data warehouses offer exceptional query performance but can be costly and may struggle with semi-structured and unstructured data sources.

To address these challenges, developers have introduced the concept of a data lake house — a fusion of data lake and data warehouse functionalities. This approach combines the flexibility and cost-effectiveness of data lakes with the performance and structure of data warehouses. Data lake houses enable organizations to store data from diverse sources efficiently while leveraging built-in data management and governance features to support both business intelligence and high-performance machine learning workloads.

In essence, just as ingredients undergo a journey from delivery to plate in a restaurant kitchen, data undergoes a transformation from raw to insights within organizations’ data architectures, with data lakes, data warehouses, and emerging data lake houses playing vital roles in this process.

A general data warehousing architecture consists of data sources, ETL pipelines, staging areas, a data warehouse repository, optional data marts, and analytics tools. Companies can customize this model to meet their analytics needs. Vendors provide proprietary reference architectures based on this model, ensuring interoperability among components. For example, IBM’s architecture includes a data acquisition layer, integration layer, repository layer, analytics layer, and presentation layer. IBM’s InfoSphere suite offers tools like DataStage for ETL, MetaData Workbench for metadata management, QualityStage for data quality, Db2 Warehouse for data storage, and Cognos Analytics for business intelligence.

Data Cube and Materialised Views

A data cube, derived from a star schema, uses dimensions to define its coordinates and facts to populate its cells. Operations like slicing, dicing, drilling down, and rolling up can be performed on data cubes to analyze multidimensional data effectively. Materialized views, or snapshots of query results, are utilized in data warehousing to replicate data or precompute expensive queries, ensuring efficient data analysis. In Oracle and PostgreSQL, materialized views are created to refresh data automatically or manually. In Db2, they are termed as MQTs, with system-maintained options for immediate refresh. Overall, data cubes facilitate multidimensional analysis, while materialized views optimize query performance in enterprise data warehousing systems.

Facts and Dimensions

Data in business analytics comprises facts and dimensions. Facts are measurable quantities like temperature or sales, while dimensions provide context to these facts, such as location or time. Fact tables store business process data with foreign keys linking to dimension tables, categorizing facts for analysis. Dimension tables describe attributes like products or employees and are joined to fact tables via foreign keys. This structure allows for the filtering, grouping, and labelling of business data for analysis.

Data Modelling

Star and snowflake schemas are data modelling techniques used in designing specialized data warehouses, particularly for scenarios like “A to Z Discount Warehouse” managing point-of-sale transactions.

In a star schema, a central fact table contains measures such as transaction amount and quantity, with foreign keys linking to dimension tables for attributes like date, store, product, cashier, and member. Designing such a schema involves selecting a business process (e.g., point-of-sale transactions), determining granularity (e.g., individual line items), identifying dimensions (e.g., date, store, product), and defining facts (e.g., price, quantity).

Snowflake schemas extend star schemas by normalizing dimension tables, and separating hierarchies into individual tables to reduce storage footprint. For instance, in the snowflake schema, the store dimension might include separate tables for city, state, and country, each linked by foreign keys. Other dimensions like product and date could similarly be normalized into hierarchies like brand, category, and time elements.

The process of designing these schemas involves understanding the business process’s requirements, determining the appropriate level of detail (granularity), identifying relevant dimensions and facts, and potentially normalizing dimension tables to optimize storage and query performance.

Overall, star and snowflake schemas provide structured frameworks for organizing and analyzing data in data warehousing environments, facilitating efficient querying and reporting for businesses like “A to Z Discount Warehouse” to gain insights from their point-of-sale transactions.

Star schemas are optimized for reads and are widely used for designing data marts, whereas snowflake schemas are optimized for writes and are widely used for transactional data warehousing.

Comparison between Snowflake and Star Schema

Normalization, a key aspect of both star and snowflake schemas, addresses the issue of redundancy in data storage. It involves creating surrogate keys to replace natural keys and storing the original values in lookup tables. While this process doesn’t directly reduce redundancy within the normalized table itself, it significantly reduces redundancy in the fact table and minimizes storage requirements overall.

In a normalized schema, dimensions are structured efficiently, with each dimension having its own table or hierarchy of tables. By using surrogate keys, which are typically smaller in size compared to the original data types, such as strings, the overall data size is reduced. For instance, a 32-bit integer surrogate key occupies far less space than a 10-character string, leading to significant storage savings, especially in large datasets.

The choice between star and snowflake schemas depends on factors like query performance and ease of maintenance. Snowflake schemas, being fully normalized, offer the smallest storage footprint and minimal redundancy. However, they may result in slower query performance due to the additional joins required to access deeper levels of hierarchical dimensions. On the other hand, star schemas denormalize certain dimensions, improving query efficiency but potentially increasing redundancy.

A compromise between the two is a partially normalized schema, combining elements of both star and snowflake designs. This approach allows for the optimization of storage and query performance based on specific requirements. Additionally, data analysts and scientists often prefer star schemas for their simplicity and ease of querying, as they typically involve fewer joins.

Ultimately, the choice of schema depends on the specific needs and priorities of the organization, balancing factors such as storage efficiency, query performance, and ease of maintenance. By carefully considering these factors and gathering requirements, one can design an optimal normalization scheme for their data warehouse, ensuring efficient storage and effective data retrieval for analytical purposes.

Staging

A staging area serves as a vital intermediary between data sources and target systems, facilitating the integration of disparate data sources in data warehousing. Typically transient, staging areas can be implemented as simple flat files managed by scripts or as tables in a database. They enable decoupling of data processing from the source systems, reducing the risk of data corruption. While primarily used for data integration, staging areas also support functions like change detection, scheduling, data cleansing, validation, aggregation, and normalization. They streamline ETL workflows, simplify maintenance, and help ensure data integrity. Despite their transient nature, staging areas may be retained for archival or troubleshooting purposes. In essence, staging areas play a crucial role in facilitating seamless data flow and optimizing data management processes within enterprise data warehousing architectures.

Data verification

Data verification is essential for managing data quality and reliability, ensuring accurate, complete, consistent, and current data. Common data quality concerns include accuracy, often compromised by duplicated records and manual entry errors like typos. Completeness issues arise from missing or placeholder values, while consistency problems stem from deviations in terminology, data formats, and units. Currency concerns involve keeping data up to date, such as customer address changes. Resolving these issues involves implementing detection rules, quarantining bad data, and automating cleaning processes. Leading vendors offer data quality solutions like IBM InfoSphere Information Server, which enables continuous monitoring, analysis, and cleaning of data, ensuring its reliability and maximizing its value. These tools assist in understanding data relationships, maintaining data lineage, and ultimately turning data into trusted information. Overall, data verification is crucial for informed decision-making and can be facilitated by enterprise-grade solutions like IBM’s.

Populating a Data Warehouse

Populating an enterprise data warehouse involves initially creating fact and dimension tables, establishing relationships, and loading clean data. It’s an ongoing process with initial and periodic incremental loads, with fact tables being dynamic and needing frequent updates, while dimension tables are relatively static. Automation, using tools like Apache Airflow or custom scripts, streamlines incremental loading and maintenance tasks. Mechanisms like database utilities or custom scripts detect new or updated data for incremental loads. Periodic maintenance, typically monthly or yearly, involves archiving older data. Overall, the process ensures a robust, updated data warehouse supporting efficient analytics and decision-making.

CUBE, ROLLUP AND MATERIALISED VIEWS

In data warehousing, using CUBE and ROLLUP operations on materialized views facilitates efficient data summarization, aligning with management’s reporting needs. By denormalizing star schemas through joins, key dimensions like date, auto class, and salesperson are combined with meaningful facts like sales amounts in a single view.

CUBE generates comprehensive summaries across all permutations of dimensions, while ROLLUP provides hierarchical summaries based on the order of listed dimensions. Additionally, staging tables derived from materialized views offer refreshed data for analysis, with updates scheduled during off-peak hours. Overall, leveraging these techniques enhances query efficiency and supports informed decision-making in data warehousing environments.

Analytics and Business Intelligence

Analytics leverages data, statistics, and operations research to improve decision-making, aided by advancements in compute storage and processing power. Modern data warehousing solutions enable the swift processing of vast datasets, enhancing accuracy. Business Intelligence (BI) tools facilitate data preparation, mining, management, and visualization, empowering organizations to uncover insights through descriptive, predictive, and prescriptive analytics. IBM Cognos Analytics stands out in the market, integrating AI and natural language processing for comprehensive data analytics. With a focus on the entire analytics lifecycle, from discovery to operations, it exemplifies the evolution and impact of BI software in driving informed business decisions.

Let me summarise

A data warehouse is a system that aggregates data from one or more sources into a single consistent data store to support data analytics.
The major benefit of a data lake is handles all types of data — unstructured, semi-structured, and structured.
Organisations use the Cloud to host data warehouses.
A data warehouse vendor that offers “on-premises and cloud” services is IBM Db2 Warehouse.
Data Mart helps end users focus only on relevant data.
Data marts are known as “hub and spoke” architecture and are sometimes involved in the general architectural model for an Enterprise Data Warehouse.
Pivoting changes the point of view you may choose to analyze something from.
Dimensions are called “categorical variables” by statisticians and machine learning engineers.
Selecting a business process, identifying dimensions, and identifying facts are all design considerations.
By aggregating data you can use the staging area to summarize data.
Staging and sandbox areas are the components that hold data and develop workflows.
A staging area is a separate location where data from source systems is extracted to.
The normalized version of the star schema is called a snowflake schema, due to its multiple layers of branching which resembles a snowflake pattern.
Materialized views can be used to replicate the data.
Materialized Views can be set up to have different refresh options, such as never, upon request, and immediately.
A business needs to consider what data types are supported and the types of data the business ingests when considering features related to structure.
CUBE and ROLLUP provide summary reports.
You can denormalize star schemas using joins to bring together human-interpretable facts and dimensions in a single materialized view.
Designing star schemas is to identify the facts.
A data lake is a data repository that can store a large amount of structured, semi-structured, and unstructured data in its native format. There is no need to define the structure and schema of data before loading the data into the data lake.