Data Warehouse - DWDM
Data Warehouse - DWDM
https://aws.amazon.com/data-warehouse/
Characteristics of Data warehouse
A data warehouse has following characteristics(features):
•Subject-Oriented: A data warehouse is organized around major subjects,
such as customer, vendor, product, and sales. Rather than concentrating on
the day-to-day operations and transaction processing of an organization, a
data warehouse focuses on the modeling and analysis of data for decision
makers.
Characteristics of Data warehouse
• Integrated: A data warehouse is usually constructed by integrating
multiple heterogeneous sources, such as relational databases, flat files,
and on-line transaction records.
• Data cleaning and data integration techniques are applied to ensure
consistency in naming conventions, encoding structures, attribute
measures, and so on.
Characteristics of Data warehouse
• Time-variant: Data are stored to provide information from a historical
perspective (e.g., the past 5-10 years).
• Non-volatile: A data warehouse is always a physically separate store of
data transformed from the application data found in the operational
environment. Due to this separation, a data warehouse does not require
transaction processing, recovery, and concurrency control mechanisms. It
usually requires only two operations in data accessing: initial loading of
data and access of data.
Data Warehouse overview
https://en.wikipedia.org/wiki/Data_warehouse#/media/File:Data_Warehouse_Feeding_Data_Mart.jpg
Data Warehouse Component
https://www.guru99.com/data-warehouse-architecture.html#8
ETL(Extract, Transform and Load)
vs
ELT (Extract, Load and Transform)
ETL based Data warehousing
ETL(Extract, Transform and Load) is defined as a process that extracts the data from different
RDBMS source systems, then transforms the data (like applying calculations, concatenations,
etc.) and finally loads the data into the Data Warehouse system.
Transformations Example
•Converting numerical values
•Editing text strings
•Matching rows and columns
•Find and replace
•Changing column names
•Recombining columns from different tables and databases
•Precalculating intermediate aggregates
ETL based Data warehousing
https://panoply.io/data-warehouse-guide/etl-tutorial/
ELT based Data warehousing
In this approach, data gets extracted from heterogeneous source systems and are then
directly loaded into the data warehouse, before any transformation occurs. All necessary
transformations are then handled inside the data warehouse itself. Finally, the manipulated
data gets loaded into target tables in the same data warehouse.
ELT based Data warehousing
https://www.xplenty.com/blog/etl-vs-elt/#overview
ETL vs ELT
https://www.softwareadvice.com/resources/etl-vs-elt-for-your-data-warehouse/
Three-Tier Data Warehouse Architecture
Data Warehouse Design
There are two approaches
• Top-Down approach(Bill Inmon)
• Bottom-Up approach (Ralph Kimball)
Data Warehouse Design(Top-Down)
In the top-down approach, the data warehouse is designed first and then data
mart are built on top of data warehouse.
Data Warehouse Design(Top-Down)
Advantages of top-down design are:
• Provides consistent dimensional views of data across data marts, as all data
marts are loaded from the data warehouse.
• This approach is robust against business changes. Creating a new data mart
from the data warehouse is very easy.
Disadvantages of top-down design are:
• This methodology is inflexible to changing departmental needs during
implementation phase.
• The cost, time taken in designing and its maintenance is very high.
Data Warehouse Design(Bottom-up)
As per this method, data marts are first created to provide the reporting and
analytics capability for specific business process, later with these data marts
enterprise data warehouse is created.
Data Warehouse Design(Bottom-up)
• In the bottom-up design approach, the data marts are created first to provide
reporting capability.
• A data mart addresses a single business area such as sales, Finance etc. These data
marts are then integrated to build a complete data warehouse.
• The integration of data marts is implemented using data warehouse bus
architecture.
• In the bus architecture, a dimension is shared between facts in two or more data
marts.
• These dimensions are called conformed dimensions. These conformed dimensions
are integrated from data marts and then data warehouse is built.
Data Warehouse Design(Bottom-Up)
Advantages of Bottom-Up Approach –
• As the data marts are created first, so the reports are quickly generated.
• We can accommodate more number of data marts here and in this way data
warehouse can be extended.
• Also, the cost and time taken in designing this model is low comparatively.
Disadvantage of Bottom-Up Approach –
• This model is not strong as top-down approach as dimensional view of data
marts is not consistent as it is in above approach.
Why build a data warehouse at all—why not just run analytics queries
directly on an online transaction processing (OLTP) database, where
the transactions are recorded?
ANS: Data warehouses are optimized for batched write operations
and reading high volumes of data, whereas OLTP databases are
optimized for continuous write operations and high volumes of small
read operations
Data Warehouse vs Database
Characteristics Data Warehouse Database
Suitable workloads Analytics, reporting, big data Transaction processing
Data source Data collected and normalized from Data captured as-is from a
many sources single source, such as a
transactional system
Data access Optimized to minimize I/O and High volumes of small read
maximize data throughput operations
Data Mart
Why do we need Data Mart?
• Data Mart helps to enhance user's response time due to reduction in volume of
data
• It provides easy access to frequently requested data.
• Data mart are simpler to implement when compared to corporate Data warehouse.
At the same time, the cost of implementing Data Mart is certainly lower compared
with implementing a full data warehouse.
• Compared to Data Warehouse, a data mart is agile. In case of change in model, data
mart can be built quicker due to a smaller size.
Why do we need Data Mart?
• A Datamart is defined by a single Subject Matter Expert. On the contrary data
warehouse is defined by interdisciplinary SME from a variety of domains. Hence,
Data mart is more open to change compared to Data warehouse.
• Data is partitioned and allows very granular access control privileges.
• Data can be segmented and stored on different hardware/software platforms.
Type of Data Mart
1. Dependent: Dependent data marts are created by drawing data directly from
operational, external or both sources.
2. Independent: Independent data mart is created without the use of a central data
warehouse.
3. Hybrid: This type of data marts can take data from data warehouses or operational
systems.
Type of Data Mart(Dependent Data Mart)
• Dependent data mart(Top-Down approach) is a place
where its data comes from a data warehouse. Data in a
data warehouse is aggregated, restructured, and
summarized when it passes into a dependent data
mart.
Type of Data Mart(Independent Data Mart)
• An independent data mart also known as stand-alone
data mart emphasizes on a particular subject area. It is
not designed in an enterprise context.
• Business intelligent tools or analytic tools query data
directly from the data mart and present information to
the user
• Independent data marts can be built in a short time
Type of Data Mart(Hybrid Data Mart)
• A hybrid data mart combines input from sources
apart from Data warehouse. This could be helpful
when you want ad-hoc integration, like after a new
group or product is added to the organization.
Steps in Implementing a Data Mart
There are following steps for implementing a data mart:
• Designing
• Building(Constructing)
• Populating
• Accessing and Managing
Steps in Implementing a Data Mart
Designing: this step includes identification of a subject or a topic related to which data mart
will store data.
It involves the following tasks:
• Gathering the business and technical requirements
• Identifying data sources
• Selecting the appropriate subset of data
• Designing the logical and physical architecture of the data mart.
Steps in Implementing a Data Mart
Building(Constructing): This step contains creating the physical database and logical
structures associated with the data mart to provide fast and efficient access to the data.
• Typically, the term datacube is applied in contexts where these arrays are massively
larger than the hosting computer's main memory; examples include multi-
terabyte/petabyte data warehouses and time series of image data.
--Wikipedia
Example
Example: In the 2-D representation, we will look at the All Electronics sales data for items
sold per quarter in the city of Vancouver. The measured display in dollars sold (in
thousands).
Example
Example (4-D cuboid)
Example(4-D cuboid)
Example(4-D cuboid)
Computed versus Stored Data Cubes
• Pre-compute all cells in the cube: If the whole cube is pre-computed, then queries run on
the cube will be very fast. The disadvantage is that the pre-computed cube requires a lot of
memory.
• Pre-compute no cells: To minimize memory requirements, we can pre-compute none of the
cells in the cube. The disadvantage here is that queries on the cube will run more slowly
because the cube will need to be rebuilt for each query.
• Pre-compute some of the cells: As a compromise between these two, we can pre-compute
only those cells in the cube which will most likely be used for decision support queries. The
trade-off between memory space and computing time is called the space-time trade-off,
and it often exists in data mining and computer science in general.
Operations of OLAP(Data Cube)
Four types of analytical operations in OLAP are:
• Roll-up
• Drill-down
• Slice and dice
• Pivot (rotate)
Operations of OLAP(Data Cube)
Roll-up: Roll-up is also known as "consolidation" or "aggregation." The Roll-up operation can
be performed in 2 ways
• Reducing dimensions
• Climbing up concept hierarchy(Concept hierarchy is a system of grouping things based on
their order or level.)
Operations of OLAP(Data Cube)
Operations of OLAP(Data Cube)
Drill-down: In drill-down data is fragmented into smaller parts. It is the opposite of the rollup
process. It can be done via