unit-1 data warehousing
unit-1 data warehousing
1. Data warehouse
A Database Management System (DBMS) stores data in the form of tables and uses an ER model
and the goal is ACID properties. For example, a DBMS of a college has tables for students, faculty.
etc
A Data Warehouse is separate from DBMS, it stores a huge amount of data, which is typically
collected from multiple heterogeneous sources like files, DBMS, etc. The goal is to produce
statistical results that may help in decision-making. For example, a college might want to see quick
different results, like how the placement of CS students has improved over the last 10 years, in
terms of salaries, counts, etc.
Cost reductions: Data warehousing can result in cost savings over time by reducing data
management procedures and increasing overall efficiency, even when there are setup costs
initially.
Data security: Data warehouses employ security protocols to safeguard confidential
information, guaranteeing that only authorized personnel are granted access to certain data.
Disadvantages of Data Warehousing
Cost: Building a data warehouse can be expensive, requiring significant investments in
hardware, software, and personnel.
Complexity: Data warehousing can be complex, and businesses may need to hire
specialized personnel to manage the system.
Time-consuming: Building a data warehouse can take a significant amount of time,
requiring businesses to be patient and committed to the process.
Data integration challenges: Data from different sources can be challenging to integrate,
requiring significant effort to ensure consistency and accuracy.
Data security: Data warehousing can pose data security risks, and businesses must take
measures to protect sensitive data from unauthorized access or breaches.
1.2. Data warehouse components
Architecture is the proper arrangement of the elements. We build a data warehouse with software
and hardware components. To suit the requirements of our organizations, we arrange these building
we may want to boost up another part with extra tools and services. All of these depends on our
circumstances.
Source data
Exter nal
Information delivery
Arhived
internal
production Managenment and control
Metadata
Data mining
Data Warehouse
DBMS
g0dimensional
Multi
Database OLAP
Data storage
Data marts
The figure shows the essential elements of a typical warehouse. We see the Source Data component
shows on the left. The Data staging element serves as the next building block. In the middle, we see
the Data Storage component that handles the data warehouses data. This element not only stores
and manages the data; it also keeps track of data using the metadata repository. The Information
Delivery component shows on the right consists of all the different ways of making the information
from the data warehouses available to the users.
3) Data Loading: Two distinct categories of tasks form data loading functions. When we complete
the structure and construction of the data warehouse and go live for the first time, we do the initial
loading of the information into the data warehouse storage. The initial load moves high volumes of
data using up a substantial amount of time.
Intranet
MD Analysis
EIS Feed
E-mail
Data mining
Data marts
Metadata Component
Metadata in a data warehouse is equal to the data dictionary or the data catalog in a database
management system. In the data dictionary, we keep the data about the logical data structures, the
data about the records and addresses, the information about the indexes, and so on.
Data Marts
It includes a subset of corporate-wide data that is of value to a specific group of users. The scope is
confined to particular selected subjects. Data in a data warehouse should be a fairly current, but not
mainly up to the minute, although development in the data warehouse industry has made standard
and incremental data dumps more achievable. Data marts are lower than data warehouses and
usually contain organization. The current trends in data warehousing are to developed a data
warehouse with several smaller related data marts for particular kinds of queries and reports.
The management and control elements coordinate the services and functions within the data
warehouse. These components control the data transformation and the data transfer into the data
warehouse storage. On the other hand, it moderates the data delivery to the clients. Its work with th
database management systems and authorizes data to be correctly saved in the repositories. It
monitors the movement of information into the staging method and from there into the data
warehouses storage itself.
1. It is used for Online Transactional Processing 1. It is used for Online Analytical Processing
(OLTP) but can be used for other objectives such as (OLAP). This reads the historical information
Data Warehousing. This records the data from the for the customers for business decisions.
clients for history.
2. The tables and joins are complicated since they are 2. The tables and joins are accessible since
normalized for RDBMS. This is done to reduce they are de-normalized. This is done to
redundant files and to save storage space. minimize the response time for analytical
queries.
The Operational Database is the source of information for the data warehouse. It includes detailed
information used to run the day to day operations of the business. The data frequently changes as
updates are made and reflect the current value of the last transactions.
Operational Database Management Systems also called as OLTP (Online Transactions Processing
Databases), are used to manage dynamic data in real-time.
Data Warehouse Systems serve users or knowledge workers in the purpose of ata analysis and
decision-making. Such systems can organize and present information in specific formats to
accommodate the diverse needs of various users. These systems are called as Online-Analytical
Processing (OLAP) Systems.
Data Warehouse and the OLTP database are both relational databases. However, the goals of both
these databases are different.
Data Warehouse
Operational Database
Operational systems are designed to support Data warehousing systems are typically
high-volume transaction processing. designed to support high-volume analytical
processing (i.e., OLAP).
Operational systems are usually concerned Data warehousing systems are usually
with current data. concerned with historical data.
Data within operational systems are mainly Non-volatile, new data may be added
updated regularly according to need. regularly. Once Added rarely changed.
It is designed for real-time business dealing It is designed for analysis of business
and processes. measures by subject area, categories, and
attributes.
It is optimized for a simple set of transactions, It is optimized for extent loads and high,
generally adding or retrieving a single row at a complex, unpredictable queries that access
time per table. many rows per table.
Operational systems are widely process- Data warehousing systems are widely
oriented. subject-oriented
Operational systems are usually optimized to Data warehousing systems are usually
perform fast inserts and updates of optimized to perform fast retrievals of
associatively small volumes of data. relatively high volumes of data.
Data In Data Out
OLTP System handle with operational data. Operational data are those data contained in the
operation of a particular system. Example, ATM transactions and Bank transactions, etc.
OLAP System
OLAP handle with Historical Data or Archival Data. Historical data are those data that are achieved
over a long period. For example, if we collect the last 10 years information about flight reservation,
the data can give us much meaningful data such as the trends in the reservation. This may provide
useful information like peak time of travel, what kind of people are traveling in various classes
(Economy/Business) ete.
The major difference between an OLTP and OLAP system is the amount of data analyzed in a
single transaction. Whereas an OLTP manage many concurrent customers and queries touching
only an individual record or limited groups of files at a time. An OLAP system must have the
capability to operate on millions of files to answer a single query.
Database design OLTP system usually uses an OLAP system typically uses either a star or
entity-relationship (ER) data snowflake model and subject-oriented database
model and application-oriented design.
database design.
View OLTP system focuses primarily OLAP system often spans multiple versions of
on the current data within an a database schema, due to the evolutionary
enterprise or department, without process of an organization. OLAP systems also
referring to historical information deal with data that originates from various
or data in different organizations. organizations, integrating information from
many data stores.
Volume of data Not very large Because of their large volume, OLAP data are
stored on multiple storage media.
Access patterns The access patterns of an OLTP Accesses to OLAP systems are mostly read
system subsist mainly of short, only methods because of these data warehouses
atomic transactions. Such a stores historical data.
system requires concurrency
control and recovery techniques.
Access mode Read/write Mostly write
Insert and Short and fast inserts and updates Periodic long-running batch jobs refresh the
Updates proposed by end-users. data.
2. A bottom-up approach
The bottom-up approach starts with experiments and prototypes. This is useful in the carly
stage of business modeling and technology development. It allows an organisation to move
forward at considerably less expense and to evaluate the benefits of the technology before
making significant commitments.
3. A combination of both approaches
In the combined approach, an organisation can exploit the planned and strategic nature of
the top-down approach while retaining the rapid implementation and opportunistic
application of the bottom-up approach.
In general, the warehouse design process consists of the following steps:
1. Choose a business process to model, e.g., orders, invoices, shipments, inventory, account
administration, sales, and the general ledger. If the business process is organisational and
involves multiple, complex object collections, a data warehouse model should be followed.
However., if the process is departmental and focuses on the analysis of one kind of business
process, a data mart model should be chosen.
2. Choose the grain of the business process. The grain is the fundamental, atomic level of data
to be represented in the fact table for this process, e.g., individual transactions, individual
daily snapshots, etc.
3. Choose the dimensions that will apply to each fact table record. Typical dimensions are
time, item, customer, supplier, warehouse, transaction type, and status.
4. Choose the measures that will populate cach fact table record. Typical measures are
numeric additive quantities like dollars-sold and units-sold.
Once a data warehouse is designed and constructed, the initial deployment of the warehouse
includes initial installation, rollout planning, training and orientation. Platform upgrades
and maintenance must also be considered. Data warehouse administration will include data
refreshment, data source synchronisation, planning for disaster ecovery, managing access control
and security, managing data growth, managing database performance, and data warehouse
enhancement and extension.
esng
pata
Ming Aminitraie
LAP OLAP
sener
ToP TERfesd-adtet
operational RDBMS. This warehouse type is relatively easy to build but requires excess
computational capacity of the underlying operational database system. The users directly access
operational data via middleware tools. This architecture is feasible only if queries are posed
infrequently, and usually is used as a temporary solution until a permanent data warehouse is
developed.
Data Mart: The data mart contains a subset of the organisation-wide data that is of value to a
small group of users, e.g., marketing or customer service. This is usually a precursor (and/or a
successor) of the actual data warehouse, which differs with respect to the scope that is confined
to a specific group of users.
Depending on the source of data, data marts can be categorized into the following two classes:
1. Independent data marts are sourced fronm data captured from one or more operational
systems or external information providers, or from data generated locally within a
particular department or geographic area.
2. Dependent data marts are sourced directly from enterprise data warehouses.
Enterprise warehouse: This warehouse type holds all information about subjects spanning the
entire organisation. For a medium- to a large-size company, usually several years are needed to
design and build the enterprise warehouse.
The differences between the vitual and the enterprise DWs are shown in Figure 1.4. Data marts
can also be created as successors of an enterprise data warehouse. In this case, the DW consists of
an enterprise warehouse and (several) data marts.
MJleware
server
Eakerprise aarehune
Dectsion s p t eimeat
Oracle Autonomous Data Warehouse (ADW) and Snowflake are both cloud-based data
warehousing solutions, but they have some differences in terms of architecture, features, and
approach to data management. Here's a comparison between Oracle Autonomous Data Warehouse
and Snowflake:
1. Vendor:
Oracle ADW is built on Oracle Database technology and is part of the Oracle Cloud
Infrastructure. It utilizes Oracle's Autonomous Database technology, which includes
self-driving, self-securing, and self-repairing capabilities.
Snowflake is built as a multi-cloud, multi-cluster, and multi-region data warehouse
service. It has a unique architecture that separates storage and compute resources,
providing elasticity and scalability.
3. Automation:
Both ADW and Snowflake emphasize automation. ADW, as part of the Oracle
Autonomous Database family., is designed to automate various database management
tasks, including provisioning. patching, and tuning.
Snowflake also offers automation features, such as automatic scaling of compute
resources based on demand and automatic performance optimization.
4. Scalability:
ADW provides the ability to scale computing resources up or down based on
workload demands, allowing for flexibility in resource allocation.
Snowflake's architecture allows for independent scaling of compute and storage,
providing the ability to scale resources independently, and it automatically handles
the distribution of data across clusters.
5. Performance:
1. Cloud-Native Architecture:
Modern data warehouses are often built on cloud platforms, such as AWS, Azure, or
Google Cloud, to take advantage of scalable and flexible computing resources, as
well as the ability to pay for resources on a consumption basis.
2. Data Lakes Integration:
Integration with data lakes allows for the storage and analysis of both structured and
unstructured data. This integration supports diverse data types and enables more
comprehensive analytics.
3. Scalability:
Modern data warehouses are designed to scale horizontally and vertically, allowing
organizations to easily add or remove resources based on data volume and
processing needs.
4. Automated Data Management:
Automation is a key aspect, covering various tasks such as data ingestion, data
transformation, and data quality checks. Automated processes reduce manual effort,
enhance efficiency, and improve overall system reliability.
5. Data Virtualization:
Data virtualization enables users to access and analyze data without physically
moving it. This can be particularly useful for integrating data from multiple sources
and providing a unified view without the need for extensive data movement.
6. Advanced Analytics and Machine Learning:
Modern data warehouses often incorporate advanced analytics and machine learning
capabilities directly within the platform. This allows organizations to derive insights
from data and build predictive models without having to move the data to external
systems.
7. Real-Time Data Processing:
The ability to handle real-time data processing and analytics is a crucial aspect of a
modern data warehouse. This is especially important for organizations that require
up-to-the-minute insights for decision-making.
8. Security and Compliance:
Security features are a priority, including robust authentication, encryption, and
compliance with regulatory standards. Modern data warehouses often provide fine
grained access controls to ensure data privacy and security.
9. Cost Managemnent:
Cost-effective solutions are a focus, with modern data warehouses allowing
organizations to pay for the resources they consume. This pay-as-you-go model is
often more cost-efficient than traditional on-premises solutions.
10. Integration with BI Tools and Visualization:
Seamless integration with business intelligence (BI) tools and visualization
platforms is essential to empower users to easily analyze and visualize data stored in
the warehouse.
Modern data warehouses support flexible data models, including both relational and
non-relational data. This flexibility accommodates diverse data types and structures.
12. Data Governance:
Robust data governance features are included to ensure data quality, lineage, and
compliance with regulatory requirements. This includes metadata management, data
cataloging, and lineage tracking.