0% found this document useful (0 votes)

24 views

data mining 1

Uploaded by

shivasingh38025

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views

data mining 1

Uploaded by

shivasingh38025

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 41

What is a Data Warehouse?

A Data Warehouse (DW) is a relational database that is designed for query and
analysis rather than transaction processing. It includes historical data derived from
transaction data from single and multiple sources.
A Data Warehouse provides integrated, enterprise-wide, historical data and focuses
on providing support for decision-makers for data modeling and analysis.
A Data Warehouse is a group of data specific to the entire organization, not only to
a particular group of users.
It is not used for daily operations and transaction processing but used for making
decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
o It is a database designed for investigative tasks, using data from various
applications.
o It supports a relatively small number of clients with relatively long
interactions.
o It includes current and historical data to provide a historical perspective of
information.
o Its usage is read-intensive.
o It contains a few large tables.
"Data Warehouse is a subject-oriented, integrated, and time-variant store of
information in support of management's decisions."
Characteristics of Data Warehouse

Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view
around a particular subject, such as customer, product, or sales, instead of the
global organization's ongoing operations. This is done by excluding data that are
not useful concerning the subject and including all data needed by the users to
understand the subject.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat
files, and online transaction records. It requires performing data cleaning and
integration during data warehousing to ensure consistency in naming conventions,
attributes types, etc., among different data sources.

Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve
files from 3 months, 6 months, 12 months, or even previous data from a data
warehouse. These variations with a transactions system, where often only the most
current file is kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed
from the source operational RDBMS. The operational updates of data do not occur in
the data warehouse, i.e., update, insert, and delete operations are not performed. It
usually requires only two procedures in data accessing: Initial loading of data and
access to data. Therefore, the DW does not require transaction processing,
recovery, and concurrency capabilities, which allows for substantial speedup of data
retrieval. Non-Volatile defines that once entered into the warehouse, and data
should not change.

Goals of Data Warehousing

o To help reporting as well as analysis
o Maintain the organization's historical information
o Be the foundation for decision making.
Need for Data Warehouse
Data Warehouse is needed for the following reasons:
1. 1) Business User: Business users require a data warehouse to view
summarized data from the past. Since these people are non-technical, the
data may be presented to them in an elementary form.
2. 2) Store historical data: Data Warehouse is required to store the time
variable data from the past. This input is made to be used for various
purposes.
3. 3) Make strategic decisions: Some strategies may be depending upon the
data in the data warehouse. So, data warehouse contributes to making
strategic decisions.
4. 4) For data consistency and quality: Bringing the data from different
sources at a commonplace, the user can effectively undertake to bring the
uniformity and consistency in data.
5. 5) High response time: Data warehouse has to be ready for somewhat
unexpected loads and types of queries, which demands a significant degree
of flexibility and quick response time.
Benefits of Data Warehouse
1. Understand business trends and make better forecasting decisions.
2. Data Warehouses are designed to perform well enormous amounts of data.
3. The structure of data warehouses is more accessible for end-users to
navigate, understand, and query.
4. Queries that would be complex in many normalized databases could be
easier to build and maintain in data warehouses.
5. Data warehousing is an efficient method to manage demand for lots of
information from lots of users.
6. Data warehousing provide the capabilities to analyze a large amount of
historical data.
Data Warehouse Architecture
A data warehouse architecture is a method of defining the overall architecture of
data communication processing and presentation that exist for end-clients
computing within the enterprise. Each data warehouse is different, but all are
characterized by standard vital components.

Production applications such as payroll accounts payable product purchasing and

inventory control are designed for online transaction processing (OLTP). Such
applications gather detailed data from day to day operations.

Data Warehouse applications are designed to support the user ad-hoc data
requirements, an activity recently dubbed online analytical processing (OLAP).
These include applications such as forecasting, profiling, summary reporting, and
trend analysis.

Production databases are updated continuously by either by hand or via OLTP

applications. In contrast, a warehouse database is updated from operational
systems periodically, usually during off-hours. As OLTP data accumulates in
production databases, it is regularly extracted, filtered, and then loaded into a
dedicated warehouse server that is accessible to users. As the warehouse is
populated, it must be restructured tables de-normalized, data cleansed of errors
and redundancies and new fields and keys added to reflect the needs to the user for
sorting, combining, and summarizing data.

Data warehouses and their architectures very depending upon the elements of an
organization's situation.

Three common architectures are:

o Data Warehouse Architecture: Basic

o Data Warehouse Architecture: With Staging Area
o Data Warehouse Architecture: With Staging Area and Data Marts

Data Warehouse Architecture: Basic

Operational System

An operational system is a method used in data warehousing to refer to

a system that is used to process the day-to-day transactions of an organization.

Flat Files

A Flat file system is a system of files in which transactional data is stored, and
every file in the system must have a different name.

Meta Data

A set of data that defines and gives information about other data.

Meta Data used in Data Warehouse for a variety of purpose, including:

Meta Data summarizes necessary information about data, which can make finding
and work with particular instances of data more accessible. For example, author,
data build, and data changed, and file size are examples of very basic document
metadata.

Metadata is used to direct a query to the most appropriate data source.

Lightly and highly summarized data

The area of the data warehouse saves all the predefined lightly and highly
summarized (aggregated) data generated by the warehouse manager.

The goals of the summarized information are to speed up query performance. The
summarized record is updated continuously as new information is loaded into the
warehouse.
End-User access Tools

The principal purpose of a data warehouse is to provide information to the business

managers for strategic decision-making. These customers interact with the
warehouse using end-client access tools.

The examples of some of the end-user access tools can be:

o Reporting and Query Tools

o Application Development Tools
o Executive Information Systems Tools
o Online Analytical Processing Tools
o Data Mining Tools

Data Warehouse Architecture: With Staging Area

We must clean and process your operational information before put it into the
warehouse.

e can do this programmatically, although data warehouses uses a staging area (A

place where data is processed before entering the warehouse).

A staging area simplifies data cleansing and consolidation for operational method
coming from multiple source systems, especially for enterprise data warehouses
where all relevant data of an enterprise is consolidated.
Data Warehouse Staging Area is a temporary location where a record from
source systems is copied.

Data Warehouse Architecture: With Staging Area and Data Marts

We may want to customize our warehouse's architecture for multiple groups within
our organization.

We can do this by adding data marts. A data mart is a segment of a data

warehouses that can provided information for reporting and analysis on a section,
unit, department or operation in the company, e.g., sales, payroll, production, etc.

The figure illustrates an example where purchasing, sales, and stocks are
separated. In this example, a financial analyst wants to analyze historical data for
purchases and sales or mine historical information to make predictions about
customer behavior.

Properties of Data Warehouse Architectures

The following architecture properties are necessary for a data warehouse system:

1. Separation: Analytical and transactional processing should be keep apart as

much as possible.

2. Scalability: Hardware and software architectures should be simple to upgrade

the data volume, which has to be managed and processed, and the number of
user's requirements, which have to be met, progressively increase.

3. Extensibility: The architecture should be able to perform new operations and

technologies without redesigning the whole system.

4. Security: Monitoring accesses are necessary because of the strategic data

stored in the data warehouses.

5. Administerability: Data Warehouse management should not be complicated.

Types of Data Warehouse Architectures

Single-Tier Architecture
Single-Tier architecture is not periodically used in practice. Its purpose is to
minimize the amount of data stored to reach this goal; it removes data
redundancies.

The figure shows the only layer physically available is the source layer. In this
method, data warehouses are virtual. This means that the data warehouse is
implemented as a multidimensional view of operational data created by specific
middleware, or an intermediate processing layer.

The vulnerability of this architecture lies in its failure to meet the requirement for
separation between analytical and transactional processing. Analysis queries are
agreed to operational data after the middleware interprets them. In this way,
queries affect transactional workloads.

Two-Tier Architecture
The requirement for separation plays an essential role in defining the two-tier
architecture for a data warehouse system, as shown in fig:

Although it is typically called two-layer architecture to highlight a separation

between physically available sources and data warehouses, in fact, consists of four
subsequent data flow stages:

1. Source layer: A data warehouse system uses a heterogeneous source of

data. That data is stored initially to corporate relational databases or legacy
databases, or it may come from an information system outside the corporate
walls.
2. Data Staging: The data stored to the source should be extracted, cleansed
to remove inconsistencies and fill gaps, and integrated to merge
heterogeneous sources into one standard schema. The so-
named Extraction, Transformation, and Loading Tools (ETL) can
combine heterogeneous schemata, extract, transform, cleanse, validate,
filter, and load source data into a data warehouse.
3. Data Warehouse layer: Information is saved to one logically centralized
individual repository: a data warehouse. The data warehouses can be directly
accessed, but it can also be used as a source for creating data marts, which
partially replicate data warehouse contents and are designed for specific
enterprise departments. Meta-data repositories store information on sources,
access procedures, data staging, users, data mart schema, and so on.
4. Analysis: In this layer, integrated data is efficiently, and flexible accessed to
issue reports, dynamically analyze information, and simulate hypothetical
business scenarios. It should feature aggregate information navigators,
complex query optimizers, and customer-friendly GUIs.

Three-Tier Architecture
The three-tier architecture consists of the source layer (containing multiple source
system), the reconciled layer and the data warehouse layer (containing both data
warehouses and data marts). The reconciled layer sits between the source data and
data warehouse.

The main advantage of the reconciled layer is that it creates a standard reference
data model for a whole enterprise. At the same time, it separates the problems of
source data extraction and integration from those of data warehouse population. In
some cases, the reconciled layer is also directly used to accomplish better some
operational tasks, such as producing daily reports that cannot be satisfactorily
prepared using the corporate applications or generating data flows to feed external
processes periodically to benefit from cleaning and integration.

This architecture is especially useful for the extensive, enterprise-wide systems. A

disadvantage of this structure is the extra file storage space used through the extra
redundant reconciled layer. It also makes the analytical tools a little further away
from being real-time.
Data Warehouse Delivery Process
Now we discuss the delivery process of the data warehouse. Main steps used in data
warehouse delivery process which are as follows:
IT Strategy: DWH project must contain IT strategy for procuring and retaining
funding.

Business Case Analysis: After the IT strategy has been designed, the next step is
the business case. It is essential to understand the level of investment that can be
justified and to recognize the projected business benefits which should be derived
from using the data warehouse.

Education & Prototyping: Company will experiment with the ideas of data
analysis and educate themselves on the value of the data warehouse. This is
valuable and should be required if this is the company first exposure to the benefits
of the DS record. Prototyping method can progress the growth of education. It is
better than working models. Prototyping requires business requirement, technical
blueprint, and structures.

Business Requirement: It contains such as

The logical model for data within the data warehouse.

The source system that provides this data (mapping rules)

The business rules to be applied to information.

The query profiles for the immediate requirement

Technical blueprint: It arranges the architecture of the warehouse. Technical
blueprint of the delivery process makes an architecture plan which satisfies long-
term requirements. It lays server and data mart architecture and essential
components of database design.

Building the vision: It is the phase where the first production deliverable is
produced. This stage will probably create significant infrastructure elements for
extracting and loading information but limit them to the extraction and load of
information sources.

History Load: The next step is one where the remainder of the required history is
loaded into the data warehouse. This means that the new entities would not be
added to the data warehouse, but additional physical tables would probably be
created to save the increased record volumes.

AD-Hoc Query: In this step, we configure an ad-hoc query tool to operate against
the data warehouse.

These end-customer access tools are capable of automatically generating the

database query that answers any question posed by the user.

Automation: The automation phase is where many of the operational management

processes are fully automated within the DWH. These would include:

Extracting & loading the data from a variety of sources systems

Transforming the information into a form suitable for analysis

Backing up, restoring & archiving data

Generating aggregations from predefined definitions within the Data Warehouse.

Monitoring query profiles & determining the appropriate aggregates to maintain

system performance.

Extending Scope: In this phase, the scope of DWH is extended to address a new
set of business requirements. This involves the loading of additional data sources
into the DWH i.e. the introduction of new data marts.

Requirement Evolution: This is the last step of the delivery process of a data
warehouse. As we all know that requirements are not static and evolve
continuously. As the business requirements will change it supports to be reflected in
the system.

Concept Hierarchy
Concept hierarchy is directed acyclic graph of ideas, where a unique name identifies
each of the theories.
An arc from the concept a to b denotes which is a more general concept than b. We
can tag the text with ideas.

Each text report is tagged by a set of concepts which corresponds to its content.

Tagging a report with a concept implicitly entails its tagging with all the ancestors
of the concept hierarchy. It is, therefore desired that a report should be tagged with
the lowest concept possible.

The method to automatically tag the report to the hierarchy is a top-down

approach. An evaluation function determines whether a record currently tagged to a
node can also be tagged to any of its child nodes.

If so, then then the tag moves down the hierarchy till it cannot be pushed any
further.

The outcome of this step is a hierarchy of report and, at each node, there is a set of
the report having a common concept related to the node.

The hierarchy of reports resulting from the tagging step is useful for many texts
mining process.

It is assumed that the hierarchy of concepts is called a priori. We can even have
such a hierarchy of documents without a concept hierarchy, by using any
hierarchical clustering algorithm, which results in such a hierarchy.
Concept hierarchy defines a sequence of mapping from a set of particular, low-level
concepts to more general, higher-level concepts.

In a data warehouse, it is usually used to express different levels of granularity of an

attribute from one of the dimension tables.

Concept hierarchies are crucial for the formulation of useful OLAP queries. The
hierarchies allow the user to summarize the data at various levels.

For example, using the location hierarchy, the user can retrieve data which
summarizes sales for each location, for all the areas in a given state, or even a
given country without the necessity of reorganizing the data.

Data Warehouse Design

A data warehouse is a single data repository where a record from multiple data
sources is integrated for online business analytical processing (OLAP). This implies a
data warehouse needs to meet the requirements from all the business stages within
the entire organization. Thus, data warehouse design is a hugely complex, lengthy,
and hence error-prone process. Furthermore, business analytical functions change
over time, which results in changes in the requirements for the systems. Therefore,
data warehouse and OLAP systems are dynamic, and the design process is
continuous.

Data warehouse design takes a method different from view materialization in the
industries. It sees data warehouses as database systems with particular needs such
as answering management related queries. The target of the design becomes how
the record from multiple data sources should be extracted, transformed, and loaded
(ETL) to be organized in a database as the data warehouse.

There are two approaches

1. "top-down" approach
2. "bottom-up" approach

Top-down Design Approach

In the "Top-Down" design approach, a data warehouse is described as a subject-
oriented, time-variant, non-volatile and integrated data repository for the entire
enterprise data from different sources are validated, reformatted and saved in a
normalized (up to 3NF) database as the data warehouse. The data warehouse stores
"atomic" information, the data at the lowest level of granularity, from where
dimensional data marts can be built by selecting the data required for specific
business subjects or particular departments. An approach is a data-driven approach
as the information is gathered and integrated first and then business requirements
by subjects for building data marts are formulated. The advantage of this method is
which it supports a single integrated data source. Thus data marts built from it will
have consistency when they overlap.

Advantages of top-down design

Data Marts are loaded from the data warehouses.

Developing new data mart from the data warehouse is very easy.

Disadvantages of top-down design

This technique is inflexible to changing departmental needs.

The cost of implementing the project is high.

Bottom-Up Design Approach
In the "Bottom-Up" approach, a data warehouse is described as "a copy of
transaction data specifical architecture for query and analysis," term the star
schema. In this approach, a data mart is created first to necessary reporting and
analytical capabilities for particular business processes (or subjects). Thus it is
needed to be a business-driven approach in contrast to Inmon's data-driven
approach.

Data marts include the lowest grain data and, if needed, aggregated data too.
Instead of a normalized database for the data warehouse, a denormalized
dimensional database is adapted to meet the data delivery requirements of data
warehouses. Using this method, to use the set of data marts as the enterprise data
warehouse, data marts should be built with conformed dimensions in mind, defining
that ordinary objects are represented the same in different data marts. The
conformed dimensions connected the data marts to form a data warehouse, which
is generally called a virtual data warehouse.

The advantage of the "bottom-up" design approach is that it has quick ROI, as
developing a data mart, a data warehouse for a single subject, takes far less time
and effort than developing an enterprise-wide data warehouse. Also, the risk of
failure is even less. This method is inherently incremental. This method allows the
project team to learn and grow.
Advantages of bottom-up design

Documents can be generated quickly.

The data warehouse can be extended to accommodate new business units.

It is just developing new data marts and then integrating with other data marts.

Disadvantages of bottom-up design

the locations of the data warehouse and the data marts are reversed in the bottom-
up approach design.

Differentiate between Top-Down Design Approach and

Bottom-Up Design Approach

Top-Down Design Approach Bottom-Up Design Approach

Breaks the vast problem into smaller Solves the essential low-level problem and integrates them
subproblems. into a higher one.

Inherently architected- not a union of several Inherently incremental; can schedule essential data marts
data marts. first.

Single, central storage of information about Departmental information stored.

the content.

Centralized rules and control. Departmental rules and control.

It includes redundant information. Redundancy can be removed.

It may see quick results if implemented with Less risk of failure, favorable return on investment, and
repetitions. proof of techniques.

Data Warehouse Implementation

There are various implementation in data warehouses which are as follows

1. Requirements analysis and capacity planning: The first process in data

warehousing involves defining enterprise needs, defining architectures, carrying out
capacity planning, and selecting the hardware and software tools. This step will
contain be consulting senior management as well as the different stakeholder.
2. Hardware integration: Once the hardware and software has been selected,
they require to be put by integrating the servers, the storage methods, and the user
software tools.

3. Modeling: Modelling is a significant stage that involves designing the warehouse

schema and views. This may contain using a modeling tool if the data warehouses
are sophisticated.

4. Physical modeling: For the data warehouses to perform efficiently, physical

modeling is needed. This contains designing the physical data warehouse
organization, data placement, data partitioning, deciding on access techniques, and
indexing.

5. Sources: The information for the data warehouse is likely to come from several
data sources. This step contains identifying and connecting the sources using the
gateway, ODBC drives, or another wrapper.

6. ETL: The data from the source system will require to go through an ETL phase.
The process of designing and implementing the ETL phase may contain defining a
suitable ETL tool vendors and purchasing and implementing the tools. This may
contains customize the tool to suit the need of the enterprises.

7. Populate the data warehouses: Once the ETL tools have been agreed upon,
testing the tools will be needed, perhaps using a staging area. Once everything is
working adequately, the ETL tools may be used in populating the warehouses given
the schema and view definition.

8. User applications: For the data warehouses to be helpful, there must be end-
user applications. This step contains designing and implementing applications
required by the end-users.

9. Roll-out the warehouses and applications: Once the data warehouse has
been populated and the end-client applications tested, the warehouse system and
the operations may be rolled out for the user's community to use.

Implementation Guidelines
1. Build incrementally: Data warehouses must be built incrementally. Generally,
it is recommended that a data marts may be created with one particular project in
mind, and once it is implemented, several other sections of the enterprise may also
want to implement similar systems. An enterprise data warehouses can then be
implemented in an iterative manner allowing all data marts to extract information
from the data warehouse.

2. Need a champion: A data warehouses project must have a champion who is

active to carry out considerable researches into expected price and benefit of the
project. Data warehousing projects requires inputs from many units in an enterprise
and therefore needs to be driven by someone who is needed for interacting with
people in the enterprises and can actively persuade colleagues.

3. Senior management support: A data warehouses project must be fully

supported by senior management. Given the resource-intensive feature of such
project and the time they can take to implement, a warehouse project signal for a
sustained commitment from senior management.

4. Ensure quality: The only record that has been cleaned and is of a quality that is
implicit by the organizations should be loaded in the data warehouses.

5. Corporate strategy: A data warehouse project must be suitable for corporate

strategies and business goals. The purpose of the project must be defined before
the beginning of the projects.

6. Business plan: The financial costs (hardware, software, and peopleware),

expected advantage, and a project plan for a data warehouses project must be
clearly outlined and understood by all stakeholders. Without such understanding,
rumors about expenditure and benefits can become the only sources of data,
subversion the projects.

7. Training: Data warehouses projects must not overlook data warehouses training
requirements. For a data warehouses project to be successful, the customers must
be trained to use the warehouses and to understand its capabilities.

8. Adaptability: The project should build in flexibility so that changes may be

made to the data warehouses if and when required. Like any system, a data
warehouse will require to change, as the needs of an enterprise change.

9. Joint management: The project must be handled by both IT and business

professionals in the enterprise. To ensure that proper communication with the
stakeholder and which the project is the target for assisting the enterprise's
business, the business professional must be involved in the project along with
technical professionals.

What is Data Mart?

A Data Mart is a subset of a directorial information store, generally oriented to a
specific purpose or primary data subject which may be distributed to provide
business needs. Data Marts are analytical record stores designed to focus on
particular business functions for a specific community within an organization. Data
marts are derived from subsets of data in a data warehouse, though in the bottom-
up data warehouse design methodology, the data warehouse is created from the
union of organizational data marts.

The fundamental use of a data mart is Business Intelligence

(BI) applications. BI is used to gather, store, access, and analyze record. It can be
used by smaller businesses to utilize the data they have accumulated since it is less
expensive than implementing a data warehouse.
Reasons for creating a data mart
o Creates collective data by a group of users
o Easy access to frequently needed data
o Ease of creation
o Improves end-user response time
o Lower cost than implementing a complete data warehouses
o Potential clients are more clearly defined than in a comprehensive data
warehouse
o It contains only essential business data and is less cluttered.

Types of Data Marts

There are mainly two approaches to designing data marts. These approaches are

o Dependent Data Marts

o Independent Data Marts

Dependent Data Marts

A dependent data marts is a logical subset of a physical subset of a higher data
warehouse. According to this technique, the data marts are treated as the subsets
of a data warehouse. In this technique, firstly a data warehouse is created from
which further various data marts can be created. These data mart are dependent on
the data warehouse and extract the essential record from it. In this technique, as
the data warehouse creates the data mart; therefore, there is no need for data mart
integration. It is also known as a top-down approach.

Independent Data Marts

The second approach is Independent data marts (IDM) Here, firstly independent
data marts are created, and then a data warehouse is designed using these
independent multiple data marts. In this approach, as all the data marts are
designed independently; therefore, the integration of data marts is required. It is
also termed as a bottom-up approach as the data marts are integrated to
develop a data warehouse.

Other than these two categories, one more type exists that is called "Hybrid Data
Marts."
Hybrid Data Marts
It allows us to combine input from sources other than a data warehouse. This could
be helpful for many situations; especially when Adhoc integrations are needed, such
as after a new group or product is added to the organizations.

Steps in Implementing a Data Mart

The significant steps in implementing a data mart are to design the schema,
construct the physical storage, populate the data mart with data from source
systems, access it to make informed decisions and manage it over time. So, the
steps are:

Designing
The design step is the first in the data mart process. This phase covers all of the
functions from initiating the request for a data mart through gathering data about
the requirements and developing the logical and physical design of the data mart.

It involves the following tasks:

1. Gathering the business and technical requirements

2. Identifying data sources
3. Selecting the appropriate subset of data
4. Designing the logical and physical architecture of the data mart.

Constructing
This step contains creating the physical database and logical structures associated
with the data mart to provide fast and efficient access to the data.

It involves the following tasks:

1. Creating the physical database and logical structures such as tablespaces

associated with the data mart.
2. creating the schema objects such as tables and indexes describe in the
design step.
3. Determining how best to set up the tables and access structures.
Populating
This step includes all of the tasks related to the getting data from the source,
cleaning it up, modifying it to the right format and level of detail, and moving it into
the data mart.

It involves the following tasks:

1. Mapping data sources to target data sources

2. Extracting data
3. Cleansing and transforming the information.
4. Loading data into the data mart
5. Creating and storing metadata

Accessing
This step involves putting the data to use: querying the data, analyzing it, creating
reports, charts and graphs and publishing them.

It involves the following tasks:

1. Set up and intermediate layer (Meta Layer) for the front-end tool to use. This
layer translates database operations and objects names into business
conditions so that the end-clients can interact with the data mart using words
which relates to the business functions.
2. Set up and manage database architectures like summarized tables which
help queries agree through the front-end tools execute rapidly and efficiently.

Managing
This step contains managing the data mart over its lifetime. In this step,
management functions are performed as:

1. Providing secure access to the data.

2. Managing the growth of the data.
3. Optimizing the system for better performance.
4. Ensuring the availability of data event with system failures.
What is Meta Data?
Metadata is data about the data or documentation about the information which is
required by the users. In data warehousing, metadata is one of the essential
aspects.

Metadata includes the following:

1. The location and descriptions of warehouse systems and components.

2. Names, definitions, structures, and content of data-warehouse and end-users
views.
3. Identification of authoritative data sources.
4. Integration and transformation rules used to populate data.
5. Integration and transformation rules used to deliver information to end-user
analytical tools.
6. Subscription information for information delivery to analysis subscribers.
7. Metrics used to analyze warehouses usage and performance.
8. Security authorizations, access control list, etc.

Metadata is used for building, maintaining, managing, and using the data
warehouses. Metadata allow users access to help understand the content and find
data.

Several examples of metadata are:

1. A library catalog may be considered metadata. The directory metadata
consists of several predefined components representing specific attributes of
a resource, and each item can have one or more values. These components
could be the name of the author, the name of the document, the publisher's
name, the publication date, and the methods to which it belongs.
2. The table of content and the index in a book may be treated metadata for the
book.
3. Suppose we say that a data item about a person is 80. This must be defined
by noting that it is the person's weight and the unit is kilograms. Therefore,
(weight, kilograms) is the metadata about the data is 80.
4. Another examples of metadata are data about the tables and figures in a
report like this book. A table (which is a record) has a name (e.g., table titles),
and there are column names of the tables that may be treated metadata. The
figures also have titles or names.

Why is metadata necessary in a data warehouses?

o First, it acts as the glue that links all parts of the data warehouses.
o Next, it provides information about the contents and structures to the
developers.
o Finally, it opens the doors to the end-users and makes the contents
recognizable in their terms.

Metadata is Like a Nerve Center. Various processes during the building and
administering of the data warehouse generate parts of the data warehouse
metadata. Another uses parts of metadata generated by one process. In the data
warehouse, metadata assumes a key position and enables communication among
various methods. It acts as a nerve centre in the data warehouse.

Figure shows the location of metadata within the data warehouse.

Types of Metadata
Metadata in a data warehouse fall into three major parts:
o Operational Metadata
o Extraction and Transformation Metadata
o End-User Metadata

Operational Metadata
As we know, data for the data warehouse comes from various operational systems
of the enterprise. These source systems include different data structures. The data
elements selected for the data warehouse have various fields lengths and data
types.

In selecting information from the source systems for the data warehouses, we
divide records, combine factor of documents from different source files, and deal
with multiple coding schemes and field lengths. When we deliver information to the
end-users, we must be able to tie that back to the source data sets. Operational
metadata contains all of this information about the operational data sources.

Extraction and Transformation Metadata

Extraction and transformation metadata include data about the removal of data
from the source systems, namely, the extraction frequencies, extraction methods,
and business rules for the data extraction. Also, this category of metadata contains
information about all the data transformation that takes place in the data staging
area.

End-User Metadata
The end-user metadata is the navigational map of the data warehouses. It enables
the end-users to find data from the data warehouses. The end-user metadata allows
the end-users to use their business terminology and look for the information in
those ways in which they usually think of the business.

Metadata Interchange Initiative

The metadata interchange initiative was proposed to bring industry vendors and
user together to address a variety of severe problems and issues concerning
exchanging, sharing, and managing metadata. The goal of metadata interchange
standard is to define an extensible mechanism that will allow the vendor to
exchange standard metadata as well as carry along "proprietary" metadata. The
founding members agreed on the following initial goals:

1. Creating a vendor-independent, industry-defined, and maintained standard

access mechanisms and application programming interfaces (API) for
metadata.
2. Enabling users to control and manage the access and manipulation of
metadata in their unique environment through the use of interchange
standards-compliant tools.
3. Users are allowed to build tools that meet their needs and also will enable
them to adjust accordingly to those tools configurations.
4. Allowing individual tools to satisfy their metadata requirements freely and
efficiently within the content of an interchange model.
5. Describing a simple, clean implementation infrastructure which will facilitate
compliance and speed up adoption by minimizing the amount of modification.
6. To create a procedure and process not only for maintaining and establishing
the interchange standard specification but also for updating and extending it
over time.

Metadata Interchange Standard Framework

Interchange standard metadata model implementation assumes that the metadata
itself may be stored in storage format of any type: ASCII files, relational tables, fixed
or customized formats, etc.

It is a framework that is based on a framework that will translate an access request

into the standard interchange index.

Several approaches have been proposed in metadata interchange coalition:

o Procedural Approach
o ASCII Batch Approach
o Hybrid Approach

In a procedural approach, the communication with API is built into the tool. It
enables the highest degree of flexibility.

In ASCII Batch approach, instead of relying on ASCII file format which contains
information of various metadata items and standardized access requirements that
make up the interchange standards metadata model.

In the Hybrid approach, it follows a data-driven model.

Components of Metadata Interchange Standard Frameworks

1) Standard Metadata Model: It refers to the ASCII file format, which is used to
represent metadata that is being exchanged.
2) The standard access framework that describes the minimum number of API
functions.

3) Tool profile, which is provided by each tool vendor.

4) The user configuration is a file explaining the legal interchange paths for
metadata in the user's environment.

Metadata Repository
The metadata itself is housed in and controlled by the metadata repository. The
software of metadata repository management can be used to map the source data
to the target database, integrate and transform the data, generate code for data
transformation, and to move data to the warehouse.

Benefits of Metadata Repository

1. It provides a set of tools for enterprise-wide metadata management.
2. It eliminates and reduces inconsistency, redundancy, and underutilization.
3. It improves organization control, simplifies management, and accounting of
information assets.
4. It increases coordination, understanding, identification, and utilization of
information assets.
5. It enforces CASE development standards with the ability to share and reuse
metadata.
6. It leverages investment in legacy systems and utilizes existing applications.
7. It provides a relational model for heterogeneous RDBMS to share information.
8. It gives useful data administration tool to manage corporate information
assets with the data dictionary.
9. It increases reliability, control, and flexibility of the application development
process.

Data Warehousing - Schemas

Schema is a logical description of the entire database. It includes the name

and description of records of all record types including all associated data-
items and aggregates. Much like a database, a data warehouse also requires
to maintain a schema. A database uses relational model, while a data
warehouse uses Star, Snowflake, and Fact Constellation schema. In this
chapter, we will discuss the schemas used in a data warehouse.
Star Schema
 Each dimension in a star schema is represented with only one-
dimension table.
 This dimension table contains the set of attributes.
 The following diagram shows the sales data of a company with respect
to the four dimensions, namely time, item, branch, and location.
 There is a fact table at the center. It contains the keys to each of four
dimensions.
 The fact table also contains the attributes, namely dollars sold and
units sold.
Note − Each dimension has only one dimension table and each table holds a
set of attributes. For example, the location dimension table contains the
attribute set {location_key, street, city, province_or_state,country}. This
constraint may cause data redundancy. For example, "Vancouver" and
"Victoria" both the cities are in the Canadian province of British Columbia.
The entries for such cities may cause data redundancy along the attributes
province_or_state and country.
Snowflake Schema
 Some dimension tables in the Snowflake schema are normalized.
 The normalization splits up the data into additional tables.
 Unlike Star schema, the dimensions table in a snowflake schema are
normalized. For example, the item dimension table in star schema is
normalized and split into two dimension tables, namely item and
supplier table.
 Now the item dimension table contains the attributes item_key,
item_name, type, brand, and supplier-key.
 The supplier key is linked to the supplier dimension table. The supplier
dimension table contains the attributes supplier_key and supplier_type.
Note − Due to normalization in the Snowflake schema, the redundancy is
reduced and therefore, it becomes easy to maintain and the save storage
space.
Fact Constellation Schema
 A fact constellation has multiple fact tables. It is also known as galaxy
schema.
 The following diagram shows two fact tables, namely sales and
shipping.
 The sales fact table is same as that in the star schema.
 The shipping fact table has the five dimensions, namely item_key,
time_key, shipper_key, from_location, to_location.
 The shipping fact table also contains two measures, namely dollars
sold and units sold.
 It is also possible to share dimension tables between fact tables. For
example, time, item, and location dimension tables are shared
between the sales and shipping fact table.

Data Preprocessing
Data preprocessing is an important step in the data mining process. It refers
to the cleaning, transforming, and integrating of data in order to make it
ready for analysis. The goal of data preprocessing is to improve the quality of
the data and to make it more suitable for the specific data mining task.
Some common steps in data preprocessing include:
Data preprocessing is an important step in the data mining process that
involves cleaning and transforming raw data to make it suitable for analysis.
Some common steps in data preprocessing include:
Data Cleaning: This involves identifying and correcting errors or
inconsistencies in the data, such as missing values, outliers, and duplicates.
Various techniques can be used for data cleaning, such as imputation,
removal, and transformation.
Data Integration: This involves combining data from multiple sources to
create a unified dataset. Data integration can be challenging as it requires
handling data with different formats, structures, and semantics. Techniques
such as record linkage and data fusion can be used for data integration.
Data Transformation: This involves converting the data into a suitable
format for analysis. Common techniques used in data transformation include
normalization, standardization, and discretization. Normalization is used to
scale the data to a common range, while standardization is used to transform
the data to have zero mean and unit variance. Discretization is used to
convert continuous data into discrete categories.
Data Reduction: This involves reducing the size of the dataset while
preserving the important information. Data reduction can be achieved
through techniques such as feature selection and feature extraction. Feature
selection involves selecting a subset of relevant features from the dataset,
while feature extraction involves transforming the data into a lower-
dimensional space while preserving the important information.
Data Discretization: This involves dividing continuous data into discrete
categories or intervals. Discretization is often used in data mining and
machine learning algorithms that require categorical data. Discretization can
be achieved through techniques such as equal width binning, equal
frequency binning, and clustering.
Data Normalization: This involves scaling the data to a common range,
such as between 0 and 1 or -1 and 1. Normalization is often used to handle
data with different units and scales. Common normalization techniques
include min-max normalization, z-score normalization, and decimal scaling.
Data preprocessing plays a crucial role in ensuring the quality of data and
the accuracy of the analysis results. The specific steps involved in data
preprocessing may vary depending on the nature of the data and the
analysis goals.
By performing these steps, the data mining process becomes more efficient
and the results become more accurate.
Preprocessing in Data Mining:
Data preprocessing is a data mining technique which is used to transform the
raw data in a useful and efficient format.
Steps Involved in Data Preprocessing:
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part,
data cleaning is done. It involves handling of missing data, noisy data etc.

 (a). Missing Data:

This situation arises when some data is missing in the data. It can be
handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite
large and multiple values are missing within a tuple.

2. Fill the Missing values:

There are various ways to do this task. You can choose to fill the
missing values manually, by attribute mean or the most probable
value.

 (b). Noisy Data:

Noisy data is a meaningless data that can’t be interpreted by
machines.It can be generated due to faulty data collection, data entry
errors etc. It can be handled in following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The
whole data is divided into segments of equal size and then
various methods are performed to complete the task. Each
segmented is handled separately. One can replace all data in a
segment by its mean or boundary values can be used to
complete the task.

2. Regression:
Here data can be made smooth by fitting it to a regression
function.The regression used may be linear (having one
independent variable) or multiple (having multiple independent
variables).

3. Clustering:
This approach groups the similar data in a cluster. The outliers
may be undetected or it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable
for mining process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to
1.0 or 0.0 to 1.0)

2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.

3. Discretization:
This is done to replace the raw values of numeric attribute by interval
levels or conceptual levels.

4. Concept Hierarchy Generation:

Here attributes are converted from lower level to higher level in
hierarchy. For Example-The attribute “city” can be converted to
“country”.
3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves
reducing the size of the dataset while preserving the important information.
This is done to improve the efficiency of data analysis and to avoid
overfitting of the model. Some common steps involved in data reduction are:
Feature Selection: This involves selecting a subset of relevant features
from the dataset. Feature selection is often performed to remove irrelevant
or redundant features from the dataset. It can be done using various
techniques such as correlation analysis, mutual information, and principal
component analysis (PCA).
Feature Extraction: This involves transforming the data into a lower-
dimensional space while preserving the important information. Feature
extraction is often used when the original features are high-dimensional and
complex. It can be done using techniques such as PCA, linear discriminant
analysis (LDA), and non-negative matrix factorization (NMF).
Sampling: This involves selecting a subset of data points from the dataset.
Sampling is often used to reduce the size of the dataset while preserving the
important information. It can be done using techniques such as random
sampling, stratified sampling, and systematic sampling.
Clustering: This involves grouping similar data points together into clusters.
Clustering is often used to reduce the size of the dataset by replacing similar
data points with a representative centroid. It can be done using techniques
such as k-means, hierarchical clustering, and density-based clustering.
Compression: This involves compressing the dataset while preserving the
important information. Compression is often used to reduce the size of the
dataset for storage and transmission purposes. It can be done using
techniques such as wavelet compression, JPEG compression, and gzip
compression.

CCS341-Data Warehousing Notes-Unit I
100% (1)
CCS341-Data Warehousing Notes-Unit I
30 pages
DATA Ware House & Mining NOTES
100% (2)
DATA Ware House & Mining NOTES
31 pages
Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet
unit one
No ratings yet
unit one
41 pages
Introduction To Data Warehouse
No ratings yet
Introduction To Data Warehouse
34 pages
Introduction To Data Warehouse Edited
No ratings yet
Introduction To Data Warehouse Edited
34 pages
Data Warehousing
No ratings yet
Data Warehousing
71 pages
3 Marks 1.what Is Data Warehouse?: o o o o o
No ratings yet
3 Marks 1.what Is Data Warehouse?: o o o o o
13 pages
What Is A Data Warehouse
No ratings yet
What Is A Data Warehouse
34 pages
CDM Class1,2,3
No ratings yet
CDM Class1,2,3
4 pages
Unit 1
No ratings yet
Unit 1
26 pages
Data Warehouse Tutorial
No ratings yet
Data Warehouse Tutorial
88 pages
DWDM UNIT 1 (1)
No ratings yet
DWDM UNIT 1 (1)
24 pages
Data Warehousing-1
No ratings yet
Data Warehousing-1
51 pages
Data Warehouse unit1 CS3551
No ratings yet
Data Warehouse unit1 CS3551
25 pages
unit-1-notes_dw
No ratings yet
unit-1-notes_dw
29 pages
Data Warehouse
No ratings yet
Data Warehouse
16 pages
DWDM Notes - Final
No ratings yet
DWDM Notes - Final
46 pages
Data Warehousing-Notes(Module -I & II) (1) (1)
No ratings yet
Data Warehousing-Notes(Module -I & II) (1) (1)
32 pages
DW Unit1
No ratings yet
DW Unit1
26 pages
Unit - I DW
No ratings yet
Unit - I DW
12 pages
APznzaY6aDiiFQcZdglMmHWqlfsLZcMKsTESHR9B_kPknhosV26ajqWsdEUKja4p9JYNx0z36dw2DbeRDycS1Y8JawcQ87i9STAqIoxAdievoD9TPhGWCj-VFS9pKfSk5UUHP7K-Uuidt3jVKqNIVOgHGNQbWGsnwt_zCupOzVlvYRIscF3zSsEsHVUnpYTm4Pf6Ft1aUDOxMC_
No ratings yet
APznzaY6aDiiFQcZdglMmHWqlfsLZcMKsTESHR9B_kPknhosV26ajqWsdEUKja4p9JYNx0z36dw2DbeRDycS1Y8JawcQ87i9STAqIoxAdievoD9TPhGWCj-VFS9pKfSk5UUHP7K-Uuidt3jVKqNIVOgHGNQbWGsnwt_zCupOzVlvYRIscF3zSsEsHVUnpYTm4Pf6Ft1aUDOxMC_
47 pages
Data and AI - Data Warehousing
No ratings yet
Data and AI - Data Warehousing
58 pages
Data Warehousing Fundamentals
No ratings yet
Data Warehousing Fundamentals
108 pages
Introduction To DW
No ratings yet
Introduction To DW
28 pages
Unit 1 Notes - DW
No ratings yet
Unit 1 Notes - DW
25 pages
Unit 3 - Notes
No ratings yet
Unit 3 - Notes
20 pages
FD Unit 2
No ratings yet
FD Unit 2
20 pages
612719980-DATA-ware-house-mining-NOTES
No ratings yet
612719980-DATA-ware-house-mining-NOTES
31 pages
Bi Units F
No ratings yet
Bi Units F
53 pages
DWM Exp1
No ratings yet
DWM Exp1
12 pages
DWM Unit 1
No ratings yet
DWM Unit 1
24 pages
Data warehousing and Data mining Original Notes (1)
No ratings yet
Data warehousing and Data mining Original Notes (1)
47 pages
Data Warehouse
No ratings yet
Data Warehouse
22 pages
DWM 1
No ratings yet
DWM 1
15 pages
Data Mining Complete
No ratings yet
Data Mining Complete
95 pages
Data-Mining-final-new
No ratings yet
Data-Mining-final-new
109 pages
DM UNIT V (1)
No ratings yet
DM UNIT V (1)
50 pages
Data Warehouse-Ccs341 Material
No ratings yet
Data Warehouse-Ccs341 Material
58 pages
Data warehouse unit-3 complete
No ratings yet
Data warehouse unit-3 complete
31 pages
Data Warehousing & Mining BCA V SEM
No ratings yet
Data Warehousing & Mining BCA V SEM
107 pages
Data Warehouse Architecture
No ratings yet
Data Warehouse Architecture
4 pages
Data Warehouse: Concepts, Architecture and Components
No ratings yet
Data Warehouse: Concepts, Architecture and Components
5 pages
Data Warehouse - Final
No ratings yet
Data Warehouse - Final
28 pages
Eval of Business Performance - Module 1
No ratings yet
Eval of Business Performance - Module 1
8 pages
2.data Warehousing: Heterogeneous Database Integration
No ratings yet
2.data Warehousing: Heterogeneous Database Integration
26 pages
DWDM Unit 1
No ratings yet
DWDM Unit 1
103 pages
Introduction To Data Warehouse
No ratings yet
Introduction To Data Warehouse
14 pages
Unit 1
No ratings yet
Unit 1
14 pages
BIS (1)
No ratings yet
BIS (1)
11 pages
DMW p1 Merged
No ratings yet
DMW p1 Merged
316 pages
Warehousing
No ratings yet
Warehousing
15 pages
Warehousing
No ratings yet
Warehousing
10 pages
DWDM u-1
No ratings yet
DWDM u-1
45 pages
DH&DM Unit-1
No ratings yet
DH&DM Unit-1
16 pages
Lect 5 Data Warehousing I_240924_033406
No ratings yet
Lect 5 Data Warehousing I_240924_033406
38 pages
Notes_Data_Warehouse
No ratings yet
Notes_Data_Warehouse
49 pages
Data Mining
No ratings yet
Data Mining
65 pages
Introduction To Data Warehousing
No ratings yet
Introduction To Data Warehousing
5 pages
DWDM(UNIT-1)
No ratings yet
DWDM(UNIT-1)
29 pages
Kuldeep - Resume
No ratings yet
Kuldeep - Resume
1 page
Cambridge O Level: Computer Science 2210/12
No ratings yet
Cambridge O Level: Computer Science 2210/12
10 pages
10 - System Administrator Fundamentals
No ratings yet
10 - System Administrator Fundamentals
4 pages
Thesis Project Archive System (T-PAS)
No ratings yet
Thesis Project Archive System (T-PAS)
24 pages
Lab 05 - Implement Intersite Connectivity
No ratings yet
Lab 05 - Implement Intersite Connectivity
6 pages
Laboratory Information Management System LIMS
No ratings yet
Laboratory Information Management System LIMS
17 pages
SPM 01 Introduction
No ratings yet
SPM 01 Introduction
11 pages
Business Intelligence Assignment
No ratings yet
Business Intelligence Assignment
21 pages
AutoSaveForSystemPlatform Datasheet
No ratings yet
AutoSaveForSystemPlatform Datasheet
4 pages
SJET 62 49 53 C
No ratings yet
SJET 62 49 53 C
6 pages
Green Cloud Computing
No ratings yet
Green Cloud Computing
6 pages
Aitcs: Tan Foo Yuan, Salama A. Mostafa
No ratings yet
Aitcs: Tan Foo Yuan, Salama A. Mostafa
17 pages
3.04 - Configuration - Audit - System - v10x - Lab
No ratings yet
3.04 - Configuration - Audit - System - v10x - Lab
23 pages
Y4 ICT HOLIDAY CLASS WORKSHEET (Chapter 1 Revision)
No ratings yet
Y4 ICT HOLIDAY CLASS WORKSHEET (Chapter 1 Revision)
3 pages
Git Tutorial Rewriting History Atlassian PDF
No ratings yet
Git Tutorial Rewriting History Atlassian PDF
18 pages
f5 Networks Configuring Big Ip LTM Local Traffic Manager v13
No ratings yet
f5 Networks Configuring Big Ip LTM Local Traffic Manager v13
4 pages
Seba Deleersnyder - Practical Threat Modeling
No ratings yet
Seba Deleersnyder - Practical Threat Modeling
39 pages
NSX T 2.4 Security Configuration Guide v1
No ratings yet
NSX T 2.4 Security Configuration Guide v1
12 pages
BORGES CALDAS DA SILVA 2022 Archivage
No ratings yet
BORGES CALDAS DA SILVA 2022 Archivage
159 pages
Python
No ratings yet
Python
18 pages
E-Booking Presentation
No ratings yet
E-Booking Presentation
14 pages
TETIS
No ratings yet
TETIS
28 pages
Lec1-Intro To SE
No ratings yet
Lec1-Intro To SE
35 pages
Update Gravity (List of Blocked Domains)
No ratings yet
Update Gravity (List of Blocked Domains)
5 pages
Digital Marketing Manager Resume Example
No ratings yet
Digital Marketing Manager Resume Example
1 page
Topic 4 - Slides Notes
No ratings yet
Topic 4 - Slides Notes
21 pages
Wa0005.
No ratings yet
Wa0005.
6 pages
Blog Writing
100% (1)
Blog Writing
3 pages
Five Best Practices for Cloud Security
No ratings yet
Five Best Practices for Cloud Security
1 page
Using DAO (Data Access Objects) Code - Visual Basic 6 (VB6
33% (3)
Using DAO (Data Access Objects) Code - Visual Basic 6 (VB6
72 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.