0% found this document useful (0 votes)
24 views

data mining 1

Uploaded by

shivasingh38025
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

data mining 1

Uploaded by

shivasingh38025
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 41

What is a Data Warehouse?

A Data Warehouse (DW) is a relational database that is designed for query and
analysis rather than transaction processing. It includes historical data derived from
transaction data from single and multiple sources.
A Data Warehouse provides integrated, enterprise-wide, historical data and focuses
on providing support for decision-makers for data modeling and analysis.
A Data Warehouse is a group of data specific to the entire organization, not only to
a particular group of users.
It is not used for daily operations and transaction processing but used for making
decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
o It is a database designed for investigative tasks, using data from various
applications.
o It supports a relatively small number of clients with relatively long
interactions.
o It includes current and historical data to provide a historical perspective of
information.
o Its usage is read-intensive.
o It contains a few large tables.
"Data Warehouse is a subject-oriented, integrated, and time-variant store of
information in support of management's decisions."
Characteristics of Data Warehouse

Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view
around a particular subject, such as customer, product, or sales, instead of the
global organization's ongoing operations. This is done by excluding data that are
not useful concerning the subject and including all data needed by the users to
understand the subject.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat
files, and online transaction records. It requires performing data cleaning and
integration during data warehousing to ensure consistency in naming conventions,
attributes types, etc., among different data sources.

Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve
files from 3 months, 6 months, 12 months, or even previous data from a data
warehouse. These variations with a transactions system, where often only the most
current file is kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed
from the source operational RDBMS. The operational updates of data do not occur in
the data warehouse, i.e., update, insert, and delete operations are not performed. It
usually requires only two procedures in data accessing: Initial loading of data and
access to data. Therefore, the DW does not require transaction processing,
recovery, and concurrency capabilities, which allows for substantial speedup of data
retrieval. Non-Volatile defines that once entered into the warehouse, and data
should not change.

Goals of Data Warehousing


o To help reporting as well as analysis
o Maintain the organization's historical information
o Be the foundation for decision making.
Need for Data Warehouse
Data Warehouse is needed for the following reasons:
1. 1) Business User: Business users require a data warehouse to view
summarized data from the past. Since these people are non-technical, the
data may be presented to them in an elementary form.
2. 2) Store historical data: Data Warehouse is required to store the time
variable data from the past. This input is made to be used for various
purposes.
3. 3) Make strategic decisions: Some strategies may be depending upon the
data in the data warehouse. So, data warehouse contributes to making
strategic decisions.
4. 4) For data consistency and quality: Bringing the data from different
sources at a commonplace, the user can effectively undertake to bring the
uniformity and consistency in data.
5. 5) High response time: Data warehouse has to be ready for somewhat
unexpected loads and types of queries, which demands a significant degree
of flexibility and quick response time.
Benefits of Data Warehouse
1. Understand business trends and make better forecasting decisions.
2. Data Warehouses are designed to perform well enormous amounts of data.
3. The structure of data warehouses is more accessible for end-users to
navigate, understand, and query.
4. Queries that would be complex in many normalized databases could be
easier to build and maintain in data warehouses.
5. Data warehousing is an efficient method to manage demand for lots of
information from lots of users.
6. Data warehousing provide the capabilities to analyze a large amount of
historical data.
Data Warehouse Architecture
A data warehouse architecture is a method of defining the overall architecture of
data communication processing and presentation that exist for end-clients
computing within the enterprise. Each data warehouse is different, but all are
characterized by standard vital components.

Production applications such as payroll accounts payable product purchasing and


inventory control are designed for online transaction processing (OLTP). Such
applications gather detailed data from day to day operations.

Data Warehouse applications are designed to support the user ad-hoc data
requirements, an activity recently dubbed online analytical processing (OLAP).
These include applications such as forecasting, profiling, summary reporting, and
trend analysis.

Production databases are updated continuously by either by hand or via OLTP


applications. In contrast, a warehouse database is updated from operational
systems periodically, usually during off-hours. As OLTP data accumulates in
production databases, it is regularly extracted, filtered, and then loaded into a
dedicated warehouse server that is accessible to users. As the warehouse is
populated, it must be restructured tables de-normalized, data cleansed of errors
and redundancies and new fields and keys added to reflect the needs to the user for
sorting, combining, and summarizing data.

Data warehouses and their architectures very depending upon the elements of an
organization's situation.

Three common architectures are:

o Data Warehouse Architecture: Basic


o Data Warehouse Architecture: With Staging Area
o Data Warehouse Architecture: With Staging Area and Data Marts

Data Warehouse Architecture: Basic


Operational System

An operational system is a method used in data warehousing to refer to


a system that is used to process the day-to-day transactions of an organization.

Flat Files

A Flat file system is a system of files in which transactional data is stored, and
every file in the system must have a different name.

Meta Data

A set of data that defines and gives information about other data.

Meta Data used in Data Warehouse for a variety of purpose, including:

Meta Data summarizes necessary information about data, which can make finding
and work with particular instances of data more accessible. For example, author,
data build, and data changed, and file size are examples of very basic document
metadata.

Metadata is used to direct a query to the most appropriate data source.

Lightly and highly summarized data

The area of the data warehouse saves all the predefined lightly and highly
summarized (aggregated) data generated by the warehouse manager.

The goals of the summarized information are to speed up query performance. The
summarized record is updated continuously as new information is loaded into the
warehouse.
End-User access Tools

The principal purpose of a data warehouse is to provide information to the business


managers for strategic decision-making. These customers interact with the
warehouse using end-client access tools.

The examples of some of the end-user access tools can be:

o Reporting and Query Tools


o Application Development Tools
o Executive Information Systems Tools
o Online Analytical Processing Tools
o Data Mining Tools

Data Warehouse Architecture: With Staging Area


We must clean and process your operational information before put it into the
warehouse.

e can do this programmatically, although data warehouses uses a staging area (A


place where data is processed before entering the warehouse).

A staging area simplifies data cleansing and consolidation for operational method
coming from multiple source systems, especially for enterprise data warehouses
where all relevant data of an enterprise is consolidated.
Data Warehouse Staging Area is a temporary location where a record from
source systems is copied.

Data Warehouse Architecture: With Staging Area and Data Marts


We may want to customize our warehouse's architecture for multiple groups within
our organization.

We can do this by adding data marts. A data mart is a segment of a data


warehouses that can provided information for reporting and analysis on a section,
unit, department or operation in the company, e.g., sales, payroll, production, etc.

The figure illustrates an example where purchasing, sales, and stocks are
separated. In this example, a financial analyst wants to analyze historical data for
purchases and sales or mine historical information to make predictions about
customer behavior.

Properties of Data Warehouse Architectures


The following architecture properties are necessary for a data warehouse system:

1. Separation: Analytical and transactional processing should be keep apart as


much as possible.

2. Scalability: Hardware and software architectures should be simple to upgrade


the data volume, which has to be managed and processed, and the number of
user's requirements, which have to be met, progressively increase.

3. Extensibility: The architecture should be able to perform new operations and


technologies without redesigning the whole system.

4. Security: Monitoring accesses are necessary because of the strategic data


stored in the data warehouses.

5. Administerability: Data Warehouse management should not be complicated.

Types of Data Warehouse Architectures


Single-Tier Architecture
Single-Tier architecture is not periodically used in practice. Its purpose is to
minimize the amount of data stored to reach this goal; it removes data
redundancies.

The figure shows the only layer physically available is the source layer. In this
method, data warehouses are virtual. This means that the data warehouse is
implemented as a multidimensional view of operational data created by specific
middleware, or an intermediate processing layer.

The vulnerability of this architecture lies in its failure to meet the requirement for
separation between analytical and transactional processing. Analysis queries are
agreed to operational data after the middleware interprets them. In this way,
queries affect transactional workloads.

Two-Tier Architecture
The requirement for separation plays an essential role in defining the two-tier
architecture for a data warehouse system, as shown in fig:

Although it is typically called two-layer architecture to highlight a separation


between physically available sources and data warehouses, in fact, consists of four
subsequent data flow stages:

1. Source layer: A data warehouse system uses a heterogeneous source of


data. That data is stored initially to corporate relational databases or legacy
databases, or it may come from an information system outside the corporate
walls.
2. Data Staging: The data stored to the source should be extracted, cleansed
to remove inconsistencies and fill gaps, and integrated to merge
heterogeneous sources into one standard schema. The so-
named Extraction, Transformation, and Loading Tools (ETL) can
combine heterogeneous schemata, extract, transform, cleanse, validate,
filter, and load source data into a data warehouse.
3. Data Warehouse layer: Information is saved to one logically centralized
individual repository: a data warehouse. The data warehouses can be directly
accessed, but it can also be used as a source for creating data marts, which
partially replicate data warehouse contents and are designed for specific
enterprise departments. Meta-data repositories store information on sources,
access procedures, data staging, users, data mart schema, and so on.
4. Analysis: In this layer, integrated data is efficiently, and flexible accessed to
issue reports, dynamically analyze information, and simulate hypothetical
business scenarios. It should feature aggregate information navigators,
complex query optimizers, and customer-friendly GUIs.

Three-Tier Architecture
The three-tier architecture consists of the source layer (containing multiple source
system), the reconciled layer and the data warehouse layer (containing both data
warehouses and data marts). The reconciled layer sits between the source data and
data warehouse.

The main advantage of the reconciled layer is that it creates a standard reference
data model for a whole enterprise. At the same time, it separates the problems of
source data extraction and integration from those of data warehouse population. In
some cases, the reconciled layer is also directly used to accomplish better some
operational tasks, such as producing daily reports that cannot be satisfactorily
prepared using the corporate applications or generating data flows to feed external
processes periodically to benefit from cleaning and integration.

This architecture is especially useful for the extensive, enterprise-wide systems. A


disadvantage of this structure is the extra file storage space used through the extra
redundant reconciled layer. It also makes the analytical tools a little further away
from being real-time.
Data Warehouse Delivery Process
Now we discuss the delivery process of the data warehouse. Main steps used in data
warehouse delivery process which are as follows:
IT Strategy: DWH project must contain IT strategy for procuring and retaining
funding.

Business Case Analysis: After the IT strategy has been designed, the next step is
the business case. It is essential to understand the level of investment that can be
justified and to recognize the projected business benefits which should be derived
from using the data warehouse.

Education & Prototyping: Company will experiment with the ideas of data
analysis and educate themselves on the value of the data warehouse. This is
valuable and should be required if this is the company first exposure to the benefits
of the DS record. Prototyping method can progress the growth of education. It is
better than working models. Prototyping requires business requirement, technical
blueprint, and structures.

Business Requirement: It contains such as

The logical model for data within the data warehouse.

The source system that provides this data (mapping rules)

The business rules to be applied to information.

The query profiles for the immediate requirement


Technical blueprint: It arranges the architecture of the warehouse. Technical
blueprint of the delivery process makes an architecture plan which satisfies long-
term requirements. It lays server and data mart architecture and essential
components of database design.

Building the vision: It is the phase where the first production deliverable is
produced. This stage will probably create significant infrastructure elements for
extracting and loading information but limit them to the extraction and load of
information sources.

History Load: The next step is one where the remainder of the required history is
loaded into the data warehouse. This means that the new entities would not be
added to the data warehouse, but additional physical tables would probably be
created to save the increased record volumes.

AD-Hoc Query: In this step, we configure an ad-hoc query tool to operate against
the data warehouse.

These end-customer access tools are capable of automatically generating the


database query that answers any question posed by the user.

Automation: The automation phase is where many of the operational management


processes are fully automated within the DWH. These would include:

Extracting & loading the data from a variety of sources systems

Transforming the information into a form suitable for analysis

Backing up, restoring & archiving data

Generating aggregations from predefined definitions within the Data Warehouse.

Monitoring query profiles & determining the appropriate aggregates to maintain


system performance.

Extending Scope: In this phase, the scope of DWH is extended to address a new
set of business requirements. This involves the loading of additional data sources
into the DWH i.e. the introduction of new data marts.

Requirement Evolution: This is the last step of the delivery process of a data
warehouse. As we all know that requirements are not static and evolve
continuously. As the business requirements will change it supports to be reflected in
the system.

Concept Hierarchy
Concept hierarchy is directed acyclic graph of ideas, where a unique name identifies
each of the theories.
An arc from the concept a to b denotes which is a more general concept than b. We
can tag the text with ideas.

Each text report is tagged by a set of concepts which corresponds to its content.

Tagging a report with a concept implicitly entails its tagging with all the ancestors
of the concept hierarchy. It is, therefore desired that a report should be tagged with
the lowest concept possible.

The method to automatically tag the report to the hierarchy is a top-down


approach. An evaluation function determines whether a record currently tagged to a
node can also be tagged to any of its child nodes.

If so, then then the tag moves down the hierarchy till it cannot be pushed any
further.

The outcome of this step is a hierarchy of report and, at each node, there is a set of
the report having a common concept related to the node.

The hierarchy of reports resulting from the tagging step is useful for many texts
mining process.

It is assumed that the hierarchy of concepts is called a priori. We can even have
such a hierarchy of documents without a concept hierarchy, by using any
hierarchical clustering algorithm, which results in such a hierarchy.
Concept hierarchy defines a sequence of mapping from a set of particular, low-level
concepts to more general, higher-level concepts.

In a data warehouse, it is usually used to express different levels of granularity of an


attribute from one of the dimension tables.

Concept hierarchies are crucial for the formulation of useful OLAP queries. The
hierarchies allow the user to summarize the data at various levels.

For example, using the location hierarchy, the user can retrieve data which
summarizes sales for each location, for all the areas in a given state, or even a
given country without the necessity of reorganizing the data.

Data Warehouse Design


A data warehouse is a single data repository where a record from multiple data
sources is integrated for online business analytical processing (OLAP). This implies a
data warehouse needs to meet the requirements from all the business stages within
the entire organization. Thus, data warehouse design is a hugely complex, lengthy,
and hence error-prone process. Furthermore, business analytical functions change
over time, which results in changes in the requirements for the systems. Therefore,
data warehouse and OLAP systems are dynamic, and the design process is
continuous.

Data warehouse design takes a method different from view materialization in the
industries. It sees data warehouses as database systems with particular needs such
as answering management related queries. The target of the design becomes how
the record from multiple data sources should be extracted, transformed, and loaded
(ETL) to be organized in a database as the data warehouse.

There are two approaches

1. "top-down" approach
2. "bottom-up" approach

Top-down Design Approach


In the "Top-Down" design approach, a data warehouse is described as a subject-
oriented, time-variant, non-volatile and integrated data repository for the entire
enterprise data from different sources are validated, reformatted and saved in a
normalized (up to 3NF) database as the data warehouse. The data warehouse stores
"atomic" information, the data at the lowest level of granularity, from where
dimensional data marts can be built by selecting the data required for specific
business subjects or particular departments. An approach is a data-driven approach
as the information is gathered and integrated first and then business requirements
by subjects for building data marts are formulated. The advantage of this method is
which it supports a single integrated data source. Thus data marts built from it will
have consistency when they overlap.

Advantages of top-down design

Data Marts are loaded from the data warehouses.

Developing new data mart from the data warehouse is very easy.

Disadvantages of top-down design

This technique is inflexible to changing departmental needs.

The cost of implementing the project is high.


Bottom-Up Design Approach
In the "Bottom-Up" approach, a data warehouse is described as "a copy of
transaction data specifical architecture for query and analysis," term the star
schema. In this approach, a data mart is created first to necessary reporting and
analytical capabilities for particular business processes (or subjects). Thus it is
needed to be a business-driven approach in contrast to Inmon's data-driven
approach.

Data marts include the lowest grain data and, if needed, aggregated data too.
Instead of a normalized database for the data warehouse, a denormalized
dimensional database is adapted to meet the data delivery requirements of data
warehouses. Using this method, to use the set of data marts as the enterprise data
warehouse, data marts should be built with conformed dimensions in mind, defining
that ordinary objects are represented the same in different data marts. The
conformed dimensions connected the data marts to form a data warehouse, which
is generally called a virtual data warehouse.

The advantage of the "bottom-up" design approach is that it has quick ROI, as
developing a data mart, a data warehouse for a single subject, takes far less time
and effort than developing an enterprise-wide data warehouse. Also, the risk of
failure is even less. This method is inherently incremental. This method allows the
project team to learn and grow.
Advantages of bottom-up design

Documents can be generated quickly.

The data warehouse can be extended to accommodate new business units.

It is just developing new data marts and then integrating with other data marts.

Disadvantages of bottom-up design

the locations of the data warehouse and the data marts are reversed in the bottom-
up approach design.

Differentiate between Top-Down Design Approach and


Bottom-Up Design Approach

Top-Down Design Approach Bottom-Up Design Approach

Breaks the vast problem into smaller Solves the essential low-level problem and integrates them
subproblems. into a higher one.

Inherently architected- not a union of several Inherently incremental; can schedule essential data marts
data marts. first.

Single, central storage of information about Departmental information stored.


the content.

Centralized rules and control. Departmental rules and control.

It includes redundant information. Redundancy can be removed.

It may see quick results if implemented with Less risk of failure, favorable return on investment, and
repetitions. proof of techniques.

Data Warehouse Implementation


There are various implementation in data warehouses which are as follows

1. Requirements analysis and capacity planning: The first process in data


warehousing involves defining enterprise needs, defining architectures, carrying out
capacity planning, and selecting the hardware and software tools. This step will
contain be consulting senior management as well as the different stakeholder.
2. Hardware integration: Once the hardware and software has been selected,
they require to be put by integrating the servers, the storage methods, and the user
software tools.

3. Modeling: Modelling is a significant stage that involves designing the warehouse


schema and views. This may contain using a modeling tool if the data warehouses
are sophisticated.

4. Physical modeling: For the data warehouses to perform efficiently, physical


modeling is needed. This contains designing the physical data warehouse
organization, data placement, data partitioning, deciding on access techniques, and
indexing.

5. Sources: The information for the data warehouse is likely to come from several
data sources. This step contains identifying and connecting the sources using the
gateway, ODBC drives, or another wrapper.

6. ETL: The data from the source system will require to go through an ETL phase.
The process of designing and implementing the ETL phase may contain defining a
suitable ETL tool vendors and purchasing and implementing the tools. This may
contains customize the tool to suit the need of the enterprises.

7. Populate the data warehouses: Once the ETL tools have been agreed upon,
testing the tools will be needed, perhaps using a staging area. Once everything is
working adequately, the ETL tools may be used in populating the warehouses given
the schema and view definition.

8. User applications: For the data warehouses to be helpful, there must be end-
user applications. This step contains designing and implementing applications
required by the end-users.

9. Roll-out the warehouses and applications: Once the data warehouse has
been populated and the end-client applications tested, the warehouse system and
the operations may be rolled out for the user's community to use.

Implementation Guidelines
1. Build incrementally: Data warehouses must be built incrementally. Generally,
it is recommended that a data marts may be created with one particular project in
mind, and once it is implemented, several other sections of the enterprise may also
want to implement similar systems. An enterprise data warehouses can then be
implemented in an iterative manner allowing all data marts to extract information
from the data warehouse.

2. Need a champion: A data warehouses project must have a champion who is


active to carry out considerable researches into expected price and benefit of the
project. Data warehousing projects requires inputs from many units in an enterprise
and therefore needs to be driven by someone who is needed for interacting with
people in the enterprises and can actively persuade colleagues.

3. Senior management support: A data warehouses project must be fully


supported by senior management. Given the resource-intensive feature of such
project and the time they can take to implement, a warehouse project signal for a
sustained commitment from senior management.

4. Ensure quality: The only record that has been cleaned and is of a quality that is
implicit by the organizations should be loaded in the data warehouses.

5. Corporate strategy: A data warehouse project must be suitable for corporate


strategies and business goals. The purpose of the project must be defined before
the beginning of the projects.

6. Business plan: The financial costs (hardware, software, and peopleware),


expected advantage, and a project plan for a data warehouses project must be
clearly outlined and understood by all stakeholders. Without such understanding,
rumors about expenditure and benefits can become the only sources of data,
subversion the projects.

7. Training: Data warehouses projects must not overlook data warehouses training
requirements. For a data warehouses project to be successful, the customers must
be trained to use the warehouses and to understand its capabilities.

8. Adaptability: The project should build in flexibility so that changes may be


made to the data warehouses if and when required. Like any system, a data
warehouse will require to change, as the needs of an enterprise change.

9. Joint management: The project must be handled by both IT and business


professionals in the enterprise. To ensure that proper communication with the
stakeholder and which the project is the target for assisting the enterprise's
business, the business professional must be involved in the project along with
technical professionals.

What is Data Mart?


A Data Mart is a subset of a directorial information store, generally oriented to a
specific purpose or primary data subject which may be distributed to provide
business needs. Data Marts are analytical record stores designed to focus on
particular business functions for a specific community within an organization. Data
marts are derived from subsets of data in a data warehouse, though in the bottom-
up data warehouse design methodology, the data warehouse is created from the
union of organizational data marts.

The fundamental use of a data mart is Business Intelligence


(BI) applications. BI is used to gather, store, access, and analyze record. It can be
used by smaller businesses to utilize the data they have accumulated since it is less
expensive than implementing a data warehouse.
Reasons for creating a data mart
o Creates collective data by a group of users
o Easy access to frequently needed data
o Ease of creation
o Improves end-user response time
o Lower cost than implementing a complete data warehouses
o Potential clients are more clearly defined than in a comprehensive data
warehouse
o It contains only essential business data and is less cluttered.

Types of Data Marts


There are mainly two approaches to designing data marts. These approaches are

o Dependent Data Marts


o Independent Data Marts

Dependent Data Marts


A dependent data marts is a logical subset of a physical subset of a higher data
warehouse. According to this technique, the data marts are treated as the subsets
of a data warehouse. In this technique, firstly a data warehouse is created from
which further various data marts can be created. These data mart are dependent on
the data warehouse and extract the essential record from it. In this technique, as
the data warehouse creates the data mart; therefore, there is no need for data mart
integration. It is also known as a top-down approach.

Independent Data Marts


The second approach is Independent data marts (IDM) Here, firstly independent
data marts are created, and then a data warehouse is designed using these
independent multiple data marts. In this approach, as all the data marts are
designed independently; therefore, the integration of data marts is required. It is
also termed as a bottom-up approach as the data marts are integrated to
develop a data warehouse.

Other than these two categories, one more type exists that is called "Hybrid Data
Marts."
Hybrid Data Marts
It allows us to combine input from sources other than a data warehouse. This could
be helpful for many situations; especially when Adhoc integrations are needed, such
as after a new group or product is added to the organizations.

Steps in Implementing a Data Mart


The significant steps in implementing a data mart are to design the schema,
construct the physical storage, populate the data mart with data from source
systems, access it to make informed decisions and manage it over time. So, the
steps are:

Designing
The design step is the first in the data mart process. This phase covers all of the
functions from initiating the request for a data mart through gathering data about
the requirements and developing the logical and physical design of the data mart.

It involves the following tasks:

1. Gathering the business and technical requirements


2. Identifying data sources
3. Selecting the appropriate subset of data
4. Designing the logical and physical architecture of the data mart.

Constructing
This step contains creating the physical database and logical structures associated
with the data mart to provide fast and efficient access to the data.

It involves the following tasks:

1. Creating the physical database and logical structures such as tablespaces


associated with the data mart.
2. creating the schema objects such as tables and indexes describe in the
design step.
3. Determining how best to set up the tables and access structures.
Populating
This step includes all of the tasks related to the getting data from the source,
cleaning it up, modifying it to the right format and level of detail, and moving it into
the data mart.

It involves the following tasks:

1. Mapping data sources to target data sources


2. Extracting data
3. Cleansing and transforming the information.
4. Loading data into the data mart
5. Creating and storing metadata

Accessing
This step involves putting the data to use: querying the data, analyzing it, creating
reports, charts and graphs and publishing them.

It involves the following tasks:

1. Set up and intermediate layer (Meta Layer) for the front-end tool to use. This
layer translates database operations and objects names into business
conditions so that the end-clients can interact with the data mart using words
which relates to the business functions.
2. Set up and manage database architectures like summarized tables which
help queries agree through the front-end tools execute rapidly and efficiently.

Managing
This step contains managing the data mart over its lifetime. In this step,
management functions are performed as:

1. Providing secure access to the data.


2. Managing the growth of the data.
3. Optimizing the system for better performance.
4. Ensuring the availability of data event with system failures.
What is Meta Data?
Metadata is data about the data or documentation about the information which is
required by the users. In data warehousing, metadata is one of the essential
aspects.

Metadata includes the following:

1. The location and descriptions of warehouse systems and components.


2. Names, definitions, structures, and content of data-warehouse and end-users
views.
3. Identification of authoritative data sources.
4. Integration and transformation rules used to populate data.
5. Integration and transformation rules used to deliver information to end-user
analytical tools.
6. Subscription information for information delivery to analysis subscribers.
7. Metrics used to analyze warehouses usage and performance.
8. Security authorizations, access control list, etc.

Metadata is used for building, maintaining, managing, and using the data
warehouses. Metadata allow users access to help understand the content and find
data.

Several examples of metadata are:


1. A library catalog may be considered metadata. The directory metadata
consists of several predefined components representing specific attributes of
a resource, and each item can have one or more values. These components
could be the name of the author, the name of the document, the publisher's
name, the publication date, and the methods to which it belongs.
2. The table of content and the index in a book may be treated metadata for the
book.
3. Suppose we say that a data item about a person is 80. This must be defined
by noting that it is the person's weight and the unit is kilograms. Therefore,
(weight, kilograms) is the metadata about the data is 80.
4. Another examples of metadata are data about the tables and figures in a
report like this book. A table (which is a record) has a name (e.g., table titles),
and there are column names of the tables that may be treated metadata. The
figures also have titles or names.

Why is metadata necessary in a data warehouses?


o First, it acts as the glue that links all parts of the data warehouses.
o Next, it provides information about the contents and structures to the
developers.
o Finally, it opens the doors to the end-users and makes the contents
recognizable in their terms.

Metadata is Like a Nerve Center. Various processes during the building and
administering of the data warehouse generate parts of the data warehouse
metadata. Another uses parts of metadata generated by one process. In the data
warehouse, metadata assumes a key position and enables communication among
various methods. It acts as a nerve centre in the data warehouse.

Figure shows the location of metadata within the data warehouse.

Types of Metadata
Metadata in a data warehouse fall into three major parts:
o Operational Metadata
o Extraction and Transformation Metadata
o End-User Metadata

Operational Metadata
As we know, data for the data warehouse comes from various operational systems
of the enterprise. These source systems include different data structures. The data
elements selected for the data warehouse have various fields lengths and data
types.

In selecting information from the source systems for the data warehouses, we
divide records, combine factor of documents from different source files, and deal
with multiple coding schemes and field lengths. When we deliver information to the
end-users, we must be able to tie that back to the source data sets. Operational
metadata contains all of this information about the operational data sources.

Extraction and Transformation Metadata


Extraction and transformation metadata include data about the removal of data
from the source systems, namely, the extraction frequencies, extraction methods,
and business rules for the data extraction. Also, this category of metadata contains
information about all the data transformation that takes place in the data staging
area.

End-User Metadata
The end-user metadata is the navigational map of the data warehouses. It enables
the end-users to find data from the data warehouses. The end-user metadata allows
the end-users to use their business terminology and look for the information in
those ways in which they usually think of the business.

Metadata Interchange Initiative


The metadata interchange initiative was proposed to bring industry vendors and
user together to address a variety of severe problems and issues concerning
exchanging, sharing, and managing metadata. The goal of metadata interchange
standard is to define an extensible mechanism that will allow the vendor to
exchange standard metadata as well as carry along "proprietary" metadata. The
founding members agreed on the following initial goals:

1. Creating a vendor-independent, industry-defined, and maintained standard


access mechanisms and application programming interfaces (API) for
metadata.
2. Enabling users to control and manage the access and manipulation of
metadata in their unique environment through the use of interchange
standards-compliant tools.
3. Users are allowed to build tools that meet their needs and also will enable
them to adjust accordingly to those tools configurations.
4. Allowing individual tools to satisfy their metadata requirements freely and
efficiently within the content of an interchange model.
5. Describing a simple, clean implementation infrastructure which will facilitate
compliance and speed up adoption by minimizing the amount of modification.
6. To create a procedure and process not only for maintaining and establishing
the interchange standard specification but also for updating and extending it
over time.

Metadata Interchange Standard Framework


Interchange standard metadata model implementation assumes that the metadata
itself may be stored in storage format of any type: ASCII files, relational tables, fixed
or customized formats, etc.

It is a framework that is based on a framework that will translate an access request


into the standard interchange index.

Several approaches have been proposed in metadata interchange coalition:

o Procedural Approach
o ASCII Batch Approach
o Hybrid Approach

In a procedural approach, the communication with API is built into the tool. It
enables the highest degree of flexibility.

In ASCII Batch approach, instead of relying on ASCII file format which contains
information of various metadata items and standardized access requirements that
make up the interchange standards metadata model.

In the Hybrid approach, it follows a data-driven model.

Components of Metadata Interchange Standard Frameworks


1) Standard Metadata Model: It refers to the ASCII file format, which is used to
represent metadata that is being exchanged.
2) The standard access framework that describes the minimum number of API
functions.

3) Tool profile, which is provided by each tool vendor.

4) The user configuration is a file explaining the legal interchange paths for
metadata in the user's environment.

Metadata Repository
The metadata itself is housed in and controlled by the metadata repository. The
software of metadata repository management can be used to map the source data
to the target database, integrate and transform the data, generate code for data
transformation, and to move data to the warehouse.

Benefits of Metadata Repository


1. It provides a set of tools for enterprise-wide metadata management.
2. It eliminates and reduces inconsistency, redundancy, and underutilization.
3. It improves organization control, simplifies management, and accounting of
information assets.
4. It increases coordination, understanding, identification, and utilization of
information assets.
5. It enforces CASE development standards with the ability to share and reuse
metadata.
6. It leverages investment in legacy systems and utilizes existing applications.
7. It provides a relational model for heterogeneous RDBMS to share information.
8. It gives useful data administration tool to manage corporate information
assets with the data dictionary.
9. It increases reliability, control, and flexibility of the application development
process.

Data Warehousing - Schemas

Schema is a logical description of the entire database. It includes the name


and description of records of all record types including all associated data-
items and aggregates. Much like a database, a data warehouse also requires
to maintain a schema. A database uses relational model, while a data
warehouse uses Star, Snowflake, and Fact Constellation schema. In this
chapter, we will discuss the schemas used in a data warehouse.
Star Schema
 Each dimension in a star schema is represented with only one-
dimension table.
 This dimension table contains the set of attributes.
 The following diagram shows the sales data of a company with respect
to the four dimensions, namely time, item, branch, and location.
 There is a fact table at the center. It contains the keys to each of four
dimensions.
 The fact table also contains the attributes, namely dollars sold and
units sold.
Note − Each dimension has only one dimension table and each table holds a
set of attributes. For example, the location dimension table contains the
attribute set {location_key, street, city, province_or_state,country}. This
constraint may cause data redundancy. For example, "Vancouver" and
"Victoria" both the cities are in the Canadian province of British Columbia.
The entries for such cities may cause data redundancy along the attributes
province_or_state and country.
Snowflake Schema
 Some dimension tables in the Snowflake schema are normalized.
 The normalization splits up the data into additional tables.
 Unlike Star schema, the dimensions table in a snowflake schema are
normalized. For example, the item dimension table in star schema is
normalized and split into two dimension tables, namely item and
supplier table.
 Now the item dimension table contains the attributes item_key,
item_name, type, brand, and supplier-key.
 The supplier key is linked to the supplier dimension table. The supplier
dimension table contains the attributes supplier_key and supplier_type.
Note − Due to normalization in the Snowflake schema, the redundancy is
reduced and therefore, it becomes easy to maintain and the save storage
space.
Fact Constellation Schema
 A fact constellation has multiple fact tables. It is also known as galaxy
schema.
 The following diagram shows two fact tables, namely sales and
shipping.
 The sales fact table is same as that in the star schema.
 The shipping fact table has the five dimensions, namely item_key,
time_key, shipper_key, from_location, to_location.
 The shipping fact table also contains two measures, namely dollars
sold and units sold.
 It is also possible to share dimension tables between fact tables. For
example, time, item, and location dimension tables are shared
between the sales and shipping fact table.

Data Preprocessing
Data preprocessing is an important step in the data mining process. It refers
to the cleaning, transforming, and integrating of data in order to make it
ready for analysis. The goal of data preprocessing is to improve the quality of
the data and to make it more suitable for the specific data mining task.
Some common steps in data preprocessing include:
Data preprocessing is an important step in the data mining process that
involves cleaning and transforming raw data to make it suitable for analysis.
Some common steps in data preprocessing include:
Data Cleaning: This involves identifying and correcting errors or
inconsistencies in the data, such as missing values, outliers, and duplicates.
Various techniques can be used for data cleaning, such as imputation,
removal, and transformation.
Data Integration: This involves combining data from multiple sources to
create a unified dataset. Data integration can be challenging as it requires
handling data with different formats, structures, and semantics. Techniques
such as record linkage and data fusion can be used for data integration.
Data Transformation: This involves converting the data into a suitable
format for analysis. Common techniques used in data transformation include
normalization, standardization, and discretization. Normalization is used to
scale the data to a common range, while standardization is used to transform
the data to have zero mean and unit variance. Discretization is used to
convert continuous data into discrete categories.
Data Reduction: This involves reducing the size of the dataset while
preserving the important information. Data reduction can be achieved
through techniques such as feature selection and feature extraction. Feature
selection involves selecting a subset of relevant features from the dataset,
while feature extraction involves transforming the data into a lower-
dimensional space while preserving the important information.
Data Discretization: This involves dividing continuous data into discrete
categories or intervals. Discretization is often used in data mining and
machine learning algorithms that require categorical data. Discretization can
be achieved through techniques such as equal width binning, equal
frequency binning, and clustering.
Data Normalization: This involves scaling the data to a common range,
such as between 0 and 1 or -1 and 1. Normalization is often used to handle
data with different units and scales. Common normalization techniques
include min-max normalization, z-score normalization, and decimal scaling.
Data preprocessing plays a crucial role in ensuring the quality of data and
the accuracy of the analysis results. The specific steps involved in data
preprocessing may vary depending on the nature of the data and the
analysis goals.
By performing these steps, the data mining process becomes more efficient
and the results become more accurate.
Preprocessing in Data Mining:
Data preprocessing is a data mining technique which is used to transform the
raw data in a useful and efficient format.
Steps Involved in Data Preprocessing:
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part,
data cleaning is done. It involves handling of missing data, noisy data etc.

 (a). Missing Data:


This situation arises when some data is missing in the data. It can be
handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite
large and multiple values are missing within a tuple.

2. Fill the Missing values:


There are various ways to do this task. You can choose to fill the
missing values manually, by attribute mean or the most probable
value.

 (b). Noisy Data:


Noisy data is a meaningless data that can’t be interpreted by
machines.It can be generated due to faulty data collection, data entry
errors etc. It can be handled in following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The
whole data is divided into segments of equal size and then
various methods are performed to complete the task. Each
segmented is handled separately. One can replace all data in a
segment by its mean or boundary values can be used to
complete the task.

2. Regression:
Here data can be made smooth by fitting it to a regression
function.The regression used may be linear (having one
independent variable) or multiple (having multiple independent
variables).

3. Clustering:
This approach groups the similar data in a cluster. The outliers
may be undetected or it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable
for mining process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to
1.0 or 0.0 to 1.0)

2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.

3. Discretization:
This is done to replace the raw values of numeric attribute by interval
levels or conceptual levels.

4. Concept Hierarchy Generation:


Here attributes are converted from lower level to higher level in
hierarchy. For Example-The attribute “city” can be converted to
“country”.
3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves
reducing the size of the dataset while preserving the important information.
This is done to improve the efficiency of data analysis and to avoid
overfitting of the model. Some common steps involved in data reduction are:
Feature Selection: This involves selecting a subset of relevant features
from the dataset. Feature selection is often performed to remove irrelevant
or redundant features from the dataset. It can be done using various
techniques such as correlation analysis, mutual information, and principal
component analysis (PCA).
Feature Extraction: This involves transforming the data into a lower-
dimensional space while preserving the important information. Feature
extraction is often used when the original features are high-dimensional and
complex. It can be done using techniques such as PCA, linear discriminant
analysis (LDA), and non-negative matrix factorization (NMF).
Sampling: This involves selecting a subset of data points from the dataset.
Sampling is often used to reduce the size of the dataset while preserving the
important information. It can be done using techniques such as random
sampling, stratified sampling, and systematic sampling.
Clustering: This involves grouping similar data points together into clusters.
Clustering is often used to reduce the size of the dataset by replacing similar
data points with a representative centroid. It can be done using techniques
such as k-means, hierarchical clustering, and density-based clustering.
Compression: This involves compressing the dataset while preserving the
important information. Compression is often used to reduce the size of the
dataset for storage and transmission purposes. It can be done using
techniques such as wavelet compression, JPEG compression, and gzip
compression.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy