data mining 1
data mining 1
A Data Warehouse (DW) is a relational database that is designed for query and
analysis rather than transaction processing. It includes historical data derived from
transaction data from single and multiple sources.
A Data Warehouse provides integrated, enterprise-wide, historical data and focuses
on providing support for decision-makers for data modeling and analysis.
A Data Warehouse is a group of data specific to the entire organization, not only to
a particular group of users.
It is not used for daily operations and transaction processing but used for making
decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
o It is a database designed for investigative tasks, using data from various
applications.
o It supports a relatively small number of clients with relatively long
interactions.
o It includes current and historical data to provide a historical perspective of
information.
o Its usage is read-intensive.
o It contains a few large tables.
"Data Warehouse is a subject-oriented, integrated, and time-variant store of
information in support of management's decisions."
Characteristics of Data Warehouse
Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view
around a particular subject, such as customer, product, or sales, instead of the
global organization's ongoing operations. This is done by excluding data that are
not useful concerning the subject and including all data needed by the users to
understand the subject.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat
files, and online transaction records. It requires performing data cleaning and
integration during data warehousing to ensure consistency in naming conventions,
attributes types, etc., among different data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve
files from 3 months, 6 months, 12 months, or even previous data from a data
warehouse. These variations with a transactions system, where often only the most
current file is kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed
from the source operational RDBMS. The operational updates of data do not occur in
the data warehouse, i.e., update, insert, and delete operations are not performed. It
usually requires only two procedures in data accessing: Initial loading of data and
access to data. Therefore, the DW does not require transaction processing,
recovery, and concurrency capabilities, which allows for substantial speedup of data
retrieval. Non-Volatile defines that once entered into the warehouse, and data
should not change.
Data Warehouse applications are designed to support the user ad-hoc data
requirements, an activity recently dubbed online analytical processing (OLAP).
These include applications such as forecasting, profiling, summary reporting, and
trend analysis.
Data warehouses and their architectures very depending upon the elements of an
organization's situation.
Flat Files
A Flat file system is a system of files in which transactional data is stored, and
every file in the system must have a different name.
Meta Data
A set of data that defines and gives information about other data.
Meta Data summarizes necessary information about data, which can make finding
and work with particular instances of data more accessible. For example, author,
data build, and data changed, and file size are examples of very basic document
metadata.
The area of the data warehouse saves all the predefined lightly and highly
summarized (aggregated) data generated by the warehouse manager.
The goals of the summarized information are to speed up query performance. The
summarized record is updated continuously as new information is loaded into the
warehouse.
End-User access Tools
A staging area simplifies data cleansing and consolidation for operational method
coming from multiple source systems, especially for enterprise data warehouses
where all relevant data of an enterprise is consolidated.
Data Warehouse Staging Area is a temporary location where a record from
source systems is copied.
The figure illustrates an example where purchasing, sales, and stocks are
separated. In this example, a financial analyst wants to analyze historical data for
purchases and sales or mine historical information to make predictions about
customer behavior.
The figure shows the only layer physically available is the source layer. In this
method, data warehouses are virtual. This means that the data warehouse is
implemented as a multidimensional view of operational data created by specific
middleware, or an intermediate processing layer.
The vulnerability of this architecture lies in its failure to meet the requirement for
separation between analytical and transactional processing. Analysis queries are
agreed to operational data after the middleware interprets them. In this way,
queries affect transactional workloads.
Two-Tier Architecture
The requirement for separation plays an essential role in defining the two-tier
architecture for a data warehouse system, as shown in fig:
Three-Tier Architecture
The three-tier architecture consists of the source layer (containing multiple source
system), the reconciled layer and the data warehouse layer (containing both data
warehouses and data marts). The reconciled layer sits between the source data and
data warehouse.
The main advantage of the reconciled layer is that it creates a standard reference
data model for a whole enterprise. At the same time, it separates the problems of
source data extraction and integration from those of data warehouse population. In
some cases, the reconciled layer is also directly used to accomplish better some
operational tasks, such as producing daily reports that cannot be satisfactorily
prepared using the corporate applications or generating data flows to feed external
processes periodically to benefit from cleaning and integration.
Business Case Analysis: After the IT strategy has been designed, the next step is
the business case. It is essential to understand the level of investment that can be
justified and to recognize the projected business benefits which should be derived
from using the data warehouse.
Education & Prototyping: Company will experiment with the ideas of data
analysis and educate themselves on the value of the data warehouse. This is
valuable and should be required if this is the company first exposure to the benefits
of the DS record. Prototyping method can progress the growth of education. It is
better than working models. Prototyping requires business requirement, technical
blueprint, and structures.
Building the vision: It is the phase where the first production deliverable is
produced. This stage will probably create significant infrastructure elements for
extracting and loading information but limit them to the extraction and load of
information sources.
History Load: The next step is one where the remainder of the required history is
loaded into the data warehouse. This means that the new entities would not be
added to the data warehouse, but additional physical tables would probably be
created to save the increased record volumes.
AD-Hoc Query: In this step, we configure an ad-hoc query tool to operate against
the data warehouse.
Extending Scope: In this phase, the scope of DWH is extended to address a new
set of business requirements. This involves the loading of additional data sources
into the DWH i.e. the introduction of new data marts.
Requirement Evolution: This is the last step of the delivery process of a data
warehouse. As we all know that requirements are not static and evolve
continuously. As the business requirements will change it supports to be reflected in
the system.
Concept Hierarchy
Concept hierarchy is directed acyclic graph of ideas, where a unique name identifies
each of the theories.
An arc from the concept a to b denotes which is a more general concept than b. We
can tag the text with ideas.
Each text report is tagged by a set of concepts which corresponds to its content.
Tagging a report with a concept implicitly entails its tagging with all the ancestors
of the concept hierarchy. It is, therefore desired that a report should be tagged with
the lowest concept possible.
If so, then then the tag moves down the hierarchy till it cannot be pushed any
further.
The outcome of this step is a hierarchy of report and, at each node, there is a set of
the report having a common concept related to the node.
The hierarchy of reports resulting from the tagging step is useful for many texts
mining process.
It is assumed that the hierarchy of concepts is called a priori. We can even have
such a hierarchy of documents without a concept hierarchy, by using any
hierarchical clustering algorithm, which results in such a hierarchy.
Concept hierarchy defines a sequence of mapping from a set of particular, low-level
concepts to more general, higher-level concepts.
Concept hierarchies are crucial for the formulation of useful OLAP queries. The
hierarchies allow the user to summarize the data at various levels.
For example, using the location hierarchy, the user can retrieve data which
summarizes sales for each location, for all the areas in a given state, or even a
given country without the necessity of reorganizing the data.
Data warehouse design takes a method different from view materialization in the
industries. It sees data warehouses as database systems with particular needs such
as answering management related queries. The target of the design becomes how
the record from multiple data sources should be extracted, transformed, and loaded
(ETL) to be organized in a database as the data warehouse.
1. "top-down" approach
2. "bottom-up" approach
Developing new data mart from the data warehouse is very easy.
Data marts include the lowest grain data and, if needed, aggregated data too.
Instead of a normalized database for the data warehouse, a denormalized
dimensional database is adapted to meet the data delivery requirements of data
warehouses. Using this method, to use the set of data marts as the enterprise data
warehouse, data marts should be built with conformed dimensions in mind, defining
that ordinary objects are represented the same in different data marts. The
conformed dimensions connected the data marts to form a data warehouse, which
is generally called a virtual data warehouse.
The advantage of the "bottom-up" design approach is that it has quick ROI, as
developing a data mart, a data warehouse for a single subject, takes far less time
and effort than developing an enterprise-wide data warehouse. Also, the risk of
failure is even less. This method is inherently incremental. This method allows the
project team to learn and grow.
Advantages of bottom-up design
It is just developing new data marts and then integrating with other data marts.
the locations of the data warehouse and the data marts are reversed in the bottom-
up approach design.
Breaks the vast problem into smaller Solves the essential low-level problem and integrates them
subproblems. into a higher one.
Inherently architected- not a union of several Inherently incremental; can schedule essential data marts
data marts. first.
It may see quick results if implemented with Less risk of failure, favorable return on investment, and
repetitions. proof of techniques.
5. Sources: The information for the data warehouse is likely to come from several
data sources. This step contains identifying and connecting the sources using the
gateway, ODBC drives, or another wrapper.
6. ETL: The data from the source system will require to go through an ETL phase.
The process of designing and implementing the ETL phase may contain defining a
suitable ETL tool vendors and purchasing and implementing the tools. This may
contains customize the tool to suit the need of the enterprises.
7. Populate the data warehouses: Once the ETL tools have been agreed upon,
testing the tools will be needed, perhaps using a staging area. Once everything is
working adequately, the ETL tools may be used in populating the warehouses given
the schema and view definition.
8. User applications: For the data warehouses to be helpful, there must be end-
user applications. This step contains designing and implementing applications
required by the end-users.
9. Roll-out the warehouses and applications: Once the data warehouse has
been populated and the end-client applications tested, the warehouse system and
the operations may be rolled out for the user's community to use.
Implementation Guidelines
1. Build incrementally: Data warehouses must be built incrementally. Generally,
it is recommended that a data marts may be created with one particular project in
mind, and once it is implemented, several other sections of the enterprise may also
want to implement similar systems. An enterprise data warehouses can then be
implemented in an iterative manner allowing all data marts to extract information
from the data warehouse.
4. Ensure quality: The only record that has been cleaned and is of a quality that is
implicit by the organizations should be loaded in the data warehouses.
7. Training: Data warehouses projects must not overlook data warehouses training
requirements. For a data warehouses project to be successful, the customers must
be trained to use the warehouses and to understand its capabilities.
Other than these two categories, one more type exists that is called "Hybrid Data
Marts."
Hybrid Data Marts
It allows us to combine input from sources other than a data warehouse. This could
be helpful for many situations; especially when Adhoc integrations are needed, such
as after a new group or product is added to the organizations.
Designing
The design step is the first in the data mart process. This phase covers all of the
functions from initiating the request for a data mart through gathering data about
the requirements and developing the logical and physical design of the data mart.
Constructing
This step contains creating the physical database and logical structures associated
with the data mart to provide fast and efficient access to the data.
Accessing
This step involves putting the data to use: querying the data, analyzing it, creating
reports, charts and graphs and publishing them.
1. Set up and intermediate layer (Meta Layer) for the front-end tool to use. This
layer translates database operations and objects names into business
conditions so that the end-clients can interact with the data mart using words
which relates to the business functions.
2. Set up and manage database architectures like summarized tables which
help queries agree through the front-end tools execute rapidly and efficiently.
Managing
This step contains managing the data mart over its lifetime. In this step,
management functions are performed as:
Metadata is used for building, maintaining, managing, and using the data
warehouses. Metadata allow users access to help understand the content and find
data.
Metadata is Like a Nerve Center. Various processes during the building and
administering of the data warehouse generate parts of the data warehouse
metadata. Another uses parts of metadata generated by one process. In the data
warehouse, metadata assumes a key position and enables communication among
various methods. It acts as a nerve centre in the data warehouse.
Types of Metadata
Metadata in a data warehouse fall into three major parts:
o Operational Metadata
o Extraction and Transformation Metadata
o End-User Metadata
Operational Metadata
As we know, data for the data warehouse comes from various operational systems
of the enterprise. These source systems include different data structures. The data
elements selected for the data warehouse have various fields lengths and data
types.
In selecting information from the source systems for the data warehouses, we
divide records, combine factor of documents from different source files, and deal
with multiple coding schemes and field lengths. When we deliver information to the
end-users, we must be able to tie that back to the source data sets. Operational
metadata contains all of this information about the operational data sources.
End-User Metadata
The end-user metadata is the navigational map of the data warehouses. It enables
the end-users to find data from the data warehouses. The end-user metadata allows
the end-users to use their business terminology and look for the information in
those ways in which they usually think of the business.
o Procedural Approach
o ASCII Batch Approach
o Hybrid Approach
In a procedural approach, the communication with API is built into the tool. It
enables the highest degree of flexibility.
In ASCII Batch approach, instead of relying on ASCII file format which contains
information of various metadata items and standardized access requirements that
make up the interchange standards metadata model.
4) The user configuration is a file explaining the legal interchange paths for
metadata in the user's environment.
Metadata Repository
The metadata itself is housed in and controlled by the metadata repository. The
software of metadata repository management can be used to map the source data
to the target database, integrate and transform the data, generate code for data
transformation, and to move data to the warehouse.
Data Preprocessing
Data preprocessing is an important step in the data mining process. It refers
to the cleaning, transforming, and integrating of data in order to make it
ready for analysis. The goal of data preprocessing is to improve the quality of
the data and to make it more suitable for the specific data mining task.
Some common steps in data preprocessing include:
Data preprocessing is an important step in the data mining process that
involves cleaning and transforming raw data to make it suitable for analysis.
Some common steps in data preprocessing include:
Data Cleaning: This involves identifying and correcting errors or
inconsistencies in the data, such as missing values, outliers, and duplicates.
Various techniques can be used for data cleaning, such as imputation,
removal, and transformation.
Data Integration: This involves combining data from multiple sources to
create a unified dataset. Data integration can be challenging as it requires
handling data with different formats, structures, and semantics. Techniques
such as record linkage and data fusion can be used for data integration.
Data Transformation: This involves converting the data into a suitable
format for analysis. Common techniques used in data transformation include
normalization, standardization, and discretization. Normalization is used to
scale the data to a common range, while standardization is used to transform
the data to have zero mean and unit variance. Discretization is used to
convert continuous data into discrete categories.
Data Reduction: This involves reducing the size of the dataset while
preserving the important information. Data reduction can be achieved
through techniques such as feature selection and feature extraction. Feature
selection involves selecting a subset of relevant features from the dataset,
while feature extraction involves transforming the data into a lower-
dimensional space while preserving the important information.
Data Discretization: This involves dividing continuous data into discrete
categories or intervals. Discretization is often used in data mining and
machine learning algorithms that require categorical data. Discretization can
be achieved through techniques such as equal width binning, equal
frequency binning, and clustering.
Data Normalization: This involves scaling the data to a common range,
such as between 0 and 1 or -1 and 1. Normalization is often used to handle
data with different units and scales. Common normalization techniques
include min-max normalization, z-score normalization, and decimal scaling.
Data preprocessing plays a crucial role in ensuring the quality of data and
the accuracy of the analysis results. The specific steps involved in data
preprocessing may vary depending on the nature of the data and the
analysis goals.
By performing these steps, the data mining process becomes more efficient
and the results become more accurate.
Preprocessing in Data Mining:
Data preprocessing is a data mining technique which is used to transform the
raw data in a useful and efficient format.
Steps Involved in Data Preprocessing:
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part,
data cleaning is done. It involves handling of missing data, noisy data etc.
2. Regression:
Here data can be made smooth by fitting it to a regression
function.The regression used may be linear (having one
independent variable) or multiple (having multiple independent
variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers
may be undetected or it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable
for mining process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to
1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval
levels or conceptual levels.