HAJJATII

NAME: NAMUGUMYA SHAUFA
REG: 21/2/314/DJ/287
COURSE: BIT
COURSE UNIT: DATA WAREHOUSE
LECTURER: JUDE LUKYE
OLAP IN DATAWAREHOUSING
OLAP (Online Analytical Processing) is the key concept in data warehousing. It refers to a set of
techniques used for retrieving, analyzing, and processing data in a multidimensional way.
A data warehouse would exact information from multiple data sources and formats like textiles, excel
sheet, multimedia files.
The extracted data is cleaned and transformed
Key points in OLAP in data warehousing
1. Multidimensional data
In data warehousing, data is typically organized into a multidimensional model. This
means data is stored in a way that allows for easy and efficient analysis across multiple
dimensions or attributes for example sales data can be analyzed by time, product, location etc.
2. Operations
Online analytical processing provides various operations for data analysis, including roll-up
(Aggregating data from a lower level of granularity to a higher level), drill-down (opposite of roll-
up), slice and slice (selecting a subset of the data), and pivot (changing the orientation of the
cube).
3. Cubes.
Online analytical processing systems often use data cubes, which are multi-dimensional
structures that store data in a format that’s optimized for analytical queries. Each cell in the
cube represents a data point at intersection of different dimensions.
ONLINE ANALYTICAL PROCESSING FUNCTIONS IN DATA WAREHOUSE

Has intuitive easy to use interface
Online analytical processing supports complex calculations
Online analytical processing provides data view in multidimensional manner
Online analytical processing has time intelligence
BASIC ANALYTICAL OPERATIONS OF OLAP

Since OLAP servers are based on multidimensional view of data,
Roll-up
Drill-down
Slice and dice
Pivot (rotate)
ROLL-UP
This is known as consolidation or aggregation. The roll-up operation can be performed in two ways.
1. Reducing dimensions
2. Climbing up concept hierarchy. Concept hierarchy is a system of grouping things based on their
order or level
2) Drill-down
In drill-down data is fragmented into smaller parts.it is the opposite of the rollup process. It can be
done via
Moving down the concept hierarchy
Increasing a dimension
3) Slice
One dimension is selected, and a new sub cube is created.
Dimension time is sliced
A new cube is created altogether
4) Dice
This operation is similar to a slice .The difference in dice is you select 2or more3 dimensions that result
in the sub cube
Data engineers use dice operation to create similar sub cube from an OLAP cube. They determine the
required dimensions and build a smaller cube from the original hypercube
5) Pivot
This known as rotation. In pivot, you rotate the data axes to provide a substitute presentation of data.
Forex ample, a three dimensional OLAP cube has the following on the respective axes
X-axis product
Y-axis location
Z-axis time
Upon a pivot, the OLAP cube has the following
X-axis location
Y-axis time
Z-axis product
TYTPES OF ONLINE ANALYTICAL PROCESSING IN DATA WAREHOUSING
OLAP hierarchical Structure
There are following three major OLAP models in data warehouse:
1. Relational Online Analytical Processing (ROLAP)
The kind of system where users query data from a relational database or from their own local
tables’ .thus, the number of the potential questions is not limited.
It includes
Implementation of aggregation navigation logic
Optimization of each DBMS
Additional tools and services
2. Multidimensional Online Analytical Processing (MOLAP)
MOLAP involves creating a data cube that represents multidimensional data warehouse
It provides high speed of calculations

3. Hybrid Online Analytical Processing (HOLAP)
It combines MOLAP and ROLAP to provide the best of both architectures .pre computed aggregates
and cube structure stored in multidimensional database.
HOLAP allows data engineers to quickly retrieve analytical results from a data cube and extract
detailed information from relational databases.
Benefits of OLAP in data warehouse
Online Analytical Processing in data warehouse allows users to perform complex, ad-hoc queries for
business intelligence and reporting purposes
It enables faster query performance compared to traditional databases, making it well-suited for
analytical workloads.
It plays a vital role in helping businesses make informed decisions by providing a flexible and efficient
way to analyze their data from various perspectives
2. DATA MINING
Data mining refers to the extracting or mining of knowledge from large amount of data.it should have
been more appropriately named ass knowledge of mining which emphasis on mining from large amount
of data.
Its goal is to extract information from a data set and transform it into an understandable structure for
further use.
Key properties of data mining
Automatic discovery of patterns
Prediction of likely outcomes
Focus on large datasets and databases
Creation of actionable information
The scope of data mining
Data mining derives its name from the similarities between searching for valuable business information
in large database for example, finding linked products in gigabytes of store scanner data and mining a
mountain for a vein of valuable ore. They both require sifting through an immense amount of material,
or to find exactly where values resides.
Data mining automates the process of finding predictive information in large databases.
Data mining tools sweep through databases and identify previously hidden patterns in one step for
example pattern discovery is analysis of retail sales data to identify seemingly unrelated products that
are often purchased together. Other pattern discovery problems include detecting fraudulent credit card
transactions and identifying anomalous data that could represent data entry keying errors.
Tasks of Data Mining
Data mining involves six classes of tasks that are common;
Clustering
This is the task of discovering groups and structures in the data that are in some way or another
similar without using known structures in the data.
Regression
Attempts to find a function which models the data with the least error.
Summarization
Providing a more compact representation of the data set, including visualization and report
generation.
Association rule learning
In association rule learning, the supermarket can determine which products are frequently bought
together and use this information for marketing purposes.
It is referred to as market basket analysis
Classification
This is the task of generating know structure to apply to new data. E.g. an e-mail program might attempt
to classify an e-mail as legitimate or as spam
Anomaly detection
The identification of unusual data records, that might be interesting or data errors that require further
investigation
Architecture of data mining
A data mining system may has the following components

1 Knowledge base
This is the domain knowledge that is used to guide the search the interestingness of resulting patterns.
It can be cconceptheierarachies used to organize attributes value into different values of abstraction
2. Data mining engine
This is essential to data mining system and ideally consists of a set of functional modules for tasks such
as characterization, association and correlation analysis, classification, prediction, cluster analysis,
outlier analysis and evolution analysis
3.Pattern Evaluation Module:
This component typically employs interestingness measures interacts with the data mining modules so
as to focus the search toward interesting patterns. It may use interestingness thresholds to filter out
discovered patterns. Alternatively, the pattern evaluation module may be integrated with the mining
module, depending on the implementation of the data mining method used. For efficient data mining, it
is highly recommended to push the evaluation of pattern interestingness as deep as possible into the
mining process so as to confine the search to only the interesting patterns.
4.User interface:
This module communicates between users and the data mining system, allowing the user to interact
with the system by specifying a data mining query or task, providing information to help focus the
search, and performing exploratory data mining based on the intermediate data mining results. In
addition, this component allows the user to browse database and data warehouse schemas or data
structures, evaluate mined patterns, and visualize the patterns in different forms.
Data Mining Process:
Data Mining is a process of discovering various models, summaries, and derived values from a given
collection of data. The general experimental procedure adapted to data-mining problems involves the
following
Steps:
1. State the problem and formulate the hypothesis
Most data-based modeling studies are performed in a particular application domain. Hence, domain-
specific knowledge and experience are usually necessary in order to come up with a meaningful problem
statement. Unfortunately, many application studies tend to focus on the data-mining technique at the
expense of a clear problem statement. In this step, a modeler usually specifies a set of variables for the
unknown dependency and, if possible, a general form of this dependency as an initial hypothesis. There
may be several hypotheses formulated for a single problem at this stage. The first step requires the
combined expertise of an application domain and a data-mining model. In practice, it usually means a
close interaction between the data-mining expert and the application expert. In successful data-mining
applications, this cooperation does not stop in the initial phase; it continues during the entire data-
mining process.
2. Collect the data
This step is concerned with how the data are generated and collected. In general, there are two distinct
possibilities. The first is when the data-generation process is under the control of an expert (modeler):
this approach is known as a designed experiment. The second possibility is when the expert cannot
influence the data- generation process: this is known as the observational approach. An observational
setting, namely, random data generation, is assumed in most data-mining applications. Typically, the
sampling distribution is completely unknown after data are collected, or it is partially and implicitly given
in the data-collection procedure. It is very important, however, to understand how data collection
affects its theoretical distribution, since such a priori knowledge can be very useful for modeling and,
later, for the final interpretation of results. Also, it is important to make sure that the data used for
estimating a model and the data used later for testing and applying a model come from the same,
unknown, sampling distribution. If this is not the case, the estimated model cannot be successfully used
in a final application of the results.
3. Preprocessing the data
In the observational setting, data are usually "collected" from the existing databases, data warehouses,
and data marts. Data preprocessing usually includes at least two common tasks:
1. Outlier detection (and removal) – Outliers are unusual data values that are not consistent with most
observations. Commonly, outliers result from measurement errors, coding and recording errors, and,
sometimes, are natural, abnormal values. Such no representative samples can seriously affect the model
produced later. There are two strategies for dealing with outliers:
a. Detect and eventually remove outliers as a part of the preprocessing phase, or
b. Develop robust modeling methods that are insensitive to outliers.
2. Scaling, encoding, and selecting features – Data preprocessing includes several steps such as variable
scaling and different types of encoding. For example, one feature with the range [0, 1] and the other
with the range [−100, 1000] will not have the same weights in the applied technique; they will also
influence the final data-mining results differently. Therefore, it is recommended to scale them and bring
both features to the same weight for further analysis. Also, application-specific encoding methods
usually achieve dimensionality reduction by providing a smaller number of informative features for
subsequent data modeling.
These two classes of preprocessing tasks are only illustrative examples of a large spectrum of
preprocessing activities in a data-mining process. Data-preprocessing steps should not be considered
completely independent from other data-mining phases. In every iteration of the data-mining process,
all activities, together, could define new and improved data sets for subsequent iterations. Generally, a
good preprocessing method provides an optimal representation for a data-mining technique by
incorporating a priori knowledge in the form of application-specific scaling and encoding.
4. Estimate the model
The selection and implementation of the appropriate data-mining technique is the main task in this
phase. This process is not straightforward; usually, in practice, the implementation is based on several
models, and selecting the best one is an additional task.
5. Interpret the model and draw conclusions
In most cases, data-mining models should help in decision making. Hence, such models need to be
interpretable in order to be useful because humans are not likely to base their decisions on complex
"black-box" models. Note that the goals of accuracy of the model and accuracy of its interpretation are
somewhat contradictory. Usually, simple models are more interpretable, but they are also less accurate.
Modern data-mining methods are expected to yield highly accurate results using high dimensional
models. The problem of interpreting these models, also very important, is considered a separate task,
with specific techniques to validate the results. A user does not want hundreds of pages of numeric
results. He does not understand them; he cannot summarize, interpret, and use them for successful
decision making.
The Data mining Process
Classification of Data mining Systems:
The data mining system can be classified according to the following criteria:
Database Technology
Statistics
Machine Learning
Information Science
Visualization
Other Disciplines
Some Other Classification Criteria:
Classification according to kind of databases mined
Classification according to kind of knowledge mined
Classification according to kinds of techniques utilized
Classification according to applications adapted
Classification according to kind of databases mined
We can classify the data mining system according to kind of databases mined. Database system can be
classified according to different criteria such as data models, types of data etc. And the data mining
system can be classified accordingly. For example if we classify the database according to data model
then we may have a relational, transactional, object- relational, or data warehouse mining system.
Classification according to kind of knowledge mined
We can classify the data mining system according to kind of knowledge mined. It is means data mining
system are classified on the basis of functionalities such as:
Characterization
Discrimination
Association and Correlation Analysis
Classification
Prediction
Clustering
Outlier Analysis
Evolution Analysis
Classification according to kinds of techniques utilized
We can classify the data mining system according to kind of techniques used. We can describes these
techniques according to degree of user interaction involved or the methods of analysis employed.
Classification according to applications adapted
We can classify the data mining system according to application adapted. These applications are as
follows:
Finance
Telecommunications
DNA
Stock Markets
E-mail
Major Issues in Data Mining
Mining different kinds of knowledge in databases. - The need of different users is not the same. And
Different user may be in interested in different kind of knowledge. Therefore it is necessary for data
mining to cover broad range of knowledge discovery task.
Interactive mining of knowledge at multiple levels of abstraction. –
The data mining process needs to be interactive because it allows users to focus the search for patterns,
providing and refining data mining requests based on returned results.
Incorporation of background knowledge.
To guide discovery process and to express the discovered patterns, the background knowledge can be
used. Background knowledge may be used to express the discovered patterns not only in concise terms
but at multiple level of abstraction.
Data mining query languages and ad hoc data mining.
-Data Mining Query language that allows the user to describe ad hoc mining tasks, should be integrated
with a data warehouse query language and optimized for efficient and flexible data mining.
Presentation and visualization of data mining results.
Once the patterns are discovered it needs to be expressed in high level languages, visual
representations. This representations should be easily understandable by the users.
Handling noisy or incomplete data.
The data cleaning methods are required that can handle the noise, incomplete objects while mining the
data regularities. If data cleaning methods are not there then the accuracy of the discovered patterns
will be poor.
Pattern evaluation.
It refers to interestingness of the problem. The patterns discovered should be interesting because either
they represent common knowledge or lack novelty.
Efficiency and scalability of data mining algorithms.
In order to effectively extract the information from huge amount of data in databases, data mining
algorithm must be efficient and scalable.
Parallel, distributed, and incremental mining algorithms.
The factors such as huge size of databases, wide distribution of data, and complexity of data mining
methods motivate the development of parallel and distributed data mining algorithms. These algorithm
divide the data into partitions which is further processed parallel. Then the results from the partitions is
merged. The incremental algorithms, updates databases without having mine the data again from
scratch.
Knowledge Discovery in Databases (KDD)
Some people treat data mining same as Knowledge discovery while some people view data mining
essential step in process of knowledge discovery. Here is the list of steps involved in knowledge
discovery process:
Data Cleaning - In this step the noise and inconsistent data is removed.
Data Integration - In this step multiple data sources are combined.
Data Selection - In this step relevant to the analysis task are retrieved from the database.
Data Transformation - In this step data are transformed or consolidated into forms appropriate for
mining by performing summary or aggregation operations.
Data Mining - In this step intelligent methods are applied in order to extract data patterns.
Pattern Evaluation - In this step, data patterns are evaluated.
Knowledge Presentation - In this step, knowledge is represented.

HAJJATII

Uploaded by

Copyright:

Available Formats

HAJJATII

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HAJJATII

Uploaded by

Copyright:

Available Formats

NAME: NAMUGUMYA SHAUFA

The extracted data is cleaned and transformed

Key points in OLAP in data warehousing

ONLINE ANALYTICAL PROCESSING FUNCTIONS IN DATA WAREHOUSE

BASIC ANALYTICAL OPERATIONS OF OLAP

Moving down the concept hierarchy

One dimension is selected, and a new sub cube is created.

Dimension time is sliced

A new cube is created altogether

Upon a pivot, the OLAP cube has the following

TYTPES OF ONLINE ANALYTICAL PROCESSING IN DATA WAREHOUSING

OLAP hierarchical Structure

There are following three major OLAP models in data warehouse:

1. Relational Online Analytical Processing (ROLAP)

Implementation of aggregation navigation logic

Optimization of each DBMS

Additional tools and services

2. Multidimensional Online Analytical Processing (MOLAP)

It provides high speed of calculations

Benefits of OLAP in data warehouse

Key properties of data mining

Automatic discovery of patterns

Prediction of likely outcomes

Focus on large datasets and databases

Creation of actionable information

The scope of data mining

Tasks of Data Mining

Data mining involves six classes of tasks that are common;

Association rule learning

It is referred to as market basket analysis

Architecture of data mining

A data mining system may has the following components

2. Data mining engine

3.Pattern Evaluation Module:

Data Mining Process:

1. State the problem and formulate the hypothesis

2. Collect the data

3. Preprocessing the data

a. Detect and eventually remove outliers as a part of the preprocessing phase, or

b. Develop robust modeling methods that are insensitive to outliers.

4. Estimate the model

5. Interpret the model and draw conclusions

The Data mining Process

Classification of Data mining Systems:

Some Other Classification Criteria:

Classification according to kind of databases mined

Classification according to kind of knowledge mined

Classification according to kinds of techniques utilized

Classification according to applications adapted

Classification according to kind of databases mined

Classification according to kind of knowledge mined

Association and Correlation Analysis

Classification according to kinds of techniques utilized

Classification according to applications adapted

Major Issues in Data Mining

Interactive mining of knowledge at multiple levels of abstraction. –

Incorporation of background knowledge.

Data mining query languages and ad hoc data mining.

Presentation and visualization of data mining results.

Handling noisy or incomplete data.