PREPROCESSING
PREPROCESSING
• OLAP
– Types of Data
Objects
variable, field, characteristic, 4 Yes Married 120K No
dimension, or feature 5 No Divorced 95K Yes
• A collection of attributes 6 No Married 60K No
describe an object 7 Yes Divorced 220K No
– Object is also known as record, 8 No Single 85K Yes
point, case, sample, entity, or 9 No Married 75K No
instance
10 No Single 90K Yes
10
Attribute Values
• Attribute values are numbers or symbols assigned to an
attribute for a particular object
• Nominal Attributes:
– Nominal means “relating to names.”
– Interval-scaled :
– Ratio-scaled attributes:
• Graph
– World Wide Web
– Generic graph
– Social or information networks
• Ordered
– Spatial Data
– Temporal Data
– Sequential Data
July 2, 2019 Compiled By: Kamal Acharya 13
Record Data
• Data that consists of a collection of records, each of which consists of a
fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat
• If data objects have the same fixed set of numeric attributes, then the
data objects can be thought of as points in a multi-dimensional space,
where each dimension represents a distinct attribute
timeout
season
coach
game
score
play
team
win
ball
lost
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
July 2, 2019 Compiled By: Kamal Acharya 17
Graph Data
• Examples:
– Generic graph
2
5 1
2
5
– World-wide web
• Data cleaning
• Data reduction
• Summary
• Data integration
– Integration of multiple databases, data cubes, or files
• Data transformation
– Normalization and aggregation
• Data reduction
– Obtains reduced representation in volume but produces the same or similar analytical
results
• Data discretization
– Part of data reduction but with particular importance, especially for numerical data
– Can cause confusion for the data mining procedure, resulting in unreliable output.
• Clustering
– detect and remove outliers
• Regression
– smooth by fitting the data into regression functions
July 2, 2019 Compiled By: Kamal Acharya 30
Binning
• Three step process:
• if A and B are the lowest and highest values of the attribute, the
- Bin 1: 4, 8, 9, 15
- Bin 1: 9, 9, 9, 9
- Bin 1: 4, 4, 4, 15
• This can help improve the accuracy and speed of the subsequent
are different.
• For example: how can the data analyst or the computer be sure that
customer id in one database and cust_number in another refer to the
same attribute?
• Solution: The Meta data can be used to help the transformation of data
– Attribute construction
– Aggregation
– Normalization
– Discretization
attributes are constructed and added from the given set of attributes
applied to the data. For example, the daily sales data may be
• Data reduction
– Obtains a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the
same) analytical results.
– Dimensionality reduction
– Histograms
– clustering
– sampling
• Divide data into buckets and store average (sum) for each bucket
40
35
30
25
20
15
10
5
0
10000 30000 50000 70000 90000
• Concept hierarchies
– reduce the data by collecting and replacing low level concepts
(such as numeric values for the attribute age) by higher level
concepts (such as young, middle-aged, or senior).
Monitor
& OLAP Server
Other Metadata
sources Integrator
Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining
Data Marts
information.
• Often OLAP systems are data warehouse front end software tools to make
• Major OLAP applications are trend analysis over a number of time periods,
slicing, dicing , drill-down and roll-up to look at the data at different levels
– Users:
• OLTP systems are designed for office workers while the OLAP
systems are designed for decision makers. Therefore , while an
OLTP system may be accessed by hundreds or even thousands of
users in a large enterprise, an OLAP system is likely to be accessed
only by a selected group of managers and may be used by dozens
of users.
July 2, 2019 Compiled By: Kamal Acharya 55
Contd..
• Functions:
– OLTP systems are mission-critical (vital to the functioning of an
organization.). These systems carry out simple repetitive operations.
• Nature:
– Nature of queries in OLTP system is simple
• Design:
– OLTP systems are designed to be application-oriented while OLAP
systems are designed to be subject-oriented.
• Data:
– OLTP systems normally deal only with the current status of
information.
– On the other hand, OLAP systems require historical data over several
years since trend are often important in decision making.
• Kinds of use:
– OLTP systems are used for read and write operations while OLAP
systems normally do not update the data but refresh the data.
– Analytic
– Shared
– Multidimensional
– Information
– The system should be able to cope with any relevant queries for the
application and the user.
databases.
– A large data cube may have a large number of zeros as well as some
missing values.
– If a distinction is not made between zero values and missing values, the
aggregates are likely to be computed incorrectly.
• Generic dimensionality:
– An OLAP system should treat each dimension as equivalent in both its
structure and operational capabilities. Additional operational
capabilities may be granted to be selected dimensions but such
additional functions should be grantable to be any dimension
• Need to check other similar applicants (age, gender, income, etc…) and
– These dimensions allow the store to keep track of things like monthly
sales of items and the branches and locations at which the items were
sold.
• Facts:
– Facts are numeric measures.
• i.e., quantities by which we want to analyze relationships between
dimensions.
– Examples of facts for a sales data warehouse include dollars sold (sales
amount in dollars), units sold (number of units sold), and amount
budgeted.
• E.g.,
• E.g.,
• Fig: A 3-D data cube representation of the data in Table previous slide, according to
time, item, and location.
July 2, 2019 Compiled By: Kamal Acharya 82
Contd..
• 4-D cubes: a 4-D cube is a series of 3-D cubes, as shown in Figure below:
– Full Materialization
– Partial Materialization
• Access methods: How OLAP data can be indexed(Bit map and join indices)
– MOLAP
– HOLAP
order of seconds.
techniques.
Efficient Computation of Data Cubes
• Cuboids:
– Data at different degree of summarization/ aggregations is often
referred to as a cuboid.
– The result would form a lattice of cuboids, each showing the data at a
• where ( ) means that the group-by is empty (i.e., the dimensions are not grouped).
– These group-by’s form a lattice of cuboids for the data cube, as shown
in Figure below.
1. No Materialization
2. Full Materialization
3. Partial Materialization
Contd..
•No Materialization
•Full Materialization
processing.
• If the attribute has the value v for a given row in the data table, then the bit
bit arithmetic.
Contd..
Figure below shows a base (data) table containing the dimensions item and city,
and its mapping to bitmap index tables for each of the dimensions.
Contd..
• Join indexing:
• The join indexing method gained popularity from its use in
relational database query processing.
Here, the “Main Street” value in the location dimension table joins with tuples T57,
T238, and T884 of the sales fact table.
Similarly, the “Sony-TV” value in the item dimension table joins with tuples T57 and
T459 of the sales fact table.
Contd..
The corresponding join index tables are shown in Figure below.
Efficient Processing of OLAP Queries
•The purpose of materializing cuboids and constructing OLAP
follows:
cuboids.
should be applied.
Types of OLAP Servers
• OLAP servers present business users with multidimensional data from data
warehouses, without concerns regarding how or where the data are stored.
– These are the intermediate servers that stand in between a relational back-
services.
technology.
– The advantages of using a data cube is that it allows fast indexing to pre-
computed summarized data.
• Benefiting from the greater scalability of ROLAP and the faster computation of
MOLAP.
Maintains a separate database for It may not require space other than
data cubes. available in the Data warehouse.
– dice
– Roll-up(Drill-up)
– Drill-down(Roll-down)
– Pivot(Rotate)
• by dimension reduction.
– Corporate strategy
– Joint management
• Vision:
– The OLAP team must, in consultation with the users, develop a clear
vision for the OLAP system. This vision including the business
objectives should be clearly defined, understood, and shared by the
stakeholders.
• Corporate strategy:
– The OLAP strategy should fit with the enterprise strategy and business
objectives. A good fit will result in the OLAP tools being used more
widely.
• Focus on users:
– The OLAP project should be focused on users. Users should, in
consultation with the technical professionals, decide what tasks will be
done first and what will be done later. Attempts should be made to
provide each user with a tool suitable for that person’s skill level and
information needs. A good GUI user interface should be provided to
non-technical users. The project can only be successful whit the full
support of the users.
• Joint Management:
– The OLAP project must be managed by both the IT and business
professional. Many other people should be involved in supplying ideas.
An appropriate committee structure may be necessary to channel these
ideas
prediction techniques