Sess02 Data
Sess02 Data
Dr Sandipan Karmakar
What is Data?
Collection of data objects
and their attributes
Tid Refund Marital Taxable
An attribute is a property Status Income Cheat
{male, female}
Qualitative
2 No Married 100K No
3 No Single 70K No
6 No Married 60K No
9 No Married 75K No
timeout
season
coach
game
score
play
team
win
ball
lost
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
• A special type of record data, where
• Each record (transaction) involves a set of items.
• For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased are
the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
• Examples: Generic graph, a molecule, and webpages
2
5 1
2
5
Activity Precedence Diagram
Items/Events
An element of
the sequence
Ordered Data
• Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Ordered Data
• Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
Data Quality
• What are the kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?
• Examples:
• Same person with multiple email addresses
• Data cleaning
• Process of dealing with duplicate data issues
• Data Imputation
• Ask question like: “What would be the most likely value of the
missing value given all other attributes of a particular record?”
• E.g. An American car with 300cc and 150HP engine is expected to
have more cylinders than 100cc and 90HP Japanese cars
• Cautionary when using Mean for filling missing values
• If the mean is too high or too low towards the 100th Percentile or 0th
percentile respectively – Median is better measure to use
• Why???
Identifying Misclassifications
• Can you identify the anomaly in this dataset?
• This is frequency distribution of origin of manufacturer of
the automobiles in a Data Set
Graphical Methods of Detecting Outliers
• Z-Score Standardization
• Decimal Scaling
2 p1 point x y
p3 p4 p1 0 2
1 p2 2 0
p2
0 p3 3 1
0 1 2 3 4 5 6 p4 5 1
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Minkowski Distance
• Minkowski Distance is a generalization of Euclidean Distance
• r = 2. Euclidean distance
• Do not confuse r with n, i.e., all these distances are defined for all
numbers of dimensions.
Minkowski Distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Distance Matrix
Mahalanobis Distance
𝐦𝐚𝐡𝐚𝐥𝐚𝐧𝐨𝐛𝐢𝐬 ( 𝐱 , 𝐲 )=( 𝐱 − 𝐲 )𝑇 Ʃ − 1( 𝐱 − 𝐲 )
Mahal(A,B) = 5
Mahal(A,C) = 4
Common Properties of a Distance
• Distances, such as the Euclidean distance, have some well
known properties.
1. d(x, y) 0 for all x and y and d(x, y) = 0 only if
x = y. (Positive definiteness)
2. d(x, y) = d(y, x) for all x and y. (Symmetry)
3. d(x, z) d(x, y) + d(y, z) for all points x, y, and z.
(Triangle Inequality)
where d(x, y) is the distance (dissimilarity) between points (data
objects), x and y.
• A distance that satisfies these properties is a metric
Common Properties of a Similarity
• Similarities, also have some well known properties.
1. s(x, y) = 1 (or maximum similarity) only if x = y.
Scatter plots
showing the
similarity from
–1 to 1.
Drawback of Correlation
• If correlation is 0 then there is no linear relationship
• But it does not guarantee about nonlinear relationship
• x = (-3, -2, -1, 0, 1, 2, 3)
• y = (9, 4, 1, 0, 1, 4, 9)
yi = xi2
• mean(x) = 0, mean(y) = 4
• std(x) = 2.16, std(y) = 3.74
where pij is the probability that the ith value of X and the jth
value of Y occur together
• For discrete variables, this is easy to compute
• High mutual information indicates a large reduction in
uncertainty; low mutual information indicates a small
reduction; and zero mutual information between two random
variables means the variables are independent
• Maximum mutual information for discrete variables is
log2(min( nX, nY )), where nX (nY) is the number of values of X
(Y)
Mutual Information Example
Student Count p -plog2p Student Grade Count p -plog2p
Status Status
Undergrad 45 0.45 0.5184
Undergrad A 5 0.05 0.2161
Grad 55 0.55 0.4744
Total 100 1.00 0.9928 Undergrad B 30 0.30 0.5211
Mutual information of Student Status and Grade = 0.9928 + 1.4406 - 2.2710 = 0.1624
Maximal Information Coefficient
• Applies mutual information to two continuous variables
• Consider the possible binning of the variables into discrete
categories
• nX × nY ≤ N0.6 where
•nX is the number of values of X
•nY is the number of values of Y
•N is the number of samples (observations, data objects)
• Compute the mutual information
• Normalized by log2(min( nX, nY ))
• Take the highest value
• Reshef, David N., Yakir A. Reshef, Hilary K. Finucane, Sharon R. Grossman, Gilean McVean, Peter J. Turnbaugh, Eric S. Lander, Michael
Mitzenmacher, and Pardis C. Sabeti. "Detecting novel associations in large data sets." science 334, no. 6062 (2011): 1518-1524.
General Approach for Combining
Similarities
• Sometimes attributes are of many different types, but an overall
similarity is needed.
1: For the kth attribute, compute a similarity, sk(x, y), in the range [0, 1].
2: Define an indicator variable, k, for the kth attribute as follows:
k = 0 if the kth attribute is an asymmetric attribute and both objects have a
value of 0, or if one of the objects has a missing value for the kth
attribute
k = 1 otherwise
3. Compute
Using Weights to Combine Similarities
• May do not want to treat all attributes the same.
• Use non-negative weights
• Purpose
• Data reduction
• Reduce the number of attributes or objects
• Change of scale
• Cities aggregated into regions, states, countries, etc.
• Days aggregated into weeks, months, or years
• More “stable” data
• Aggregated data tends to have less variability
Example: Precipitation in Australia
• This example is based on precipitation in Australia (Rain or
Snow) from the period 1982 to 1993.
The next slide shows
• A histogram for the standard deviation of average monthly
precipitation for 3,030 0.5◦ by 0.5◦ grid cells in Australia, and
• A histogram for the standard deviation of the average yearly
precipitation for the same locations.
• The average yearly precipitation has less variability than the
average monthly precipitation.
• All precipitation measurements (and their standard
deviations) are in centimeters.
Example: Precipitation in Australia …
Variation of Precipitation in Australia
• Using a sample will work almost as well as using the entire data
set, if the sample is representative
• Techniques
• Principal Components Analysis (PCA)
• Singular Value Decomposition
• Others: supervised and non-linear techniques
Dimensionality Reduction: PCA
• Goal is to find a projection that captures the largest amount
of variation in data
x2
x1
Dimensionality Reduction: PCA
Feature Subset Selection
• Another way to reduce dimensionality of data
• Redundant features
• Duplicate much or all of the information contained in one or more
other attributes
• Example: purchase price of a product and the amount of sales tax
paid
• Irrelevant features
• Contain no information that is useful for the data mining task at
hand
• Example: students' ID is often irrelevant to the task of predicting
students' GPA
• Many techniques developed, especially for classification
Feature Creation
• Create new attributes that can capture the important
information in a data set much more efficiently than the
original attributes
Frequency
30
Counts 20
10
0
0 2 4 6 8
Petal Length
Data consists of four groups of points and two outliers. Data is one-
dimensional, but a random y component is added to reduce overlap.
Discretization Without Using Class
Labels