0% found this document useful (0 votes)

61 views

Sess02 Data

Data is a collection of data objects and their attributes. An attribute is a property or characteristic of an object, such as eye color or temperature. Datasets can have different types of attributes like nominal, ordinal, interval, and ratio attributes. Attributes can also be discrete or continuous, symmetric or asymmetric. Common types of datasets include record data represented as tables, transaction data, graph-based data, ordered data like spatial or sequential data, and unstructured data.

Uploaded by

Kriti Sinha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views

Sess02 Data

Uploaded by

Kriti Sinha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 96

Data

Dr Sandipan Karmakar
What is Data?
Collection of data objects
and their attributes
Tid Refund Marital Taxable
An attribute is a property Status Income Cheat

or characteristic of an 1 Yes Single 125K No

object 2 No Married 100K No

3 No Single 70K No
• Examples: eye color of a person,
4 Yes Married 120K No
temperature, etc.
5 No Divorced 95K Yes
• Attribute is also known as
6 No Married 60K No
variable, field, characteristic,
7 Yes Divorced 220K No
dimension, or feature
8 No Single 85K Yes
• A collection of attributes 9 No Married 75K No
describe an object 10 No Single 90K Yes
• Object is also known as record,
10

point, case, sample, entity, or

instance
Types of Attributes
• Nominal
• Examples: ID numbers, eye color, zip codes
• Ordinal
• Examples: rankings (e.g., taste of potato chips on a scale from 1-
10), grades, height {tall, medium, short}
• Interval
• Examples: calendar dates, temperatures in Celsius or Fahrenheit
• Ratio
• Examples: temperature in Kelvin, length, time, counts
Properties of Attribute Values
• The type of an attribute depends on which of the
following properties/operations it possesses:
• Distinctness: = 
• Order: < >
• Differences are + -
meaningful :
• Ratios are * /
meaningful

• Nominal attribute: distinctness

• Ordinal attribute: distinctness & order
• Interval attribute: distinctness, order & meaningful differences
• Ratio attribute: all 4 properties/operations
Difference Between Ratio and Interval
 Is it physically meaningful to say that a temperature of
10° is twice that of 5° on
 The Celsius Scale?
 The Fahrenheit Scale?
 The Kelvin Scale?
 Consider measuring the height above average
 If Bill’s height is three inches above average and Bob’s height
is six inches above average, then would we say that Bob is twice
as tall as Bill?
 Is this situation analogous to that of temperature?
Categorization of Attributes
Attribute Description Examples Operations
Type
Nominal Nominal attribute zip codes, employee ID mode, entropy,
values only numbers, eye color, sex: contingency
distinguish. (=, ) correlation, 2 test
Categorical

{male, female}
Qualitative

Ordinal Ordinal attribute hardness of minerals, median,

values also order {good, better, best}, percentiles, rank
objects. grades, street numbers correlation, run
(<, >) tests, sign tests
Interval For interval attributes, calendar dates, mean, standard
differences between temperature in Celsius deviation,
values are meaningful. or Fahrenheit Pearson's
Quantitative
Numeric

(+, - ) correlation, t and F

tests
Ratio For ratio variables, temperature in Kelvin, geometric mean,
both differences and monetary quantities, harmonic mean,
ratios are meaningful. counts, age, mass, percent variation
(*, /) length, current
Categorization of Attributes
Attribute Transformation Comments
Type
Nominal Any permutation of values If all employee ID numbers were
reassigned, would it make any
difference?
Categorical
Qualitative

Ordinal An order preserving change of An attribute encompassing the

values, i.e., notion of good, better best can be
new_value = f(old_value) represented equally well by the
where f is a monotonic function values {1, 2, 3} or by { 0.5, 1,
10}.

Interval new_value = a * old_value + b Thus, the Fahrenheit and Celsius

Quantitative

where a and b are constants temperature scales differ in terms

Numeric

of where their zero value is and

the size of a unit (degree).
Ratio new_value = a * old_value Length can be measured in
meters or feet.
Discrete v/s Continuous Attributes
 Discrete Attribute
 Has only a finite or countably infinite set of values
 Examples: zip codes, counts, or the set of words in a collection of
documents
 Often represented as integer variables
 Note: binary attributes are a special case of discrete attributes
 Continuous Attribute
 Has real numbers as attribute values
 Examples: temperature, height, or weight
 Practically, real values can only be measured and represented using a
finite number of digits.
 Continuous attributes are typically represented as floating-point variables
Asymmetric Attributes
 For asymmetric attributes only the non-zero attribute (presence against absence) is regarded
as important
 Consider a data set where each object is a student and each attribute records whether or not a
student took a particular course at a university.
 For a specific student, an attribute has a value of 1 if the student took the course associated
with that attribute and a value of 0 otherwise.
 Because students take only a small fraction of all available courses, most of the values in such
a data set would be 0.
 Therefore, it is more meaningful and more efficient to focus on the non-zero values.
 To illustrate, if students are compared on the basis of the courses they don’t take, then most
students would seem very similar, at least if the number of courses is large.
 Binary attributes where only non-zero values are important are called asymmetric binary
attributes.
 This type of attribute is particularly important for association analysis
 It is also possible to have discrete or continuous asymmetric features.
 For instance, if the number of credits associated with each course is recorded, then the
resulting data set will consist of asymmetric discrete or continuous attributes
General Characteristics of Data Sets
 Dimensionality - The dimensionality of a data set is the number of attributes that the
objects in the data set possess. Data with a small number of dimensions tends to be
qualitatively different than moderate or high- dimensional data
 Curse of Dimensionality
 Dimensionality Reduction
 Sparsity - For some data sets, such as those with asymmetric features, most attributes of
an object have values of 0; in many cases, fewer than 1% of the entries are non-zero. In
practical terms, sparsity is an advantage because usually only the non-zero values need to
be stored and manipulated
 Resolution - It is frequently possible to obtain data at different levels of resolution, and
often the properties of the data are different at different resolutions.
 For instance, the surface of the Earth seems very uneven at a resolution of a few
meters, but is relatively smooth at a resolution of tens of kilometers.
 The patterns in the data also depend on the level of resolution.
 If the resolution is too fine, a pattern may not be visible or may be buried in noise; if
the resolution is too coarse, the pattern may disappear.
 For example, variations in atmospheric pressure on a scale of hours reflect the
movement of storms and other weather systems. On a scale of months, such
phenomena are not detectable
Types of Data Sets
 Record Data
 Data Matrix
 Transaction Data
 Document or Term Matrix
 Graph based data
 WWW
 Molecular Structures
 Ordered Data
 Spatial Data
 Sequential Data
 Unstructured data
Record Data
• Data that consists of a collection of records, each of which
consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes

Data Matrix
• If data objects have the same fixed set of numeric attributes, then the
data objects can be thought of as points in a multi-dimensional space,
where each dimension represents a distinct attribute

• Such data set can be represented by an m by n matrix, where there are

m rows, one for each object, and n columns, one for each attribute

Projection Projection Distance Load Thickness

of x Load of y load

10.23 5.27 15.22 2.7 1.2

12.65 6.25 16.22 2.2 1.1
Document Data
• Each document becomes a ‘term’ vector
• Each term is a component (attribute) of the vector
• The value of each component is the number of times the
corresponding term occurs in the document.

timeout

season
coach

game
score
play
team

win
ball

lost
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
• A special type of record data, where
• Each record (transaction) involves a set of items.
• For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased are
the items.

TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
• Examples: Generic graph, a molecule, and webpages

2
5 1
2
5
Activity Precedence Diagram

Benzene Molecule: C6H6

Linked Webpages
Ordered Data
• Sequences of transactions

Items/Events

An element of
the sequence
Ordered Data
• Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Ordered Data
• Spatio-Temporal Data

Average Monthly
Temperature of
land and ocean
Data Quality
• What are the kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?

• Examples of data quality problems:

• Obsolete or redundant fields
• Noise and outliers' values
• Missing values
• Duplicate data
• Wrong data
Noise
• For objects, noise is an extraneous object
• For attributes, noise refers to modification of original values
• Examples: distortion of a person’s voice when talking on a poor phone and
“snow” on television screen

Two Sine Waves Two Sine Waves + Noise

Outliers
• Outliers are data objects with characteristics that are
considerably different than most of the other data objects in
the data set
• Case 1: Outliers are
noise that interferes
with data analysis
• Case 2: Outliers are
the goal of our analysis
• Credit card fraud
• Intrusion detection
Missing Values
• Reasons for missing values
• Information is not collected
(e.g., people decline to give their age and weight)
• Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

• Handling missing values

• Eliminate data objects or variables
• Estimate missing values
• Example: time series of temperature
• Example: census results
• Ignore the missing value during analysis
Missing Values …
• Missing completely at random (MCAR)
• Missingness of a value is independent of attributes
• Fill in values based on the attribute
• Analysis may be unbiased overall
• Missing at Random (MAR)
• Missingness is related to other variables
• Fill in values based on other values
• Almost always produces a bias in the analysis
• Missing Not at Random (MNAR)
• Missingness is related to unobserved measurements
• Informative or non-ignorable missingness
• Not possible to know the situation from the data
Duplicate Data
• Data set may include data objects that are duplicates, or
almost duplicates of one another
• Major issue when merging data from heterogeneous sources

• Examples:
• Same person with multiple email addresses

• Data cleaning
• Process of dealing with duplicate data issues

• When should duplicate data not be removed?

Need of Data Cleaning
• Can you find any problem in this tiny data set?

 Source is an American Bank

 ZIP codes are assumed to be of 5 digits
 Incomes are assumed to be positive
 Age must be represented by a number
Need of Data Cleaning
 Take the ZIP code attribute first
 Customer 1002 has some unusual zip code
 Is it data entering mistake? No…
 It is ZIP code of St Hyancinthe, Canada
 So, not to be treated as wrong entry
 What about customer id 1004? It is of 4 digits
 This is also not wrong because State of Connecticut has ZIP 06269 and the first zero
might have been omitted by the computer as because the data type of ZIP might have
been declared as integer
Need of Data Cleaning
 Take Gender Attribute
 Customer 1003 has missing record of gender
 Might be data entering mistake
Need of Data Cleaning
 Take Income Attribute
 Customer 1002 has negative value
 Might be data entering mistake
 Customer 1003 has income too high – Might be either data entering mistake or will
be taken as outlier – needs verification
 Customer 1005 has an anomalous value – might be missing data as per legacy
database systems
Handling Missing Data
• Some field values may be missing as in following

• Generally the missing value rows are omitted – potentially

dangerous in creating bias in the subset of the original data
• Some common ways to deal with this problem
• Replace with some constant as specified by the analyst
• Replace with field mean (Continuous) or field mode (Categorical)
• Replace with a random generated the distribution of the field
• Replace with imputed values based on other characteristics of the
record
• Replace with values using kNN, MICE or Deep Learning
Handling Missing Data

• Data Imputation
• Ask question like: “What would be the most likely value of the
missing value given all other attributes of a particular record?”
• E.g. An American car with 300cc and 150HP engine is expected to
have more cylinders than 100cc and 90HP Japanese cars
• Cautionary when using Mean for filling missing values
• If the mean is too high or too low towards the 100th Percentile or 0th
percentile respectively – Median is better measure to use
• Why???
Identifying Misclassifications
• Can you identify the anomaly in this dataset?
• This is frequency distribution of origin of manufacturer of
the automobiles in a Data Set
Graphical Methods of Detecting Outliers

Car Weight Histogram

Car Weight v/s Miles/Gallon

Data Transformations
• Min-Max Normalization

• Z-Score Standardization

• Decimal Scaling

here d is the number of digits in the data value with the

largest absolute value
Data Transformation for Normality
• Most Data Mining and Machine learning algorithms require
the underlying variables be Normally Distributed which is
really rare in reality
• Skewness measure can be used to check normality

• Negative Skewness -> Tail extended on Negative Side

• Positive Skewness -> Tail extended on Positive Side

Original Data Skewness = 0.6 Standardized Data Skewness = 0.6

Data Transformation for Normality
• Most common transformations are Square Root, Natural Log and
Inverse Square Root
• Skewness (sqrt(weight)) = 0.40
• Skewness (ln(weight)) = 0.19
• Skewness (inv-sqrt(weight)) = 0
Finding Outliers Numerically
• Z-score method for identifying outliers states that a data
value is an outlier if it has a Z-score that is either less than -3
or greater than 3
• Unfortunately, the mean and SD, which are both part of the
formula for the Z-score standardization, are both rather
sensitive to the presence of outliers
• One elementary robust method is to use the IQR.
• The quartiles of a data set divide the data set into the
following four parts, each containing 25% of the data
• A robust measure of outlier detection is therefore defined as
follows. A data value is an outlier if
a. it is located 1.5(IQR) or more below Q1, or
b. it is located 1.5(IQR) or more above Q3
Similarity and Dissimilarity Measures
• Similarity measure
• Numerical measure of how alike two data objects are.
• Is higher when objects are more alike.
• Often falls in the range [0,1]
• Dissimilarity measure
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity
Similarity/Dissimilarity for Simple
Attributes
The following table shows the similarity and dissimilarity between two
objects, x and y, with respect to a single, simple attribute.
Euclidean Distance
• Euclidean Distance

where n is the number of dimensions (attributes) and xk and yk are,

respectively, the kth attributes (components) or data objects x and y.

 Standardization is necessary, if scales differ.

Euclidean Distance
3

2 p1 point x y
p3 p4 p1 0 2
1 p2 2 0
p2
0 p3 3 1
0 1 2 3 4 5 6 p4 5 1

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Minkowski Distance
• Minkowski Distance is a generalization of Euclidean Distance

Where r is a parameter, n is the number of dimensions (attributes) and xk

and yk are, respectively, the kth attributes (components) or data objects x
and y.
Minkowski Distance: Examples
• r = 1. City block (Manhattan, taxicab, L1 norm) distance.
• A common example of this is the Hamming distance, which is just the
number of bits that are different between two binary vectors

• r = 2. Euclidean distance

• r  . “supremum” (Lmax norm, L norm) distance.

• This is the maximum difference between any component of the vectors

• Do not confuse r with n, i.e., all these distances are defined for all
numbers of dimensions.
Minkowski Distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0

Distance Matrix
Mahalanobis Distance
𝐦𝐚𝐡𝐚𝐥𝐚𝐧𝐨𝐛𝐢𝐬 ( 𝐱 , 𝐲 )=( 𝐱 − 𝐲 )𝑇 Ʃ − 1( 𝐱 − 𝐲 )

 is the covariance matrix

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.

Mahalanobis Distance
Covariance
Matrix:
 0.3 0.2
 
C
 0. 2 0. 3
B A: (0.5, 0.5)
B: (0, 1)
A
C: (1.5, 1.5)

Mahal(A,B) = 5
Mahal(A,C) = 4
Common Properties of a Distance
• Distances, such as the Euclidean distance, have some well
known properties.
1. d(x, y)  0 for all x and y and d(x, y) = 0 only if
x = y. (Positive definiteness)
2. d(x, y) = d(y, x) for all x and y. (Symmetry)
3. d(x, z)  d(x, y) + d(y, z) for all points x, y, and z.
(Triangle Inequality)
where d(x, y) is the distance (dissimilarity) between points (data
objects), x and y.
• A distance that satisfies these properties is a metric
Common Properties of a Similarity
• Similarities, also have some well known properties.
1. s(x, y) = 1 (or maximum similarity) only if x = y.

2. s(x, y) = s(y, x) for all x and y. (Symmetry)

where s(x, y) is the similarity between points (data objects), x and y.

Similarity Between Binary Vectors
• Common situation is that objects, p and q, have only binary attributes

• Compute similarities using the following quantities

f01 = the number of attributes where p was 0 and q was 1
f10 = the number of attributes where p was 1 and q was 0
f00 = the number of attributes where p was 0 and q was 0
f11 = the number of attributes where p was 1 and q was 1

• Simple Matching and Jaccard Coefficients

SMC = number of matches / number of attributes
= (f11 + f00) / (f01 + f10 + f11 + f00)
J = number of 11 matches / number of non-zero attributes
= (f11) / (f01 + f10 + f11)
SMC versus Jaccard: Example
x= 1000000000
y= 0000001001

f01 = 2 (the number of attributes where p was 0 and q was 1)

f10 = 1 (the number of attributes where p was 1 and q was 0)
f00 = 7 (the number of attributes where p was 0 and q was 0)
f11 = 0 (the number of attributes where p was 1 and q was 1)

SMC = (f11 + f00) / (f01 + f10 + f11 + f00)

= (0+7) / (2+1+0+7) = 0.7

J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0

Cosine Similarity
• If d1 and d2 are two document vectors, then
cos( d1, d2 ) = <d1,d2> / ||d1||2 ||d2||2 ,
where <d1,d2> indicates inner product or vector dot product of
vectors, d1 and d2, and || d || is the length of vector d.
• Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
<d1, d2> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
|| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449
cos(d1, d2 ) = 0.3150
Extended Jaccard Coefficient
(Tanimoto)
• Variation of Jaccard for continuous or count attributes
• Reduces to Jaccard for binary attributes
Correlation measures the linear
relationship between objects
Visually Evaluating Correlation

Scatter plots
showing the
similarity from
–1 to 1.
Drawback of Correlation
• If correlation is 0 then there is no linear relationship
• But it does not guarantee about nonlinear relationship
• x = (-3, -2, -1, 0, 1, 2, 3)
• y = (9, 4, 1, 0, 1, 4, 9)

yi = xi2

• mean(x) = 0, mean(y) = 4
• std(x) = 2.16, std(y) = 3.74

• corr = (-3)(5)+(-2)(0)+(-1)(-3)+(0)(-4)+(1)(-3)+(2)(0)+3(5) / ( 6 * 2.16 * 3.74 )

=0
Comparison of Proximity Measures
• Domain of application
• Similarity measures tend to be specific to the type of attribute and
data
• Record data, images, graphs, sequences, 3D-protein structure, etc.
tend to have different measures
• However, one can talk about various properties that you
would like a proximity measure to have
• Symmetry is a common one
• Tolerance to noise and outliers is another
• Ability to find more types of patterns
• Many others possible
• The measure must be applicable to the data and produce
results that agree with domain knowledge
Information Based Measures
• Information theory is a well-developed and fundamental
disciple with broad applications

• Some similarity measures are based on information theory

• Mutual information in various versions
• Maximal Information Coefficient (MIC) and related measures
• General and can handle non-linear relationships
• Can be complicated and time intensive to compute
Information and Probability
• Information relates to possible outcomes of an event
• transmission of a message, flip of a coin, or measurement of a
piece of data

• The more certain an outcome, the less information that it

contains and vice-versa
• For example, if a coin has two heads, then an outcome of heads
provides no information
• More quantitatively, the information is related the probability of an
outcome
• The smaller the probability of an outcome, the more information it provides
and vice-versa
• Entropy is the commonly used measure
Entropy
• For
• a variable (event), X,
• with n possible values (outcomes), x1, x2 …, xn
• each outcome having probability, p1, p2 …, pn
• the entropy of X , H(X), is given by

• Entropy is between 0 and log2n and is measured in bits

• Thus, entropy is a measure of how many bits it takes to represent
an observation of X on average
• Maximum value of Entropy for an image depends on number of
gray scales.
• For example, for an image with 256 gray scale maximum entropy
is log2(256)=8. Maximum value happens when all bins of
histogram have the same constant value, or, image intensity is
uniformly distributed in [0,255]
Entropy Examples
• For a coin with probability p of heads and probability q = 1
– p of tails

• For p= 0.5, q = 0.5 (fair coin) H = 1

• For p = 1 or q = 1, H = 0

• What is the entropy of a fair four-sided die?

Entropy for Sample Data: Example
Hair Color Count p -plog2p
Black 75 0.75 0.3113
Brown 15 0.15 0.4105
Blond 5 0.05 0.2161
Red 0 0.00 0
Other 5 0.05 0.2161
Total 100 1.0 1.1540

Maximum entropy is log25 = 2.3219

Entropy for Sample Data
• Suppose we have
• a number of observations (m) of some attribute, X, e.g., the hair
color of students in the class,
• where there are n different possible values
• And the number of observation in the ith category is mi
• Then, for this sample

• For continuous data, the calculation is harder

Mutual Information
• Information one variable provides about another
• Formally, , where H(X,Y) is the joint entropy of X and Y,

where pij is the probability that the ith value of X and the jth
value of Y occur together
• For discrete variables, this is easy to compute
• High mutual information indicates a large reduction in
uncertainty; low mutual information indicates a small
reduction; and zero mutual information between two random
variables means the variables are independent
• Maximum mutual information for discrete variables is
log2(min( nX, nY )), where nX (nY) is the number of values of X
(Y)
Mutual Information Example
Student Count p -plog2p Student Grade Count p -plog2p
Status Status
Undergrad 45 0.45 0.5184
Undergrad A 5 0.05 0.2161
Grad 55 0.55 0.4744
Total 100 1.00 0.9928 Undergrad B 30 0.30 0.5211

Undergrad C 10 0.10 0.3322

Grade Count p -plog2p Grad A 30 0.30 0.5211

A 35 0.35 0.5301 Grad B 20 0.20 0.4644

B 50 0.50 0.5000 Grad C 5 0.05 0.2161
C 15 0.15 0.4105 Total 100 1.00 2.2710
Total 100 1.00 1.4406

Mutual information of Student Status and Grade = 0.9928 + 1.4406 - 2.2710 = 0.1624
Maximal Information Coefficient
• Applies mutual information to two continuous variables
• Consider the possible binning of the variables into discrete
categories
• nX × nY ≤ N0.6 where
•nX is the number of values of X
•nY is the number of values of Y
•N is the number of samples (observations, data objects)
• Compute the mutual information
• Normalized by log2(min( nX, nY ))
• Take the highest value

• Reshef, David N., Yakir A. Reshef, Hilary K. Finucane, Sharon R. Grossman, Gilean McVean, Peter J. Turnbaugh, Eric S. Lander, Michael
Mitzenmacher, and Pardis C. Sabeti. "Detecting novel associations in large data sets." science 334, no. 6062 (2011): 1518-1524.
General Approach for Combining
Similarities
• Sometimes attributes are of many different types, but an overall
similarity is needed.
1: For the kth attribute, compute a similarity, sk(x, y), in the range [0, 1].
2: Define an indicator variable, k, for the kth attribute as follows:
k = 0 if the kth attribute is an asymmetric attribute and both objects have a
value of 0, or if one of the objects has a missing value for the kth
attribute
k = 1 otherwise
3. Compute
Using Weights to Combine Similarities
• May do not want to treat all attributes the same.
• Use non-negative weights 

• Can also define a weighted form of distance

Density
• Measures the degree to which data objects are close to each other in a
specified area
• The notion of density is closely related to that of proximity
• Concept of density is typically used for clustering and anomaly
detection
• Examples:
• Euclidean density
• Euclidean density = number of points per unit volume
• Probability density
• Estimate what the distribution of the data looks like
• Graph-based density
• Connectivity
Euclidean Density: Grid-based
Approach
• Simplest approach is to divide region into a number of
rectangular cells of equal volume and define density as # of
points the cell contains

Grid-based density. Counts for each cell.

Euclidean Density: Center-Based
• Euclidean density is the number of points within a specified
radius of the point

Illustration of center-based density.

Data Preprocessing
• Aggregation
• Sampling
• Dimensionality Reduction
• Feature subset selection
• Feature creation
• Discretization and Binarization
• Attribute Transformation
Aggregation
• Combining two or more attributes (or objects) into a single
attribute (or object)

• Purpose
• Data reduction
• Reduce the number of attributes or objects
• Change of scale
• Cities aggregated into regions, states, countries, etc.
• Days aggregated into weeks, months, or years
• More “stable” data
• Aggregated data tends to have less variability
Example: Precipitation in Australia
• This example is based on precipitation in Australia (Rain or
Snow) from the period 1982 to 1993.
The next slide shows
• A histogram for the standard deviation of average monthly
precipitation for 3,030 0.5◦ by 0.5◦ grid cells in Australia, and
• A histogram for the standard deviation of the average yearly
precipitation for the same locations.
• The average yearly precipitation has less variability than the
average monthly precipitation.
• All precipitation measurements (and their standard
deviations) are in centimeters.
Example: Precipitation in Australia …
Variation of Precipitation in Australia

Standard Deviation of Average Standard Deviation of

Monthly Precipitation Average Yearly Precipitation
Sampling
• Sampling is the main technique employed for data
reduction.
• It is often used for both the preliminary investigation of the data
and the final data analysis.

• Statisticians often sample because obtaining the entire set of

data of interest is too expensive or time consuming.

• Sampling is typically used in data mining because

processing the entire set of data of interest is too expensive
or time consuming.
Sampling …
• The key principle for effective sampling is the following:

• Using a sample will work almost as well as using the entire data
set, if the sample is representative

• A sample is representative if it has approximately the same

properties (of interest) as the original set of data
Sample Size

8000 points 2000 Points 500 Points

Types of Sampling
• Simple Random Sampling
• There is an equal probability of selecting any particular item
• Sampling without replacement
• As each item is selected, it is removed from the population
• Sampling with replacement
• Objects are not removed from the population as they are
selected for the sample.
• In sampling with replacement, the same object can be picked
up more than once
• Stratified sampling
• Split the data into several partitions; then draw random samples
from each partition
Sample Size
• What sample size is necessary to get at least one
object from each of 10 equal-sized groups.
Curse of Dimensionality
• When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies

• Definitions of density and

distance between points,
which are critical for
clustering and outlier
detection, become less
meaningful
• Randomly generate 500 points
• Compute difference between max and
min distance between any pair of points
Dimensionality Reduction
• Purpose:
• Avoid curse of dimensionality
• Reduce amount of time and memory required by data mining
algorithms
• Allow data to be more easily visualized
• May help to eliminate irrelevant features or reduce noise

• Techniques
• Principal Components Analysis (PCA)
• Singular Value Decomposition
• Others: supervised and non-linear techniques
Dimensionality Reduction: PCA
• Goal is to find a projection that captures the largest amount
of variation in data

x1
Dimensionality Reduction: PCA
Feature Subset Selection
• Another way to reduce dimensionality of data
• Redundant features
• Duplicate much or all of the information contained in one or more
other attributes
• Example: purchase price of a product and the amount of sales tax
paid
• Irrelevant features
• Contain no information that is useful for the data mining task at
hand
• Example: students' ID is often irrelevant to the task of predicting
students' GPA
• Many techniques developed, especially for classification
Feature Creation
• Create new attributes that can capture the important
information in a data set much more efficiently than the
original attributes

• Three general methodologies:

• Feature extraction
• Example: extracting edges from images
• Feature construction
• Example: dividing mass by volume to get density
• Mapping data to new space
• Example: Fourier and wavelet analysis
Mapping Data to a New Space
 Fourier and wavelet transform

Frequency

Two Sine Waves + Noise Frequency

Discretization
• Discretization is the process of converting a continuous
attribute into an ordinal attribute
• A potentially infinite number of values are mapped into a small
number of categories
• Discretization is commonly used in classification
• Many classification algorithms work best if both the independent
and dependent variables have only a few values
• We give an illustration of the usefulness of discretization using the
Iris data set
Iris Sample Data Set
• Iris Plant data set.
• Can be obtained from the
UCI Machine Learning
Repository
http://www.ics.uci.edu/~m
learn/MLRepository.html

• From the statistician Douglas

Fisher
• Three flower types (classes): Virginica. Robert H. Mohlenbrock.
USDA NRCS. 1995. Northeast
• Setosa wetland flora: Field office guide to
• Versicolour plant species. Northeast National
• Virginica Technical Center, Chester, PA.
Courtesy of USDA NRCS Wetland
• Four (non-class) attributes Science Institute.
• Sepal width and
length
Discretization: Iris Example

Petal width low or petal length low implies Setosa.

Petal width medium or petal length medium implies Versicolour.
Petal width high or petal length high implies Virginica.
Discretization: Iris Example …
• How can we tell what the best discretization is?
• Unsupervised discretization: find breaks in the data values
• Example:
50
Petal Length
40

Counts 20

0
0 2 4 6 8
Petal Length

• Supervised discretization: Use class labels to find breaks

Discretization Without Using Class Labels

Data consists of four groups of points and two outliers. Data is one-
dimensional, but a random y component is added to reduce overlap.
Discretization Without Using Class
Labels

Equal interval width approach used to obtain 4 values.

Discretization Without Using Class
Labels

Equal frequency approach used to obtain 4 values.

Discretization Without Using Class
Labels

K-means approach to obtain 4 values.

Binarization
• Binarization maps a continuous or categorical attribute into
one or more binary variables

• Typically used for association analysis

• Often convert a continuous attribute to a categorical attribute

and then convert a categorical attribute to a set of binary
attributes
• Association analysis needs asymmetric binary attributes
• Examples: eye color and height measured as
{low, medium, high}
Attribute Transformation
• An attribute transform is a function that maps the entire set
of values of a given attribute to a new set of replacement
values such that each old value can be identified with one of
the new values
• Simple functions: xk, log(x), ex, |x|
• Normalization
• Refers to various techniques to adjust to differences among
attributes in terms of frequency of occurrence, mean, variance,
range
• Take out unwanted, common signal, e.g., seasonality
• In statistics, standardization refers to subtracting off the means and
dividing by the standard deviation

Computer Science Grade 12 Final Booklet PDF
25% (4)
Computer Science Grade 12 Final Booklet PDF
16 pages
How To Connect SIP VTO With SIP Phone
No ratings yet
How To Connect SIP VTO With SIP Phone
6 pages
Oracle Dumps Inventory Order Management PDF
100% (2)
Oracle Dumps Inventory Order Management PDF
42 pages
Data Mining: Data
No ratings yet
Data Mining: Data
50 pages
Chap2 Data
No ratings yet
Chap2 Data
87 pages
2-Data_Preprocessing
No ratings yet
2-Data_Preprocessing
104 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
A.I. Lecture 5 NEW
No ratings yet
A.I. Lecture 5 NEW
96 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
50 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
Data
No ratings yet
Data
84 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
31 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
DMDW 2
No ratings yet
DMDW 2
68 pages
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
No ratings yet
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
49 pages
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
No ratings yet
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
57 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
Full
No ratings yet
Full
367 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
Class 2 Introduction to Data
No ratings yet
Class 2 Introduction to Data
40 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
No ratings yet
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
69 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
Lecture 3-Know Your Data - M
No ratings yet
Lecture 3-Know Your Data - M
19 pages
Attributes
No ratings yet
Attributes
66 pages
lec01-dataprep
No ratings yet
lec01-dataprep
67 pages
Data Mining Lecture2-2
No ratings yet
Data Mining Lecture2-2
29 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
Chapter-2 (Data)
No ratings yet
Chapter-2 (Data)
95 pages
Week 2
No ratings yet
Week 2
73 pages
IDS-UNIT-2-FINAL (1)
No ratings yet
IDS-UNIT-2-FINAL (1)
18 pages
DMI UNIT 2
No ratings yet
DMI UNIT 2
19 pages
R22-UNIT2-IDS-CH1
No ratings yet
R22-UNIT2-IDS-CH1
10 pages
Chapter 2.1 2.2
No ratings yet
Chapter 2.1 2.2
40 pages
Data Mining and Data Warehouses: Professor: Liana Stanescu Student: Georgian Vladutu
No ratings yet
Data Mining and Data Warehouses: Professor: Liana Stanescu Student: Georgian Vladutu
12 pages
Modified Module 2-DM
No ratings yet
Modified Module 2-DM
107 pages
Unit 2 Final Ids
No ratings yet
Unit 2 Final Ids
38 pages
Data Mining Unit-I
No ratings yet
Data Mining Unit-I
44 pages
clustering_vivek_saxena
No ratings yet
clustering_vivek_saxena
169 pages
CAC 428 Topic 1_introduction to Data
No ratings yet
CAC 428 Topic 1_introduction to Data
24 pages
2nd Slides
No ratings yet
2nd Slides
54 pages
Attribute Type Description Examples Operations: Attribute Level Transformation Comments
No ratings yet
Attribute Type Description Examples Operations: Attribute Level Transformation Comments
33 pages
DSV-S6 Measures of Similarity and Dissimilarity
No ratings yet
DSV-S6 Measures of Similarity and Dissimilarity
43 pages
chapter 2
No ratings yet
chapter 2
57 pages
Data and Its Types in Data Mining
No ratings yet
Data and Its Types in Data Mining
4 pages
Knowing Your Data
No ratings yet
Knowing Your Data
43 pages
4 - Ch4 - Data Objects and Attribute Types
No ratings yet
4 - Ch4 - Data Objects and Attribute Types
14 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
lec2-data
No ratings yet
lec2-data
51 pages
Chapter 2 Data Issues
No ratings yet
Chapter 2 Data Issues
21 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
Data Mining Chapter 2 Notes
No ratings yet
Data Mining Chapter 2 Notes
87 pages
Datalec1 (1)
No ratings yet
Datalec1 (1)
23 pages
IDS Unit-2
No ratings yet
IDS Unit-2
39 pages
IDS 2nd Unit Notes
No ratings yet
IDS 2nd Unit Notes
14 pages
Lecture Notes For Chapter 2 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining
34 pages
PREPROCESSING
No ratings yet
PREPROCESSING
122 pages
Module2 - Preprocessing Updated - V3-2
No ratings yet
Module2 - Preprocessing Updated - V3-2
106 pages
ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc
No ratings yet
ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc
22 pages
Data Mining: Data: Lecture Notes For Chapter 2
No ratings yet
Data Mining: Data: Lecture Notes For Chapter 2
34 pages
Chemistry Essentials
From Everand
Chemistry Essentials
Editors of REA
4/5 (3)
GCSE Maths Revision: Cheeky Revision Shortcuts
From Everand
GCSE Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (2)
Simple and Multiple Linear Regression
No ratings yet
Simple and Multiple Linear Regression
72 pages
Brand Love and Family PDF
No ratings yet
Brand Love and Family PDF
15 pages
Sess03 Dimension Reduction Methods
No ratings yet
Sess03 Dimension Reduction Methods
36 pages
CRM at Make My Trip: Jainisha Kumawat
No ratings yet
CRM at Make My Trip: Jainisha Kumawat
12 pages
Brand Love and Family PDF
No ratings yet
Brand Love and Family PDF
15 pages
SCBA Chennai Metro
100% (1)
SCBA Chennai Metro
13 pages
Ch05-Structural Modeling
No ratings yet
Ch05-Structural Modeling
41 pages
FF
No ratings yet
FF
24 pages
Hotel Management System: Object Oriented Approach To UML Design
No ratings yet
Hotel Management System: Object Oriented Approach To UML Design
18 pages
Intro To Java RMI
No ratings yet
Intro To Java RMI
9 pages
Gop43 043at
No ratings yet
Gop43 043at
2 pages
What Is Friend - Tech - Binance Academy
No ratings yet
What Is Friend - Tech - Binance Academy
15 pages
(Ebook) X-Ways Forensics Practitioner’s Guide by Brett Shavers and Eric Zimmerman (Auth.) ISBN 9780124116054, 0124116051 instant download
100% (1)
(Ebook) X-Ways Forensics Practitioner’s Guide by Brett Shavers and Eric Zimmerman (Auth.) ISBN 9780124116054, 0124116051 instant download
48 pages
ta000034767-Microsoft Updates KBs for .NET Framework impact System Platform and related AVEVA product installations
No ratings yet
ta000034767-Microsoft Updates KBs for .NET Framework impact System Platform and related AVEVA product installations
3 pages
Finite Autometa PDF
No ratings yet
Finite Autometa PDF
40 pages
40 Algorithm Challenges Booklet
No ratings yet
40 Algorithm Challenges Booklet
20 pages
SLB - Instance - List - Ap-Southeast-5 - Aug 23, 2019, 14 - 55 - 41
No ratings yet
SLB - Instance - List - Ap-Southeast-5 - Aug 23, 2019, 14 - 55 - 41
5 pages
in 1998 2000s early google homepage - Google Search
No ratings yet
in 1998 2000s early google homepage - Google Search
1 page
Introduction To Data Analytics
No ratings yet
Introduction To Data Analytics
33 pages
STE (MicroProject) G 7
No ratings yet
STE (MicroProject) G 7
17 pages
Introduction To Convolutional Neural Networks (CNNS)
No ratings yet
Introduction To Convolutional Neural Networks (CNNS)
28 pages
CNC Control and Cad Cam Solutions PDF
No ratings yet
CNC Control and Cad Cam Solutions PDF
16 pages
Boot Camp Installation & Setup Guide
No ratings yet
Boot Camp Installation & Setup Guide
11 pages
Valve Handbook LowRes
No ratings yet
Valve Handbook LowRes
37 pages
Code To Animate A 2D Helicopter Game
No ratings yet
Code To Animate A 2D Helicopter Game
6 pages
Resume Kuldeep v2024
No ratings yet
Resume Kuldeep v2024
4 pages
Web Accessibility For Developers
No ratings yet
Web Accessibility For Developers
294 pages
Cjc Norg Ai Light Version
No ratings yet
Cjc Norg Ai Light Version
13 pages
Online Handwriting Recognition by Using Microcontroller
No ratings yet
Online Handwriting Recognition by Using Microcontroller
93 pages
Textile Management System SRS
No ratings yet
Textile Management System SRS
6 pages
DVBViewer TE2 Manual
No ratings yet
DVBViewer TE2 Manual
44 pages
BI_Proposal_c
No ratings yet
BI_Proposal_c
71 pages
Pillar Users Guide
No ratings yet
Pillar Users Guide
37 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Sess02 Data

Uploaded by

Sess02 Data

Uploaded by

Data

or characteristic of an 1 Yes Single 125K No

object 2 No Married 100K No

point, case, sample, entity, or

• Nominal attribute: distinctness

Ordinal Ordinal attribute hardness of minerals, median,

(+, - ) correlation, t and F

Ordinal An order preserving change of An attribute encompassing the

Interval new_value = a * old_value + b Thus, the Fahrenheit and Celsius

where a and b are constants temperature scales differ in terms

of where their zero value is and

1 Yes Single 125K No

4 Yes Married 120K No

5 No Divorced 95K Yes

7 Yes Divorced 220K No

8 No Single 85K Yes

10 No Single 90K Yes

• Such data set can be represented by an m by n matrix, where there are

Projection Projection Distance Load Thickness

10.23 5.27 15.22 2.7 1.2

Benzene Molecule: C6H6

• Examples of data quality problems:

Two Sine Waves Two Sine Waves + Noise

• Handling missing values

• When should duplicate data not be removed?

 Source is an American Bank

• Generally the missing value rows are omitted – potentially

Car Weight Histogram

Car Weight v/s Miles/Gallon

here d is the number of digits in the data value with the

• Negative Skewness -> Tail extended on Negative Side

Original Data Skewness = 0.6 Standardized Data Skewness = 0.6

where n is the number of dimensions (attributes) and xk and yk are,

 Standardization is necessary, if scales differ.

Where r is a parameter, n is the number of dimensions (attributes) and xk

• r  . “supremum” (Lmax norm, L norm) distance.

 is the covariance matrix

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.

2. s(x, y) = s(y, x) for all x and y. (Symmetry)

where s(x, y) is the similarity between points (data objects), x and y.

• Compute similarities using the following quantities

• Simple Matching and Jaccard Coefficients

f01 = 2 (the number of attributes where p was 0 and q was 1)

SMC = (f11 + f00) / (f01 + f10 + f11 + f00)

J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0

• corr = (-3)(5)+(-2)(0)+(-1)(-3)+(0)(-4)+(1)(-3)+(2)(0)+3(5) / ( 6 * 2.16 * 3.74 )

• Some similarity measures are based on information theory

• The more certain an outcome, the less information that it

• Entropy is between 0 and log2n and is measured in bits

• For p= 0.5, q = 0.5 (fair coin) H = 1

• What is the entropy of a fair four-sided die?

Maximum entropy is log25 = 2.3219

• For continuous data, the calculation is harder

Undergrad C 10 0.10 0.3322

Grade Count p -plog2p Grad A 30 0.30 0.5211

A 35 0.35 0.5301 Grad B 20 0.20 0.4644

• Can also define a weighted form of distance

Grid-based density. Counts for each cell.

Illustration of center-based density.

Standard Deviation of Average Standard Deviation of

• Statisticians often sample because obtaining the entire set of

• Sampling is typically used in data mining because

• A sample is representative if it has approximately the same

8000 points 2000 Points 500 Points

• Definitions of density and

• Three general methodologies:

Two Sine Waves + Noise Frequency

• From the statistician Douglas

Petal width low or petal length low implies Setosa.

• Supervised discretization: Use class labels to find breaks

Equal interval width approach used to obtain 4 values.

Equal frequency approach used to obtain 4 values.

K-means approach to obtain 4 values.

• Typically used for association analysis

• Often convert a continuous attribute to a categorical attribute

You might also like