0% found this document useful (0 votes)
61 views

Sess02 Data

Data is a collection of data objects and their attributes. An attribute is a property or characteristic of an object, such as eye color or temperature. Datasets can have different types of attributes like nominal, ordinal, interval, and ratio attributes. Attributes can also be discrete or continuous, symmetric or asymmetric. Common types of datasets include record data represented as tables, transaction data, graph-based data, ordered data like spatial or sequential data, and unstructured data.

Uploaded by

Kriti Sinha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views

Sess02 Data

Data is a collection of data objects and their attributes. An attribute is a property or characteristic of an object, such as eye color or temperature. Datasets can have different types of attributes like nominal, ordinal, interval, and ratio attributes. Attributes can also be discrete or continuous, symmetric or asymmetric. Common types of datasets include record data represented as tables, transaction data, graph-based data, ordered data like spatial or sequential data, and unstructured data.

Uploaded by

Kriti Sinha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 96

Data

Dr Sandipan Karmakar
What is Data?
Collection of data objects
and their attributes
Tid Refund Marital Taxable
An attribute is a property Status Income Cheat

or characteristic of an 1 Yes Single 125K No

object 2 No Married 100K No


3 No Single 70K No
• Examples: eye color of a person,
4 Yes Married 120K No
temperature, etc.
5 No Divorced 95K Yes
• Attribute is also known as
6 No Married 60K No
variable, field, characteristic,
7 Yes Divorced 220K No
dimension, or feature
8 No Single 85K Yes
• A collection of attributes 9 No Married 75K No
describe an object 10 No Single 90K Yes
• Object is also known as record,
10

point, case, sample, entity, or


instance
Types of Attributes
• Nominal
• Examples: ID numbers, eye color, zip codes
• Ordinal
• Examples: rankings (e.g., taste of potato chips on a scale from 1-
10), grades, height {tall, medium, short}
• Interval
• Examples: calendar dates, temperatures in Celsius or Fahrenheit
• Ratio
• Examples: temperature in Kelvin, length, time, counts
Properties of Attribute Values
• The type of an attribute depends on which of the
following properties/operations it possesses:
• Distinctness: = 
• Order: < >
• Differences are + -
meaningful :
• Ratios are * /
meaningful

• Nominal attribute: distinctness


• Ordinal attribute: distinctness & order
• Interval attribute: distinctness, order & meaningful differences
• Ratio attribute: all 4 properties/operations
Difference Between Ratio and Interval
 Is it physically meaningful to say that a temperature of
10° is twice that of 5° on
 The Celsius Scale?
 The Fahrenheit Scale?
 The Kelvin Scale?
 Consider measuring the height above average
 If Bill’s height is three inches above average and Bob’s height
is six inches above average, then would we say that Bob is twice
as tall as Bill?
 Is this situation analogous to that of temperature?
Categorization of Attributes
Attribute Description Examples Operations
Type
Nominal Nominal attribute zip codes, employee ID mode, entropy,
values only numbers, eye color, sex: contingency
distinguish. (=, ) correlation, 2 test
Categorical

{male, female}
Qualitative

Ordinal Ordinal attribute hardness of minerals, median,


values also order {good, better, best}, percentiles, rank
objects. grades, street numbers correlation, run
(<, >) tests, sign tests
Interval For interval attributes, calendar dates, mean, standard
differences between temperature in Celsius deviation,
values are meaningful. or Fahrenheit Pearson's
Quantitative
Numeric

(+, - ) correlation, t and F


tests
Ratio For ratio variables, temperature in Kelvin, geometric mean,
both differences and monetary quantities, harmonic mean,
ratios are meaningful. counts, age, mass, percent variation
(*, /) length, current
Categorization of Attributes
Attribute Transformation Comments
Type
Nominal Any permutation of values If all employee ID numbers were
reassigned, would it make any
difference?
Categorical
Qualitative

Ordinal An order preserving change of An attribute encompassing the


values, i.e., notion of good, better best can be
new_value = f(old_value) represented equally well by the
where f is a monotonic function values {1, 2, 3} or by { 0.5, 1,
10}.

Interval new_value = a * old_value + b Thus, the Fahrenheit and Celsius


Quantitative

where a and b are constants temperature scales differ in terms


Numeric

of where their zero value is and


the size of a unit (degree).
Ratio new_value = a * old_value Length can be measured in
meters or feet.
Discrete v/s Continuous Attributes
 Discrete Attribute
 Has only a finite or countably infinite set of values
 Examples: zip codes, counts, or the set of words in a collection of
documents
 Often represented as integer variables
 Note: binary attributes are a special case of discrete attributes
 Continuous Attribute
 Has real numbers as attribute values
 Examples: temperature, height, or weight
 Practically, real values can only be measured and represented using a
finite number of digits.
 Continuous attributes are typically represented as floating-point variables
Asymmetric Attributes
 For asymmetric attributes only the non-zero attribute (presence against absence) is regarded
as important
 Consider a data set where each object is a student and each attribute records whether or not a
student took a particular course at a university.
 For a specific student, an attribute has a value of 1 if the student took the course associated
with that attribute and a value of 0 otherwise.
 Because students take only a small fraction of all available courses, most of the values in such
a data set would be 0.
 Therefore, it is more meaningful and more efficient to focus on the non-zero values.
 To illustrate, if students are compared on the basis of the courses they don’t take, then most
students would seem very similar, at least if the number of courses is large.
 Binary attributes where only non-zero values are important are called asymmetric binary
attributes.
 This type of attribute is particularly important for association analysis
 It is also possible to have discrete or continuous asymmetric features.
 For instance, if the number of credits associated with each course is recorded, then the
resulting data set will consist of asymmetric discrete or continuous attributes
General Characteristics of Data Sets
 Dimensionality - The dimensionality of a data set is the number of attributes that the
objects in the data set possess. Data with a small number of dimensions tends to be
qualitatively different than moderate or high- dimensional data
 Curse of Dimensionality
 Dimensionality Reduction
 Sparsity - For some data sets, such as those with asymmetric features, most attributes of
an object have values of 0; in many cases, fewer than 1% of the entries are non-zero. In
practical terms, sparsity is an advantage because usually only the non-zero values need to
be stored and manipulated
 Resolution - It is frequently possible to obtain data at different levels of resolution, and
often the properties of the data are different at different resolutions.
 For instance, the surface of the Earth seems very uneven at a resolution of a few
meters, but is relatively smooth at a resolution of tens of kilometers.
 The patterns in the data also depend on the level of resolution.
 If the resolution is too fine, a pattern may not be visible or may be buried in noise; if
the resolution is too coarse, the pattern may disappear.
 For example, variations in atmospheric pressure on a scale of hours reflect the
movement of storms and other weather systems. On a scale of months, such
phenomena are not detectable
Types of Data Sets
 Record Data
 Data Matrix
 Transaction Data
 Document or Term Matrix
 Graph based data
 WWW
 Molecular Structures
 Ordered Data
 Spatial Data
 Sequential Data
 Unstructured data
Record Data
• Data that consists of a collection of records, each of which
consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes


Data Matrix
• If data objects have the same fixed set of numeric attributes, then the
data objects can be thought of as points in a multi-dimensional space,
where each dimension represents a distinct attribute

• Such data set can be represented by an m by n matrix, where there are


m rows, one for each object, and n columns, one for each attribute

Projection Projection Distance Load Thickness


of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1
Document Data
• Each document becomes a ‘term’ vector
• Each term is a component (attribute) of the vector
• The value of each component is the number of times the
corresponding term occurs in the document.

timeout

season
coach

game
score
play
team

win
ball

lost
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
• A special type of record data, where
• Each record (transaction) involves a set of items.
• For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased are
the items.

TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
• Examples: Generic graph, a molecule, and webpages

2
5 1
2
5
Activity Precedence Diagram

Benzene Molecule: C6H6


Linked Webpages
Ordered Data
• Sequences of transactions

Items/Events

An element of
the sequence
Ordered Data
• Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Ordered Data
• Spatio-Temporal Data

Average Monthly
Temperature of
land and ocean
Data Quality
• What are the kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?

• Examples of data quality problems:


• Obsolete or redundant fields
• Noise and outliers' values
• Missing values
• Duplicate data
• Wrong data
Noise
• For objects, noise is an extraneous object
• For attributes, noise refers to modification of original values
• Examples: distortion of a person’s voice when talking on a poor phone and
“snow” on television screen

Two Sine Waves Two Sine Waves + Noise


Outliers
• Outliers are data objects with characteristics that are
considerably different than most of the other data objects in
the data set
• Case 1: Outliers are
noise that interferes
with data analysis
• Case 2: Outliers are
the goal of our analysis
• Credit card fraud
• Intrusion detection
Missing Values
• Reasons for missing values
• Information is not collected
(e.g., people decline to give their age and weight)
• Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

• Handling missing values


• Eliminate data objects or variables
• Estimate missing values
• Example: time series of temperature
• Example: census results
• Ignore the missing value during analysis
Missing Values …
• Missing completely at random (MCAR)
• Missingness of a value is independent of attributes
• Fill in values based on the attribute
• Analysis may be unbiased overall
• Missing at Random (MAR)
• Missingness is related to other variables
• Fill in values based on other values
• Almost always produces a bias in the analysis
• Missing Not at Random (MNAR)
• Missingness is related to unobserved measurements
• Informative or non-ignorable missingness
• Not possible to know the situation from the data
Duplicate Data
• Data set may include data objects that are duplicates, or
almost duplicates of one another
• Major issue when merging data from heterogeneous sources

• Examples:
• Same person with multiple email addresses

• Data cleaning
• Process of dealing with duplicate data issues

• When should duplicate data not be removed?


Need of Data Cleaning
• Can you find any problem in this tiny data set?

 Source is an American Bank


 ZIP codes are assumed to be of 5 digits
 Incomes are assumed to be positive
 Age must be represented by a number
Need of Data Cleaning
 Take the ZIP code attribute first
 Customer 1002 has some unusual zip code
 Is it data entering mistake? No…
 It is ZIP code of St Hyancinthe, Canada
 So, not to be treated as wrong entry
 What about customer id 1004? It is of 4 digits
 This is also not wrong because State of Connecticut has ZIP 06269 and the first zero
might have been omitted by the computer as because the data type of ZIP might have
been declared as integer
Need of Data Cleaning
 Take Gender Attribute
 Customer 1003 has missing record of gender
 Might be data entering mistake
Need of Data Cleaning
 Take Income Attribute
 Customer 1002 has negative value
 Might be data entering mistake
 Customer 1003 has income too high – Might be either data entering mistake or will
be taken as outlier – needs verification
 Customer 1005 has an anomalous value – might be missing data as per legacy
database systems
Handling Missing Data
• Some field values may be missing as in following

• Generally the missing value rows are omitted – potentially


dangerous in creating bias in the subset of the original data
• Some common ways to deal with this problem
• Replace with some constant as specified by the analyst
• Replace with field mean (Continuous) or field mode (Categorical)
• Replace with a random generated the distribution of the field
• Replace with imputed values based on other characteristics of the
record
• Replace with values using kNN, MICE or Deep Learning
Handling Missing Data

• Data Imputation
• Ask question like: “What would be the most likely value of the
missing value given all other attributes of a particular record?”
• E.g. An American car with 300cc and 150HP engine is expected to
have more cylinders than 100cc and 90HP Japanese cars
• Cautionary when using Mean for filling missing values
• If the mean is too high or too low towards the 100th Percentile or 0th
percentile respectively – Median is better measure to use
• Why???
Identifying Misclassifications
• Can you identify the anomaly in this dataset?
• This is frequency distribution of origin of manufacturer of
the automobiles in a Data Set
Graphical Methods of Detecting Outliers

Car Weight Histogram

Car Weight v/s Miles/Gallon


Data Transformations
• Min-Max Normalization

• Z-Score Standardization

• Decimal Scaling

here d is the number of digits in the data value with the


largest absolute value
Data Transformation for Normality
• Most Data Mining and Machine learning algorithms require
the underlying variables be Normally Distributed which is
really rare in reality
• Skewness measure can be used to check normality

• Negative Skewness -> Tail extended on Negative Side


• Positive Skewness -> Tail extended on Positive Side

Original Data Skewness = 0.6 Standardized Data Skewness = 0.6


Data Transformation for Normality
• Most common transformations are Square Root, Natural Log and
Inverse Square Root
• Skewness (sqrt(weight)) = 0.40
• Skewness (ln(weight)) = 0.19
• Skewness (inv-sqrt(weight)) = 0
Finding Outliers Numerically
• Z-score method for identifying outliers states that a data
value is an outlier if it has a Z-score that is either less than -3
or greater than 3
• Unfortunately, the mean and SD, which are both part of the
formula for the Z-score standardization, are both rather
sensitive to the presence of outliers
• One elementary robust method is to use the IQR.
• The quartiles of a data set divide the data set into the
following four parts, each containing 25% of the data
• A robust measure of outlier detection is therefore defined as
follows. A data value is an outlier if
a. it is located 1.5(IQR) or more below Q1, or
b. it is located 1.5(IQR) or more above Q3
Similarity and Dissimilarity Measures
• Similarity measure
• Numerical measure of how alike two data objects are.
• Is higher when objects are more alike.
• Often falls in the range [0,1]
• Dissimilarity measure
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity
Similarity/Dissimilarity for Simple
Attributes
The following table shows the similarity and dissimilarity between two
objects, x and y, with respect to a single, simple attribute.
Euclidean Distance
• Euclidean Distance

where n is the number of dimensions (attributes) and xk and yk are,


respectively, the kth attributes (components) or data objects x and y.

 Standardization is necessary, if scales differ.


Euclidean Distance
3

2 p1 point x y
p3 p4 p1 0 2
1 p2 2 0
p2
0 p3 3 1
0 1 2 3 4 5 6 p4 5 1

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Minkowski Distance
• Minkowski Distance is a generalization of Euclidean Distance

Where r is a parameter, n is the number of dimensions (attributes) and xk


and yk are, respectively, the kth attributes (components) or data objects x
and y.
Minkowski Distance: Examples
• r = 1. City block (Manhattan, taxicab, L1 norm) distance.
• A common example of this is the Hamming distance, which is just the
number of bits that are different between two binary vectors

• r = 2. Euclidean distance

• r  . “supremum” (Lmax norm, L norm) distance.


• This is the maximum difference between any component of the vectors

• Do not confuse r with n, i.e., all these distances are defined for all
numbers of dimensions.
Minkowski Distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0

Distance Matrix
Mahalanobis Distance
𝐦𝐚𝐡𝐚𝐥𝐚𝐧𝐨𝐛𝐢𝐬 ( 𝐱 , 𝐲 )=( 𝐱 − 𝐲 )𝑇 Ʃ − 1( 𝐱 − 𝐲 )
 

 is the covariance matrix

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.


Mahalanobis Distance
Covariance
Matrix:
 0.3 0.2
 
C
 0. 2 0. 3
B A: (0.5, 0.5)
B: (0, 1)
A
C: (1.5, 1.5)

Mahal(A,B) = 5
Mahal(A,C) = 4
Common Properties of a Distance
• Distances, such as the Euclidean distance, have some well
known properties.
1. d(x, y)  0 for all x and y and d(x, y) = 0 only if
x = y. (Positive definiteness)
2. d(x, y) = d(y, x) for all x and y. (Symmetry)
3. d(x, z)  d(x, y) + d(y, z) for all points x, y, and z.
(Triangle Inequality)
where d(x, y) is the distance (dissimilarity) between points (data
objects), x and y.
• A distance that satisfies these properties is a metric
Common Properties of a Similarity
• Similarities, also have some well known properties.
1. s(x, y) = 1 (or maximum similarity) only if x = y.

2. s(x, y) = s(y, x) for all x and y. (Symmetry)

where s(x, y) is the similarity between points (data objects), x and y.


Similarity Between Binary Vectors
• Common situation is that objects, p and q, have only binary attributes

• Compute similarities using the following quantities


f01 = the number of attributes where p was 0 and q was 1
f10 = the number of attributes where p was 1 and q was 0
f00 = the number of attributes where p was 0 and q was 0
f11 = the number of attributes where p was 1 and q was 1

• Simple Matching and Jaccard Coefficients


SMC = number of matches / number of attributes
= (f11 + f00) / (f01 + f10 + f11 + f00)
J = number of 11 matches / number of non-zero attributes
= (f11) / (f01 + f10 + f11)
SMC versus Jaccard: Example
x= 1000000000
y= 0000001001

f01 = 2 (the number of attributes where p was 0 and q was 1)


f10 = 1 (the number of attributes where p was 1 and q was 0)
f00 = 7 (the number of attributes where p was 0 and q was 0)
f11 = 0 (the number of attributes where p was 1 and q was 1)

SMC = (f11 + f00) / (f01 + f10 + f11 + f00)


= (0+7) / (2+1+0+7) = 0.7

J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0


Cosine Similarity
• If d1 and d2 are two document vectors, then
cos( d1, d2 ) = <d1,d2> / ||d1||2 ||d2||2 ,
where <d1,d2> indicates inner product or vector dot product of
vectors, d1 and d2, and || d || is the length of vector d.
• Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
<d1, d2> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
|| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449
cos(d1, d2 ) = 0.3150
Extended Jaccard Coefficient
(Tanimoto)
• Variation of Jaccard for continuous or count attributes
• Reduces to Jaccard for binary attributes
Correlation measures the linear
relationship between objects
Visually Evaluating Correlation

Scatter plots
showing the
similarity from
–1 to 1.
Drawback of Correlation
• If correlation is 0 then there is no linear relationship
• But it does not guarantee about nonlinear relationship
• x = (-3, -2, -1, 0, 1, 2, 3)
• y = (9, 4, 1, 0, 1, 4, 9)

yi = xi2

• mean(x) = 0, mean(y) = 4
• std(x) = 2.16, std(y) = 3.74

• corr = (-3)(5)+(-2)(0)+(-1)(-3)+(0)(-4)+(1)(-3)+(2)(0)+3(5) / ( 6 * 2.16 * 3.74 )


=0
Comparison of Proximity Measures
• Domain of application
• Similarity measures tend to be specific to the type of attribute and
data
• Record data, images, graphs, sequences, 3D-protein structure, etc.
tend to have different measures
• However, one can talk about various properties that you
would like a proximity measure to have
• Symmetry is a common one
• Tolerance to noise and outliers is another
• Ability to find more types of patterns
• Many others possible
• The measure must be applicable to the data and produce
results that agree with domain knowledge
Information Based Measures
• Information theory is a well-developed and fundamental
disciple with broad applications

• Some similarity measures are based on information theory


• Mutual information in various versions
• Maximal Information Coefficient (MIC) and related measures
• General and can handle non-linear relationships
• Can be complicated and time intensive to compute
Information and Probability
• Information relates to possible outcomes of an event
• transmission of a message, flip of a coin, or measurement of a
piece of data

• The more certain an outcome, the less information that it


contains and vice-versa
• For example, if a coin has two heads, then an outcome of heads
provides no information
• More quantitatively, the information is related the probability of an
outcome
• The smaller the probability of an outcome, the more information it provides
and vice-versa
• Entropy is the commonly used measure
Entropy
•  For
• a variable (event), X,
• with n possible values (outcomes), x1, x2 …, xn
• each outcome having probability, p1, p2 …, pn
• the entropy of X , H(X), is given by

• Entropy is between 0 and log2n and is measured in bits


• Thus, entropy is a measure of how many bits it takes to represent
an observation of X on average
• Maximum value of Entropy for an image depends on number of
gray scales.
• For example, for an image with 256 gray scale maximum entropy
is log2(256)=8. Maximum value happens when all bins of
histogram have the same constant value, or, image intensity is
uniformly distributed in [0,255]
Entropy Examples
•  For a coin with probability p of heads and probability q = 1
– p of tails

• For p= 0.5, q = 0.5 (fair coin) H = 1


• For p = 1 or q = 1, H = 0

• What is the entropy of a fair four-sided die?


Entropy for Sample Data: Example
Hair Color Count p -plog2p
Black 75 0.75 0.3113
Brown 15 0.15 0.4105
Blond 5 0.05 0.2161
Red 0 0.00 0
Other 5 0.05 0.2161
Total 100 1.0 1.1540

Maximum entropy is log25 = 2.3219


Entropy for Sample Data
•  Suppose we have
• a number of observations (m) of some attribute, X, e.g., the hair
color of students in the class,
• where there are n different possible values
• And the number of observation in the ith category is mi
• Then, for this sample

• For continuous data, the calculation is harder


Mutual Information
•  Information one variable provides about another
• Formally, , where H(X,Y) is the joint entropy of X and Y,

where pij is the probability that the ith value of X and the jth
value of Y occur together
• For discrete variables, this is easy to compute
• High mutual information indicates a large reduction in
uncertainty; low mutual information indicates a small
reduction; and zero mutual information between two random
variables means the variables are independent
• Maximum mutual information for discrete variables is
log2(min( nX, nY )), where nX (nY) is the number of values of X
(Y)
Mutual Information Example
Student Count p -plog2p Student Grade Count p -plog2p
Status Status
Undergrad 45 0.45 0.5184
Undergrad A 5 0.05 0.2161
Grad 55 0.55 0.4744
Total 100 1.00 0.9928 Undergrad B 30 0.30 0.5211

Undergrad C 10 0.10 0.3322

Grade Count p -plog2p Grad A 30 0.30 0.5211

A 35 0.35 0.5301 Grad B 20 0.20 0.4644


B 50 0.50 0.5000 Grad C 5 0.05 0.2161
C 15 0.15 0.4105 Total 100 1.00 2.2710
Total 100 1.00 1.4406

Mutual information of Student Status and Grade = 0.9928 + 1.4406 - 2.2710 = 0.1624
Maximal Information Coefficient
• Applies mutual information to two continuous variables
• Consider the possible binning of the variables into discrete
categories
• nX × nY ≤ N0.6 where
•nX is the number of values of X
•nY is the number of values of Y
•N is the number of samples (observations, data objects)
• Compute the mutual information
• Normalized by log2(min( nX, nY ))
• Take the highest value

• Reshef, David N., Yakir A. Reshef, Hilary K. Finucane, Sharon R. Grossman, Gilean McVean, Peter J. Turnbaugh, Eric S. Lander, Michael
Mitzenmacher, and Pardis C. Sabeti. "Detecting novel associations in large data sets." science 334, no. 6062 (2011): 1518-1524.
General Approach for Combining
Similarities
• Sometimes attributes are of many different types, but an overall
similarity is needed.
1: For the kth attribute, compute a similarity, sk(x, y), in the range [0, 1].
2: Define an indicator variable, k, for the kth attribute as follows:
k = 0 if the kth attribute is an asymmetric attribute and both objects have a
value of 0, or if one of the objects has a missing value for the kth
attribute
k = 1 otherwise
3. Compute
Using Weights to Combine Similarities
•  May do not want to treat all attributes the same.
• Use non-negative weights 

• Can also define a weighted form of distance


Density
• Measures the degree to which data objects are close to each other in a
specified area
• The notion of density is closely related to that of proximity
• Concept of density is typically used for clustering and anomaly
detection
• Examples:
• Euclidean density
• Euclidean density = number of points per unit volume
• Probability density
• Estimate what the distribution of the data looks like
• Graph-based density
• Connectivity
Euclidean Density: Grid-based
Approach
• Simplest approach is to divide region into a number of
rectangular cells of equal volume and define density as # of
points the cell contains

Grid-based density. Counts for each cell.


Euclidean Density: Center-Based
• Euclidean density is the number of points within a specified
radius of the point

Illustration of center-based density.


Data Preprocessing
• Aggregation
• Sampling
• Dimensionality Reduction
• Feature subset selection
• Feature creation
• Discretization and Binarization
• Attribute Transformation
Aggregation
• Combining two or more attributes (or objects) into a single
attribute (or object)

• Purpose
• Data reduction
• Reduce the number of attributes or objects
• Change of scale
• Cities aggregated into regions, states, countries, etc.
• Days aggregated into weeks, months, or years
• More “stable” data
• Aggregated data tends to have less variability
Example: Precipitation in Australia
• This example is based on precipitation in Australia (Rain or
Snow) from the period 1982 to 1993.
The next slide shows
• A histogram for the standard deviation of average monthly
precipitation for 3,030 0.5◦ by 0.5◦ grid cells in Australia, and
• A histogram for the standard deviation of the average yearly
precipitation for the same locations.
• The average yearly precipitation has less variability than the
average monthly precipitation.
• All precipitation measurements (and their standard
deviations) are in centimeters.
Example: Precipitation in Australia …
Variation of Precipitation in Australia

Standard Deviation of Average Standard Deviation of


Monthly Precipitation Average Yearly Precipitation
Sampling
• Sampling is the main technique employed for data
reduction.
• It is often used for both the preliminary investigation of the data
and the final data analysis.

• Statisticians often sample because obtaining the entire set of


data of interest is too expensive or time consuming.

• Sampling is typically used in data mining because


processing the entire set of data of interest is too expensive
or time consuming.
Sampling …
• The key principle for effective sampling is the following:

• Using a sample will work almost as well as using the entire data
set, if the sample is representative

• A sample is representative if it has approximately the same


properties (of interest) as the original set of data
Sample Size

8000 points 2000 Points 500 Points


Types of Sampling
• Simple Random Sampling
• There is an equal probability of selecting any particular item
• Sampling without replacement
• As each item is selected, it is removed from the population
• Sampling with replacement
• Objects are not removed from the population as they are
selected for the sample.
• In sampling with replacement, the same object can be picked
up more than once
• Stratified sampling
• Split the data into several partitions; then draw random samples
from each partition
Sample Size
• What sample size is necessary to get at least one
object from each of 10 equal-sized groups.
Curse of Dimensionality
• When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies

• Definitions of density and


distance between points,
which are critical for
clustering and outlier
detection, become less
meaningful
• Randomly generate 500 points
• Compute difference between max and
min distance between any pair of points
Dimensionality Reduction
• Purpose:
• Avoid curse of dimensionality
• Reduce amount of time and memory required by data mining
algorithms
• Allow data to be more easily visualized
• May help to eliminate irrelevant features or reduce noise

• Techniques
• Principal Components Analysis (PCA)
• Singular Value Decomposition
• Others: supervised and non-linear techniques
Dimensionality Reduction: PCA
• Goal is to find a projection that captures the largest amount
of variation in data

x2

x1
Dimensionality Reduction: PCA
Feature Subset Selection
• Another way to reduce dimensionality of data
• Redundant features
• Duplicate much or all of the information contained in one or more
other attributes
• Example: purchase price of a product and the amount of sales tax
paid
• Irrelevant features
• Contain no information that is useful for the data mining task at
hand
• Example: students' ID is often irrelevant to the task of predicting
students' GPA
• Many techniques developed, especially for classification
Feature Creation
• Create new attributes that can capture the important
information in a data set much more efficiently than the
original attributes

• Three general methodologies:


• Feature extraction
• Example: extracting edges from images
• Feature construction
• Example: dividing mass by volume to get density
• Mapping data to new space
• Example: Fourier and wavelet analysis
Mapping Data to a New Space
 Fourier and wavelet transform

Frequency

Two Sine Waves + Noise Frequency


Discretization
• Discretization is the process of converting a continuous
attribute into an ordinal attribute
• A potentially infinite number of values are mapped into a small
number of categories
• Discretization is commonly used in classification
• Many classification algorithms work best if both the independent
and dependent variables have only a few values
• We give an illustration of the usefulness of discretization using the
Iris data set
Iris Sample Data Set
• Iris Plant data set.
• Can be obtained from the
UCI Machine Learning
Repository
http://www.ics.uci.edu/~m
learn/MLRepository.html

• From the statistician Douglas


Fisher
• Three flower types (classes): Virginica. Robert H. Mohlenbrock.
USDA NRCS. 1995. Northeast
• Setosa wetland flora: Field office guide to
• Versicolour plant species. Northeast National
• Virginica Technical Center, Chester, PA.
Courtesy of USDA NRCS Wetland
• Four (non-class) attributes Science Institute.
• Sepal width and
length
Discretization: Iris Example

Petal width low or petal length low implies Setosa.


Petal width medium or petal length medium implies Versicolour.
Petal width high or petal length high implies Virginica.
Discretization: Iris Example …
• How can we tell what the best discretization is?
• Unsupervised discretization: find breaks in the data values
• Example:
50
Petal Length
40

30

Counts 20

10

0
0 2 4 6 8
Petal Length

• Supervised discretization: Use class labels to find breaks


Discretization Without Using Class Labels

Data consists of four groups of points and two outliers. Data is one-
dimensional, but a random y component is added to reduce overlap.
Discretization Without Using Class
Labels

Equal interval width approach used to obtain 4 values.


Discretization Without Using Class
Labels

Equal frequency approach used to obtain 4 values.


Discretization Without Using Class
Labels

K-means approach to obtain 4 values.


Binarization
• Binarization maps a continuous or categorical attribute into
one or more binary variables

• Typically used for association analysis

• Often convert a continuous attribute to a categorical attribute


and then convert a categorical attribute to a set of binary
attributes
• Association analysis needs asymmetric binary attributes
• Examples: eye color and height measured as
{low, medium, high}
Attribute Transformation
• An attribute transform is a function that maps the entire set
of values of a given attribute to a new set of replacement
values such that each old value can be identified with one of
the new values
• Simple functions: xk, log(x), ex, |x|
• Normalization
• Refers to various techniques to adjust to differences among
attributes in terms of frequency of occurrence, mean, variance,
range
• Take out unwanted, common signal, e.g., seasonality
• In statistics, standardization refers to subtracting off the means and
dividing by the standard deviation

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy