0% found this document useful (0 votes)
27 views

Chapter-2 (Data)

The document discusses different types of data attributes. It describes nominal, ordinal, interval and ratio attributes and how they differ in terms of valid operations and transformations. It also discusses describing attributes by the number of values they can take, such as discrete attributes which have a finite set of values.

Uploaded by

Monis Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Chapter-2 (Data)

The document discusses different types of data attributes. It describes nominal, ordinal, interval and ratio attributes and how they differ in terms of valid operations and transformations. It also discusses describing attributes by the number of values they can take, such as discrete attributes which have a finite set of values.

Uploaded by

Monis Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 95

Data Mining - Data

Pramod Kumar Singh


Professor (Computer Science and Engineering)
ABV – Indian Institute of Information Technology Management Gwalior
Gwalior – 474015, MP, India
Introduction

As the data mining tools and techniques heavily depend on the type of data, it is required to
understand the data first.

The Types of Data


Data sets differ in a number of ways – on the basis of attributes (e.g., quantitative, qualitative), on the
basis of special characteristics (e.g., time series, correlation among the objects) etc. The new research
in data mining is often driven by the need to accommodate new application areas and their new types
of data.

The Quality of the Data


Mostly data is far from perfect. Though most data mining techniques can tolerate some level of
imperfection in the data, understanding and improving data quality improves the quality of the
resulting analysis. The typical data quality issues are missing values, noise and outliers, inconsistency,
duplicity, biasness etc.
Introduction
Preprocessing Steps to Make the Data More Suitable for Data Mining
Mostly the raw data requires to be preprocessed to make it suitable for analysis with following two
goals.
1. Improve the quality of data and
2. Modify the data so that it better fits a specified data mining technique or tool.

Analyzing Data in Terms of Its Relationships


One approach to data analysis is to find relationships among the data objects and then perform the
remaining analysis using these relationships rather than the data objects themselves.

For example, we can compute the similarity or dissimilarity between pairs of objects and then
perform the analysis—clustering, classification, or anomaly detection—based on these
similarities or dissimilarities.

There are many such similarity or dissimilarity measures, and the proper choice depends on (i) the
type of data, and (ii) the particular application.
Types of Data

A data set can often be viewed as a collection of data objects. Other names for a data object are
record, point, vector, pattern, event, case, sample, observation, or entity.

These data objects are described by a number of attributes that capture the basic characteristics of an
object. Other names for an attribute are variable, characteristic, field, feature, or dimension.

Example: Often, a data set is a file, in which the objects are records (or rows) and each field (or
column) corresponds to an attribute. The table below shows a data set that consists of student
information. Here, each row corresponds to a student and each column is an attribute that describes
some aspect of a student, e.g., cumulative grade point average (CGPA), identification number (ID).
Table: Student Information
Student ID Year CGPA …
2034625 Freshman 7.8 …
1934364 Sophomore 9.5 …
1737637 Senior 6.8 …
Attributes and Measurements

What Is an attribute?
An attribute is a property or characteristic of an object that may vary, either from one object to
another or from one time to another.

For example, eye color varies from person to person, while the temperature of an object varies
over time. The eye color is a symbolic attribute with a small number of possible values {brown,
black, blue, green, hazel, etc.} whereas the temperature is a numerical attribute with a potentially
unlimited number of values.

However, at the most basic level, attributes are not about numbers or symbols. Rather, we assign
numbers or symbols to them to discuss and analyze the characteristics of objects.

We need a measurement scale to do this in a well-defined way.


Attributes and Measurements

What Is a measurement scale?


A measurement scale is a function that associates a numerical or symbolic value to an attribute of an
object.

The process of measurement is the application of a measurement scale to associate a value with a
particular attribute of a specific object. We do the process of measurement all the time.

For example, we use bathroom scale to determine our weight, or we classify someone as male or
female. In these cases, the “physical value” of an attribute of an object is mapped to a numerical
or symbolic value.
The Type of an Attribute

The type of an attribute determines whether a particular data analysis technique is consistent with a
specific type of attribute.
However, the properties of an attribute need not be the same as the properties of the values used to
measure it. In other words, the values used to represent an attribute may have properties that are not
properties of the attribute itself, and vice versa.
Example: Two attributes that might be associated with an employee are ID and age. Both of these
attributes can be represented as integers.
Though it is reasonable to talk about the average age of an employee, it makes no sense to talk
about the average ID because it simply captures the aspect that each employee is distinct. The
only valid operation on IDs is to test whether they are equal. However, there is no hint of this
limitation when integers are used to represent the employee ID attribute.
For the age attribute, the properties of the integers used to represent age are very much the
properties of the attribute. However, the correspondence is not complete since, for example, ages
have a maximum, while integers do not.
The Different Types of Attribute
Attribute Type Description Example Operations
Nominal The values of nominal attribute are just zip codes, employee ID mode, entropy,
different names; i.e., nominal values provide numbers, eye color, gender contingency
only enough information to distinguish one correlation, χ2 test
Categorical object from another.
(Qualitative) (=, ≠)
Ordinal The values of an ordinal attribute provide hardness of minerals, {good, median, percentiles,
enough information to order objects. better, best}, grades, street rank correlation,
(<, >) numbers run tests, sign tests
Interval For interval attributes, the differences calendar dates, temperature mean, standard
between values are meaningful, i.e., a unit of in Celsius or Fahrenheit deviation, Pearson’s
measurement exists. correlation, t and F
Numeric (+, − ) tests
(Quantitative) Ratio For ratio variables, both differences and temperature in Kelvin, geometric mean,
ratios are meaningful. monetary quantities, harmonic mean,
(*, /) counts, age, mass, length, percent variation
electrical current
Each attribute type possesses all of the properties and operations of the attribute types above it.
The Different Types of Attributes

Nominal and ordinal attributes are collectively referred to as categorical or qualitative attributes.
They lack most of the properties of numbers even if they are represented by numbers. They are
treated like symbols.

Interval and ratio are collectively referred to as numeric or quantitative attributes. They are
represented by numbers and have most of the properties of numbers. They can be integer-valued or
continuous.
The Different Types of Attributes
The types of attributes can also be described in terms of transformations that do not change the
meaning of an attribute.
Attribute Type Transformation Comment
Nominal Any one-to-one mapping, e.g., a If all employee ID numbers are reassigned, it will
permutation of values not make any difference.
Categorical
(Qualitative
Ordinal An order-preserving change of An attribute encompassing the notion of good,
)
values, i.e., new value = f(old value), better, best can be represented equally well by
where f is a monotonic function. the values {1, 2, 3} or by {0.5, 1, 10}.
Interval new value = a ∗ old value + b, where The Fahrenheit and Celsius temperature scales
a and b constants. differ in the
Numeric
location of their zero value and the size of a
(Quantitati
degree (unit).
ve)
Ratio new value = a ∗ old value Length can be measured in meters or feet.
Describing Attributes by the Number of Values
An independent way of distinguishing between attributes is by the number of values they can take.
Discrete: A discrete attribute has a finite set of values. Such attributes can be categorical, e.g., pin
codes, or numeric, e.g., counts. Discrete attributes are often represented using integer variables.
Binary attributes are a special case of discrete attributes and assume only two values, e.g.,
true/false, yes/no, or 0/1. They are often represented as Boolean variables, or as integer
variables that take either of two values 0 or 1.
Continuous: A continuous attribute is one whose values are real numbers. For example,
temperature, height, or weight. They are typically represented as floating-point variables.
In theory, any of the measurement scale types—nominal, ordinal, interval, and ratio—can be
combined with any of the types based on the number of attribute values—binary, discrete, and
continuous. However, some combinations do not make much sense. For instance, it is difficult to
think of a realistic data set that contains a continuous binary attribute.
Typically, nominal and ordinal attributes are binary or discrete, whereas interval and ratio attributes
are continuous. However, count attributes, which are discrete, are also ratio attributes.
Asymmetric Attributes

For asymmetric attributes, only presence—a non-zero attribute value—is regarded as important.
Binary attributes where only non-zero values are important are called asymmetric binary attributes.
This type of attribute is particularly important for association analysis.
Example: Consider a data set where each object is a student and each attribute records whether
or not a student took a particular course at a university. For a specific student, an attribute has a
value of 1 if the student took the course associated with that attribute and a value of 0
otherwise. Because students take only a small fraction of all available courses, most of the values
in such a data set would be 0. Therefore, it is more meaningful and more efficient to focus on the
non-zero values. If students are compared on the basis of the courses they don’t take, then most
students would seem very similar if the number of courses is large.

However, it is also possible to have discrete or continuous asymmetric features. For example, if the
number of credits associated with each course is recorded, then it is asymmetric discrete attribute.
Types of Data Sets
There are many types of data sets. Though every data set does not fit and other groupings are also
possible, we group them as record data, graph-based data, and ordered data.

General Characteristics of Data Sets


The three characteristics dimensionality, sparsity, and resolution apply to many data sets and have a
significant impact on the data mining techniques and tools.

Dimensionality: The dimensionality of a data set is the number of attributes that the objects possess
in the data set. The difficulty associated with analyzing high-dimensional data is referred to as the
curse of dimensionality. Because of this, an important motivation in preprocessing the data is
dimensionality reduction.

Sparsity: For some data sets, such as those with asymmetric features, most attributes of an object
have values of 0; in many cases, fewer than 1% of the entries are non-zero. In some algorithms (e.g.,
Naïve Bayes, Logistics regression), it is an advantage because only the non-zero values need to be
stored and manipulated. This results in significant savings in computation time and storage. However,
for some algorithms (e.g., recommender system), it is a big issue.
Types of Data Sets

Resolution It is frequently possible to obtain data at different levels of resolution, and often the
properties of the data are different at different resolutions.
Example: The surface of the Earth seems very uneven at a resolution of a few meters, but is
relatively smooth at a resolution of tens of kilometers.

Why is it important?
The patterns in the data also depend on the level of resolution. If the resolution is too fine, a pattern
may not be visible or may be buried in noise; if the resolution is too coarse, the pattern may
disappear.
Example: Variations in atmospheric pressure on a scale of hours reflect the movement of storms
and other weather systems. On a scale of months, such phenomena are not detectable.
Types of Data Sets – Record Data

Most of the data mining work assumes that the data set is a collection of records (data objects), each
of which consists of a fixed set of data fields (attributes). In record data, there is no explicit
relationship among records or data fields, and every record (object) has the same set of attributes.
Record data is usually stored either in flat files or in relational databases. Though the relational
databases are more than a collection of records, data mining often does not use any additional
information available in a relational database.

Fig: Record data


Types of Data Sets – Record Data

Transaction or Market Basket Data: Transaction data is a special type of record data, where each
record (transaction) involves a set of items. Consider a grocery store. The set of products purchased
by a customer during one shopping trip constitutes a transaction, while the individual products that
were purchased are the items. This type of data is called market basket data because the items in
each record are the products in a person’s market basket. Transaction data is a collection of sets of
items, but it can be viewed as a set of records whose fields are asymmetric attributes. Most often,
the attributes are binary, indicating whether or not an item was purchased.

Fig: Transaction data


Types of Data Sets – Record Data
The Data Matrix: If all the data objects have the same fixed set of numeric attributes, then the data
objects can be thought of as points (vectors) in a multidimensional space, where each dimension
represents a distinct attribute describing the object. A set of such data objects can be interpreted as
an M x N matrix, where there are M rows, one for each object, and N columns, one for each
attribute. This matrix is called a data matrix or a pattern matrix. A data matrix is a variation of record
data, but because it consists of numeric attributes, standard matrix operation can be applied to
transform and manipulate the data. Therefore, the data matrix is the standard data format for most
statistical data.

Fig: Data matrix


Types of Data Sets – Record Data
The Sparse Data Matrix: A sparse data matrix is a special case of a data matrix in which the
attributes are of the same type and are asymmetric; i.e., only non-zero values are important.
Transaction data is an example of a sparse data matrix that has only 0–1 entries. Another common
example is document data. In particular, if the order of the terms (words) in a document is ignored,
then a document can be represented as a term vector, where each term is a component (attribute)
of the vector and the value of each component is the number of times the corresponding term
occurs in the document. This representation of a collection of documents is often called a document-
term matrix.

Fig: Document-term matrix


Types of Data Sets – Graph-based Data
A graph can sometimes be a convenient and powerful representation for data. We consider two
specific cases: (1) the graph captures relationships among data objects and (2) the data objects
themselves are represented as graphs.

Data with Relationships among Objects: The


relationships among objects frequently convey
important information. In such cases, the data is
often represented as a graph. In particular, the
data objects are mapped to nodes of the graph,
while the relationships among objects are
captured by the links between objects and link
properties, such as direction and weight. For
example, in web pages, the links to and from
each page provide a great deal of information
about the relevance of a Web page to a query.
Fig: Linked Web pages
Types of Data Sets – Graph-based Data

Data with Objects That Are Graphs: If objects


have structure, that is, the objects contain sub-
objects that have relationships, then such objects
are frequently represented as graphs.
Substructure mining is a branch of data mining
that analyzes such data. For example, the
structure of chemical compounds can be
represented by a graph, where the nodes are
atoms and the links between nodes are chemical
bonds.
Fig: Benzene Molecule
Types of Data Sets – Sequential (Ordered) Data
For some types of data, the attributes have relationships that involve order in time or space.

Sequential Data: It is also referred to as temporal data and can be thought of as an extension of
record data, where each record has a time associated with it. Consider a retail transaction data set
that also stores the time at which the transaction took place. This time information makes it possible
to find patterns such as “candy sales peak before Halloween.” A time can also be associated with
each attribute. For example, each record could be the purchase history of a customer, with a listing
of items purchased at different times. Using this information, it is possible to find patterns such as
“people who buy DVD players tend to buy DVDs in the period immediately following the purchase.”

Fig: Sequential transaction data


Types of Data Sets – Sequence (Ordered) Data
Sequence Data: It consists of a data set that is a sequence of individual entities, such as a sequence
of words or letters. It is quite similar to sequential data, except that there are no time stamps;
instead, there are positions in an ordered sequence. For example, the genetic information of plants
and animals can be represented in the form of sequences of nucleotides that are known as genes.

Fig: Genomic Sequence data


Types of Data Sets – Time Series (Ordered) Data
Time Series Data: It a special type of sequential data in which each record is a time series, i.e., a
series of measurements taken over time. For example, a financial data set might contain objects that
are time series of the daily prices of various stocks. As another example (below), a time series of the
average monthly temperature for a city during a time period. When working with temporal data, it is
important to consider temporal autocorrelation; i.e., if two measurements are close in time, then
the values of those measurements are often very similar.

Fig: Temperature time series


Types of Data Sets – Spatial (Ordered) Data
Spatial Data Some objects have spatial attributes, such as positions or areas, as well as other types
of attributes. An example of spatial data is weather data (precipitation, temperature, pressure) that is
collected for a variety of geographical locations. An important aspect of spatial data is spatial
autocorrelation; i.e., objects that are physically close tend to be similar in other ways as well. Thus,
two points on the Earth that are close to each other usually have similar values for temperature and
rainfall.

Fig: Sequential Temperature data


Handling Non-Record Data

Most data mining algorithms are designed for record data or its variations, such as transaction data
and data matrices. Record-oriented techniques can be applied to non-record data by extracting
features from data objects and using these features to create a record corresponding to each object.

However, in some cases, it is easy to represent the data in a record format, but this type of
representation does not capture all the information in the data.
Data Quality

Data mining applications are often applied to data that was collected for another purpose, or for
future, but unspecified applications. For that reason, data mining cannot usually take advantage of
the significant benefits of addressing quality issues at the source. Because preventing data quality
problems is typically not an option, data mining focuses on the following.
1. The detection and correction of data quality problems (it is known as data cleaning) and
2. The use of algorithms that can tolerate poor data quality.
Data Quality - Measurement and Data Collection Issues

It is unrealistic to expect that data will be perfect. There may be problems due to human error,
limitations of measuring devices, or flaws in the data collection process.

Values or even entire data objects may be missing. In other cases, there may be spurious or duplicate
objects; i.e., multiple data objects that all correspond to a single object.
For example, there might be two different records for a person who has recently lived at two
different addresses.

Even if all the data is present and looks fine, there may be inconsistencies.
For example, a person has a height of 2 meters, but weighs only 2 kilograms.
Data Quality - Measurement and Data Collection Issues

Measurement and Data Collection Errors

The term measurement error refers to any problem resulting from the measurement process. A
common problem is that the recorded value differs from the true value to some extent.

For continuous attributes, the numerical difference of the measured and true value is called the
error.

The term data collection error refers to errors such as omitting data objects or attribute values, or
inappropriately including a data object.

Both measurement errors and data collection errors can be either systematic or random.
Data Quality - Measurement and Data Collection Issues
Noise and Artifacts
Noise is the random component of a measurement
error. It may involve the distortion of a value or the
addition of spurious objects.
Because the term noise is often used in connection
with data that has a spatial or temporal component, A time series data is disrupted by random noise. If a bit more
noise were added to the time series, its shape would be lost.
techniques from signal or image processing can be
used to reduce noise. It helps to discover patterns
(signals) that otherwise might be lost in the noise.
However, the elimination of noise is frequently
difficult, and much work in data mining focuses on
devising robust algorithms that produce acceptable
results even when noise is present.
Data errors may be the result of a more deterministic A set of data points before and after some noise points
(indicated by ‘+’s) have been added. Notice that some of
phenomenon are often referred to as artifacts. the noise points are intermixed with the non-noise points.
Data Quality - Measurement and Data Collection Issues
Precision, Bias, and Accuracy

The quality of the measurement process and the resulting data are measured by precision and bias.
(Their definitions assume that the measurements are repeated to calculate a mean (average) value
that serves as an estimate of the true value.)

Precision: The closeness of repeated measurements (of the same quantity) to one another.

Bias: A systematic variation of measurements from the quantity being measured.

Precision is often measured by the standard deviation of a set of values, while bias is measured by
taking the difference between the mean of the set of values and the known value of the quantity
being measured.

It is common to use the more general term, accuracy, to refer to the degree of measurement error
in data.
Data Quality - Measurement and Data Collection Issues

Precision, Bias, and Accuracy

Accuracy: The closeness of measurements to the true value of the quantity being measured.

Accuracy depends on precision and bias, but since it is a general concept, there is no specific
formula for accuracy in terms of these two quantities. One important aspect of accuracy is the use
of significant digits. The goal is to use only as many digits to represent the result of a measurement
or calculation as are justified by the precision of the data.

Issues such as significant digits, precision, bias, and accuracy are sometimes overlooked, but they
are important for data mining. Without some understanding of the accuracy of the data and the
results, an analyst runs the risk of committing serious data analysis blunders.
Data Quality - Measurement and Data Collection Issues

Outliers (Anomalous)
Outliers are either (1) data objects that, in some sense, have characteristics that are different from
most of the other data objects in the data set, or (2) values of an attribute that are unusual with
respect to the typical values for that attribute.

It is important to distinguish between the notions of noise and outliers. Outliers can be legitimate
data objects or values. Thus, unlike noise, outliers may sometimes be of interest.

Missing Values
It is not unusual for an object to be missing one or more attribute values. There may be various
reasons for it, e.g., the information was not collected, some attributes are not applicable to all
objects. However, they should be considered seriously during the data analysis.

There are several strategies (and variations on these strategies) for dealing with missing data, each
of which may be appropriate in certain circumstances.
Data Quality - Measurement and Data Collection Issues
Ways to handle the missing values
Eliminate Data Objects or Attributes: A simple and effective strategy is to eliminate objects with
missing values. However, even a partially specified data object contains some information, and if
many objects have missing values, then a reliable analysis can be difficult or impossible.
Nonetheless, if a data set has only a few objects that have missing values, then it may be expedient
to omit them. A related strategy is to eliminate attributes that have missing values. This should be
done with caution, however, since the eliminated attributes may be the ones that are critical to the
analysis.
Estimate Missing Values Sometimes missing data can be reliably estimated. For example:
i. In time series data, the missing values can be estimated (interpolated) by using the remaining
values.
ii. If a data set that has many similar data points the attribute values of the points closest to the
point with the missing value are often used to estimate the missing value.
a) If the attribute is continuous, then the average attribute value of the nearest neighbors is
used.
b) If the attribute is categorical, then the most commonly occurring attribute value can be
taken.
Data Quality - Measurement and Data Collection Issues

Ways to handle the missing values

Ignore the Missing Value during Analysis Many data mining approaches can be modified to ignore
missing values. For example, suppose that objects are being clustered and the similarity between
pairs of data objects needs to be calculated. If one or both objects of a pair have missing values for
some attributes, then the similarity can be calculated by using only the attributes that do not have
missing values. It is true that the similarity will only be approximate, but unless the total number of
attributes is small or the number of missing values is high, this degree of inaccuracy may not matter
much. Likewise, many classification schemes can be modified to work with missing values.
Data Quality - Measurement and Data Collection Issues

Inconsistent Values
Data can contain inconsistent values. Consider an address field, where both a zip code and city are
listed, but the specified zip code area is not contained in that city.

It is important to detect and, if possible, correct such problems.

Some types of inconsistences are easy to detect. For instance, a person’s height should not be
negative.

In other cases, it can be necessary to consult an external source of information. The correction of an
inconsistency requires additional or redundant information.
Data Quality - Measurement and Data Collection Issues
Duplicate Data

A data set may include data objects that are duplicates, or almost duplicates, of one another.
Deduplication is the process to deal with these issues. Following two main issues must be
addressed for deduplication.

(1) If there are two (data) objects that actually represent a single object (entity in the real-world),
then the values of corresponding attributes may differ, and these inconsistent values must be
resolved.

(2) Care needs to be taken to avoid accidentally combining data objects that are similar, but not
duplicates, e.g., two distinct people with identical names.

In some cases, two or more objects are identical with respect to the attributes measured by the
database, but they still represent different objects. Here, the duplicates are legitimate, but may still
cause problems for some algorithms if the possibility of identical objects is not specifically
accounted for in their design.
Data Quality - Issues Related to Applications
Data quality issues can also be considered from an application viewpoint as expressed by the
statement data is of high quality if it is suitable for its intended use.

Timeliness: Some data starts to age as soon as it has been collected, e.g., purchasing behavior of
customers, web browsing patterns. If the data is out of date, then so are the models and patterns
that are based on it.

Relevance: The available data must contain the information necessary for the application. For
example, building a model that predicts the accident rate for drivers from a dataset where the age
and gender of the driver is omitted is not much useful.

Making sure that the objects in a data set are relevant is also challenging. A common problem is
sampling bias, which occurs when a sample does not contain different types of objects in proportion
to their actual occurrence in the population. Because the results of a data analysis can reflect only
the data that is present, sampling bias results in an erroneous analysis.
Data Quality - Issues Related to Applications

Knowledge about the Data: Ideally, data sets are accompanied by documentation that describes
different aspects of the data; the quality of this documentation can either aid or hinder the
subsequent analysis.
For example, if the documentation identifies several attributes as being strongly related, these
attributes are likely to provide highly redundant information, and we may decide to keep just
one. (Consider sales tax and purchase price.) If the documentation is poor, however, and fails to
tell us, for example, that the missing values for a particular field are indicated with a -9999, then
our analysis of the data may be faulty.

Other important characteristics are the precision of the data, the type of features (nominal, ordinal,
interval, ratio), the scale of measurement (e.g., meters or feet for length), and the origin of the data.
Data Preprocessing

Data preprocessing is applied to make the data more suitable for data mining. It consists of a
number of different strategies and techniques that are interrelated in complex ways. The most
important ideas and approaches are as follows.
• Aggregation
• Sampling
• Dimensionality reduction
• Feature subset selection
• Feature creation
• Discretization and binarization
• Variable transformation

Roughly speaking, these items fall into two categories: selecting data objects and attributes for the
analysis or creating/changing the attributes. In both cases the goal is to improve the data mining
analysis with respect to time, cost, and quality.
Data Preprocessing - Aggregation
Sometimes less is more and this is the case with aggregation - the combining of two or more objects
into a single object. An obvious issue is how to combine the values of each attribute of all the
records. Quantitative attributes are typically aggregated by taking a sum or an average and a
qualitative attributes are either omitted or summarized.

There are several motivations for aggregation.


(1) The smaller data sets resulting from data reduction require less memory and processing time.
Hence, aggregation may permit the use of more expensive data mining algorithms.
(2) Aggregation can act as a change of scope or scale by providing a high-level view of the data
instead of a low-level view.
(3) The behavior of groups of objects or attributes is often more stable than that of individual
objects or attributes. For example, averages or totals, have less variability than the individual objects
being aggregated.

However, a disadvantage of aggregation is the potential loss of interesting details.


Data Preprocessing - Sampling

Sampling is a commonly used approach for selecting a subset of the data objects to be analyzed. It is
done because it is too expensive or time consuming to process all the data. In some cases, using a
sampling algorithm can reduce the data size to the point where a better, but more expensive
algorithm can be used.

The key principle for effective sampling is that the sample should be a representative of the entire
data set. A sample is representative if it has approximately the same property (of interest) as the
original set of data.

Because sampling is a statistical process, the representativeness of any particular sample will vary,
and the best that we can do is choose a sampling scheme that guarantees a high probability of
getting a representative sample. It involves choosing the appropriate sample size and sampling
techniques.
Data Preprocessing – Sampling … Contd
Sampling Approaches
There are two variations of sampling: (1) sampling without replacement — as each item is selected,
it is removed from the set of all objects that together constitute the population, and (2) sampling
with replacement — objects are not removed from the population as they are selected for the
sample; hence, the same object can be picked more than once. The samples produced by the two
methods are not much different when sample size is relatively small compared to the data set size.

The simplest type of sampling is simple random sampling. Here, there is an equal probability of
selecting any particular item. When the population consists of different types of objects, with widely
different numbers of objects, simple random sampling can fail to adequately represent those types
of objects that are less frequent. It can cause problems when the analysis requires proper
representation of all object types.

Hence, a sampling scheme that can accommodate differing frequencies for the items of interest is
needed. Stratified sampling is one such method. In the simplest version, equal numbers of objects
are drawn from each group even though the groups are of different sizes. In another variation, the
number of objects drawn from each group is proportional to the size of that group.
Data Preprocessing – Sampling … Contd
Once a sampling technique is selected, it is necessary to choose the sample size. Larger sample sizes
increase the probability that a sample will be representative, but they also eliminate much of the
advantage of sampling. Conversely, with smaller sample sizes, patterns may be missed or erroneous
patterns can be detected.

Figure (a) shows a data set that contains 8000 two-dimensional points, while Figures (b) and (c) show samples from this data set of size
2000 and 500, respectively. Although most of the structure of this data set is present in the sample of 2000 points, much of the structure
is missing in the sample of 500 points.
Data Preprocessing – Sampling … Contd

Progressive Sampling: The proper sample size can be difficult to determine, so adaptive or
progressive sampling schemes are sometimes used. These approaches start with a small sample,
and then increase the sample size until a sample of sufficient size has been obtained. While this
technique eliminates the need to determine the correct sample size initially, it requires that there be
a way to evaluate the sample to judge if it is large enough.
Data Preprocessing – Dimensionality Reduction
There are a variety of benefits to dimensionality reduction.
1) Many data mining algorithms work better if the dimensionality — the number of attributes in
the data — is lower. This is because dimensionality reduction can eliminate irrelevant features
and reduce noise, and the curse of dimensionality.
2) It can lead to a more understandable model because the model may involve fewer attributes.
3) It may allow easy data visualization. Even if dimensionality reduction doesn’t reduce the data to
two or three dimensions, data is often visualized by looking at pairs or triplets of attributes, and
the number of such combinations is greatly reduced.
4) The amount of time and memory required by the data mining algorithm is reduced with a
reduction in dimensionality.
The term dimensionality reduction is often reserved for those techniques that reduce the
dimensionality of a data set by creating new attributes that are a combination of the old attributes.
The reduction of dimensionality by selecting new attributes that are a subset of the old is known as
feature subset selection or feature selection.
Data Preprocessing – Dimensionality Reduction

The Curse of Dimensionality

The curse of dimensionality refers to the phenomenon that many types of data analysis become
significantly harder as the dimensionality of the data increases. Specifically, as dimensionality
increases, the data becomes increasingly sparse in the space that it occupies.

For classification, this can mean that there are not enough data objects to allow the creation of a
model that reliably assigns a class to all possible objects. For clustering, the definitions of density
and the distance between points, which are critical for clustering, become less meaningful.

Principal Components Analysis (PCA) and Singular Value Decomposition (SVD) are most common
approaches for dimensionality reduction in continuous data. They use linear algebra to project the
data from a high-dimensional space into a lower-dimensional space.
Data Preprocessing – Feature Subset Selection
Another way to reduce the dimensionality is to use only a subset of the features. It may seem that
such an approach will lose information but this is not the case if redundant and irrelevant features
are present.
Redundant features duplicate much or all of the information contained in one or more other
attributes. Irrelevant features contain almost no useful information for the data mining task at
hand. Such features can reduce classification accuracy and the quality of the clusters.
While some of these attributes can be eliminated immediately by using common sense or domain
knowledge, selecting the best subset of features frequently requires a systematic approach.
The ideal approach to feature selection is to try all possible subsets of features as input to the data
mining algorithm, and then take the subset that produces the best results. This method has the
advantage of reflecting the objective and bias of the data mining algorithm that will eventually be
used.
Unfortunately, since the number of subsets involving n attributes is 2n, such an approach is
impractical in most situations and alternative strategies are needed. There are three standard
approaches to feature selection: embedded, filter, and wrapper.
Data Preprocessing – Feature Subset Selection … Contd

Embedded approaches
Feature selection occurs naturally as part of the data mining algorithm. Specifically, during the
operation of the data mining algorithm, the algorithm itself decides which attributes to use and
which to ignore.

Filter approaches
Features are selected before the data mining algorithm is run, using some approach that is
independent of the data mining task. For example, we might select sets of attributes whose pairwise
correlation is as low as possible.

Wrapper approaches
These methods use the target data mining algorithm as a black box to find the best subset of
attributes but typically without enumerating all possible subsets.
Data Preprocessing – Feature Subset Selection … Contd
An Architecture for Feature Subset Selection

It is possible to encompass both the filter and wrapper approaches within a common architecture.
The feature selection process is viewed as consisting of four parts: (i) a measure for evaluating a
subset, (ii) a search strategy that controls the generation of a new subset of features, (iii) a
stopping criterion, and (iv) a validation procedure.

Filter methods and wrapper methods differ only in the way in which they evaluate a subset of
features. For a wrapper method, subset evaluation uses the target data mining algorithm, while for
a filter approach, the evaluation technique is distinct from the target data mining algorithm.

Conceptually, feature subset selection is a search over all possible subsets of features. Many
different types of search strategies can be used, but the search strategy should be computationally
inexpensive and should find optimal or near optimal sets of features. It is usually not possible to
satisfy both requirements, and thus, tradeoffs are necessary.
Data Preprocessing – Feature Subset Selection … Contd

Fig: Flowchart of a feature subset selection process


Data Preprocessing – Feature Subset Selection … Contd
Search Strategy
Basic heuristic methods of attribute subset selection are as follows.
• Stepwise forward selection: The procedure starts with an empty set of attributes as the reduced
set. The best of the original attributes is determined and added to the reduced set. At each
subsequent iteration or step, the best of the remaining original attributes is added to the set.
• Stepwise backward elimination: The procedure starts with the full set of attributes. At each step,
it removes the worst attribute remaining in the set.
• Combination of forward selection and backward elimination: The stepwise forward selection
and backward elimination methods can be combined so that, at each step, the procedure selects
the best attribute and removes the worst from among the remaining attributes.
• Decision tree induction: Decision tree algorithms were originally intended for classification.
When decision tree induction is used for attribute subset selection, a tree is constructed from the
given data. All attributes that do not appear in the tree are assumed to be irrelevant. The set of
attributes appearing in the tree form the reduced subset of attributes.
Data Preprocessing – Feature Subset Selection … Contd
Search Strategy – Pictorial Representation

Figure: Greedy (heuristic) methods for attribute subset selection.


Data Preprocessing – Feature Subset Selection … Contd

Evaluation Strategy

An integral part of the search is an evaluation step to judge how the current subset of features
compares to others that have been considered.

This requires an evaluation measure that attempts to determine the goodness of a subset of
attributes with respect to a particular data mining task, such as classification or clustering.

For the filter approach, such measures attempt to predict how well the actual data mining algorithm
will perform on a given set of attributes.

For the wrapper approach, where evaluation consists of actually running the target data mining
application, the subset evaluation function is simply the criterion normally used to measure the
result of the data mining.
Data Preprocessing – Feature Subset Selection … Contd

Stopping Criteria

Because the number of subsets can be enormous and it is impractical to examine them all, some
sort of stopping criterion is necessary.

This strategy is usually based on one or more conditions involving the following:
• The number of iterations,
• Whether the value of the subset evaluation measure is optimal or exceeds a certain threshold,
• Whether a subset of a certain size has been obtained,
• Whether simultaneous size and evaluation criteria have been achieved, and
• Whether any improvement can be achieved by the options available to the search strategy.
Data Preprocessing – Feature Subset Selection … Contd

Validation
Finally, once a subset of features has been selected, the results of the target data mining algorithm
on the selected subset should be validated.

A straightforward evaluation approach is to run the algorithm with the full set of features and
compare the full results to results obtained using the subset of features. Hopefully, the subset of
features will produce results that are better than or almost as good as those produced when using
all features.

Another validation approach is to use a number of different feature selection algorithms to obtain
subsets of features and then compare the results of running the data mining algorithm on each
subset.
Data Preprocessing – Feature Subset Selection … Contd
Feature Weighting

Feature weighting is an alternative to keeping or eliminating features.

More important features are assigned a higher weight, while less important features are given a
lower weight.

These weights are sometimes assigned based on domain knowledge about the relative importance
of features.

Alternatively, they may be determined automatically. For example, some classification schemes,
such as support vector machines, produce classification models in which each feature is given a
weight.

Features with larger weights play a more important role in the model.
Data Preprocessing – Feature Creation

Usually it is possible to create a new set of reduced attributes from the original attributes that
captures the important information in a data set much more effectively. It allows us to reap all the
previously described benefits of dimensionality reduction. Three popular methodologies for creating
new attributes are: (i) feature extraction, (ii) mapping the data to a new space, and (iii) feature
construction.
Data Preprocessing – Feature Creation … Contd

Feature Extraction

The creation of a new set of features from the original raw data is known as feature extraction.

Example: Consider a set of photographs, where each photograph is to be classified according to


whether or not it contains a human face. The raw data is a set of pixels, and as such, is not suitable
for many types of classification algorithms. However, if the data is processed to provide higher level
features, such as the presence or absence of certain types of edges and areas that are highly
correlated with the presence of human faces, then a much broader set of classification techniques
can be applied to this problem.

Unfortunately, feature extraction is highly domain-specific. Therefore, whenever data mining is


applied to a relatively new area, a key task is the development of new features and feature
extraction methods.
Data Preprocessing – Feature Creation … Contd

Mapping the Data to a New Space

A totally different view of the data can reveal important and interesting features.

Example: Consider time series data, which often contains periodic patterns. If there is only a single
periodic pattern and not much noise, then the pattern is easily detected. However, if there are a
number of periodic patterns and a significant amount of noise is present, then these patterns are
hard to detect. Nonetheless, such patterns can be detected by applying a Fourier transform to the
time series to change to a representation in which frequency information is explicit (because, for
each time series, the Fourier transform produces a new data object whose attributes are related to
frequencies).

Many other sorts of transformations are also possible. Besides the Fourier transform, the wavelet
transform has also proven very useful for time series and other types of data.
Data Preprocessing – Feature Creation … Contd
Mapping the Data to a New Space

Figure: (b) is the sum of three other time series, two of which are shown in (a) and the third time series is random
noise. (c) shows the power spectrum computed after applying a Fourier transform to the original time series.
Data Preprocessing – Feature Creation … Contd
Feature Construction

Sometimes the features in the original data sets have the necessary information, but it is not in a
form suitable for the data mining algorithm. In this situation, one or more new features constructed
out of the original features can be more useful than the original features.

Example: Consider a data set consisting of information about historical artifacts, which, along with
other information, contains the volume and mass of each artifact. For simplicity, assume that these
artifacts are made of a small number of materials (wood, clay, bronze, gold) and that we want to
classify the artifacts with respect to the material of which they are made. In this case, a density
feature constructed from the mass and volume features, i.e., density = mass/volume, would most
directly yield an accurate classification.

Though there have been some attempts to automatically perform feature construction by exploring
simple mathematical combinations of existing attributes, the most common approach is to construct
features using domain expertise.
Data Preprocessing – Discretization and Binarization
Some data mining algorithms want the data to be presented in some specific format. For example,
some classification algorithms require the data to be in the form of categorical attributes, algorithms
that find association patterns require that the data be in the form of binary attributes. Thus, it is
often necessary to transform a continuous attribute into a categorical attribute (discretization), and
both continuous and discrete attributes may need to be transformed into one or more binary
attributes (binarization).

Additionally, if a categorical attribute has a large number of values (categories), or some values
occur infrequently, then it may be beneficial for certain data mining tasks to reduce the number of
categories by combining some of the values.

As with feature selection, the best discretization and binarization approach is the one that produces
the best result for the data mining algorithm that will be used to analyze the data. However, it is
not practical to apply such a criterion directly. Therefore, discretization or binarization is performed
in a way that satisfies a criterion that is thought to have a relationship to good performance for the
considered data mining task.
Data Preprocessing – Discretization and Binarization … Contd
Binarization
A simple technique to binarize a categorical attribute is the following: If there are m categorical
values, then uniquely assign each original value to an integer in the interval [0, m−1].
If the attribute is ordinal, then order must be maintained by the assignment.
Even if the attribute is originally represented using integers, this process is necessary if the integers
are not in the interval [0,m−1].
Next, convert each of these m integers to a binary number. Since n = Γlog2(m)˥ binary digits are
required to represent these integers, represent these binary numbers using n binary attributes.
Categorical Value Integer Value x1 x2 x3

Example: a categorical variable with 5 values {awful, awful 0 0 0 0


poor, OK, good, great} would require three binary poor 1 0 0 1
variables x1, x2, and x3. OK 2 0 1 0
good 3 0 1 1
great 4 1 0 0
Data Preprocessing – Discretization and Binarization … Contd
Binarization
However, the previous transformation can cause complications of creating unintended relationships
among the transformed attributes. Additionally, association analysis requires asymmetric binary
attributes, where only the presence of the attribute (value = 1) is important. For association
problems, it is therefore necessary to introduce one binary attribute for each categorical value.
Categorical Value Integer Value X1 x2 x3 x4 x5
awful 0 1 0 0 0 0
poor 1 0 1 0 0 0
OK 2 0 0 1 0 0
good 3 0 0 0 1 0
great 4 0 0 0 0 1

If the number of resulting attributes is too large, then the techniques that reduce the number of
categorical values before binarization can be used.
Data Preprocessing – Discretization and Binarization … Contd
Discretization of Continuous Attributes
Discretization is primarily applied to attributes that are used in classification or association analysis.
In general, the best discretization depends on the algorithm being used, as well as the other
attributes being considered. However, the discretization of an attribute is considered in isolation.
Transformation of a continuous attribute to a categorical attribute involves two subtasks:
• deciding the number of categories and
• determining mapping the values of the continuous attribute to these categories.
In the first step, after the values of the continuous attribute are sorted, they are then divided into n
intervals by specifying n−1 split points. In the second, all the values in one interval are mapped to
the same categorical value.
Therefore, the problem of discretization is one of deciding how many split points to choose and
where to place them. The result can be represented either as a set of intervals {(x0, x1], (x1, x2], . . .
, (xn−1, xn)}, where x0 and xn may be +∞ or −∞, respectively, or equivalently, as a series of
inequalities x0 < x ≤ x1, . . . , xn−1 < x < xn.
Data Preprocessing – Discretization and Binarization … Contd

Unsupervised Discretization

A basic distinction between discretization methods for classification is whether class information is
used (supervised) or not (unsupervised).

If class information is not used, then relatively simple approaches are common.
• The equal width approach divides the range of the attribute into a user-specified number of
intervals each having the same width. However, it can be badly affected by outliers.
• The equal frequency (equal depth) approach tries to put the same number of objects into
each interval. It is preferred as it mitigates the above problem.
• A clustering method, e.g., K-means, can also be used for unsupervised discretization.
• Visually inspecting the data can sometimes be an effective approach for unsupervised
discretization.
Data Preprocessing – Discretization and Binarization … Contd

Supervised Discretization

However, the discretization methods that use class labels often produces better results. The
unsupervised discretization is poor because an interval constructed with no knowledge of class
labels often contains a mixture of class labels.

A conceptually simple approach is to place the splits in a way that maximizes the purity of the
intervals. It requires (1) decisions about the purity of an interval and (2) the minimum size of an
interval. Therefore, some statistical based approaches start with each attribute value as a separate
interval and create larger intervals by merging adjacent intervals that are similar according to a
statistical test.

Entropy-based approaches are one of the most promising approaches to discretization.


Data Preprocessing – Discretization and Binarization … Contd
Supervised Discretization
Entropy: Let k be the number of different class labels, mi be the number of values in the ith interval
of a partition, and mij be the number of values of class j in interval i. Then the entropy ei of the ith
interval is given by the equation
𝑘

𝑒𝑖 = ෍ 𝑝𝑖𝑗 log 2 𝑝𝑖𝑗


𝑖=1
Here, pij = mij/mi is the probability (fraction of values) of class j in the ith interval. The total entropy
(e) of the partition is the weighted average of the individual interval entropies, i.e.,
𝑛

𝑒 = ෍ 𝑤𝑖 𝑒𝑖
𝑖=1
Here, m is the number of values, wi = mi/m is the fraction of values in the ith interval, and n is the
number of intervals. Hence, the entropy of an interval is a measure of the purity of an interval.
If an interval contains only values of one class (is perfectly pure), then the entropy is 0 and it
contributes nothing to the overall entropy. If the classes of values in an interval occur equally often
(the interval is as impure as possible), then the entropy is maximum.
Data Preprocessing – Discretization and Binarization … Contd

Supervised Discretization

A simple approach for partitioning a continuous attribute starts by bisecting the initial values so that
the resulting two intervals give minimum entropy. If the intervals contain ordered sets of values, it is
required to consider each value as a possible split point. The splitting process is then repeated with
another interval, typically choosing the interval with the worst (highest) entropy, until a user-
specified number of intervals is reached, or a stopping criterion is satisfied.

There are two aspects of discretization.


• Discretizing each attribute separately often guarantees suboptimal results.
• It is desirable to have a stopping criterion that automatically finds the right number of partitions.
Data Preprocessing – Discretization and Binarization … Contd

Categorical Attributes with Too Many Values

Categorical attributes can sometimes have too many values.

If the categorical attribute is an ordinal attribute, then techniques similar to those for continuous
attributes can be used to reduce the number of categories.

If the categorical attribute is nominal, however, then better use the domain knowledge. If domain
knowledge does not serve the purpose or results in poor classification performance, then it is
necessary to use a more empirical approach, such as grouping values together only if such a
grouping results in improved classification accuracy or achieves some other data mining objective.
Data Preprocessing – Variable Transformation
It refers to a transformation that is applied to all the values of a variable (attribute). The two
important types of variable transformations are simple functional transformations and
normalization.

Simple Functions
For this type of variable transformation, a simple mathematical function is applied to each value
individually. If x is a variable, then examples of such transformations include xk, logx, ex, √x, 1/x, sinx,
or |x| etc. The sqrt, log, and 1/x, are often used to transform data that does not have a Gaussian
(normal) distribution.

Variable transformations should be applied with caution since they change the nature of the
data. While this is what is desired, there can be problems if the nature of the transformation is not
fully appreciated. For example, the transformation 1/x reduces the magnitude of values that are 1
or larger, but increases the magnitude of values between 0 and 1. Thus, for all sets of values, the
transformation 1/x reverses the order.
Data Preprocessing – Variable Transformation… Contd
Normalization or Standardization
Another common type of variable transformation is the standardization or normalization of a variable. The
goal of standardization or normalization is to make an entire set of values have a particular property. For
example, if x‫ י‬is the mean (average) of the attribute values and sx is their standard deviation, then the
transformation y = (x − x‫) י‬/sx creates a new variable that has a mean of 0 and a standard deviation of 1.
If different variables are to be combined in some way, then such a transformation is necessary to avoid
having a variable with large values dominate the results of the calculation. For example, comparing
people based on two variables: age and income.
As the mean and standard deviation are strongly affected by outliers, they are sometime replaced by the
median, i.e., the middle value, and the absolute standard deviation, respectively. The absolute standard
deviation of a variable x is given as follows.
σA = σ𝑚
𝑖=1 𝑥𝑖 − µ

Here, xi is the ith value of the variable, m is the number of objects, and μ is either the mean or median.
Other approaches for computing estimates of the location (center) and spread of a set of values can also
be used to define a standardization transformation.
Similarity and Dissimilarity Measures
Similarity and dissimilarity are important because they are used by a number of data mining
techniques, such as clustering, nearest neighbor classification, and anomaly detection.
In many cases, the initial data set is not needed once these similarities or dissimilarities have been
computed. Such approaches can be viewed as transforming the data to a similarity (dissimilarity)
space and then performing the analysis.
Since the proximity (proximity refers to either similarity or dissimilarity) between two objects is a
function of the proximity between the corresponding attributes of the two objects, we shall first
discuss measurement of the proximity between objects having only one simple attribute, and then
consider proximity measures for objects with multiple attributes.
The similarity between two objects is a numerical measure of the degree to which the two objects
are alike. Similarities are usually non-negative and are often between 0 (no similarity) and 1
(complete similarity).
The dissimilarity between two objects is a numerical measure of the degree to which the two
objects are different. Dissimilarities sometimes fall in the interval [0, 1], but it is also common for
them to range from 0 to ∞. Sometime, the term distance is used as a synonym for dissimilarity.
Similarity and Dissimilarity Measures
Transformations
Transformations are often applied to convert a similarity to a dissimilarity, or vice versa, or to
transform a proximity measure to fall within a particular range, e.g., [0,1].
Usually, proximity measures, especially similarities, are defined or transformed to have values in the
interval [0,1]. Such a transformation is often relatively straightforward.
For example:
• If the similarities between objects range from 1 (not at all similar) to 10 (completely similar), we
can make them fall within the range [0, 1] by using the transformation s’ = (s−1)/9, where s and
s’ are the original and new similarity values, respectively.
• In the more general case, the transformation of similarities to the interval [0, 1] is given by the
expression s’ = (s − min_s)/(max_s−min_s), where max_s and min_s are the maximum and
minimum similarity values, respectively.
• Similarly, dissimilarity measures with a finite range can be mapped to the interval [0,1] by using
the formula d’ = (d − min_d)/(max_d − min_d).
Similarity and Dissimilarity Measures
Transformations … Contd
However, there can be various complications in mapping proximity measures to the interval [0, 1].
For example, the proximity measure originally takes values in the interval [0, ∞], then a non-
linear transformation is needed and values will not have the same relationship to one another
on the new scale.
Another complication is that the meaning of the proximity measure may be changed.
For example, correlation is a measure of similarity that takes values in the interval [-1,1].
Mapping these values to the interval [0,1] by taking the absolute value loses information about
the sign, which can be important in some applications.
Transforming similarities to dissimilarities and vice versa is also relatively straightforward, although
we again face the issues of preserving meaning and changing a linear scale into a non-linear scale.
• If the similarity (or dissimilarity) falls in the interval [0,1], then the dissimilarity can be defined as
d = 1−s (s = 1 − d).
• Another simple approach is to define similarity as the negative of the dissimilarity (or vice versa).
However, it is not restricted to the range [0, 1], but if that is desired, then transformations such
as s = 1/(d+1), s = e−d, or s = 1− (d−min_d)/(max_d − min_d) can be used.
Similarity and Dissimilarity Measures

Transformations … Contd
In general, any monotonic decreasing function can be used to convert dissimilarities to similarities,
or vice versa. Of course, other factors (e.g., the issues of preserving meaning, changing a linear
scale into a non-linear scale) must also be considered when transforming similarities to
dissimilarities, or vice versa, or when transforming the values of a proximity measure to a new
scale.
Similarity and Dissimilarity Measures
Similarity and Dissimilarity between Simple Attributes

Consider objects described by one nominal attribute. What would it mean for two such objects to
be similar? Since nominal attributes only convey information about the distinctness of objects, all
we can say is that two objects either have the same value or they do not. Hence, in this case
similarity is traditionally defined as 1 if attribute values match, and as 0 otherwise. A dissimilarity
would be defined in the opposite way: 0 if the attribute values match, and 1 if they do not.

For objects with a single ordinal attribute, the situation is more complicated because information
about order should be taken into account.
Consider an attribute that measures the quality of a product on the scale {poor, fair, OK, good,
wonderful}. It would seem reasonable that a product, P1, which is rated wonderful, would be
closer to a product P2, which is rated good, than it would be to a product P3, which is rated OK.
To make this observation quantitative, the values of the ordinal attribute are often mapped to
successive integers, beginning at 0 or 1, e.g., {poor=0, fair=1, OK=2, good=3, wonderful=4}.
Then, d(P1, P2) = 4 − 3 = 1 or, if we want the dissimilarity to fall between 0 and 1, d(P1, P2) =
(4−3)/4 = 0.25. A similarity for ordinal attributes can then be defined as s = 1− d.
Similarity and Dissimilarity Measures
Similarity and Dissimilarity between Simple Attributes
For interval or ratio attributes, the natural measure of dissimilarity between two objects is the
absolute difference of their values.
The similarity of interval or ratio attributes is typically expressed by transforming a similarity into a
dissimilarity, as discussed in previous slides.

Attribute Type Dissimilarity Similarity


Nominal d = 0; if x = y s = 1; if x = y
d = 1; if x ≠ y s = 0; if x ≠ y
Ordinal d = |x − y|/(n − 1) s = 1− d
(values mapped to integers 0 to n−1,
where n is the number of values)
Interval or Ratio d = |x − y| s = −d, s = 1/(1+d), s = e−d,
s = 1− (d − min_d)/(max_d−min_d)
Similarity and Dissimilarity Measures
Dissimilarities between Data Objects
Distances
The Euclidean distance, d, between two points, x and y, in one-, two-, three-, or higher dimensional
space, is given by the following formula:
𝑛

ⅆ 𝑥, 𝑦 = ෍ 𝑥𝑘 − 𝑦𝑘 2

𝑘=1

Here, n is the number of dimensions and xk and yk are, respectively, the kth attributes (components)
of x and y.

The Euclidean distance measure is generalized by the Minkowski distance metric shown below.
𝑛 1/𝑟
𝑟
ⅆ 𝑥, 𝑦 = ෍ 𝑥𝑘 − 𝑦𝑘
𝑘=1
Here, r is a parameter.
Similarity and Dissimilarity Measures
Dissimilarities between Data Objects … Contd
Distances
The following are the three most common examples of Minkowski distances.
• r = 1. City block (Manhattan, taxicab, L1 norm) distance. A common example is the Hamming
distance, which is the number of bits that are different between two objects that have only
binary attributes, i.e., between two binary vectors.

• r = 2. Euclidean distance (L2 norm).

• r = ∞. Supremum (Lmax or L∞ norm) distance. This is the maximum difference between any
attribute of the objects. More formally, the L∞ distance is defined as follows.
𝑛 1/𝑟
𝑟
ⅆ 𝑥, 𝑦 = lim ෍ 𝑥𝑘 − 𝑦𝑘
𝑟→∞
𝑘=1
Similarity and Dissimilarity Measures
Dissimilarities between Data Objects … Contd

Distances
Distances, such as the Euclidean distance, have some well-known properties. If d(x, y) is the
distance between two points, x and y, then the following properties hold.
1. Positivity
(a) d(x, x) ≥ 0 for all x and y,
(b) d(x, y) = 0 only if x = y.

2. Symmetry
d(x, y) = d(y, x) for all x and y.

3. Triangle Inequality
d(x, z) ≤ d(x, y) + d(y, z) for all points x, y, and z.

Measures that satisfy all three properties are known as metrics.


Similarity and Dissimilarity Measures

Similarities between Data Objects

The triangle inequality does not hold for similarities, but symmetry and positivity do.

If s(x, y) is the similarity between points x and y, then the typical properties of similarities are as
follows.
1. s(x, y) = 1 only if x = y. (0 ≤ s ≤ 1)
2. s(x, y) = s(y, x) for all x and y. (Symmetry)
Similarity and Dissimilarity Measures
Similarity Measures for Binary Data
Similarity measures between objects that contain only binary attributes are called similarity
coefficients, and typically have values between 0 and 1. A value of 1 indicates that the two objects
are completely similar, while a value of 0 indicates that the objects are not at all similar.

Let x and y be two objects that consist of n binary attributes. The comparison of two such objects,
i.e., two binary vectors, leads to the following four quantities (frequencies):
f00 = the number of attributes where x is 0 and y is 0
f01 = the number of attributes where x is 0 and y is 1
f10 = the number of attributes where x is 1 and y is 0
f11 = the number of attributes where x is 1 and y is 1

Simple Matching Coefficient One commonly used similarity coefficient is the simple matching
coefficient (SMC), which is defined as follows:
number of matching attribute values
SMC =
number of attributes
𝑓11 + 𝑓00
=
𝑓01 +𝑓10 +𝑓11 +𝑓00
Similarity and Dissimilarity Measures

Similarity Measures for Binary Data


The SMC counts both presences and absences equally. Hence, it is not good for asymmetric
attributes. For example, on a market basket data of a big store all the customers will be rated
similar. In such cases, we uses Jaccard Coefficient.
The Jaccard coefficient (J) is computed as follows.
number of matching presences
J=
number of attributes not involved in 00 matches
𝑓11
=
𝑓01 +𝑓10 +𝑓11
Similarity and Dissimilarity Measures
Cosine Similarity
Documents are often represented as vectors, where, in its simplest form, each attribute represents
the frequency with which a particular term (word) occurs in the document. Even though documents
have thousands or tens of thousands of attributes (terms), each document is sparse since it has
relatively few non-zero attributes.
The normalizations used for documents do not create a non-zero entry where there was a zero
entry; i.e., they preserve sparsity. Thus, as with transaction data, similarity should not depend on the
number of shared 0 values since any two documents are likely to “not contain” many of the same
words, and therefore, if 0–0 matches are counted, most documents will be highly similar to most
other documents. Therefore, a similarity measure for documents needs to ignores 0–0 matches like
the Jaccard measure, but also must be able to handle non-binary vectors. The cosine similarity is
one of the most common measure of document similarity. If x and y are two document vectors, then
𝑥⋅𝑦
cos(x, y) =
𝑥 𝑦

Here, · indicates the vector dot product, x· y = σ𝑛𝑘=1 𝑥𝑘 𝑦𝑘 , and 𝑥 is the length of vector, 𝑥 =
σ𝑛𝑘=1 𝑥𝑘 2 = 𝑥. 𝑥
Similarity and Dissimilarity Measures
Cosine Similarity
As indicated in the figure below, cosine similarity really is a measure of the (cosine of the) angle
between x and y. Thus, if the cosine similarity is 1, the angle between x and y is 0◦, and x and y are
the same except for magnitude (length). If the cosine similarity is 0, then the angle between x and y
is 90◦, and they do not share any terms (words).

It can also be written as follows.


𝑥 𝑦
cos(x, y) = .
𝑥 𝑦

Dividing x and y by their lengths normalizes them to have a length of 1. This means that cosine
similarity does not take the magnitude of the two data objects into account when computing
similarity.
For vectors with a length of 1, the cosine measure can be calculated by taking a simple dot product.
Consequently, when many cosine similarities between objects are being computed, normalizing the
objects to have unit length can reduce the time required.
Euclidean distance may be a better choice when magnitude is important.
Similarity and Dissimilarity Measures

Extended Jaccard Coefficient (Tanimoto Coefficient)

The extended Jaccard coefficient (EJ), also known as the Tanimoto coefficient, can be used for
document data and that reduces to the Jaccard coefficient in the case of binary attributes. It is
defined as follows.
𝑥.𝑦
EJ(x, y) =
𝑥 2 + 𝑦 2 −𝑥.𝑦
Similarity and Dissimilarity Measures
Correlation
The correlation between two data objects that have binary or continuous variables is a measure of
the linear relationship between the attributes of the objects. More precisely, Pearson’s correlation
coefficient between two data objects, x and y, is defined by the following equation:
𝑐𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑥,𝑦) 𝑠𝑥𝑦
c𝑜𝑟𝑟(𝑥, 𝑦) = =
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑥 ∗𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 (𝑦) 𝑠𝑥 .𝑠𝑦
Here,
𝑛
1
𝑐𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑥, 𝑦 = 𝑠𝑥𝑦 = ෍ (𝑥𝑘 − 𝑥)(𝑦
ҧ 𝑘 − 𝑦)

(𝑛 − 1)
𝑘=1
𝑛
1
𝑠𝑡𝑎𝑛ⅆ𝑎𝑟ⅆ ⅆ𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑥 = 𝑠𝑥 = ෍ (𝑥𝑘 − 𝑥)ҧ 2
(𝑛 − 1)
𝑘=1

𝑛
1
𝑠𝑡𝑎𝑛ⅆ𝑎𝑟ⅆ ⅆ𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑦 = 𝑠𝑦 = ത 2
෍ (𝑦𝑘 − 𝑦)
(𝑛 − 1)
𝑘=1
1 1
𝑥ҧ = 𝑛 σ𝑛𝑘=1 𝑥𝑘 is the mean of x 𝑦ത = 𝑛 σ𝑛𝑘=1 𝑦𝑘 is the mean of y
Similarity and Dissimilarity Measures

Correlation
Correlation is always in the range −1 to 1. A correlation of 1 (−1) means that x and y have a perfect
positive (negative) linear relationship; that is, xk = ayk + b, where a and b are constants.

If the correlation is 0, then there is no linear relationship between the attributes of the two data
objects. However, non-linear relationships may still exist.

If we transform x and y by subtracting off their means and then normalizing them so that their
lengths are 1, then their correlation can be calculated by taking the dot product.
Similarity and Dissimilarity Measures
Similarity and Dissimilarity Measures

Issues in Proximity Calculation

1) How to handle the case in which attributes have different scales and/or are correlated?

2) How to calculate proximity between objects that are composed of different types of attributes,
e.g., quantitative and qualitative?

3) How to handle proximity calculation when attributes have different weights; i.e., when not all
attributes contribute equally to the proximity of objects?
Similarity and Dissimilarity Measures
Standardization and Correlation for Distance Measures

An important issue with distance measures is how to handle the situation when attributes do not
have the same range of values, i.e., the variables have different scales.

A related issue is how to compute distance when there is correlation between some of the
attributes, perhaps in addition to differences in the ranges of values.

A generalization of Euclidean distance, the Mahalanobis distance, is useful when attributes are
correlated, have different ranges of values (different variances), and the distribution of the data is
approximately Gaussian (normal).

The Mahalanobis distance between two objects (vectors) x and y is defined as follows.

mahalanobis(x, y) = (x − y)Σ−1(x − y)T


Here, Σ−1 is the inverse of the covariance matrix of the data.
Similarity and Dissimilarity Measures

Combining Similarities for Heterogeneous Attributes


A general approach is needed when the attributes are of different types. One straightforward
approach is to compute the similarity between each attribute separately, and then combine these
similarities using a method that results in a similarity between 0 and 1. Typically, the overall
similarity is defined as the average of all the individual attribute similarities.
Unfortunately, the above approach does not work well if some of the attributes are asymmetric
attributes. The easiest way to fix this problem is to omit asymmetric attributes from the similarity
calculation when their values are 0 for both of the objects whose similarity is being computed.
A similar approach also works well for handling missing values.
Similarity and Dissimilarity Measures

Combining Similarities for Heterogeneous Attributes

Algorithm: Similarities of heterogeneous objects.


1. For the kth attribute, compute a similarity, sk(x, y), in the range [0, 1].

2. Define an indicator variable, δk, for the kth attribute as follows:


δ𝑘 = 0 if the kth attribute is an asymmetric attribute and both objects have a value of 0, or
if one of the objects has a missing value for the kth attribute
δ𝑘 = 1 Otherwise

3. Compute the overall similarity between the two objects using the following formula:
σ𝑛
𝑘=1 δ𝑘 𝑠𝑘 (𝑥,𝑦)
similarity(x, y) = σ𝑛
𝑘=1 δ𝑘
Similarity and Dissimilarity Measures

Using Weights

In much of the previous discussion, all attributes were treated equally when computing proximity.
This is not desirable when some attributes are more important to the definition of proximity than
others.

To address these situations, the formulas for proximity can be modified by weighting the
contribution of each attribute. If the weights wk sum to 1, then similarity(x, y) becomes
σ𝑛
𝑘=1 𝑤𝑘 δ𝑘 𝑠𝑘 (𝑥,𝑦)
similarity(x, y) = σ𝑛
𝑘=1 δ𝑘

The definition of the Minkowski distance can also be modified as follows:


d(x, y) = σ𝑛𝑘=1 𝑤𝑘 𝑥𝑘 − 𝑦𝑘 𝑟 1/𝑟

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy