Chapter-2 (Data)
Chapter-2 (Data)
As the data mining tools and techniques heavily depend on the type of data, it is required to
understand the data first.
For example, we can compute the similarity or dissimilarity between pairs of objects and then
perform the analysis—clustering, classification, or anomaly detection—based on these
similarities or dissimilarities.
There are many such similarity or dissimilarity measures, and the proper choice depends on (i) the
type of data, and (ii) the particular application.
Types of Data
A data set can often be viewed as a collection of data objects. Other names for a data object are
record, point, vector, pattern, event, case, sample, observation, or entity.
These data objects are described by a number of attributes that capture the basic characteristics of an
object. Other names for an attribute are variable, characteristic, field, feature, or dimension.
Example: Often, a data set is a file, in which the objects are records (or rows) and each field (or
column) corresponds to an attribute. The table below shows a data set that consists of student
information. Here, each row corresponds to a student and each column is an attribute that describes
some aspect of a student, e.g., cumulative grade point average (CGPA), identification number (ID).
Table: Student Information
Student ID Year CGPA …
2034625 Freshman 7.8 …
1934364 Sophomore 9.5 …
1737637 Senior 6.8 …
Attributes and Measurements
What Is an attribute?
An attribute is a property or characteristic of an object that may vary, either from one object to
another or from one time to another.
For example, eye color varies from person to person, while the temperature of an object varies
over time. The eye color is a symbolic attribute with a small number of possible values {brown,
black, blue, green, hazel, etc.} whereas the temperature is a numerical attribute with a potentially
unlimited number of values.
However, at the most basic level, attributes are not about numbers or symbols. Rather, we assign
numbers or symbols to them to discuss and analyze the characteristics of objects.
The process of measurement is the application of a measurement scale to associate a value with a
particular attribute of a specific object. We do the process of measurement all the time.
For example, we use bathroom scale to determine our weight, or we classify someone as male or
female. In these cases, the “physical value” of an attribute of an object is mapped to a numerical
or symbolic value.
The Type of an Attribute
The type of an attribute determines whether a particular data analysis technique is consistent with a
specific type of attribute.
However, the properties of an attribute need not be the same as the properties of the values used to
measure it. In other words, the values used to represent an attribute may have properties that are not
properties of the attribute itself, and vice versa.
Example: Two attributes that might be associated with an employee are ID and age. Both of these
attributes can be represented as integers.
Though it is reasonable to talk about the average age of an employee, it makes no sense to talk
about the average ID because it simply captures the aspect that each employee is distinct. The
only valid operation on IDs is to test whether they are equal. However, there is no hint of this
limitation when integers are used to represent the employee ID attribute.
For the age attribute, the properties of the integers used to represent age are very much the
properties of the attribute. However, the correspondence is not complete since, for example, ages
have a maximum, while integers do not.
The Different Types of Attribute
Attribute Type Description Example Operations
Nominal The values of nominal attribute are just zip codes, employee ID mode, entropy,
different names; i.e., nominal values provide numbers, eye color, gender contingency
only enough information to distinguish one correlation, χ2 test
Categorical object from another.
(Qualitative) (=, ≠)
Ordinal The values of an ordinal attribute provide hardness of minerals, {good, median, percentiles,
enough information to order objects. better, best}, grades, street rank correlation,
(<, >) numbers run tests, sign tests
Interval For interval attributes, the differences calendar dates, temperature mean, standard
between values are meaningful, i.e., a unit of in Celsius or Fahrenheit deviation, Pearson’s
measurement exists. correlation, t and F
Numeric (+, − ) tests
(Quantitative) Ratio For ratio variables, both differences and temperature in Kelvin, geometric mean,
ratios are meaningful. monetary quantities, harmonic mean,
(*, /) counts, age, mass, length, percent variation
electrical current
Each attribute type possesses all of the properties and operations of the attribute types above it.
The Different Types of Attributes
Nominal and ordinal attributes are collectively referred to as categorical or qualitative attributes.
They lack most of the properties of numbers even if they are represented by numbers. They are
treated like symbols.
Interval and ratio are collectively referred to as numeric or quantitative attributes. They are
represented by numbers and have most of the properties of numbers. They can be integer-valued or
continuous.
The Different Types of Attributes
The types of attributes can also be described in terms of transformations that do not change the
meaning of an attribute.
Attribute Type Transformation Comment
Nominal Any one-to-one mapping, e.g., a If all employee ID numbers are reassigned, it will
permutation of values not make any difference.
Categorical
(Qualitative
Ordinal An order-preserving change of An attribute encompassing the notion of good,
)
values, i.e., new value = f(old value), better, best can be represented equally well by
where f is a monotonic function. the values {1, 2, 3} or by {0.5, 1, 10}.
Interval new value = a ∗ old value + b, where The Fahrenheit and Celsius temperature scales
a and b constants. differ in the
Numeric
location of their zero value and the size of a
(Quantitati
degree (unit).
ve)
Ratio new value = a ∗ old value Length can be measured in meters or feet.
Describing Attributes by the Number of Values
An independent way of distinguishing between attributes is by the number of values they can take.
Discrete: A discrete attribute has a finite set of values. Such attributes can be categorical, e.g., pin
codes, or numeric, e.g., counts. Discrete attributes are often represented using integer variables.
Binary attributes are a special case of discrete attributes and assume only two values, e.g.,
true/false, yes/no, or 0/1. They are often represented as Boolean variables, or as integer
variables that take either of two values 0 or 1.
Continuous: A continuous attribute is one whose values are real numbers. For example,
temperature, height, or weight. They are typically represented as floating-point variables.
In theory, any of the measurement scale types—nominal, ordinal, interval, and ratio—can be
combined with any of the types based on the number of attribute values—binary, discrete, and
continuous. However, some combinations do not make much sense. For instance, it is difficult to
think of a realistic data set that contains a continuous binary attribute.
Typically, nominal and ordinal attributes are binary or discrete, whereas interval and ratio attributes
are continuous. However, count attributes, which are discrete, are also ratio attributes.
Asymmetric Attributes
For asymmetric attributes, only presence—a non-zero attribute value—is regarded as important.
Binary attributes where only non-zero values are important are called asymmetric binary attributes.
This type of attribute is particularly important for association analysis.
Example: Consider a data set where each object is a student and each attribute records whether
or not a student took a particular course at a university. For a specific student, an attribute has a
value of 1 if the student took the course associated with that attribute and a value of 0
otherwise. Because students take only a small fraction of all available courses, most of the values
in such a data set would be 0. Therefore, it is more meaningful and more efficient to focus on the
non-zero values. If students are compared on the basis of the courses they don’t take, then most
students would seem very similar if the number of courses is large.
However, it is also possible to have discrete or continuous asymmetric features. For example, if the
number of credits associated with each course is recorded, then it is asymmetric discrete attribute.
Types of Data Sets
There are many types of data sets. Though every data set does not fit and other groupings are also
possible, we group them as record data, graph-based data, and ordered data.
Dimensionality: The dimensionality of a data set is the number of attributes that the objects possess
in the data set. The difficulty associated with analyzing high-dimensional data is referred to as the
curse of dimensionality. Because of this, an important motivation in preprocessing the data is
dimensionality reduction.
Sparsity: For some data sets, such as those with asymmetric features, most attributes of an object
have values of 0; in many cases, fewer than 1% of the entries are non-zero. In some algorithms (e.g.,
Naïve Bayes, Logistics regression), it is an advantage because only the non-zero values need to be
stored and manipulated. This results in significant savings in computation time and storage. However,
for some algorithms (e.g., recommender system), it is a big issue.
Types of Data Sets
Resolution It is frequently possible to obtain data at different levels of resolution, and often the
properties of the data are different at different resolutions.
Example: The surface of the Earth seems very uneven at a resolution of a few meters, but is
relatively smooth at a resolution of tens of kilometers.
Why is it important?
The patterns in the data also depend on the level of resolution. If the resolution is too fine, a pattern
may not be visible or may be buried in noise; if the resolution is too coarse, the pattern may
disappear.
Example: Variations in atmospheric pressure on a scale of hours reflect the movement of storms
and other weather systems. On a scale of months, such phenomena are not detectable.
Types of Data Sets – Record Data
Most of the data mining work assumes that the data set is a collection of records (data objects), each
of which consists of a fixed set of data fields (attributes). In record data, there is no explicit
relationship among records or data fields, and every record (object) has the same set of attributes.
Record data is usually stored either in flat files or in relational databases. Though the relational
databases are more than a collection of records, data mining often does not use any additional
information available in a relational database.
Transaction or Market Basket Data: Transaction data is a special type of record data, where each
record (transaction) involves a set of items. Consider a grocery store. The set of products purchased
by a customer during one shopping trip constitutes a transaction, while the individual products that
were purchased are the items. This type of data is called market basket data because the items in
each record are the products in a person’s market basket. Transaction data is a collection of sets of
items, but it can be viewed as a set of records whose fields are asymmetric attributes. Most often,
the attributes are binary, indicating whether or not an item was purchased.
Sequential Data: It is also referred to as temporal data and can be thought of as an extension of
record data, where each record has a time associated with it. Consider a retail transaction data set
that also stores the time at which the transaction took place. This time information makes it possible
to find patterns such as “candy sales peak before Halloween.” A time can also be associated with
each attribute. For example, each record could be the purchase history of a customer, with a listing
of items purchased at different times. Using this information, it is possible to find patterns such as
“people who buy DVD players tend to buy DVDs in the period immediately following the purchase.”
Most data mining algorithms are designed for record data or its variations, such as transaction data
and data matrices. Record-oriented techniques can be applied to non-record data by extracting
features from data objects and using these features to create a record corresponding to each object.
However, in some cases, it is easy to represent the data in a record format, but this type of
representation does not capture all the information in the data.
Data Quality
Data mining applications are often applied to data that was collected for another purpose, or for
future, but unspecified applications. For that reason, data mining cannot usually take advantage of
the significant benefits of addressing quality issues at the source. Because preventing data quality
problems is typically not an option, data mining focuses on the following.
1. The detection and correction of data quality problems (it is known as data cleaning) and
2. The use of algorithms that can tolerate poor data quality.
Data Quality - Measurement and Data Collection Issues
It is unrealistic to expect that data will be perfect. There may be problems due to human error,
limitations of measuring devices, or flaws in the data collection process.
Values or even entire data objects may be missing. In other cases, there may be spurious or duplicate
objects; i.e., multiple data objects that all correspond to a single object.
For example, there might be two different records for a person who has recently lived at two
different addresses.
Even if all the data is present and looks fine, there may be inconsistencies.
For example, a person has a height of 2 meters, but weighs only 2 kilograms.
Data Quality - Measurement and Data Collection Issues
The term measurement error refers to any problem resulting from the measurement process. A
common problem is that the recorded value differs from the true value to some extent.
For continuous attributes, the numerical difference of the measured and true value is called the
error.
The term data collection error refers to errors such as omitting data objects or attribute values, or
inappropriately including a data object.
Both measurement errors and data collection errors can be either systematic or random.
Data Quality - Measurement and Data Collection Issues
Noise and Artifacts
Noise is the random component of a measurement
error. It may involve the distortion of a value or the
addition of spurious objects.
Because the term noise is often used in connection
with data that has a spatial or temporal component, A time series data is disrupted by random noise. If a bit more
noise were added to the time series, its shape would be lost.
techniques from signal or image processing can be
used to reduce noise. It helps to discover patterns
(signals) that otherwise might be lost in the noise.
However, the elimination of noise is frequently
difficult, and much work in data mining focuses on
devising robust algorithms that produce acceptable
results even when noise is present.
Data errors may be the result of a more deterministic A set of data points before and after some noise points
(indicated by ‘+’s) have been added. Notice that some of
phenomenon are often referred to as artifacts. the noise points are intermixed with the non-noise points.
Data Quality - Measurement and Data Collection Issues
Precision, Bias, and Accuracy
The quality of the measurement process and the resulting data are measured by precision and bias.
(Their definitions assume that the measurements are repeated to calculate a mean (average) value
that serves as an estimate of the true value.)
Precision: The closeness of repeated measurements (of the same quantity) to one another.
Precision is often measured by the standard deviation of a set of values, while bias is measured by
taking the difference between the mean of the set of values and the known value of the quantity
being measured.
It is common to use the more general term, accuracy, to refer to the degree of measurement error
in data.
Data Quality - Measurement and Data Collection Issues
Accuracy: The closeness of measurements to the true value of the quantity being measured.
Accuracy depends on precision and bias, but since it is a general concept, there is no specific
formula for accuracy in terms of these two quantities. One important aspect of accuracy is the use
of significant digits. The goal is to use only as many digits to represent the result of a measurement
or calculation as are justified by the precision of the data.
Issues such as significant digits, precision, bias, and accuracy are sometimes overlooked, but they
are important for data mining. Without some understanding of the accuracy of the data and the
results, an analyst runs the risk of committing serious data analysis blunders.
Data Quality - Measurement and Data Collection Issues
Outliers (Anomalous)
Outliers are either (1) data objects that, in some sense, have characteristics that are different from
most of the other data objects in the data set, or (2) values of an attribute that are unusual with
respect to the typical values for that attribute.
It is important to distinguish between the notions of noise and outliers. Outliers can be legitimate
data objects or values. Thus, unlike noise, outliers may sometimes be of interest.
Missing Values
It is not unusual for an object to be missing one or more attribute values. There may be various
reasons for it, e.g., the information was not collected, some attributes are not applicable to all
objects. However, they should be considered seriously during the data analysis.
There are several strategies (and variations on these strategies) for dealing with missing data, each
of which may be appropriate in certain circumstances.
Data Quality - Measurement and Data Collection Issues
Ways to handle the missing values
Eliminate Data Objects or Attributes: A simple and effective strategy is to eliminate objects with
missing values. However, even a partially specified data object contains some information, and if
many objects have missing values, then a reliable analysis can be difficult or impossible.
Nonetheless, if a data set has only a few objects that have missing values, then it may be expedient
to omit them. A related strategy is to eliminate attributes that have missing values. This should be
done with caution, however, since the eliminated attributes may be the ones that are critical to the
analysis.
Estimate Missing Values Sometimes missing data can be reliably estimated. For example:
i. In time series data, the missing values can be estimated (interpolated) by using the remaining
values.
ii. If a data set that has many similar data points the attribute values of the points closest to the
point with the missing value are often used to estimate the missing value.
a) If the attribute is continuous, then the average attribute value of the nearest neighbors is
used.
b) If the attribute is categorical, then the most commonly occurring attribute value can be
taken.
Data Quality - Measurement and Data Collection Issues
Ignore the Missing Value during Analysis Many data mining approaches can be modified to ignore
missing values. For example, suppose that objects are being clustered and the similarity between
pairs of data objects needs to be calculated. If one or both objects of a pair have missing values for
some attributes, then the similarity can be calculated by using only the attributes that do not have
missing values. It is true that the similarity will only be approximate, but unless the total number of
attributes is small or the number of missing values is high, this degree of inaccuracy may not matter
much. Likewise, many classification schemes can be modified to work with missing values.
Data Quality - Measurement and Data Collection Issues
Inconsistent Values
Data can contain inconsistent values. Consider an address field, where both a zip code and city are
listed, but the specified zip code area is not contained in that city.
Some types of inconsistences are easy to detect. For instance, a person’s height should not be
negative.
In other cases, it can be necessary to consult an external source of information. The correction of an
inconsistency requires additional or redundant information.
Data Quality - Measurement and Data Collection Issues
Duplicate Data
A data set may include data objects that are duplicates, or almost duplicates, of one another.
Deduplication is the process to deal with these issues. Following two main issues must be
addressed for deduplication.
(1) If there are two (data) objects that actually represent a single object (entity in the real-world),
then the values of corresponding attributes may differ, and these inconsistent values must be
resolved.
(2) Care needs to be taken to avoid accidentally combining data objects that are similar, but not
duplicates, e.g., two distinct people with identical names.
In some cases, two or more objects are identical with respect to the attributes measured by the
database, but they still represent different objects. Here, the duplicates are legitimate, but may still
cause problems for some algorithms if the possibility of identical objects is not specifically
accounted for in their design.
Data Quality - Issues Related to Applications
Data quality issues can also be considered from an application viewpoint as expressed by the
statement data is of high quality if it is suitable for its intended use.
Timeliness: Some data starts to age as soon as it has been collected, e.g., purchasing behavior of
customers, web browsing patterns. If the data is out of date, then so are the models and patterns
that are based on it.
Relevance: The available data must contain the information necessary for the application. For
example, building a model that predicts the accident rate for drivers from a dataset where the age
and gender of the driver is omitted is not much useful.
Making sure that the objects in a data set are relevant is also challenging. A common problem is
sampling bias, which occurs when a sample does not contain different types of objects in proportion
to their actual occurrence in the population. Because the results of a data analysis can reflect only
the data that is present, sampling bias results in an erroneous analysis.
Data Quality - Issues Related to Applications
Knowledge about the Data: Ideally, data sets are accompanied by documentation that describes
different aspects of the data; the quality of this documentation can either aid or hinder the
subsequent analysis.
For example, if the documentation identifies several attributes as being strongly related, these
attributes are likely to provide highly redundant information, and we may decide to keep just
one. (Consider sales tax and purchase price.) If the documentation is poor, however, and fails to
tell us, for example, that the missing values for a particular field are indicated with a -9999, then
our analysis of the data may be faulty.
Other important characteristics are the precision of the data, the type of features (nominal, ordinal,
interval, ratio), the scale of measurement (e.g., meters or feet for length), and the origin of the data.
Data Preprocessing
Data preprocessing is applied to make the data more suitable for data mining. It consists of a
number of different strategies and techniques that are interrelated in complex ways. The most
important ideas and approaches are as follows.
• Aggregation
• Sampling
• Dimensionality reduction
• Feature subset selection
• Feature creation
• Discretization and binarization
• Variable transformation
Roughly speaking, these items fall into two categories: selecting data objects and attributes for the
analysis or creating/changing the attributes. In both cases the goal is to improve the data mining
analysis with respect to time, cost, and quality.
Data Preprocessing - Aggregation
Sometimes less is more and this is the case with aggregation - the combining of two or more objects
into a single object. An obvious issue is how to combine the values of each attribute of all the
records. Quantitative attributes are typically aggregated by taking a sum or an average and a
qualitative attributes are either omitted or summarized.
Sampling is a commonly used approach for selecting a subset of the data objects to be analyzed. It is
done because it is too expensive or time consuming to process all the data. In some cases, using a
sampling algorithm can reduce the data size to the point where a better, but more expensive
algorithm can be used.
The key principle for effective sampling is that the sample should be a representative of the entire
data set. A sample is representative if it has approximately the same property (of interest) as the
original set of data.
Because sampling is a statistical process, the representativeness of any particular sample will vary,
and the best that we can do is choose a sampling scheme that guarantees a high probability of
getting a representative sample. It involves choosing the appropriate sample size and sampling
techniques.
Data Preprocessing – Sampling … Contd
Sampling Approaches
There are two variations of sampling: (1) sampling without replacement — as each item is selected,
it is removed from the set of all objects that together constitute the population, and (2) sampling
with replacement — objects are not removed from the population as they are selected for the
sample; hence, the same object can be picked more than once. The samples produced by the two
methods are not much different when sample size is relatively small compared to the data set size.
The simplest type of sampling is simple random sampling. Here, there is an equal probability of
selecting any particular item. When the population consists of different types of objects, with widely
different numbers of objects, simple random sampling can fail to adequately represent those types
of objects that are less frequent. It can cause problems when the analysis requires proper
representation of all object types.
Hence, a sampling scheme that can accommodate differing frequencies for the items of interest is
needed. Stratified sampling is one such method. In the simplest version, equal numbers of objects
are drawn from each group even though the groups are of different sizes. In another variation, the
number of objects drawn from each group is proportional to the size of that group.
Data Preprocessing – Sampling … Contd
Once a sampling technique is selected, it is necessary to choose the sample size. Larger sample sizes
increase the probability that a sample will be representative, but they also eliminate much of the
advantage of sampling. Conversely, with smaller sample sizes, patterns may be missed or erroneous
patterns can be detected.
Figure (a) shows a data set that contains 8000 two-dimensional points, while Figures (b) and (c) show samples from this data set of size
2000 and 500, respectively. Although most of the structure of this data set is present in the sample of 2000 points, much of the structure
is missing in the sample of 500 points.
Data Preprocessing – Sampling … Contd
Progressive Sampling: The proper sample size can be difficult to determine, so adaptive or
progressive sampling schemes are sometimes used. These approaches start with a small sample,
and then increase the sample size until a sample of sufficient size has been obtained. While this
technique eliminates the need to determine the correct sample size initially, it requires that there be
a way to evaluate the sample to judge if it is large enough.
Data Preprocessing – Dimensionality Reduction
There are a variety of benefits to dimensionality reduction.
1) Many data mining algorithms work better if the dimensionality — the number of attributes in
the data — is lower. This is because dimensionality reduction can eliminate irrelevant features
and reduce noise, and the curse of dimensionality.
2) It can lead to a more understandable model because the model may involve fewer attributes.
3) It may allow easy data visualization. Even if dimensionality reduction doesn’t reduce the data to
two or three dimensions, data is often visualized by looking at pairs or triplets of attributes, and
the number of such combinations is greatly reduced.
4) The amount of time and memory required by the data mining algorithm is reduced with a
reduction in dimensionality.
The term dimensionality reduction is often reserved for those techniques that reduce the
dimensionality of a data set by creating new attributes that are a combination of the old attributes.
The reduction of dimensionality by selecting new attributes that are a subset of the old is known as
feature subset selection or feature selection.
Data Preprocessing – Dimensionality Reduction
The curse of dimensionality refers to the phenomenon that many types of data analysis become
significantly harder as the dimensionality of the data increases. Specifically, as dimensionality
increases, the data becomes increasingly sparse in the space that it occupies.
For classification, this can mean that there are not enough data objects to allow the creation of a
model that reliably assigns a class to all possible objects. For clustering, the definitions of density
and the distance between points, which are critical for clustering, become less meaningful.
Principal Components Analysis (PCA) and Singular Value Decomposition (SVD) are most common
approaches for dimensionality reduction in continuous data. They use linear algebra to project the
data from a high-dimensional space into a lower-dimensional space.
Data Preprocessing – Feature Subset Selection
Another way to reduce the dimensionality is to use only a subset of the features. It may seem that
such an approach will lose information but this is not the case if redundant and irrelevant features
are present.
Redundant features duplicate much or all of the information contained in one or more other
attributes. Irrelevant features contain almost no useful information for the data mining task at
hand. Such features can reduce classification accuracy and the quality of the clusters.
While some of these attributes can be eliminated immediately by using common sense or domain
knowledge, selecting the best subset of features frequently requires a systematic approach.
The ideal approach to feature selection is to try all possible subsets of features as input to the data
mining algorithm, and then take the subset that produces the best results. This method has the
advantage of reflecting the objective and bias of the data mining algorithm that will eventually be
used.
Unfortunately, since the number of subsets involving n attributes is 2n, such an approach is
impractical in most situations and alternative strategies are needed. There are three standard
approaches to feature selection: embedded, filter, and wrapper.
Data Preprocessing – Feature Subset Selection … Contd
Embedded approaches
Feature selection occurs naturally as part of the data mining algorithm. Specifically, during the
operation of the data mining algorithm, the algorithm itself decides which attributes to use and
which to ignore.
Filter approaches
Features are selected before the data mining algorithm is run, using some approach that is
independent of the data mining task. For example, we might select sets of attributes whose pairwise
correlation is as low as possible.
Wrapper approaches
These methods use the target data mining algorithm as a black box to find the best subset of
attributes but typically without enumerating all possible subsets.
Data Preprocessing – Feature Subset Selection … Contd
An Architecture for Feature Subset Selection
It is possible to encompass both the filter and wrapper approaches within a common architecture.
The feature selection process is viewed as consisting of four parts: (i) a measure for evaluating a
subset, (ii) a search strategy that controls the generation of a new subset of features, (iii) a
stopping criterion, and (iv) a validation procedure.
Filter methods and wrapper methods differ only in the way in which they evaluate a subset of
features. For a wrapper method, subset evaluation uses the target data mining algorithm, while for
a filter approach, the evaluation technique is distinct from the target data mining algorithm.
Conceptually, feature subset selection is a search over all possible subsets of features. Many
different types of search strategies can be used, but the search strategy should be computationally
inexpensive and should find optimal or near optimal sets of features. It is usually not possible to
satisfy both requirements, and thus, tradeoffs are necessary.
Data Preprocessing – Feature Subset Selection … Contd
Evaluation Strategy
An integral part of the search is an evaluation step to judge how the current subset of features
compares to others that have been considered.
This requires an evaluation measure that attempts to determine the goodness of a subset of
attributes with respect to a particular data mining task, such as classification or clustering.
For the filter approach, such measures attempt to predict how well the actual data mining algorithm
will perform on a given set of attributes.
For the wrapper approach, where evaluation consists of actually running the target data mining
application, the subset evaluation function is simply the criterion normally used to measure the
result of the data mining.
Data Preprocessing – Feature Subset Selection … Contd
Stopping Criteria
Because the number of subsets can be enormous and it is impractical to examine them all, some
sort of stopping criterion is necessary.
This strategy is usually based on one or more conditions involving the following:
• The number of iterations,
• Whether the value of the subset evaluation measure is optimal or exceeds a certain threshold,
• Whether a subset of a certain size has been obtained,
• Whether simultaneous size and evaluation criteria have been achieved, and
• Whether any improvement can be achieved by the options available to the search strategy.
Data Preprocessing – Feature Subset Selection … Contd
Validation
Finally, once a subset of features has been selected, the results of the target data mining algorithm
on the selected subset should be validated.
A straightforward evaluation approach is to run the algorithm with the full set of features and
compare the full results to results obtained using the subset of features. Hopefully, the subset of
features will produce results that are better than or almost as good as those produced when using
all features.
Another validation approach is to use a number of different feature selection algorithms to obtain
subsets of features and then compare the results of running the data mining algorithm on each
subset.
Data Preprocessing – Feature Subset Selection … Contd
Feature Weighting
More important features are assigned a higher weight, while less important features are given a
lower weight.
These weights are sometimes assigned based on domain knowledge about the relative importance
of features.
Alternatively, they may be determined automatically. For example, some classification schemes,
such as support vector machines, produce classification models in which each feature is given a
weight.
Features with larger weights play a more important role in the model.
Data Preprocessing – Feature Creation
Usually it is possible to create a new set of reduced attributes from the original attributes that
captures the important information in a data set much more effectively. It allows us to reap all the
previously described benefits of dimensionality reduction. Three popular methodologies for creating
new attributes are: (i) feature extraction, (ii) mapping the data to a new space, and (iii) feature
construction.
Data Preprocessing – Feature Creation … Contd
Feature Extraction
The creation of a new set of features from the original raw data is known as feature extraction.
A totally different view of the data can reveal important and interesting features.
Example: Consider time series data, which often contains periodic patterns. If there is only a single
periodic pattern and not much noise, then the pattern is easily detected. However, if there are a
number of periodic patterns and a significant amount of noise is present, then these patterns are
hard to detect. Nonetheless, such patterns can be detected by applying a Fourier transform to the
time series to change to a representation in which frequency information is explicit (because, for
each time series, the Fourier transform produces a new data object whose attributes are related to
frequencies).
Many other sorts of transformations are also possible. Besides the Fourier transform, the wavelet
transform has also proven very useful for time series and other types of data.
Data Preprocessing – Feature Creation … Contd
Mapping the Data to a New Space
Figure: (b) is the sum of three other time series, two of which are shown in (a) and the third time series is random
noise. (c) shows the power spectrum computed after applying a Fourier transform to the original time series.
Data Preprocessing – Feature Creation … Contd
Feature Construction
Sometimes the features in the original data sets have the necessary information, but it is not in a
form suitable for the data mining algorithm. In this situation, one or more new features constructed
out of the original features can be more useful than the original features.
Example: Consider a data set consisting of information about historical artifacts, which, along with
other information, contains the volume and mass of each artifact. For simplicity, assume that these
artifacts are made of a small number of materials (wood, clay, bronze, gold) and that we want to
classify the artifacts with respect to the material of which they are made. In this case, a density
feature constructed from the mass and volume features, i.e., density = mass/volume, would most
directly yield an accurate classification.
Though there have been some attempts to automatically perform feature construction by exploring
simple mathematical combinations of existing attributes, the most common approach is to construct
features using domain expertise.
Data Preprocessing – Discretization and Binarization
Some data mining algorithms want the data to be presented in some specific format. For example,
some classification algorithms require the data to be in the form of categorical attributes, algorithms
that find association patterns require that the data be in the form of binary attributes. Thus, it is
often necessary to transform a continuous attribute into a categorical attribute (discretization), and
both continuous and discrete attributes may need to be transformed into one or more binary
attributes (binarization).
Additionally, if a categorical attribute has a large number of values (categories), or some values
occur infrequently, then it may be beneficial for certain data mining tasks to reduce the number of
categories by combining some of the values.
As with feature selection, the best discretization and binarization approach is the one that produces
the best result for the data mining algorithm that will be used to analyze the data. However, it is
not practical to apply such a criterion directly. Therefore, discretization or binarization is performed
in a way that satisfies a criterion that is thought to have a relationship to good performance for the
considered data mining task.
Data Preprocessing – Discretization and Binarization … Contd
Binarization
A simple technique to binarize a categorical attribute is the following: If there are m categorical
values, then uniquely assign each original value to an integer in the interval [0, m−1].
If the attribute is ordinal, then order must be maintained by the assignment.
Even if the attribute is originally represented using integers, this process is necessary if the integers
are not in the interval [0,m−1].
Next, convert each of these m integers to a binary number. Since n = Γlog2(m)˥ binary digits are
required to represent these integers, represent these binary numbers using n binary attributes.
Categorical Value Integer Value x1 x2 x3
If the number of resulting attributes is too large, then the techniques that reduce the number of
categorical values before binarization can be used.
Data Preprocessing – Discretization and Binarization … Contd
Discretization of Continuous Attributes
Discretization is primarily applied to attributes that are used in classification or association analysis.
In general, the best discretization depends on the algorithm being used, as well as the other
attributes being considered. However, the discretization of an attribute is considered in isolation.
Transformation of a continuous attribute to a categorical attribute involves two subtasks:
• deciding the number of categories and
• determining mapping the values of the continuous attribute to these categories.
In the first step, after the values of the continuous attribute are sorted, they are then divided into n
intervals by specifying n−1 split points. In the second, all the values in one interval are mapped to
the same categorical value.
Therefore, the problem of discretization is one of deciding how many split points to choose and
where to place them. The result can be represented either as a set of intervals {(x0, x1], (x1, x2], . . .
, (xn−1, xn)}, where x0 and xn may be +∞ or −∞, respectively, or equivalently, as a series of
inequalities x0 < x ≤ x1, . . . , xn−1 < x < xn.
Data Preprocessing – Discretization and Binarization … Contd
Unsupervised Discretization
A basic distinction between discretization methods for classification is whether class information is
used (supervised) or not (unsupervised).
If class information is not used, then relatively simple approaches are common.
• The equal width approach divides the range of the attribute into a user-specified number of
intervals each having the same width. However, it can be badly affected by outliers.
• The equal frequency (equal depth) approach tries to put the same number of objects into
each interval. It is preferred as it mitigates the above problem.
• A clustering method, e.g., K-means, can also be used for unsupervised discretization.
• Visually inspecting the data can sometimes be an effective approach for unsupervised
discretization.
Data Preprocessing – Discretization and Binarization … Contd
Supervised Discretization
However, the discretization methods that use class labels often produces better results. The
unsupervised discretization is poor because an interval constructed with no knowledge of class
labels often contains a mixture of class labels.
A conceptually simple approach is to place the splits in a way that maximizes the purity of the
intervals. It requires (1) decisions about the purity of an interval and (2) the minimum size of an
interval. Therefore, some statistical based approaches start with each attribute value as a separate
interval and create larger intervals by merging adjacent intervals that are similar according to a
statistical test.
𝑒 = 𝑤𝑖 𝑒𝑖
𝑖=1
Here, m is the number of values, wi = mi/m is the fraction of values in the ith interval, and n is the
number of intervals. Hence, the entropy of an interval is a measure of the purity of an interval.
If an interval contains only values of one class (is perfectly pure), then the entropy is 0 and it
contributes nothing to the overall entropy. If the classes of values in an interval occur equally often
(the interval is as impure as possible), then the entropy is maximum.
Data Preprocessing – Discretization and Binarization … Contd
Supervised Discretization
A simple approach for partitioning a continuous attribute starts by bisecting the initial values so that
the resulting two intervals give minimum entropy. If the intervals contain ordered sets of values, it is
required to consider each value as a possible split point. The splitting process is then repeated with
another interval, typically choosing the interval with the worst (highest) entropy, until a user-
specified number of intervals is reached, or a stopping criterion is satisfied.
If the categorical attribute is an ordinal attribute, then techniques similar to those for continuous
attributes can be used to reduce the number of categories.
If the categorical attribute is nominal, however, then better use the domain knowledge. If domain
knowledge does not serve the purpose or results in poor classification performance, then it is
necessary to use a more empirical approach, such as grouping values together only if such a
grouping results in improved classification accuracy or achieves some other data mining objective.
Data Preprocessing – Variable Transformation
It refers to a transformation that is applied to all the values of a variable (attribute). The two
important types of variable transformations are simple functional transformations and
normalization.
Simple Functions
For this type of variable transformation, a simple mathematical function is applied to each value
individually. If x is a variable, then examples of such transformations include xk, logx, ex, √x, 1/x, sinx,
or |x| etc. The sqrt, log, and 1/x, are often used to transform data that does not have a Gaussian
(normal) distribution.
Variable transformations should be applied with caution since they change the nature of the
data. While this is what is desired, there can be problems if the nature of the transformation is not
fully appreciated. For example, the transformation 1/x reduces the magnitude of values that are 1
or larger, but increases the magnitude of values between 0 and 1. Thus, for all sets of values, the
transformation 1/x reverses the order.
Data Preprocessing – Variable Transformation… Contd
Normalization or Standardization
Another common type of variable transformation is the standardization or normalization of a variable. The
goal of standardization or normalization is to make an entire set of values have a particular property. For
example, if x יis the mean (average) of the attribute values and sx is their standard deviation, then the
transformation y = (x − x) י/sx creates a new variable that has a mean of 0 and a standard deviation of 1.
If different variables are to be combined in some way, then such a transformation is necessary to avoid
having a variable with large values dominate the results of the calculation. For example, comparing
people based on two variables: age and income.
As the mean and standard deviation are strongly affected by outliers, they are sometime replaced by the
median, i.e., the middle value, and the absolute standard deviation, respectively. The absolute standard
deviation of a variable x is given as follows.
σA = σ𝑚
𝑖=1 𝑥𝑖 − µ
Here, xi is the ith value of the variable, m is the number of objects, and μ is either the mean or median.
Other approaches for computing estimates of the location (center) and spread of a set of values can also
be used to define a standardization transformation.
Similarity and Dissimilarity Measures
Similarity and dissimilarity are important because they are used by a number of data mining
techniques, such as clustering, nearest neighbor classification, and anomaly detection.
In many cases, the initial data set is not needed once these similarities or dissimilarities have been
computed. Such approaches can be viewed as transforming the data to a similarity (dissimilarity)
space and then performing the analysis.
Since the proximity (proximity refers to either similarity or dissimilarity) between two objects is a
function of the proximity between the corresponding attributes of the two objects, we shall first
discuss measurement of the proximity between objects having only one simple attribute, and then
consider proximity measures for objects with multiple attributes.
The similarity between two objects is a numerical measure of the degree to which the two objects
are alike. Similarities are usually non-negative and are often between 0 (no similarity) and 1
(complete similarity).
The dissimilarity between two objects is a numerical measure of the degree to which the two
objects are different. Dissimilarities sometimes fall in the interval [0, 1], but it is also common for
them to range from 0 to ∞. Sometime, the term distance is used as a synonym for dissimilarity.
Similarity and Dissimilarity Measures
Transformations
Transformations are often applied to convert a similarity to a dissimilarity, or vice versa, or to
transform a proximity measure to fall within a particular range, e.g., [0,1].
Usually, proximity measures, especially similarities, are defined or transformed to have values in the
interval [0,1]. Such a transformation is often relatively straightforward.
For example:
• If the similarities between objects range from 1 (not at all similar) to 10 (completely similar), we
can make them fall within the range [0, 1] by using the transformation s’ = (s−1)/9, where s and
s’ are the original and new similarity values, respectively.
• In the more general case, the transformation of similarities to the interval [0, 1] is given by the
expression s’ = (s − min_s)/(max_s−min_s), where max_s and min_s are the maximum and
minimum similarity values, respectively.
• Similarly, dissimilarity measures with a finite range can be mapped to the interval [0,1] by using
the formula d’ = (d − min_d)/(max_d − min_d).
Similarity and Dissimilarity Measures
Transformations … Contd
However, there can be various complications in mapping proximity measures to the interval [0, 1].
For example, the proximity measure originally takes values in the interval [0, ∞], then a non-
linear transformation is needed and values will not have the same relationship to one another
on the new scale.
Another complication is that the meaning of the proximity measure may be changed.
For example, correlation is a measure of similarity that takes values in the interval [-1,1].
Mapping these values to the interval [0,1] by taking the absolute value loses information about
the sign, which can be important in some applications.
Transforming similarities to dissimilarities and vice versa is also relatively straightforward, although
we again face the issues of preserving meaning and changing a linear scale into a non-linear scale.
• If the similarity (or dissimilarity) falls in the interval [0,1], then the dissimilarity can be defined as
d = 1−s (s = 1 − d).
• Another simple approach is to define similarity as the negative of the dissimilarity (or vice versa).
However, it is not restricted to the range [0, 1], but if that is desired, then transformations such
as s = 1/(d+1), s = e−d, or s = 1− (d−min_d)/(max_d − min_d) can be used.
Similarity and Dissimilarity Measures
Transformations … Contd
In general, any monotonic decreasing function can be used to convert dissimilarities to similarities,
or vice versa. Of course, other factors (e.g., the issues of preserving meaning, changing a linear
scale into a non-linear scale) must also be considered when transforming similarities to
dissimilarities, or vice versa, or when transforming the values of a proximity measure to a new
scale.
Similarity and Dissimilarity Measures
Similarity and Dissimilarity between Simple Attributes
Consider objects described by one nominal attribute. What would it mean for two such objects to
be similar? Since nominal attributes only convey information about the distinctness of objects, all
we can say is that two objects either have the same value or they do not. Hence, in this case
similarity is traditionally defined as 1 if attribute values match, and as 0 otherwise. A dissimilarity
would be defined in the opposite way: 0 if the attribute values match, and 1 if they do not.
For objects with a single ordinal attribute, the situation is more complicated because information
about order should be taken into account.
Consider an attribute that measures the quality of a product on the scale {poor, fair, OK, good,
wonderful}. It would seem reasonable that a product, P1, which is rated wonderful, would be
closer to a product P2, which is rated good, than it would be to a product P3, which is rated OK.
To make this observation quantitative, the values of the ordinal attribute are often mapped to
successive integers, beginning at 0 or 1, e.g., {poor=0, fair=1, OK=2, good=3, wonderful=4}.
Then, d(P1, P2) = 4 − 3 = 1 or, if we want the dissimilarity to fall between 0 and 1, d(P1, P2) =
(4−3)/4 = 0.25. A similarity for ordinal attributes can then be defined as s = 1− d.
Similarity and Dissimilarity Measures
Similarity and Dissimilarity between Simple Attributes
For interval or ratio attributes, the natural measure of dissimilarity between two objects is the
absolute difference of their values.
The similarity of interval or ratio attributes is typically expressed by transforming a similarity into a
dissimilarity, as discussed in previous slides.
ⅆ 𝑥, 𝑦 = 𝑥𝑘 − 𝑦𝑘 2
𝑘=1
Here, n is the number of dimensions and xk and yk are, respectively, the kth attributes (components)
of x and y.
The Euclidean distance measure is generalized by the Minkowski distance metric shown below.
𝑛 1/𝑟
𝑟
ⅆ 𝑥, 𝑦 = 𝑥𝑘 − 𝑦𝑘
𝑘=1
Here, r is a parameter.
Similarity and Dissimilarity Measures
Dissimilarities between Data Objects … Contd
Distances
The following are the three most common examples of Minkowski distances.
• r = 1. City block (Manhattan, taxicab, L1 norm) distance. A common example is the Hamming
distance, which is the number of bits that are different between two objects that have only
binary attributes, i.e., between two binary vectors.
• r = ∞. Supremum (Lmax or L∞ norm) distance. This is the maximum difference between any
attribute of the objects. More formally, the L∞ distance is defined as follows.
𝑛 1/𝑟
𝑟
ⅆ 𝑥, 𝑦 = lim 𝑥𝑘 − 𝑦𝑘
𝑟→∞
𝑘=1
Similarity and Dissimilarity Measures
Dissimilarities between Data Objects … Contd
Distances
Distances, such as the Euclidean distance, have some well-known properties. If d(x, y) is the
distance between two points, x and y, then the following properties hold.
1. Positivity
(a) d(x, x) ≥ 0 for all x and y,
(b) d(x, y) = 0 only if x = y.
2. Symmetry
d(x, y) = d(y, x) for all x and y.
3. Triangle Inequality
d(x, z) ≤ d(x, y) + d(y, z) for all points x, y, and z.
The triangle inequality does not hold for similarities, but symmetry and positivity do.
If s(x, y) is the similarity between points x and y, then the typical properties of similarities are as
follows.
1. s(x, y) = 1 only if x = y. (0 ≤ s ≤ 1)
2. s(x, y) = s(y, x) for all x and y. (Symmetry)
Similarity and Dissimilarity Measures
Similarity Measures for Binary Data
Similarity measures between objects that contain only binary attributes are called similarity
coefficients, and typically have values between 0 and 1. A value of 1 indicates that the two objects
are completely similar, while a value of 0 indicates that the objects are not at all similar.
Let x and y be two objects that consist of n binary attributes. The comparison of two such objects,
i.e., two binary vectors, leads to the following four quantities (frequencies):
f00 = the number of attributes where x is 0 and y is 0
f01 = the number of attributes where x is 0 and y is 1
f10 = the number of attributes where x is 1 and y is 0
f11 = the number of attributes where x is 1 and y is 1
Simple Matching Coefficient One commonly used similarity coefficient is the simple matching
coefficient (SMC), which is defined as follows:
number of matching attribute values
SMC =
number of attributes
𝑓11 + 𝑓00
=
𝑓01 +𝑓10 +𝑓11 +𝑓00
Similarity and Dissimilarity Measures
Here, · indicates the vector dot product, x· y = σ𝑛𝑘=1 𝑥𝑘 𝑦𝑘 , and 𝑥 is the length of vector, 𝑥 =
σ𝑛𝑘=1 𝑥𝑘 2 = 𝑥. 𝑥
Similarity and Dissimilarity Measures
Cosine Similarity
As indicated in the figure below, cosine similarity really is a measure of the (cosine of the) angle
between x and y. Thus, if the cosine similarity is 1, the angle between x and y is 0◦, and x and y are
the same except for magnitude (length). If the cosine similarity is 0, then the angle between x and y
is 90◦, and they do not share any terms (words).
Dividing x and y by their lengths normalizes them to have a length of 1. This means that cosine
similarity does not take the magnitude of the two data objects into account when computing
similarity.
For vectors with a length of 1, the cosine measure can be calculated by taking a simple dot product.
Consequently, when many cosine similarities between objects are being computed, normalizing the
objects to have unit length can reduce the time required.
Euclidean distance may be a better choice when magnitude is important.
Similarity and Dissimilarity Measures
The extended Jaccard coefficient (EJ), also known as the Tanimoto coefficient, can be used for
document data and that reduces to the Jaccard coefficient in the case of binary attributes. It is
defined as follows.
𝑥.𝑦
EJ(x, y) =
𝑥 2 + 𝑦 2 −𝑥.𝑦
Similarity and Dissimilarity Measures
Correlation
The correlation between two data objects that have binary or continuous variables is a measure of
the linear relationship between the attributes of the objects. More precisely, Pearson’s correlation
coefficient between two data objects, x and y, is defined by the following equation:
𝑐𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑥,𝑦) 𝑠𝑥𝑦
c𝑜𝑟𝑟(𝑥, 𝑦) = =
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑥 ∗𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 (𝑦) 𝑠𝑥 .𝑠𝑦
Here,
𝑛
1
𝑐𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑥, 𝑦 = 𝑠𝑥𝑦 = (𝑥𝑘 − 𝑥)(𝑦
ҧ 𝑘 − 𝑦)
ത
(𝑛 − 1)
𝑘=1
𝑛
1
𝑠𝑡𝑎𝑛ⅆ𝑎𝑟ⅆ ⅆ𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑥 = 𝑠𝑥 = (𝑥𝑘 − 𝑥)ҧ 2
(𝑛 − 1)
𝑘=1
𝑛
1
𝑠𝑡𝑎𝑛ⅆ𝑎𝑟ⅆ ⅆ𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑦 = 𝑠𝑦 = ത 2
(𝑦𝑘 − 𝑦)
(𝑛 − 1)
𝑘=1
1 1
𝑥ҧ = 𝑛 σ𝑛𝑘=1 𝑥𝑘 is the mean of x 𝑦ത = 𝑛 σ𝑛𝑘=1 𝑦𝑘 is the mean of y
Similarity and Dissimilarity Measures
Correlation
Correlation is always in the range −1 to 1. A correlation of 1 (−1) means that x and y have a perfect
positive (negative) linear relationship; that is, xk = ayk + b, where a and b are constants.
If the correlation is 0, then there is no linear relationship between the attributes of the two data
objects. However, non-linear relationships may still exist.
If we transform x and y by subtracting off their means and then normalizing them so that their
lengths are 1, then their correlation can be calculated by taking the dot product.
Similarity and Dissimilarity Measures
Similarity and Dissimilarity Measures
1) How to handle the case in which attributes have different scales and/or are correlated?
2) How to calculate proximity between objects that are composed of different types of attributes,
e.g., quantitative and qualitative?
3) How to handle proximity calculation when attributes have different weights; i.e., when not all
attributes contribute equally to the proximity of objects?
Similarity and Dissimilarity Measures
Standardization and Correlation for Distance Measures
An important issue with distance measures is how to handle the situation when attributes do not
have the same range of values, i.e., the variables have different scales.
A related issue is how to compute distance when there is correlation between some of the
attributes, perhaps in addition to differences in the ranges of values.
A generalization of Euclidean distance, the Mahalanobis distance, is useful when attributes are
correlated, have different ranges of values (different variances), and the distribution of the data is
approximately Gaussian (normal).
The Mahalanobis distance between two objects (vectors) x and y is defined as follows.
3. Compute the overall similarity between the two objects using the following formula:
σ𝑛
𝑘=1 δ𝑘 𝑠𝑘 (𝑥,𝑦)
similarity(x, y) = σ𝑛
𝑘=1 δ𝑘
Similarity and Dissimilarity Measures
Using Weights
In much of the previous discussion, all attributes were treated equally when computing proximity.
This is not desirable when some attributes are more important to the definition of proximity than
others.
To address these situations, the formulas for proximity can be modified by weighting the
contribution of each attribute. If the weights wk sum to 1, then similarity(x, y) becomes
σ𝑛
𝑘=1 𝑤𝑘 δ𝑘 𝑠𝑘 (𝑥,𝑦)
similarity(x, y) = σ𝑛
𝑘=1 δ𝑘