Unit 1 - IDS
Unit 1 - IDS
Objects
temperature, etc.
● Attribute is also known as
variable, field, characteristic,
dimension, or feature
● A collection of attributes describe an
object
● Object is also known as record,
point, case, sample, entity, or
instance
Attribute Values
● Attribute values are numbers or symbols assigned to an attribute
for a particular object
● Addition : + and -
■ (Meaningful Differences)
● Multiplication : * and /
■ ( Meaningful Differences)
Types of Attributes
● There are 4 different types of attributes
● Nominal
● Examples: ID numbers, eye color, zip codes
● Ordinal
● Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height {tall, medium, short}
● Interval
● Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
● Ratio
● Examples: temperature in Kelvin, length, time, counts
● Nominal attribute : distinctness
● Ordinal attribute : distinctness & order
● Interval attribute : distinctness, order &
meaningful differences
● Ratio attribute : all 4
properties/operations
Difference Between Ratio and Interval
● Is it physically meaningful to say that a temperature of 10 °
is twice that of 5° on
● the Celsius scale?
● the Fahrenheit scale?
● the Kelvin scale?
● Biased Scale
● Interval or Ratio
Key Messages for Attribute Types
● The types of operations you choose should be “meaningful” for the
type of data you have
● Distinctness, order, meaningful intervals, and meaningful ratios are
only four properties of data
● The data type you see – often numbers or strings – may not capture
all the properties or may suggest properties that are not there
● Analysis may depend on these other properties of the data
● Many statistical analyses depend only on the distribution
● Many times what is meaningful is measured by statistical
significance
● But in the end, what is meaningful is measured by the domain
Types of data sets
● Record
● Data Matrix
● Document Data
● Transaction Data
● Graph
● World Wide Web
● Molecular Structures
● Ordered
● Spatial Data
● Temporal Data
● Sequential Data
● Genetic Sequence Data
Important Characteristics of Data
● Dimensionality (number of attributes)
● High dimensional data brings a number of challenges
● Sparsity
● Only presence counts
● Resolution
● Patterns depend on the scale
● Size
● Type of analysis may depend on size of data
Record Data
● Data that consists of a collection of records, each of which
consists of a fixed set of attributes
Data Matrix
● If data objects have the same fixed set of numeric attributes,
then the data objects can be thought of as points in a multi-
dimensional space, where each dimension represents a distinct
attribute
● Such data set can be represented by an m by n matrix, where
there are m rows, one for each object, and n columns, one for
each attribute
Document Data
● Each document becomes a ‘term’ vector
● Each term is a component (attribute) of the vector
● The value of each component is the number of times the
corresponding term occurs in the document.
Transaction Data
● A special type of record data, where
● Each record (transaction) involves a set of items.
● For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute
a transaction, while the individual products that were
purchased are the items.
Graph Based Data
1, Data with Relationships among objects
ex: Web pages
2, Data with Objects that are Graphs
ex: Molecule
Graph Data
● Examples: Generic graph, a molecule, and webpages
Benzene Molecule:
C6H6
Ordered Data
1. Sequential Data
2. Sequence Data
3. Time Series Data
4. Spatial Data
Ordered Data
● Sequential Data : Also referred as temporal data
● Sequences of transactions
An element of
the sequence
Ordered Data
● Sequence Data: Ex: Genome sequencing
Ordered Data
● Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
Data Quality
● Poor data quality negatively affects many data processing
efforts
“The most important point is that poor data quality is an
unfolding disaster.
● Poor data quality costs the typical company at least ten
percent (10%) of revenue; twenty percent (20%) is
probably a better estimate.”
Thomas C. Redman, DM Review, August
2004
● Data mining example: a classification model for detecting
people who are loan risks is built using poor data
● Some credit-worthy candidates are denied loans
● More loans are given to individuals that default
Data Quality …
● What kinds of data quality problems?
● How can we detect problems with the data?
● What can we do about these problems?
● Causes?
Missing Values
● It is not unusual for an object to be missing one or more
attribute values.
● In some cases, the information was not collected; e.g.,
some people decline to give their age or weight.
● In other cases, some attributes are not applicable to all
objects; e.g., often, forms have conditional parts that are
filled out only when a person answers a previous question
in a certain way, but for simplicity, all fields are stored.
● Regardless, missing values should be taken into account
during the data analysis.
Eliminate Data Objects or Attributes
● A simple and effective strategy is to eliminate objects with
missing values.
● However, even a partially specified data object contains
some information, and if many objects have missing
values, then a reliable analysis can be difficult or
impossible.
● Nonetheless, if a data set has only a few objects that have
missing values, then it may be expedient to omit them.
● A related strategy is to eliminate attributes that have
missing values. This should be done with caution,
however, since the eliminated attributes may be the ones
that are critical to the analysis.
Estimate Missing Values
● Sometimes missing data can be reliably estimated.
● For example, consider a time series that changes in a
reasonably smooth fashion, but has a few, widely scattered
missing values.
● In such cases, the missing values can be estimated
(interpolated) by using the remaining values.
● As another example, consider a data set that has many
similar data points. In this situation, the attribute values of
the points closest to the point with the missing value are
often used to estimate the missing value.
● If the attribute is continuous, then the average attribute
value of the nearest neighbors is used.
● If the attribute is categorical, then the most commonly
occurring attribute value can be taken.
● For a concrete illustration, consider precipitation
measurements that are recorded by ground stations. For
areas not containing a ground station, the precipitation
can be estimated using values observed at nearby ground
stations..
Inconsistent Values
● Data can contain inconsistent values. Consider an address
field, where both a zip code and city are listed, but the
specified zip code area is not contained in that city.
● It may be that the individual entering this information
transposed two digits, or perhaps a digit was misread when
the information was scanned from a handwritten form.
● Some types of inconsistencies are easy to detect. For
instance, a person's height should not be negative.
● In other cases, it can be necessary to consult an external
source of information.
● For example, when an insurance company processes
claims for reimbursement, it checks the names and
addresses on the reimbursement forms against a database
of its customers.
● A product code may have "check" digits, or it may be
possible to double-check a product code against a list of
known product codes, and then correct the code if it is
incorrect, but close to a known code. The correction of
an inconsistency requires additional or redundant
information.
Duplicate Data
● A data set may include data objects that are duplicates, or
almost duplicates, of one another.
● Many people receive duplicate mailings because they
appear in a database multiple times under slightly different
names.
● To detect and eliminate such duplicates, two main issues
must be addressed.
● First, if there are two objects that actually represent a single
object, then the values of corresponding attributes may
differ, and these inconsistent values must be resolved.
● Second, care needs to be taken to avoid accidentally
combining data objects that are similar, but not
duplicates, such as two distinct people with identical
names.
● In some cases, two or more objects are identical with
respect to the attributes measured by the database, but
they still represent different objects.
● Here, the duplicates are legitimate, but may still cause
problems for some algorithms if the possibility of
identical objects is not specifically accounted for in their
design