0% found this document useful (0 votes)

19 views

Unit 1 - IDS

This document provides an overview of data types and collection. It discusses the different types of data, including attributes, objects, and attribute values. The key data types are nominal, ordinal, interval, and ratio. It also describes different types of data sets such as records, matrices, documents, transactions, and graphs. Finally, it covers important data quality issues like noise, missing values, and duplicates that can negatively impact analysis.

Uploaded by

mahadevunisirisha02

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

Unit 1 - IDS

Uploaded by

mahadevunisirisha02

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 49

Unit 1

Data Types & Collection

Out line
● Types of Data
○ Attributes & Measurements
○ Types of Data Sets
● Data Quality
○ Data Measurement and Data Collection Issues
What is Data?
● Collection of Data Objects and their Attributes
Attributes
● An attribute is a property or
characteristic of an object that may
vary, either from one object to
another or from one time to another.
● Examples: eye color of a person,

Objects
temperature, etc.
● Attribute is also known as
variable, field, characteristic,
dimension, or feature
● A collection of attributes describe an
object
● Object is also known as record,
point, case, sample, entity, or
instance
Attribute Values
● Attribute values are numbers or symbols assigned to an attribute
for a particular object

● Distinction between attributes and attribute values

● Same attribute can be mapped to different attribute values
● Example: height can be measured in feet or meters

● Different attributes can be mapped to the same set of values

● Example: Attribute values for ID and age are integers
● But properties of attribute values can be different
Measurement
● A measurement scale is a rule (function) that associates
a numerical or symbolic value with an attribute of an
object.
● For instance, we step on a bathroom scale to determine
our weight, we classify someone as male or female, or
we count the number of chairs in a room to see if there
will be enough to seat all the people coming to a
meeting.
● In all these cases the "physical value" of an attribute
of an object is mapped to a numerical or symbolic
value.
● With this background, we can now discuss the type
of an attribute, a concept that is important in
determining if a particular data analysis technique is
consistent with a specific type of attribute.
Type of an attribute
● The values used to represent an attribute may
have properties that are not properties of the
attribute itself, and vice versa
● Example 1: Employee Age and ID Number
● Example 2: Length of Line Segments
Example 2: Length of Line Segments
● The way you measure an attribute may not match the
attributes properties.

This scale This scale

preserves preserves
only the the ordering
ordering and additvity
property of properties of
length. length.
Properties of Attribute Values
● The type of an attribute depends on which of the
following properties/operations it possesses:
● Distinctness : = and ≠
● Order : < , <=, > and >=

● Addition : + and -
■ (Meaningful Differences)
● Multiplication : * and /
■ ( Meaningful Differences)
Types of Attributes
● There are 4 different types of attributes
● Nominal
● Examples: ID numbers, eye color, zip codes
● Ordinal
● Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height {tall, medium, short}
● Interval
● Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
● Ratio
● Examples: temperature in Kelvin, length, time, counts
● Nominal attribute : distinctness
● Ordinal attribute : distinctness & order
● Interval attribute : distinctness, order &
meaningful differences
● Ratio attribute : all 4
properties/operations
Difference Between Ratio and Interval
● Is it physically meaningful to say that a temperature of 10 °
is twice that of 5° on
● the Celsius scale?
● the Fahrenheit scale?
● the Kelvin scale?

● Consider measuring the height above average

● If Bill’s height is three inches above average and Bob’s
height is six inches above average, then would we say that
Bob is twice as tall as Bill?
● Is this situation analogous to that of temperature?
Discrete and Continuous Attributes
● Discrete Attribute
● Has only a finite or countably infinite set of values
● Examples: zip codes, counts, or the set of words in a
collection of documents
● Often represented as integer variables.
● Note: binary attributes are a special case of discrete
attributes
● Continuous Attribute
● Has real numbers as attribute values
● Examples: temperature, height, or weight.
● Practically, real values can only be measured and
represented using a finite number of digits.
● Continuous attributes are typically represented as floating-
point variables.
Asymmetric Attributes
● Only presence (a non-zero attribute value) is regarded as important
● Words present in documents
● Items present in customer transactions

● If we met a friend in the grocery store would we ever say the

following?
“I see our purchases are very similar since we didn’t buy most of the
same things.”
● We need two asymmetric binary attributes to represent one ordinary
binary attribute
● Association analysis uses asymmetric attributes

● Asymmetric attributes typically arise from objects that are sets

Critiques …
● Not a good guide for statistical analysis
● May unnecessarily restrict operations and results
● Statistical analysis is often approximate
● Thus, for example, using interval analysis for ordinal
values may be justified
● Transformations are common but don’t preserve scales
● Can transform data to a new scale with better statistical
properties
● Many statistical analyses depend only on the distribution
More Complicated Examples
● ID numbers
● Nominal, ordinal, or interval?

● Number of cylinders in an automobile engine

● Nominal, ordinal, or ratio?

● Biased Scale
● Interval or Ratio
Key Messages for Attribute Types
● The types of operations you choose should be “meaningful” for the
type of data you have
● Distinctness, order, meaningful intervals, and meaningful ratios are
only four properties of data
● The data type you see – often numbers or strings – may not capture
all the properties or may suggest properties that are not there
● Analysis may depend on these other properties of the data
● Many statistical analyses depend only on the distribution
● Many times what is meaningful is measured by statistical
significance
● But in the end, what is meaningful is measured by the domain
Types of data sets
● Record
● Data Matrix
● Document Data
● Transaction Data
● Graph
● World Wide Web
● Molecular Structures
● Ordered
● Spatial Data
● Temporal Data
● Sequential Data
● Genetic Sequence Data
Important Characteristics of Data
● Dimensionality (number of attributes)
● High dimensional data brings a number of challenges
● Sparsity
● Only presence counts
● Resolution
● Patterns depend on the scale
● Size
● Type of analysis may depend on size of data
Record Data
● Data that consists of a collection of records, each of which
consists of a fixed set of attributes
Data Matrix
● If data objects have the same fixed set of numeric attributes,
then the data objects can be thought of as points in a multi-
dimensional space, where each dimension represents a distinct
attribute
● Such data set can be represented by an m by n matrix, where
there are m rows, one for each object, and n columns, one for
each attribute
Document Data
● Each document becomes a ‘term’ vector
● Each term is a component (attribute) of the vector
● The value of each component is the number of times the
corresponding term occurs in the document.
Transaction Data
● A special type of record data, where
● Each record (transaction) involves a set of items.
● For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute
a transaction, while the individual products that were
purchased are the items.
Graph Based Data
1, Data with Relationships among objects
ex: Web pages
2, Data with Objects that are Graphs
ex: Molecule
Graph Data
● Examples: Generic graph, a molecule, and webpages

Benzene Molecule:
C6H6
Ordered Data
1. Sequential Data
2. Sequence Data
3. Time Series Data
4. Spatial Data
Ordered Data
● Sequential Data : Also referred as temporal data
● Sequences of transactions

An element of
the sequence
Ordered Data
● Sequence Data: Ex: Genome sequencing
Ordered Data

● Spatio-Temporal Data

Average Monthly
Temperature of
land and ocean
Data Quality
● Poor data quality negatively affects many data processing
efforts
“The most important point is that poor data quality is an
unfolding disaster.
● Poor data quality costs the typical company at least ten
percent (10%) of revenue; twenty percent (20%) is
probably a better estimate.”
Thomas C. Redman, DM Review, August
2004
● Data mining example: a classification model for detecting
people who are loan risks is built using poor data
● Some credit-worthy candidates are denied loans
● More loans are given to individuals that default
Data Quality …
● What kinds of data quality problems?
● How can we detect problems with the data?
● What can we do about these problems?

● Examples of data quality problems:

● Noise and outliers
● Missing values
● Duplicate data
● Wrong data
Measurement and Data Collection Errors
● The term Measurement Error refers to any problem resulting
from the measurement process.
● A common problem is that the value recorded differs from
the true value to some extent.
● For continuous attributes, the numerical difference of the
measured and true value is called the error.
● The term Data Collection Error refers to errors such as
omitting data objects or attribute values, or inappropriately
including a data object.
● For example, a study of animals of a certain species might
include animals of a related species that are similar in
appearance to the species of interest.
Noise and Artifacts
● Noise is the random component of a measurement error. It
may involve the distortion of a value or the addition of
spurious objects.
● The term noise is often used in connection with data that
has a spatial or temporal component.
● In such cases, techniques from signal or image processing
can frequently be used to reduce noise and thus, help to
discover patterns (signals) that might be "lost in the noise."
● Nonetheless, the elimination of noise is frequently difficult,
and much work in data mining focuses on devising robust
algorithms that produce acceptable results even when noise
is present.
Noise
● For objects, noise is an extraneous object
● For attributes, noise refers to modification of original
values
● Examples: distortion of a person’s voice when talking
on a poor phone and “snow” on television screen

Two Sine Waves Two Sine Waves + Noise

● Data errors may be the result of a more deterministic
phenomenon, such as a streak in the same place on a set
of photographs. Such deterministic distortions of the data
are often referred to as artifacts.
● A data artifact is a data flaw caused by equipment,
techniques or conditions.
● Common sources of data flaws include hardware or
software errors, conditions such as electromagnetic
interference and flawed designs such as an algorithm
prone to miscalculations
Precision, Bias, and Accuracy
Precision: The closeness of repeated measurements (of the
same quantity) to one another.
Precision is often measured by the standard deviation of a
set of values
Bias: A systematic quantity being measured.
Bias is measured by taking the difference between the mean
of the set of values and the known value of the quantity
being measured.
Bias can only be determined for objects whose measured
quantity is known by means external to the current situation
● For example: Suppose that we have a standard laboratory
weight with a mass of 1g and want to assess the precision
and bias of our new laboratory scale.
● We weigh the mass five times, and obtain the following five
values: {1.015,0.990, 1.013, 1.001,0.986}.
● The mean of these values is 1.001, and hence, the bias is
0.001. The precision, as measured by the standard
deviation, is 0.013.
● Accuracy: The closeness of measurements to the true value
of the quantity being measured.
● Accuracy depends on precision and bias, but since it is a
general concept, there is no specific formula for accuracy in
terms of these two quantities
Outliers
● Outliers are either (1) data objects that, in some sense, have
characteristics that are different from most of the other data
objects in the data set, or (2) values of an attribute that are
unusual with respect to the typical values for that attribute.
● It is important to distinguish between the notions of noise
and outliers.
● Outliers can be legitimate data objects or values. Thus,
unlike noise, outliers may sometimes be of interest.
● In fraud and network intrusion detection, for example, the
goal is to find unusual objects or events from among a large
number of normal ones.
Outliers
● Outliers are data objects with characteristics that are
considerably different than most of the other data objects in the
data set
● Case 1: Outliers are
noise that interferes
with data analysis

● Case 2: Outliers are

the goal of our analysis
● Credit card fraud
● Intrusion detection

● Causes?
Missing Values
● It is not unusual for an object to be missing one or more
attribute values.
● In some cases, the information was not collected; e.g.,
some people decline to give their age or weight.
● In other cases, some attributes are not applicable to all
objects; e.g., often, forms have conditional parts that are
filled out only when a person answers a previous question
in a certain way, but for simplicity, all fields are stored.
● Regardless, missing values should be taken into account
during the data analysis.
Eliminate Data Objects or Attributes
● A simple and effective strategy is to eliminate objects with
missing values.
● However, even a partially specified data object contains
some information, and if many objects have missing
values, then a reliable analysis can be difficult or
impossible.
● Nonetheless, if a data set has only a few objects that have
missing values, then it may be expedient to omit them.
● A related strategy is to eliminate attributes that have
missing values. This should be done with caution,
however, since the eliminated attributes may be the ones
that are critical to the analysis.
Estimate Missing Values
● Sometimes missing data can be reliably estimated.
● For example, consider a time series that changes in a
reasonably smooth fashion, but has a few, widely scattered
missing values.
● In such cases, the missing values can be estimated
(interpolated) by using the remaining values.
● As another example, consider a data set that has many
similar data points. In this situation, the attribute values of
the points closest to the point with the missing value are
often used to estimate the missing value.
● If the attribute is continuous, then the average attribute
value of the nearest neighbors is used.
● If the attribute is categorical, then the most commonly
occurring attribute value can be taken.
● For a concrete illustration, consider precipitation
measurements that are recorded by ground stations. For
areas not containing a ground station, the precipitation
can be estimated using values observed at nearby ground
stations..
Inconsistent Values
● Data can contain inconsistent values. Consider an address
field, where both a zip code and city are listed, but the
specified zip code area is not contained in that city.
● It may be that the individual entering this information
transposed two digits, or perhaps a digit was misread when
the information was scanned from a handwritten form.
● Some types of inconsistencies are easy to detect. For
instance, a person's height should not be negative.
● In other cases, it can be necessary to consult an external
source of information.
● For example, when an insurance company processes
claims for reimbursement, it checks the names and
addresses on the reimbursement forms against a database
of its customers.
● A product code may have "check" digits, or it may be
possible to double-check a product code against a list of
known product codes, and then correct the code if it is
incorrect, but close to a known code. The correction of
an inconsistency requires additional or redundant
information.
Duplicate Data
● A data set may include data objects that are duplicates, or
almost duplicates, of one another.
● Many people receive duplicate mailings because they
appear in a database multiple times under slightly different
names.
● To detect and eliminate such duplicates, two main issues
must be addressed.
● First, if there are two objects that actually represent a single
object, then the values of corresponding attributes may
differ, and these inconsistent values must be resolved.
● Second, care needs to be taken to avoid accidentally
combining data objects that are similar, but not
duplicates, such as two distinct people with identical
names.
● In some cases, two or more objects are identical with
respect to the attributes measured by the database, but
they still represent different objects.
● Here, the duplicates are legitimate, but may still cause
problems for some algorithms if the possibility of
identical objects is not specifically accounted for in their
design

Yf300 Service Manual
No ratings yet
Yf300 Service Manual
53 pages
Datasheet DIODOS
No ratings yet
Datasheet DIODOS
6 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
50 pages
Data Mining Lecture2-2
No ratings yet
Data Mining Lecture2-2
29 pages
Chap2 Data
No ratings yet
Chap2 Data
87 pages
Data Mining: Data
No ratings yet
Data Mining: Data
50 pages
Data
No ratings yet
Data
84 pages
Class 2 Introduction to Data
No ratings yet
Class 2 Introduction to Data
40 pages
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
No ratings yet
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
69 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
No ratings yet
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
57 pages
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
No ratings yet
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
49 pages
2-Data_Preprocessing
No ratings yet
2-Data_Preprocessing
104 pages
Attributes
No ratings yet
Attributes
66 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
A.I. Lecture 5 NEW
No ratings yet
A.I. Lecture 5 NEW
96 pages
Full
No ratings yet
Full
367 pages
Chapter-2 (Data)
No ratings yet
Chapter-2 (Data)
95 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
lec01-dataprep
No ratings yet
lec01-dataprep
67 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
31 pages
Sess02 Data
No ratings yet
Sess02 Data
96 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
Basic Data Mining Techniques: Attributes
No ratings yet
Basic Data Mining Techniques: Attributes
12 pages
DMDW 2
No ratings yet
DMDW 2
68 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc
No ratings yet
ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc
22 pages
Chapter 2 Data Issues
No ratings yet
Chapter 2 Data Issues
21 pages
PREPROCESSING
No ratings yet
PREPROCESSING
122 pages
Module2 - Preprocessing Updated - V3-2
No ratings yet
Module2 - Preprocessing Updated - V3-2
106 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
chapter 2
No ratings yet
chapter 2
57 pages
Chapter 2.1 2.2
No ratings yet
Chapter 2.1 2.2
40 pages
Modified Module 2-DM
No ratings yet
Modified Module 2-DM
107 pages
Attribute Type Description Examples Operations: Attribute Level Transformation Comments
No ratings yet
Attribute Type Description Examples Operations: Attribute Level Transformation Comments
33 pages
CAC 428 Topic 1_introduction to Data
No ratings yet
CAC 428 Topic 1_introduction to Data
24 pages
Lecture Notes For Chapter 2 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining
34 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
100 pages
Data Mining Chapter 2 Data Preprocessing
No ratings yet
Data Mining Chapter 2 Data Preprocessing
33 pages
DM Unit1_1 INTRODUCTION TO DATA MINING and types of data 19I504
No ratings yet
DM Unit1_1 INTRODUCTION TO DATA MINING and types of data 19I504
42 pages
Lecture2_IntroData
No ratings yet
Lecture2_IntroData
16 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
Lecture Notes For Chapter 2: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 2: by Tan, Steinbach, Kumar
25 pages
Chapter 02 Data and Data Preprocessing
No ratings yet
Chapter 02 Data and Data Preprocessing
74 pages
DWH m2p2
No ratings yet
DWH m2p2
8 pages
clustering_vivek_saxena
No ratings yet
clustering_vivek_saxena
169 pages
All Data Mining Chapters
No ratings yet
All Data Mining Chapters
235 pages
Lecture 3-Know Your Data - M
No ratings yet
Lecture 3-Know Your Data - M
19 pages
CIS62283 02 PreProcessing
100% (1)
CIS62283 02 PreProcessing
51 pages
Datalec1 (1)
No ratings yet
Datalec1 (1)
23 pages
IDS Unit 2
No ratings yet
IDS Unit 2
49 pages
2 Data Types Quality
No ratings yet
2 Data Types Quality
15 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
lec2-data
No ratings yet
lec2-data
51 pages
IDS Unit-2
No ratings yet
IDS Unit-2
39 pages
Getting To Know Your Data: - Chapter 2
No ratings yet
Getting To Know Your Data: - Chapter 2
63 pages
Chap2 Data
No ratings yet
Chap2 Data
68 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Microsoft Excel Statistical and Advanced Functions for Decision Making
From Everand
Microsoft Excel Statistical and Advanced Functions for Decision Making
Palani Murugappan
5/5 (2)
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
Inductive and Deductive Arguments
No ratings yet
Inductive and Deductive Arguments
2 pages
How To Beat The Board Exam Using Es991 Plus PDF
No ratings yet
How To Beat The Board Exam Using Es991 Plus PDF
21 pages
Fault and Stability Analysis of A Power System Network by Matlab Simulink
No ratings yet
Fault and Stability Analysis of A Power System Network by Matlab Simulink
9 pages
Vickers DG4V-3-6C Electrovalve PDF
No ratings yet
Vickers DG4V-3-6C Electrovalve PDF
16 pages
The Historical Research: Theory, Methodology and Historiography
No ratings yet
The Historical Research: Theory, Methodology and Historiography
2 pages
Microteaching TM
No ratings yet
Microteaching TM
24 pages
PIERALISI - Centrifugal Separators PDF
No ratings yet
PIERALISI - Centrifugal Separators PDF
6 pages
Syntax 2023
No ratings yet
Syntax 2023
44 pages
English For Security
No ratings yet
English For Security
7 pages
UC Berkeley ME 185 Syllabus
No ratings yet
UC Berkeley ME 185 Syllabus
4 pages
Spare Stock
No ratings yet
Spare Stock
114 pages
Sequence and Series PDF
No ratings yet
Sequence and Series PDF
17 pages
Highlight - Removal Access19
No ratings yet
Highlight - Removal Access19
15 pages
E-LIT-106-REPORTING
No ratings yet
E-LIT-106-REPORTING
16 pages
Guidelines On Aesthetic Design of Pumping Station Buildings
100% (1)
Guidelines On Aesthetic Design of Pumping Station Buildings
22 pages
Discrete Mathematics
No ratings yet
Discrete Mathematics
12 pages
FDO181 - Smoke Detector (Đầu báo khói)
No ratings yet
FDO181 - Smoke Detector (Đầu báo khói)
4 pages
Unit 2 Swot T
No ratings yet
Unit 2 Swot T
15 pages
Representing Emotions New Connections In The Histories Of Art Music And Medicine 1 Helen Hills download
100% (1)
Representing Emotions New Connections In The Histories Of Art Music And Medicine 1 Helen Hills download
89 pages
Imd 122 Group Assignment
No ratings yet
Imd 122 Group Assignment
8 pages
Mock Test Maths
No ratings yet
Mock Test Maths
3 pages
T0 - AAPCT - E&I - Report - Power Reticulation System PDF
No ratings yet
T0 - AAPCT - E&I - Report - Power Reticulation System PDF
22 pages
Subject Enrichment Activity
No ratings yet
Subject Enrichment Activity
4 pages
SATO M84Pro Printer Parts List
No ratings yet
SATO M84Pro Printer Parts List
36 pages
C Programming: BTM 3234 Manufacturing Computer Application
No ratings yet
C Programming: BTM 3234 Manufacturing Computer Application
7 pages
Rafidah Fullpaper Nat - Sem
No ratings yet
Rafidah Fullpaper Nat - Sem
17 pages
QS11 - Class Exercises Solution
100% (2)
QS11 - Class Exercises Solution
8 pages
Research Paper
No ratings yet
Research Paper
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit 1 - IDS

Uploaded by

Unit 1 - IDS

Uploaded by

Unit 1

Data Types & Collection

● Distinction between attributes and attribute values

● Different attributes can be mapped to the same set of values

This scale This scale

● Consider measuring the height above average

● If we met a friend in the grocery store would we ever say the

● Asymmetric attributes typically arise from objects that are sets

● Number of cylinders in an automobile engine

● Examples of data quality problems:

Two Sine Waves Two Sine Waves + Noise

● Case 2: Outliers are

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.