UNIT 3
UNIT 3
To Do:
1. Overview, Motivation, Definition & Functionalities
2. Data Processing, Form of Data Pre-processing
3. Data Cleaning: Missing Values, Noisy Data (Binning, Clustering, Regression, Computer
and Human inspection)
4. Inconsistent Data
5. Data Integration and Transformation
6. Data Reduction:
a. Data Cube Aggregation
b. Dimensionality reduction
c. Data Compression
d. Numerosity Reduction
e. Discretization and Concept Hierarchy Generation
7. Decision Tree
# Overview, Motivation, Definition, and Functionalities of Data Mining:
Conclusion
The overview and motivation for data mining highlight its importance in handling large
datasets for actionable insights. The definition shows that it is a systematic process
involving machine learning and statistical methods, while the functionalities provide
specific tasks that can help businesses, researchers, and industries.
This powerful tool allows organizations to unlock the value hidden within data and improve
decision-making in diverse fields like healthcare, retail, finance, and more.
# Data Processing and Forms of Data Pre-processing:
1. Data Processing
Data processing refers to the steps taken to transform raw data into meaningful and usable
information. Raw data collected from various sources often contains errors, inconsistencies,
and noise. To make it suitable for data mining and analysis, it needs to undergo several
processes, collectively called data preprocessing.
• Why is Data Processing Important?
o Improves Data Quality: Clean and accurate data leads to better analysis and
predictions.
o Reduces Redundancy: Eliminates duplicate or irrelevant data.
o Optimizes Performance: Well-processed data improves the efficiency of data
mining algorithms.
o Ensures Consistency: Standardizes data formats and removes inconsistencies.
A. Data Cleaning
Definition: Data cleaning involves identifying and correcting errors, inconsistencies, or
missing values in the dataset to ensure high-quality data.
• Steps in Data Cleaning:
o Handling Missing Values:
▪ Replace missing values with a constant (e.g., zero) or a mean/median.
▪ Remove rows or columns with too many missing values.
▪ Use predictive models to estimate missing values.
▪ Example: If income data has missing values, you can replace them with
the average income.
o Removing Noisy Data (Irrelevant or outlier data):
▪ Binning: Sort data into bins (ranges) and smooth values by mean,
median, or boundaries.
▪ Regression: Use a regression model to predict and smooth noisy data
points.
▪ Clustering: Group data into clusters and remove outliers.
▪ Human Inspection: Manual review for small datasets.
▪ Example: In a sales dataset, entries like "99999" as age could be outliers.
o Resolving Inconsistent Data:
▪ Standardize data formats, units, or naming conventions.
▪ Correct inconsistencies (e.g., "NY" vs. "New York").
B. Data Integration
Definition: Combining data from multiple sources into a single, unified view.
• Purpose: Ensures that all the data required for analysis is merged and accessible.
• Challenges:
o Data may have different formats (e.g., databases, spreadsheets, files).
o Redundancies: Duplicate records or fields must be removed.
o Schema Integration: Combining fields like "Customer_ID" vs. "C_ID".
• Example: Combining sales data from a CRM (Customer Relationship Management)
and product data from an ERP (Enterprise Resource Planning) system into one
dataset.
C. Data Transformation
Definition: Converting data into a suitable format for data mining.
• Types of Data Transformation:
1. Normalization: Scaling data values into a specific range to ensure fairness and
balance.
▪ Methods:
D. Data Reduction
Definition: Reducing the size of the dataset while preserving its integrity, so it remains
useful for analysis.
• Purpose: To improve processing time and algorithm efficiency.
• Techniques:
1. Data Cube Aggregation: Summarizing data into a data cube for
multidimensional analysis.
▪ Example: Aggregating total sales across regions, months, and product
categories.
2. Dimensionality Reduction: Reducing the number of attributes (features) in a
dataset while maintaining essential information.
▪ Methods:
▪ Principal Component Analysis (PCA)
▪ Linear Discriminant Analysis (LDA)
▪ Example: If you have 20 features in a dataset, PCA might reduce them to
5 key components.
3. Data Compression: Storing data in a compact form using encoding techniques.
▪ Example: Encoding categorical values as integers.
4. Numerosity Reduction: Representing data with fewer records through
sampling or clustering.
E. Data Discretization
Definition: Converting continuous numerical data into categorical (discrete) intervals.
• Purpose: Simplifies data for analysis and algorithms that work better with categorical
inputs.
• Techniques:
1. Binning: Dividing data into intervals or bins.
▪ Equal-width Binning: Bins have equal size (e.g., [0-10], [10-20]).
▪ Equal-frequency Binning: Bins have equal data points.
2. Histogram Analysis: Grouping data into frequency intervals.
• Example: Age data can be discretized into categories: 0–18 (child), 19–35 (youth), 36–
60 (adult), 60+ (senior).
Conclusion
data preprocessing forms/methods:
→ data cleaning (missing value, noisy data, inconsistent data)
→ data integration
→ data transformation (normalization, smoothing, agg)
→ data reduction (dim. Red, data compression)
→ data discretization (binning, hist. analysis)
→ hierarchy gen
# Data Cleaning—focusing on missing values, noisy data, and the methods used
to address them:
3. Regression
o Definition: Use regression models to smooth noisy data by fitting a line or
curve.
o Types:
▪ Linear Regression: Fits a straight line.
▪ Polynomial Regression: Fits a curve.
o Example:
If sales data fluctuates with noise, regression can estimate a trend line to
smooth the values.
4. Computer and Human Inspection
o Computer Inspection: Use automated scripts or tools to detect and flag noisy
data.
o Human Inspection: Experts manually review small datasets to identify and fix
errors.
o When to use:
▪ For small datasets or critical data.
o Example:
In medical data, experts manually check abnormal entries like inconsistent
patient records.
5. Smoothing Techniques
o Moving Average: Replace each value with the average of neighboring values
over a specified window.
o Weighted Moving Average: Assign higher weights to recent values.
o Example:
Daily temperature data can be smoothed by averaging values over 3 days.
Summary of Key Concepts
Issue Technique Description Example
Delete rows/columns with missing Remove rows with null
Remove Data
data. values.
Replace missing salaries with
Replace with Default Use constant values like 0 or “N/A.”
Missing ₹0.
Values Replace with Replace with mean or mode Fill missing age with average
Mean/Median/Mode values. age = 30.
Use regression or ML models to Predict income based on
Predict Values
predict. education level.
Smooth data using mean, median, Replace [15, 20, 22] with [20,
Binning
or boundaries. 20, 20].
Remove outliers using clustering Exclude a house price of
Clustering
methods. ₹10,000,000.
Noisy Data
Fit a regression line to smooth Sales trends smoothed using
Regression
data. regression.
Apply moving average to reduce Smooth daily temperatures
Smoothing
variations. over a week.
Conclusion
Handling missing values and noisy data is an essential step in data cleaning to improve the
quality and accuracy of datasets. By using techniques like binning, clustering, regression,
and statistical measures, raw data can be prepared effectively for data mining and analysis.
This process ensures that algorithms work efficiently and produce reliable results.
# Inconsistent Data, Data Integration, and Data Transformation as they relate to the data
cleaning process in data mining:
1. Inconsistent Data
Definition
Inconsistent data refers to data that conflicts within the dataset or with other datasets due
to errors, discrepancies, or lack of standardization. This inconsistency can impact data
analysis and decision-making.
2. Data Integration
Definition
Data integration is the process of combining data from multiple sources into a unified,
consistent, and accurate dataset. This step ensures data from various systems, formats, or
databases is ready for analysis.
3. Data Transformation
Definition
Data transformation involves converting raw data into a format suitable for data mining and
analysis. It includes processes like normalization, aggregation, and feature engineering to
make data consistent and useful.
Summary
Inconsistent Data
• Deals with errors like format mismatches, naming conflicts, and contradictory
values.
• Resolved using standardization, reconciliation, and duplicate removal.
Data Integration
• Combines data from multiple sources into a unified view.
• Handles schema conflicts, entity identification, and redundancy.
Data Transformation
• Converts data into a suitable format for analysis.
• Includes normalization, aggregation, discretization, and feature engineering.
By resolving inconsistencies, integrating data, and transforming it into a usable format, data
mining models can operate efficiently, resulting in accurate insights and improved decision-
making.
# Data Reduction
Key Concepts
1. Data Cube: A data structure that organizes data across multiple dimensions, allowing
analysis at different levels of abstraction.
o Example Dimensions: Time, Location, Product Category, etc.
2. Aggregation: Performing summary operations (like sum, count, average, etc.) on data
grouped by specific dimensions.
Example
Imagine a sales dataset with three dimensions:
• Time: Year, Month
• Location: City, Country
• Product: Category, Brand
Time Location Product Sales
Jan New York Electronics 1000
Feb New York Electronics 1200
Jan New York Furniture 800
Step 1: Aggregation by Time
Summarize data by month:
Month Total Sales
Jan 1800
Feb 1200
Step 2: Aggregation by Location
Summarize data by city:
City Total Sales
New York 3000
2. Dimensionality Reduction
Definition
Dimensionality reduction refers to reducing the number of attributes (features) in the
dataset while preserving important information.
Techniques
1. Principal Component Analysis (PCA)
o Projects data onto a new set of axes (principal components) that maximize
variance.
o Reduces redundancy by combining correlated features.
2. Linear Discriminant Analysis (LDA)
o Focuses on maximizing class separability while reducing dimensions.
3. Feature Selection
o Selects the most relevant features using methods like:
▪ Filter Methods (e.g., Correlation).
▪ Wrapper Methods (e.g., Recursive Feature Elimination).
Example
Imagine a dataset with 10 features, but only 3 contribute significantly to predicting the
target variable. PCA can reduce these 10 features to the top 3 principal components.
Features After PCA
10 Attributes 3 Components
3. Data Compression
Definition
Data compression reduces the storage space required for data by encoding or reorganizing
it in a way that retains its meaning.
Techniques
1. Lossless Compression
o Compresses data without losing any information.
o Techniques: Huffman coding, Run-length encoding.
2. Lossy Compression
o Reduces size by removing insignificant or redundant data.
o Used in multimedia data (images, videos).
Example
• Original Dataset: AAAAABBBCC
• Compressed Dataset (Lossless): 5A3B2C
Benefits
• Saves storage space.
• Reduces data transmission time.
• Improves performance of data processing tasks.
4. Numerosity Reduction
Definition
Numerosity reduction involves representing data using a smaller set of numbers. It
approximates the original data without losing important patterns.
Techniques
1. Parametric Methods
o Assume that the data fits a model and represents it using model parameters.
o Example: Using a regression equation to approximate data points.
2. Non-Parametric Methods
o Store a reduced representation of the data, such as:
▪ Histograms
▪ Clustering (e.g., k-means)
Example
• Original Data: 1, 2, 3, 4, 5, 6
• Parametric: Approximate as y = x + 1
• Non-Parametric: Represent using clusters:
o Cluster 1: {1, 2} → 1.5
o Cluster 2: {3, 4} → 3.5
Summary
Data Reduction
Data reduction helps in simplifying data for efficient storage, analysis, and computation.
1. Data Cube Aggregation: Summarizes data across multiple dimensions.
2. Dimensionality Reduction: Reduces the number of attributes using PCA, LDA, etc.
3. Data Compression: Reduces the physical size of data.
4. Numerosity Reduction: Represents data with fewer numbers using parametric or
non-parametric methods.
5. Discretization & Concept Hierarchy: Converts continuous data into categories and
organizes it hierarchically.
These techniques ensure that large datasets become manageable while retaining key
insights.
Decision Tree - Detailed Explanation
Splitting Criteria
1. Gini Index:
o Measures the impurity of a dataset.
o Lower Gini index → Higher purity → Better split.