0% found this document useful (0 votes)

5 views4 pages

Data Mining - Lab 1

This document outlines Lab 1 of a Data Mining course, focusing on Data Preprocessing techniques applied to the Adult Census Income dataset. Students will learn data cleaning, feature selection, normalization, and discretization using Python libraries, culminating in a report presented in a Jupyter Notebook. The lab includes specific assessment criteria and emphasizes the importance of individual work and academic integrity.

Uploaded by

Yến Quyên

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views4 pages

Data Mining - Lab 1

Uploaded by

Yến Quyên

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Lab 1.

Data Preprocessing Data Mining CSC14004

Lab 1

Data Preprocessing
In this lab, we will explore Data Preprocessing, a critical step in the data mining process
that prepares raw data for analysis by cleaning, reducing, normalizing, and discretizing it.

1 Introduction
The Adult Census Income dataset, also known as the Census Income dataset or the Adult dataset,
is sourced from the U.S. Census Bureau Database and is commonly used in classification tasks.
The goal of this dataset is to predict if an individual’s annual income exceeds $50,000 based on
demographic characteristics.
The dataset contains 48,842 records and 14 attributes (features) as follows:

Attribute Description
age The age of the individual.
workclass The type of employment (e.g., Private, Government, Self-employed, etc.).
fnlwgt The final weight of the survey.
education The level of education.
education-num The number of years of education, in integer form.
marital-status Marital status (single, married, divorced, etc.).
occupation Occupation (Managerial, Clerical, Service, etc.).
relationship Relationship to the household (spouse, child, other relative, etc.).
race Race.
sex Gender.
capital-gain Income from capital (not from wages).
capital-loss Loss from capital (from investments).
hours-per-week The number of hours worked per week.
native-country Country of origin.

The dataset also contains a target variable income with two values, including <=50K and >50K,
representing the individual’s income.
To get the dataset, please visit: Adult - UCI Machine Learning.

Students will learn to apply various data preprocessing techniques on the dataset mention
above using essential Python libraries such as pandas, numpy, matplotlib, etc. (no sklearn).
The result will be presented in a Jupyter Notebook.

University of Science Faculty of Information Technology Page 1

Lab 1. Data Preprocessing Data Mining CSC14004

2 Description
2.1 Data Cleaning
The data is dirty in the real world with lots of potential incorrect data from human mistakes,
computing errors, or transmission problems. And that’s why data needs preprocessing to ensure
the accuracy and reliability of later analyses.
To perform data cleaning tasks, let’s answer the questions and address the problems below:

• Assessment of Missing Data: Is there any missing data present? If so, what actions
should be taken and why?

– Should the affected records be excluded from the dataset?

– Should the missing values be imputed? If so, which imputation methods will be applied
(e.g., global constant, mean, etc.)?

• Identification of Duplicate Records: Are there any duplicate records present in dataset?
If duplicates exist, keep only one of them.

• Additional Data Cleaning Methods: Are there any further steps, methods that can be
implemented to make the data to be more cleaner?

2.2 Features Selection

Complex data analysis can take a significant amount of time to execute on the complete dataset.
Higher dimensions make data more sparse and create exponentially more subspace combinations,
but these combinations often lack meaning. Therefore, features selection is an important step in
data preprocessing that simplifies the dataset while retaining its most essential characteristics,
allowing for more efficient modeling and visualization.
To perform features selection, let’s address the following questions:

• Feature Selection: Which features (attributes) are the most informative in the data? Why?
Which features should be kept in the dataset?

– Are there any features that are redundant or highly correlated with others? If so,
which methods (e.g., correlation thresholding, Variance Inflation Factor) will be used
to eliminate these features?
– How many features should be retained to ensure minimal information loss?

University of Science Faculty of Information Technology Page 2

Lab 1. Data Preprocessing Data Mining CSC14004

2.3 Data Normalization

Data normalization is a critical step in data preprocessing, where the values of features are scaled
to a common range. This ensures that no single feature disproportionately influences the analysis.
To perform data normalization, let’s address the following questions:

• Need for Normalization: Does the data contain features with different scales?

• Normalization Techniques: Which normalization techniques should be applied? Why?

– Min-Max Scaling:

X − Xmin
Xscaled = × (Xmax new − Xmin new ) + Xmin new
Xmax − Xmin
– Decimal Scaling:

X
Xscaled =
10j
– Z-score Standardization:

X −µ
Xnormalized =
σ
– Are there any other normalization methods (e.g., Robust Scaling) that can handle
outliers effectively?

2.4 Data Discretization

Data discretization is a preprocessing technique that transforms continuous data into intervals.
This step is useful for simplifying data patterns and is often used in machine learning algorithms
that work better with categorical data.
To perform data discretization, let’s address the following questions:

• Discretization Techniques: Which techniques should be applied for discretization?

– Binning: Equal-width (distance) partitioning, Equal-depth (frequency) partitioning.

– Histogram analysis.
– Are there any other methods?

University of Science Faculty of Information Technology Page 3

Lab 1. Data Preprocessing Data Mining CSC14004

3 Report (Jupyter Notebook)

The source code, result will be reported in a Jupyter Notebook with the following requirements:

• Student information (Student ID, full name, etc.).

• Self-evaluation of the assignment requirements.

• Detailed explanation of each step. Illustrative images, diagrams and equations are required.

• Each processing step must be fully commented, and results should be printed for observation.

• The report needs to be well-formatted.

• Before submitting, re-run the notebook (Kernel → Restart & Run All).

• References (if any).

4 Assessment
No. Details Score
1 Data Cleaning 25%
2 Features Selection 25%
3 Data Normalization 25%
4 Data Discretization 25%
Total 100%

5 Notices
Please pay attention to the following notices:

• This is an INDIVIDUAL assignment.

• Duration: about 2 weeks.

• Any plagiarism, any tricks, or any lie will have a 0 point for the course grade.

The end.

University of Science Faculty of Information Technology Page 4

Vanessa Van Edwards - Personality Matrix One Pager
No ratings yet
Vanessa Van Edwards - Personality Matrix One Pager
1 page
Neuman Tool PDF
100% (2)
Neuman Tool PDF
445 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
DS Unit 2
No ratings yet
DS Unit 2
42 pages
CS322_Lec 3_S25
No ratings yet
CS322_Lec 3_S25
42 pages
data preprocessing
No ratings yet
data preprocessing
8 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
DS Module2 L3 L13
No ratings yet
DS Module2 L3 L13
43 pages
VIPDMTheoryChapter3
No ratings yet
VIPDMTheoryChapter3
87 pages
213015005_Lab2
No ratings yet
213015005_Lab2
8 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
OJCST_Vol13_N2-3_p_78-81
No ratings yet
OJCST_Vol13_N2-3_p_78-81
4 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
CH 3
No ratings yet
CH 3
68 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
35 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
Chapter3
No ratings yet
Chapter3
50 pages
Preprocessing
No ratings yet
Preprocessing
90 pages
CSC407_Chapter 2-3
No ratings yet
CSC407_Chapter 2-3
46 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
DMiningKuliah 2A DPreparation
No ratings yet
DMiningKuliah 2A DPreparation
32 pages
Data Warehouse and Data Mining- Definition and Concepts
No ratings yet
Data Warehouse and Data Mining- Definition and Concepts
20 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
15 pages
Lab 1: Preprocessing Using Python
No ratings yet
Lab 1: Preprocessing Using Python
5 pages
03preprocessing DMDW
No ratings yet
03preprocessing DMDW
81 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Lecture123
No ratings yet
Lecture123
20 pages
Data Binning
No ratings yet
Data Binning
9 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
54 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
Data preprocessing (1)
No ratings yet
Data preprocessing (1)
77 pages
PPT1
No ratings yet
PPT1
93 pages
Week2-2
No ratings yet
Week2-2
25 pages
DMDW Chapter 3
No ratings yet
DMDW Chapter 3
13 pages
dm(2)
No ratings yet
dm(2)
3 pages
03Preprocessing
No ratings yet
03Preprocessing
59 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
AI With Python-Data Preprocessing: Student Name Student Roll # Program Section
No ratings yet
AI With Python-Data Preprocessing: Student Name Student Roll # Program Section
7 pages
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
No ratings yet
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
57 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
03Preprocessing (2)
No ratings yet
03Preprocessing (2)
80 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
2020 Preprocessing
No ratings yet
2020 Preprocessing
63 pages
UNIT-2
No ratings yet
UNIT-2
37 pages
Unit 1
No ratings yet
Unit 1
8 pages
Data Cleaning Data Transformation Data Reduction Discretization and Generating Concept Hierarchies
No ratings yet
Data Cleaning Data Transformation Data Reduction Discretization and Generating Concept Hierarchies
25 pages
Lec7
No ratings yet
Lec7
45 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
50 pages
Chap.3 Data Preprocessing
No ratings yet
Chap.3 Data Preprocessing
6 pages
Data Preprocessing (DWDM MOD 2)
No ratings yet
Data Preprocessing (DWDM MOD 2)
62 pages
DM Day3 Preprocessing a F24(1)
No ratings yet
DM Day3 Preprocessing a F24(1)
85 pages
Mining
No ratings yet
Mining
63 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
I. Objective A. Content Standards B. Performance Standards: Humss - Pg12-Iie-F-16
No ratings yet
I. Objective A. Content Standards B. Performance Standards: Humss - Pg12-Iie-F-16
3 pages
Valiandes 2017
No ratings yet
Valiandes 2017
17 pages
FPSC Senior Teacher Lecturer Subject Specialist
No ratings yet
FPSC Senior Teacher Lecturer Subject Specialist
5 pages
Teaching Notes: Blood Brothers by Willy Russell
No ratings yet
Teaching Notes: Blood Brothers by Willy Russell
15 pages
Arya Uttamchandni
No ratings yet
Arya Uttamchandni
1 page
UB 2024 GUIDE BOOKLET Revised Version 1 27032024
No ratings yet
UB 2024 GUIDE BOOKLET Revised Version 1 27032024
64 pages
Tojned Volume07 I01
No ratings yet
Tojned Volume07 I01
221 pages
Summative 1.2 Math 5
No ratings yet
Summative 1.2 Math 5
2 pages
TRADE PROJECT Maina Edditted
100% (1)
TRADE PROJECT Maina Edditted
17 pages
The Remainder Theorem
No ratings yet
The Remainder Theorem
6 pages
Health Psychology 11th Edition (eBook PDF)pdf download
100% (4)
Health Psychology 11th Edition (eBook PDF)pdf download
52 pages
The Divided Mind PDF
No ratings yet
The Divided Mind PDF
135 pages
Chapter 1 Quiz - Cybersecurity Essentials
No ratings yet
Chapter 1 Quiz - Cybersecurity Essentials
8 pages
Tamil Computing - Seminar
No ratings yet
Tamil Computing - Seminar
2 pages
Graduation Ceremony Program 2016 OOO
No ratings yet
Graduation Ceremony Program 2016 OOO
12 pages
La Consolacion College Tanauan Basic Education Department Organizational Structure Board of Trustees
No ratings yet
La Consolacion College Tanauan Basic Education Department Organizational Structure Board of Trustees
2 pages
Powerpoint Conclusion
No ratings yet
Powerpoint Conclusion
17 pages
Philosophical Perspective About The Self
No ratings yet
Philosophical Perspective About The Self
31 pages
Intervention in Reading 2022-2023
No ratings yet
Intervention in Reading 2022-2023
4 pages
Performance Appraisal For Managers
No ratings yet
Performance Appraisal For Managers
6 pages
g3 A1 Smile
No ratings yet
g3 A1 Smile
2 pages
Never Give Up
No ratings yet
Never Give Up
5 pages
Problem 12-5 QuickBooks Guide PDF
No ratings yet
Problem 12-5 QuickBooks Guide PDF
2 pages
Motivational Model of English Learning Among Elementary School Students in Japan
No ratings yet
Motivational Model of English Learning Among Elementary School Students in Japan
14 pages
The Toxic World of Self-Development
No ratings yet
The Toxic World of Self-Development
3 pages
A192 Epc Company Profile Document Presentation Rubric
No ratings yet
A192 Epc Company Profile Document Presentation Rubric
9 pages
ALM Prog.
No ratings yet
ALM Prog.
5 pages
Saks Et Al 2021 - Org Engagement Vs Job Engagement
No ratings yet
Saks Et Al 2021 - Org Engagement Vs Job Engagement
30 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Data Mining - Lab 1

Uploaded by

Data Mining - Lab 1

Uploaded by

Lab 1.

Data Preprocessing Data Mining CSC14004

University of Science Faculty of Information Technology Page 1

– Should the affected records be excluded from the dataset?

2.2 Features Selection

University of Science Faculty of Information Technology Page 2

2.3 Data Normalization

• Normalization Techniques: Which normalization techniques should be applied? Why?

2.4 Data Discretization

• Discretization Techniques: Which techniques should be applied for discretization?

– Binning: Equal-width (distance) partitioning, Equal-depth (frequency) partitioning.

University of Science Faculty of Information Technology Page 3

3 Report (Jupyter Notebook)

• Student information (Student ID, full name, etc.).

• Self-evaluation of the assignment requirements.

• The report needs to be well-formatted.

• References (if any).

• This is an INDIVIDUAL assignment.

• Duration: about 2 weeks.

University of Science Faculty of Information Technology Page 4

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.