0% found this document useful (0 votes)

73 views

Data Wrangling Python.

This document discusses three approaches for handling categorical variables in machine learning models: 1) Drop categorical variables, 2) Label encoding, and 3) One-hot encoding. It introduces a Melbourne housing dataset and defines a function to evaluate the mean absolute error of a random forest model under each approach. Label encoding and one-hot encoding the categorical variables "Type", "Method", and "Regionname" results in lower error than dropping these variables. One-hot encoding generally performs best for nominal variables without a clear ordering.

Uploaded by

Robert

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views

Data Wrangling Python.

Uploaded by

Robert

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Intermediate Machine Learning Home Page (https://www.kaggle.

com/learn/intermediate-machine-
learning)
In this tutorial, you will learn what a categorical variable is, along with three approaches for handling this type
of data.

Introduction
A categorical variable takes only a limited number of values.

Consider a survey that asks how often you eat breakfast and provides four options: "Never", "Rarely",
"Most days", or "Every day". In this case, the data is categorical, because responses fall into a fixed set of
categories.
If people responded to a survey about which what brand of car they owned, the responses would fall into
categories like "Honda", "Toyota", and "Ford". In this case, the data is also categorical.

You will get an error if you try to plug these variables into most machine learning models in Python without
preprocessing them first. In this tutorial, we'll compare three approaches that you can use to prepare your
categorical data.

Three Approaches
1) Drop Categorical Variables
The easiest approach to dealing with categorical variables is to simply remove them from the dataset. This
approach will only work well if the columns did not contain useful information.

2) Label Encoding
Label encoding assigns each unique value to a different integer.

This approach assumes an ordering of the categories: "Never" (0) < "Rarely" (1) < "Most days" (2) < "Every
day" (3).

This assumption makes sense in this example, because there is an indisputable ranking to the categories. Not
all categorical variables have a clear ordering in the values, but we refer to those that do as ordinal
variables. For tree-based models (like decision trees and random forests), you can expect label encoding to
work well with ordinal variables.

3) One-Hot Encoding
One-hot encoding creates new columns indicating the presence (or absence) of each possible value in the
original data. To understand this, we'll work through an example.

In the original dataset, "Color" is a categorical variable with three categories: "Red", "Yellow", and "Green". The
corresponding one-hot encoding contains one column for each possible value, and one row for each row in the
original dataset. Wherever the original value was "Red", we put a 1 in the "Red" column; if the original value
was "Yellow", we put a 1 in the "Yellow" column, and so on.

In contrast to label encoding, one-hot encoding does not assume an ordering of the categories. Thus, you can
expect this approach to work particularly well if there is no clear ordering in the categorical data (e.g., "Red" is
neither more nor less than "Yellow"). We refer to categorical variables without an intrinsic ranking as nominal
variables.

One-hot encoding generally does not perform well if the categorical variable takes on a large number of values
(i.e., you generally won't use it for variables taking more than 15 different values).

Example
As in the previous tutorial, we will work with the Melbourne Housing dataset
(https://www.kaggle.com/dansbecker/melbourne-housing-snapshot/home).

We won't focus on the data loading step. Instead, you can imagine you are at a point where you already have
the training and validation data in X_train , X_valid , y_train , and y_valid .
In [1]: import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data

data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')

# Separate target from predictors

y = data.Price
X = data.drop(['Price'], axis=1)

# Divide data into training and validation subsets

X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train
_size=0.8, test_size=0.2,
random_stat
e=0)

# Drop columns with missing values (simplest approach)

cols_with_missing = [col for col in X_train_full.columns if X_train_full[co
l].isnull().any()]
X_train_full.drop(cols_with_missing, axis=1, inplace=True)
X_valid_full.drop(cols_with_missing, axis=1, inplace=True)

# "Cardinality" means the number of unique values in a column

# Select categorical columns with relatively low cardinality (convenient bu
t arbitrary)
low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_
full[cname].nunique() < 10 and
X_train_full[cname].dtype == "object"]

# Select numerical columns

numerical_cols = [cname for cname in X_train_full.columns if X_train_full[c
name].dtype in ['int64', 'float64']]

# Keep selected columns only

my_cols = low_cardinality_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py:4102: SettingWi
thCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/

stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,

We take a peek at the training data with the head() method below.
In [2]: X_train.head()

Out[2]:
Type Method Regionname Rooms Distance Postcode Bedroom2 Bathroom Landsize

Southern
12167 u S 1 5.0 3182.0 1.0 1.0 0.0
Metropolitan

Western
6524 h SA 2 8.0 3016.0 2.0 2.0 193.0
Metropolitan

Western
8413 h S 3 12.6 3020.0 3.0 1.0 555.0
Metropolitan

Northern
2919 u SP 3 13.0 3046.0 3.0 1.0 265.0
Metropolitan

Western
6043 h S 3 13.3 3020.0 3.0 1.0 673.0
Metropolitan

Next, we obtain a list of all of the categorical variables in the training data.

We do this by checking the data type (or dtype) of each column. The object dtype indicates a column has
text (there are other things it could theoretically be, but that's unimportant for our purposes). For this dataset,
the columns with text indicate categorical variables.

In [3]: # Get list of categorical variables

s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)

print("Categorical variables:")
print(object_cols)

Categorical variables:
['Type', 'Method', 'Regionname']

Define Function to Measure Quality of Each Approach

We define a function score_dataset() to compare the three different approaches to dealing with
categorical variables. This function reports the mean absolute error
(https://en.wikipedia.org/wiki/Mean_absolute_error) (MAE) from a random forest model. In general, we want the
MAE to be as low as possible!

In [4]: from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_absolute_error

# Function for comparing different approaches

def score_dataset(X_train, X_valid, y_train, y_valid):
model = RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(X_train, y_train)
preds = model.predict(X_valid)
return mean_absolute_error(y_valid, preds)
Score from Approach 1 (Drop Categorical Variables)
We drop the object columns with the select_dtypes() (https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.select_dtypes.html) method.

In [5]: drop_X_train = X_train.select_dtypes(exclude=['object'])

drop_X_valid = X_valid.select_dtypes(exclude=['object'])

print("MAE from Approach 1 (Drop categorical variables):")

print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))

MAE from Approach 1 (Drop categorical variables):

175703.48185157913

Score from Approach 2 (Label Encoding)

Scikit-learn has a LabelEncoder (https://scikit-
learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) class that can be used to get
label encodings. We loop over the categorical variables and apply the label encoder separately to each
column.

In [6]: from sklearn.preprocessing import LabelEncoder

# Make copy to avoid changing original data

label_X_train = X_train.copy()
label_X_valid = X_valid.copy()

# Apply label encoder to each column with categorical data

label_encoder = LabelEncoder()
for col in object_cols:
label_X_train[col] = label_encoder.fit_transform(X_train[col])
label_X_valid[col] = label_encoder.transform(X_valid[col])

print("MAE from Approach 2 (Label Encoding):")

print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))

MAE from Approach 2 (Label Encoding):

165936.40548390493
In the code cell above, for each column, we randomly assign each unique value to a different integer. This is a
common approach that is simpler than providing custom labels; however, we can expect an additional boost in
performance if we provide better-informed labels for all ordinal variables.

Score from Approach 3 (One-Hot Encoding)

We use the OneHotEncoder (https://scikit-
learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) class from scikit-learn to get
one-hot encodings. There are a number of parameters that can be used to customize its behavior.

We set handle_unknown='ignore' to avoid errors when the validation data contains classes that aren't
represented in the training data, and
setting sparse=False ensures that the encoded columns are returned as a numpy array (instead of a
sparse matrix).

To use the encoder, we supply only the categorical columns that we want to be one-hot encoded. For instance,
to encode the training data, we supply X_train[object_cols] . ( object_cols in the code cell below is a
list of the column names with categorical data, and so X_train[object_cols] contains all of the categorical
data in the training set.)

In [7]: from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoder to each column with categorical data

OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols
]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))

# One-hot encoding removed index; put it back

OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)

num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features

OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

print("MAE from Approach 3 (One-Hot Encoding):")

print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))

MAE from Approach 3 (One-Hot Encoding):

166089.4893009678
Which approach is best?
In this case, dropping the categorical columns (Approach 1) performed worst, since it had the highest MAE
score. As for the other two approaches, since the returned MAE scores are so close in value, there doesn't
appear to be any meaningful benefit to one over the other.

In general, one-hot encoding (Approach 3) will typically perform best, and dropping the categorical columns
(Approach 1) typically performs worst, but it varies on a case-by-case basis.

Conclusion
The world is filled with categorical data. You will be a much more effective data scientist if you know how to use
this common data type!

Your Turn
Put your new skills to work in the next exercise (https://www.kaggle.com/kernels/fork/3370279)!

Intermediate Machine Learning Home Page (https://www.kaggle.com/learn/intermediate-machine-

learning)

Have questions or comments? Visit the Learn Discussion forum (https://www.kaggle.com/learn-forum) to chat
with other Learners.

Mercedes-Benz Greener Manufacturing Ai
0% (1)
Mercedes-Benz Greener Manufacturing Ai
16 pages
Programming With MSW Logo
No ratings yet
Programming With MSW Logo
22 pages
Gateway Zx4800-02 Oobe
No ratings yet
Gateway Zx4800-02 Oobe
11 pages
ML-Lab05-Data Preprocessing Techniques in Python
No ratings yet
ML-Lab05-Data Preprocessing Techniques in Python
7 pages
Dealing with categorical
No ratings yet
Dealing with categorical
25 pages
LAB MANUAL 5 SOLVED 40 (1)
No ratings yet
LAB MANUAL 5 SOLVED 40 (1)
13 pages
Train
No ratings yet
Train
17 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
22K61A0654_2_sasi_auto
No ratings yet
22K61A0654_2_sasi_auto
24 pages
Using Categorical Data With One Hot Encoding - Kaggle PDF
No ratings yet
Using Categorical Data With One Hot Encoding - Kaggle PDF
4 pages
Categorical Variable
No ratings yet
Categorical Variable
2 pages
003-FIN7790 (Part2)
No ratings yet
003-FIN7790 (Part2)
162 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
Exp 6
No ratings yet
Exp 6
9 pages
Abhiml ML File
No ratings yet
Abhiml ML File
74 pages
Feature Engineering
No ratings yet
Feature Engineering
20 pages
EDA+Cheatsheet+ +Class+Note
No ratings yet
EDA+Cheatsheet+ +Class+Note
29 pages
Lab File
No ratings yet
Lab File
96 pages
EDA+Cheatsheet+ +Class+Note
No ratings yet
EDA+Cheatsheet+ +Class+Note
29 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
Xgboost
No ratings yet
Xgboost
12 pages
Data Clearning
No ratings yet
Data Clearning
7 pages
Practical 3 - Categorical Feature Engineering
No ratings yet
Practical 3 - Categorical Feature Engineering
6 pages
StarterNotebook - Jupyter Notebook
No ratings yet
StarterNotebook - Jupyter Notebook
12 pages
ml lab
No ratings yet
ml lab
14 pages
ML unit 3
No ratings yet
ML unit 3
17 pages
Project paarth (1) (1)
No ratings yet
Project paarth (1) (1)
21 pages
L1_Data Pre-processing & Steps of Building a Model (1)
No ratings yet
L1_Data Pre-processing & Steps of Building a Model (1)
30 pages
House Price Prediction Using Machine Learning in Python
No ratings yet
House Price Prediction Using Machine Learning in Python
13 pages
ML Final Prac
No ratings yet
ML Final Prac
47 pages
Analysis and Prediction of House Prices by Linear Regression Model
No ratings yet
Analysis and Prediction of House Prices by Linear Regression Model
91 pages
Data Preprocessing Techniques in ML
No ratings yet
Data Preprocessing Techniques in ML
12 pages
Data Preparation For ML in Practice v213
No ratings yet
Data Preparation For ML in Practice v213
78 pages
Week 10
No ratings yet
Week 10
50 pages
Data Wrangling and Preprocessing
100% (1)
Data Wrangling and Preprocessing
41 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
EDA+Cheatsheet+-+Class+Note
No ratings yet
EDA+Cheatsheet+-+Class+Note
29 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Reading Data: #Importing Required Libraries
No ratings yet
Reading Data: #Importing Required Libraries
16 pages
ML Lab Programs
No ratings yet
ML Lab Programs
18 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
Dealing With Missing Data in Python Pandas
100% (1)
Dealing With Missing Data in Python Pandas
14 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
Data Mining & Data Science Practical Slips
No ratings yet
Data Mining & Data Science Practical Slips
45 pages
Data Analysis: Data Preparation
No ratings yet
Data Analysis: Data Preparation
9 pages
ModuleAr Merged
No ratings yet
ModuleAr Merged
42 pages
MACHINE LEARNING manual
No ratings yet
MACHINE LEARNING manual
36 pages
PCA2-1
No ratings yet
PCA2-1
26 pages
83 Sklearn Pipeline
No ratings yet
83 Sklearn Pipeline
8 pages
AD3461_ML Lab Manual
No ratings yet
AD3461_ML Lab Manual
54 pages
EDS - Python Cheat Sheet
0% (1)
EDS - Python Cheat Sheet
3 pages
ML in 10 Pages 1683806402
No ratings yet
ML in 10 Pages 1683806402
10 pages
5) Randomforest - Ipynb - Colaboratory
No ratings yet
5) Randomforest - Ipynb - Colaboratory
12 pages
ML Interview Questions
No ratings yet
ML Interview Questions
10 pages
week_11_features_categorical
No ratings yet
week_11_features_categorical
15 pages
01 - Feature Engg
No ratings yet
01 - Feature Engg
43 pages
23BCE7092_ML_Lab_Assignment[1]
No ratings yet
23BCE7092_ML_Lab_Assignment[1]
14 pages
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
DX Diag
No ratings yet
DX Diag
33 pages
FP Growth Algorithm
No ratings yet
FP Growth Algorithm
21 pages
r/health Mods Ban For /r/ideasfortheadmins Post
No ratings yet
r/health Mods Ban For /r/ideasfortheadmins Post
1 page
Programming With WxDev-C++
100% (1)
Programming With WxDev-C++
340 pages
ReaLearn User Guide
No ratings yet
ReaLearn User Guide
24 pages
Brochure - Condition Monitoring With SIMATIC S7-1200
0% (1)
Brochure - Condition Monitoring With SIMATIC S7-1200
2 pages
Scribd
No ratings yet
Scribd
2 pages
Data Privacy Regulation Matrix For Salesforce by Cloud Compliance
No ratings yet
Data Privacy Regulation Matrix For Salesforce by Cloud Compliance
2 pages
DC SS009 7 PDF
No ratings yet
DC SS009 7 PDF
11 pages
Database Management - Class Notes
No ratings yet
Database Management - Class Notes
6 pages
TR 181 - Issue 2 - Amendment 2 PDF
No ratings yet
TR 181 - Issue 2 - Amendment 2 PDF
88 pages
SSBP Slides PDF
No ratings yet
SSBP Slides PDF
118 pages
Advanced Ubuntu Sheet PDF
No ratings yet
Advanced Ubuntu Sheet PDF
4 pages
Iproject.com.Ng 33742.1966 Design and Implementation of a Road Transport Booking System
No ratings yet
Iproject.com.Ng 33742.1966 Design and Implementation of a Road Transport Booking System
52 pages
Bella Model Limitations PDF
No ratings yet
Bella Model Limitations PDF
7 pages
Business Analytics Using R
No ratings yet
Business Analytics Using R
1 page
RDBMS Lab Assignment
67% (3)
RDBMS Lab Assignment
60 pages
Ficha Técnica Magellan 3550hsi
No ratings yet
Ficha Técnica Magellan 3550hsi
2 pages
DS-3E1326P-EI-B_Datasheet_20240822
No ratings yet
DS-3E1326P-EI-B_Datasheet_20240822
7 pages
Axiohm NCR Thermal Receipt Printer A794 Owners Guide
No ratings yet
Axiohm NCR Thermal Receipt Printer A794 Owners Guide
146 pages
Circuits and Systems - Examiners
No ratings yet
Circuits and Systems - Examiners
3 pages
Assignment # 02: Air University
No ratings yet
Assignment # 02: Air University
5 pages
Syllabus M.tech Extc-1
No ratings yet
Syllabus M.tech Extc-1
59 pages
Sanchita Sarkar - Portfolio 2019
No ratings yet
Sanchita Sarkar - Portfolio 2019
6 pages
Fitting Guide Remote Support Phonak Target 6.0
No ratings yet
Fitting Guide Remote Support Phonak Target 6.0
7 pages
Deploying and Managing AD-Windows PowerShell
100% (4)
Deploying and Managing AD-Windows PowerShell
687 pages
Naomi Adjoa Jewel Degadzor - CV
No ratings yet
Naomi Adjoa Jewel Degadzor - CV
2 pages
e - 20240509 VL - Complete
No ratings yet
e - 20240509 VL - Complete
12 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Data Wrangling Python.

Uploaded by

Data Wrangling Python.

Uploaded by

Intermediate Machine Learning Home Page (https://www.kaggle.

# Read the data

# Separate target from predictors

# Divide data into training and validation subsets

# Drop columns with missing values (simplest approach)

# "Cardinality" means the number of unique values in a column

# Select numerical columns

# Keep selected columns only

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/

In [3]: # Get list of categorical variables

Define Function to Measure Quality of Each Approach

In [4]: from sklearn.ensemble import RandomForestRegressor

# Function for comparing different approaches

In [5]: drop_X_train = X_train.select_dtypes(exclude=['object'])

print("MAE from Approach 1 (Drop categorical variables):")

MAE from Approach 1 (Drop categorical variables):

Score from Approach 2 (Label Encoding)

In [6]: from sklearn.preprocessing import LabelEncoder

# Make copy to avoid changing original data

# Apply label encoder to each column with categorical data

print("MAE from Approach 2 (Label Encoding):")

MAE from Approach 2 (Label Encoding):

Score from Approach 3 (One-Hot Encoding)

In [7]: from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoder to each column with categorical data

# One-hot encoding removed index; put it back

# Remove categorical columns (will replace with one-hot encoding)

# Add one-hot encoded columns to numerical features

print("MAE from Approach 3 (One-Hot Encoding):")

MAE from Approach 3 (One-Hot Encoding):

Intermediate Machine Learning Home Page (https://www.kaggle.com/learn/intermediate-machine-

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.