Data Wrangling Python.
Data Wrangling Python.
com/learn/intermediate-machine-
learning)
In this tutorial, you will learn what a categorical variable is, along with three approaches for handling this type
of data.
Introduction
A categorical variable takes only a limited number of values.
Consider a survey that asks how often you eat breakfast and provides four options: "Never", "Rarely",
"Most days", or "Every day". In this case, the data is categorical, because responses fall into a fixed set of
categories.
If people responded to a survey about which what brand of car they owned, the responses would fall into
categories like "Honda", "Toyota", and "Ford". In this case, the data is also categorical.
You will get an error if you try to plug these variables into most machine learning models in Python without
preprocessing them first. In this tutorial, we'll compare three approaches that you can use to prepare your
categorical data.
Three Approaches
1) Drop Categorical Variables
The easiest approach to dealing with categorical variables is to simply remove them from the dataset. This
approach will only work well if the columns did not contain useful information.
2) Label Encoding
Label encoding assigns each unique value to a different integer.
This approach assumes an ordering of the categories: "Never" (0) < "Rarely" (1) < "Most days" (2) < "Every
day" (3).
This assumption makes sense in this example, because there is an indisputable ranking to the categories. Not
all categorical variables have a clear ordering in the values, but we refer to those that do as ordinal
variables. For tree-based models (like decision trees and random forests), you can expect label encoding to
work well with ordinal variables.
3) One-Hot Encoding
One-hot encoding creates new columns indicating the presence (or absence) of each possible value in the
original data. To understand this, we'll work through an example.
In the original dataset, "Color" is a categorical variable with three categories: "Red", "Yellow", and "Green". The
corresponding one-hot encoding contains one column for each possible value, and one row for each row in the
original dataset. Wherever the original value was "Red", we put a 1 in the "Red" column; if the original value
was "Yellow", we put a 1 in the "Yellow" column, and so on.
In contrast to label encoding, one-hot encoding does not assume an ordering of the categories. Thus, you can
expect this approach to work particularly well if there is no clear ordering in the categorical data (e.g., "Red" is
neither more nor less than "Yellow"). We refer to categorical variables without an intrinsic ranking as nominal
variables.
One-hot encoding generally does not perform well if the categorical variable takes on a large number of values
(i.e., you generally won't use it for variables taking more than 15 different values).
Example
As in the previous tutorial, we will work with the Melbourne Housing dataset
(https://www.kaggle.com/dansbecker/melbourne-housing-snapshot/home).
We won't focus on the data loading step. Instead, you can imagine you are at a point where you already have
the training and validation data in X_train , X_valid , y_train , and y_valid .
In [1]: import pandas as pd
from sklearn.model_selection import train_test_split
/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py:4102: SettingWi
thCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
We take a peek at the training data with the head() method below.
In [2]: X_train.head()
Out[2]:
Type Method Regionname Rooms Distance Postcode Bedroom2 Bathroom Landsize
Southern
12167 u S 1 5.0 3182.0 1.0 1.0 0.0
Metropolitan
Western
6524 h SA 2 8.0 3016.0 2.0 2.0 193.0
Metropolitan
Western
8413 h S 3 12.6 3020.0 3.0 1.0 555.0
Metropolitan
Northern
2919 u SP 3 13.0 3046.0 3.0 1.0 265.0
Metropolitan
Western
6043 h S 3 13.3 3020.0 3.0 1.0 673.0
Metropolitan
Next, we obtain a list of all of the categorical variables in the training data.
We do this by checking the data type (or dtype) of each column. The object dtype indicates a column has
text (there are other things it could theoretically be, but that's unimportant for our purposes). For this dataset,
the columns with text indicate categorical variables.
print("Categorical variables:")
print(object_cols)
Categorical variables:
['Type', 'Method', 'Regionname']
We set handle_unknown='ignore' to avoid errors when the validation data contains classes that aren't
represented in the training data, and
setting sparse=False ensures that the encoded columns are returned as a numpy array (instead of a
sparse matrix).
To use the encoder, we supply only the categorical columns that we want to be one-hot encoded. For instance,
to encode the training data, we supply X_train[object_cols] . ( object_cols in the code cell below is a
list of the column names with categorical data, and so X_train[object_cols] contains all of the categorical
data in the training set.)
In general, one-hot encoding (Approach 3) will typically perform best, and dropping the categorical columns
(Approach 1) typically performs worst, but it varies on a case-by-case basis.
Conclusion
The world is filled with categorical data. You will be a much more effective data scientist if you know how to use
this common data type!
Your Turn
Put your new skills to work in the next exercise (https://www.kaggle.com/kernels/fork/3370279)!
Have questions or comments? Visit the Learn Discussion forum (https://www.kaggle.com/learn-forum) to chat
with other Learners.