0% found this document useful (0 votes)
136 views

Feature Engineering

This document provides an overview of various machine learning concepts related to feature engineering: 1. It discusses feature encoding techniques like one-hot encoding, ordinal encoding, and label encoding. 2. It explains feature scaling methods such as standard scaling, min-max scaling, and robust scaling. 3. It describes feature binning as transforming continuous variables into categorical bins. 4. It defines feature selection as reducing input variables to only relevant data, discussing filter and embedded (feature importance) methods. 5. It provides examples of linear regression and decision tree modeling.

Uploaded by

ARCHANA R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views

Feature Engineering

This document provides an overview of various machine learning concepts related to feature engineering: 1. It discusses feature encoding techniques like one-hot encoding, ordinal encoding, and label encoding. 2. It explains feature scaling methods such as standard scaling, min-max scaling, and robust scaling. 3. It describes feature binning as transforming continuous variables into categorical bins. 4. It defines feature selection as reducing input variables to only relevant data, discussing filter and embedded (feature importance) methods. 5. It provides examples of linear regression and decision tree modeling.

Uploaded by

ARCHANA R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Team 3 - Data Science A

Machine
Learning
Week #2

Kampus Merdeka x MyEduSolve


Feature Engineering

1 Feature Encoding 2 Feature Scaling

3 Feature Binning 4 Feature Selection


Machine learning models can only
Feature Encoding work with numerical values.

For this reason, it is necessary to


transform the categorical values of
the relevant features into numerical
ones.

This process is called feature


encoding.
Feature Encoding

There are several types of encoding that are often used

1 One-Hot Encoding

2 Ordinal Encoding

3 Label Encoding
One-Hot Encoding

One-hot encoding turns your


categorical data into a binary vector
representation.

We can do One-Hot Encoding on


nominal data in several ways, but the
easiest way is to use get_dummies()
Ordinal Encoding

Ordinal encoding is a good choice if the


order of the categorical variables
matters.

For example, if we were predicting the


price of a house, the label “small”,
“medium”, and “large” would imply that
a small house is cheaper than a medium
house, which is cheaper than a large
house.

The label is easily reversible and


doesn’t increase the dimensionality of
the data.
Label Encoding

Encode target labels with value


between 0 and n_classes-1.

Label encoder is used when:


The number of categories is quite
large as one-hot encoding can
lead to high memory
consumption.
When the order does not matter
in categorical feature.
Feature Scaling is a technique to
Feature Scaling standardize the independent features
present in the data in a fixed range.

If feature scaling is not done, then a


machine learning algorithm tends to
weigh greater values, higher and
consider smaller values as the lower
values, regardless of the unit of the
values.
Feature Scaling

There are several types of scaling that are often used

1 Standard Scaler

2 MinMax Scaler

3 Robust Scaler
StandardScaler()
sklearn.preprocessing.StandardScaler

Standardize features by removing the mean and scaling to


unit variance.

The standard score of a sample x is calculated as:


z = (x - u) / s

where u is the mean of the training samples or zero if


with_mean=False, and s is the standard deviation of the
training samples or one if with_std=False.
MinMaxScaler()
sklearn.preprocessing.MinMaxScaler

Transform features by scaling each feature to a given range.


This estimator scales and translates each feature individually such that it is in the given range on
the training set, e.g. between zero and one.

The transformation is given by:


X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))X_scaled = X_std * (max - min) + min

where min, max = feature_range.

This transformation is often used as an alternative to zero mean, unit variance scaling.
RobustScaler()
sklearn.preprocessing.RobustScaler

Scale features using statistics that are robust to outliers.

This Scaler removes the median and scales the data according to the quantile
range (defaults to IQR: Interquartile Range).

The IQR is the range between the 1st quartile (25th quantile) and the 3rd
quartile (75th quantile).

Centering and scaling happen independently on each feature by computing the


relevant statistics on the samples in the training set. Median and interquartile
range are then stored to be used on later data using the transform method.
Binning or discretization is used for
Feature Binning the transformation of a continuous or
numerical variable into a categorical
feature.

For example, if you have data about


a group of people, you might want to
organize their ages into smaller
number of age intervals such as
child, teens, adults, etc.
Feature Binning Example
Feature Selection is the method of
Feature Selection reducing the input variable to your
model by using only relevant data
and getting rid of noise in data.

It is the process of automatically


choosing relevant features for your
machine learning model based on the
type of problem you are trying to
solve.
Feature Selection

1 Filter Method

Embedded Method
2
(Feature Importances)
Filter Method

The filter method evaluates each


feature independently and then ranks
the features after evaluating and
picking the best.

The filter method uses statistical help


to assign a score to each feature.
Where each feature is ranked by
score and selected to be kept or
deleted from the dataset.
Embedded Method
(Feature Importances)

The embedded selector method is a


feature selection method that
combines the advantages of the filter
method and the wrapper method.

Where the wrapper method requires


one type of Machine Learning
algorithm and uses performance as
an evaluation criterion.

The method that is often used is the


Tree Based model. We can use
ExtraTreeClassifier for classification
problems or ExtraTreeRegressor for
regression problems.
Modeling

1 Linear Regression

2 Decision Tree
Simple Linear Regression

Simple Linear Regression is a type of Regression algorithms that models the


relationship between a dependent variable and a single independent variable.
The relationship shown by a Simple Linear Regression model is linear or a
sloped straight line, hence it is called Simple Linear Regression.
The key point in Simple Linear Regression is that the dependent variable must
be a continuous/real value. However, the independent variable can be
measured on continuous or categorical values.
Simple Linear regression algorithm has mainly two objectives:
Model the relationship between the two variables. Such as the relationship
between Income and expenditure, experience and Salary, etc.
Forecasting new observations. Such as Weather forecasting according to
temperature, Revenue of a company according to the investments in a
year, etc.
Multiple Linear Regression

We can define it as:


"Multiple Linear Regression is one of the important regression algorithms which models
the linear relationship between a single dependent continuous variable and more than
one independent variable."
We have to do some classical assumption tests:
1. Linearity test, which is to test whether the variables X and y are linearly correlated.
We can use the Pearson correlation test, Spearman, pairplot or scatterplot.
2. Normality Test (GOF), which is to test whether the variables are normally distributed
(parametric). We can use test d'agostino(), Shapiro() or QQ-plot.
3. Multicollinearity, namely the existence of a strong correlation between the dependent
variable (X). We can use variance_inflation_factor() or heatmap correlation.
4. Homoscedasticity, which is a condition where the variance of each residual value
(error) is constant. We can use bartlett(), levene() or ANOVA tests.
Decision Tree

Decision Tree is a Supervised learning


technique that can be used for both
classification and Regression problems, but
mostly it is preferred for solving Classification
problems.

It is a tree-structured classifier, where


internal nodes represent the features of a
dataset, branches represent the decision
rules and each leaf node represents the
outcome.
Thank you !

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy