0% found this document useful (0 votes)

136 views

Feature Engineering

This document provides an overview of various machine learning concepts related to feature engineering: 1. It discusses feature encoding techniques like one-hot encoding, ordinal encoding, and label encoding. 2. It explains feature scaling methods such as standard scaling, min-max scaling, and robust scaling. 3. It describes feature binning as transforming continuous variables into categorical bins. 4. It defines feature selection as reducing input variables to only relevant data, discussing filter and embedded (feature importance) methods. 5. It provides examples of linear regression and decision tree modeling.

Uploaded by

ARCHANA R

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

136 views

Feature Engineering

Uploaded by

ARCHANA R

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Team 3 - Data Science A

Machine
Learning
Week #2

Kampus Merdeka x MyEduSolve

Feature Engineering

1 Feature Encoding 2 Feature Scaling

3 Feature Binning 4 Feature Selection

Machine learning models can only
Feature Encoding work with numerical values.

For this reason, it is necessary to

transform the categorical values of
the relevant features into numerical
ones.

This process is called feature

encoding.
Feature Encoding

There are several types of encoding that are often used

1 One-Hot Encoding

2 Ordinal Encoding

3 Label Encoding
One-Hot Encoding

One-hot encoding turns your

categorical data into a binary vector
representation.

We can do One-Hot Encoding on

nominal data in several ways, but the
easiest way is to use get_dummies()
Ordinal Encoding

Ordinal encoding is a good choice if the

order of the categorical variables
matters.

For example, if we were predicting the

price of a house, the label “small”,
“medium”, and “large” would imply that
a small house is cheaper than a medium
house, which is cheaper than a large
house.

The label is easily reversible and

doesn’t increase the dimensionality of
the data.
Label Encoding

Encode target labels with value

between 0 and n_classes-1.

Label encoder is used when:

The number of categories is quite
large as one-hot encoding can
lead to high memory
consumption.
When the order does not matter
in categorical feature.
Feature Scaling is a technique to
Feature Scaling standardize the independent features
present in the data in a fixed range.

If feature scaling is not done, then a

machine learning algorithm tends to
weigh greater values, higher and
consider smaller values as the lower
values, regardless of the unit of the
values.
Feature Scaling

There are several types of scaling that are often used

1 Standard Scaler

2 MinMax Scaler

3 Robust Scaler
StandardScaler()
sklearn.preprocessing.StandardScaler

Standardize features by removing the mean and scaling to

unit variance.

The standard score of a sample x is calculated as:

z = (x - u) / s

where u is the mean of the training samples or zero if

with_mean=False, and s is the standard deviation of the
training samples or one if with_std=False.
MinMaxScaler()
sklearn.preprocessing.MinMaxScaler

Transform features by scaling each feature to a given range.

This estimator scales and translates each feature individually such that it is in the given range on
the training set, e.g. between zero and one.

The transformation is given by:

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))X_scaled = X_std * (max - min) + min

where min, max = feature_range.

This transformation is often used as an alternative to zero mean, unit variance scaling.
RobustScaler()
sklearn.preprocessing.RobustScaler

Scale features using statistics that are robust to outliers.

This Scaler removes the median and scales the data according to the quantile
range (defaults to IQR: Interquartile Range).

The IQR is the range between the 1st quartile (25th quantile) and the 3rd
quartile (75th quantile).

Centering and scaling happen independently on each feature by computing the

relevant statistics on the samples in the training set. Median and interquartile
range are then stored to be used on later data using the transform method.
Binning or discretization is used for
Feature Binning the transformation of a continuous or
numerical variable into a categorical
feature.

For example, if you have data about

a group of people, you might want to
organize their ages into smaller
number of age intervals such as
child, teens, adults, etc.
Feature Binning Example
Feature Selection is the method of
Feature Selection reducing the input variable to your
model by using only relevant data
and getting rid of noise in data.

It is the process of automatically

choosing relevant features for your
machine learning model based on the
type of problem you are trying to
solve.
Feature Selection

1 Filter Method

Embedded Method
2
(Feature Importances)
Filter Method

The filter method evaluates each

feature independently and then ranks
the features after evaluating and
picking the best.

The filter method uses statistical help

to assign a score to each feature.
Where each feature is ranked by
score and selected to be kept or
deleted from the dataset.
Embedded Method
(Feature Importances)

The embedded selector method is a

feature selection method that
combines the advantages of the filter
method and the wrapper method.

Where the wrapper method requires

one type of Machine Learning
algorithm and uses performance as
an evaluation criterion.

The method that is often used is the

Tree Based model. We can use
ExtraTreeClassifier for classification
problems or ExtraTreeRegressor for
regression problems.
Modeling

1 Linear Regression

2 Decision Tree
Simple Linear Regression

Simple Linear Regression is a type of Regression algorithms that models the

relationship between a dependent variable and a single independent variable.
The relationship shown by a Simple Linear Regression model is linear or a
sloped straight line, hence it is called Simple Linear Regression.
The key point in Simple Linear Regression is that the dependent variable must
be a continuous/real value. However, the independent variable can be
measured on continuous or categorical values.
Simple Linear regression algorithm has mainly two objectives:
Model the relationship between the two variables. Such as the relationship
between Income and expenditure, experience and Salary, etc.
Forecasting new observations. Such as Weather forecasting according to
temperature, Revenue of a company according to the investments in a
year, etc.
Multiple Linear Regression

We can define it as:

"Multiple Linear Regression is one of the important regression algorithms which models
the linear relationship between a single dependent continuous variable and more than
one independent variable."
We have to do some classical assumption tests:
1. Linearity test, which is to test whether the variables X and y are linearly correlated.
We can use the Pearson correlation test, Spearman, pairplot or scatterplot.
2. Normality Test (GOF), which is to test whether the variables are normally distributed
(parametric). We can use test d'agostino(), Shapiro() or QQ-plot.
3. Multicollinearity, namely the existence of a strong correlation between the dependent
variable (X). We can use variance_inflation_factor() or heatmap correlation.
4. Homoscedasticity, which is a condition where the variance of each residual value
(error) is constant. We can use bartlett(), levene() or ANOVA tests.
Decision Tree

Decision Tree is a Supervised learning

technique that can be used for both
classification and Regression problems, but
mostly it is preferred for solving Classification
problems.

It is a tree-structured classifier, where

internal nodes represent the features of a
dataset, branches represent the decision
rules and each leaf node represents the
outcome.
Thank you !

Data Analytics Using Python
100% (1)
Data Analytics Using Python
982 pages
The Python Bible
97% (31)
The Python Bible
506 pages
Time Series Forecasting Week 2 Quiz Part 1
75% (4)
Time Series Forecasting Week 2 Quiz Part 1
3 pages
Data Visualization in Python Preview PDF
100% (8)
Data Visualization in Python Preview PDF
58 pages
Python Machine Learning For Beginners Ebook Final
100% (11)
Python Machine Learning For Beginners Ebook Final
305 pages
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
100% (18)
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
208 pages
Time Series Forecasting Week 1 Quiz Part 2
67% (3)
Time Series Forecasting Week 1 Quiz Part 2
2 pages
Algorithms For Data Science 1st Brian Steele (WWW - Ebook DL - Com)
100% (15)
Algorithms For Data Science 1st Brian Steele (WWW - Ebook DL - Com)
438 pages
Data Analysis From Scratch With Python - Beginner Guide Using Python, Pandas, NumPy, Scikit-Learn, IPython, TensorFlow and
100% (10)
Data Analysis From Scratch With Python - Beginner Guide Using Python, Pandas, NumPy, Scikit-Learn, IPython, TensorFlow and
104 pages
Machine Learning From Scratch PDF
88% (8)
Machine Learning From Scratch PDF
124 pages
Machine Learning
100% (11)
Machine Learning
135 pages
Machine Learning With Python
100% (14)
Machine Learning With Python
692 pages
Suresh-Sparkling Time Series Forecasting Project Report
No ratings yet
Suresh-Sparkling Time Series Forecasting Project Report
73 pages
30 Deep Learning Projects
No ratings yet
30 Deep Learning Projects
7 pages
Coffee Break NumPy PDF
100% (5)
Coffee Break NumPy PDF
211 pages
Python Data Science
92% (12)
Python Data Science
65 pages
Feature Engg Pre Processing Python
No ratings yet
Feature Engg Pre Processing Python
68 pages
Feature Engineering
100% (2)
Feature Engineering
44 pages
Bias and Variance
No ratings yet
Bias and Variance
6 pages
AIML - 04 Single Layer Perceptron
No ratings yet
AIML - 04 Single Layer Perceptron
11 pages
Feature Selection Techniques in Machine Learning
No ratings yet
Feature Selection Techniques in Machine Learning
9 pages
Unit V - Classification and Prediction 2020-21
100% (1)
Unit V - Classification and Prediction 2020-21
68 pages
Loss Functions
No ratings yet
Loss Functions
37 pages
Curse of Dimensionality
No ratings yet
Curse of Dimensionality
9 pages
Dimensionality Reduction Lecture Slide
No ratings yet
Dimensionality Reduction Lecture Slide
27 pages
Assignment # 01 Bscs - 7 Semester: Machine Learning
100% (1)
Assignment # 01 Bscs - 7 Semester: Machine Learning
5 pages
ML0101EN Clas K Nearest Neighbors CustCat Py v1
100% (1)
ML0101EN Clas K Nearest Neighbors CustCat Py v1
11 pages
HW1
100% (1)
HW1
8 pages
ML0101EN Clas Logistic Reg Churn Py v1
100% (1)
ML0101EN Clas Logistic Reg Churn Py v1
13 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
Deploy A Machine Learning Model Using Flask - Towards Data Science
No ratings yet
Deploy A Machine Learning Model Using Flask - Towards Data Science
12 pages
Lab7.ipynb - Colaboratory
100% (1)
Lab7.ipynb - Colaboratory
5 pages
Neural Networks
No ratings yet
Neural Networks
29 pages
Dropout Vs Pruning
No ratings yet
Dropout Vs Pruning
2 pages
Machine Learning in Python Main Developments and T
100% (1)
Machine Learning in Python Main Developments and T
44 pages
(IJETA-V8I5P1) :yew Kee Wong
No ratings yet
(IJETA-V8I5P1) :yew Kee Wong
5 pages
Outliers, Hypothesis and Natural Language Processing
100% (1)
Outliers, Hypothesis and Natural Language Processing
7 pages
Final Twitter - Sentiment - Analysis - Report
100% (1)
Final Twitter - Sentiment - Analysis - Report
14 pages
Data Pre-Processing (Pandas)
No ratings yet
Data Pre-Processing (Pandas)
19 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest
100% (1)
Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest
72 pages
Loading The Dataset: First We Load The Dataset and Find Out The Number of Columns, Rows, NULL Values, Etc
100% (1)
Loading The Dataset: First We Load The Dataset and Find Out The Number of Columns, Rows, NULL Values, Etc
8 pages
Python Plotly
No ratings yet
Python Plotly
8 pages
CCS355 Neural Networks and Deep Learning Lab
No ratings yet
CCS355 Neural Networks and Deep Learning Lab
43 pages
Glass Classification
100% (2)
Glass Classification
3 pages
CIS 419/519 Introduction To Machine Learning Assignment 2: Instructions
No ratings yet
CIS 419/519 Introduction To Machine Learning Assignment 2: Instructions
12 pages
Machine Learning Theory
100% (1)
Machine Learning Theory
12 pages
Data Analysis and Visualisation With Python
No ratings yet
Data Analysis and Visualisation With Python
75 pages
Bagging and Boosting Regression Algorithms
100% (1)
Bagging and Boosting Regression Algorithms
84 pages
TP Regression
100% (1)
TP Regression
1 page
Crime Prediction in Nigeria's Higer Institutions
No ratings yet
Crime Prediction in Nigeria's Higer Institutions
13 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
Fake News Detection
No ratings yet
Fake News Detection
14 pages
Deep Reinforcement Learning for Cyber Security
No ratings yet
Deep Reinforcement Learning for Cyber Security
17 pages
Churn Modeling
100% (1)
Churn Modeling
11 pages
Gradient Descent
No ratings yet
Gradient Descent
15 pages
Role of Machine Learning in The Field of Fiber Reinforced Polymer
No ratings yet
Role of Machine Learning in The Field of Fiber Reinforced Polymer
6 pages
Machine Learning Techniques Quantum
No ratings yet
Machine Learning Techniques Quantum
161 pages
An Introduction To Feature Selection
No ratings yet
An Introduction To Feature Selection
45 pages
Actividad Semana 4 - Jupyter Notebook
100% (1)
Actividad Semana 4 - Jupyter Notebook
7 pages
Unit-V Deep Learning Techniques
100% (1)
Unit-V Deep Learning Techniques
31 pages
RBF, KNN, SVM, DT
No ratings yet
RBF, KNN, SVM, DT
9 pages
Python Scripting
100% (1)
Python Scripting
15 pages
UNIT-4
No ratings yet
UNIT-4
79 pages
Soft Max
No ratings yet
Soft Max
6 pages
2.building Blocks of Neural Networks
100% (1)
2.building Blocks of Neural Networks
2 pages
Logistics Regression
100% (1)
Logistics Regression
5 pages
Evaluation Metrics in Machine Learning
No ratings yet
Evaluation Metrics in Machine Learning
14 pages
Logistic Regression
100% (1)
Logistic Regression
29 pages
Natural Language Toolkit NLTK PDF
No ratings yet
Natural Language Toolkit NLTK PDF
23 pages
ML Project Shivani Pandey
100% (2)
ML Project Shivani Pandey
49 pages
ML Unit 2
No ratings yet
ML Unit 2
90 pages
Deep Learning RNN
100% (1)
Deep Learning RNN
53 pages
A Machine Learning Framework For Cybersecurity Operations
No ratings yet
A Machine Learning Framework For Cybersecurity Operations
5 pages
Machine Learning: by Team 2
No ratings yet
Machine Learning: by Team 2
41 pages
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
45 pages
Machine Learning Projects Python
94% (18)
Machine Learning Projects Python
134 pages
TensorFlow For Machine Intelligence
100% (26)
TensorFlow For Machine Intelligence
305 pages
Full Course of Machine Learning
100% (16)
Full Course of Machine Learning
660 pages
Real-Time Face Detection On A "Dual-Sensor" Smart Camera Using Smooth-Edges Technique
No ratings yet
Real-Time Face Detection On A "Dual-Sensor" Smart Camera Using Smooth-Edges Technique
5 pages
Hackers Guide To Machine Learning With Python PDF
100% (14)
Hackers Guide To Machine Learning With Python PDF
272 pages
Hands On Machine Learning With Python Concepts and Applications For Beginners - John Anderson 2018
91% (11)
Hands On Machine Learning With Python Concepts and Applications For Beginners - John Anderson 2018
166 pages
Machine Learning
100% (5)
Machine Learning
35 pages
Practical Machine Learning R
90% (10)
Practical Machine Learning R
149 pages
Master'S Thesis: Potential Deep Learning Approaches For The Physical Layer
No ratings yet
Master'S Thesis: Potential Deep Learning Approaches For The Physical Layer
59 pages
Machine Learning Projects in Python
100% (16)
Machine Learning Projects in Python
135 pages
EBOOK - Python Crash Course For Data Analysis
100% (12)
EBOOK - Python Crash Course For Data Analysis
168 pages
Face Detection
75% (4)
Face Detection
23 pages
Deep Learning - Fundamentals, Theory and Applications 2019 PDF
100% (9)
Deep Learning - Fundamentals, Theory and Applications 2019 PDF
168 pages
Iot PDF
No ratings yet
Iot PDF
6 pages
Deep Learning Notes
No ratings yet
Deep Learning Notes
110 pages
Getting Started With Python Programming
100% (11)
Getting Started With Python Programming
1,484 pages
Machine Learning Notes
100% (3)
Machine Learning Notes
134 pages
Vired
No ratings yet
Vired
4 pages
Data Visualisation Using Tableau
No ratings yet
Data Visualisation Using Tableau
12 pages
Statistics Interview Questions
100% (2)
Statistics Interview Questions
5 pages
Suresh-Rose Time Series Forecasting Project Report
100% (1)
Suresh-Rose Time Series Forecasting Project Report
75 pages
Tableau+2020 2+relationships
No ratings yet
Tableau+2020 2+relationships
2 pages
DSML Brochure 2023 Latest Feb
No ratings yet
DSML Brochure 2023 Latest Feb
18 pages
Plan The Week - Storytelling With Data-1
No ratings yet
Plan The Week - Storytelling With Data-1
5 pages
Linear Regression and SVR
No ratings yet
Linear Regression and SVR
25 pages
Wine DS
No ratings yet
Wine DS
14 pages
Python Lib
No ratings yet
Python Lib
33 pages
Statistic & Machine Learning: Team 2
No ratings yet
Statistic & Machine Learning: Team 2
42 pages
Statistics Materials: Data Science: Week 9
No ratings yet
Statistics Materials: Data Science: Week 9
22 pages
Machine: Learning
No ratings yet
Machine: Learning
15 pages
Model Deployment GL
No ratings yet
Model Deployment GL
20 pages
Statistics and Machine Learning
No ratings yet
Statistics and Machine Learning
51 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
60 pages
Deposit Subscription: Eda Mini Project
No ratings yet
Deposit Subscription: Eda Mini Project
41 pages
Stru of DS Project
No ratings yet
Stru of DS Project
24 pages
Titanic DS Callenge
No ratings yet
Titanic DS Callenge
24 pages
Python Codin
No ratings yet
Python Codin
4 pages
Project Time Series Forecasting ROSE Dataset by Somya Dhar 1 PDF
No ratings yet
Project Time Series Forecasting ROSE Dataset by Somya Dhar 1 PDF
52 pages
Uml Handbook
100% (2)
Uml Handbook
24 pages
DD Solvers
No ratings yet
DD Solvers
3 pages
241010 Final Signed MS-502
No ratings yet
241010 Final Signed MS-502
5 pages
MY PLACEMENT - Profile
No ratings yet
MY PLACEMENT - Profile
2 pages
Always Better Control Analysis
No ratings yet
Always Better Control Analysis
7 pages
E Commerce
No ratings yet
E Commerce
37 pages
Music DLP General Music 2nd Grade Lesson Plan
No ratings yet
Music DLP General Music 2nd Grade Lesson Plan
12 pages
Affidavit for OPS Ravi
No ratings yet
Affidavit for OPS Ravi
4 pages
02-03-2019 (Saturday) From 01:30 PM To 03:15 PM: ADMISSION TICKET (Provisional) Thiruvananthapuram
No ratings yet
02-03-2019 (Saturday) From 01:30 PM To 03:15 PM: ADMISSION TICKET (Provisional) Thiruvananthapuram
3 pages
Business Ethics - Fundamentals
No ratings yet
Business Ethics - Fundamentals
28 pages
Chemical Risk Assessment 2013 1 Copy 1 PDF
100% (1)
Chemical Risk Assessment 2013 1 Copy 1 PDF
3 pages
ReleaseNote_FileList of X64W11_22H2_SWP_X1404ZA_05.00
No ratings yet
ReleaseNote_FileList of X64W11_22H2_SWP_X1404ZA_05.00
7 pages
Welding of CrMo Steels For Power Generation and Petrochemical Applications
100% (1)
Welding of CrMo Steels For Power Generation and Petrochemical Applications
12 pages
MS DOS Hacking
100% (1)
MS DOS Hacking
7 pages
Batch Distillation and Plate and Packed Column Sizing (Compatibility Mode)
No ratings yet
Batch Distillation and Plate and Packed Column Sizing (Compatibility Mode)
70 pages
Electrostatic
100% (1)
Electrostatic
16 pages
Section 5. Terms of Reference: Annexure - B
No ratings yet
Section 5. Terms of Reference: Annexure - B
26 pages
University of Jahangir Nagar
100% (1)
University of Jahangir Nagar
6 pages
DCC Unit 3
No ratings yet
DCC Unit 3
66 pages
Temple Mine Survey (2007)
No ratings yet
Temple Mine Survey (2007)
86 pages
Quiz 1 - Acoustics
No ratings yet
Quiz 1 - Acoustics
42 pages
Note: कृपया E-Payment रसीद की दो कॉपियां print कर एक कॉपी form के साथ सम्बंधित फैकल्टी में जमा करवाएं एवं दूसरी कॉपी अपने पास रखें।
No ratings yet
Note: कृपया E-Payment रसीद की दो कॉपियां print कर एक कॉपी form के साथ सम्बंधित फैकल्टी में जमा करवाएं एवं दूसरी कॉपी अपने पास रखें।
2 pages
Business Culture in Denmark
No ratings yet
Business Culture in Denmark
5 pages
Applications of The Balanced Scorecard For Strategic Management and Performance Measurement in The Health Sector
No ratings yet
Applications of The Balanced Scorecard For Strategic Management and Performance Measurement in The Health Sector
10 pages
Non Technical Losses Reduction
0% (1)
Non Technical Losses Reduction
19 pages
ZB Importing LLC Recall
No ratings yet
ZB Importing LLC Recall
2 pages
Fsic Occupancy
No ratings yet
Fsic Occupancy
1 page
Product Version: Boost Up Your Certification Score
No ratings yet
Product Version: Boost Up Your Certification Score
7 pages
Corporations (Langevoort) S2020
No ratings yet
Corporations (Langevoort) S2020
70 pages
Antennas - The Slotted Waveguide Antenna
No ratings yet
Antennas - The Slotted Waveguide Antenna
5 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Feature Engineering

Uploaded by

Feature Engineering

Uploaded by

Team 3 - Data Science A

Kampus Merdeka x MyEduSolve

1 Feature Encoding 2 Feature Scaling

3 Feature Binning 4 Feature Selection

For this reason, it is necessary to

This process is called feature

There are several types of encoding that are often used

One-hot encoding turns your

We can do One-Hot Encoding on

Ordinal encoding is a good choice if the

For example, if we were predicting the

The label is easily reversible and

Encode target labels with value

Label encoder is used when:

If feature scaling is not done, then a

There are several types of scaling that are often used

Standardize features by removing the mean and scaling to

The standard score of a sample x is calculated as:

where u is the mean of the training samples or zero if

Transform features by scaling each feature to a given range.

The transformation is given by:

where min, max = feature_range.

Scale features using statistics that are robust to outliers.

Centering and scaling happen independently on each feature by computing the

For example, if you have data about

It is the process of automatically

The filter method evaluates each

The filter method uses statistical help

The embedded selector method is a

Where the wrapper method requires

The method that is often used is the

Simple Linear Regression is a type of Regression algorithms that models the

We can define it as:

Decision Tree is a Supervised learning

It is a tree-structured classifier, where

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.