0% found this document useful (0 votes)

5 views8 pages

Semester

Coding notes

Uploaded by

cassbozz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views8 pages

Semester

Coding notes

Uploaded by

cassbozz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

BDA

20242 Semester
.
Train test split
Splits in trainand test data
- builds models using training data
• test model using test data
Model can remember data and therefore is there a need for splitting so it can generilize on new unseen
data → which will make it able to generilize well
The train_test_split function, splits into 75% (training set) and 25% ( test set)

Evaluating model
Accuracy is the ratio of correctly predicted instances to the total number of instances
• often in percentage
• not great in imbalanced data sets → only use in balanced dataset
Comparing with a baseline is great → used for evaluating the performance of complex models

Imbalanced data sets

One class (minority) has significant fewer sample than the other class (majority) → leads to poor
performance of ML models due to the models can be biased towards the majority class

Supervised vs un-supervised
Supervised
• you know what to predict
• labelled data with a target variable
• train test split
• good when model can predict on new unseen data
Unsupervised
• discover unknown patterns
• unlabelled date without a target variable
• No train test split
• clustering
SMOTE
Used especially in imbalanced datasets
Oversampling technique for balanced class distribution in a dataset
Creates synthetic samples for minority class to balance the distribution
→ improves model performing, reduces bias, and enhances generalisation
→ does have a risk of overfitting, and the synthetic samples may not introduce sufficient variability

Classification
Predicts categorical labels
Regression
Predicts continuous values

Generalisability
When the model applies to data that was not used to build the model

Overfitting & underfitting

Overfitting is when the model fits the training data so well that it cannot perform accurately on new
unseen data
• model is to complex, due to too much training or too many inputs feature
Underfitting is when the model is unable to capture the relationship between the input and output
variables accurately → higher error rate on both training set and unseen data
• model is too simple, due to needing more training and more input features

Regularization
Calibrating a models fit to the data with the models complexity → guard against overfitting
Accuracy, precision & recall
Accuracy is the ratio of correctly predictive instance to the total instances
• good measure when classes are well-balanced, meaning number of instances is roughly the same
• can be misleading when data is imbalanced TP + TN

TP + TN + FP +FN

Precision is the ratio of correctly predictive positives to the total predictive positives
• how many instances predictive positive was actually positive?
• important to use when the cost of false positives are high → does not account for false negatives

P
Recall is the ratio of correctly predictive positives to the actual positives
• how many positive instants was correctly identified?
• important to use when the cost of false negatives are high → does not account for false positives

GridSearch N
Hyper parameter technique searches through the best parameters to find the best one for the model
Searches through the best combination to find the best set for the model
Evaluates performance for each combination using cross-validation
• cross-validation is a technique for evaluating a model
• splits the dataset into training and validation sets multiple times → ensures that the models
performance is assessed reliably across different subsets of the data
KNN
Measures similarity's amongst customers
It is a instance based learning algorithm (supervised)
Predicts an outcome for a test instance by finding the K most similar instances in the training data and
aggregating the observed outcomes
Simple model and effective with sufficient training data
set by default k =5
• K = 1 → complex model → risk of overfitting
• k= N → simple model → risk of underfitting
Tuning
• K = number of neighbours
• distance weighting
Hamming distance is a metric for comparing to binary string → the number of bit positions in which
the two bits are different
• looks at the whole data and finds when data points are similar or dissimilar one to one
• gives the results of how many attributeswere different

Logistic regression
Outputs a categorical value → for classification tasks
Supervised algorithm which can be used to classify data into categories or classes, by predicting the
probability that an instance fall into that patrician class based on its attribute
• smaller C → stronger regularization
Support vector classifier (svc) is a linear model that outputs categorical value by finding optimal line
• classifies instances by finding the optimal one or hyperplane to separate classes in a feature space
Decision tree
Used for both classification and regression
Builds a hierarchy of if/else questions leading to a decision
Controlling complexity
• build until leaves are pure (closer to O) → tree will be 100% accurate on training data
Prevent overfitting
• limit depth, limit max numbers of leaves, require minimum number of points in mode to keep splitting
Nodes is each rectangular box → conditions based on feature
• leaf nodes represent final classification

Random forest
Build many decision trees where each tree differs in random ways → does an average to figure out which
tree is the best
Its build on training data
It improves prediction accuracy by combining predictions from multiple trees
Reduces overfitting
Randomly selects n items from original dataset = allows repetition → each tree is the same size, but
random different
Max feature is a parameter, that selects random subsets of feature of its size → high max gives chance
of overfitting
Classification: each tree makes "soft predictions" → highest average= output by forest
Regression: each tree makes own prediction → average of these predictions
Tuning
• number of trees (n_estimator), max_depth, min_samples_split, mm_samples_leaf, max_feature,
bootstrap t criterion
• by grid search or cross-validation

Gradient boosting
Powerful ensemble learningtechnique used for both regression and classification. Builds a series of weak
learned (decision tree) sequentially where one leads to a improved one
Neural network (ANN) & multilayer perception (MLP)
ANN is a computational model inspired by the way biological neural networks in the human bran
process information
MLP is a type of artificial ANN that consist of multilayer of neuron
• Input layer (receive input features from the dataset)
• hidden layer (s) (processes input layers received from input layer )
• output layer (produces final output of the network
Tuning MLP
• number of hidden layers
• number of units in each layer.
• regularization
• K of input → which is important

Scaling
adjusting the range and distribution of numerical features in your dataset → ensure all features
contribute equally to the model
Important for SVM and neural networks
Minmaxscaler ensures all features is between 0 and 1
Applied before supervised ML
Tran + test should be scaled the same way
Dummy variables
Also called one-hot-encoding
If features F has three values a, b & c → creates three new features Fa, Fb, Fc
A powerful tool in statistical modelling for incorporating categorical data
Makes a new columns
Dummy variables for word
• one feature for each word
• valve is 1 if word occurs in text otherwise 0 → makes new columns for each word

Dummy classifier
Simple baseline model used to evaluate the performance of more complex models
Primary purpose is to provide a benchmark against which the performance of more advanced models
can be compared

Confusion matrix
Powerful tool for understanding the performance of a classification model.
Provides a detailed breakdown of correct and incorrect predictions
• true positives → number of instances that are correctly predicted positive
• false positives → number of instances that are incorrectly predicted positive
• true negatives → number of instances that are correctly predicted negative

Basics of ML and Evaluation
No ratings yet
Basics of ML and Evaluation
42 pages
All About ML
No ratings yet
All About ML
18 pages
Kenny-230718-The Ultimate Machine Learning Cheat Sheet
No ratings yet
Kenny-230718-The Ultimate Machine Learning Cheat Sheet
20 pages
Final ML
No ratings yet
Final ML
2 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Introduction Class
No ratings yet
Introduction Class
134 pages
ML
No ratings yet
ML
9 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
3 pages
100 days ML
No ratings yet
100 days ML
15 pages
ML MAKAUT unit-3
No ratings yet
ML MAKAUT unit-3
6 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
24 pages
CSC413 Lecture Note
No ratings yet
CSC413 Lecture Note
32 pages
ChatGPT - Machine Learning Overview
No ratings yet
ChatGPT - Machine Learning Overview
34 pages
MLT_Notes
No ratings yet
MLT_Notes
28 pages
SML
No ratings yet
SML
8 pages
ML - Interview Prep
No ratings yet
ML - Interview Prep
9 pages
Lesson 2.4.1 What is Scikit Learn Keynote
No ratings yet
Lesson 2.4.1 What is Scikit Learn Keynote
21 pages
GATE ML Updated 111023
No ratings yet
GATE ML Updated 111023
109 pages
Algorithms 1
No ratings yet
Algorithms 1
23 pages
AWS Machine Learning Specialty Master Cheat Sheet
No ratings yet
AWS Machine Learning Specialty Master Cheat Sheet
24 pages
machine learning
No ratings yet
machine learning
37 pages
3ML.02.MainConcepts_Evaluation
No ratings yet
3ML.02.MainConcepts_Evaluation
35 pages
Unit 4 Learning
No ratings yet
Unit 4 Learning
5 pages
Classification
No ratings yet
Classification
53 pages
Big Data Lesson 5 Lucrezia Noli
No ratings yet
Big Data Lesson 5 Lucrezia Noli
30 pages
Quiz 1 Materials
No ratings yet
Quiz 1 Materials
159 pages
Module 1 ML Mumbai University
No ratings yet
Module 1 ML Mumbai University
47 pages
BAI 3303 Notes
No ratings yet
BAI 3303 Notes
12 pages
Session 5 ppt
No ratings yet
Session 5 ppt
36 pages
Module 3 Data Science Machine Learning
No ratings yet
Module 3 Data Science Machine Learning
53 pages
Deep Learning
No ratings yet
Deep Learning
21 pages
ML
No ratings yet
ML
4 pages
Types of Machine Learning
No ratings yet
Types of Machine Learning
63 pages
Lecture 15 - Recap and Midterm Review
No ratings yet
Lecture 15 - Recap and Midterm Review
37 pages
Machine - Learning - Unit - 1
No ratings yet
Machine - Learning - Unit - 1
70 pages
Machine Learning Engineer Interview Preparation Guide
No ratings yet
Machine Learning Engineer Interview Preparation Guide
14 pages
Data analysis ch1
No ratings yet
Data analysis ch1
13 pages
T1 ML QB Soln
No ratings yet
T1 ML QB Soln
23 pages
AIch5 (2)
No ratings yet
AIch5 (2)
50 pages
Lecture 9
No ratings yet
Lecture 9
27 pages
Feature and Feature Extractionlect2
No ratings yet
Feature and Feature Extractionlect2
28 pages
Evaluating Machine Learning Algorithms and Model Selection
No ratings yet
Evaluating Machine Learning Algorithms and Model Selection
10 pages
Machine Learning Notes ?
No ratings yet
Machine Learning Notes ?
64 pages
Introduction to Machine Learning
No ratings yet
Introduction to Machine Learning
116 pages
مشین سیکھنا
No ratings yet
مشین سیکھنا
5 pages
ml
No ratings yet
ml
9 pages
Machine learning QB
No ratings yet
Machine learning QB
15 pages
04 - Model Selection
No ratings yet
04 - Model Selection
62 pages
Machine Learning in A Nutshell
No ratings yet
Machine Learning in A Nutshell
36 pages
Lecture 5 - Feature extraction, model building & evaluation
No ratings yet
Lecture 5 - Feature extraction, model building & evaluation
35 pages
Summary - Data Analytics& Machine Learning
No ratings yet
Summary - Data Analytics& Machine Learning
18 pages
Gene Expression Prediction Using Machine Learning Project Presentation
No ratings yet
Gene Expression Prediction Using Machine Learning Project Presentation
14 pages
ML-2-PPT-UNIT-2
No ratings yet
ML-2-PPT-UNIT-2
214 pages
unit 4
No ratings yet
unit 4
34 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
From Everand
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
Idea Link
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
BAB 3 - Dian Ayu Rahmawati - 205150201111042
No ratings yet
BAB 3 - Dian Ayu Rahmawati - 205150201111042
14 pages
NAME:Menta Jaya Sai Saketh REGNO:19BBS0194 Lab slot:L3+L4 Faculty name:PROF. Boominathan P
No ratings yet
NAME:Menta Jaya Sai Saketh REGNO:19BBS0194 Lab slot:L3+L4 Faculty name:PROF. Boominathan P
10 pages
13 Ec 533
No ratings yet
13 Ec 533
1 page
Operations Research: Linear Programming Problems
No ratings yet
Operations Research: Linear Programming Problems
28 pages
130DSA Ashhad
No ratings yet
130DSA Ashhad
7 pages
APM2613 Tuition and Assessment Plan 2025
No ratings yet
APM2613 Tuition and Assessment Plan 2025
2 pages
Waves Renaissance Channel Manual
100% (1)
Waves Renaissance Channel Manual
19 pages
Xam Idea Maths Solutions Class 10 Chapter 2 Polynomials
100% (1)
Xam Idea Maths Solutions Class 10 Chapter 2 Polynomials
25 pages
Syllabus Honours SDR
No ratings yet
Syllabus Honours SDR
2 pages
Digital Communication Chapter 5
No ratings yet
Digital Communication Chapter 5
23 pages
Developing An Excel Spreadsheet Program To Solve Transportation Problems
No ratings yet
Developing An Excel Spreadsheet Program To Solve Transportation Problems
9 pages
Tutorial Week-6 Set-1 Solution
No ratings yet
Tutorial Week-6 Set-1 Solution
4 pages
3 Free Courses That Helped Me Land My First Data Scientist Job in Amazon - by Farzad Mahmoodinobar - Medium
No ratings yet
3 Free Courses That Helped Me Land My First Data Scientist Job in Amazon - by Farzad Mahmoodinobar - Medium
15 pages
Longman English Pupils Book Grade 11(Bernard Tito)_compressed
No ratings yet
Longman English Pupils Book Grade 11(Bernard Tito)_compressed
241 pages
MTH 510 Outline
No ratings yet
MTH 510 Outline
2 pages
FLANN.ppt
No ratings yet
FLANN.ppt
16 pages
Garber Grossman Johnson-Yu
No ratings yet
Garber Grossman Johnson-Yu
1 page
Bjorck 1988
No ratings yet
Bjorck 1988
12 pages
Polynomials Class 9TH
No ratings yet
Polynomials Class 9TH
4 pages
Transportation Problems
No ratings yet
Transportation Problems
12 pages
Binomial Theorem
No ratings yet
Binomial Theorem
5 pages
Viva Questions
No ratings yet
Viva Questions
2 pages
Musical Sound Processing Csetube
No ratings yet
Musical Sound Processing Csetube
2 pages
13MFCC Tutorial
No ratings yet
13MFCC Tutorial
6 pages
Gaussian Elimination
No ratings yet
Gaussian Elimination
6 pages
DSA Lab 4
No ratings yet
DSA Lab 4
13 pages
Cryptography and Network Security: by William Stallings
No ratings yet
Cryptography and Network Security: by William Stallings
37 pages
MID SEM 1 PORTIONS
No ratings yet
MID SEM 1 PORTIONS
3 pages
Introduction To Numerical Analysis Using Matlab
No ratings yet
Introduction To Numerical Analysis Using Matlab
47 pages
University of Victoria Midterm Exam 1 February 7 2019 Computer Science 349A
No ratings yet
University of Victoria Midterm Exam 1 February 7 2019 Computer Science 349A
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Semester

Uploaded by

Semester

Uploaded by

BDA

Imbalanced data sets

Overfitting & underfitting

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.