Semester
Semester
20242 Semester
.
Train test split
Splits in trainand test data
- builds models using training data
• test model using test data
Model can remember data and therefore is there a need for splitting so it can generilize on new unseen
data → which will make it able to generilize well
The train_test_split function, splits into 75% (training set) and 25% ( test set)
Evaluating model
Accuracy is the ratio of correctly predicted instances to the total number of instances
• often in percentage
• not great in imbalanced data sets → only use in balanced dataset
Comparing with a baseline is great → used for evaluating the performance of complex models
Supervised vs un-supervised
Supervised
• you know what to predict
• labelled data with a target variable
• train test split
• good when model can predict on new unseen data
Unsupervised
• discover unknown patterns
• unlabelled date without a target variable
• No train test split
• clustering
SMOTE
Used especially in imbalanced datasets
Oversampling technique for balanced class distribution in a dataset
Creates synthetic samples for minority class to balance the distribution
→ improves model performing, reduces bias, and enhances generalisation
→ does have a risk of overfitting, and the synthetic samples may not introduce sufficient variability
Classification
Predicts categorical labels
Regression
Predicts continuous values
Generalisability
When the model applies to data that was not used to build the model
Regularization
Calibrating a models fit to the data with the models complexity → guard against overfitting
Accuracy, precision & recall
Accuracy is the ratio of correctly predictive instance to the total instances
• good measure when classes are well-balanced, meaning number of instances is roughly the same
• can be misleading when data is imbalanced TP + TN
TP + TN + FP +FN
Precision is the ratio of correctly predictive positives to the total predictive positives
• how many instances predictive positive was actually positive?
• important to use when the cost of false positives are high → does not account for false negatives
P
Recall is the ratio of correctly predictive positives to the actual positives
• how many positive instants was correctly identified?
• important to use when the cost of false negatives are high → does not account for false positives
GridSearch N
Hyper parameter technique searches through the best parameters to find the best one for the model
Searches through the best combination to find the best set for the model
Evaluates performance for each combination using cross-validation
• cross-validation is a technique for evaluating a model
• splits the dataset into training and validation sets multiple times → ensures that the models
performance is assessed reliably across different subsets of the data
KNN
Measures similarity's amongst customers
It is a instance based learning algorithm (supervised)
Predicts an outcome for a test instance by finding the K most similar instances in the training data and
aggregating the observed outcomes
Simple model and effective with sufficient training data
set by default k =5
• K = 1 → complex model → risk of overfitting
• k= N → simple model → risk of underfitting
Tuning
• K = number of neighbours
• distance weighting
Hamming distance is a metric for comparing to binary string → the number of bit positions in which
the two bits are different
• looks at the whole data and finds when data points are similar or dissimilar one to one
• gives the results of how many attributeswere different
Logistic regression
Outputs a categorical value → for classification tasks
Supervised algorithm which can be used to classify data into categories or classes, by predicting the
probability that an instance fall into that patrician class based on its attribute
• smaller C → stronger regularization
Support vector classifier (svc) is a linear model that outputs categorical value by finding optimal line
• classifies instances by finding the optimal one or hyperplane to separate classes in a feature space
Decision tree
Used for both classification and regression
Builds a hierarchy of if/else questions leading to a decision
Controlling complexity
• build until leaves are pure (closer to O) → tree will be 100% accurate on training data
Prevent overfitting
• limit depth, limit max numbers of leaves, require minimum number of points in mode to keep splitting
Nodes is each rectangular box → conditions based on feature
• leaf nodes represent final classification
Random forest
Build many decision trees where each tree differs in random ways → does an average to figure out which
tree is the best
Its build on training data
It improves prediction accuracy by combining predictions from multiple trees
Reduces overfitting
Randomly selects n items from original dataset = allows repetition → each tree is the same size, but
random different
Max feature is a parameter, that selects random subsets of feature of its size → high max gives chance
of overfitting
Classification: each tree makes "soft predictions" → highest average= output by forest
Regression: each tree makes own prediction → average of these predictions
Tuning
• number of trees (n_estimator), max_depth, min_samples_split, mm_samples_leaf, max_feature,
bootstrap t criterion
• by grid search or cross-validation
Gradient boosting
Powerful ensemble learningtechnique used for both regression and classification. Builds a series of weak
learned (decision tree) sequentially where one leads to a improved one
Neural network (ANN) & multilayer perception (MLP)
ANN is a computational model inspired by the way biological neural networks in the human bran
process information
MLP is a type of artificial ANN that consist of multilayer of neuron
• Input layer (receive input features from the dataset)
• hidden layer (s) (processes input layers received from input layer )
• output layer (produces final output of the network
Tuning MLP
• number of hidden layers
• number of units in each layer.
• regularization
• K of input → which is important
Scaling
adjusting the range and distribution of numerical features in your dataset → ensure all features
contribute equally to the model
Important for SVM and neural networks
Minmaxscaler ensures all features is between 0 and 1
Applied before supervised ML
Tran + test should be scaled the same way
Dummy variables
Also called one-hot-encoding
If features F has three values a, b & c → creates three new features Fa, Fb, Fc
A powerful tool in statistical modelling for incorporating categorical data
Makes a new columns
Dummy variables for word
• one feature for each word
• valve is 1 if word occurs in text otherwise 0 → makes new columns for each word
Dummy classifier
Simple baseline model used to evaluate the performance of more complex models
Primary purpose is to provide a benchmark against which the performance of more advanced models
can be compared
Confusion matrix
Powerful tool for understanding the performance of a classification model.
Provides a detailed breakdown of correct and incorrect predictions
• true positives → number of instances that are correctly predicted positive
• false positives → number of instances that are incorrectly predicted positive
• true negatives → number of instances that are correctly predicted negative