Unit 4 Part 2
Unit 4 Part 2
Step-02: Compute the mean vector (µ). (Sum of all the Values of Variable/ Total No. of
Variables)
Step-03: Standardize the Dataset- Subtract mean from the given data.
Aim is to standardize the range of the continuous initial variables so that each one of them
contributes equally to the analysis. The reason being if there are large differences between the
ranges of initial variables, those variables with larger ranges will dominate over those with
small ranges
Step-04: Calculate the covariance matrix.
Aim is to understand how the variables of the input data set are varying from the mean with
respect to each other, or in other words, to see if there is any relationship between them.
Because sometimes, variables are highly correlated in such a way that they contain redundant
information. So, in order to identify these correlations, we compute the Covariance matrix.
The covariance matrix is not more than a table that summarizes the correlations between all
the possible pairs of variables
Step-05: Identify the Principal Components -Calculate the eigen vectors and eigen values
of the covariance matrix.
Principal components are new variables that are constructed as linear combinations or mixtures
of the initial variables. These combinations are done in such a way that the new variables (i.e.,
principal components) are uncorrelated and most of the information within the initial variables
is squeezed or compressed into the first components. So, the idea is 10-dimensional data gives
you 10 principal components, but PCA tries to put maximum possible information in the first
component, then maximum remaining information in the second and so on.
Organizing information in principal components this way, will allow you to reduce
dimensionality without losing much information, and this by discarding the components with
low information and considering the remaining components as your new variables.
Step-07: Deriving the new data set/ Recast the Data along the Principal Components Axes
Aim is to use the feature vector formed using the eigenvectors of the covariance matrix, to
reorient the data from the original axes to the ones represented by the principal components
(hence the name Principal Components Analysis). This can be done by multiplying the
transpose of the original data set by the transpose of the feature vector.
Then use PCA Algorithm to transform the patterns onto the eigen vector by using:
Feature vector gets transformed to
= Transpose of Eigen vector x (Feature Vector – Mean Vector)
B. Model Fitting:
1. Fit regression models for each subset, such as simple linear regression or multiple regression.
2. Use appropriate techniques to fit the models (e.g., least squares method).
C. Model Evaluation:
1. Use criteria like R2, adjusted R2, or cross-validation error.
2. Compare models to select the best subset.
D. Validation:
1. Validate the selected subset on new, unseen data to ensure generalizability.
2. Avoid overfitting by using techniques like cross-validation.
Numerical Example:
Dataset: Consider variables A, B, C, and D.
Association Probabilities:
1. P(A,B)=0.85 (strong association)
2. P(A,C)=0.40 (weak association)
3. P(A,D)=0.92 (strong association)
4. P(B,C)=0.65 (moderate association)
5. P(B,D)=0.90 (strong association)
6. P(C,D)=0.20 (very weak association)
Step 1: Compute Association Probabilities
Utilize appropriate statistical measures to calculate probabilities, such as correlation
coefficients or mutual information scores.
Step 2: Identify Strong Associations
Set a threshold (e.g., 0.70) for significant associations.
Strong associations: A−B, A−D, B−D.
Step 3: Select Relevant Variables
Variables A, B, and D are selected due to their strong associations.
Variables C is discarded due to weak associations.
Cross-Validation Techniques:
K-Fold Cross-Validation: Divides data into K folds and uses each fold as a testing set
while using the remaining K-1 folds as training data, repeating the process K times.
Stratified K-Fold Cross-Validation: Ensures that each fold is representative of all
strata of the data i.e. each fold maintains the same class distribution as the original data.
Leave-One-Out Cross-Validation (LOOCV): Uses a single observation as the
validation set and the rest of the data as the training set.
Model Selection:
Grid Search: Technique to find the best performing combination of hyperparameters
for a model by systematically testing a range of hyperparameters.
Random Search: Randomly selects combinations of hyperparameters for evaluation, ,
useful when computation resources are limited.
Model Ensembling: Combining predictions from multiple models to improve overall
performance.
Prediction Measures
Prediction measures refer to the metrics and techniques used to assess the accuracy and
reliability of predictions made by a model. These measures are essential for evaluating the
performance of predictive models and understanding how well they generalize to new, unseen
data. Here are some common prediction measures used in machine learning:
Where:
Interpretation:
A smaller MAE indicates a better fit of the model to the data. MAE is measured in the same
units as the data, making it easy to understand in the context of the problem.
Numerical Example:
Let's consider a simple dataset of actual and predicted values for house prices:
Actual Prices: [200,300,400,500,600] (in thousands)
Predicted Prices: [220,280,380,480,590] (in thousands)
In this example, the MAE is calculated as 18. It means, on average, the predictions differ from
the actual values by $18,000. This indicates that the predictive model's average error in
estimating house prices is $18,000.
MAE is a useful metric for evaluating the accuracy of regression models. It gives a clear
understanding of how well the predictions match the actual data points and is often used in
various real-world applications where understanding prediction accuracy is crucial.
Where:
Interpretation:
A smaller MSE indicates a better fit of the model to the data. MSE is measured in the square
of the units of the data, making it sensitive to outliers and large errors.
Numerical Example:
Let's consider a simple dataset of actual and predicted values for house prices:
Actual Prices: [200,300,400,500,600] (in thousands)
Predicted Prices: [220,280,380,480,590] (in thousands)
In this example, the MSE is calculated as 340. It means, on average, the squared difference
between the predicted house prices and the actual prices is (340,000)2. This metric gives higher
penalties to larger errors, making it suitable for applications where accurately predicting
outliers is crucial.
MSE is a valuable metric for evaluating the accuracy of regression models, especially when
you want to emphasize larger errors in the predictions. However, MSE is sensitive to outliers,
so it's essential to consider the data characteristics and the problem context when choosing
evaluation metrics.
Where:
1
RMSE = √5 (400 + 400 + 400 + 400 + 100)
RMSE= √340)
RMSE ≈ 18.44
In this example, the RMSE is calculated as approximately 18.44. It means, on average, the
difference between the predicted house prices and the actual prices is approximately $18,440.
This metric provides a clear understanding of the average prediction error in the same unit as
the house prices.
RMSE is a valuable metric for evaluating the accuracy of regression models. It's particularly
useful when you want to understand the average magnitude of errors, especially in applications
where the prediction errors need to be interpretable in the same unit as the target variable.
Where:
MAPE ≈ 7.4 %
In this example, the MAPE is calculated as approximately 7.4%. It means, on average, the
predictions differ from the actual sales figures by about 7.4% of the actual values. This metric
provides a clear understanding of the average percentage error in predictions.
MAPE is a useful metric for evaluating the accuracy of forecasting models, especially in
business and economics, where understanding prediction errors as percentages is essential. It
helps analysts and decision-makers assess the reliability of their forecasts in practical
applications.
Where:
Interpretation
A lower MSPE indicates a better fit of the forecasting model to the data, as it represents smaller
squared percentage errors.
Numerical Example:
Let's consider a dataset of actual and predicted sales figures for a product over a period of time:
Actual Sales: [150,200,180,250,300] (in units)
Predicted Sales: [140,220,190,240,280] (in units)
Using the MSPE formula, we can calculate MSPE as follows:
1
MSPE= ((6.67%)2 + (−10%)2 + (−5.56%)2 + (4%)2 + (6.67%)2 )
5
1
MSPE ≈ (0.44% + 1% + 0.31% + 0.16% + 0.44%)
5
12.35%
MSPE ≈ 5
MSPE ≈ 0.47 %
In this example, the MSPE is calculated as approximately 0.47%. It means, on average, the
squared percentage difference between the predicted and actual sales figures is 0.47%. A
smaller MSPE indicates a better fit of the forecasting model.
MSPE is a useful metric for evaluating the accuracy of forecasting models, especially when
you want to understand the prediction errors in terms of percentages. It helps data scientists
and analysts assess the reliability of their forecasts in practical applications.
Where:
Numerical Example:
Let's consider a dataset of actual and predicted sales figures for a product over a period of
time:
Actual Sales: [150,200,180,250,300] (in units)
Predicted Sales: [140,220,190,240,280] (in units)
Using the RMSPE formula, we can calculate RMSPE as follows:
RMSPE =
1 150−140 2 200−220 2 180−190 2 250−240 2 300−280 2
√ (( ∗ 100%) + ( ∗ 100%) + ( ∗ 100%) + ( ∗ 100%) + ( ∗ 100%) )
5 150 200 180 250 300
RMSPE =
1 10 2 −20 2 −10 2 10 2 20 2
√ (( ∗ 100%) + ( ∗ 100%) + ( 180 ∗ 100%) + (250 ∗ 100%) + (300 ∗ 100%) )
5 150 200
1
RMSPE= √5 ((6.67%)2 + (−10%)2 + (−5.56%)2 + (4%)2 + (6.67%)2 )
1
RMSPE ≈ √5 (0.44% + 1% + 0.31% + 0.16% + 0.44%)
12.35%
RMSPE ≈ √ 5
RMSPE ≈ √0.47%
RMSPE ≈ 0.687%
RMSPE is a valuable metric for evaluating the accuracy of time series forecasting models,
especially when you want to understand prediction errors in terms of percentages. It provides
a more interpretable measure of prediction accuracy, allowing data scientists and analysts to
assess the reliability of their forecasts.
Predictive measures are crucial for understanding a model's accuracy and reliability. It's a
fundamental step in the machine learning workflow and aids in making data-driven decisions
in various domains.
Avoiding Overtraining
Overtraining, also known as overfitting, occurs when a machine learning model performs
exceptionally well on the training data but fails to generalize to new, unseen data. This happens
because the model has essentially memorized the training data, including its noise and outliers,
instead of learning the underlying patterns. To ensure the model's effectiveness in real-world
scenarios, it's crucial to avoid overtraining. Here are some effective techniques:
1. Increase Training Data:
Providing more diverse and abundant training data can help the model generalize better.
With a larger dataset, the model is exposed to a wider range of patterns and variations
in the data.
2. Feature Selection:
Careful selection of relevant features can significantly impact the model's performance.
Irrelevant or redundant features can introduce noise and confuse the learning process.
Use techniques like feature importance analysis to identify and select the most
informative features.
3. Cross-Validation:
Instead of relying on a single train-test split, use techniques like k-fold cross-validation.
This divides the data into multiple folds and trains the model on different subsets,
ensuring that the model's performance is evaluated across various parts of the data.
4. Regularization:
Regularization techniques, such as L1 (Lasso) and L2 (Ridge) regularization, add
penalty terms to the model's loss function, discouraging overly complex models. These
penalties prevent the model from assigning too much importance to any particular
feature, thus mitigating overfitting.
5. Early Stopping:
During the training process, monitor the model's performance on a validation dataset.
If the performance starts degrading on the validation set while improving on the training
set, stop the training early. This prevents the model from memorizing the noise in the
training data.
6. Ensemble Methods:
Ensemble methods like Random Forests and Gradient Boosting combine predictions
from multiple models. These methods often lead to more robust and generalized
models, as they mitigate the biases and errors of individual models.
7. Neural Network Techniques:
In neural networks, techniques like dropout layers and batch normalization help in
preventing overfitting. Dropout layers randomly deactivate neurons during training,
while batch normalization normalizes input batches to stabilize learning.
8. Hyperparameter Tuning:
Carefully tune hyperparameters, such as learning rate and regularization strength, using
techniques like grid search or random search. Optimal hyperparameters ensure the
model's balance between complexity and generalization.
Avoiding overtraining is critical to ensuring that machine learning models generalize well to
new data. By employing a combination of techniques like increasing training data, feature
selection, cross-validation, regularization, early stopping, ensemble methods, and careful
hyperparameter tuning, data scientists can build models that are both accurate and robust in
real-world applications.