ML U2
ML U2
The machine learning process involves several steps that help develop and deploy a
successful machine learning model.
Machine Learning Process:
1. Problem Definition
2. Data Collection
3. Data Preprocessing
4. Data Split
5. Model Selection
6. Model Training
7. Model Evaluation
9. Model Deployment
1. Requirements gathering
2. Data preparation
3. Model development
4. Model testing
5. Model deployment
6. Model maintenance
Key Considerations:
Data can broadly be divided into two types: 1. Qualitative data 2. Quantitative data
Qualitative data provides information about the quality of an object or information which cannot be
measured. Qualitative data is also called categorical data. Eg: name or roll number
Qualitative data can be further subdivided into two types ◦ 1. Nominal data ◦ 2. Ordinal data
Nominal data is one which has no numeric value, but a named value
Eg: 1. Blood group: A, B, O, AB, etc. 2. Nationality: Indian, American, British, etc. 3. Gender: Male,
Female, Other
Mathematical operations cannot be performed on nominal data. So, statistical functions such as mean,
variance, etc. can also not be applied on nominal data
Ordinal data, in addition to possessing the properties of nominal data, can also be naturally ordered.
This means they can be arranged in a sequence of increasing or decreasing value so that we can say
whether a value is better than or greater than another value.
Eg: 1. Customer satisfaction: ‘Very Happy’, ‘Happy’, ‘Unhappy’, etc. 2. Grades: A, B, C, etc. 3. Hardness of
Metal: ‘Very Hard’, ‘Hard’, ‘Soft’, etc. Basic counting is possible for ordinal data.
Quantitative data relates to information about the quantity of an object – hence it can be measured.
For example, if we consider the attribute ‘marks’, it can be measured using a scale of measurement.
1. Interval data
2. Ratio data
Interval data is numeric data for which not only the order is known, but the exact difference between
values is also known.
Ratio data represents numeric data for which exact value can be measured.
There are various types of data, classified based on their characteristics, format, and
usage.
Data Types:
1. Quantitative Data
2. Qualitative Data
3. Structured Data
4. Unstructured Data
5. Semi-Structured Data
Data Formats:
Data Sources:
Data Classification:
1. Volume
2. Velocity
3. Variety
4. Veracity
5. Value
1. Accuracy
2. Completeness
3. Consistency
4. Timeliness
5. Relevance
2.1.2 Structure of the data
Data set that we take as a reference is the Auto MPG data set available in the UCI
repository
1. Numeric: The attributes ‘mpg’, ‘cylinders’, ‘displacement’, ‘horsepower’,
‘weight’, ‘acceleration’, ‘model year’, and ‘origin’ are all numeric
There are two most effective mathematical plots to explore numerical data – box plot and histogram.
we can apply the measures of central tendency of data, i.e. mean and median
Auto MPG data set : find out for each of the numeric attributes the values of mean,
median and the deviation between these values
Understanding data spread:
a granular view of the data spread in the form of 1. Dispersion of data 2. Position of the different data
values
However, the first set of values that is of attribute 1 is more concentrated or clustered around the
mean/median value whereas the second set of values of attribute 2 is quite spread out or dispersed
To measure the extent of dispersion of a data, or to find out how much the different values of a data
are spread out, the variance of the data is measured.
Larger value of variance or standard deviation indicates more dispersion in the data and vice versa. In
the above example, for attribute 1
2. Measuring data value position
The median of data set gives the central data value, which divides the entire data set into two halves.
If the first half of the data is divided into two halves so that each half consists of one-quarter of the data
set, then that median of the first half is known as first quartile or Q1.
If the second half of the data is divided into two halves, then that median of the second half is known as
third quartile or Q3. The overall median is also known as second quartile or Q2.
So, any data set has five values - minimum, first quartile (Q1), median (Q2), third quartile (Q3), and
maximum.
Plotting and exploring numerical data
a) Box plots ◦ A box plot is an extremely effective mechanism to get a one-shot view and
understand the nature of the data. ◦
b) The box plot (also called box and whisker plot) gives a standard visualization of the five-
number summary statistics of a data, namely minimum, first quartile (Q1), median (Q2),
third quartile (Q3), and maximum.
b) Histogram
It helps in understanding the distribution of a numeric data into series of intervals, also termed as ‘bins’.
◦ The focus of histogram is to plot ranges of data values ( ‘bins’), the number of data elements in each
range will depend on the data distribution.
◦ The focus of box plot is to divide the data elements in a data set into four equal portions, such that
each portion contains an equal number of data elements.
Histograms might be of different shapes depending on the nature of the data
Figure provides different shapes of the histogram that are generally created.
a) Scatter Plot
Two-way cross-tabulations (also called cross-tab or contingency table) are used to understand the
relationship of two categorical attributes. It has a matrix format that presents a summarized view of the
bivariate frequency distribution.
A cross-tab, very much like a scatter plot, helps to understand how much the data values of one
attribute changes with the change in data values of another.
Data quality :
Success of machine learning depends largely on the quality of data.
A data which has the right quality helps to achieve better prediction accuracy.
1. Missing value.
2. Outliers.
Eg: If we are selecting a sample set of sales transactions from a festive period and trying to use that
data to predict sales in future.
In many cases, a person or group of persons are responsible for the collection of data to be used in a
learning activity.
In this manual process, there is the possibility of wrongly recording data either in terms of value (say
20.67 is wrongly recorded as 206.7 or 2.067) or in terms of a unit of measurement (say cm. is wrongly
recorded as m. or mm.).
It may also happen that the data is not recorded at all. In case of a survey conducted to collect data,
survey responders may choose not to respond to a certain question. So the data value for that data
element in that responder’s record is missing
Data remediation:
The issues in data quality, need to be remediated, if the right amount of efficiency has to be achieved in
the learning activity.
1.Handling outliers
◦ Outliers are data elements with an abnormally high value which may impact prediction accuracy,
especially in regression models.
◦ Once the outliers are identified and the decision has been taken to amend those values, we may
consider one of the following approaches:
Remove outliers: If the number of records which are outliers is not many, a simple approach may be to
remove them.
Imputation: One other way is to impute the value with mean or median or mode or value of the most
similar data element
Capping: For values that lie outside the 1.5|×| IQR limits, we can cap them by replacing those
observations below the lower limit with the value of 5th percentile and those that lie above the upper
limit, with the value of 95th percentile.
If there is a significant number of outliers, they should be treated separately in the statistical model
If the proportion of data elements having missing values is within a tolerable limit, remove the
This will not be possible if the proportion of records having data elements with missing value
is really high.
For quantitative attributes, all missing values are imputed with the mean, median, or mode of the
remaining values under the same attribute
For qualitative attributes, all missing values are imputed by the mode of all remaining values of the same
attribute
If there are data points similar to the ones with missing attribute values, then the attribute values from
those similar data points can be planted in place of the missing value
Eg: For example, let’s assume that the weight of a Russian student having age 12 years and height 5 ft. is
missing. Then the weight of any other Russian student having age close to 12 years and height close to 5
ft. can be assigned.
1. Missing values
2. Noisy or erroneous data
3. Inconsistent formatting
4. Duplicate records
5. Outliers
6. Biased data
7. Incomplete data
8. Data drift (concept drift)
Data remediation in Machine Learning (ML) involves identifying and correcting data
quality issues to improve model performance and reliability.
Data Remediation Steps:
Types:
Non-Linear Methods
Benefits:
Applications:
1. Image classification
2. Natural Language Processing
3. Recommender systems
4. Anomaly detection
5. Clustering
2.2.2 Feature subset selection
Wrapper Methods
1. Forward Selection
2. Backward Elimination
3. Recursive Feature Elimination (RFE)
4. Genetic Algorithms
5. Particle Swarm Optimization
Embedded Methods
1. L1 Regularization (Lasso)
2. L2 Regularization (Ridge)
3. Elastic Net Regularization
4. Decision Trees
5. Random Forests
Benefits:
Applications:
1. Image classification
2. Natural Language Processing
3. Recommender systems
4. Anomaly detection
5. Clustering
2.3 Describe learning of the data model
Learning Types:
Learning Process:
1. Accuracy
2. Precision
3. Recall
4. F1-Score
5. Mean Squared Error (MSE)
6. Computational Complexity
7. Interpretability
8. Scalability
1. Cross-Validation
2. Grid Search
3. Random Search
4. Bayesian Optimization
5. Ensemble Methods
1. Linear Regression
2. Logistic Regression
3. Decision Trees
4. Random Forests
5. Support Vector Machines (SVM)
6. Neural Networks
7. Gradient Boosting
8. K-Nearest Neighbors (KNN)
Unsupervised Learning
1. K-Means Clustering
2. Hierarchical Clustering
3. Principal Component Analysis (PCA)
4. t-Distributed Stochastic Neighbor Embedding (t-SNE)
1. Scikit-learn
2. TensorFlow
3. PyTorch
4. Keras
5. Microsoft Cognitive Toolkit (CNTK)
Best Practices:
Training Types:
Training Algorithms:
Supervised Learning
1. Linear Regression
2. Logistic Regression
3. Decision Trees
4. Random Forests
5. Support Vector Machines (SVM)
6. Neural Networks
7. Gradient Boosting
8. K-Nearest Neighbors (KNN)
Unsupervised Learning
1. K-Means Clustering
2. Hierarchical Clustering
3. Principal Component Analysis (PCA)
4. t-Distributed Stochastic Neighbor Embedding (t-SNE)
Training Techniques:
1. Batch Training
2. Online Training
3. Mini-Batch Training
4. Transfer Learning
5. Ensemble Methods
Training Metrics:
1. Accuracy
2. Precision
3. Recall
4. F1-Score
5. Mean Squared Error (MSE)
6. Cross-Entropy Loss
7. R-Squared
Training Tools:
1. Scikit-learn
2. TensorFlow
3. PyTorch
4. Keras
5. Microsoft Cognitive Toolkit (CNTK)
Best Practices:
Common Challenges:
1. Overfitting
2. Underfitting
3. Data Imbalance
4. Feature Engineering
5. Model Interpretability
Real-World Applications:
1. Image Classification
2. Natural Language Processing
3. Recommender Systems
4. Predictive Maintenance
5. Autonomous Vehicles
Training Frameworks:
1. Mathematical Equations
2. Graphical Representations (e.g., decision trees)
3. Probabilistic Models (e.g., Bayesian networks)
4. Neural Network Architectures
1. Feature Importance
2. Partial Dependence Plots
3. SHAP Values (SHapley Additive exPlanations)
4. LIME (Local Interpretable Model-agnostic Explanations)
5. Model-agnostic interpretability
Interpretability Metrics:
1. Accuracy
2. Precision
3. Recall
4. F1-Score
5. Mean Squared Error (MSE)
6. R-Squared
1. TensorFlow Explainability
2. PyTorch Explainer
3. Scikit-learn's Interpretation Tools
4. LIME
5. SHAP
Challenges:
1. Model complexity
2. Data quality issues
3. Feature engineering
4. Overfitting
5. Scalability
Real-World Applications:
1. Healthcare diagnosis
2. Financial risk assessment
3. Image classification
4. Natural Language Processing
5. Autonomous vehicles
1.Classification
2.Regression
3.Clustering
1.Classification
+
2.Regression
3.Clustering
Performance Evaluation of a model in Machine Learning (ML) involves assessing its
ability to make accurate predictions or decisions.
Evaluation Metrics:
Classification:
1. Accuracy
2. Precision
3. Recall
4. F1-Score
5. ROC-AUC
6. Confusion Matrix
Regression:
1. Silhouette Coefficient
2. Calinski-Harabasz Index
3. Davies-Bouldin Index
4. Cluster Purity
5. Cluster Entropy
Evaluation Techniques:
1. Cross-Validation
2. Train-Test Split
3. Walk-Forward Optimization
4. Bootstrap Resampling
5. Monte Carlo Simulation
Best Practices:
Common Challenges:
1. Overfitting
2. Underfitting
3. Data imbalance
4. Feature engineering
5. Model interpretability
Real-World Applications:
1. Image classification
2. Natural Language Processing
3. Recommender systems
4. Predictive maintenance
5. Autonomous vehicles
Classification Algorithms:
Supervised Learning:
1. Logistic Regression
2. Decision Trees
3. Random Forests
4. Support Vector Machines (SVM)
5. Neural Networks
6. K-Nearest Neighbors (KNN)
7. Gradient Boosting
Deep Learning:
Evaluation Metrics:
1. Accuracy
2. Precision
3. Recall
4. F1-Score
5. ROC-AUC
6. Confusion Matrix
Classification Techniques:
1. Feature Engineering
2. Data Preprocessing
3. Hyperparameter Tuning
4. Ensemble Methods
5. Transfer Learning
Real-World Applications:
1. Image Classification
2. Sentiment Analysis
3. Spam Detection
4. Medical Diagnosis
5. Product Recommendation
Classification Tools:
1. Scikit-learn
2. TensorFlow
3. PyTorch
4. Keras
5. Microsoft Cognitive Toolkit (CNTK)
Best Practices:
Common Challenges:
1. Class imbalance
2. Overfitting
3. Underfitting
4. Data noise
5. Model interpretability
1. Anomaly Detection
2. One-Class Classification
3. Zero-Shot Learning
4. Few-Shot Learning
5. Meta-Learning
2.4.2 Regression
Regression Algorithms:
Linear Models:
Non-Linear Models:
1. Decision Trees
2. Random Forests
3. Support Vector Regression (SVR)
4. Neural Networks
5. Gradient Boosting
Evaluation Metrics:
Regression Techniques:
1. Feature Engineering
2. Data Preprocessing
3. Hyperparameter Tuning
4. Ensemble Methods
5. Transfer Learning
Real-World Applications:
Regression Tools:
1. Scikit-learn
2. TensorFlow
3. PyTorch
4. Keras
5. Microsoft Cognitive Toolkit (CNTK)
Best Practices:
Common Challenges:
1. Overfitting
2. Underfitting
3. Data noise
4. Model interpretability
5. Non-linear relationships
2.4.3 Clustering
Clustering in Machine Learning (ML) involves grouping similar data points into
clusters based on their features.
Types of Clustering:
Clustering Algorithms:
1. K-Means
2. Hierarchical Agglomerative Clustering (HAC)
3. DBSCAN
4. OPTICS (Ordering Points To Identify the Clustering Structure)
5. K-Medoids
Evaluation Metrics:
1. Silhouette Coefficient
2. Calinski-Harabasz Index
3. Davies-Bouldin Index
4. Cluster Purity
5. Cluster Entropy
Clustering Techniques:
1. Feature Scaling
2. Data Preprocessing
3. Dimensionality Reduction
4. Ensemble Methods
5. Model Selection
Real-World Applications:
1. Customer Segmentation
2. Image Segmentation
3. Gene Expression Analysis
4. Text Clustering
5. Anomaly Detection
Clustering Tools:
1. Scikit-learn
2. TensorFlow
3. PyTorch
4. Keras
5. Microsoft Cognitive Toolkit (CNTK)
Best Practices:
Common Challenges:
1. Semi-Supervised Clustering
2. Active Learning
3. Transfer Learning
4. Deep Clustering
5. Spectral Clustering
1. Data Preprocessing
2. Feature Engineering
3. Hyperparameter Tuning
4. Model Selection
5. Ensemble Methods
6. Regularization Techniques
7. Cross-Validation
1. Gradient Boosting
2. Transfer Learning
3. Early Stopping
4. Batch Normalization
5. Dropout
6. Data Augmentation
7. Attention Mechanisms
1. Gradient Descent
2. Stochastic Gradient Descent
3. Adam
4. RMSprop
5. Adagrad
Performance Metrics:
Classification:
1. Accuracy
2. Precision
3. Recall
4. F1-Score
5. ROC-AUC
Regression:
1. Scikit-learn
2. TensorFlow
3. PyTorch
4. Keras
5. Microsoft Cognitive Toolkit (CNTK)
Best Practices:
Common Challenges:
1. Overfitting
2. Underfitting
3. Data quality issues
4. Model interpretability
5. Scalability
Real-World Applications:
1. Image classification
2. Natural Language Processing
3. Recommender systems
4. Predictive maintenance
5. Autonomous vehicles
Key Considerations:
1. Data quality
2. Model complexity
3. Hyperparameter tuning
4. Regularization
5. Model interpretability
Advanced Techniques:
1. Bayesian Optimization
2. Gradient-Based Optimization
3. Evolutionary Optimization
4. Deep Learning
5. Transfer Learning