0% found this document useful (0 votes)

6 views

ML U2

Uploaded by

Thil Pa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

ML U2

Uploaded by

Thil Pa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 62

2.

0 Process of Machine Learning

The machine learning process involves several steps that help develop and deploy a
successful machine learning model.
Machine Learning Process:
1. Problem Definition

 Identify business problem or opportunity

 Define goals and objectives
 Determine key performance indicators (KPIs)

2. Data Collection

 Gather relevant data from various sources

 Ensure data quality and relevance
 Store data in a suitable format

3. Data Preprocessing

 Clean and preprocess data

 Handle missing values and outliers
 Transform data (normalization, feature scaling)

4. Data Split

 Divide data into training, validation, and testing sets

 Ensure data splitting is random and representative

5. Model Selection

 Choose suitable algorithm and framework

 Consider model complexity and interpretability
 Research and compare different models

6. Model Training

 Train model using training data

 Tune hyperparameters for optimal performance
 Monitor training progress and adjust as needed

7. Model Evaluation

 Evaluate model using validation data

 Assess performance metrics (accuracy, precision, recall)
 Compare models and select best performer
8. Model Testing

 Test model using testing data

 Simulate real-world scenarios
 Evaluate model robustness and reliability

9. Model Deployment

 Deploy model in production environment

 Integrate with existing systems
 Monitor model performance and retrain as needed

10. Model Maintenance

 Continuously monitor model performance

 Update model with new data
 Refine model to maintain accuracy

Machine Learning Lifecycle:

1. Requirements gathering
2. Data preparation
3. Model development
4. Model testing
5. Model deployment
6. Model maintenance

Key Considerations:

1. Data quality and availability

2. Model interpretability and explainability
3. Model scalability and reliability
4. Ethics and fairness
5. Security and privacy
2.1 Discuss the data modeling

2.1.1 Types of data

Data can broadly be divided into two types: 1. Qualitative data 2. Quantitative data

Qualitative data provides information about the quality of an object or information which cannot be
measured. Qualitative data is also called categorical data. Eg: name or roll number

Qualitative data can be further subdivided into two types ◦ 1. Nominal data ◦ 2. Ordinal data

Nominal data is one which has no numeric value, but a named value

Eg: 1. Blood group: A, B, O, AB, etc. 2. Nationality: Indian, American, British, etc. 3. Gender: Male,
Female, Other

Mathematical operations cannot be performed on nominal data. So, statistical functions such as mean,
variance, etc. can also not be applied on nominal data

Ordinal data, in addition to possessing the properties of nominal data, can also be naturally ordered.
This means they can be arranged in a sequence of increasing or decreasing value so that we can say
whether a value is better than or greater than another value.

Eg: 1. Customer satisfaction: ‘Very Happy’, ‘Happy’, ‘Unhappy’, etc. 2. Grades: A, B, C, etc. 3. Hardness of
Metal: ‘Very Hard’, ‘Hard’, ‘Soft’, etc. Basic counting is possible for ordinal data.

Quantitative data relates to information about the quantity of an object – hence it can be measured.

Quantitative data is also termed as numeric data.

For example, if we consider the attribute ‘marks’, it can be measured using a scale of measurement.

There are two types of quantitative data:

1. Interval data

2. Ratio data
Interval data is numeric data for which not only the order is known, but the exact difference between
values is also known.

◦ Eg: Celsius temperature, date, time etc.

Ratio data represents numeric data for which exact value can be measured.

Absolute zero is available for ratio data.

Also, these variables can be added, subtracted,multiplied, or divided

There are various types of data, classified based on their characteristics, format, and
usage.
Data Types:
1. Quantitative Data

 Numerical data (e.g., age, height, temperature) : distinctness

 Continuous (e.g., weight, time)
 Discrete (e.g., number of items, rating)

2. Qualitative Data

 Categorical data (e.g., gender, color, city)

 Ordinal data (e.g., ranking, satisfaction level)
 Text data (e.g., reviews, descriptions)

3. Structured Data

 Organized and formatted data (e.g., tables, spreadsheets)

 Relational databases (e.g., customer info, orders)

4. Unstructured Data

 Unorganized and unformatted data (e.g., images, videos, emails)

 Non-relational databases (e.g., social media posts, text messages)

5. Semi-Structured Data

 Partially organized data (e.g., XML, JSON)

 Combines structured and unstructured data features

Data Formats:

1. Text (e.g., CSV, TXT)

2. Image (e.g., JPEG, PNG)
3. Audio (e.g., MP3, WAV)
4. Video (e.g., MP4, AVI)
5. Binary (e.g., EXE, DLL)

Data Sources:

1. Primary data (e.g., surveys, experiments)

2. Secondary data (e.g., existing research, public datasets)
3. Internal data (e.g., company records, sales data)
4. External data (e.g., social media, weather data)

Data Classification:

1. Public data (e.g., government records)

2. Private data (e.g., personal info)
3. Confidential data (e.g., business secrets)
4. Sensitive data (e.g., financial info, health records)

Big Data Characteristics:

1. Volume
2. Velocity
3. Variety
4. Veracity
5. Value

Data Quality Dimensions:

1. Accuracy
2. Completeness
3. Consistency
4. Timeliness
5. Relevance
2.1.2 Structure of the data

Data set that we take as a reference is the Auto MPG data set available in the UCI
repository
1. Numeric: The attributes ‘mpg’, ‘cylinders’, ‘displacement’, ‘horsepower’,
‘weight’, ‘acceleration’, ‘model year’, and ‘origin’ are all numeric

2. Discrete: Out of these attributes, ‘cylinders’, ‘model year’, and ‘origin’

are discrete in nature

3. Continuous: ‘mpg’, ‘displacement’, ‘horsepower’, ‘weight’, and

‘acceleration’ are continuous in nature

4. Nominal: ‘car name’ is of type categorical, or more specifically nominal

5. Target attribute: This data set is regarding prediction of fuel

consumption in miles per gallon, i.e. the numeric attribute ‘mpg’ is the target
attribute.
NUMERIC DATA:

There are two most effective mathematical plots to explore numerical data – box plot and histogram.

we can apply the measures of central tendency of data, i.e. mean and median

Auto MPG data set : find out for each of the numeric attributes the values of mean,
median and the deviation between these values
Understanding data spread:

a granular view of the data spread in the form of 1. Dispersion of data 2. Position of the different data
values

1. Measuring Dispersion of data

Consider the data values of two attributes
1. Attribute 1 values : 44, 46, 48, 45, and 47
2. Attribute 2 values : 34, 46, 59, 39, and 52
Both the set of values have a mean and median of 46

However, the first set of values that is of attribute 1 is more concentrated or clustered around the
mean/median value whereas the second set of values of attribute 2 is quite spread out or dispersed

To measure the extent of dispersion of a data, or to find out how much the different values of a data
are spread out, the variance of the data is measured.

Larger value of variance or standard deviation indicates more dispersion in the data and vice versa. In
the above example, for attribute 1
2. Measuring data value position

The median of data set gives the central data value, which divides the entire data set into two halves.

If the first half of the data is divided into two halves so that each half consists of one-quarter of the data
set, then that median of the first half is known as first quartile or Q1.

If the second half of the data is divided into two halves, then that median of the second half is known as
third quartile or Q3. The overall median is also known as second quartile or Q2.

So, any data set has five values - minimum, first quartile (Q1), median (Q2), third quartile (Q3), and
maximum.
Plotting and exploring numerical data

a) Box plots ◦ A box plot is an extremely effective mechanism to get a one-shot view and
understand the nature of the data. ◦
b) The box plot (also called box and whisker plot) gives a standard visualization of the five-
number summary statistics of a data, namely minimum, first quartile (Q1), median (Q2),
third quartile (Q3), and maximum.

b) Histogram

Histogram is another plot which helps in effective visualization of numeric attributes.

It helps in understanding the distribution of a numeric data into series of intervals, also termed as ‘bins’.

Difference between histogram and box plot is:

◦ The focus of histogram is to plot ranges of data values ( ‘bins’), the number of data elements in each
range will depend on the data distribution.

◦ The focus of box plot is to divide the data elements in a data set into four equal portions, such that
each portion contains an equal number of data elements.
Histograms might be of different shapes depending on the nature of the data

Figure provides different shapes of the histogram that are generally created.

EXPLORING CATEGORICAL DATA

There are not many options for exploring categorical data.

In the Auto MPG data set, attribute ‘car.name’ is categorical in nature.

We may consider ‘cylinders’ as a categorical variable instead of a numeric variable.

EXPLORING RELATION BETWEEN VARIABLES

a) Scatter Plot

A scatter plot helps in visualizing bivariate relationships, i.e.

relationship between two variables.

It is a two-dimensional plot in which points or dots are drawn on

coordinates provided by values of the attributes
b) Two-way cross-tabulations

Two-way cross-tabulations (also called cross-tab or contingency table) are used to understand the
relationship of two categorical attributes. It has a matrix format that presents a summarized view of the
bivariate frequency distribution.

A cross-tab, very much like a scatter plot, helps to understand how much the data values of one
attribute changes with the change in data values of another.

2.1.3 Data quality and remediation

Data quality :
Success of machine learning depends largely on the quality of data.

A data which has the right quality helps to achieve better prediction accuracy.

Two types of data quality issues:

1. Missing value.
2. Outliers.

Factors which lead to these data quality issues:

◦ Incorrect sample set selection

Eg: If we are selecting a sample set of sales transactions from a festive period and trying to use that
data to predict sales in future.

◦ Errors in data collection:

In many cases, a person or group of persons are responsible for the collection of data to be used in a
learning activity.

In this manual process, there is the possibility of wrongly recording data either in terms of value (say
20.67 is wrongly recorded as 206.7 or 2.067) or in terms of a unit of measurement (say cm. is wrongly
recorded as m. or mm.).

It may also happen that the data is not recorded at all. In case of a survey conducted to collect data,
survey responders may choose not to respond to a certain question. So the data value for that data
element in that responder’s record is missing

Data remediation:
The issues in data quality, need to be remediated, if the right amount of efficiency has to be achieved in
the learning activity.

1.Handling outliers
◦ Outliers are data elements with an abnormally high value which may impact prediction accuracy,
especially in regression models.

◦ Once the outliers are identified and the decision has been taken to amend those values, we may
consider one of the following approaches:

Remove outliers: If the number of records which are outliers is not many, a simple approach may be to
remove them.

Imputation: One other way is to impute the value with mean or median or mode or value of the most
similar data element

Capping: For values that lie outside the 1.5|×| IQR limits, we can cap them by replacing those
observations below the lower limit with the value of 5th percentile and those that lie above the upper
limit, with the value of 95th percentile.

If there is a significant number of outliers, they should be treated separately in the statistical model

2. Handling missing values:

In a data set, one or more data elements may have missing values in multiple records.
There are multiple strategies to handle missing value of data elements

1. Eliminate records having a missing values:

If the proportion of data elements having missing values is within a tolerable limit, remove the

records having such data elements.

This will not be possible if the proportion of records having data elements with missing value

is really high.

2. Imputing missing values:

For quantitative attributes, all missing values are imputed with the mean, median, or mode of the
remaining values under the same attribute

For qualitative attributes, all missing values are imputed by the mode of all remaining values of the same
attribute

If there are data points similar to the ones with missing attribute values, then the attribute values from
those similar data points can be planted in place of the missing value

Eg: For example, let’s assume that the weight of a Russian student having age 12 years and height 5 ft. is
missing. Then the weight of any other Russian student having age close to 12 years and height close to 5
ft. can be assigned.

Data Quality Issues:

1. Missing values
2. Noisy or erroneous data
3. Inconsistent formatting
4. Duplicate records
5. Outliers
6. Biased data
7. Incomplete data
8. Data drift (concept drift)
Data remediation in Machine Learning (ML) involves identifying and correcting data
quality issues to improve model performance and reliability.
Data Remediation Steps:

1. Data Profiling: Understand data distribution and quality.

2. Data Cleaning: Remove duplicates, handle missing values, and correct errors.
3. Data Normalization: Scale data to a consistent range.
4. Data Transformation: Convert data formats for better analysis.
5. Data Validation: Verify data correctness and consistency.
6. Data Augmentation: Enhance data with additional information.
7. Data Quality Monitoring: Continuously track data quality.

Data Remediation Techniques:

1. Handling Missing Values:

 Imputation (mean, median, mode)
 Interpolation
 Regression-based imputation
2. Outlier Detection:
 Statistical methods (Z-score, IQR)
 Distance-based methods (DBSCAN)
 Density-based methods (Local Outlier Factor)
3. Data Normalization:
 Min-Max Scaling
 Standardization (Z-score)
 Log Transformation
4. Data Transformation:
 Feature scaling
 Feature extraction
 Dimensionality reduction
2.2 Explain the data Pre-processing

Data pre-processing is a crucial step in Machine Learning (ML) that involves

transforming raw data into a suitable format for training ML models.

2.2.1 Dimensionality reduction

Definition:
Dimensionality reduction involves transforming high-dimensional data into lower-
dimensional data, minimizing information loss.

Types:

1. Feature Selection: Selecting relevant features.

2. Feature Extraction: Creating new features from existing ones.
3. Feature Transformation: Transforming existing features.
Techniques:
Linear Methods

1. Principal Component Analysis (PCA)

2. Linear Discriminant Analysis (LDA)
3. Canonical Correlation Analysis (CCA)

Non-Linear Methods

1. t-Distributed Stochastic Neighbor Embedding (t-SNE)

2. Autoencoders
3. Kernel PCA
4. Isomap
5. Locally Linear Embedding (LLE)

Feature Selection Methods

1. Filter Methods (correlation, mutual info)

2. Wrapper Methods (recursive feature elimination)
3. Embedded Methods (L1 regularization)

Dimensionality Reduction Algorithms:

1. PCA (Principal Component Analysis)

2. SVD (Singular Value Decomposition)
3. LLE (Local Linear Embedding)
4. t-SNE (t-Distributed Stochastic Neighbor Embedding)
5. Autoencoder

Benefits:

1. Reduced data complexity

2. Improved model performance
3. Enhanced visualization
4. Faster computation
5. Reduced overfitting

Applications:

1. Image classification
2. Natural Language Processing
3. Recommender systems
4. Anomaly detection
5. Clustering
2.2.2 Feature subset selection

Feature subset selection is a crucial technique in Machine Learning (ML) that

involves selecting a subset of relevant features from the original feature set to
improve model performance.
Definition:
Feature subset selection involves selecting a subset of features that are most
relevant to the target variable, reducing dimensionality and improving model
performance.
Types:

1. Filter Methods: Select features based on statistical measures.

2. Wrapper Methods: Use ML algorithms to evaluate feature subsets.
3. Embedded Methods: Integrates feature selection into ML algorithm.
4. Hybrid Methods: Combines multiple feature selection methods.

Feature Subset Selection Techniques:

Filter Methods
1. Correlation Analysis
2. Mutual Information
3. Chi-Square Test
4. Information Gain
5. Recursive Feature Elimination (RFE)

Wrapper Methods

1. Forward Selection
2. Backward Elimination
3. Recursive Feature Elimination (RFE)
4. Genetic Algorithms
5. Particle Swarm Optimization

Embedded Methods

1. L1 Regularization (Lasso)
2. L2 Regularization (Ridge)
3. Elastic Net Regularization
4. Decision Trees
5. Random Forests

Feature Subset Selection Algorithms:

1. Recursive Feature Elimination (RFE)

2. Lasso (Least Absolute Shrinkage and Selection Operator)
3. Random Forest Feature Importance
4. Permutation Feature Importance
5. Boruta Algorithm

Benefits:

1. Improved model performance

2. Reduced overfitting
3. Enhanced interpretability
4. Faster computation
5. Reduced data dimensionality

Applications:

1. Image classification
2. Natural Language Processing
3. Recommender systems
4. Anomaly detection
5. Clustering
2.3 Describe learning of the data model

Learning in a data model involves training the model to make predictions or

decisions based on data.
Phases of Learning:

1. Data Preparation: Collect, preprocess, and transform data.

2. Model Selection: Choose a suitable algorithm.
3. Training: Feed data to the model.
4. Evaluation: Assess model performance.
5. Hyperparameter Tuning: Optimize model parameters.

Learning Types:

1. Supervised Learning: Labeled data.

2. Unsupervised Learning: Unlabeled data.
3. Semi-Supervised Learning: Both labeled and unlabeled data.
4. Reinforcement Learning: Trial and error.

Learning Process:

1. Initialization: Set initial model parameters.

2. Iteration: Update parameters based on data.
3. Optimization: Minimize loss function.
4. Convergence: Reach optimal parameters.

Model Learning Objectives:

1. Regression: Predict continuous values.

2. Classification: Predict categorical values.
3. Clustering: Group similar data points.
2.3.1 Selecting a model
Selecting a model in Machine Learning (ML) involves choosing the best algorithm
and configuration to solve a specific problem.
Model Selection Criteria:

1. Accuracy
2. Precision
3. Recall
4. F1-Score
5. Mean Squared Error (MSE)
6. Computational Complexity
7. Interpretability
8. Scalability

Model Selection Steps:

1. Define Problem Statement

2. Collect and Preprocess Data
3. Split Data (Training, Validation, Testing)
4. Choose Candidate Models
5. Train and Evaluate Models
6. Compare Model Performance
7. Select Best Model
8. Fine-tune Hyperparameters

Model Selection Techniques:

1. Cross-Validation
2. Grid Search
3. Random Search
4. Bayesian Optimization
5. Ensemble Methods

Machine Learning Models:

Supervised Learning

1. Linear Regression
2. Logistic Regression
3. Decision Trees
4. Random Forests
5. Support Vector Machines (SVM)
6. Neural Networks
7. Gradient Boosting
8. K-Nearest Neighbors (KNN)

Unsupervised Learning

1. K-Means Clustering
2. Hierarchical Clustering
3. Principal Component Analysis (PCA)
4. t-Distributed Stochastic Neighbor Embedding (t-SNE)

Deep Learning Models

1. Convolutional Neural Networks (CNN)

2. Recurrent Neural Networks (RNN)
3. Long Short-Term Memory (LSTM)
4. Generative Adversarial Networks (GAN)
5. Transformers

Model Selection Tools:

1. Scikit-learn
2. TensorFlow
3. PyTorch
4. Keras
5. Microsoft Cognitive Toolkit (CNTK)

Best Practices:

1. Understand Data Distribution

2. Choose Relevant Features
3. Avoid Overfitting
4. Monitor Performance Metrics
5. Iterate and Refine
2.3.2 Training a model
Training a model in Machine Learning (ML) involves teaching the model to make
predictions or decisions based on data.
Training Process:

1. Data Preparation: Collect, preprocess, and split data.

2. Model Selection: Choose a suitable algorithm.
3. Hyperparameter Tuning: Optimize model parameters.
4. Model Training: Feed data to the model.
5. Evaluation: Assess model performance.

Training Types:

1. Supervised Learning: Labeled data.

2. Unsupervised Learning: Unlabeled data.
3. Semi-Supervised Learning: Both labeled and unlabeled data.
4. Reinforcement Learning: Trial and error.

Training Algorithms:
Supervised Learning

1. Linear Regression
2. Logistic Regression
3. Decision Trees
4. Random Forests
5. Support Vector Machines (SVM)
6. Neural Networks
7. Gradient Boosting
8. K-Nearest Neighbors (KNN)

Unsupervised Learning

1. K-Means Clustering
2. Hierarchical Clustering
3. Principal Component Analysis (PCA)
4. t-Distributed Stochastic Neighbor Embedding (t-SNE)

Training Techniques:

1. Batch Training
2. Online Training
3. Mini-Batch Training
4. Transfer Learning
5. Ensemble Methods

Training Metrics:

1. Accuracy
2. Precision
3. Recall
4. F1-Score
5. Mean Squared Error (MSE)
6. Cross-Entropy Loss
7. R-Squared

Training Tools:

1. Scikit-learn
2. TensorFlow
3. PyTorch
4. Keras
5. Microsoft Cognitive Toolkit (CNTK)

Best Practices:

1. Data Quality Check

2. Feature Engineering
3. Hyperparameter Tuning
4. Regularization Techniques
5. Early Stopping

Common Challenges:
1. Overfitting
2. Underfitting
3. Data Imbalance
4. Feature Engineering
5. Model Interpretability

Real-World Applications:

1. Image Classification
2. Natural Language Processing
3. Recommender Systems
4. Predictive Maintenance
5. Autonomous Vehicles

Training Frameworks:

1. Scikit-learn's Training Toolbox

2. TensorFlow's Training API
3. PyTorch's Training Module

By effectively training a model, organizations can build accurate predictive models.

2.3.3 Model representation and interpretability
Model representation and interpretability in Machine Learning (ML) refer to the ability
to understand and explain the decisions made by a trained model.
Model Representation:

1. Mathematical Equations
2. Graphical Representations (e.g., decision trees)
3. Probabilistic Models (e.g., Bayesian networks)
4. Neural Network Architectures

Model Interpretability Techniques:

1. Feature Importance
2. Partial Dependence Plots
3. SHAP Values (SHapley Additive exPlanations)
4. LIME (Local Interpretable Model-agnostic Explanations)
5. Model-agnostic interpretability

Interpretability Metrics:

1. Accuracy
2. Precision
3. Recall
4. F1-Score
5. Mean Squared Error (MSE)
6. R-Squared

Model Interpretability Tools:

1. TensorFlow Explainability
2. PyTorch Explainer
3. Scikit-learn's Interpretation Tools
4. LIME
5. SHAP

Model Representation Benefits:

1. Improved model understanding

2. Better decision-making
3. Enhanced model reliability
4. Reduced model complexity
5. Improved collaboration

Model Interpretability Benefits:

1. Trust in model decisions

2. Identification of biases
3. Improved model performance
4. Regulatory compliance
5. Business insights

Challenges:

1. Model complexity
2. Data quality issues
3. Feature engineering
4. Overfitting
5. Scalability

Real-World Applications:

1. Healthcare diagnosis
2. Financial risk assessment
3. Image classification
4. Natural Language Processing
5. Autonomous vehicles

Model Interpretability Frameworks:

1. Model Interpretability Framework (MIF)

2. Explainable AI (XAI)
3. Transparency, Accountability, and Responsiveness (TAR)

By focusing on model representation and interpretability, organizations can build

more transparent, explainable, and reliable ML models.
2.4 Analyze Performance Evaluation of a model

1.Classification

2.Regression

3.Clustering

1.Classification
+
2.Regression
3.Clustering
Performance Evaluation of a model in Machine Learning (ML) involves assessing its
ability to make accurate predictions or decisions.
Evaluation Metrics:
Classification:

1. Accuracy
2. Precision
3. Recall
4. F1-Score
5. ROC-AUC
6. Confusion Matrix

Regression:

1. Mean Squared Error (MSE)

2. Mean Absolute Error (MAE)
3. R-Squared (R2)
4. Coefficient of Determination
5. Root Mean Squared Percentage Error (RMSPE)
Clustering:

1. Silhouette Coefficient
2. Calinski-Harabasz Index
3. Davies-Bouldin Index
4. Cluster Purity
5. Cluster Entropy

Evaluation Techniques:

1. Cross-Validation
2. Train-Test Split
3. Walk-Forward Optimization
4. Bootstrap Resampling
5. Monte Carlo Simulation

Model Selection Criteria:

1. Akaike Information Criterion (AIC)

2. Bayesian Information Criterion (BIC)
3. Mean Squared Error (MSE)
4. Mean Absolute Error (MAE)
5. Computational Complexity

Performance Evaluation Tools:

1. Scikit-learn's Metrics Module

2. TensorFlow's Evaluation API
3. PyTorch's Metrics Module
4. R's Caret Package
5. MATLAB's Evaluation Toolbox

Best Practices:

1. Use multiple evaluation metrics

2. Split data into training and testing sets
3. Avoid overfitting
4. Monitor performance on unseen data
5. Optimize hyperparameters

Common Challenges:

1. Overfitting
2. Underfitting
3. Data imbalance
4. Feature engineering
5. Model interpretability
Real-World Applications:

1. Image classification
2. Natural Language Processing
3. Recommender systems
4. Predictive maintenance
5. Autonomous vehicles

Performance Evaluation Frameworks:

1. Model Evaluation Framework (MEF)

2. Performance Evaluation Framework (PEF)
3. Machine Learning Evaluation Framework (MLEF)

By systematically evaluating model performance, organizations can:

1. Identify top-performing models

2. Improve model reliability
3. Enhance decision-making
4. Reduce errors
5. Increase efficiency
2.4.1 Classification

Classification in Machine Learning (ML) involves predicting a categorical label or

class for a given input data point.
Types of Classification:

1. Binary Classification: 2 classes (e.g., spam/not spam)

2. Multi-Class Classification: 3+ classes (e.g., image classification)
3. Multi-Label Classification: multiple labels per instance (e.g., text tagging)

Classification Algorithms:
Supervised Learning:

1. Logistic Regression
2. Decision Trees
3. Random Forests
4. Support Vector Machines (SVM)
5. Neural Networks
6. K-Nearest Neighbors (KNN)
7. Gradient Boosting

Deep Learning:

1. Convolutional Neural Networks (CNN)

2. Recurrent Neural Networks (RNN)
3. Long Short-Term Memory (LSTM)
4. Transformers

Evaluation Metrics:

1. Accuracy
2. Precision
3. Recall
4. F1-Score
5. ROC-AUC
6. Confusion Matrix

Classification Techniques:

1. Feature Engineering
2. Data Preprocessing
3. Hyperparameter Tuning
4. Ensemble Methods
5. Transfer Learning

Real-World Applications:
1. Image Classification
2. Sentiment Analysis
3. Spam Detection
4. Medical Diagnosis
5. Product Recommendation

Classification Tools:

1. Scikit-learn
2. TensorFlow
3. PyTorch
4. Keras
5. Microsoft Cognitive Toolkit (CNTK)

Best Practices:

1. Data quality check

2. Feature selection
3. Model selection
4. Hyperparameter tuning
5. Cross-validation

Common Challenges:

1. Class imbalance
2. Overfitting
3. Underfitting
4. Data noise
5. Model interpretability

Advanced Classification Topics:

1. Anomaly Detection
2. One-Class Classification
3. Zero-Shot Learning
4. Few-Shot Learning
5. Meta-Learning

2.4.2 Regression

Regression in Machine Learning (ML) involves predicting a continuous output

variable based on one or more input features.
Types of Regression:

1. Simple Linear Regression: one input feature

2. Multiple Linear Regression: multiple input features
3. Polynomial Regression: nonlinear relationships
4. Logistic Regression: binary classification
5. Ridge Regression: regularization
6. Lasso Regression: feature selection
7. Elastic Net Regression: combination of Ridge and Lasso

Regression Algorithms:
Linear Models:

1. Ordinary Least Squares (OLS)

2. Linear Regression
3. Generalized Linear Regression

Non-Linear Models:

1. Decision Trees
2. Random Forests
3. Support Vector Regression (SVR)
4. Neural Networks
5. Gradient Boosting

Evaluation Metrics:

1. Mean Squared Error (MSE)

2. Mean Absolute Error (MAE)
3. R-Squared (R2)
4. Coefficient of Determination
5. Root Mean Squared Percentage Error (RMSPE)

Regression Techniques:

1. Feature Engineering
2. Data Preprocessing
3. Hyperparameter Tuning
4. Ensemble Methods
5. Transfer Learning

Real-World Applications:

1. Predicting House Prices

2. Stock Market Forecasting
3. Energy Consumption Prediction
4. Medical Diagnosis
5. Traffic Flow Prediction

Regression Tools:

1. Scikit-learn
2. TensorFlow
3. PyTorch
4. Keras
5. Microsoft Cognitive Toolkit (CNTK)

Best Practices:

1. Data quality check

2. Feature selection
3. Model selection
4. Hyperparameter tuning
5. Cross-validation

Common Challenges:

1. Overfitting
2. Underfitting
3. Data noise
4. Model interpretability
5. Non-linear relationships

Advanced Regression Topics:

1. Time Series Regression

2. Survival Analysis
3. Bayesian Regression
4. Gaussian Process Regression
5. Neural Network Regression

2.4.3 Clustering

Clustering in Machine Learning (ML) involves grouping similar data points into
clusters based on their features.
Types of Clustering:

1. Hierarchical Clustering: nested clusters

2. K-Means Clustering: non-hierarchical, fixed clusters
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
4. EM Clustering (Expectation-Maximization)
5. Fuzzy Clustering

Clustering Algorithms:

1. K-Means
2. Hierarchical Agglomerative Clustering (HAC)
3. DBSCAN
4. OPTICS (Ordering Points To Identify the Clustering Structure)
5. K-Medoids

Evaluation Metrics:

1. Silhouette Coefficient
2. Calinski-Harabasz Index
3. Davies-Bouldin Index
4. Cluster Purity
5. Cluster Entropy

Clustering Techniques:

1. Feature Scaling
2. Data Preprocessing
3. Dimensionality Reduction
4. Ensemble Methods
5. Model Selection

Real-World Applications:

1. Customer Segmentation
2. Image Segmentation
3. Gene Expression Analysis
4. Text Clustering
5. Anomaly Detection

Clustering Tools:

1. Scikit-learn
2. TensorFlow
3. PyTorch
4. Keras
5. Microsoft Cognitive Toolkit (CNTK)

Best Practices:

1. Data quality check

2. Feature selection
3. Model selection
4. Hyperparameter tuning
5. Visualization

Common Challenges:

1. Choosing optimal cluster number

2. Handling noisy data
3. Dealing with high-dimensional data
4. Model interpretability
5. Scalability

Advanced Clustering Topics:

1. Semi-Supervised Clustering
2. Active Learning
3. Transfer Learning
4. Deep Clustering
5. Spectral Clustering

By mastering clustering techniques, organizations can:

1. Identify patterns in data

2. Improve customer segmentation
3. Enhance image and text analysis
4. Detect anomalies
5. Inform business decisions
2.5 Discuss the performance improvement of a model

Performance improvement of a model in Machine Learning (ML) involves enhancing

its ability to make accurate predictions or decisions.
Performance Improvement Strategies:

1. Data Preprocessing
2. Feature Engineering
3. Hyperparameter Tuning
4. Model Selection
5. Ensemble Methods
6. Regularization Techniques
7. Cross-Validation

Techniques for Improvement:

1. Gradient Boosting
2. Transfer Learning
3. Early Stopping
4. Batch Normalization
5. Dropout
6. Data Augmentation
7. Attention Mechanisms

Model Optimization Algorithms:

1. Gradient Descent
2. Stochastic Gradient Descent
3. Adam
4. RMSprop
5. Adagrad

Performance Metrics:
Classification:

1. Accuracy
2. Precision
3. Recall
4. F1-Score
5. ROC-AUC

Regression:

1. Mean Squared Error (MSE)

2. Mean Absolute Error (MAE)
3. R-Squared (R2)
4. Coefficient of Determination
Tools for Performance Improvement:

1. Scikit-learn
2. TensorFlow
3. PyTorch
4. Keras
5. Microsoft Cognitive Toolkit (CNTK)

Best Practices:

1. Monitor performance metrics

2. Use cross-validation
3. Avoid overfitting
4. Optimize hyperparameters
5. Ensemble models

Common Challenges:

1. Overfitting
2. Underfitting
3. Data quality issues
4. Model interpretability
5. Scalability

Real-World Applications:

1. Image classification
2. Natural Language Processing
3. Recommender systems
4. Predictive maintenance
5. Autonomous vehicles

Performance Improvement Frameworks:

1. Model Optimization Framework (MOF)

2. Performance Improvement Framework (PIF)
3. Machine Learning Optimization Framework (MLOF)

Key Considerations:

1. Data quality
2. Model complexity
3. Hyperparameter tuning
4. Regularization
5. Model interpretability

Advanced Techniques:
1. Bayesian Optimization
2. Gradient-Based Optimization
3. Evolutionary Optimization
4. Deep Learning
5. Transfer Learning

Business Forecasting PDF
100% (6)
Business Forecasting PDF
573 pages
Isilon - x410-X410 - Select CTO Node Activity - Replace Drive-2
No ratings yet
Isilon - x410-X410 - Select CTO Node Activity - Replace Drive-2
19 pages
CHP 2
No ratings yet
CHP 2
52 pages
ML 3170724 Unit-2
No ratings yet
ML 3170724 Unit-2
40 pages
unit1
No ratings yet
unit1
78 pages
UNIT02
No ratings yet
UNIT02
41 pages
machine learning unit 2
No ratings yet
machine learning unit 2
9 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Unit2PreparingtoModelpptx 2023 09 02 14 52 40
No ratings yet
Unit2PreparingtoModelpptx 2023 09 02 14 52 40
43 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
02Data (2)
No ratings yet
02Data (2)
36 pages
UNIT-1 (Preparing To Model)
No ratings yet
UNIT-1 (Preparing To Model)
82 pages
02Data Edited v2
No ratings yet
02Data Edited v2
43 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
CS822-DataMining-Week2 (2)
No ratings yet
CS822-DataMining-Week2 (2)
28 pages
Week2_UnderstandingData
No ratings yet
Week2_UnderstandingData
27 pages
Data Mining:: Concepts and Techniques
100% (1)
Data Mining:: Concepts and Techniques
63 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
CH 2
No ratings yet
CH 2
68 pages
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
33 pages
02know Your Data-Lecture2-3
No ratings yet
02know Your Data-Lecture2-3
53 pages
02 Data
No ratings yet
02 Data
64 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
8 pages
1_L2_Intro_DAM
No ratings yet
1_L2_Intro_DAM
27 pages
Module 1
No ratings yet
Module 1
64 pages
Chapter 2 - Preparing To Model
No ratings yet
Chapter 2 - Preparing To Model
16 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
Week 1B - Data
No ratings yet
Week 1B - Data
38 pages
02 Data
No ratings yet
02 Data
64 pages
data mining 2
No ratings yet
data mining 2
64 pages
02 Data
No ratings yet
02 Data
35 pages
DM UNIT-1-1
No ratings yet
DM UNIT-1-1
56 pages
Lect 3
No ratings yet
Lect 3
51 pages
Unit1 Statistics
No ratings yet
Unit1 Statistics
60 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
Data Mining (DM) : Lecture 3: Know Your Data
No ratings yet
Data Mining (DM) : Lecture 3: Know Your Data
53 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
2 Knowing Data & Visualization
No ratings yet
2 Knowing Data & Visualization
51 pages
02Know Your Data Lecture2 3
No ratings yet
02Know Your Data Lecture2 3
53 pages
Lecture 2
No ratings yet
Lecture 2
62 pages
02 Data
No ratings yet
02 Data
41 pages
Chapter 2 - Understand Data
No ratings yet
Chapter 2 - Understand Data
63 pages
Lec.02 Getting to Know Your Data
No ratings yet
Lec.02 Getting to Know Your Data
62 pages
Types of Data
No ratings yet
Types of Data
14 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
02Data
No ratings yet
02Data
65 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
No ratings yet
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
52 pages
02 Data
No ratings yet
02 Data
62 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
02Data
No ratings yet
02Data
66 pages
02Data
No ratings yet
02Data
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
VIPDMTheoryChapter2
No ratings yet
VIPDMTheoryChapter2
56 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
Unit 3
No ratings yet
Unit 3
30 pages
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
122365
No ratings yet
122365
1 page
Eagle Club_compressed
No ratings yet
Eagle Club_compressed
6 pages
What is the Specialty of a Chameleon and a Moth
No ratings yet
What is the Specialty of a Chameleon and a Moth
2 pages
Aec Unit 4 MCQ
No ratings yet
Aec Unit 4 MCQ
3 pages
Unit 4 MCQ
No ratings yet
Unit 4 MCQ
3 pages
B.I.T Institute of Technology:Hindupur: Answer Any of Two Questions
No ratings yet
B.I.T Institute of Technology:Hindupur: Answer Any of Two Questions
1 page
Aec Unit 4 MCQ
No ratings yet
Aec Unit 4 MCQ
3 pages
B.I.T Institute of Technology:Hindupur: Answer Any of Two Questions
No ratings yet
B.I.T Institute of Technology:Hindupur: Answer Any of Two Questions
1 page
Mid1 QP MPMC Ii-Ii Cse 15a04407 Microprocessor & Interfacing
No ratings yet
Mid1 QP MPMC Ii-Ii Cse 15a04407 Microprocessor & Interfacing
2 pages
Unit 4 MCQ
100% (1)
Unit 4 MCQ
3 pages
1
No ratings yet
1
3 pages
Aruna Jyothi Talari
No ratings yet
Aruna Jyothi Talari
1 page
ECE R15 CMM Sheet
No ratings yet
ECE R15 CMM Sheet
2 pages
DIGITAL SIGNAL PROCESSING QUIZ-1 (Responses) PDF
No ratings yet
DIGITAL SIGNAL PROCESSING QUIZ-1 (Responses) PDF
1 page
6713 DSK Schem PDF
No ratings yet
6713 DSK Schem PDF
12 pages
17EC52 CIfdddd
No ratings yet
17EC52 CIfdddd
3 pages
Academic Regulations - Autonomous - SRIT R19 - Batch 2019-23
No ratings yet
Academic Regulations - Autonomous - SRIT R19 - Batch 2019-23
24 pages
AEC Bit
No ratings yet
AEC Bit
5 pages
13 Batch - III-II - DSP - Course File
No ratings yet
13 Batch - III-II - DSP - Course File
4 pages
Oracle Installation From Oracle-Base
No ratings yet
Oracle Installation From Oracle-Base
57 pages
Design For Manufacturing
No ratings yet
Design For Manufacturing
2 pages
Arducopter
No ratings yet
Arducopter
16 pages
AUR Appointment Order PDF
No ratings yet
AUR Appointment Order PDF
6 pages
E-Waste Management PDF
No ratings yet
E-Waste Management PDF
5 pages
29PT6441 85 PDF
No ratings yet
29PT6441 85 PDF
338 pages
Lesson 15 - Information Theory
No ratings yet
Lesson 15 - Information Theory
18 pages
MTConnect Information Flyer
No ratings yet
MTConnect Information Flyer
2 pages
How To Organize Your Thesis
No ratings yet
How To Organize Your Thesis
7 pages
Union Bank of India Bank Clerk Exam (10 - 01 - 2010)
No ratings yet
Union Bank of India Bank Clerk Exam (10 - 01 - 2010)
20 pages
Updated SAP Cards Requirement Jalchd
No ratings yet
Updated SAP Cards Requirement Jalchd
51 pages
Samsung ml-1630 SM
No ratings yet
Samsung ml-1630 SM
117 pages
ComEd1 Module2week3and4
No ratings yet
ComEd1 Module2week3and4
18 pages
Task 1
No ratings yet
Task 1
2 pages
Week 9. PSYC1101 Perception Lecture 1.
No ratings yet
Week 9. PSYC1101 Perception Lecture 1.
20 pages
Alligent
No ratings yet
Alligent
44 pages
See by Chloe Dresses Chlo LBD With Shoulder Ties Poshmark
No ratings yet
See by Chloe Dresses Chlo LBD With Shoulder Ties Poshmark
1 page
Cloud Email Security Comparison
No ratings yet
Cloud Email Security Comparison
6 pages
PG-8X Users Manual PDF
No ratings yet
PG-8X Users Manual PDF
11 pages
Server Framework 101
No ratings yet
Server Framework 101
73 pages
12th Lab Practical
No ratings yet
12th Lab Practical
20 pages
CCNPv7 SWITCH - SBA Version A - STUDENT
0% (1)
CCNPv7 SWITCH - SBA Version A - STUDENT
5 pages
GitHub Tutorial
No ratings yet
GitHub Tutorial
27 pages
The Forrester Wave™ - Zero Trust Extended Ecosystem Platform Providers, Q3 2020
No ratings yet
The Forrester Wave™ - Zero Trust Extended Ecosystem Platform Providers, Q3 2020
19 pages
Delhi University Student
No ratings yet
Delhi University Student
1 page
Unit 4 Notes
No ratings yet
Unit 4 Notes
10 pages
Nmapebook
No ratings yet
Nmapebook
58 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.