ML - Unit - 1 (24-25)
ML - Unit - 1 (24-25)
UNIT I
UNIT-I 1. 1
Paavai Engineering College Department of MCA
CONTENTS
TECHNICAL TERMS
1.1 MACHINE LEARNING LANDSCAPE:
1.1.1 INTRODUCTION
1.2 TYPES OF MACHINE LEARNING SYSTEMS
1.2.1 Supervised Learning Systems
1.2.2 UnSupervised Learning Systems
1.2.3 Semi-Supervised Learning Systems
1.2.4 Reinforcement Learning Systems
1.2.5 MAIN CHALLENGES OF MACHINE LEARNING
1.3 TESTING AND VALIDATING
1.4 END TO END MACHINE LEARNING PROJECT:
1.4.1 Working With Real Data
1.4.2 Discover and Generalize the Data to Gain Insights
1.4.3 Prepare the Data for Machine Learning Algorithms
1.4.4 Select and Train a Model
1.4.5 Fine – Tune the Model
TECHNICAL TERMS
S.NO TERMS LITERAL TECHNICAL MEANING
MEANING
1 LOGICAL IMAGINATION VIRTUAL
UNIT-I
INTRODUCTION 1. 2
1.0 Machine Learning
1. Machine learning is a growing technology which enables computers to learn automatically from past data.
2. Machine learning uses various algorithms for building mathematical models and making predictions using
historical data or information.
3. Currently, it is being used for various tasks such as image recognition, speech recognition, email
filtering, Facebook auto-tagging, recommender system, and many more.
1. Machine Learning is said as a subset of artificial intelligence that is mainly concerned with the development of
algorithms which allow a computer to learn from the data and past experiences on their own.
2. The term machine learning was first introduced by Arthur Samuel in 1959.
1.0.2 Definition
Machine learning enables a machine to automatically learn from data, improve performance from experiences,
and predict things without being explicitly programmed.
1. A Machine Learning system learns from historical data, builds the prediction models, and whenever it
receives new data, predicts the output for it.
2. The accuracy of predicted output depends upon the amount of data.
3. As the huge amount of data helps to build a better model which predicts the output more accurately.
UNIT-I 1. 4
2. IDENTIFYING PATTERNS
2.1 Language Translation Services
Learning from Examples: Just like how you learn a language by listening and practicing, machine
learning algorithms learn from lots of examples. They look at texts in two languages to understand
how words and sentences are translated from one language to another.
Finding Patterns:
The algorithms find patterns in these texts. For example, they learn how common phrases in
English are said in Spanish, and they remember these patterns.
Translating New Sentences:
When you ask the algorithm to translate something new, it uses the patterns it learned to translate
the new sentences. This is how Google Translate or other translation apps work. They use these
patterns to guess the best translation.
Reduced Human Load: As the system improves, the reliance on human moderators for routine content
checks decreases, allowing them to focus on more complex moderation tasks.
5. GENERALIZATION
5.1 Self Driving Car
Training Phase:
1. Learning to Drive: Imagine a self-driving car is being trained using machine learning. During its
training, it's exposed to various driving scenarios - city streets, highways, different weather conditions,
daytime and nighttime driving, etc.
2. Data Collection: It collects data from sensors, cameras, and GPS, learning how to recognize road signs,
traffic lights, pedestrians, and other vehicles. It also learns how to respond to these - when to stop, how
to navigate turns, and how to maintain a safe distance from other objects.
3. Generalization:
New City, New Roads: Now, suppose this self-driving car is placed in a completely new city that it has
never 'seen' during its training.
4. Applying Learned Skills: The car's machine learning system uses the general driving rules and patterns
it learned during training to navigate the new environment. It recognizes stop signs (even if they look
slightly different), understands traffic light signals, and knows how to avoid pedestrians and other
vehicles, despite never encountering these specific roads and conditions before.
5. Adaptability: If the car can successfully navigate this new environment, it shows good generalization.
It means the car's machine learning model didn't just memorize specific roads and scenarios but learned
general driving rules applicable to various situations.
CONCLUSION: Machine learning, with its diverse applications, is transforming industries and everyday
life. Its ability to learn from data, recognize complex patterns, and make informed predictions or decisions
is what makes it a revolutionary tool. Whether it's enhancing personal health, breaking language barriers,
securing financial transactions, moderating online content, automating driving, or optimizing
manufacturing processes, ML's impact is profound and growing.
Types of Machine Learning
Machine learning is a subset of AI, which enables the machine to automatically learn from data,
improve performance from past experiences, and make predictions.
Machine learning contains a set of algorithms that work on a huge amount of data.
Data is fed to these algorithms to train them, and on the basis of training, they build the model & perform
a specific task.
These ML algorithms help to solve different business problems like Regression, Classification, Forecasting,
Clustering, and Associations, etc.
Based on the methods and way of learning, machine learning is divided into mainly four types, which are:
It means in the supervised learning technique, we train the machines using the "labelled" dataset, and based
on the training, the machine predicts the output.
Here, the labelled data specifies that some of the inputs are already mapped to the output.
More preciously, we can say; first, we train the machine with the input and corresponding output, and then
we ask the machine to predict the output using the test dataset.
Examples are suppose we have an input dataset of cats and dog images. So, first, we will provide the
training to the machine to understand the images, such as the shape & size of the tail of cat and dog,
Shape of eyes, colour, height (dogs are taller, cats are smaller), etc.
After completion of training, we input the picture of a cat and ask the machine to identify the object and
predict the output.
Now, the machine is well trained, so it will check all the features of the object, such as height, shape, colour,
eyes, ears, tail, etc., and find that it's a cat. So, it will put it in the Cat category.
This is the process of how the machine identifies the objects in Supervised Learning.
The main goal of the supervised learning technique is to map the input variable(x) with the output
variable(y).
Some real-world applications of supervised learning are Risk Assessment, Fraud Detection, Spam
filtering, etc.
UNIT-I Supervised machine learning can be classified into two types of problems,
1. 7which are given below:
o Classification
o Regression
a) Classification
Classification algorithms are used to solve the classification problems in which the output variable is
categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The classification algorithms predict the
categories present in the dataset.
Some real-world examples of classification algorithms are Spam Detection, Email filtering, etc.
Paavai Engineering College Department of MCA
b) Regression
Regression algorithms are used to solve regression problems in which there is a linear relationship between input and
output variables.
These are used to predict continuous output variables, such as market trends, weather prediction, etc.Some popular
Regression algorithms are given below:
Advantages:
o Since supervised learning work with the labelled dataset so we can have an exact idea about the classes of objects.
o These algorithms are helpful in predicting the output on the basis of prior experience.
Disadvantages:
o Image Segmentation:
Supervised Learning algorithms are used in image segmentation. In this process, image classification is performed
on different image data with pre-defined labels.
o Medical Diagnosis:
Supervised algorithms are also used in the medical field for diagnosis purposes. It is done by using medical images
and past labelled data with labels for disease conditions. With such a process, the machine can identify a disease
for the new patients.
UNIT-I 1. 8
o Fraud Detection - Supervised Learning classification algorithms are used for identifying fraud transactions, fraud
customers, etc. It is done by using historic data to identify the patterns that can lead to possible fraud.
o Spam detection - In spam detection & filtering, classification algorithms are used. These algorithms classify an
email as spam or not spam. The spam emails are sent to the spam folder.
o Speech Recognition - Supervised learning algorithms are also used in speech recognition. The algorithm is trained
with voice data, and various identifications can be done using the same, such as voice-activated passwords, voice
commands, etc.
Unsupervised learning is different from the Supervised learning technique; as its name suggests, there is no need for
Paavai Engineering College Department of MCA
supervision.
It means, in unsupervised machine learning, the machine is trained using the unlabeled dataset, and the machine
predicts the output without any supervision.
In unsupervised learning, the models are trained with the data that is neither classified nor labelled, and the model
acts on that data without any supervision.
The main aim of the unsupervised learning algorithm is to group or categories the unsorted dataset according
to the similarities, patterns, and differences.
Machines are instructed to find the hidden patterns from the input dataset.
Let's take an example to understand it more preciously; suppose there is a basket of fruit images, and we input it into
the machine learning model.
The images are totally unknown to the model, and the task of the machine is to find the patterns and categories of the
objects.
So, now the machine will discover its patterns and differences, such as colour difference, shape difference, and predict
the output when it is tested with the test dataset.
Unsupervised Learning can be further classified into two types, which are given below:
o Clustering
o Association
1) Clustering
The clustering technique is used when we want to find the inherent groups from the data.
It is a way to group the objects into a cluster such that the objects with the most similarities remain in one group and
have fewer or no similarities with the objects of other groups.
An example of the clustering algorithm is grouping the customers by their purchasing behaviour.
2) Association
UNIT-I 1. 9
Association rule learning is an unsupervised learning technique, which finds interesting relations among variables
within a large dataset.
The main aim of this learning algorithm is to find the dependency of one data item on another data item and map
those variables accordingly so that it can generate maximum profit.
This algorithm is mainly applied in Market Basket analysis, Web usage mining, continuous production, etc.
Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-growth algorithm.
Advantages:
Paavai Engineering College Department of MCA
o These algorithms can be used for complicated tasks compared to the supervised ones because these algorithms
work on the unlabeled dataset.
o Unsupervised algorithms are preferable for various tasks as getting the unlabeled dataset is easier as compared to
the labelled dataset.
Disadvantages:
o The output of an unsupervised algorithm can be less accurate as the dataset is not labelled, and algorithms are not
trained with the exact output in prior.
o Working with Unsupervised learning is more difficult as it works with the unlabelled dataset that does not map
with the output.
o Network Analysis: Unsupervised learning is used for identifying plagiarism and copyright in document network
analysis of text data for scholarly articles.
o Recommendation Systems: Recommendation systems widely use unsupervised learning techniques for building
recommendation applications for different web applications and e-commerce websites.
o Anomaly Detection: Anomaly detection is a popular application of unsupervised learning, which can identify
unusual data points within the dataset. It is used to discover fraudulent transactions.
o Singular Value Decomposition: Singular Value Decomposition or SVD is used to extract particular information
from the database. For example, extracting information of each user located at a particular location.
3. Semi-Supervised Learning
Semi-Supervised learning is a type of Machine Learning algorithm that lies between Supervised and
Unsupervised machine learning.
It represents the intermediate ground between Supervised (With Labelled training data) and Unsupervised learning
(with no labelled training data) algorithms and uses the combination of labelled and unlabeled datasets during the
training period.
Although Semi-supervised learning is the middle ground between supervised and unsupervised learning and operates
on the data that consists of a few labels, it mostly consists of unlabeled data.
As labels are costly, but for corporate purposes, they may have few labels. It is completely different from supervised
and unsupervised learning as they are based on the presence & absence of labels.
To overcome the drawbacks of supervised learning and unsupervised learning algorithms, the concept of Semi-
supervised learning is introduced.
The main aim of semi-supervised learning is to effectively use all the available data, rather than only labelled data
like in supervised learning.
Initially, similar data is clustered along with an unsupervised learning algorithm, and further, it helps to label the
unlabeled data into labelled data.
UNIT-I 1. 10 data.
It is because labelled data is a comparatively more expensive acquisition than unlabeled
Example is Supervised learning is where a student is under the supervision of an instructor at home and college.
Further, if that student is self-analyzing the same concept without any help from the instructor, it comes under
unsupervised learning.
Under semi-supervised learning, the student has to revise himself after analyzing the same concept under the guidance
of an instructor at college.
Advantages:
Paavai Engineering College Department of MCA
o It is simple and easy to understand the algorithm.
o It is highly efficient.
o It is used to solve drawbacks of Supervised and Unsupervised Learning algorithms.
Disadvantages:
4. Reinforcement Learning
Agent gets rewarded for each good action and get punished for each bad action; hence the goal of reinforcement
learning agent is to maximize the rewards.
In reinforcement learning, there is no labelled data like supervised learning, and agents learn from their experiences
only.
The reinforcement learning process is similar to a human being; for example, a child learns various things by
experiences in his day-to-day life.
An example of reinforcement learning is to play a game, where the Game is the environment, moves of an agent at
each step define states, and the goal of the agent is to get a high score.
Due to its way of working, reinforcement learning is employed in different fields such as Game theory, Operation
Research, Information theory, multi-agent systems.
A reinforcement learning problem can be formalized using Markov Decision Process(MDP). In MDP, the agent
constantly interacts with the environment and performs actions; at each action, the environment responds and
generates a new state.
o Positive Reinforcement Learning: Positive reinforcement learning specifies increasing the tendency that the
required behaviour would occur again by adding something. It enhances the strength of the behaviour of the agent
and positively impacts it.
o Negative Reinforcement Learning: Negative reinforcement learning works exactly opposite to the positive RL.
It increases the tendency that the specific behaviour would occur again by avoiding the negative condition.
o VideoGames:
RL algorithms are much popular in gaming applications. It is used to gain super-human performance. Some
popular games that use RL algorithms are AlphaGO and AlphaGO Zero.
o Resource Management:
The "Resource Management with Deep Reinforcement Learning" paper showed that how to use RL in computer
to automatically learn and schedule resources to wait for different jobs in order to minimize average job slowdown.
o Robotics:
RL is widely being used in Robotics applications. Robots are used in the industrial and manufacturing area, and
Paavai Engineering College Department of MCA
these robots are made more powerful with reinforcement learning. There are different industries that have their
vision of building intelligent robots using AI and Machine learning technology.
o Text Mining
Text-mining, one of the great applications of NLP, is now being implemented with the help of Reinforcement
Learning by Salesforce company.
Advantages
o It helps in solving complex real-world problems which are difficult to be solved by general techniques.
o The learning model of RL is similar to the learning of human beings; hence most accurate results can be found.
o Helps in achieving long term results.
Disadvantages
Overfitting refers to a machine learning model trained with a massive amount of data that negatively affect its
performance.
Unfortunately, this is one of the significant issues faced by machine learning professionals.
UNIT-I 1. 12
This means that the algorithm is trained with noisy and biased data, which will affect its overall performance.
Consider a model trained to differentiate between a cat, a rabbit, a dog, and a tiger.
The training data contains 1000 cats, 1000 dogs, 1000 tigers, and 4000 Rabbits.
Then there is a considerable probability that it will identify the cat as a rabbit.
In this example, we had a vast amount of data, but it was biased; hence the prediction was negatively affected.
We can tackle this issue by:
Analyzing the data with the utmost level of perfection
Use data augmentation technique
Paavai Engineering College Department of MCA
Remove outliers in the training set
Select a model with lesser features
4. Machine Learning is a Complex Process
The most important task you need to do in the machine learning process is to train the data to achieve an
accurate output.
Less amount training data will produce inaccurate or too biased predictions.
Consider a machine learning algorithm similar to training a child.
One day you decided to explain to a child how to distinguish between an apple and a watermelon. You will take an
apple and a watermelon and show him the difference between both based on their color, shape, and taste.
In this way, soon, he will attain perfection in differentiating between the two.
But on the other hand, a machine-learning algorithm needs a lot of data to distinguish.
For complex problems, it may even require millions of data to be trained.
Therefore we need to ensure that Machine learning algorithms are trained with sufficient amounts of data.
6. Slow Implementation
This is one of the common issues faced by machine learning professionals.
The machine learning models are highly efficient in providing accurate results, but it takes a tremendous amount
of time.
Slow programs, data overload, and excessive requirements usually take a lot of time to provide accurate results.
Further, it requires constant monitoring and maintenance to deliver the best output.
So you have found quality data, trained it amazingly, and the predictions are really concise and accurate.
Yay, you have learned how to create a machine learning algorithm!! But wait, there is a twist; the model may
become useless in the future as data grows.
The best model of the present may become inaccurate in the coming Future and require further rearrangement.
So you need regular monitoring and maintenance to keep the algorithm working.
UNIT-I 1. 13
This is one of the most exhausting issues faced by machine learning professionals.
Conclusion: Machine learning is all set to bring a big bang transformation in technology.
It is one of the most rapidly growing technologies used in medical diagnosis, speech recognition, robotic training,
product recommendations, video surveillance, and this list goes on.
This continuously evolving domain offers immense job satisfaction, excellent opportunities, global exposure, and
exorbitant salary.
It is a high risk and a high return technology.
Before starting your machine learning journey, ensure that you carefully examine the challenges mentioned above.
To learn this fantastic technology, you need to plan carefully, stay patient, and maximize your efforts.
Once you win this battle, you can conquer the Future of work and land your dream job!
Paavai Engineering College Department of MCA
In this article, we are going to see how to Train, Test and Validate the Sets.
The fundamental purpose for splitting the dataset is to assess how effective will the trained model be in generalizing
to new data. This split can be achieved by using train_test_split function of scikit-learn.
Training Set
This is the actual dataset from which a model trains .i.e. the model sees and learns from this data to predict the outcome
or to make the right decisions. Most of the training data is collected from several resources and then preprocessed and
organized to provide proper performance of the model. Type of training data hugely determines the ability of the
model to generalize .i.e. the better the quality and diversity of training data, the better will be the performance of the
model. This data is more than 60% of the total data available for the project.
Example:
Python3
x = np.arange(16).reshape((8,2))
y = range(8)
train_size=0.8,
random_state=42)
Paavai Engineering College Department of MCA
# Training set
Output:
Training set x: [[ 0 1]
[14 15]
[ 4 5]
[ 8 9]
[ 6 7]
[12 13]]
Training set y: [0, 7, 2, 4, 3, 6]
Explanation:
Firstly we created a dummy matrix of 8×2 shape using NumPy library to represent input x. And a list of 0 to 7 integers
representing our target variable y.
Now in order to split our dataset into training and testing data, a function named train_test_split of sklearn library is
used.
Input data x with target variable y is passed as parameters to function which then divides the dataset into 2 parts on
the size given in train_size i.e. if train_size=0.8 is given then the dataset will be divided in such an way that the
training set will be 80% of given dataset and testing set will be 20% of given dataset.
And as we specify random_state to be a positive number, train_test_split function will randomly split data.
Testing Set
This dataset is independent of the training set but has a somewhat similar type of probability distribution of classes
and is used as a benchmark to evaluate the model, used only after the training of the model is complete. Testing set is
usually a properly organized dataset having all kinds of data for scenarios that the model would probably be facing
when used in the real world. Often the validation and testing set combined is used as a testing set which is not
considered a good practice. If the accuracy of the model on training data is greater than that on testing data then the
model is said to have overfitting. This data is approximately 20-25% of the total data available for the project.
Example:
Python3
UNIT-I
import numpy as np 1. 15
x = np.arange(16).reshape((8, 2))
# target variable
y = range(8)
test_size=0.2,
random_state=42)
# Testing set
Output:
Testing set x: [[ 2 3] [10 11]]
Testing set y: [1, 5]
Explanation:
To show how the train_test_split function works we first created a dummy matrix of 8×2 shape using NumPy library
to represent input x. And a list of 0 to 7 integers representing our target variable y.
Now in order to split our dataset into training and testing data, input data x with target variable y is passed as
parameters to function which then divides the dataset into 2 parts on the size given in test_size i.e. if test_size=0.2 is
given then the dataset will be divided in such an away that testing set will be 20% of given dataset and training set
will be 80% of given dataset.
And as we specify random_state to be a positive number, train_test_split function will randomly split data.
Validation Set
The validation set is used to fine-tune the hyperparameters of the model and is considered a part of the training of the
model. The model only sees this data for evaluation but does not learn from this data,
UNIT-I providing an objective unbiased
1. 16
evaluation of the model. Validation dataset can be utilized for regression as well by interrupting training of model
when loss of validation dataset becomes greater than loss of training dataset .i.e. reducing bias and variance. This data
is approximately 10-15% of the total data available for the project but this can change depending upon the number of
hyperparameters .i.e. if model has quite many hyperparameters then using large validation set will give better results.
Now, whenever the accuracy of model on validation data is greater than that on training data then the model is said
to have generalized well.
Example:
Python3
Paavai Engineering College Department of MCA
import numpy as np
x = np.arange(24).reshape((8,3))
# target variable
y = range(8)
train_size=0.8,
random_state=42)
y_Combine,
test_size=0.5,
Paavai Engineering College Department of MCA
random_state=42)
# Training set
print(" ")
# Testing set
print(" ")
# Validation set
Output:
Training set x: [[ 0 1 2]
[21 22 23]
[ 6 7 8]
[12 13 14]
[ 9 10 11]
[18 19 20]]
Training set y: [0, 7, 2, 4, 3, 6]
Explanation:
So as to get the validation set, a dummy matrix of 8×3 shape is created using the NumPy library to represent input x. And
a list of 0 to 7 integers representing our target variable y.
Now it gets a bit tricky to divide dataset into 3 parts. To begin with, the dataset is divided into two parts, input data x
with target variable y is passed as parameters to function which then divides the dataset into 2 parts on the size given in
train_size (from this we’ll get our training set) i.e. if train_size=0.8 is given then the dataset will be divided in such a way
that training set will be 80% of given dataset and another set will be 20% of given dataset.
So now we have validation and testing combined set having 20% of the initially given dataset. This dataset is divided
further to get validation set and testing set, output of above distribution is then passed as parameters to train_test_split
Paavai Engineering College Department of MCA
again which then divides the combined dataset into 2 parts on the size given in test_size .i.e. if test_size=0.5 is given then
the dataset will be divided in such a way that testing set and validation set will be 50% of the combined dataset.
Introduction
Pre-Requisites
Basic understanding of Linear Regression Algorithm. If you have no idea about the algorithm,
please refer to the link before going to the later part of the article, so that you have a basic understanding
of all the concepts which we will cover.
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns
%matplotlib inline
Step 2: Study the Data
Here we will work on the E-commerce Customers dataset (CSV file). It has Customer information,
such as Email, Address, and color Avatar. Then it also has numerical value columns:
Average Session Length: Average session of in-store style advice sessions.
Time on App: Average time spent by the customer on App in minutes
Time on Website: Average time spent by the customer on Website in minutes
Length of Membership: From how many years the customer has been a member.
In this step, we will read and load the dataset using some basic function of pandas such as
Output:
UNIT-I 1. 20
Joint plot:
Time on Website vs Yearly Amount Spent Time on App vs Yearly Amount Spent
Paavai Engineering College Department of MCA
Time on App vs Length of membership
Use seaborn to create a joint plot to compare the Time on Website and Yearly Amount Spent
columns.sns.jointplot(x='Time onWebsite',y='Yearly Amount Spent',data=df)
Output:
Output:
Use joint plot to create a 2D hex bin plot comparing Time on App and Length ofMembership
sns.jointplot(x='Time on App',y='Length of Membership',kind="hex",data=df)
Output:
UNIT-I 1. 21
Paavai Engineering College Department of MCA
Let’s explore these types of relationships across the entire data set. Use Pair plot torecreate the plot
below:
sns.pairplot(df)
Based on this plot what looks to be the most correlated feature with the Yearly Amount Spent?
Length of Membership
Create a linear model plot (using seaborn’s lmplot) of Yearly Amount Spent vs. Length of Membership
Now that we have explored the data a bit, it’s time to go ahead and split our initial data into training
and testing subsets. Here we set a variable X i.e, independent columns as the numerical features of
the customers, and a variable y i.e, dependent column as the “Yearly Amount Spent” column.
Separate Dependent and Independent Variable
X = customers[['Avg. Session Length', 'Time on App', 'Time on Website', 'Length of Membership']] y =
customers['Yearly Amount Spent']
UNIT-I Use model_selection.train_test_split from sklearn to split the data into training
1. 22and testing sets
Now, at this step we are able to train our model on our training data using Linear Regression.
Import LinearRegression from sklearn.linear_model from sklearn.linear_model import LinearRegression
Paavai Engineering College Department of MCA
Create an instance of a LinearRegression() model named lm.
lr_model = LinearRegression()
lr_model.fit(X_train,y_train)
Output:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Print out the coefficients of the model
lr_model.coef_
Output:
array([25.98154972, 38.59015875, 0.19040528, 61.27909654])
Step 7: Predictions on Test Data
Now that we have train our model, let’s evaluate its performance by doing the predictions on the
unseen data.
Use lr_model.predict() to predict off the X_test set of the data
predictions = lr_model.predict(X_test)
Create a scatterplot of the real test values versus the predicted values
plt.scatter(y_test,predictions)
plt.xlabel('Y Test')
plt.ylabel('Predicted Y')
Output:
To evaluate our model performance, we will be calculating the residual sum of squares and the explained
variance score (R2).
UNIT-I Determine the metrics such as Mean Absolute Error, Mean Squared Error, and
1. 23
the Root Mean Squared
Error.
sns.distplot(y_test - predictions,bins=50)
Output:
Output:
Steps in anMLProject
UNIT-I 1. 25
Paavai Engineering College Department of MCA
GettingRealData
UNIT-I 1. 26
Paavai Engineering College Department of MCA
SystemDesign
Supervised learning
Data is labeled
Regression
Model will predict a value
Batch learning
No additional data will be added later
Types of Regression
Multiple regression
Uses multiple features to predict a value
Univariate regression
Predicts a single value
Multivariate regression
Predicts multiple values
SelectaPerformanceMeasure
Root Mean Square Error (RMSE)
• Adds up the error for each item of data
• The most commonly used measure for regression tasks
SelectaPerformanceMeasure
ChecktheAssumptions
We're assuming the price will be used as a numerical value
If the next stage just uses categories, like "cheap", "medium", or "expensive" we should be using
classification instead of regression.
2GetTheData
UNIT-I 1. 28
Paavai Engineering College Department of MCA
LoadDatafromGithub
head()ShowsFirstFiveRows
UNIT-I 1. 29
Paavai Engineering College Department of MCA
UNIT-I 1. 30
Paavai Engineering College Department of MCA
value_counts()
UNIT-I 1. 31
Paavai Engineering College Department of MCA
describe()ShowsStatistics
Histograms
MedianIncome
It's not in dollars. It's been scaled and capped at 15 max and 0.5 min. Numbers represent roughly tens of thousands of
dollars.Preprocessed attributes are common in ML, this should be OK.
UNIT-I 1. 32
Paavai Engineering College Department of MCA
OtherCappedValues
StratifiedSampling
• Take the important feature and gather it into categories
• Then sample the correct number from each category
UNIT-I 1. 33
Paavai Engineering College Department of MCA
• Training and test sets match now
Transparency
• Alpha = 0.2 shows more detail in high-density areas
UNIT-I 1. 34
Paavai Engineering College Department of MCA
AddPricewithColor
• Areas near the ocean and with higher population
density have higher prices
UNIT-I 1. 35
Paavai Engineering College Department of MCA
Correlations Strongest correlations with median_house_value: an_income, total_rooms,
housing_median_age, latitude
median_income
• Correlation is strong
• Clusters of points at $500,000. $450,000. and $350,000
UNIT-I 1. 36
Paavai Engineering College Department of MCA
CorrelationAssumesaLine
One-HotVectors
• A better way to represent such data
FeatureScalingandTransformation
• Number of rooms ranges from 6 to 39,320
• Median incomes range from 0 to 15
• Models will weight number of rooms far more highly than income
• To prevent this, scale data in one of two ways:
min-max scaling
• Every value ranges from 0 to 1
• Or -1 to 1 for neural nets
standardization
Subtract the mean, then divide by standard deviation.
Does not limit the range strictly.Less affected by outliers.
HeavyT
Heavy Fit
UNIT-I 1. 38
Paavai Engineering College Department of MCA
• Values far from the mean are not exponentially rare
• Take square root or log to get closer to a Gaussian
Do this before normalization
• Another solution is bucketizing
Grouping values into ranges
5 SelectAModelAndTrain It LinearRegression
The first prediction is off by more than $200,000!
LinearRegression
The root mean squared error is over $68,000
The median_housing_values range from $120,000 to $265,000
Pretty bad predictions.
DecisionTreeRegressor
A more powerful model capable of finding complex nonlinear
suggests overfitting
BetterEvaluationUsingCross-Validation
Splits the training set into 10 subsets called folds
Trains the model 10 times on 9 folds.
Evaluating each one on the remaining fold
RandomForestRegressor
Results are somewhat better, Error $47,000.But on the training set, the error is $17,000
Still a lot of overfitting.
5. FINE TUNE YOUR MODEL
GRID SEARCH
Scikit-Learn's GridSearchCSV class
Tell it which hyperparameters you want to try, and what values to try.
It will use cross-validation to evaluate them.
RandomizedSearch
UNIT-I 1. 39
Paavai Engineering College Department of MCA
Evaluates a fixed number of random hyper parameter values.
Useful when the hyper parameter search space is large
EnsembleMethods
Combines several models together.
PerformanceMonitoring
A component may break, causing performance to drop Or it may
drop gradually, die to model rot.
The parameters go out of date.
One measure of performance is downstream metrics.
Number of recommended products sold per day.
Or send human raters sample pictures of products the model
classified, to verify them
It can be a lot of work to set up good performance monitoring.
AutomaticUpdatingandRetraining
Collect fresh data and label it
Write a script to train the moden and fine-tune the hyperparameters periodically
Write another script to evaluate both the new model and the previous model on the
updated test set.
Evaluate input data quality
Keep backups of every model
Be ready to roll back.
UNIT-I 1. 40
Paavai Engineering College Department of MCA
QUESTION BANK
PART A
10. Draw the Diagram of Machine Learning Types with Neat Presentation.
12. List the categories of Supervised Machine Learning with their algorithms.
13. Which algorithm has a linear relationship between Input and Output variables?
15. What is the main aim of the Unsupervised Learning Algorithm in Machine Learning?
16. List some of the popular Clustering Algorithms of Unsupervised Learning systems.
19. Which Machine Learning Algorithm that lies between Supervised and Unsupervised
Learning Algorithm.
20. Write the pros and cons of Semisupervised learning algorithm in Machine Learning.
UNIT-I 1. 41
Paavai Engineering College Department of MCA
21. Which algorithm works on a Feed back process in Machine Learning Algorithm?
23. What are the Real world use cases of Reinforcement Learning Systems?
25. What is the main purpose for splitting the dataset in Training Data set?
27. Write the steps for End – to – End Machine Learning Project.
28. Write the codings for Read and Load the Dataset.
30. Which Regression is used to training the model in Machine Learning Systems?
32. Write the names of Popular Open Data Repositories in Machine Learning.
33. Draw the Look at the Big Picture of ML pipeline for Real Estate Investments.
34. Which measurement has been preferred if data has many outliers in Machine Learning
Systems/
40. What are the two ways of Feature Scaling and Transformation in ML end- to – end
project.
PART B
1. Illustrate the Machine Learning Works, Features and Basic Principles of Machine
Learning.
2. Explain the Types of Machine Learning for the following with their categories
a) Supervised Learning Algorithm.
b) Semi-Supervised Learning Algorithm.
c) Un-Supervised Learning Algorithm.
d) Reinforcement Learning Algorithm.
UNIT-I 1. 42
Paavai Engineering College Department of MCA
4. Discuss about the Advantage and Disadvantages for the following algorithms
a) Supervised Learning Algorithm.
b) Semi-Supervised Learning Algorithm.
c) Un-Supervised Learning Algorithm.
d) Reinforcement Learning Algorithm.
5. Elucidate the Main Challenges of ML Algorithms and write the steps to overcome that
issues in ML.
6. Describe the Training and Testing and Validation with their coding and Output.
7. Explain the first three steps of ML Project Pipeline with their statistical Details.
8. Explain the Exploratory Data Analysis(EDA) with their diagram in Machine Learning
Project.
9. Elucidate the Explore the Residuals and Model Evaluation in Machine Learning
Algorithm.
10. Illustrate the Get the Real Data and Look at the Big Picture in ML.
UNIT-I 1. 43