ML Lab Manual (final) dtu

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 52

DELHI TECHNOLOGICAL

UNIVERSITY
Delhi – 110042, India

B.Tech. 5th Semester

LAB FILE
IT-323 Machine Learning

Submitted by: Mohit Jindal


Roll Number: 2K22/IT/104

Course Coordinator: Prof. Dinesh Kumar


Vishwakarma
List of Experiments
S
i
S g
. n
Page
N Practical a
No.
o t
. u
r
e
To understand the basics
1
of Python
To understand the NumPy
2
functions
To demonstrate graphical visualizations in python
3
using matplotlib and seaborn libraries
To read, display and save
4
an image using OpenCV
To implement Linear
5
Regression
To implement Logistic
6
Regression
To process the data (Data preprocessing) using
7
Pandas
To implement KNN
8
algorithm
To classify data using
9
SVM
1 To classify data using
0 neural network
1 To implement Decision
1 Tree algorithm
1 To implement Random
2 Forest algorithm
1 To implement K-Means
3 Clustering algorithm

2|Page
Experiment 1
OBJECTIVE: To understand the basics of Python by the following programs:
(i) Write a program to add two numbers to illustrate the use of print statement.
(ii) Write a program to illustrate the use of conditional statements by checking if the
input number is odd or even.
(iii) Write a program to illustrate the use of functions.
(iv) Write a program to access the values in a dictionary.
(v) Write a program to perform various string operations.

RELATED THEORY
(i) Python is a powerful high-level, object-oriented programming language. It has
simple easy-to-use syntax, making it the perfect language for someone trying to
learn computer programming for the first time.
(ii) In the first program, two numbers are added to illustrate the use of print statement.
In the second program, use of conditional statements is demonstrated. In order to
write useful programs, we almost always need the ability to check conditions and
change the behavior of the program accordingly. Conditional statements give us
this ability.
(iii) Next, python functions are explained where a function is a block of organized,
reusable code that is used to perform a single, related action. In the fourth program
values in a dictionary are accessed. Each key is separated from its value by a
colon (:), the items are separated by commas, and the whole thing is enclosed in
curly braces.
(iv) Finally, various string operations are demonstrated using string functions. Python
treats single quotes the same as double quotes.

3|Page
4|Page
CONCLUSION
From the above programs, we conclude that python provides rich features for programming.
It supports functional and structured programming methods as well as OOP. Python has few
keywords, simple structure, and a clearly defined syntax. Python code is more clearly defined
and visible to the eyes. Python's bulk of the library is very portable.

5|Page
Experiment 2
OBJECTIVE: To understand the NumPy functions.

RELATED THEORY

(i) NumPy, which stands for Numerical Python, is a library consisting of


multidimensional array objects and a collection of routines for processing those
arrays. Using NumPy, mathematical and logical operations on arrays can be
performed.
(ii) NumPy is the fundamental package for scientific computing with Python.
NumPy’s main object is the homogeneous multidimensional array.
(iii) The most important object defined in NumPy is an N-dimensional array type
called ndarray. It describes the collection of items of the same type. Items in the
collection can be accessed using a zero-based index.
(iv) Every item in an ndarray takes the same size of block in the memory. Each
element in ndarray is an object of data-type object (called dtype).
(v) Any item extracted from ndarray object (by slicing) is represented by a Python
object of one of array scalar types. The following diagram shows a relationship
between ndarray, data type object (dtype) and array scalar type −

(a) ndarray.ndim
The number of axes (dimensions) of the array.

(b) ndarray.shape
The dimensions of the array. This is a tuple of integers indicating the size of the array
in each dimension. For a matrix with n rows and m columns, shape will be (n,m). The
length of the shape tuple is therefore the number of axes, ndim

(c) ndarray.dtype
An object describing the type of the elements in the array. One can create or specify
dtype’s using standard Python types. Additionally NumPy provides types of its own.
numpy.int32, numpy.int16, and numpy.float64 are some examples.

6|Page
7|Page
8|Page
9|Page
DISCUSSION & RESULT
The above program performs the following tasks using numpy array:
(i) Create a ID array
(ii) Create a 2D array
(iii) How to reflect a change
(iv) Make a copy
(v) Indexing
(vi) Clipping

CONCLUSION
NumPy is a general-purpose array-processing package. It provides a high-performance
multidimensional array object, and tools for working with these arrays. It contains following
things:
 a powerful N-dimensional array object
 sophisticated (broadcasting) functions
 tools for integrating C/C++ and Fortran code
 useful linear algebra, Fourier transform, and random number capabilities

10 | P a g e
Experiment 3

OBJECTIVE: To demonstrate graphical visualizations in python using matplotlib and


seaborn libraries

RELATED THEORY
(i) Matplotlib is a Python 2D plotting library which provides both a very quick way
to visualize data from Python and publication-quality figures in many
formats. Matplotlib can be used in Python scripts, the Python and IPython shells,
the Jupyter notebook, web application servers, and four graphical user interface
toolkits.
(ii) Matplotlib tries to make easy things easy and hard things possible. You can
generate plots, histograms, power spectra, bar charts, error charts, scatterplots,
etc., with just a few lines of code. For examples, see the sample
plots and thumbnail gallery.
(iii) Seaborn is an amazing visualization library for statistical graphics plotting in
Python. It provides beautiful default styles and color palettes to make statistical
plots more attractive. It is built on the top of matplotlib library and also closely
integrated to the data structures from pandas. Seaborn aims to make visualization
the central part of exploring and understanding data. It provides dataset-oriented
APIs, so that we can switch between different visual representations for same
variables for better understanding of dataset.

11 | P a g e
12 | P a g e
13 | P a g e
14 | P a g e
15 | P a g e
16 | P a g e
17 | P a g e
18 | P a g e
19 | P a g e
CONCLUSION
In the above program we used matplotlib and seaborn libraries to plot various visualizations.
These visualizations include plotting line plot, bar chart (both horizontal and vertical) and
heatmaps. Several customization options have been displayed and discussed in the
demonstrations above.

20 | P a g e
Experiment 4

OBJECTIVE: To read, display and save an image using OpenCV.


RELATED THEORY
OpenCV (Open-Source Computer Vision Library) is an open-source computer vision and
machine learning software library. The various functions used under OpenCV are:

1. Read an image:
We use the function cv2.imread() to read an image. The image should be in the working
directory or a full path of image should be given. Second argument is a flag which
specifies the way image should be read.
cv2.IMREAD_COLOR : Loads a color image. Any transparency of image will be
neglected. It is the default flag.
cv2.IMREAD_GRAYSCALE : Loads image in grayscale mode
cv2.IMREAD_UNCHANGED : Loads image as such including alpha channel
2. Display an image:
We use the function cv2.imshow() to display an image in a window. The window
automatically fits to the image size. First argument is a window name which is a string.
Second argument is our image. You can create as many windows as you wish, but with
different window names.
3. Wait Key
cv2.waitKey() is a keyboard binding function. Its argument is the time in milliseconds.
The function waits for specified milliseconds for any keyboard event. If you press any key
in that time, the program continues. If 0 is passed, it waits indefinitely for a key stroke. It
can also be set to detect specific key strokes like, if key a is pressed etc which we will
discuss below.
4. Destroy All Windows
cv2.destroyAllWindows() simply destroys all the windows we created. If you want to
destroy any specific window, use the function
cv2.destroyWindow()where you pass the exact window name as the argument.
5. Write an image
We use the function cv2.imwrite() to save an image. First argument is the file name,
second argument is the image you want to save.

21 | P a g e
DISCUSSION & RESULT

In the above program, we have successfully uploaded the image from the system, displayed it
on the console and saved it in the file directory with name “copy.png” using OpenCV library.
The image displayed is a grayscale image and saved the image is also a grayscale image. The
output of the program is shown below:

Figure 1. Original Image Figure 2. Grayscaled Image

CONCLUSION

From the above program, we conclude that that OpenCV is an important library for
displaying an image, reading an image, writing an image etc. We can also you matplotlib for
displaying an image, zooming it and saving it.

22 | P a g e
Experiment 5

OBJECTIVE: To implement Linear Regression.


RELATED THEORY
In Linear Regression, we assume a linear relationship between the input variables (x) and the
single output variable (y). In linear model, we try to fit a line (y=mx+c) to the given data, in
such a manner that it has minimum error. The general line is written as: y(hat)=w0 + w1 * x.

 Cost function is given by:


1
J ( w 0 , w1 )= ∑ i
m(hw × x − y )
i 2
2 m i=1 of cost function:
 Derivative
d 1
J ( w0 , w )= ∑ m(h w × x − y )× x j
i i i
dwj m i=1

23 | P a g e
DISCUSSION & RESULT

In the above program, we have successfully implemented the linear regression by computing
the initial cost and then reducing the cost by computing the gradient descent. The cost gets
significantly less from 319.406315894 to 56.041973777981703. The final output is shown in
figure 2.

24 | P a g e
Figure 1. Plotting the relationship between data points

Figure 2. Linear Regression between data points

CONCLUSION
From the above program, we conclude that regression is a statistical process for estimating
the relationship between dependent variable and one or more independent variable.

25 | P a g e
Experiment 6

OBJECTIVE: To implement Logistic Regression


RELATED THEORY
Logistic regression is a predictive analysis. Logistic regression is used to describe data and to
explain the relationship between one dependent variable and one or more independent
variables. A standard logistic function is called sigmoid function given by 1 / (1 + e^-value)
where is the base of the natural logarithms (Euler’s number) and value is the actual numerical
value that you want to transform. The cost function is given by:

26 | P a g e
27 | P a g e
DISCUSSION & RESULT

28 | P a g e
In the above program, we have successfully implemented the logistic regression and the
number of function evaluations and gradient evaluations are being computed and it is shown
in the plotted graph.
The output of the above program is shown below:

Figure 1. Initial data plotting

Figure 2. Logistic regression

CONCLUSION
From the above program, we see Logistic regression is used when the response variable is
categorical in nature. For instance, yes/no, true/false, red/green/blue, 1st/2nd/3rd/4th, etc.
whereas Linear regression is used when your response variable is continuous. For instance,
weight, height, number of hours, etc. Linear regression uses ordinary least squares method to
minimize the errors and arrive at a best possible fit, while logistic regression uses maximum
likelihood method to arrive at the solution. Hence, logistic regression is better than linear
regression.

29 | P a g e
Experiment 7

OBJECTIVE: To process the data (Data preprocessing) using Pandas


RELATED THEORY
Pandas is a popular Python library used for data science and analysis. Used in conjunction
with other data science toolsets like SciPy, NumPy, and Matplotlib, a modeler can create end-
to-end analytic workflows to solve business problems. Many datasets have missing,
malformed, or erroneous data. It’s often unavoidable–anything from incomplete reporting to
technical glitches can cause “dirty” data. Pandas provides a robust library of functions to
help you clean up, sort through, and make sense of your datasets, no matter what state they’re
in. The dataset of 5,000 movies scraped from IMDB is used for pre-processing. It contains
information on the actors, directors, budget, and gross, as well as the IMDB rating and
release year.

30 | P a g e
DISCUSSION & RESULT
In the above program, we have used a dataset of 5,000 movies scraped from IMDB. It
contains information on the actors, directors, budget, and gross, as well as the IMDB rating
and release year. The output of the program is:

Figure 1. Snapshot of the IMDB dataset

Figure 2. The country column is ‘NaN’ values are replaced by empty string ‘’

Figure 3. The duration column is filled with ‘mean’ values of the column

31 | P a g e
Figure 4. The names of the movie are converted in uppercase

CONCLUSION
From the above program, we conclude that Pandas has some selection methods which you
can use to slice and dice the dataset based on your queries. It majorly helps in the following
data-processing tasks:
 Deal with missing data
 Add default values
 Remove incomplete rows
 Deal with error-prone column
 Normalize data types
 Change case
 Rename columns

32 | P a g e
Experiment 8

OBJECTIVE: To implement KNN algorithm.


RELATED THEORY
The KNN algorithm assumes that similar things exist in close proximity. In other words,
similar things are near to each other.
KNN algorithm:
Load the data
1. Initialize K to your chosen number of neighbors
2. For each example in the data
a. Calculate the distance between the query example and the current example from
the data.
b. Add the distance and the index of the example to an ordered collection
3. Sort the ordered collection of distances and indices from smallest to largest (in
ascending order) by the distances
4. Pick the first K entries from the sorted collection
5. Get the labels of the selected K entries
6. If regression, return the mean of the K labels
7. If classification, return the mode of the K labels

CODE

33 | P a g e
34 | P a g e
35 | P a g e
Experiment 9
OBJECTIVE: To classify data using SVM.
RELATED THEORY
The objective of the support vector machine algorithm isto find a hyperplane in an N-
dimensional space (N- the no. of features) that distinctly classifies the data points.

36 | P a g e
37 | P a g e
38 | P a g e
39 | P a g e
Predicting iris flower category

To Identify the iris flow type using the modeled svm classifier, we need to call the predict
function over the fitted model. For example, if you want to predict the iris flower category
using the lin_svc model. We need to call lin_svc.predict (with the features). In our case,
these features will include the sepal length and width or petal length and width.

40 | P a g e
Experiment 10

OBJECTIVE: To classify data using neural network.


RELATED THEORY
Neural networks, a beautiful biologically-inspired programming paradigm which enables a
computer to learn from observational data. An Artificial Neural Network is based on a
collection of connected units or nodes called artificial neurons. ANNs have been used on a
variety of tasks, including computer vision, speech recognition, machine translation, social
network filtering, playing board and video games and medical diagnosis.

Figure 1. Neural Network

41 | P a g e
42 | P a g e
43 | P a g e
DISCUSSION & RESULT

In the above program we have built a neural network using the following steps:
 Read input and output
 Initialize weights and biases with random values (There are methods to initialize
weights and biases but for now initialize with random values)
 Calculate hidden layer input:
 Perform non-linear transformation on hidden linear input
 Perform linear and non-linear transformation of hidden layer activation at output layer
 Calculate gradient of Error(E) at output layer
 Compute slope at output and hidden layer
 Compute delta at output layer
 Calculate Error at hidden layer
 Compute delta at hidden layer
 Update weight at both output and hidden layer
 Update biases at both output and hidden layer

The output of the program is:

CONCLUSION
From the above program, we conclude that we have trained the model on 5000 iterations and
the results are closed to the target values.

44 | P a g e
Experiment 11

OBJECTIVE: To implement Decision Tree Algorithm on breast cancer data to predict whether
a person is having cancer or not.
RELATED THEORY
Decision Tree is the most powerful and popular tool for classification and prediction. A
Decision tree is a flowchart-like tree structure, where each internal node denotes a test on an
attribute, each branch represents an outcome of the test, and each leaf node (terminal node)
holds a class label. The strengths of decision tree methods are:
1. Decision trees are able to generate understandable rules.
2. Decision trees perform classification without requiring much computation.
3. Decision trees are able to handle both continuous and categorical variables.
4. Decision trees provide a clear indication of which fields are most important for
prediction or classification.
The weaknesses of decision tree methods :
1. Decision trees are less appropriate for estimation tasks where the goal is to predict the
value of a continuous attribute.
2. Decision trees are prone to errors in classification problems with many classes and a
relatively small number of training examples.
3. Decision tree can be computationally expensive to train. The process of growing a
decision tree is computationally expensive. At each node, each candidate splitting field
must be sorted before its best split can be found. In some algorithms, combinations of
fields are used and a search must be made for optimal combining weights. Pruning
algorithms can also be expensive since many candidate sub-trees must be formed and
compared.

Step1: Importing Libraries and Read the dataset.

Step2: Description of data frame.

45 | P a g e
Step3: To divide the data into malignant and benign data

Step 4: To plot the classified data.

46 | P a g e
Step5: To divide the data into train and test data and to predict the accuracy to determine
whether a person is having cancer or not.

47 | P a g e
Experiment 12

OBJECTIVE: Implementation of Random Forest in Python


RELATED THEORY:
Random forest, like its name implies, consists of a large number of individual decision trees
that operate as an ensemble. Each individual tree in the random forest spits out a class
prediction and the class with the most votes becomes our model’s prediction (see figure below).

48 | P a g e
49 | P a g e
50 | P a g e
Experiment 13
OBJECTIVE: Implementation of K Means Clustering algorithm
RELATED THEORY
K-means is an unsupervised learning method for clustering data points. The algorithm
iteratively divides data points into K clusters by minimizing the variance in each cluster.
Here, we will show you how to estimate the best value for K using the elbow method, then use
K-means clustering to group the data points into clusters.
First, each data point is randomly assigned to one of the K clusters. Then, we compute the
centroid (functionally the center) of each cluster, and reassign each data point to the cluster
with the closest centroid. We repeat this process until the cluster assignments for each data
point are no longer changing. K-means clustering requires us to select K, the number of
clusters we want to group the data into. The elbow method lets us graph the inertia (a distance-
based metric) and visualize the point at which it starts decreasing linearly. This point is
referred to as the "elbow" and is a good estimate for the best value for K based on our data.

51 | P a g e
52 | P a g e

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy