ML MANUAL (1)
ML MANUAL (1)
Implement and demonstrate the FIND-S algorithm for finding the most specific
hypothesis based on a given set of training data samples. Read the training data from a .CSV
file.
FIND-S Algorithm
1. Initialize h to the most specific hypothesis in H
2. For each positive training instance x
For each attribute constraint ai in h
Training Examples:
import csv
a = []
Output:
• If d is a negative example
• Remove from S any hypothesis inconsistent with d
• For each hypothesis g in G that is not consistent with d
• Remove g from G
• Add to G all minimal specializations h of g such that
• h is consistent with d, and some member of S is more specific than h
• Remove from G any hypothesis that is less general than another hypothesis in G
Training Examples:
import numpy as np
import pandas as pd
data = pd.DataFrame(data=pd.read_csv('enjoysport.csv'))
concepts = np.array(data.iloc[:,0:-1])
print(concepts)
target = np.array(data.iloc[:,-1])
print(target)
Output:
Final Specific_h:
['sunny' 'warm' '?' 'strong' '?' '?']
Final General_h:
[['sunny', '?', '?', '?', '?', '?'],
['?', 'warm', '?', '?', '?', '?']]
3. Write a program to demonstrate the working of the decision tree based ID3 algorithm. Use
an appropriate data set for building the decision tree and apply this knowledge to classify a
new sample.
ID3 Algorithm
Examples are the training examples. Target_attribute is the attribute whose value is to
be predicted by the tree. Attributes is a list of other attributes that may be tested by the
learned decision tree. Returns a decision tree that correctly classifies the given
Examples.
Otherwise Begin
A ← the attribute from Attributes that best* classifies Examples
The decision attribute for Root ← A
For each possible value, vi, of A,
Add a new tree branch below Root, corresponding to the test A = vi
Let Examples vi, be the subset of Examples that have value vi for A
If Examples vi , is empty
Then below this new branch add a leaf node with label = most common
value of Target_attribute in Examples
Else below this new branch add the subtree
ID3(Examples vi, Targe_tattribute, Attributes – {A}))
End
Return Root
INFORMATION GAIN:
Training Dataset:
Test Dataset:
import math
import csv
def load_csv(filename):
lines=csv.reader(open(filename,"r"));
dataset = list(lines)
headers = dataset.pop(0)
return dataset,headers
class Node:
def init (self,attribute):
self.attribute=attribute
self.children=[]
self.answer=""
def subtables(data,col,delete):
dic={}
coldata=[row[col] for row in data]
attr=list(set(coldata))
counts=[0]*len(attr)
r=len(data)
c=len(data[0])
for x in range(len(attr)):
for y in range(r):
if data[y][col]==attr[x]:
counts[x]+=1
for x in range(len(attr)):
dic[attr[x]]=[[0 for i in range(c)] for j in
range(counts[x])]
pos=0
for y in range(r):
if data[y][col]==attr[x]:
if delete:
del data[y][col]
dic[attr[x]][pos]=data[y]
pos+=1
return attr,dic
def entropy(S):
attr=list(set(S))
if len(attr)==1:
return 0
counts=[0,0]
for i in range(2):
counts[i]=sum([1 for x in S if attr[i]==x])/(len(S)*1.0)
sums=0
for cnt in counts:
sums+=-1*cnt*math.log(cnt,2)
return sums
def compute_gain(data,col):
attr,dic = subtables(data,col,delete=False)
total_size=len(data)
entropies=[0]*len(attr)
ratio=[0]*len(attr)
def build_tree(data,features):
lastcol=[row[-1] for row in data]
if(len(set(lastcol)))==1:
node=Node("")
node.answer=lastcol[0]
return node
n=len(data[0])-1
gains=[0]*n
for col in range(n):
gains[col]=compute_gain(data,col)
split=gains.index(max(gains))
node=Node(features[split])
fea = features[:split]+features[split+1:]
attr,dic=subtables(data,split,delete=True)
for x in range(len(attr)):
child=build_tree(dic[attr[x]],fea)
node.children.append((attr[x],child))
return node
def print_tree(node,level):
if node.answer!="":
print(" "*level,node.answer)
return
print(" "*level,node.attribute)
for value,n in node.children:
print(" "*(level+1),value)
print_tree(n,level+2)
def classify(node,x_test,features):
if node.answer!="":
print(node.answer)
return
pos=features.index(node.attribute)
for value, n in node.children:
if x_test[pos]==value:
classify(n,x_test,features)
'''Main program'''
dataset,features=load_csv("data3.csv")
node1=build_tree(dataset,features)
Outlook
rain
Wind
strong
no
weak
yes
overcast
yes
sunny
Humidity
normal
yes
high
no
→
(⃗𝑥→, network
input values, (𝑡 ) and is the vector of target network output values.
ƞ is the learning rate (e.g., .05). ni, is the number of network inputs, nhidden the number
of units in the hidden layer, and nout the number of output units.
The input from unit i into unit j is denoted xji, and the weight from unit i to unit j is
denoted wji
Create a feed-forward network with ni inputs, nhidden hidden units, and nout output
units.
Initialize all network weights to small random numbers
Until the termination condition is met, Do
Expected % in
Example Sleep Study
Exams
1 2 9 92
2 1 5 86
3 3 6 89
Program:
import numpy as np
X = np.array(([2, 9], [1, 5], [3, 6]), dtype=float)
y = np.array(([92], [86], [89]), dtype=float)
X = X/np.amax(X,axis=0) # maximum of X array longitudinally
y = y/100
#Sigmoid Function
def sigmoid (x):
return 1/(1 + np.exp(-x))
#Variable initialization
epoch=5000 #Setting training iterations
lr=0.1 #Setting learning rate
inputlayer_neurons = 2 #number of features in data set
hiddenlayer_neurons = 3 #number of hidden layers neurons
output_neurons = 1 #number of neurons at output layer
#weight and bias initialization
wh=np.random.uniform(size=(inputlayer_neurons,hiddenlayer_neur
ons))
bh=np.random.uniform(size=(1,hiddenlayer_neurons))
wout=np.random.uniform(size=(hiddenlayer_neurons,output_neuron
s))
bout=np.random.uniform(size=(1,output_neurons))
#Forward Propogation
hinp1=np.dot(X,wh)
hinp=hinp1 + bh
hlayer_act = sigmoid(hinp)
outinp1=np.dot(hlayer_act,wout)
outinp= outinp1+ bout
output = sigmoid(outinp)
#Backpropagation
EO = y-output
outgrad = derivatives_sigmoid(output)
d_output = EO* outgrad
EH = d_output.dot(wout.T)
Input:
[[0.66666667 1. ]
[0.33333333 0.55555556]
[1. 0.66666667]]
Actual Output:
[[0.92]
[0.86]
[0.89]]
Predicted Output:
[[0.89726759]
[0.87196896]
[0.9000671]]
5. Write a program to implement the naïve Bayesian classifier for a sample training data set
stored as a .CSV file. Compute the accuracy of the classifier, considering few test data sets.
Bayes’ Theorem is stated as:
Where,
P(h|D) is the probability of hypothesis h given the data D. This is called the posterior
probability.
P(D|h) is the probability of data d given that the hypothesis h was true.
P(h) is the probability of hypothesis h being true. This is called the prior probability of h.
P(D) is the probability of the data. This is called the prior probability of D
interested in finding the most probable hypothesis h ∈ H given the observed data D. Any
After calculating the posterior probability for a number of different hypotheses h, and is
Bayes theorem to calculate the posterior probability of each candidate hypothesis is hMAP is a
MAP hypothesis provided
A Gaussian Naive Bayes algorithm is a special type of Naïve Bayes algorithm. It’s
specifically used when the features have continuous values. It’s also assumed that all the
features are following a Gaussian distribution i.e., normal distribution
This means that in addition to the probabilities for each class, we must also store the mean
and standard deviations for each input variable for each class.
Sample Examples:
Examples Pregnancies Glucose BloodPressure SkinThickness Insulin BMI Diabetic Age Outcome
Pedigree
Function
1 6 148 72 35 0 33.6 0.627 50 1
2 1 85 66 29 0 26.6 0.351 31 0
3 8 183 64 0 0 23.3 0.672 32 1
4 1 89 66 23 94 28.1 0.167 21 0
5 0 137 40 35 168 43.1 2.288 33 1
6 5 116 74 0 0 25.6 0.201 30 0
7 3 78 50 32 88 31 0.248 26 1
8 10 115 0 0 0 35.3 0.134 29 0
9 2 197 70 45 543 30.5 0.158 53 1
10 8 125 96 0 0 0 0.232 54 1
Program:
import csv
import random
import math
def loadcsv(filename):
lines = csv.reader(open(filename, "r"));
dataset = list(lines)
for i in range(len(dataset)):
#converting strings into numbers for processing
dataset[i] = [float(x) for x in dataset[i]]
return dataset
def separatebyclass(dataset):
separated = {} #dictionary of classes 1 and 0
#creates a dictionary of classes 1 and 0 where the values are
#the instances belonging to each class
for i in range(len(dataset)):
vector = dataset[i]
if (vector[-1] not in separated):
separated[vector[-1]] = []
separated[vector[-1]].append(vector)
return separated
def mean(numbers):
return sum(numbers)/float(len(numbers))
def stdev(numbers):
avg = mean(numbers)
variance = sum([pow(x-avg,2) for x in
numbers])/float(len(numbers)-1)
return math.sqrt(variance)
def summarize(dataset): #creates a dictionary of classes
summaries = [(mean(attribute), stdev(attribute)) for
attribute in zip(*dataset)];
del summaries[-1] #excluding labels +ve or -ve
return summaries
def summarizebyclass(dataset):
separated = separatebyclass(dataset);
#print(separated)
summaries = {}
for classvalue, instances in separated.items():
#for key,value in dic.items()
#summaries is a dic of tuples(mean,std) for each class value
summaries[classvalue] = summarize(instances)
#summarize is used to cal to mean and std
return summaries
def getaccuracy(testset,
predictions): correct = 0
for i in range(len(testset)):
if testset[i][-1] ==
predictions[i]: correct += 1
return (correct/float(len(testset))) * 100.0
def main():
filename =
'naivedata.csv'
splitratio = 0.67
dataset = loadcsv(filename);
main()
Output:
LEARN_NAIVE_BAYES_TEXT (Examples, V)
Examples is a set of text documents along with their target
values. V is the set of all possible target values. This
function learns the probability terms P(wk |vj,), describing
the probability that a randomly drawn word from a document in
class vj will be the English word wk. It also learns the class
prior probabilities P(vj).
1. collect all words, punctuation, and other tokens that occur in Examples
Vocabulary ← c the set of all distinct words and other tokens occurring in any text
document from Examples
CLASSIFY_NAIVE_BAYES_TEXT (Doc)
positions ← all word positions in Doc that contain tokens found in Vocabulary
Return VNB, where
import pandas as pd
msg=pd.read_csv('naivetext.csv',names=['message','label'])
msg['labelnum']=msg.label.map({'pos':1,'neg':0})
X=msg.message
y=msg.labelnum
print(X)
print(y)
df=pd.DataFrame(xtrain_dtm.toarray(),columns=count_vect.get_fe
ature_names())
Output:
Accuracy of the
classifier is 0.8
Confusion matrix
[[2 1]
[0 2]]
The value of Precision
0.6666666666666666 The value
of Recall 1.0
Basic knowledge
Confusion Matrix
Unique word
< I, loved, the, movie, hated, a, great, good, poor, acting>
Doc I loved the movie hated a great good poor acting Class
1 1 1 1 1 +
2 1 1 1 1 -
3 2 1 1 1 +
4 1 1 -
5 1 1 1 1 +
Doc I loved the movie hated a great good poor acting Class
1 1 1 1 1 +
3 2 1 1 1 +
5 1 1 1 1 +
3
𝑃(+) = 0.6
5
=
Doc I loved the movie hated a great good poor acting Class
2 1 1 1 1 -
4 1 1 -
2
𝑃(−) = 0.4
= 5
𝑃(𝑚𝑜𝑣𝑖𝑒|−) = 𝑃(𝑝𝑜𝑜𝑟|−) =
1+1 1+1
= 0.125 = 0.125
6 + 10 6 + 10
then,
= P(−) P(I | −) P(hated | −) P(the | −) P(poor | −) P(acting | −)
= 0.4 * 0.125 * 0.125 * 0.125 * 0.125 * 0.125
= 1.22 X 10−5
7. Write a program to construct a Bayesian network considering medical data. Use this model
to demonstrate the diagnosis of heart patients using standard Heart Disease Data Set. You can
use Java/Python ML library classes/API
Theory
A Bayesian network is a directed acyclic graph in which each
edge corresponds to a conditional dependency, and each node
corresponds to a unique random variable.
Program:
8. Write a program to construct a Bayesian network considering medical data. Use this model
to demonstrate the diagnosis of heart patients using standard Heart Disease Data Set. You can
Machine Learning Laboratory
17CSL76
use Java/Python ML library classes/API
Theory
A Bayesian network is a directed acyclic graph in which each
edge corresponds to a conditional dependency, and each node
corresponds to a unique random variable.
Program:
iris = datasets.load_iris()
X = pd.DataFrame(iris.data)
X.columns = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width']
y = pd.DataFrame(iris.target)
y.columns = ['Targets']
model = KMeans(n_clusters=3)
model.fit(X)
plt.figure(figsize=(14,7))
y_gmm = gmm.predict(xs)
#y_cluster_gmm
plt.subplot(2, 2, 3)
plt.scatter(X.Petal_Length, X.Petal_Width, c=colormap[y_gmm], s=40)
plt.title('GMM Classification')
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
9. Write a program to implement k-Nearest Neighbour algorithm to classify the iris data set.
Print both correct and wrong predictions. Java/Python ML library classes can be used for this
problem.
Training algorithm:
For each training example (x, f (x)), add the example to the list training
examples Classification algorithm:
Given a query instance xq to be classified,
Let x1 . . .xk denote the k instances from training examples that are nearest to xq
Return
Where, f(xi) function to calculate the mean value of the k nearest training
examples.
Data Set:
""" Iris Plants Dataset, dataset contains 150 (50 in each of three
classes)Number of Attributes: 4 numeric, predictive attributes and
the Class
"""
iris=datasets.load_iris()
""" The x variable contains the first four columns of the dataset
(i.e. attributes) while y contains the labels.
"""
x = iris.data
y = iris.target
""" Splits the dataset into 70% train data and 30% test data. This
means that out of total 150 records, the training set will contain
105 records and the test set contains 45 of those records
"""
x_train, x_test, y_train, y_test =
train_test_split(x,y,test_size=0.3)
Confusion Matrix
[[20 0 0]
[ 0 10 0]
[ 0 1 14]]
Accuracy Metrics
True positives: data points labelled as positive that are actually positive
False positives: data points labelled as positive that are actually negative
True negatives: data points labelled as negative that are actually
negative False negatives: data points labelled as negative that are
actually positive
F1-Score:
10. Implement the non-parametric Locally Weighted Regression algorithm in order to fit data
points. Select appropriate data set for your experiment and draw graphs.
Regression:
Regression is a technique from statistics that is used to predict values of a desired
target quantity when the target quantity is continuous.
In regression, we seek to identify (or estimate) a continuous variable y associated with
a given input vector x.
y is called the dependent variable.
x is called the independent variable.
Loess/Lowess Regression:
Loess regression is a nonparametric technique that uses
local weighted regression to fit a smooth curve through
points in a scatter plot.
Lowess Algorithm:
Locally weighted regression is a very powerful nonparametric model used in
statistical learning.
Given a dataset X, y, we attempt to find a model parameter β(x) that
minimizes residual sum of weighted squared errors.
The weights are given by a kernel function (k or w) which can be chosen arbitrarily
Algorithm
1. Read the Given data Sample to X and the curve (linear or non linear) to Y
2. Set the value for Smoothening parameter or Free parameter say τ
3. Set the bias /Point of interest set x0 which is a subset of X
4. Determine the weight matrix using :
6. Prediction = x0*β:
Program
import numpy as np
from bokeh.plotting import figure, show, output_notebook
from bokeh.layouts import gridplot
from bokeh.io import push_notebook
n = 1000
# generate dataset
X = np.linspace(-3, 3, num=n)
print("The Data Set ( 10 Samples) X :\n",X[1:10])
Y = np.log(np.abs(X ** 2 - 1) + .5)
print("The Fitting Curve Data Set (10 Samples) Y
:\n",Y[1:10])
# jitter X
X += np.random.normal(scale=.1, size=n)
print("Normalised (10 Samples) X :\n",X[1:10])
show(gridplot([
[plot_lwr(10.), plot_lwr(1.)],
[plot_lwr(0.1), plot_lwr(0.01)]]))
Output
# -*- coding:
utf-8 -*- """
Spyder Editor
This is a temporary
script file. """
from numpy
import * from
os import
listdir import
matplotlib
import
matplotlib.pyplot as
plt import pandas as
pd
import numpy as
np1 import
numpy.linalg as np
from scipy.stats.stats import pearsonr
def
kernel(point,xm
at, k): m,n =
np1.shape(xmat)
weights =
np1.mat(np1.eye((m)))
for j in range(m):
diff = point - X[j]
weights[j,j] =
np1.exp(diff*diff.T/(-2.0*k**2))
return weights
def
localWeight(point,xmat,y
mat,k): wei =
kernel(point,xmat,k)
W =
(X.T*(wei*X)).I*(X.T*(wei*ymat.T
)) return W
def
localWeightRegression(xmat,
ymat,k): m,n =
np1.shape(xmat)
ypred =
np1.zeros(m)
for i in
range(m):
ypred[i] =
xmat[i]*localWeight(xmat[i],xmat,ymat,k)
return ypred
fig = plt.figure()
ax =
fig.add_subplot(1,1,1
)
ax.scatter(bill,tip,
color='green')
ax.plot(xsort[:,1],ypred[SortIndex], color = 'red',
linewidth=5) plt.xlabel('Total bill')
plt.ylabel('Tip')
plt.show();
11. Implement the non-parametric Locally Weighted Regression algorithm in order to fit data
points. Select appropriate data set for your experiment and draw graphs.
Regression:
Regression is a technique from statistics that is used to predict values of a desired
target quantity when the target quantity is continuous.
In regression, we seek to identify (or estimate) a continuous variable y associated with
a given input vector x.
y is called the dependent variable.
x is called the independent variable.
Loess/Lowess Regression:
Loess regression is a nonparametric technique that uses
local weighted regression to fit a smooth curve through
points in a scatter plot.
Lowess Algorithm:
Locally weighted regression is a very powerful nonparametric model used in
statistical learning.
Given a dataset X, y, we attempt to find a model parameter β(x) that
minimizes residual sum of weighted squared errors.
The weights are given by a kernel function (k or w) which can be chosen arbitrarily
Algorithm
7. Read the Given data Sample to X and the curve (linear or non linear) to Y
8. Set the value for Smoothening parameter or Free parameter say τ
9. Set the bias /Point of interest set x0 which is a subset of X
10. Determine the weight matrix using :
Program
import numpy as np
from bokeh.plotting import figure, show, output_notebook
from bokeh.layouts import gridplot
from bokeh.io import push_notebook
n = 1000
# generate dataset
X = np.linspace(-3, 3, num=n)
print("The Data Set ( 10 Samples) X :\n",X[1:10])
Y = np.log(np.abs(X ** 2 - 1) + .5)
print("The Fitting Curve Data Set (10 Samples) Y
:\n",Y[1:10])
# jitter X
X += np.random.normal(scale=.1, size=n)
print("Normalised (10 Samples) X :\n",X[1:10])
show(gridplot([
[plot_lwr(10.), plot_lwr(1.)],
[plot_lwr(0.1), plot_lwr(0.01)]]))
Output
# -*- coding:
utf-8 -*- """
Spyder Editor
This is a temporary
script file. """
from numpy
import * from
os import
listdir import
matplotlib
import
matplotlib.pyplot as
plt import pandas as
pd
import numpy as
np1 import
numpy.linalg as np
from scipy.stats.stats import pearsonr
def
kernel(point,xm
at, k): m,n =
np1.shape(xmat)
weights =
np1.mat(np1.eye((m)))
for j in range(m):
diff = point - X[j]
weights[j,j] =
np1.exp(diff*diff.T/(-2.0*k**2))
return weights
def
localWeight(point,xmat,y
mat,k): wei =
kernel(point,xmat,k)
W =
(X.T*(wei*X)).I*(X.T*(wei*ymat.T
)) return W
def
localWeightRegression(xmat,
ymat,k): m,n =
np1.shape(xmat)
ypred =
np1.zeros(m)
for i in
range(m):
ypred[i] =
xmat[i]*localWeight(xmat[i],xmat,ymat,k
) return ypred
#preparing and
add 1 in bill
mbill =
np1.mat(bill)
mtip = np1.mat(tip) # mat is used to convert to n
dimesiona to 2 dimensional array form m= np1.shape(mbill)
[1]
# print(m) 244 data
is stored in m one =
np1.mat(np1.ones(m))
X= np1.hstack((one.T,mbill.T)) # create a stack
of bill from ONE #print(X)
#set k here
ypred =
localWeightRegression(X,mti
p,0.3) SortIndex =
X[:,1].argsort(0)
xsort = X[SortIndex][:,0]
fig = plt.figure()
ax =
fig.add_subplot(1,1
,1)
ax.scatter(bill,tip
, color='green')
ax.plot(xsort[:,1],ypred[SortIndex], color =
'red', linewidth=5) plt.xlabel('Total bill')
plt.ylabel('Tip
') plt.show();