IR Final LabManual
IR Final LabManual
(Sem-I) [2023-24]
ASSIGNMENT No: 08
Title: Write a program to construct a Bayesian network considering medical data. Use this model to
demonstrate the diagnosis of heart patients using the standard Heart Disease Data Set (You can use
Java/Python ML library classes/API.
Problem Statement: To construct a Bayesian network considering medical data. Use this model to
demonstrate the diagnosis of heart patients using the standard Heart Disease Data Set (You can use
Java/Python ML library classes/API.
Prerequisite:
Basics of Python
Learning Objectives:
Outcomes:
After completion of this assignment students are able to understand how to construct a Bayesian
network considering medical data.
Theory:
A Bayesian network is a directed acyclic graph in which each edge corresponds to a conditional
dependency, and each node corresponds to a unique random variable.
Bayesian network consists of two major parts: a directed acyclic graph and a set of conditional
probability distributions
The directed acyclic graph is a set of random variables represented by nodes.
The conditional probability distribution of a node (random variable) is defined for
every possible outcome of the preceding causal node(s).
For illustration, consider the following example. Suppose we attempt to turn on our
computer, but the computer does not start (observation/evidence). We would like to know
which of the possible causes of computer failure is more likely. In this simplified
illustration, we assume only two possible causes of this misfortune: electricity failure and
computer malfunction.
The corresponding directed acyclic graph is depicted in below figure.
Fig: Directed acyclic graph representing two independent possible causes of a computer failure.
The goal is to calculate the posterior conditional probability distribution of each of the possible
unobserved causes given the observed evidence, i.e. P [Cause | Evidence].
Data Set:
Attribute Information:
1. age: age in years
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal Heartdisease
63 1 1 145 233 1 2 150 0 2.3 3 0 6 0
67 1 4 160 286 0 2 108 1 1.5 2 3 3 2
67 1 4 120 229 0 2 129 1 2.6 2 2 7 1
41 0 2 130 204 0 2 172 0 1.4 1 0 3 0
62 0 4 140 268 0 2 160 0 3.6 3 2 3 3
60 1 4 130 206 0 2 132 1 2.4 2 2 7 4
Program:
#display the data print('Few examples from the dataset are given below')
print(heartDisease.head())
#Learning CPDs using Maximum Likelihood Estimators print('\n Learning CPD using Maximum
likelihood estimators') model.fit(heartDisease,estimator=MaximumLikelihoodEstimator)
Output:
[5 rows x 14 columns]
ASSIGNMENT No: 09
Title: Implement Agglomerative hierarchical clustering algorithm using appropriate dataset.
Problem Statement: Implement Agglomerative hierarchical clustering algorithm using appropriate
dataset
Prerequisite:
Basics of Python
Learning Objectives:
Outcomes:
After completion of this assignment students are able to understand how to Implement Agglomerative
hierarchical clustering algorithm using appropriate dataset
Theory:
In data mining and statistics, hierarchical clustering analysis is a method of clustering analysis that
seeks to build a hierarchy of clusters i.e. tree-type structure based on the hierarchy.
In machine learning, clustering is the unsupervised learning technique that groups the data based on
similarity between the set of data. There are different-different types of clustering algorithms in
machine learning. Connectivity-based clustering: This type of clustering algorithm builds the cluster
based on the connectivity between the data points. Example: Hierarchical clustering
Centroid-based clustering: This type of clustering algorithm forms around the centroids of the
data points. Example: K-Means clustering, K-Mode clustering
Distribution-based clustering: This type of clustering algorithm is modeled using statistical
distributions. It assumes that the data points in a cluster are generated from a particular probability
distribution, and the algorithm aims to estimate the parameters of the distribution to group similar
data points into clusters Example: Gaussian Mixture Models (GMM)
Density-based clustering: This type of clustering algorithm groups together data points that are
in high-density concentrations and separates points in low-concentrations regions. The basic idea
is that it identifies regions in the data space that have a high density of data points and groups
those points together into clusters. Example: DBSCAN(Density-Based Spatial Clustering of
Applications with Noise)
Hierarchical clustering
Hierarchical clustering is a connectivity-based clustering model that groups the data points together
that are close to each other based on the measure of similarity or distance. The assumption is that data
points that are close to each other are more similar or related than data points that are farther apart.
A dendrogram, a tree-like figure produced by hierarchical clustering, depicts the hierarchical
relationships between groups. Individual data points are located at the bottom of the dendrogram,
while the largest clusters, which include all the data points, are located at the top. In order to generate
different numbers of clusters, the dendrogram can be sliced at various heights.
The dendrogram is created by iteratively merging or splitting clusters based on a measure of similarity
or distance between data points. Clusters are divided or merged repeatedly until all data points are
contained within a single cluster, or until the predetermined number of clusters is attained.
We can look at the dendrogram and measure the height at which the branches of the dendrogram form
distinct clusters to calculate the ideal number of clusters. The dendrogram can be sliced at this height
to determine the number of clusters.
Agglomerative Clustering is one of the most common hierarchical clustering techniques. Dataset –
Credit Card Dataset.
Assumption: The clustering technique assumes that each data point is similar enough to the other data
points that the data at the starting can be assumed to be clustered in 1 cluster. Step 1: Importing the
required libraries
import pandas as pd
import numpy as np
cd C:\Users\Dev\Desktop\Kaggle\Credit_Card
X = pd.read_csv('CC_GENERAL.csv')
X = X.drop('CUST_ID', axis = 1)
Dendrograms are used to divide a given cluster into many different clusters. Step 5: Visualizing the
working of the Dendrograms
plt.figure(figsize =(8, 8))
plt.title('Visualising the data')
Dendrogram = shc.dendrogram((shc.linkage(X_principal, method ='ward')))
To determine the optimal number of clusters by visualizing the data, imagine all the horizontal lines as
being completely horizontal and then after calculating the maximum distance between any two
horizontal lines, draw a horizontal line in the maximum distance calculated.
The above image shows that the optimal number of clusters should be 2 for the given data. Step 6:
Building and Visualizing the different clustering models for different values of k a) k = 2
ac2 = AgglomerativeClustering(n_clusters = 2)
ac3 = AgglomerativeClustering(n_clusters = 3)
ac4 = AgglomerativeClustering(n_clusters = 4)
ac5 = AgglomerativeClustering(n_clusters = 5)
ac6 = AgglomerativeClustering(n_clusters = 6)
We now determine the optimal number of clusters using a mathematical technique. Here, We will use
the Silhouette Scores for the purpose. Step 7: Evaluating the different models and Visualizing the
results.
k = [2, 3, 4, 5, 6]
with the help of the silhouette scores, it is concluded that the optimal number of clusters for the given
data and clustering technique is 2.
Conclusion :- This way Implemented Agglomerative hierarchical clustering algorithm using appropriate
dataset
ASSIGNMENT No: 10
Title: Implement Page Rank Algorithm. (Use python or beautiful soup for implementation).
Problem Statement: Implement Page Rank Algorithm. (Use python or beautiful soup for
implementation).
Prerequisite:
Basics of Python
Learning Objectives:
Learn to Implement Page Rank Algorithm. (Use python or beautiful soup for implementation).
Outcomes:
After completion of this assignment students are able to understand how to Implement Page Rank
Algorithm. (Use python or beautiful soup for implementation).
Theory:
PageRank (PR) is an algorithm used by Google Search to rank websites in their search engine results.
PageRank was named after Larry Page, one of the founders of Google. PageRank is a way of measurin g
the importance of website pages. According to Google:
PageRank works by counting the number and quality of links to a page to determine a rough estimate
of how important the website is. The underlying assumption is that more important websites are likely
to receive more links from other websites.
It is not the only algorithm used by Google to order search engine results, but it is the first algorithm
that was used by the company, and it is the best-known.
The above centrality measure is not implemented for multi-graphs.
Algorithm
The PageRank algorithm outputs a probability distribution used to represent the likelihood that a person
randomly clicking on links will arrive at any particular page. PageRank can be calculated for collectio ns
of documents of any size. It is assumed in several research papers that the distributio n is evenly divided
among all documents in the collection at the beginning of the computational process. The PageRank
computations require several passes, called “iterations”, through the collection to adjust approximate
PageRank values to more closely reflect the theoretical true value.
Simplified algorithm
Assume a small universe of four web pages: A, B, C, and D. Links from a page to itself, or multiple outbound links
from one single page to another single page, are ignored. PageRank is initialized to the same value for all pages.
In the original form of PageRank, the sum of PageRank over all pages was the total number of pages on the web
at that time, so each page in this example would have an initial value of 1. However, later versions of PageRank,
and the remainder of this section, assume a probability distribution between 0 and 1. Hence the initial value for
each page in this example is 0.25.
The PageRank transferred from a given page to the targets of its outbound links upon the next iteration is
divided equally among all outbound links.
If the only links in the system were from pages B, C, and D to A, each link would transfer 0.25 PageRank to A
upon the next iteration, for a total of 0.75.
Suppose instead that page B had a link to pages C and A, page C had a link to page A, and page D had links to all
three pages. Thus, upon the first iteration, page B would transfer half of its existing value, or 0.125, to page A
and the other half, or 0.125, to page C. Page C would transfer all of its existing value , 0.25, to the only page it
links to, A. Since D had three outbound links, it would transfer one -third of its existing value, or approximately
0.083, to A. At the completion of this iteration, page A will have a PageRank of approximately 0.458.
In other words, the PageRank conferred by an outbound link is equal to the document’s own PageRank score
divided by the number of outbound links L( ).
In the general case, the PageRank value for any page u can be expressed as:
i.e. the PageRank value for a page u is dependent on the PageRank values for each page v contained in the set
Bu (the set containing all pages linking to page u), divided by the number L(v) of links from page v. The
algorithm involves a damping factor for the calculation of the PageRank. It is like the income tax which the govt
extracts from one despite paying him itself.
Parameters
----------
G : graph
A NetworkX graph. Undirected graphs will be converted to a directed
graph with two directed edges for each undirected edge.
Returns
-------
pagerank : dictionary
Dictionary of nodes with PageRank as value
Notes
-----
The eigenvector calculation is done by the power iteration method
and has no guarantee of convergence. The iteration will stop
after max_iter iterations or an error tolerance of
number_of_nodes(G)*tol has been reached.
The PageRank algorithm was designed for directed graphs but this
algorithm does not check if the input graph is directed and will
execute on undirected graphs by converting each edge in the
directed graph to two edges.
"""
if len(G) == 0:
return {}
if not G.is_directed():
D = G.to_directed()
else:
D = G
if nstart is None:
x = dict.fromkeys(W, 1.0 / N)
else:
# Normalized nstart vector
s = float(sum(nstart.values()))
x = dict((k, v / s) for k, v in nstart.items())
if personalization is None:
if dangling is None:
Conclusion:- Thus, this way the centrality measure of Page Rank is calculated for the given graph.