DWM Manual
DWM Manual
DWM Manual
Laboratory Manual
Year 2023-2024
Prepared by
Prof. Diksha D. Bhave Dr. Uttara Gogate
VISION
To impart high quality technical education for creating competent and ethically strong
professionals with capabilities of accepting new challenges.
MISSION
Our efforts are dedicated to impart high quality technical education based on a balanced
program of instructions and practical experiences.
Our strength is to provide value based technical education to develop core competencies
and ethics for overall personality development.
Our endeavor is to impart in depth knowledge and versatility to meet the global
challenges.
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Engineering
VISION
To impart quality technical education in the department of Computer Engineering for
creating competent and ethically strong engineers with capabilities of accepting new
challenges.
MISSION
Ability to use software methodology and various software tools for developing
system programs, high quality web apps and solutions to complex real world
problems.
Ability to identify and use suitable data structure and analyze the various algorithm
for given problem from different domains
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
Lab Objectives:
Lab Outcomes:
Learner will be able to…
1. Design data warehouse and perform various OLAP operations.
2. Implement data mining algorithms like classification.
3. Implement clustering algorithms on a given set of data sample.
4. Implement Association rule mining & web mining algorithm.
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
List of experiment
1 1
One case study on building Data warehouse/Data Mart.
Write Detailed Problem statement and design
dimensional modelling (creation of star and snowflake
schema)
2 4
Implementation of all dimension table and fact table
based on experiment No 1 case study
3 8
To study various OLAP operations such as Slice, Dice,
Roll up, Roll Down and Pivote.
4 14
Implement Naïve Bays classification algorithm using
JAVA
5 22
Introduction to weka tool
6 30
Implement Decision tree classification algorithm using
WEKA tool
7 34
Implement K-means clustering algorithm using JAVA.
8 39
Implementation of HITS algorithm.
9 43
Implement Apriori association algorithm WEKA tool.
10 49
Implement linear regression algorithm using Python
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
Experiments to CO Mapping
Experiment No - 01
Title: Dimensional modeling
Aim: -
One case study on building Data warehouse/Data Mart .Write Detailed
Problem statement and design dimensional modeling (creation of star
and snowflake schema)
Theory:
“A data warehouse is a subject-oriented, integrated, time-variant, and non-
volatile collection of data in support of management’s decision-making process.”
Schema is a logical description of the entire database. It includes the name and
description of records of all record types including all associated data-items and
aggregates. Much like a database, a data warehouse also requires to maintain a
schema. A database uses relational model, while a data warehouse uses Star,
Snowflake, and Fact Constellation schema.
DWM Lab/ V 1
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
1. Star Schema:-
Each dimension in a star schema is represented with only one-dimension
table.
This dimension table contains the set of attributes.
The above diagram shows the sales data of a company with respect to the
four dimensions, namely time, item, branch, and location.
There is a fact table at the center. It contains the keys to each of four
dimensions.
The fact table also contains the attributes, namely dollars sold and units sold.
Each dimension has only one dimension table and each table holds a set of attributes.
For example, the location dimension table contains the attribute set {location_key,
street, city, province or state, country}. This constraint may cause data redundancy.
For example, "Vancouver" and "Victoria" both the cities are in the Canadian province
of British Columbia. The entries for such cities may cause data redundancy along the
attributes province or state and country.
2. Snowflake schema:
The main benefit of the snowflake schema it uses smaller disk space.
Easier to implement a dimension is added to the Schema
Due to multiple tables query performance is reduced
The primary challenge that you will face while using the snowflake Schema
is that you need to perform more maintenance efforts because of the more
lookup tables.
DWM Lab/ V 2
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
Result :
DWM Lab/ V 3
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
Experiment No - 02
DWM Lab/ V 4
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
DWM Lab/ V 5
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
DIMENSION TABLE:
A dimension table allows keeping records of the dimensions. Each dimension may have a table
associated with it this is called dimension table. Dimension table consist of the textual description
of dimension of the table. For example: in a class we need to create a class data warehouse
containing information such as subjects offered, students name, students roll no, student marks, etc.
these dimension allows to keep track of the student performance
1). Dimension table is related to fact table with the help of simple primary key.
2). these consist of the constraint used to link them to fact table
Steps to Create Dimensional Data Modelling:
DWM Lab/ V 6
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
Conclusion: - Thus we have implemented all dimension table and fact table based on
DWM Lab/ V 7
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
Experiment No - 03
OLAP Operations:
Roll-up
Drill-down
Slice and dice
Pivot (rotate)
Roll-up Operation:
Roll-up performs aggregation on a data cube in any of the following ways −
DWM Lab/ V 8
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
Drill-down Operation:
Drill-down is the reverse operation of roll-up. It is performed by either of the
following ways
DWM Lab/ V 9
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
Initially the concept hierarchy was "day < month < quarter < year."
On drilling down, the time dimension is descended from the level of quarter
to the level of month.
When drill-down is performed, one or more dimensions from the data cube
are added.
It navigates the data from less detailed data to highly detailed data.
DWM Lab/ V 10
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
Slice Operation:
The slice operation selects one particular dimension from a given cube and provides
a new sub-cube. Consider the following diagram that shows how slice works.
Here Slice is performed for the dimension "time" using the criterion
time = "Q1".
Dice Operation:
Dice selects two or more dimensions from a given cube and provides a new sub-
cube. Consider the following diagram that shows the dice operation.
DWM Lab/ V 11
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
DWM Lab/ V 12
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
DWM Lab/ V 13
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
Experiment No - 4
Title: Naïve Bayes Algorithm
Aim:-
Implementation of Naïve Bayes classifier is using java.
Theory: -
Naive Bayesian Classifier a statistical classifier. This classifier is based on
the Bayes’ Theorem and the maximum posteriori hypothesis. The naive assumption
of class conditional independence is often made to reduce the computational cost.
Bayesian classifiers are statistical classifiers. They can predict class
membership probabilities, such as the probability that a given sample belongs to a
particular class. Bayesian classifier is based on Bayes’ theorem. Naive Bayesian
classifiers assume that the effect of an attribute value on a given class is independent
of the values of the other attributes. This assumption is called class conditional
independence. It is made to simplify the computation involved and, in this sense, is
considered”naive”.
Bayes’ Theorem: - Let X = {x1, x2. . . xn} be a sample, whose components represent
values made on a set of n attributes. In Bayesian terms, X is considered”evidence”.
Let H be some hypothesis, such as that the data X belongs to a specific class C. For
classification problems, our goal is to determine P(H|X), the probability that the
hypothesis H holds given the ”evidence”, (i.e. the observed data sample X). In other
words, we are looking for the probability that sample X belongs to class C, given
that we know the attribute description of X. P(H|X) is the a posteriori probability of
H conditioned on X. Fox example, suppose our data samples have attributes: age and
income, and that sample X is a 35-year-old customer with an income of $40,000.
Suppose that H is the hypothesis that our customer will buy a computer. Then
P(H|X) is the probability that customer X will buy a computer given that we know
the customer’s age and income. In contrast, P(H) is the a priori probability of H. For
our example, this is the probability that any given customer will buy a computer,
DWM Lab/ V 14
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
DWM Lab/ V 15
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
IV. Given data sets with many attributes, it would be computationally expensive to
compute P(X|Ci). In order to reduce computation in evaluating P(X|Ci) P(Ci), the
naive assumption of class conditional independence is made. This presumes that the
values of the attributes are conditionally independent of one another, given the class
label of the sample. Mathematically this means that
doublecount_no=0.0;
System.out.println("The Table that you have entered is:");
for(int i=0;i<n;i++)
{
System.out.print(col[i]+"\t\t");
}
for(int i=0;i<m;i++)
{
for(int j=0;j<n;j++)
{
if(j==(n-1))
{
if(a[i][j].equals("Yes")||a[i][j].equals("yes"))
{
count_yes++;
}
else if(a[i][j].equals("No")||a[i][j].equals("no"))
{
count_no++;
}
}
System.out.print("a[i][j]+");
}
System.out.println();
}
System.out.println("p(yes):"+count_yes+"/"+m);
System.out.println("p(no):"+count_no+"/"+m);
String decision[]=new String[n-1];
doubledecisiony_ctr[]=new double[n-1];
doubledecisionn_ctr[]=new double[n-1];
System.out.println("Enter the unseen tuple for classification:");
for(int i=0;i<n-2;i++)
{
System.out.print(col[i]+":");
decision[i]=br.readLine();
System.out.println();
}
for(int i=0;i<m;i++)
{
for(int j=0;j<n-1;j++)
{
if(a[i][j].equals(decision[j]))
{
if(a[i][n-1].equals("Yes"))
DWM Lab/ V 17
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
{
decisiony_ctr[j]++;
}
else if(a[i][n-1].equals("No"))
{
decisionn_ctr[j]++;
}
}
}
}
for(int j=0;j<n-1;j++)
{
System.out.println(decision[j]+"/Yes:"+decisiony_ctr[j]);
System.out.println(decision[j]+"/No:"+decisionn_ctr[j]);
}
doubleyprobability=1.0;
doublenprobability=1.0;
for(int j=0;j<n-2;j++)
{
yprobability*=(decisiony_ctr[j]/count_yes);
nprobability*=(decisionn_ctr[j]/count_no);
}
double temp=(count_yes)/(m);
double temp1=(count_no)/(m);
yprobability=temp*yprobability;
nprobability=temp1*nprobability;
if(yprobability>nprobability)
{
System.out.println("The Decision is Yes");
}
else
{
System.out.println("The Decision is No");
}
}
}
******Output******
C:\Program Files\Java\jdk1.8.0_31\bin>javac NaiveBayes.java
C:\Program Files\Java\jdk1.8.0_31\bin>java NaiveBayes
Enter the no. of columns: 5
Enter the no. of rows: 14
Enter the names of the attributes and keep the classification column last and its value
as yes or no:
DWM Lab/ V 18
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
Column0:Age
Column1:Income
Column2:Student
Column3:Credit
Column4:Buys
Enter the values of row1:
Age:<=30
Income:High
Student:No
Credit:Fair
Buys:No
Enter the values of row2:
Age:<=30
Income:High
Student:No
Credit:Excellent
Buys:No
Enter the values of row3:
Age:31-40
Income:High
Student:No
Credit:Fair
Buys:Yes
Enter the values of row4:
Age:>40
Income:Medium
Student:No
Credit:Fair
Buys:Yes
Enter the values of row5:
Age:>40
Income:Low
Student:Yes
Credit:Fair
Buys:Yes
Enter the values of row6:
Age:>40
Income:Low
Student:Yes
Credit:Excellent
Buys:No
Enter the values of row7:
Age:31-40
Income:Low
DWM Lab/ V 19
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
Student:Yes
Credit:Excellent
Buys:Yes
Enter the values of row8:
Age:<=30
Income:Medium
Student:NO
Credit:Fair
Buys:No
Enter the values of row9:
Age:<=30
Income:Low
Student:Yes
Credit:Fair
Buys:Yes
Enter the values of row10:
Age:>40
Income:Medium
Student:Yes
Credit:Fair
Buys:Yes
Enter the values of row11:
Age:<=30
Income:Medium
Student:Yes
Credit:Excellent
Buys:Yes
Enter the values of row12:
Age:31-40
Income:Medium
Student:No
Credit:Excellent
Buys:Yes
Enter the values of row13:
Age:31-40
Income:High
Student:Yes
Credit:Fair
Buys:Yes
Enter the values of row14:
Age:>40
Income:Medium
Student:No
Credit:Excellent
DWM Lab/ V 20
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
Buys:No
The Table that you have entered is:
Age Income Student Credit Buys
<=30 High No Fair No
<=30 High No Excellent No
31-40 High No Fair Yes
>40 Medium No Fair Yes
>40 Low Yes Fair Yes
>40 Low Yes Excellent No
31-40 Low Yes Excellent Yes
<=30 Medium No Fair No
<=30 Low Yes Fair Yes
>40 Medium Yes Fair Yes
<=30 Medium Yes Excellent Yes
31-40 Medium No Excellent Yes
31-40 High Yes Fair Yes
>40 Medium No Excellent No
p(yes):9.0/14
p(no):5.0/14
Enter the unseen tuple for classification:
Age:<=30
Income:Medium
Student:Yes
Credit:Fair
<=30/Yes:2.0
<=30/No:3.0
Medium/Yes:4.0
Medium/No:2.0
Yes/Yes:6.0
Yes/No:1.0
Fair/Yes:6.0
Fair/No:2.0
The Decision is Yes
DWM Lab/ V 21
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
Experiment no - 05
DWM Lab/ V 22
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
DWM Lab/ V 23
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
Preprocessing Data:
At the very top of the window, just below the title bar there is a row of tabs. Only the
first tab, ‘Preprocess’, is active at the moment because there is no dataset open. The
first threebuttons at the top of the preprocess section enable you to load data into
WEKA. Data can be imported from a file in various formats: ARFF, CSV, C4.5,
binary, it can also be read from a URL or from an SQL database (using JDBC) . The
easiest and the most common way of getting the data into WEKA is to store it as
Attribute-Relation File Format (ARFF) file.
File Conversion
We assume that all your data stored in a Microsoft Excel spreadsheet “weather.xls”.
DWM Lab/ V 24
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
WEKA expects the data file to be in Attribute-Relation File Format (ARFF) file.
Before you apply the algorithm to your data, you need to convert your data
intocomma-separated file into ARFF format (into the file with .arff extension) . To
save you data in comma-separated format, select the ‘Save As…’ menu item from
Excel ‘File’ pull-down menu. In the ensuing dialog box select ‘CSV (Comma
Delimited)’ from the file type pop-up menu, enter a name of the file, and click
‘Save’ button. Ignore all messages that appear by clicking ‘OK’. Open this file with
Microsoft Word. Your screen will look like the screen below.
DWM Lab/ V 25
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
The rows of the original spreadsheet are converted into lines of text where the
elements are separated from each other by commas. In this file you need to change
the first line, which holds the attribute names, into the header structure that makes up
the beginning of an ARFF file. Add a @relation tag with the dataset’s name, an
@attribute tag with the attribute information, and a @data tag as shown below.
DWM Lab/ V 26
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
Choose ‘Save As…’ from the ‘File‘ menu and specify ‘Text Only with Line Breaks’
as the file type. Enter a file name and click ‘Save’ button. Rename the file to the file
with extension .arff to indicate that it is in ARFF format.
Opening file from a local file system
Click on ‘Open file…’ button.It brings up a dialog box allowing you to browse for
the data file on the local file system, choose“weather.arff” file.Some databases have
the ability to save data in CSV format. In this case, you can select CSVfile from the
local filesystem. If you would like to convert this file into ARFF format, you can
click on ‘Save’ button. WEKA automatically creates ARFF file from your CSV file.
DWM Lab/ V 27
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
DWM Lab/ V 28
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
DWM Lab/ V 29
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
Experiment no – 06
Decision tree learning uses a decision tree (as a predictive model) to go from
observations about an item (represented in the branches) to conclusions about the
item's target value (represented in the leaves). It is one of the predictive modelling
approaches used in statistics, data mining and machine learning. Tree models where
the target variable can take a discrete set of values are called classification trees; in
these tree structures, leaves represent class labels and branches
represent conjunctions of features that lead to those class labels. Decision trees
where the target variable can take continuous values (typically real numbers) are
called regression trees. In decision analysis, a decision tree can be used to visually
and explicitly represent decisions and decision making. In data mining, a decision
tree describes data (but the resulting classification tree can be an input for decision
making). This page deals with decision trees in data mining.
Decision tree learning is a method commonly used in data mining. The goal is to
create a model that predicts the value of a target variable based on several input
variables. An example is shown in the diagram at right. Each interior node
corresponds to one of the input variables; there are edges to children for each of the
possible values of that input variable. Each leaf represents a value of the target
variable given the values of the input variables represented by the path from the root
to the leaf.
A decision tree is a simple representation for classifying examples. For this section,
assume that all of the input features have finite discrete domains, and there is a
single target feature called the "classification". Each element of the domain of the
DWM Lab/ V 30
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
Set of conditions organized hierarchically in such a way that the final decision can
be determined following the conditions that are fulfilled from the root of the tree to
one of its leaves.
They are easily understandable. They build a model (made up by rules) easy
to understand for the user.
They only work over a single table, and over a single attribute at a time.
They are one of the most used data mining techniques.
The decision trees are based on the strategy "divide and conquer".
There are two possible types of divisions or partitions:
Nominal partitions: a nominal attribute may lead to a split with as many
branches as values there are for the attribute.
Numerical partitions: typically, they allow partitions like "X>" and "X<a
Partitions relating two different attributes are not permitted.
Result:
DWM Lab/ V 31
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
DWM Lab/ V 32
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
Experiment no – 07
Theory: -
Clustering is the process of partitioning or grouping a given set of patterns
into disjoint clusters. This is done such that patterns in the same cluster are alike and
patterns belonging to two different clusters are different. Clustering has been a
widely studied problem in a variety of application domains including neural
networks, AI, and statistics.
K-Means clustering intends to partition n objects into k clusters in which each
object belongs to the cluster with the nearest mean. This method produces
exactly k different clusters of greatest possible distinction. The best number of
clusters k leading to the greatest separation (distance) is not known as a priori and
must be computed from the data. The objective of K-Means clustering is to
minimize total intra-cluster variance, or, the squared error function:
Algorithm
1. Clusters the data into k groups where k is predefined.
2. Select k points at random as cluster centers.
DWM Lab/ V 34
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
3. Assign objects to their closest cluster center according to the Euclidean distance function.
4. Calculate the centroid or mean of all objects in each cluster.
5. Repeat steps 2, 3 and 4 until the same points are assigned to each cluster in
consecutive rounds.
intcount,sum;
int elements[][]=new int[n][2];
System.out.println("Enter the elements:");
for(int i=0;i<n;i++)
elements[i][0]=Integer.parseInt(br.readLine());
System.out.println("No. of clusters:");
intnoc=Integer.parseInt(br.readLine());
double m[]=new double[noc];
System.out.println("Enter initial means:");
for(int i=0;i<noc;i++)
m[i]=Double.parseDouble(br.readLine());
int temp[][]=new int[n][2];
intitn=1;
while(true)
{
for(int i=0;i<n;i++)
{
for(int j=0;j<noc;j++)
{
if(Math.abs(elements[i][0]-m[elements[i][1]])>Math.abs(elements[i][0]-m[j]))
{
elements[i][1]=j;
}
}
}
for(int j=0;j<noc;j++)
{
sum=0;
count=0;
for(int i=0;i<n;i++)
{
if(elements[i][1]==j)
{
sum=sum+elements[i][0];
count++;
}
}
m[j]=(double)sum/count;
}
int c=0;
for(int i=0;i<n;i++)
if(elements[i][1]==temp[i][1])
c++;
if(c==n)
DWM Lab/ V 36
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
break;
for(int i=0;i<n;i++)
temp[i][1]=elements[i][1];
System.out.println("\nIteration"+itn);
for(int j=0;j<noc;j++)
System.out.println("Mean"+(j+1)+":"+m[j]);
for(int i=0;i<noc;i++)
{
System.out.println("Cluster"+(i+1)+":");
for(int j=0;j<n;j++)
{
if(elements[j][1]==i)
{
System.out.println(""+elements[j][0]+"");
}
}
System.out.println();
}
itn++;
}
}
}
******Output******
C:\Program Files\Java\jdk1.8.0_31\bin>javac Kmeans.java
C:\Program Files\Java\jdk1.8.0_31\bin>java Kmeans
Enter the no. of elements: 9
Enter the elements:2 4 10 12 3 20 30 11 25
No. of clusters: 2
Enter initial means: 3 4
Iteration1
Mean1:2.5
Mean2:16.0
Cluster1: 2 3
Cluster2: 4 10 12 20 30 11 25
Iteration2
Mean1:3.0
Mean2:18.0
Cluster1: 2 4 3
Cluster2: 10 12 20 30 11 25
Iteration3
Mean1:4.75
Mean2:19.6
Cluster1: 2 4 10 3
DWM Lab/ V 37
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
Cluster2: 12 20 30 11 25
Iteration4
Mean1:7.0
Mean2:25.0
Cluster1: 2 4 10 12 3 11
Cluster2: 20 30 25
DWM Lab/ V 38
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
Experiment no - 08
DWM Lab/ V 39
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
Algorithm:
In this HITS algorithm, the hub and authority are calculated using the following algorithm.
HITS Algorithm
4.
6.
7. Normalize
The algorithm performs a series of iterations, each consisting of two basic steps:
Authority update: Update each node's authority score to be equal to the sum of the hub
scores of each node that points to it. That is, a node is given a high authority score by being
linked from pages that are recognized as Hubs for information.
Hub update: Update each node's hub score to be equal to the sum of the authority scores of
each node that it points to. That is, a node is given a high hub score by linking to nodes that are
considered to be authorities on the subject.
Program:
# importing modules
import networkx as nx
import matplotlib.pyplot as plt
G = nx.DiGraph()
G.add_edges_from([('A', 'D'), ('B', 'C'), ('B', 'E'), ('C', 'A'),
('D', 'C'), ('E', 'D'), ('E', 'B'), ('E', 'F'),
DWM Lab/ V 40
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
DWM Lab/ V 41
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
DWM Lab/ V 42
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
Experiment no - 09
Aim:
Implement Apriori association algorithm WEKA tool.
Theory:
Association rule generation is usually split up into two separate steps:
1. First, minimum support is applied to find all frequent itemsets in a database.
2. Second, these frequent itemsets and the minimum confidence constraint are used
to form rules.
While the second step is straight forward, the first step needs more attention. Finding
all frequent itemsets in a database is difficult since it involves searching all possible
itemsets (item combinations). The set of possible itemsets is the power set over I and
has size 2n − 1 (excluding the empty set which is not a valid itemset). Although the
size of the powerset grows exponentially in the number of items n in I, efficient
search is possible using the downward-closure property of support (also called anti-
monotonicity) which guarantees that for a frequent itemset, all its subsets are also
frequent and thus for an infrequent itemset, all its supersets must also be infrequent.
Exploiting this property, efficient algorithms (e.g., Apriori and Eclat) can find all
frequent itemsets.
Association Rule Mining:
Given a set of transaction, association rule mining aims to find the rules
which enable us to predict the occurrence of a specific item based on the occurrence
of other items .
Association Rule:
DWM Lab/ V 43
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
For any transaction database association rule is given as x=>y where x and y
are subsets of A. The rule x=>y is specified with two factors support factor and
confidence factor.
The rule x=>y has a confident factor , z which means that z % of transactions
in the database that support x also supports y
Similarly , the rule x=>y has a support which means that % of transactions xUy
we use association rule to find transactions that contain x tends to y as well
Apriori Algorithm:
An algorithm for frequent item set mining and association rule learning over
transactional database proposed by Agrawal and Srikant in 1994.
Apriori Property:
DWM Lab/ V 44
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
Steps In Apriori
1. In the first iteration of the algorithm, each item is taken as a 1-itemsets candidate.
The algorithm will count the occurrences of each item.
2. Let there be some minimum support, min_sup( eg 2). The set of 1 – itemsets
whose occurrence is satisfying the min sup are determined. Only those
candidates which count more than or equal to min_sup, are taken ahead for the
next iteration and the others are pruned
3. Next, 2-itemset frequent items with min_sup are discovered. For this in the join
step, the 2-itemset is generated by forming a group of 2 by combining items with
itself.
4. The 2-itemset candidates are pruned using min-sup threshold value. Now the
table will have 2 –itemsets with min-sup only.
5. The next iteration will form 3 –itemsets using join and prune step. This iteration
will follow antimonotone property where the subsets of 3-itemsets, that is the 2 –
DWM Lab/ V 45
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
itemset subsets of each group fall in min_sup. If all 2-itemset subsets are
frequent then the superset will be frequent otherwise it is pruned.
6. Next step will follow making 4-itemset by joining 3-itemset with itself and
pruning if its subset does not meet the min_sup criteria. The algorithm is stopped
when the most frequent itemset is achieved.
Advantages:
Disadvantages:
DWM Lab/ V 46
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
DWM Lab/ V 47
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
Conclusion: In this way we have implemented Apriori algorithm using weka tool
DWM Lab/ V 48
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
Experiment No - 10
Title: Linear Regression
What Is Regression?
For example, you can observe several employees of some company and try to
understand how their salaries depend on the features, such as experience, level of
education, role, city they work in, and so on.
Similarly, you can try to establish a mathematical dependence of the prices of houses
on their areas, numbers of bedrooms, distances to the city center, and so on.
In other words, you need to find a function that maps some features or variables to
others sufficiently well.
DWM Lab/ V 49
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
- The dependent features are called the dependent variables, outputs, or responses.
- The independent features are called the independent variables, inputs,
or predictors.
It is a common practice to denote the outputs with 𝑦 and inputs with 𝑥. If there are
two or more independent variables, they can be represented as the vector 𝐱 = (𝑥₁,
…,ᵣ), where 𝑟 is the number of inputs.
Typically, you need regression to answer whether and how some phenomenon
influences the other or how several variables are related. For example, you can use it
to determine if and to what extent the experience or gender impact salaries.
Regression is also useful when you want to forecast a response using a new set of
predictors. For example, you could try to predict electricity consumption of a
household for the next hour given the outdoor temperature, time of day, and number
of residents in that household.
Linear Regression
Linear regression is probably one of the most important and widely used regression
techniques. It’s among the simplest regression methods. One of its main advantages
is the ease of interpreting results.
DWM Lab/ V 50
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
Problem Formulation
To get the best weights, you usually minimize the sum of squared residuals (SSR)
for all observations 𝑖 = 1, …,: SSR = Σᵢ(𝑦ᵢ - 𝑓(𝐱ᵢ))². This approach is called
the method of ordinary least squares.
Regression Performance
The variation of actual responses 𝑦ᵢ, 𝑖 = 1, …,, occurs partly due to the dependence
on the predictors 𝐱ᵢ. However, there is also an additional inherent variance of the
output.
The coefficient of determination, denoted as 𝑅², tells you which amount of variation
in 𝑦 can be explained by the dependence on 𝐱 using the particular regression model.
DWM Lab/ V 51
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
Larger 𝑅² indicates a better fit and means that the model can better explain the
variation of the output with different inputs.
The value 𝑅² = 1 corresponds to SSR = 0, that is to the perfect fit since the values of
predicted and actual responses fit completely to each other.
DWM Lab/ V 52
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
import numpy as np
import matplotlib.pyplot as plt
defestimate_coef(x, y):
# number of observations/points
n = np.size(x)
defplot_regression_line(x, y, b):
# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m",
marker = "o", s = 30)
# putting labels
plt.xlabel('x')
plt.ylabel('y')
DWM Lab/ V 53
Shivajirao S Jondhale College of Engineering, Dombivli (E)
Department of Computer Engineering
def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))
if __name__ == "__main__":
main()
Estimated coefficients:
b_0 = -0.0586206896552
b_1 = 1.45747126437
And graph obtained looks like this:
DWM Lab/ V 54