DWDM Record
DWDM Record
CERTIFICATE
This is to certify that this is the BONAFIDE RECORD of the work done in
______________________________________________________________ Laboratory by
Total number of experiments held: _____ Total number of experiments done: _____
EXTERNAL EXAMINER
INDEX
There are different options available in MySQL administrator. Another tool SQLyog
Enterprise, we are using for building & identifying tables in a database after successful
connection establishment through MySQL Administrator. Below we can see the window
of SQLyog Enterprise.
On left-side navigation, we can see different databases & it‘s related tables. Now we are
going to build tables & populate table‘s data in database through SQL queries. These
tables in database can be used further for building data warehouse.
In the above two windows, we created a database named “sample” & in that database we
created two tables named as “user_details”& “hockey”through SQL queries.Now, we are
going to populate (filling) sample data through SQL queries in those two created tables as
represented in below windows.
Through MySQL administrator & SQLyog, we can import databases from other sources
(.XLS,. CSV, .sql) & also we can export our databases as backup for further processing.
We can connect MySQL to other applications for data analysis & reporting.
In the above window, left side navigation bar consists of a database named as
―sales_dw‖ in which there are six different tables (dimcustdetails, dimcustomer,
dimproduct, dimsalesperson, dimstores, factproductsales) has been created.
After creating tables in database, here we are going to use a tool called as “Microsoft
Visual Studio 2012 for Business Intelligence” for building multi- dimensional models.
Process Extract:
The Extract step covers the data extraction from the source system and makes it
accessible for further processing. The main objective of the extract step is to retrieve
all the required data from the source system with as little resources as possible. The
extract step should be designed in a way that it does not negatively affect the source
system in terms or performance, response time or any kind of locking.
There are several ways to perform the extract:
▪ Update notification - if the source system is able to provide a notification that
a record has been changed and describe the change, this is the easiest way to
get the data.
▪ Incremental extract - some systems may not be able to provide notification
that an update has occurred, but they are able to identify which records have
been modified and provide an extract of such records. During further ETL
steps, the system needs to identify changes and propagate it down. Note, that
by using daily extract, we may not be able to handle deleted records properly.
▪ Full extract - some systems are not able to identify which data has been
changed at all, so a full extract is the only way one can get the data out of the
system. The full extract requires keeping a copy of the last extract in the same
format in order to be able to identify changes. Full extract handles deletions as
well.
▪ When using Incremental or Full extracts, the extract frequency is extremely
important. Particularly for full extracts; the data volumes can be in tens of
gigabytes.
Clean:
The cleaning step is one of the most important as it ensures the quality of thedata in
the data warehouse.
Cleaning should perform basic data unification rules, such as:
▪ Making identifiers unique (sex categories Male/Female/Unknown, M/F/null,
Man/Woman/Not Available are translated to standard
Male/Female/Unknown)
▪ Convert null values into standardized Not Available/Not Provided value
▪ Convert phone numbers, ZIP codes to a standardized form
▪ Validate address fields, convert them into proper naming, e.g.
Street/St/St./Str./Str
▪ Validate address fields against each other (State/Country, City/State, City/ZIP
code, City/Street).
Transform:
The transform step applies a set of rules to transform the data from the source to the
target. This includes converting any measured data to the same dimension (i.e.
conformed dimension) using the same units so that they can later be joined. The
transformation step also requires joining data from several
➢ Perform Various OLAP operations such slice, dice, roll up, drill up andpivot.
ANS:
OLAP Operations are being implemented practically using MicrosoftExcel.
Procedure for OLAP Operations:
1. Open Microsoft Excel, go toData tab in top & click on ―Existing Connections”.
2. Existing Connections window will be opened, there “Browse for more”option
should be clicked for importing .cub extension file for performing OLAP Operations. For
sample, I took music.cub file.
5. Now we are going to perform roll-up (drill-up) operation, in the above window I selected
January month then automatically Drill-up option is enabled on top. We will click on Drill-up
option, then the below window will be displayed.
While inserting slicers for slicing operation, we select 2 Dimensions (for e.g., CategoryName &
Year) only with one Measure (for e.g. Sum of sales).After inserting a slice& adding a filter
(CategoryName: AVANT ROCK & BIG BAND; Year: 2009 & 2010), we will get table as shown
below.
8. Finally, the Pivot (rotate) OLAP operation is performed by swapping rows (Order Date-
Year) & columns (Values-Sum of Quantity & Sum of Sales) through right side bottom navigation
bar as shown below.
After Swapping (rotating), we will get resultant as represented below with a pie-chart for
Category-Classical& Year Wise data.
The GUI Chooser application allows you to run five different types of applications -
▪ The Explorer is the central panel where most data mining tasks are performed.
▪ The Experimenter panel is used to run experiments and conduct statistical tests
between learning schemes.
▪ The KnowledgeFlow panel is used to provide an interface to drag and drop
components, connect them to form a knowledge flow and analyze the data and
results.
▪ The WorkBench panel is used to discover, explore & learn about different
statistical distributions.
▪ The Simple CLI panel provides the command-line interface powers to run
WEKA.
The Explorer - When you click on the Explorer button in the Applications selector,it
opens the following scr
The Weka Explorer is designed to investigate your machine learning dataset. Itis useful when you
are thinking about different data transforms and modeling algorithms that you could investigate
with a controlled experiment later. It is excellent for getting ideas and playing what-if scenarios.
The interface is divided into 6 tabs, each with a specific function:
1. The preprocess tab is for loading your dataset and applying filters to transformthe data
into a form that better exposes the structure of the problem to the modelling processes.
Also provides some summary statistics about loaded data.
2. The classify tab is for training and evaluating the performance of different machine
learning algorithms on your classification or regression problem. Algorithms are
divided up into groups, results are kept in a result list and summarized in the main
Classifier output.
3. The cluster tab is for training and evaluating the performance of different unsupervised
clustering algorithms on your unlabelled dataset. Like the Classify tab, algorithms are
divided into groups, results are kept in a result list and summarized in the main
Clustered output.
4. The associate tab is for automatically finding associations in a dataset. The techniques
are often used for market basket analysis type data mining problems and require data
where all attributes are categorical.
5. The select attributes tab is for performing feature selection on the loaded dataset and
identifying those features that are most likely to be relevant in developing a predictive
model.
6. The visualize tab is for reviewing pairwise scatterplot matrix of each attribute plotted
against every other attribute in the loaded dataset. It is useful to get an idea of the
shape and relationship of attributes that may aid in data filtering, transformation, and
modelling.
The Experimenter - When you click on the Experimenter button in the
Applications selector, it opens the following screen.
• Click the “Add New” button in the Datasets pane and select
the required dataset (ARFF format files).
• Click the “Add New” button in the “Algorithms” pane and click “OK”
to add the required algorithm.
2. The run tab is for running your designed experiments. Experiments can be started and
stopped. There is not a lot to it.
• Click the “Start” button to run the small experiment you designed.
3. The analyze tab is for analyzing the results collected from an experiment. Results can
be loaded from a file, from the database or from an experiment just completed in the
tool. A no. of performance measures are collected from agiven experiment which can
be compared between algorithms using tools like statistical significance.
• Click the “Experiment” button the “Source” pane to load the results
from the experiment you just ran.
• Click the “Perform Test” button to summary the classificationaccuracy
results for the single algorithm in the experiment.
The KnowledgeFlow – When you click on the KnowledgeFlow button in the Applications
selector, it opens the following screen.
The Weka Workbench is an environment that combines all the GUI interfaces into a single
interface. It is useful if you find yourself jumping a lot between two or more different
interfaces, such as between the Explorer and the Experiment Environment. This can happen
if you try out a lot of what if’s in the Explorer and quickly take what you learn and put it into
controlled experiments.
The Simple CLI – When you click on the Simple CLI button in the Applications
selector, it opens the following screen.
Weka can be used from a simple Command Line Interface (CLI). This is powerful
because you can write shell scripts to use the full API from command line calls with
parameters, allowing you to build models, run experiments and make predictions without a
graphical user interface.
Classify Panel
Test Options
1. The result of applying the chosen classifier will be tested according to the options that
are set by clicking in the Test options box.
2. There are four test modes:
▪ Use training set: The classifier is evaluated on how well it predicts the class
of the instances it was trained on.
▪ Supplied test set: The classifier is evaluated on how well it predicts the class
of a set of instances loaded from a file. Clicking the Set... button brings up a
dialog allowing you to choose the file to test on.
▪ Cross-validation: The classifier is evaluated by cross-validation, using the
number of folds that are entered in the Folds text field.
▪ Percentage split: The classifier is evaluated on how well it predicts a certain
percentage of the data which is held out for testing. The amount of data held
out depends on the value entered in the % field.
3. Click the “Start” button to run the ZeroR classifier on the dataset and
summarize the results.
Associate Panel
1. Click the “Start” button to run the Apriori association algorithm on the dataset
and summarize the results.
Visualize Panel
1. Increase the point size and the jitter and click the “Update” button to set an
improved plot of the categorical attributes of the loaded dataset.
EXPERIMENTER
Setup Panel
1. Click the “New” button to create a new Experiment.
2. Click the “Add New” button in the Datasets pane and select
the data/diabetes.arff dataset.
3. Click the “Add New” button in the “Algorithms” pane and click “OK” to add
the ZeroR algorithm.
Run Panel
1. Click the “Start” button to run the small experiment you designed.
Analyse Panel
1. Click the “Experiment” button the “Source” pane to load the results from the
experiment you just ran.
2. Click the “Perform Test” button to summary the classification accuracy results
for the single algorithm in the experiment.
➢ Study the arff file format Explore the available data sets in WEKA. Load adata
set (ex. Weather dataset, Iris dataset, etc.)
ANS:
1. An ARFF (Attribute-Relation File Format) file is an ASCII text file that
describes a list of instances sharing a set of attributes.
2. ARFF files have two distinct sections – The Header & the Data.
• The Header describes the name of the relation, a list of the attributes,and
their types.
• The Data section contains a comma separated list of data.
The ARFF Header Section
The ARFF Header section of the file contains the relation declaration and attribute
declarations.
The @relation Declaration
The relation name is defined as the first line in the ARFF file. The format is:
@relation <relation-name>
where <relation-name> is a string. The string must be quoted if the name includes spaces.
The @attribute Declarations
Attribute declarations take the form of an ordered sequence of @attribute statements. Each
attribute in the data set has its own @attribute statement which uniquely defines the name
of that attribute and its data type. The order the attributes are declared indicates the column
position in the data section of the file.
The format for the @attribute statement is:
@attribute <attribute-name><datatype>
where the <attribute-name> must start with an alphabetic character. If spaces areto be
included in the name, then the entire name must be quoted.
The <datatype> can be any of the four types:
1. numeric
2. <nominal-specification>
3. string
4. date [<date-format>]
ARFF Data Section
The ARFF Data section of the file contains the data declaration line and the actual
instance lines.
The @data Declaration
The @data declaration is a single line denoting the start of the data segment in thefile.
The format is:
@data
The instance data:
Each instance is represented on a single line, with carriage returns denoting the end of
the instance.
Attribute values for each instance are delimited by commas. They must appear in the order
that they were declared in the header section (i.e. the data corresponding to the nth
@attribute declaration is always the nth field of the attribute).
Missing values are represented by a single question mark, as in:@data
4.4, ?, 1.5, ?, Iris-setosa
4. Plot Histogram
Applying Resample Filter on the dataset selected, yields the following results.
➢ Load weather. nominal, Iris, Glass datasets into Weka and run AprioriAlgorithm
with different support and confidence values.
ANS:
Loading WEATHER.NOMINAL dataset
1. Select WEATHER.NOMINAL dataset from the available datasets in the
preprocessing tab.
2. Apply Apriori algorithm by selecting it from the Associate tab and click start
button.
3. The Associator output displays the following result.
=== Run information ===
Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S
-1.0 -c -1
Relation: weather.symbolic
Instances: 14
Attributes: 5
outlook
temperature
humidity windy
play
=== Associator model (full training set) ===
Apriori
=======
Minimum support: 0.15 (2 instances)
Minimum metric <confidence>: 0.9
Number of cycles performed: 17
a b c <-- classified as
5 0 0 | a = soft
0 4 0 | b = hard
0 0 15 | c = none
age
spectacle-prescrip
astigmatism
tear-prod-rate
contact-lenses
Test mode: evaluate on training data
=== Classifier model (full training set) ===J48
pruned tree
➢ Extract if-then rules from the decision tree generated by the classifier, Observethe
confusion matrix.
ANS:
Loading CONTACT-LENSES dataset and Run JRip algorithm.
1. Select CONTACT-LENSES dataset from the available datasets in the
preprocessing tab.
2. Apply JRip algorithm by selecting it from the Classify tab.
3. Now select one of the Test Options available (Training Set/ Supplied Test Set/
Cross-Validation/ Percentage Split).
4. Also select Output Entropy Evaluation Measures from the More options availablein
Test Options.
5. Now click Start button available.
6. The Classifier output displays the following result.
=== Run information ===
Scheme: weka.classifiers.rules.JRip -F 3 -N 2.0 -O 2 -S 1
Relation: contact-lenses
Instances: 24
Attributes: 5
age
spectacle-prescrip
astigmatism
tear-prod-rate
contact-lenses
Test mode: evaluate on training data
rules:
===========
(tear-prod-rate = normal) and (astigmatism = yes) => contact-lenses=hard (6.0/2.0)(tear-
prod-rate = normal) => contact-lenses=soft (6.0/1.0)
=> contact-lenses=none (12.0/0.0)
Number of Rules : 3
Time taken to build model: 0 seconds
a b c <-- classified as
5 0 0 | a = soft
0 4 0 | b = hard
1 2 12 | c = none
➢ Load each dataset into Weka and perform Naïve-bayes classification and k-
Nearest Neighbour classification. Interpret the results obtained.
ANS:
Loading CONTACT-LENSES dataset and Run Naïve-Bayes algorithm.
1. Select CONTACT-LENSES dataset from the available datasets in the
preprocessing tab.
2. Apply Naïve-Bayes algorithm by selecting it from the Classify tab.
3. Now select one of the Test Options available (Training Set/ Supplied Test Set/
Cross-Validation/ Percentage Split).
4. Also select Output Entropy Evaluation Measures from the More options availablein
Test Options.
5. Now click Start button available.
6. The Classifier output displays the following result.
=== Run information ===
Scheme: weka.classifiers.bayes.NaiveBayes
Relation: contact-lenses
Instances: 24
Attributes: 5
age
spectacle-prescrip
astigmatism
tear-prod-rate
contact-lenses
Test mode: evaluate on training data
=== Classifier model (full training set) ===Naive
Bayes Classifier
Class
Attribute soft hard none
(0.22) (0.19) (0.59)
==========================================
age
young 3.0 3.0 5.0
pre-presbyopic 3.0 2.0 6.0
presbyopic 2.0 2.0 7.0
[total] 8.0 7.0 18.0
spectacle-prescrip
myope 3.0 4.0 8.0
hypermetrope 4.0 2.0 9.0
[total] 7.0 6.0 17.0
astigmatism
no 6.0 1.0 8.0
yes 1.0 5.0 9.0
[total] 7.0 6.0 17.0
tear-prod-rate
Department of Computer Science and Engineering
47
➢ Compare classification results of ID3, J48, Naïve-Bayes and k-NN classifiers for each
dataset, and deduce which classifier is performing best and poor for each dataset and
justify.
ANS:
By observing all the classification results of the algorithms ID3, K-NN, J48 & NaïveBayes
–
ID3 Algorithm accuracy & performance is best.J48
Algorithm accuracy & performance is poor.
RESULT
Knowledge flow layout for finding strong association rules by using FPGrowth
algorithms
RESULT
➢ Set up the knowledge flow to load an ARFF (batch mode) and perform across
validation using J48 algorithm.
ANS:
Knowledge flow to load an ARFF (batch mode) and perform a crossvalidation
using J48 algorithm.
RESULT
➢ Demonstrate plotting multiple ROC curves in the same plot window by usingJ48
and Random Forest tree.
Plotting multiple ROC curves in the same plot window by using J48 andRandom
Forest tree.
RESULT
a b c <-- classified as
15 0 0 | a = Iris-setosa
19 0 0 | b = Iris-versicolor
17 0 0 | c = Iris-virginica
8. Write a java program to prepare a simulated data set with unique instances.
PROGRAM:
import java.util.ArrayList;
import java.util.List; import
java.util.Random;
OUTPUT:
Set12
37
1
38
84
28
24
61
88
65
66
72
85
75
64
91
27
47
42
9. Write a Python program to generate frequent item sets / association rules using
Apriori algorithm.
PROGRAM:
# installing the apyori package
!pip install apyori
# importing libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
OUTPUT:
# Displaying the results non-sorted
output_DataFrame
10. Write a program to calculate chi-square value using Python. Report your
observation.
PROGRAM:
# importing libraries
import numpy as np
from scipy.stats import chi2
OUTPUT:
Chi-square statistic: 10.0
P-value: 0.0014
OBSERVATION:
The chi-square statistic is a measure of the discrepancy between the observed and expected
frequencies in a contingency table. The p-value is the probability of obtaining a chi-square
statistic as large or larger than the one observed, assuming that the null hypothesis is true. In
this case, the null hypothesis is that there is no association between the two variables in the
contingency table. The p-value of 0.0014 is less than the significance level of 0.05, so we
reject the null hypothesis and conclude that there is a significant association between the two
variables.
dict[mydata[i][-1]].append(mydata[i])return
dict
# Calculating Mean
def mean(numbers):
return sum(numbers) / float(len(numbers))
def MeanAndStdDev(mydata):
info = [(mean(attribute), std_dev(attribute)) for attribute in zip(*mydata)]
# eg: list = [ [a, b, c], [m, n, o], [x, y, z]]
# here mean of 1st attribute =(a + m+x), mean of 2nd attribute = (b +
n+y)/3
# delete summaries of last class
del info[-1]return
info
probabilities[classValue] *= calculateGaussianProbability(x,
mean, std_dev)
return probabilities
# Accuracy score
def accuracy_rate(test, predictions):
correct = 0
for i in range(len(test)):
if test[i][-1] == predictions[i]:
correct += 1
return (correct / float(len(test))) * 100.0
# driver code
# add the data path in your system
filename = r'E:\user\MACHINE LEARNING\machine learning algos\Naive
bayes\filedata.csv'
# 70% of data is training data and 30% is test data used for testing
ratio = 0.7
train_data, test_data = splitting(mydata, ratio) print('Total
number of examples are: ', len(mydata))
print('Out of these, training examples are: ', len(train_data))
print("Test examples are: ", len(test_data))
# prepare model
info = MeanAndStdDevForClass(train_data)
# test model
predictions = getPredictions(info, test_data) accuracy
= accuracy_rate(test_data, predictions)
print("Accuracy of your model is: ", accuracy)
OUTPUT:
Total number of examples are: 200
Out of these, training examples are: 140Test
examples are: 60
Accuracy of your model is: 71.237678
item[i][j]=0;
}
}
}
//creating array for 2-frequency itemsetint
nt1[][]=new int[10][10]; for(j=0;j<10;j++)
{
//generating unique items for 2-frequency itemlist
for(m=j+1;m<10;m++)
{
for(i=0;i<n;i++)
{
if(item[i][j]==1 &&item[i][m]==1)
}
}
for(j=0;j<10;j++)
{
for(m=j+1;m<10;m++)
{
if(((nt1[j][m]/(float)n)*100)>=50)q[j]=1;
else
q[j]=0;
if(q[j]==1)
{
System.out.println("Item "+itemlist[j]+"& "+itemlist[m]+" is selected ");
}
}
}
}
Department of Computer Science and Engineering
76
OUTPUT:
Enter the number of transaction:
3
items :1--Milk 2--Bread 3--Coffee 4--Juice 5--Cookies 6--Jam 7--Tea 8--Butter 9--Sugar
10--Water
Transaction 1:
Is Item MILK present in this transaction(1/0)?
:1
Is Item BREAD present in this transaction(1/0)? :
1
Is Item COFFEE present in this transaction(1/0)? :
1
Is Item JUICE present in this transaction(1/0)? :
1
Is Item COOKIES present in this transaction(1/0)? :
1
Is Item JAM present in this transaction(1/0)? :
1
Is Item TEA present in this transaction(1/0)? :
1
Is Item BUTTER present in this transaction(1/0)? :
1
Is Item SUGAR present in this transaction(1/0)? :
1
Is Item WATER present in this transaction(1/0)? :
1
Transaction 2:
Is Item MILK present in this transaction(1/0)? :
1
Is Item BREAD present in this transaction(1/0)? :
1
Is Item COFFEE present in this transaction(1/0)? :
0
Is Item JUICE present in this transaction(1/0)? :
0
Is Item COOKIES present in this transaction(1/0)? :
1
Is Item JAM present in this transaction(1/0)?
Department of Computer Science and Engineering
79
13. Write a program to cluster your choice of data using simple k-means algorithmusing
JDK.
PROGRAM:
import java.util.ArrayList;
import java.util.List; import
java.util.Random;
@Override
public String toString() {
return "(" + x + ", " + y + ")";
}
}
}
OUTPUT:
(0.8, 1.2)
(3.2, 3.6)
(1.0, 1.2)
14. Write a program of cluster analysis using simple k-means algorithm Python
programming language.
PROGRAM:
# importing libraries
import numpy as np
import matplotlib.pyplot as plt
OUTPUT:
OBSERVATION:
This program will generate 100 random data points and then cluster them using the k- means
algorithm. The number of clusters is chosen to be 3. The program will then plotthe data points
and the centroids.
15. Write a program to compute/display dissimilarity matrix (for your own dataset
containing at least four instances with two attributes) using Python.
PROGRAM:
# importing library
import numpy as np
OUTPUT:
[[0. 2.82842712 5.65685425 8.48528137]
[2.82842712 0. 2.82842712 5.65685425]
[5.65685425 2.82842712 0. 2.82842712]
[8.48528137 5.65685425 2.82842712 0. ]]
OBSERVATION:
This program first creates a dataset containing four instances with two attributes. Then, it
computes the dissimilarity matrix using the Euclidean distance formula. Finally, it displays
the dissimilarity matrix.
16. Visualize the datasets using matplotlib in python. (Histogram, Box plot, Barchart,
Pie chart etc.,)
PROGRAM:
LINE CHART
import matplotlib.pyplot as plt
OUTPUT:
HISTOGRAM
import matplotlib.pyplot as plt
import numpy as np
# Create a histogram
plt.hist(data)
OUTPUT:
BOXPLOT
import matplotlib.pyplot as plt
import numpy as np
OUTPUT:
BAR CHART
import matplotlib.pyplot as plt
import numpy as np
OUTPUT:
PIE CHART
import matplotlib.pyplot as plt
import numpy as np
OUTPUT: