0% found this document useful (0 votes)

12 views

DWDM Record

Dmdw laboratory record

Uploaded by

Meghana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

DWDM Record

Dmdw laboratory record

Uploaded by

Meghana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 83

SIR C R REDDY COLLEGE OF ENGINEERING, ELURU

Approved by AICTE & Affiliated to JNTUK, Kakinada

Accredited by NBA
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

CERTIFICATE
This is to certify that this is the BONAFIDE RECORD of the work done in

______________________________________________________________ Laboratory by

Mr/Mrs _____________________________________________________________ bearing

Regd. No. _ of __B.E/B.TECH course

during the academic year 2024-25.

Total number of experiments held: _____ Total number of experiments done: _____

LAB-IN-CHARGE HEAD OF THE DEPARTMENT

EXTERNAL EXAMINER
INDEX

S. No. Name of the Experiment Date Page No. Valued Grade

1 Creation of a Data Warehouse 1 – 15
Explore machine learning tool
2 16 – 32
“WEKA”
Perform data preprocessing tasks
and demonstrate performing
3 33 – 30
association rule mining on data
sets.
Demonstrate performing
4 41 – 52
classification on data sets.
Demonstrate performing clustering
5 53 – 56
of data sets
Demonstrate knowledge flow
6 57 – 60
application on data sets
Demonstrate ZeroR technique on
Iris dataset (by using necessary
7 61 – 63
preprocessing technique(s)) and
share your observations
Write a java program to prepare a
8 simulated data set with unique 64 – 65
instances
Write a Python program to
generate frequent item sets /
9 66 – 67
association rules using Apriori
algorithm.
Write a program to calculate chi-
10 square value using Python. Report 68
your observation
Write a program of Naive
11 Bayesian classification using 69 – 72
Python programming language
Implement a Java program to
12 73 – 79
perform Apriori algorithm
Write a program to cluster your
13 choice of data using simple k- 80 – 82
means algorithm using JDK
Write a program of cluster analysis
14 using simple k-means algorithm 83 – 84
Python programming language.
Write a program to
compute/display dissimilarity
15 matrix (for your own dataset 85
containing at least four instances
with two attributes) using Python
Visualize the datasets using
16 matplotlib in python. (Histogram, 86 – 90
Box plot, Bar chart, Pie chart etc.,)
1

1. Creation of a Data Warehouse.

➢ Build Data Warehouse/Data Mart (using open-source tools like Pentaho Data
Integration Tool, Pentaho Business Analytics; or other data warehouse tools like
Microsoft-SSIS, Informatica, Business Objects, etc.,)
ANS:
Identify source tables and populate sample data
In this task, we are going to use MySQL administrator, SQLyog Enterprise tools for
building & identifying tables in database & also for populating (filling) the sample data in
those tables of a database. A data warehouse is constructed by integrating data from
multiple heterogeneous sources. It supports analytical reporting, structured and/or ad hoc
queries and decision making. We are building adata warehouse by integrating all the tables
in database &analysing those data.

In the below figure we represented MySQL Administrator connection establishment.

After successful login, it will open new window as shown below.

Department of Computer Science and Engineering

There are different options available in MySQL administrator. Another tool SQLyog
Enterprise, we are using for building & identifying tables in a database after successful
connection establishment through MySQL Administrator. Below we can see the window
of SQLyog Enterprise.

On left-side navigation, we can see different databases & it‘s related tables. Now we are
going to build tables & populate table‘s data in database through SQL queries. These
tables in database can be used further for building data warehouse.

Department of Computer Science and Engineering

In the above two windows, we created a database named “sample” & in that database we
created two tables named as “user_details”& “hockey”through SQL queries.Now, we are
going to populate (filling) sample data through SQL queries in those two created tables as
represented in below windows.

Department of Computer Science and Engineering

Through MySQL administrator & SQLyog, we can import databases from other sources
(.XLS,. CSV, .sql) & also we can export our databases as backup for further processing.
We can connect MySQL to other applications for data analysis & reporting.

Department of Computer Science and Engineering

➢ Design multi-dimensional data models namely Star, Snowflake and Fact

Constellation schemas for any one enterprise. (ex. Banking, Insurance, Finance,
Healthcare, manufacturing, Automobiles, sales etc)
ANS:
Multi-Dimensional model was developed for implementing data warehouses & it provides
both a mechanism to store data and a way for business analysis. The primary components
of dimensional model are dimensions & facts. There are different of types of multi-
dimensional data models. They are:
1. Star Schema Model
2. Snow Flake Schema Model
3. Fact Constellation Model.
Now, we are going to design these multi-dimensional models for the Marketing
enterprise.
First, we need to build the tables in a database through SQLyog as shown below.

In the above window, left side navigation bar consists of a database named as
―sales_dw‖ in which there are six different tables (dimcustdetails, dimcustomer,
dimproduct, dimsalesperson, dimstores, factproductsales) has been created.
After creating tables in database, here we are going to use a tool called as “Microsoft
Visual Studio 2012 for Business Intelligence” for building multi- dimensional models.

Department of Computer Science and Engineering

Fact Constellation Schema:

Fact Constellation is a set of fact tables that share some dimension tables. In this schema
there are two or more fact tables. We developed fact constellation in visual studio as
shown below. Fact tables are labelled in yellow color.

Department of Computer Science and Engineering

➢ Write ETL scripts and implement using data warehouse tools.

ANS:
ETL (Extract-Transform-Load):
ETL comes from Data Warehousing and stands for Extract-Transform-Load. ETL
covers a process of how the data are loaded from the source system to the data
warehouse. Currently, the ETL encompasses a cleaning step as a separate step. The
sequence is then Extract-Clean- Transform-Load. Let us briefly describe each step of
the ETL process.

Process Extract:
The Extract step covers the data extraction from the source system and makes it
accessible for further processing. The main objective of the extract step is to retrieve
all the required data from the source system with as little resources as possible. The
extract step should be designed in a way that it does not negatively affect the source
system in terms or performance, response time or any kind of locking.
There are several ways to perform the extract:
▪ Update notification - if the source system is able to provide a notification that
a record has been changed and describe the change, this is the easiest way to
get the data.
▪ Incremental extract - some systems may not be able to provide notification
that an update has occurred, but they are able to identify which records have
been modified and provide an extract of such records. During further ETL
steps, the system needs to identify changes and propagate it down. Note, that
by using daily extract, we may not be able to handle deleted records properly.
▪ Full extract - some systems are not able to identify which data has been
changed at all, so a full extract is the only way one can get the data out of the
system. The full extract requires keeping a copy of the last extract in the same
format in order to be able to identify changes. Full extract handles deletions as
well.
▪ When using Incremental or Full extracts, the extract frequency is extremely
important. Particularly for full extracts; the data volumes can be in tens of
gigabytes.

Clean:
The cleaning step is one of the most important as it ensures the quality of thedata in
the data warehouse.
Cleaning should perform basic data unification rules, such as:
▪ Making identifiers unique (sex categories Male/Female/Unknown, M/F/null,
Man/Woman/Not Available are translated to standard
Male/Female/Unknown)
▪ Convert null values into standardized Not Available/Not Provided value
▪ Convert phone numbers, ZIP codes to a standardized form
▪ Validate address fields, convert them into proper naming, e.g.
Street/St/St./Str./Str
▪ Validate address fields against each other (State/Country, City/State, City/ZIP
code, City/Street).

Transform:
The transform step applies a set of rules to transform the data from the source to the
target. This includes converting any measured data to the same dimension (i.e.
conformed dimension) using the same units so that they can later be joined. The
transformation step also requires joining data from several

Department of Computer Science and Engineering

sources, generating aggregates, generating surrogate keys, sorting, derivingnew

calculated values, and applying advanced validation rules.
Load:
During the load step, it is necessary to ensure that the load is performed correctly and
with as little resources as possible. The target of the Load process is often a database.
In order to make the load process efficient, it is helpful to disable any constraints and
indexes before the load and enable them back only after the load completes. The
referential integrity needs to be maintained by ETL tool to ensure consistency.

Managing ETL Process:

The ETL process seems quite straight forward. As with every application, there is a
possibility that the ETL process fails. This can be caused by missing extracts from one
of the systems, missing values in one of the reference tables,or simply a connection or
power outage. Therefore, it is necessary to design the ETL process keeping fail-
recovery in mind.
Staging:
It should be possible to restart, at least, some of the phases independently from the
others. For example, if the transformation step fails, it should not be necessary to
restart the Extract step. We can ensure this by implementing proper staging. Staging
means that the data is simply dumped to the location (called the Staging Area) so that
it can then be read by the next processing phase. The staging area is also used during
ETL process to store intermediate results of processing. This is ok for the ETL process
which uses for this purpose. However, the staging area should it be accessed by the
load ETL process only. It should never be available to anyone else; particularly not to
end users as it is not intended for data presentation to the end-user. May contain
incomplete or in-the-middle-of-the-processing data.

ETL Tool Implementation:

When you are about to use an ETL tool, there is a fundamental decision to be made:
will the company build its own data transformation tool or will it use an existing tool?
Building your own data transformation tool (usually a set of shell scripts) is the
preferred approach for a small number of data sources which reside in storage of the
same type. The reason for that is the effort to implement the necessary transformation
is little due to similar data structure and common system architecture. Also, this
approach saves licensing cost and there is no need to train the staff in a new tool. This
approach, however, is dangerousfrom the TOC point of view. If the transformations
become more sophisticatedduring the time or there is a need to integrate other systems,
the complexity of such an ETL system grows but the manageability drops
significantly. Similarly, the implementation of your own tool often resembles re-
inventing the wheel.
There are many ready-to-use ETL tools on the market. The main benefit of using off-
the-shelf ETL tools is the fact that they are optimized for the ETL process by providing
connectors to common data sources like databases, flat files, mainframe systems, xml,
etc. They provide a means to implement data transformations easily and consistently
across various data sources. This includes filtering, reformatting, sorting, joining,
merging, aggregation, and other operations ready to use. The tools also support
transformation scheduling, version control, monitoring, and unified metadata
management. Some of the ETL tools are even integrated with BI tools.

Department of Computer Science and Engineering

➢ Perform Various OLAP operations such slice, dice, roll up, drill up andpivot.
ANS:
OLAP Operations are being implemented practically using MicrosoftExcel.
Procedure for OLAP Operations:
1. Open Microsoft Excel, go toData tab in top & click on ―Existing Connections”.
2. Existing Connections window will be opened, there “Browse for more”option
should be clicked for importing .cub extension file for performing OLAP Operations. For
sample, I took music.cub file.

3. As shown in above window, select ―PivotTable Report” and click “OK”.

4. We got all the music.cub data for analysing different OLAP Operations. Firstly,
we performeddrill-down operation as shown below.

Department of Computer Science and Engineering

In the above window, we selected year „2008‟ in „Electronic‟ Category, then

automatically the Drill-Down option is enabled on top navigation options. We willclick
on„Drill-Down‟ option, then the below window will be displayed.

5. Now we are going to perform roll-up (drill-up) operation, in the above window I selected
January month then automatically Drill-up option is enabled on top. We will click on Drill-up
option, then the below window will be displayed.

Department of Computer Science and Engineering

6. Next OLAP operation Slicing is performed by inserting slicer as shown in top

navigation options.

While inserting slicers for slicing operation, we select 2 Dimensions (for e.g., CategoryName &
Year) only with one Measure (for e.g. Sum of sales).After inserting a slice& adding a filter
(CategoryName: AVANT ROCK & BIG BAND; Year: 2009 & 2010), we will get table as shown
below.

Department of Computer Science and Engineering

7. Dicing operation is similar to Slicing operation. Here we are selecting 3 dimensions

(CategoryName, Year, RegionCode)& 2 Measures (Sum of Quantity, Sum of Sales) through
“insert slicer” option. After that adding a filter for CategoryName, Year & RegionCode as shown
below.

8. Finally, the Pivot (rotate) OLAP operation is performed by swapping rows (Order Date-
Year) & columns (Values-Sum of Quantity & Sum of Sales) through right side bottom navigation
bar as shown below.

Department of Computer Science and Engineering

After Swapping (rotating), we will get resultant as represented below with a pie-chart for
Category-Classical& Year Wise data.

Department of Computer Science and Engineering

2. Explore machine learning tool “WEKA”

➢ Explore WEKA Data Mining/Machine Learning Toolkit.
ANS:
WEKA (Waikato Environment for Knowledge Analysis) an open-sourcesoftware
provides tools for data preprocessing, implementation of severalMachine Learning
algorithms, and visualization tools so that we can developmachine learning
techniques and apply them to real-world data mining problems. Features of WEKA -

a. Preprocessor – Most of the Data is Raw. Hence, Preprocessor is used to cleanthe

noisy data.
b. Classify – After preprocessing the data, we assign classes or categories toitems.
c. Cluster – In Clustering, a dataset is arranged in different groups/clusters basedon
some similarities.
d. Associate – Association rules highlight all the associations and correlations
between items of a dataset.
e. Select Attributes – Every dataset contains a lot of attributes; only significantly
valuable attributes are selected for building a good model.
f. Visualize – In Visualization, different plot matrices and graphs are available toshow
the trends and errors identified by the model.

➢ Downloading and/or installation of WEKA data mining toolkit.

ANS:
1. Go to the Weka website, http://www.cs.waikato.ac.nz/ml/weka/, and download the
software.
2. Select the appropriate link corresponding to the version of the software based on your
operating system and whether or not you already have Java VM running on your
machine.
3. The link will forward you to a site where you can download the software from a mirror
site. Save the self-extracting executable to disk and then double click on it to install
Weka. Answer yes or next to the questions during the installation.
4. Click yes to accept the Java agreement if necessary. After you install the program
Weka should appear on your start menu under Programs (if you are using Windows).
5. Running Weka from the start menu select Programs, then Weka. You will see the
Weka GUI Chooser. Select Explorer. The Weka Explorer will then launch.

Department of Computer Science and Engineering

➢ Understand the features of WEKA toolkit such as Explorer, Knowledge Flow

interface, Experimenter, command-line interface.
ANS:
The Weka GUI Chooser (class weka.gui.GUIChooser) provides a starting point for
launching Weka‘s main GUI applications and supporting tools. If oneprefers a MDI
(“Multiple Document Interface”) appearance, then this is provided by an alternative
launcher called “Main” (class weka.gui.Main).

The GUI Chooser application allows you to run five different types of applications -
▪ The Explorer is the central panel where most data mining tasks are performed.
▪ The Experimenter panel is used to run experiments and conduct statistical tests
between learning schemes.
▪ The KnowledgeFlow panel is used to provide an interface to drag and drop
components, connect them to form a knowledge flow and analyze the data and
results.
▪ The WorkBench panel is used to discover, explore & learn about different
statistical distributions.
▪ The Simple CLI panel provides the command-line interface powers to run
WEKA.
The Explorer - When you click on the Explorer button in the Applications selector,it
opens the following scr

Department of Computer Science and Engineering

The Weka Explorer is designed to investigate your machine learning dataset. Itis useful when you
are thinking about different data transforms and modeling algorithms that you could investigate
with a controlled experiment later. It is excellent for getting ideas and playing what-if scenarios.
The interface is divided into 6 tabs, each with a specific function:
1. The preprocess tab is for loading your dataset and applying filters to transformthe data
into a form that better exposes the structure of the problem to the modelling processes.
Also provides some summary statistics about loaded data.
2. The classify tab is for training and evaluating the performance of different machine
learning algorithms on your classification or regression problem. Algorithms are
divided up into groups, results are kept in a result list and summarized in the main
Classifier output.
3. The cluster tab is for training and evaluating the performance of different unsupervised
clustering algorithms on your unlabelled dataset. Like the Classify tab, algorithms are
divided into groups, results are kept in a result list and summarized in the main
Clustered output.
4. The associate tab is for automatically finding associations in a dataset. The techniques
are often used for market basket analysis type data mining problems and require data
where all attributes are categorical.
5. The select attributes tab is for performing feature selection on the loaded dataset and
identifying those features that are most likely to be relevant in developing a predictive
model.
6. The visualize tab is for reviewing pairwise scatterplot matrix of each attribute plotted
against every other attribute in the loaded dataset. It is useful to get an idea of the
shape and relationship of attributes that may aid in data filtering, transformation, and
modelling.
The Experimenter - When you click on the Experimenter button in the
Applications selector, it opens the following screen.

The Weka Experiment Environment is for designing controlled experiments, running

them, then analyzing the results collected.
The interface is split into 3 tabs.
1. The setup tab is for designing an experiment. This includes the file where results are
written, the test setup in terms of how algorithms are evaluated, the datasets to model
and the algorithms to model them. The specifics of an experiment can be saved for
later use and modification.
• Click the “New” button to create a new Experiment
•
Department of Computer Science and Engineering
19

• Click the “Add New” button in the Datasets pane and select
the required dataset (ARFF format files).
• Click the “Add New” button in the “Algorithms” pane and click “OK”
to add the required algorithm.
2. The run tab is for running your designed experiments. Experiments can be started and
stopped. There is not a lot to it.
• Click the “Start” button to run the small experiment you designed.
3. The analyze tab is for analyzing the results collected from an experiment. Results can
be loaded from a file, from the database or from an experiment just completed in the
tool. A no. of performance measures are collected from agiven experiment which can
be compared between algorithms using tools like statistical significance.
• Click the “Experiment” button the “Source” pane to load the results
from the experiment you just ran.
• Click the “Perform Test” button to summary the classificationaccuracy
results for the single algorithm in the experiment.

The KnowledgeFlow – When you click on the KnowledgeFlow button in the Applications
selector, it opens the following screen.

The Weka KnowledgeFlow Environment is a graphical workflow tool for designing a

machine learning pipeline from data source to results summary, and much more. Once
designed, the pipeline can be executed and evaluated within the tool.
Features of the KnowledgeFlow:
▪ Intuitive data flow style layout.
▪ Process data in batches or incrementally.
▪ Process multiple batches or streams in parallel!
▪ chain filters together.
▪ View models produced by classifiers for each fold in a cross validation.
▪ Visualize performance of incremental classifiers during processing.
The WorkBench – When you click on the WorkBench button in the Application selector,
it opens the following screen

Department of Computer Science and Engineering

The Weka Workbench is an environment that combines all the GUI interfaces into a single
interface. It is useful if you find yourself jumping a lot between two or more different
interfaces, such as between the Explorer and the Experiment Environment. This can happen
if you try out a lot of what if’s in the Explorer and quickly take what you learn and put it into
controlled experiments.
The Simple CLI – When you click on the Simple CLI button in the Applications
selector, it opens the following screen.

Weka can be used from a simple Command Line Interface (CLI). This is powerful
because you can write shell scripts to use the full API from command line calls with
parameters, allowing you to build models, run experiments and make predictions without a
graphical user interface.

Department of Computer Science and Engineering

21
➢ Navigate the options available in the WEKA (ex. Select attributes panel, Preprocess
panel, Classify panel, Cluster panel, Associate panel and Visualizepanel)
ANS:
EXPLORER PANEL
Preprocessor Panel
1. A variety of dataset formats can be loaded: WEKA‘s ARFF format (.arff extension),
CSV format (.csv extension), C4.5 format (.data & .names extension), or serialized
Instances format (.bsi entension).
2. Load a standard dataset in the data/ directory of your Weka installation, specifically
data/breast-cancer.arff.

Classify Panel
Test Options
1. The result of applying the chosen classifier will be tested according to the options that
are set by clicking in the Test options box.
2. There are four test modes:
▪ Use training set: The classifier is evaluated on how well it predicts the class
of the instances it was trained on.
▪ Supplied test set: The classifier is evaluated on how well it predicts the class
of a set of instances loaded from a file. Clicking the Set... button brings up a
dialog allowing you to choose the file to test on.
▪ Cross-validation: The classifier is evaluated by cross-validation, using the
number of folds that are entered in the Folds text field.
▪ Percentage split: The classifier is evaluated on how well it predicts a certain
percentage of the data which is held out for testing. The amount of data held
out depends on the value entered in the % field.
3. Click the “Start” button to run the ZeroR classifier on the dataset and
summarize the results.

Department of Computer Science and Engineering

Associate Panel
1. Click the “Start” button to run the Apriori association algorithm on the dataset
and summarize the results.

Select Attributes Panel

1. Click the “Start” button to run the CfsSubsetEval algorithm with
a BestFirst search on the dataset and summarize the results.

Department of Computer Science and Engineering

Visualize Panel
1. Increase the point size and the jitter and click the “Update” button to set an
improved plot of the categorical attributes of the loaded dataset.

EXPERIMENTER
Setup Panel
1. Click the “New” button to create a new Experiment.
2. Click the “Add New” button in the Datasets pane and select
the data/diabetes.arff dataset.
3. Click the “Add New” button in the “Algorithms” pane and click “OK” to add
the ZeroR algorithm.

Department of Computer Science and Engineering

Run Panel
1. Click the “Start” button to run the small experiment you designed.

Analyse Panel
1. Click the “Experiment” button the “Source” pane to load the results from the
experiment you just ran.
2. Click the “Perform Test” button to summary the classification accuracy results
for the single algorithm in the experiment.

Department of Computer Science and Engineering

➢ Study the arff file format Explore the available data sets in WEKA. Load adata
set (ex. Weather dataset, Iris dataset, etc.)
ANS:
1. An ARFF (Attribute-Relation File Format) file is an ASCII text file that
describes a list of instances sharing a set of attributes.
2. ARFF files have two distinct sections – The Header & the Data.
• The Header describes the name of the relation, a list of the attributes,and
their types.
• The Data section contains a comma separated list of data.
The ARFF Header Section
The ARFF Header section of the file contains the relation declaration and attribute
declarations.
The @relation Declaration
The relation name is defined as the first line in the ARFF file. The format is:
@relation <relation-name>
where <relation-name> is a string. The string must be quoted if the name includes spaces.
The @attribute Declarations
Attribute declarations take the form of an ordered sequence of @attribute statements. Each
attribute in the data set has its own @attribute statement which uniquely defines the name
of that attribute and its data type. The order the attributes are declared indicates the column
position in the data section of the file.
The format for the @attribute statement is:
@attribute <attribute-name><datatype>
where the <attribute-name> must start with an alphabetic character. If spaces areto be
included in the name, then the entire name must be quoted.
The <datatype> can be any of the four types:
1. numeric
2. <nominal-specification>
3. string
4. date [<date-format>]
ARFF Data Section
The ARFF Data section of the file contains the data declaration line and the actual
instance lines.
The @data Declaration
The @data declaration is a single line denoting the start of the data segment in thefile.
The format is:
@data
The instance data:
Each instance is represented on a single line, with carriage returns denoting the end of
the instance.
Attribute values for each instance are delimited by commas. They must appear in the order
that they were declared in the header section (i.e. the data corresponding to the nth
@attribute declaration is always the nth field of the attribute).
Missing values are represented by a single question mark, as in:@data
4.4, ?, 1.5, ?, Iris-setosa

Department of Computer Science and Engineering

An example header on the standard IRIS dataset looks like this:

% 1. Title: Iris Plants Database
%
% 2. Sources:
% (a) Creator: R.A. Fisher
% (b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
% (c) Date: July, 1988
%
@RELATION iris
@ATTRIBUTE sepallength NUMERIC
@ATTRIBUTE sepalwidth NUMERIC
@ATTRIBUTE petallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC
@ATTRIBUTE class {Iris-setosa, Iris-versicolor, Iris-virginica}

The Data of the ARFF file looks like the following:

@DATA
5.1, 3.5, 1.4, 0.2, Iris-setosa
4.9, 3.0, 1.4, 0.2, Iris-setosa
4.7, 3.2, 1.3, 0.2, Iris-setosa
4.6, 3.1, 1.5, 0.2, Iris-setosa
5.0, 3.6, 1.4, 0.2, Iris-setosa
5.4, 3.9, 1.7, 0.4, Iris-setosa
4.6, 3.4, 1.4, 0.3, Iris-setosa
5.0, 3.4, 1.5, 0.2, Iris-setosa
4.4, 2.9, 1.4, 0.2, Iris-setosa
4.9, 3.1, 1.5, 0.1, Iris-setosa

NOTE: Lines that begin with a % are comments. The @RELATION,

@ATTRIBUBTE and @DATA declarations are case insensitive.

Sparse ARFF files

Sparse ARFF files are very similar to ARFF files, but data with value 0 are not be
explicitly represented.
Sparse ARFF files have the same header (i.e @relation and @attribute tags) butthe
data section is different. Instead of representing each value in order, like this:
@data
0, X, 0, Y, "class A"
0, 0, W, 0, "class B"
the non-zero attributes are explicitly identified by attribute number and their valuestated,
like this:
@data
{1 X, 3 Y, 4 "class A"}
{2 W, 4 "class B"}
Each instance is surrounded by curly braces, and the format for each entry is:
<index><space><value> where index is the attribute index (starting from 0).

Department of Computer Science and Engineering

Available Datasets in WEKA:

There are 25 different datasets are available in WEKA (C:\Program Files\Weka-3- 8-
6\data) by default for testing purpose. All the datasets are available in .arff format.
Those datasets are listed below.

Department of Computer Science and Engineering

➢ Load each dataset and observe the following:

1. List the attribute names and they type.
2. Number of records in each dataset
3. Identify the class attribute (if any)
4. Plot Histogram
5. Determine the number of records for each class.
6. Visualize the data in various dimensions
ANS:
Procedure:
1) Open the WEKA tool and Select the Explorer option.
2) A new window will be opened which consists of six tabs – Preprocess,Classify,
Cluster, Associate, Select Attributes and Visualize.
3) In the Preprocess tab, Click the “Open file” option.
4) Go to C:\Program Files\Weka-3-8-6\data for finding different existing .arff
datasets.
5) Click on any of the dataset for loading the data and then the data will bedisplayed as
shown.
6) Here Weather.arff dataset is chosen as sample for all the observations.

Department of Computer Science and Engineering

1. List the attribute names and they type.

There are 5 attributes and its data type presented in the loaded datasetWeather.arff.
S.NO. ATTRIBUTE NAME DATA TYPE
1 Outlook Nominal
2 Temperature Nominal
3 Humidity Nominal
4 Windy Nominal
5 Play Nominal

2. Number of records in each dataset

There is total 14 records (Instances) in the loaded dataset Weather.arff.

3. Identify the class attribute (if any)

No class attribute in the loaded dataset Weather.arff.

Department of Computer Science and Engineering

4. Plot Histogram

5. Determine the number of records for each class.

S.NO. ATTRIBUTE NAME RECORDS (INSTANCES)
1 Outlook 14
2 Temperature 14
3 Humidity 14
4 Windy 14
5 Play 14

6. Visualize the data in various dimensions

Plot Matrix for the loaded dataset Weather.arff.

Department of Computer Science and Engineering

3. Perform data preprocessing tasks and demonstrate performing association rule

mining on data sets.
➢ Explore various options available in Weka for preprocessing data and apply
Unsupervised filters like Discretization, Resample filter, etc. on each dataset.
ANS:
1. Select a dataset from the available for preprocessing.
2. Once the dataset was loaded, apply various Unsupervised filters such as
Discretization, Resample available in the WEKA on the selected dataset.
Applying Discretization Filter on the dataset selected, yields the following results.

Department of Computer Science and Engineering

Applying Resample Filter on the dataset selected, yields the following results.

Department of Computer Science and Engineering

➢ Load weather. nominal, Iris, Glass datasets into Weka and run AprioriAlgorithm
with different support and confidence values.
ANS:
Loading WEATHER.NOMINAL dataset
1. Select WEATHER.NOMINAL dataset from the available datasets in the
preprocessing tab.
2. Apply Apriori algorithm by selecting it from the Associate tab and click start
button.
3. The Associator output displays the following result.
=== Run information ===
Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S
-1.0 -c -1
Relation: weather.symbolic
Instances: 14
Attributes: 5
outlook
temperature
humidity windy
play
=== Associator model (full training set) ===
Apriori
=======
Minimum support: 0.15 (2 instances)
Minimum metric <confidence>: 0.9
Number of cycles performed: 17

Generated sets of large itemsets:

Size of set of large itemsets L(1): 12Size of
set of large itemsets L(2): 47Size of set of
large itemsets L(3): 39Size of set of large
itemsets L(4): 6

Best rules found:

1. outlook=overcast 4 ==> play=yes 4 <conf:(1)> lift:(1.56) lev:(0.1) [1]
conv:(1.43)
2. temperature=cool 4 ==> humidity=normal 4 <conf:(1)> lift:(2) lev:(0.14) [2]
conv:(2)
3. humidity=normal windy=FALSE 4 ==> play=yes 4 <conf:(1)> lift:(1.56)
lev:(0.1) [1] conv:(1.43)
4. outlook=sunny play=no 3 ==> humidity=high 3 <conf:(1)> lift:(2) lev:(0.11)
[1] conv:(1.5)
5. outlook=sunny humidity=high 3 ==> play=no 3 <conf:(1)> lift:(2.8)
lev:(0.14) [1] conv:(1.93)
6. outlook=rainy play=yes 3 ==> windy=FALSE 3 <conf:(1)> lift:(1.75)
lev:(0.09) [1] conv:(1.29)

Department of Computer Science and Engineering

Best rules found:

1. Na='(12.725-13.39]' 93 ==> Ba='(-inf-0.315]' 93 <conf:(1)> lift:(1.16)
lev:(0.06) [12] conv:(12.6)
2. Na='(12.725-13.39]' K='(-inf-0.621]' 69 ==> Ba='(-inf-0.315]' 69 <conf:(1)>
lift:(1.16) lev:(0.04) [9] conv:(9.35)
3. RI='(1.515706-1.517984]' Mg='(3.143-3.592]' 65 ==> Ba='(-inf-0.315]' 65
<conf:(1)> lift:(1.16) lev:(0.04) [8] conv:(8.81)
4. Na='(12.725-13.39]' Ca='(7.582-8.658]' 64 ==> Ba='(-inf-0.315]' 64
<conf:(1)> lift:(1.16) lev:(0.04) [8] conv:(8.67)
5. Type=build wind non-float 76 ==> Ba='(-inf-0.315]' 75 <conf:(0.99)>
lift:(1.14) lev:(0.04) [9] conv:(5.15)
6. Type=build wind float 70 ==> Ba='(-inf-0.315]' 69 <conf:(0.99)> lift:(1.14)
lev:(0.04) [8] conv:(4.74)
7. Al='(1.253-1.574]' 81 ==> Ba='(-inf-0.315]' 79 <conf:(0.98)> lift:(1.13)
lev:(0.04) [8] conv:(3.66)
8. Al='(1.253-1.574]' K='(-inf-0.621]' 67 ==> Ba='(-inf-0.315]' 65
<conf:(0.97)> lift:(1.12) lev:(0.03) [7] conv:(3.03)
9. Mg='(3.143-3.592]' 86 ==> Ba='(-inf-0.315]' 83 <conf:(0.97)> lift:(1.12)
lev:(0.04) [8] conv:(2.91)
10. Type=build wind float 70 ==> K='(-inf-0.621]' 64 <conf:(0.91)> lift:(1.14)
lev:(0.04) [7] conv:(1.96)

Loading IRIS dataset

1. Select IRIS dataset from the available datasets in the preprocessing tab.
2. Apply Apriori algorithm by selecting it from the Associate tab and click startbutton.
3. The Associator output displays the following result.
=== Run information ===
Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S
-1.0 -c -1
Relation: iris-weka.filters.unsupervised.attribute.Discretize-B10-M-1.0-Rfirst-
last-precision6
Instances: 150
Attributes: 5
sepallength
sepalwidth
petallength
petalwidth class
=== Associator model (full training set) ===
Apriori
=======
Minimum support: 0.1 (15 instances)
Minimum metric <confidence>: 0.9
Number of cycles performed: 18

Generated sets of large itemsets: Size of

set of large itemsets L(1): 20

Department of Computer Science and Engineering

Size of set of large itemsets L(2): 15Size

of set of large itemsets L(3): 3

Best rules found:

1. petalwidth='(-inf-0.34]' 41 ==> class=Iris-setosa 41 <conf:(1)> lift:(3)
lev:(0.18) [27] conv:(27.33)
2. petallength='(-inf-1.59]' 37 ==> class=Iris-setosa 37 <conf:(1)> lift:(3)
lev:(0.16) [24] conv:(24.67)
3. petallength='(-inf-1.59]' petalwidth='(-inf-0.34]' 33 ==> class=Iris-setosa 33
<conf:(1)> lift:(3) lev:(0.15) [22] conv:(22)
4. petalwidth='(1.06-1.3]' 21 ==> class=Iris-versicolor 21 <conf:(1)> lift:(3)
lev:(0.09) [14] conv:(14)
5. petallength='(5.13-5.72]' 18 ==> class=Iris-virginica 18 <conf:(1)> lift:(3)
lev:(0.08) [12] conv:(12)
6. sepallength='(4.66-5.02]' petalwidth='(-inf-0.34]' 17 ==> class=Iris-setosa 17
<conf:(1)> lift:(3) lev:(0.08) [11] conv:(11.33)
7. sepalwidth='(2.96-3.2]' class=Iris-setosa 16 ==> petalwidth='(-inf-0.34]' 16
<conf:(1)> lift:(3.66) lev:(0.08) [11] conv:(11.63)
8. sepalwidth='(2.96-3.2]' petalwidth='(-inf-0.34]' 16 ==> class=Iris-setosa 16
<conf:(1)> lift:(3) lev:(0.07) [10] conv:(10.67)
9. petallength='(3.95-4.54]' 26 ==> class=Iris-versicolor 25 <conf:(0.96)>
lift:(2.88) lev:(0.11) [16] conv:(8.67)
10. petalwidth='(1.78-2.02]' 23 ==> class=Iris-virginica 22 <conf:(0.96)>
lift:(2.87) lev:(0.1) [14] conv:(7.67)

Department of Computer Science and Engineering

➢ Apply different discretization filters on numerical attributes and run the

Apriori association rule algorithm. Study the rules generated.
➢ Derive interesting insights and observe the effect of discretization in the rule
generation process.
ANS:
Loading WEATHER.NUMERIC dataset
1. Select WEATHER.NUMERIC dataset from the available datasets in the
preprocessing tab.
2. Apply Apriori algorithm by selecting it from the Associate tab and click start
button.
3. The Associator output displays the following result.
=== Run information ===
Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S
-1.0 -c -1
Relation: weather-weka.filters.unsupervised.attribute.Discretize-B10-M-1.0-
Rfirst-last-precision6
Instances: 14
Attributes: 5
outlook
temperature
humidity windy
play
=== Associator model (full training set) ===
Apriori
=======
Minimum support: 0.15 (2 instances)
Minimum metric <confidence>: 0.9
Number of cycles performed: 17

Generated sets of large itemsets:

Size of set of large itemsets L(1): 17Size of
set of large itemsets L(2): 34Size of set of
large itemsets L(3): 13Size of set of large
itemsets L(4): 1

Best rules found:

1. outlook=overcast 4 ==> play=yes 4 <conf:(1)> lift:(1.56) lev:(0.1) [1]
conv:(1.43)
2. humidity='(89.8-92.9]' 3 ==> windy=TRUE 3 <conf:(1)> lift:(2.33)
lev:(0.12) [1] conv:(1.71)
3. outlook=rainy play=yes 3 ==> windy=FALSE 3 <conf:(1)> lift:(1.75)
lev:(0.09) [1] conv:(1.29)
4. outlook=rainy windy=FALSE 3 ==> play=yes 3 <conf:(1)> lift:(1.56)
lev:(0.08) [1] conv:(1.07)

Department of Computer Science and Engineering

5. humidity='(77.4-80.5]' 2 ==> outlook=rainy 2 <conf:(1)> lift:(2.8) lev:(0.09)

[1] conv:(1.29)
6. temperature='(-inf-66.1]' 2 ==> windy=TRUE 2 <conf:(1)> lift:(2.33)
lev:(0.08) [1] conv:(1.14)
7. temperature='(68.2-70.3]' 2 ==> windy=FALSE 2 <conf:(1)> lift:(1.75)
lev:(0.06) [0] conv:(0.86)
8. temperature='(68.2-70.3]' 2 ==> play=yes 2 <conf:(1)> lift:(1.56) lev:(0.05)
[0] conv:(0.71)
9. temperature='(74.5-76.6]' 2 ==> play=yes 2 <conf:(1)> lift:(1.56) lev:(0.05)
[0] conv:(0.71)
10. humidity='(83.6-86.7]' 2 ==> temperature='(82.9-inf)' 2 <conf:(1)> lift:(7)
lev:(0.12) [1] conv:(1.71)

Department of Computer Science and Engineering

4. Demonstrate performing classification on data sets

➢ Load each dataset into Weka and run 1d3, J48 classification algorithm. Study the
classifier output. Compute entropy values, Kappa statistic.
ANS:
Loading CONTACT-LENSES dataset and Run ID3 algorithm.
1. Select CONTACT-LENSES dataset from the available datasets in the
preprocessing tab.
2. Apply ID3 algorithm by selecting it from the Classify tab.
3. Now select one of the Test Options available (Training Set/ Supplied Test Set/
Cross-Validation/ Percentage Split).
4. Also select Output Entropy Evaluation Measures from the More options availablein
Test Options.
5. Now click Start button available.
6. The Classifier output displays the following result.
=== Run information ===
Scheme: weka.classifiers.trees.Id3
Relation: contact-lenses
Instances: 24
Attributes: 5
age
spectacle-prescrip
astigmatism
tear-prod-rate
contact-lenses
Test mode: evaluate on training data
=== Classifier model (full training set) ===
Id3
tear-prod-rate = reduced: none
tear-prod-rate = normal
| astigmatism = no
| | age = young: soft
| | age = pre-presbyopic: soft
| | age = presbyopic
| | | spectacle-prescrip = myope: none
| | | spectacle-prescrip = hypermetrope: soft
| astigmatism = yes
| | spectacle-prescrip = myope: hard
| | spectacle-prescrip = hypermetrope
| | | age = young: hard
| | | age = pre-presbyopic: none
| | | age = presbyopic: none
Time taken to build model: 0 seconds
=== Evaluation on training set ===
Time taken to test model on training data: 0 seconds
=== Summary ===
Department of Computer Science and Engineering
41

Correctly Classified Instances 24 100 %

Incorrectly Classified Instances 0 0%
Kappa statistic 1
K&B Relative Info Score 100 %
K&B Information Score 31.9048 bits 1.3294 bits/instance
Class complexity | order 0 31.9048 bits 1.3294 bits/instance
Class complexity | scheme 0 bits 0 bits/instance
Complexity improvement (Sf) 31.9048 bits 1.3294 bits/instance
Mean absolute error 0
Root mean squared error 0
Relative absolute error 0%
Root relative squared error 0%
Total Number of Instances 24
=== Detailed Accuracy By Class ===

=== Confusion Matrix ===

a b c <-- classified as
5 0 0 | a = soft
0 4 0 | b = hard
0 0 15 | c = none

Loading CONTACT-LENSES dataset and Run J48 algorithm.

1. Select CONTACT-LENSES dataset from the available datasets in the
preprocessing tab.
2. Apply J48 algorithm by selecting it from the Classify tab.
3. Now select one of the Test Options available (Training Set/ Supplied Test Set/
Cross-Validation/ Percentage Split).
4. Also select Output Entropy Evaluation Measures from the More options availablein
Test Options.
5. Now click Start button available.
6. The Classifier output displays the following result.
=== Run information ===
Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: contact-lenses
Instances: 24
Attributes: 5
Department of Computer Science and Engineering
42

age
spectacle-prescrip
astigmatism
tear-prod-rate
contact-lenses
Test mode: evaluate on training data
=== Classifier model (full training set) ===J48
pruned tree

tear-prod-rate = reduced: none (12.0)tear-

prod-rate = normal
| astigmatism = no: soft (6.0/1.0)
| astigmatism = yes
| | spectacle-prescrip = myope: hard (3.0)
| | spectacle-prescrip = hypermetrope: none (3.0/1.0)Number
of Leaves: 4
Size of the tree: 7
Time taken to build model: 0.01 seconds
=== Evaluation on training set ===
Time taken to test model on training data: 0 seconds
=== Summary ===
Correctly Classified Instances 22 91.6667 %
Incorrectly Classified Instances 2 8.3333 %
Kappa statistic 0.8447
K&B Relative Info Score 81.6411 %
K&B Information Score 26.0474 bits 1.0853 bits/instance
Class complexity | order 0 31.9048 bits 1.3294 bits/instance
Class complexity | scheme 6.655 bits 0.2773 bits/instance
Complexity improvement (Sf) 25.2498 bits 1.0521 bits/instance
Mean absolute error 0.0833
Root mean squared error 0.2041
Relative absolute error 22.6257 %
Root relative squared error 48.1223 %
Total Number of Instances 24
=== Detailed Accuracy By Class ===

Department of Computer Science and Engineering

➢ Extract if-then rules from the decision tree generated by the classifier, Observethe
confusion matrix.
ANS:
Loading CONTACT-LENSES dataset and Run JRip algorithm.
1. Select CONTACT-LENSES dataset from the available datasets in the
preprocessing tab.
2. Apply JRip algorithm by selecting it from the Classify tab.
3. Now select one of the Test Options available (Training Set/ Supplied Test Set/
Cross-Validation/ Percentage Split).
4. Also select Output Entropy Evaluation Measures from the More options availablein
Test Options.
5. Now click Start button available.
6. The Classifier output displays the following result.
=== Run information ===
Scheme: weka.classifiers.rules.JRip -F 3 -N 2.0 -O 2 -S 1
Relation: contact-lenses
Instances: 24
Attributes: 5
age
spectacle-prescrip
astigmatism
tear-prod-rate
contact-lenses
Test mode: evaluate on training data

=== Classifier model (full training set) ===JRIP

rules:
===========
(tear-prod-rate = normal) and (astigmatism = yes) => contact-lenses=hard (6.0/2.0)(tear-
prod-rate = normal) => contact-lenses=soft (6.0/1.0)
=> contact-lenses=none (12.0/0.0)
Number of Rules : 3
Time taken to build model: 0 seconds

=== Evaluation on training set ===

Time taken to test model on training data: 0 seconds

=== Summary ===

Correctly Classified Instances 21 87.5 %
Incorrectly Classified Instances 3 12.5 %
Kappa statistic 0.7895
K&B Relative Info Score 73.756 %
K&B Information Score 23.5317 bits 0.9805 bits/instance
Class complexity | order 0 31.9048 bits 1.3294 bits/instance
Class complexity | scheme 9.4099 bits 0.3921 bits/instance
Complexity improvement (Sf) 22.4949 bits 0.9373 bits/instance
Mean absolute error 0.1204
Root mean squared error 0.2453
Relative absolute error 32.6816 %
Root relative squared error 57.8358 %
Total Number of Instances 24

Department of Computer Science and Engineering

=== Detailed Accuracy By Class ===

=== Confusion Matrix ===

a b c <-- classified as
5 0 0 | a = soft
0 4 0 | b = hard
1 2 12 | c = none

Department of Computer Science and Engineering

➢ Load each dataset into Weka and perform Naïve-bayes classification and k-
Nearest Neighbour classification. Interpret the results obtained.
ANS:
Loading CONTACT-LENSES dataset and Run Naïve-Bayes algorithm.
1. Select CONTACT-LENSES dataset from the available datasets in the
preprocessing tab.
2. Apply Naïve-Bayes algorithm by selecting it from the Classify tab.
3. Now select one of the Test Options available (Training Set/ Supplied Test Set/
Cross-Validation/ Percentage Split).
4. Also select Output Entropy Evaluation Measures from the More options availablein
Test Options.
5. Now click Start button available.
6. The Classifier output displays the following result.
=== Run information ===
Scheme: weka.classifiers.bayes.NaiveBayes
Relation: contact-lenses
Instances: 24
Attributes: 5
age
spectacle-prescrip
astigmatism
tear-prod-rate
contact-lenses
Test mode: evaluate on training data
=== Classifier model (full training set) ===Naive
Bayes Classifier
Class
Attribute soft hard none
(0.22) (0.19) (0.59)
==========================================
age
young 3.0 3.0 5.0
pre-presbyopic 3.0 2.0 6.0
presbyopic 2.0 2.0 7.0
[total] 8.0 7.0 18.0
spectacle-prescrip
myope 3.0 4.0 8.0
hypermetrope 4.0 2.0 9.0
[total] 7.0 6.0 17.0

astigmatism
no 6.0 1.0 8.0
yes 1.0 5.0 9.0
[total] 7.0 6.0 17.0
tear-prod-rate
Department of Computer Science and Engineering
47

reduced 1.0 1.0 13.0

normal 6.0 5.0 4.0
[total] 7.0 6.0 17.0
Time taken to build model: 0 seconds
=== Evaluation on training set ===
Time taken to test model on training data: 0 seconds
=== Summary ===
Correctly Classified Instances 23 95.8333 %
Incorrectly Classified Instances 1 4.1667 %
Kappa statistic 0.925
K&B Relative Info Score 62.2646 %
K&B Information Score 19.8654 bits 0.8277 bits/instance
Class complexity | order 0 31.9048 bits 1.3294 bits/instance
Class complexity | scheme 12.066 bits 0.5028 bits/instance
Complexity improvement (Sf) 19.8387 bits 0.8266 bits/instance
Mean absolute error 0.1809
Root mean squared error 0.2357
Relative absolute error 49.1098 %
Root relative squared error 55.5663 %
Total Number of Instances 24
=== Detailed Accuracy By Class ===

=== Confusion Matrix ===

a b c <-- classified as
5 0 0 | a = soft
0 4 0 | b = hard
1 0 14 | c = none

Loading CONTACT-LENSES dataset and Run k-Nearest Neighbour (IBk)

algorithm.
1. Select CONTACT-LENSES dataset from the available datasets in the
preprocessing tab.
2. Apply k-Nearest Neighbour (IBk) algorithm by selecting it from the Classify tab.
3. Now select one of the Test Options available (Training Set/ Supplied Test Set/
Cross-Validation/ Percentage Split).
4. Also select Output Entropy Evaluation Measures from the More options availablein
Test Options.
5. Now click Start button available.
6. The Classifier output displays the following result.
Department of Computer Science and Engineering
48

=== Run information ===

Scheme: weka.classifiers.lazy.IBk -K 1 -W 0 -A
"weka.core.neighboursearch.LinearNNSearch -A \"weka.core.EuclideanDistance -R first-
last\""
Relation: contact-lenses
Instances: 24
Attributes: 5
age
spectacle-prescrip
astigmatism
tear-prod-rate
contact-lenses
Test mode: evaluate on training data
=== Classifier model (full training set) ===
IB1 instance-based classifier
using 1 nearest neighbour(s) for classificationTime
taken to build model: 0 seconds
=== Evaluation on training set ===
Time taken to test model on training data: 0 seconds
=== Summary ===
Correctly Classified Instances 24 100 %
Incorrectly Classified Instances 0 0 %
Kappa statistic 1
K&B Relative Info Score 91.6478 %
K&B Information Score 29.24 bits 1.2183 bits/instance
Class complexity | order 0 31.9048 bits 1.3294 bits/instance
Class complexity | scheme 2.6648 bits 0.111 bits/instance
Complexity improvement (Sf) 29.24 bits 1.2183 bits/instance
Mean absolute error 0.0494
Root mean squared error 0.0524
Relative absolute error 13.4078 %
Root relative squared error 12.3482 %
Total Number of Instances 24
=== Detailed Accuracy By Class ===

Department of Computer Science and Engineering

➢ Plot RoC Curves.

ANS:
1. Select CONTACT-LENSES dataset from the available datasets in the
preprocessing tab.
2. Apply Naïve-Bayes algorithm by selecting it from the Classify tab.
3. Now select one of the Test Options available (Training Set/ Supplied Test Set/Cross-
Validation/ Percentage Split).
4. Also select Output Entropy Evaluation Measures from the More options availablein
Test Options.
5. Now click Start button available.
6. The Classifier output displays the result.
7. For plotting RoC curves – right click on the -bayes.NaiveBayes available in the result list,
select the Visulaize Threshold Curve option in which we can select any of the class
available (soft, hard, none).
8. After selecting a class, RoC curve plot will be displayed with False Positive Rate as X-
axis and True Positive Rate as Y-axis.

Department of Computer Science and Engineering

➢ Compare classification results of ID3, J48, Naïve-Bayes and k-NN classifiers for each
dataset, and deduce which classifier is performing best and poor for each dataset and
justify.
ANS:
By observing all the classification results of the algorithms ID3, K-NN, J48 & NaïveBayes
–
ID3 Algorithm accuracy & performance is best.J48
Algorithm accuracy & performance is poor.

Department of Computer Science and Engineering

5. Demonstrate performing clustering of data sets

➢ Load each dataset into Weka and run simple k-means clustering algorithm with
different values of k (number of desired clusters).
➢ Study the clusters formed. Observe the sum of squared errors and centroids, and
derive insights.
ANS:
Loading IRIS dataset and Run Simple K-Means clustering algorithm
1. Select and Load IRIS dataset from the available datasets in the preprocessing tab.
2. Apply Simple K-Means algorithm by selecting it from the Cluster tab.
3. Now select one of the cluster modes available (Training Set/ Supplied Test Set/
Percentage Split/ Classes to Clusters Evaluation) – Training Set option was selected.
4. Now click Start button.
5. The Clusterer output displays the following result.

=== Run information ===

Scheme: weka.clusterers.SimpleKMeans -init 0 -max-candidates 100 -periodic-
pruning
10000 -min-density 2.0 -t1 -1.25 -t2 -1.0 -N 2 -A
"weka.core.EuclideanDistance -R first-last"-I 500 -num-slots 1 -S 10
Relation: iris
Instances: 150
Attributes: 5
sepallength
sepalwidth
petallength
petalwidth
class
Test mode: evaluate on training data

=== Clustering model (full training set) ===

kMeans
======
Number of iterations: 7
Within cluster sum of squared errors: 62.1436882815797

Initial starting points (random):

Cluster 0: 6.1, 2.9, 4.7, 1.4, Iris-versicolor
Cluster 1: 6.2, 2.9, 4.3, 1.3, Iris-versicolor Missing

values globally replaced with mean/mode

Department of Computer Science and Engineering

Final cluster centroids:

Cluster#
Attribute Full Data 0 1
(150.0) (100.0) (50.0)
================================================================
==
sepallength 5.8433 6.262 5.006
sepalwidth 3.054 2.872 3.418
petallength 3.7587 4.906 1.464
petalwidth 1.1987 1.676 0.244
class Iris-setosa Iris-versicolor Iris-setosa
Time taken to build model (full training data) : 0 seconds

=== Model and evaluation on training set ===

Clustered Instances
0 100 (67%)
1 50 (33%)

Department of Computer Science and Engineering

➢ Explore other clustering techniques available in Weka.

ANS:
1. A clustering algorithm finds groups of similar instances in the entire dataset.
2. WEKA supports several clustering algorithms such as EM, Filtered Clusterer, Hierarchical
Clusterer and so on.
a) The EM algorithm is an iterative approach that cycles between two modes. The first mode
attempts to estimate the missing or latent variables, called the estimation-step or E-step.
The second mode attempts to optimize the parameters of the model to best explain the
data, called the maximization-step or M-step.
b) The Filtered Clusterer algorithm improves the performance of k-means by imposing an
index structure on the dataset and reduces the number of cluster centers searched while
finding the nearest center of a point.
c) The Hierarchical Clusterer algorithm works via grouping data into a tree of clusters.
Hierarchical clustering begins by treating every data point as a separate cluster. Then, it
repeatedly executes the subsequent steps:
• Identify the 2 clusters which can be closest together, and
• Merge the 2 maximum comparable clusters. We need to continue these stepsuntil
all the clusters are merged together.
Example,
Applying HIERARCHICAL CLUSTERER algrithm on IRIS dataset.

Department of Computer Science and Engineering

➢ Explore visualization features of Weka to visualize the clusters. Derive interesting

insights and explain.
ANS:
1. As in the case of classification, distinction between the correctly and incorrectlyidentified
instances can be observed.
2. By changing the X and Y axes we can analyze the results. Use jittering to find out the
concentration of correctly identified instances.
3. The operations in visualization plot of clustering are similar to in the case ofclassification.

Department of Computer Science and Engineering

6. Demonstrate knowledge flow application on data sets

➢ Develop a knowledge flow layout for finding strong association rules by using
Apriori, FP Growth algorithms.
ANS:
Knowledge flow layout for finding strong association rules by using Apriori
Algorithm

RESULT

Department of Computer Science and Engineering

Knowledge flow layout for finding strong association rules by using FPGrowth
algorithms

RESULT

Department of Computer Science and Engineering

➢ Set up the knowledge flow to load an ARFF (batch mode) and perform across
validation using J48 algorithm.
ANS:
Knowledge flow to load an ARFF (batch mode) and perform a crossvalidation
using J48 algorithm.

RESULT

Department of Computer Science and Engineering

➢ Demonstrate plotting multiple ROC curves in the same plot window by usingJ48
and Random Forest tree.

Plotting multiple ROC curves in the same plot window by using J48 andRandom
Forest tree.

RESULT

Department of Computer Science and Engineering

7. Demonstrate ZeroR technique on Iris dataset (by using necessary preprocessing

technique(s)) and share your observations.
ANS:
1. Select IRIS dataset from the pool of datasets available for preprocessing.
2. Once the dataset was loaded, apply various unsupervised filters, if necessary.
3. Now apply ZeroR technique on the dataset by selecting it from the Classify tab.
4. Check the results with test options –
i. Cross-Validation with 10 folds and
ii. Percentage Split of 66%
Loading IRIS Dataset for Preprocessing

ZeroR technique with test option – Cross-Validation with 10 folds

The Classifier output displays the following result.
=== Run information ===
Scheme: weka.classifiers.rules.ZeroR
Relation: iris
Instances: 150
Attributes: 5
sepallength
sepalwidth
petallength
petalwidth class
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
ZeroR predicts class value: Iris-setosa Time
taken to build model: 0 seconds
=== Stratified cross-validation ===
=== Summary ===

Department of Computer Science and Engineering

Correctly Classified Instances 50 33.3333 %

Incorrectly Classified Instances 100 66.6667 %
Kappa statistic 0
Mean absolute error 0.4444
Root mean squared error 0.4714
Relative absolute error 100 %
Root relative squared error 100 %
Total Number of Instances 150
=== Detailed Accuracy by Class ===

=== Confusion Matrix ===

ZeroR technique with test option –

Percentage Split of 66%
The Classifier output displays the following result.
=== Run information ===
Scheme: weka.classifiers.rules.ZeroR
Relation: iris
Instances: 150
Attributes: 5
sepallength
sepalwidth
petallength
petalwidth class
Test mode: split 66.0% train, remainder test
=== Classifier model (full training set) ===
ZeroR predicts class value: Iris-setosa Time
taken to build model: 0 seconds
Department of Computer Science and Engineering
63

Time taken to test model on test split: 0 seconds

=== Summary ===

Correctly Classified Instances 15 29.4118 %
Incorrectly Classified Instances 36 70.5882 %
Kappa statistic 0
Mean absolute error 0.4455
Root mean squared error 0.4728
Relative absolute error 100 %
Root relative squared error 100 %
Total Number of Instances 51
=== Detailed Accuracy by Class ===

=== Confusion Matrix ===

a b c <-- classified as
15 0 0 | a = Iris-setosa
19 0 0 | b = Iris-versicolor
17 0 0 | c = Iris-virginica

Department of Computer Science and Engineering

8. Write a java program to prepare a simulated data set with unique instances.
PROGRAM:
import java.util.ArrayList;
import java.util.List; import
java.util.Random;

public class UniqueDataSet

{
public static void main(String[] args)
{
// Create a random number generator.
Random random = new Random();

// Create a list to store the data.

List<Integer> data = new ArrayList<>();

// Generate 20 unique integers.

for (int i = 0; i < 20; i++)
{
int number = random.nextInt(100);if
(!data.contains(number))
{
data.add(number);
}
}

// Print the data.

for (int i = 0; i < data.size(); i++)
{
System.out.println(data.get(i));
}
}
}

Department of Computer Science and Engineering

OUTPUT:
Set12
37
1
38
84
28
24
61
88
65
66
72
85
75
64
91
27
47
42

Department of Computer Science and Engineering

9. Write a Python program to generate frequent item sets / association rules using
Apriori algorithm.
PROGRAM:
# installing the apyori package
!pip install apyori

# importing libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# importing the dataset

Data = pd.read_csv('/content/drive/MyDrive/Market_Basket_Optimisation.csv',
header = None)

# Intializing the list

transacts = []
# populating a list of transactions
for i in range(0, 7501): transacts.append([str(Data.values[i,j]) for j
in range(0, 20)])

#trains our apriori model

from apyori import apriori
rule = apriori (transactions = transacts, min_support = 0.003, min_confidence = 0.2,
min_lift = 3, min_length = 2, max_length = 2)

#Visualising the results

output = list(rule) # returns a non-tabular output

# putting output into a pandas data frame

def inspect(output):
lhs = [tuple (result [2][0][0]) [0] for result in output]rhs =
[tuple (result [2][0][1]) [0] for result in output]support =
[result [1] for result in output]
confidence = [result [2][0][2] for result in output]lift =
[result [2][0][3] for result in output]
return list (zip (lhs, rhs, support, confidence, lift))
output_DataFrame = pd.DataFrame(inspect(results), columns = ['Left_Hand_Side',
'Right_Hand_Side', 'Support', 'Confidence', 'Lift'])

Department of Computer Science and Engineering

OUTPUT:
# Displaying the results non-sorted
output_DataFrame

# Displaying the results sorted by descending order of Lift column

output_DataFrame.nlargest(n = 10, columns = 'Lift')

Department of Computer Science and Engineering

10. Write a program to calculate chi-square value using Python. Report your
observation.
PROGRAM:
# importing libraries
import numpy as np
from scipy.stats import chi2

# Create a contingency table

observed = np.array([[10, 20], [30, 40]])

# Calculate the chi-square statistic

chi2_stat = chi2.contingency(observed)[0]

# Calculate the p-value

p_value = chi2.contingency(observed)[1]

# Print the results

print ("Chi-square statistic:", chi2_stat)print
("P-value:", p_value)

OUTPUT:
Chi-square statistic: 10.0
P-value: 0.0014

OBSERVATION:
The chi-square statistic is a measure of the discrepancy between the observed and expected
frequencies in a contingency table. The p-value is the probability of obtaining a chi-square
statistic as large or larger than the one observed, assuming that the null hypothesis is true. In
this case, the null hypothesis is that there is no association between the two variables in the
contingency table. The p-value of 0.0014 is less than the significance level of 0.05, so we
reject the null hypothesis and conclude that there is a significant association between the two
variables.

Department of Computer Science and Engineering

11. Write a program of Naive Bayesian classification using Python programming

language.
PROGRAM:
# Importing library
import math import
randomimport csv

# the categorical class names are changed to numeric data# eg:

yes and no encoded to 1 and 0
def encode_class(mydata):
classes = []
for i in range(len(mydata)):
if mydata[i][-1] not in classes:
classes.append(mydata[i][-1])
for i in range(len(classes)):
for j in range(len(mydata)):
if mydata[j][-1] == classes[i]:
mydata[j][-1] = i
return mydata

# Splitting the data

def splitting(mydata, ratio):
train_num = int(len(mydata) * ratio)train
= []
# initially testset will have all the dataset
test = list(mydata)
while len(train) < train_num:
# index generated randomly from range 0# to
length of testset
index = random.randrange(len(test))
# from testset, pop data rows and put it in train
train.append(test.pop(index))return
train, test

# Group the data rows under each class yes or# no

in dictionary eg: dict[yes] and dict[no] def
groupUnderClass(mydata):
dict = {}
for i in range(len(mydata)):
if (mydata[i][-1] not in dict):
dict[mydata[i][-1]] = []

Department of Computer Science and Engineering

dict[mydata[i][-1]].append(mydata[i])return
dict

# Calculating Mean
def mean(numbers):
return sum(numbers) / float(len(numbers))

# Calculating Standard Deviation

def std_dev(numbers):
avg = mean(numbers)
variance = sum([pow(x - avg, 2) for x in numbers]) / float(len(numbers) - 1)return
math.sqrt(variance)

def MeanAndStdDev(mydata):
info = [(mean(attribute), std_dev(attribute)) for attribute in zip(*mydata)]
# eg: list = [ [a, b, c], [m, n, o], [x, y, z]]
# here mean of 1st attribute =(a + m+x), mean of 2nd attribute = (b +
n+y)/3
# delete summaries of last class
del info[-1]return
info

# find Mean and Standard Deviation under each class

def MeanAndStdDevForClass(mydata):info =
{}
dict = groupUnderClass(mydata)
for classValue, instances in dict.items(): info[classValue] =
MeanAndStdDev(instances)
return info

# Calculate Gaussian Probability Density Function

def calculateGaussianProbability(x, mean, stdev):
expo = math.exp(-(math.pow(x - mean, 2) / (2 * math.pow(stdev, 2))))return (1 /
(math.sqrt(2 * math.pi) * stdev)) * expo

# Calculate Class Probabilities

def calculateClassProbabilities(info, test):
probabilities = {}
for classValue, classSummaries in info.items():
probabilities[classValue] = 1
for i in range(len(classSummaries)):
mean, std_dev = classSummaries[i]x =
test[i]
Department of Computer Science and Engineering
71

probabilities[classValue] *= calculateGaussianProbability(x,
mean, std_dev)
return probabilities

# Make prediction - highest probability is the prediction

def predict(info, test):
probabilities = calculateClassProbabilities(info, test)
bestLabel, bestProb = None, -1
for classValue, probability in probabilities.items():
if bestLabel is None or probability > bestProb:
bestProb = probability
bestLabel = classValuereturn
bestLabel

# returns predictions for a set of examples

def getPredictions(info, test):
predictions = []
for i in range(len(test)):
result = predict(info, test[i])
predictions.append(result)
return predictions

# Accuracy score
def accuracy_rate(test, predictions):
correct = 0
for i in range(len(test)):
if test[i][-1] == predictions[i]:
correct += 1
return (correct / float(len(test))) * 100.0

# driver code
# add the data path in your system
filename = r'E:\user\MACHINE LEARNING\machine learning algos\Naive
bayes\filedata.csv'

# load the file and store it in mydata listmydata

= csv.reader(open(filename, "rt")) mydata =
list(mydata)
mydata = encode_class(mydata)for i in
range(len(mydata)):
mydata[i] = [float(x) for x in mydata[i]]

# split ratio = 0.7

Department of Computer Science and Engineering
72

# 70% of data is training data and 30% is test data used for testing
ratio = 0.7
train_data, test_data = splitting(mydata, ratio) print('Total
number of examples are: ', len(mydata))
print('Out of these, training examples are: ', len(train_data))
print("Test examples are: ", len(test_data))

# prepare model
info = MeanAndStdDevForClass(train_data)

# test model
predictions = getPredictions(info, test_data) accuracy
= accuracy_rate(test_data, predictions)
print("Accuracy of your model is: ", accuracy)

OUTPUT:
Total number of examples are: 200
Out of these, training examples are: 140Test
examples are: 60
Accuracy of your model is: 71.237678

Department of Computer Science and Engineering

12. Implement a Java program to perform Apriori algorithm.

PROGRAM:
//Implementation of Apriori algorithm in Java.
import java.io.*;class
Main
{
public static void main(String []arg)throws IOException
{
int i, j, m=0; //initalize variables
int t1=0;
BufferedReader b=new BufferedReader(new InputStreamReader(System.in));
//Java BufferedReader class is used to read the text from a character-based input
stream.
//It can be used to read data line by line by readLine() method.
System.out.println("Enter the number of transaction:");
//here n is the number of transactions
int n=Integer.parseInt(b.readLine());
System.out.println("items :1--Milk 2--Bread 3--Coffee 4--Juice 5--Cookies 6--Jam7--Tea
8--Butter 9--Sugar 10--Water");
//cresting array of 10 items.
int item[][]=new int[n][10];
//loop generating for number of transactions
for(i=0;i<n;i++)
//loop generating for items array
for(j=0;j<10;j++)
//initializing unique items with their frequency as 0.
item[i][j]=0;
String[] itemlist = {"MILK", "BREAD", "COFFEE", "JUICE", "COOKIES", "JAM","TEA",
"BUTTER", "SUGAR", "WATER"};
//getting 10 items into array called itemlist.
int nt[]=new int[10]; int
q[]=new int[10];
for(i=0;i<n;i++)
{

//incrementing for each items in 'n' transactions.

System.out.println("Transaction "+(i+1)+" :");
for(j=0;j<10;j++)
{
//System.out.println(itemlist[j]);
System.out.println("Is Item "+itemlist[j]+" present in this transaction(1/0)? :");

Department of Computer Science and Engineering

item[i][j]=0;
}
}
}
//creating array for 2-frequency itemsetint
nt1[][]=new int[10][10]; for(j=0;j<10;j++)
{
//generating unique items for 2-frequency itemlist
for(m=j+1;m<10;m++)
{
for(i=0;i<n;i++)
{
if(item[i][j]==1 &&item[i][m]==1)

//checking there would atleast 1 itemset in 1-frequency itemset and 2-frequency

itemlist.
{
nt1[j][m]=nt1[j][m]+1;
//incrementing for each items with all other items in 2-frequency itemset
}
}
if(nt1[j][m]!=0) //if 2-frequency itemlist is present
System.out.println("Number of Items of “+itemlist[j]+"& "+itemlist[m]+"
:"+nt1[j][m]);
//printing number of items of each items with other items with theirfrequency.

}
}
for(j=0;j<10;j++)
{
for(m=j+1;m<10;m++)
{
if(((nt1[j][m]/(float)n)*100)>=50)q[j]=1;
else
q[j]=0;
if(q[j]==1)
{
System.out.println("Item "+itemlist[j]+"& "+itemlist[m]+" is selected ");
}
}
}
}
Department of Computer Science and Engineering
76

OUTPUT:
Enter the number of transaction:
3
items :1--Milk 2--Bread 3--Coffee 4--Juice 5--Cookies 6--Jam 7--Tea 8--Butter 9--Sugar
10--Water
Transaction 1:
Is Item MILK present in this transaction(1/0)?
:1
Is Item BREAD present in this transaction(1/0)? :
1
Is Item COFFEE present in this transaction(1/0)? :
1
Is Item JUICE present in this transaction(1/0)? :
1
Is Item COOKIES present in this transaction(1/0)? :
1
Is Item JAM present in this transaction(1/0)? :
1
Is Item TEA present in this transaction(1/0)? :
1
Is Item BUTTER present in this transaction(1/0)? :
1
Is Item SUGAR present in this transaction(1/0)? :
1
Is Item WATER present in this transaction(1/0)? :
1
Transaction 2:
Is Item MILK present in this transaction(1/0)? :
1
Is Item BREAD present in this transaction(1/0)? :
1
Is Item COFFEE present in this transaction(1/0)? :
0
Is Item JUICE present in this transaction(1/0)? :
0
Is Item COOKIES present in this transaction(1/0)? :
1
Is Item JAM present in this transaction(1/0)?
Department of Computer Science and Engineering
79

Number of Items of JAM & BUTTER :1

Number of Items of JAM & SUGAR :2
Number of Items of JAM & WATER :2
Number of Items of TEA & BUTTER :2
Number of Items of TEA & SUGAR :1
Number of Items of TEA & WATER :1
Number of Items of BUTTER & SUGAR :1
Number of Items of BUTTER & WATER :1
Number of Items of SUGAR & WATER :2 I
tem MILK& BREAD is selected
Item MILK& COOKIES is selected
Item MILK& JAM is selected
Item MILK& SUGAR is selected
Item MILK& WATER is selected
Item BREAD& COOKIES is selected
Item BREAD& JAM is selected
Item BREAD& SUGAR is selected
Item BREAD& WATER is selected
Item COFFEE& JUICE is selected
Item COFFEE& TEA is selected
Item COFFEE& BUTTER is selected
Item JUICE& TEA is selected
Item JUICE& BUTTER is selected
Item COOKIES& JAM is selected
Item COOKIES& SUGAR is selected
Item COOKIES& WATER is selected
Item JAM& SUGAR is selected
Item JAM& WATER is selected
Item TEA& BUTTER is selected
Item SUGAR& WATER is selected

Department of Computer Science and Engineering

13. Write a program to cluster your choice of data using simple k-means algorithmusing
JDK.
PROGRAM:
import java.util.ArrayList;
import java.util.List; import
java.util.Random;

public class KMeansClustering {

private static final int NUM_CLUSTERS = 3;public
static void main(String[] args) {
// Create a list of data points
List<Point> dataPoints = new ArrayList<>();
dataPoints.add(new Point(1, 2));
dataPoints.add(new Point(3, 4));
dataPoints.add(new Point(5, 6));
dataPoints.add(new Point(7, 8));
dataPoints.add(new Point(9, 10));

// Initialize the cluster centroids

List<Point> clusterCentroids = new ArrayList<>();for (int
i = 0; i < NUM_CLUSTERS; i++)
{
clusterCentroids.add(new Point(Math.random() * 10, Math.random() * 10));
}

// Assign each data point to the closest cluster centroid

for (Point dataPoint : dataPoints) {int
closestClusterIndex = 0;
double closestDistance = Double.MAX_VALUE;for
(int i = 0; i < NUM_CLUSTERS; i++) {
double distance = dataPoint.distanceTo(clusterCentroids.get(i));if
(distance < closestDistance) {
closestClusterIndex = i;
closestDistance = distance;
}
}
dataPoint.setClusterIndex(closestClusterIndex);
}

// Update the cluster centroids

Department of Computer Science and Engineering
81

for (int i = 0; i < NUM_CLUSTERS; i++) {Point

clusterCentroid = new Point(0, 0); for (Point
dataPoint : dataPoints) {
if (dataPoint.getClusterIndex() == i) {
clusterCentroid.add(dataPoint);
}
}
clusterCentroid.divide(dataPoints.size());
clusterCentroids.set(i, clusterCentroid);
}

// Print the final cluster centroids

for (Point clusterCentroid : clusterCentroids) {System.out.println(clusterCentroid);
}
}

private static class Point {

private double x; private
double y; private int
clusterIndex;

public Point(double x, double y) {this.x

= x;
this.y = y; this.clusterIndex
= -1;
}

public double distanceTo(Point otherPoint) {

return Math.sqrt((x - otherPoint.x) * (x - otherPoint.x) + (y - otherPoint.y) * (y
- otherPoint.y));
}

public void add(Point otherPoint) {this.x

+= otherPoint.x;
this.y += otherPoint.y;
}

public void divide(int divisor) {this.x

/= divisor;
this.y /= divisor;
}

Department of Computer Science and Engineering

public int getClusterIndex() {return

clusterIndex;
}

public void setClusterIndex(int clusterIndex) {

this.clusterIndex = clusterIndex;
}

@Override
public String toString() {
return "(" + x + ", " + y + ")";
}
}
}

OUTPUT:
(0.8, 1.2)
(3.2, 3.6)
(1.0, 1.2)

Department of Computer Science and Engineering

14. Write a program of cluster analysis using simple k-means algorithm Python
programming language.
PROGRAM:
# importing libraries
import numpy as np
import matplotlib.pyplot as plt

# Generate some data

data = np.random.rand(100, 2)

# Choose the number of clusters

k=3

# Randomly select the centroids

centroids = data[np.random.choice(len(data), k, replace=False)]

# Assign each data point to a cluster

clusters = np.zeros(len(data))for i
in range(len(data)):
distances = np.linalg.norm(data[i] - centroids, axis=1)clusters[i]
= np.argmin(distances)

# Update the centroids

for i in range(k):
centroids[i] = np.mean(data[clusters == i], axis=0)

# Plot the data

plt.scatter(data[:, 0], data[:, 1], c=clusters)
plt.scatter(centroids[:, 0], centroids[:, 1], c='black', marker='x')plt.show()

Department of Computer Science and Engineering

OUTPUT:

OBSERVATION:
This program will generate 100 random data points and then cluster them using the k- means
algorithm. The number of clusters is chosen to be 3. The program will then plotthe data points
and the centroids.

Department of Computer Science and Engineering

15. Write a program to compute/display dissimilarity matrix (for your own dataset
containing at least four instances with two attributes) using Python.
PROGRAM:
# importing library
import numpy as np

# Create the dataset

dataset = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])

# Compute the dissimilarity matrix

dissimilarity_matrix = np.zeros((dataset.shape[0], dataset.shape[0]))for i in
range(dataset.shape[0]):
for j in range(dataset.shape[0]):
dissimilarity_matrix[i, j] = np.sqrt(np.sum((dataset[i] - dataset[j])**2))

# Display the dissimilarity matrix

print(dissimilarity_matrix)

OUTPUT:
[[0. 2.82842712 5.65685425 8.48528137]
[2.82842712 0. 2.82842712 5.65685425]
[5.65685425 2.82842712 0. 2.82842712]
[8.48528137 5.65685425 2.82842712 0. ]]

OBSERVATION:
This program first creates a dataset containing four instances with two attributes. Then, it
computes the dissimilarity matrix using the Euclidean distance formula. Finally, it displays
the dissimilarity matrix.

Department of Computer Science and Engineering

16. Visualize the datasets using matplotlib in python. (Histogram, Box plot, Barchart,
Pie chart etc.,)
PROGRAM:
LINE CHART
import matplotlib.pyplot as plt

# initializing the data

x = [10, 20, 30, 40]
y = [20, 25, 35, 55]

# plotting the data

plt.plot(x, y)

# Adding title to the plot

plt.title("Line Chart")

# Adding label on the y-axis

plt.ylabel('Y-Axis')

# Adding label on the x-axis

plt.xlabel('X-Axis')

# Show the plot

plt.show()

OUTPUT:

Department of Computer Science and Engineering

HISTOGRAM
import matplotlib.pyplot as plt
import numpy as np

# Create a random dataset

data = np.random.normal(100, 25, 200)

# Create a histogram
plt.hist(data)

# Show the plot

plt.show()

OUTPUT:

Department of Computer Science and Engineering

BOXPLOT
import matplotlib.pyplot as plt
import numpy as np

# Create a random dataset

data = np.random.normal(100, 25, 200)

# Create a box plot

plt.boxplot(data)

# Show the plot

plt.show()

OUTPUT:

Department of Computer Science and Engineering

BAR CHART
import matplotlib.pyplot as plt
import numpy as np

# Create a random dataset

data = np.random.randint(0, 100, 10)

# Create a bar chart

plt.bar(range(len(data)), data)

# Show the plot

plt.show()

OUTPUT:

Department of Computer Science and Engineering

PIE CHART
import matplotlib.pyplot as plt
import numpy as np

# Create a random dataset

data = np.random.randint(0, 100, 10)

# Create a pie chart

plt.pie(data)

# Show the plot

plt.show()

OUTPUT:

Department of Computer Science and Engineering

1z0 908 Actual Exam Question From Oracle
No ratings yet
1z0 908 Actual Exam Question From Oracle
75 pages
IBM Watsonx - Data Technical Essentials Final Quiz - Attempt Review
No ratings yet
IBM Watsonx - Data Technical Essentials Final Quiz - Attempt Review
15 pages
Oracle Database Administration Interview Questions You'll Most Likely Be Asked: Job Interview Questions Series
From Everand
Oracle Database Administration Interview Questions You'll Most Likely Be Asked: Job Interview Questions Series
Vibrant Publishers
5/5 (1)
App Builder Enterprise Application
100% (1)
App Builder Enterprise Application
273 pages
Refresh Automation - TableList Shareable
No ratings yet
Refresh Automation - TableList Shareable
124 pages
DWDM Lab Manual Excercises
No ratings yet
DWDM Lab Manual Excercises
91 pages
GUNADWDM
No ratings yet
GUNADWDM
105 pages
DWDM RECORD PRINT1
No ratings yet
DWDM RECORD PRINT1
100 pages
DWDM Lab Manual
No ratings yet
DWDM Lab Manual
51 pages
LAB MANUAL
No ratings yet
LAB MANUAL
100 pages
DWDM LAB Final Manualtest
No ratings yet
DWDM LAB Final Manualtest
134 pages
Master Manual
No ratings yet
Master Manual
100 pages
Data Mining and Data Warehouse BY
100% (1)
Data Mining and Data Warehouse BY
12 pages
Module-1
No ratings yet
Module-1
78 pages
Data Mining and Data Warehouse: Qis College of Engineering & Technology Ongole
No ratings yet
Data Mining and Data Warehouse: Qis College of Engineering & Technology Ongole
10 pages
8 Data Warehousing
No ratings yet
8 Data Warehousing
113 pages
Data Mining Techniques
No ratings yet
Data Mining Techniques
108 pages
Lab Manual: Jawaharlal Nehru Engineering College Aurangabad
No ratings yet
Lab Manual: Jawaharlal Nehru Engineering College Aurangabad
39 pages
Data Mining and Warehousing
No ratings yet
Data Mining and Warehousing
10 pages
INMGT 1 Module Data Warehouse WHOLE Coverage 1ST SEM 2022 2023 2
No ratings yet
INMGT 1 Module Data Warehouse WHOLE Coverage 1ST SEM 2022 2023 2
60 pages
Data Mining and Data Warehouse
No ratings yet
Data Mining and Data Warehouse
11 pages
By Bi Jay Mishra
No ratings yet
By Bi Jay Mishra
685 pages
Module 1 - Data Warehousing & Modeling F.0
No ratings yet
Module 1 - Data Warehousing & Modeling F.0
32 pages
Data Warehousing and Data Mining: Downloaded From
No ratings yet
Data Warehousing and Data Mining: Downloaded From
94 pages
Unit 1 DWDM
No ratings yet
Unit 1 DWDM
122 pages
Data Mining 1
No ratings yet
Data Mining 1
13 pages
DWDM 5 Unit Notes
No ratings yet
DWDM 5 Unit Notes
86 pages
data mining lab manual
No ratings yet
data mining lab manual
85 pages
Data Mining Capital Iq
No ratings yet
Data Mining Capital Iq
78 pages
Unit 1 - Introduction To Data Mining and Data Warehousing
No ratings yet
Unit 1 - Introduction To Data Mining and Data Warehousing
84 pages
DWDM 2 UNIT NOTES
No ratings yet
DWDM 2 UNIT NOTES
14 pages
DW&DM Syllabus
No ratings yet
DW&DM Syllabus
2 pages
Chapter Four
No ratings yet
Chapter Four
43 pages
Data Warehouse & Data Mining
No ratings yet
Data Warehouse & Data Mining
12 pages
Data Warehousing and Data Mining
75% (4)
Data Warehousing and Data Mining
14 pages
CCS341 Set3
100% (1)
CCS341 Set3
3 pages
Data Warehouse
No ratings yet
Data Warehouse
10 pages
Data Warehousing and Data Mining Lab
No ratings yet
Data Warehousing and Data Mining Lab
69 pages
DWDM
No ratings yet
DWDM
5 pages
Unit 2 LT
No ratings yet
Unit 2 LT
13 pages
SQL_FULL_NOTES
No ratings yet
SQL_FULL_NOTES
17 pages
Data Warehouse & Data Mining
No ratings yet
Data Warehouse & Data Mining
41 pages
Data Mining and Data Warehouse: Raju - Qis@yahoo - Co.in Praneeth - Grp@yahoo - Co.in
No ratings yet
Data Mining and Data Warehouse: Raju - Qis@yahoo - Co.in Praneeth - Grp@yahoo - Co.in
8 pages
A.V.C.College of Engineering: Mannampandal, Mayiladuthurai-609 305
No ratings yet
A.V.C.College of Engineering: Mannampandal, Mayiladuthurai-609 305
96 pages
Unit 2 Data Warehouse
No ratings yet
Unit 2 Data Warehouse
22 pages
Data Warehousing and Data Mining Lab: Maharaja Agrasen Institute of Technology, PSP Area, Sector - 22, New Delhi - 110085
No ratings yet
Data Warehousing and Data Mining Lab: Maharaja Agrasen Institute of Technology, PSP Area, Sector - 22, New Delhi - 110085
44 pages
17CS651 DMDW
No ratings yet
17CS651 DMDW
302 pages
Data Warehousing Mining
No ratings yet
Data Warehousing Mining
26 pages
Kushalproject 1
No ratings yet
Kushalproject 1
77 pages
Unit 2_V2_Data Science
No ratings yet
Unit 2_V2_Data Science
23 pages
DATA WAREHOUSE Basic Concepts
No ratings yet
DATA WAREHOUSE Basic Concepts
26 pages
DW DM Notes
No ratings yet
DW DM Notes
107 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
26 pages
DWDM Lab Manual Final As On 09-04-2021 R18
No ratings yet
DWDM Lab Manual Final As On 09-04-2021 R18
88 pages
Data warehouse (1)
No ratings yet
Data warehouse (1)
14 pages
Data warehouse
No ratings yet
Data warehouse
11 pages
Lecture #2 - Data Warehouse Architecture
No ratings yet
Lecture #2 - Data Warehouse Architecture
6 pages
How Evolution of Database Led To Data Mining
No ratings yet
How Evolution of Database Led To Data Mining
10 pages
Data Warehousing: Modern Database Management
No ratings yet
Data Warehousing: Modern Database Management
32 pages
Paper Presentation: Data Ware Housing AND Data Mining
No ratings yet
Paper Presentation: Data Ware Housing AND Data Mining
10 pages
Data Ware House
No ratings yet
Data Ware House
6 pages
Computer Fundamentals (Final)
No ratings yet
Computer Fundamentals (Final)
23 pages
Visual Basic 2010 Coding Briefs Data Access
From Everand
Visual Basic 2010 Coding Briefs Data Access
Kevin Hough
5/5 (1)
Learn SQL: Database Management Basics
From Everand
Learn SQL: Database Management Basics
Kiet Huynh
No ratings yet
Oracle
No ratings yet
Oracle
24 pages
Supervised Vs Unsupervised
No ratings yet
Supervised Vs Unsupervised
8 pages
Sap Hana Big Data
No ratings yet
Sap Hana Big Data
54 pages
Library question 1 imaginary
No ratings yet
Library question 1 imaginary
87 pages
DBMS Mini Project
100% (1)
DBMS Mini Project
7 pages
RDBMS Unit - I
No ratings yet
RDBMS Unit - I
38 pages
DISTRIBUTED SYSTEMS R19 - UNIT-5
No ratings yet
DISTRIBUTED SYSTEMS R19 - UNIT-5
32 pages
Class 12 Ip Practical File
No ratings yet
Class 12 Ip Practical File
14 pages
V2V Python + Data Science + Power BI
No ratings yet
V2V Python + Data Science + Power BI
23 pages
Building A Data Processing Pipeline Using A Directory Table
No ratings yet
Building A Data Processing Pipeline Using A Directory Table
7 pages
R12 IProcurement Presentation OA
100% (1)
R12 IProcurement Presentation OA
243 pages
SAP OEE 15.1 Help
No ratings yet
SAP OEE 15.1 Help
134 pages
MySQL Enterprise Edition Ebook Eng v9.20 M
No ratings yet
MySQL Enterprise Edition Ebook Eng v9.20 M
13 pages
Mern - Ducat
No ratings yet
Mern - Ducat
6 pages
Snowflakes Beginner To Intermediate Path Updated
No ratings yet
Snowflakes Beginner To Intermediate Path Updated
4 pages
Setup and User Manual RetroPay (Enhanced)
No ratings yet
Setup and User Manual RetroPay (Enhanced)
23 pages
MOC 20483B Programming in C#
No ratings yet
MOC 20483B Programming in C#
4 pages
Lowell Fryman USGS 2017 Final
No ratings yet
Lowell Fryman USGS 2017 Final
18 pages
2023 11 06 14 56 52 Report Monthly06 Nov 2023
No ratings yet
2023 11 06 14 56 52 Report Monthly06 Nov 2023
105 pages
Writen Question Answare
No ratings yet
Writen Question Answare
5 pages
CS333 Application Software Development Lab
No ratings yet
CS333 Application Software Development Lab
6 pages
20 TOP SAP Reports Interview Questions and Answers PDF - Multiple Choice Questions and Answers Beginners and Experienced PDF
No ratings yet
20 TOP SAP Reports Interview Questions and Answers PDF - Multiple Choice Questions and Answers Beginners and Experienced PDF
2 pages
CLOUD DIGITAL LEADER DUMPS
No ratings yet
CLOUD DIGITAL LEADER DUMPS
4 pages
DBMS - Experiment 1
No ratings yet
DBMS - Experiment 1
17 pages
B6061 David Lashbrook Course Guide
No ratings yet
B6061 David Lashbrook Course Guide
397 pages
Jurnal Kesehatan: Perancangan Sistem Informasi Medical Check Up Guna Mempercepat Pelayanan MCU Di RSUD Brebes
No ratings yet
Jurnal Kesehatan: Perancangan Sistem Informasi Medical Check Up Guna Mempercepat Pelayanan MCU Di RSUD Brebes
16 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

DWDM Record

Uploaded by

DWDM Record

Uploaded by

SIR C R REDDY COLLEGE OF ENGINEERING, ELURU

Approved by AICTE & Affiliated to JNTUK, Kakinada

Mr/Mrs _____________________________________________________________ bearing

Regd. No. _______________________ of ________________________B.E/B.TECH course

during the academic year 2024-25.

LAB-IN-CHARGE HEAD OF THE DEPARTMENT

S. No. Name of the Experiment Date Page No. Valued Grade

1. Creation of a Data Warehouse.

In the below figure we represented MySQL Administrator connection establishment.

After successful login, it will open new window as shown below.

Department of Computer Science and Engineering

Department of Computer Science and Engineering

Department of Computer Science and Engineering

Department of Computer Science and Engineering

➢ Design multi-dimensional data models namely Star, Snowflake and Fact

Department of Computer Science and Engineering

Fact Constellation Schema:

Department of Computer Science and Engineering

➢ Write ETL scripts and implement using data warehouse tools.

Department of Computer Science and Engineering

sources, generating aggregates, generating surrogate keys, sorting, derivingnew

Managing ETL Process:

ETL Tool Implementation:

Department of Computer Science and Engineering

3. As shown in above window, select ―PivotTable Report” and click “OK”.

Department of Computer Science and Engineering

In the above window, we selected year „2008‟ in „Electronic‟ Category, then

Department of Computer Science and Engineering

6. Next OLAP operation Slicing is performed by inserting slicer as shown in top

Department of Computer Science and Engineering

7. Dicing operation is similar to Slicing operation. Here we are selecting 3 dimensions

Department of Computer Science and Engineering

Department of Computer Science and Engineering

2. Explore machine learning tool “WEKA”

a. Preprocessor – Most of the Data is Raw. Hence, Preprocessor is used to cleanthe

➢ Downloading and/or installation of WEKA data mining toolkit.

Department of Computer Science and Engineering

➢ Understand the features of WEKA toolkit such as Explorer, Knowledge Flow

Department of Computer Science and Engineering

The Weka Experiment Environment is for designing controlled experiments, running

The Weka KnowledgeFlow Environment is a graphical workflow tool for designing a

Department of Computer Science and Engineering

Department of Computer Science and Engineering

Department of Computer Science and Engineering

Select Attributes Panel

Department of Computer Science and Engineering

Department of Computer Science and Engineering

Department of Computer Science and Engineering

Department of Computer Science and Engineering

An example header on the standard IRIS dataset looks like this:

The Data of the ARFF file looks like the following:

NOTE: Lines that begin with a % are comments. The @RELATION,

Sparse ARFF files

Department of Computer Science and Engineering

Available Datasets in WEKA:

Department of Computer Science and Engineering

➢ Load each dataset and observe the following:

Department of Computer Science and Engineering

1. List the attribute names and they type.

2. Number of records in each dataset

3. Identify the class attribute (if any)

Department of Computer Science and Engineering

5. Determine the number of records for each class.

6. Visualize the data in various dimensions

Department of Computer Science and Engineering

3. Perform data preprocessing tasks and demonstrate performing association rule

Department of Computer Science and Engineering

Department of Computer Science and Engineering

Generated sets of large itemsets:

Best rules found:

Department of Computer Science and Engineering

Best rules found:

Loading IRIS dataset

Generated sets of large itemsets: Size of

Department of Computer Science and Engineering

Size of set of large itemsets L(2): 15Size

Regd. No. _ of __B.E/B.TECH course