0% found this document useful (0 votes)
355 views

Discrete Math Lab Manual

1. The document discusses three multidimensional data models: star schema, snowflake schema, and fact constellation. 2. It provides details on star schema, which has one large central table (fact table) and smaller dimension tables that the fact table points to. Variations include snowflake and fact constellation schemas. 3. The document compares and contrasts relational and star schemas through an example, noting differences in depth, focus on the fact table, and relationships between tables.

Uploaded by

Mohammed Wahid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
355 views

Discrete Math Lab Manual

1. The document discusses three multidimensional data models: star schema, snowflake schema, and fact constellation. 2. It provides details on star schema, which has one large central table (fact table) and smaller dimension tables that the fact table points to. Variations include snowflake and fact constellation schemas. 3. The document compares and contrasts relational and star schemas through an example, noting differences in depth, focus on the fact table, and relationships between tables.

Uploaded by

Mohammed Wahid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 158

1. Implement the following Multidimensional Data Models i.

Star Schema
ii.Snowflake Schema iii.Fact Constellation

Consider a fully normalized data model. Now think of exactly the


opposite, where you fully denormalize your relational data model so that you have
only one flat record like a spreadsheet with a very wide row. Now back up from
this flat record just a little bit so that you have a data model that is only two levels
deep; one big table, and several small tables that the big table points back to. This
is a STAR schema.

Thus a true star data model has two attributes, it is always two levels
deep, and a true star model always contains only one large table that is the focus
of the model. There are of course variations of this concept as SNOWFLAKE and
FACT CONSTELLATION.

STAR DESIGN is the out-growth of a special need, the need to analyze


large amounts of data in an interactive manner quickly with no opportunity to rely
on the existence of canned queries. And for this need, a design theory was
eventually constructed. Of course to make a design theory useful is must be
implementable in a practical way.

Databases using star designs always get their data from somewhere
else. They are a form or reporting database. As such, star schemas are not
required to follow normalization rules as we are accustomed to. The presumption
is that feeding systems have already applied edits and constraints on the data so
the star data repository does not need to.

When you see your star design trying to accommodate other ideas and
other purposes that focus on something other than the fact table, you should re-
evaluate the direction you are heading in. Star designs are for analyzing the one
fact table central to the design of the model, and doing anything else with your
star data model reduces its effectiveness as an analytic data store.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 2


Deviations from true star models usually manifest in two ways: 1) the
desire to retain the relationships between dimensions in the star model which
often take the form of SNOWFLAKING, and 2) the existence of two or more fact
tables in the design as FACT CONSTELLATION.

A great way to appreciate the difference between RELATIONAL DESIGN


THEORY and STAR DESIGN THEORY is to see an example of how the data models
for the same data will differ for the two design strategies.

Remember, both design are valid, each however is attempting to


address a different need. The relational data model seeks to model data as it
exists in the real world, with important relationships in the data accounted for.
The star data model seeks to reincarnate the relational model into a design that
makes slicing and dicing one specific subject area easy and fast.

One of the keys to understanding star models is to look at the


practicality of querying the data. In a star model getting the total quantity sold by
project is the same as getting the total qty sold by department is the same as
getting the total quantity sold by category is the same as ... whereas doing this in
the relational model is more complex with each query being potentially very
different.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 3


Figure 1.1 Relational and Star Schema

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 4


There are some interesting differences between our relational model and our star
model:

1) the relational model shown here is five levels deep whereas the star model
shown here is only two.

2) the relational model does not suggest by its design that any of the data it
models is special whereas to the star model, the fact table is the centre of the
universe.

3) the relational model carefully maps the relationships between tables treating
relationships between so-called reference tables as just as important as all other
relationships whereas the star model relies on its load processes to load data
correctly based on the relationships in the data, but then can (and in this case did)
toss all these relationships out of its design because it cares not about them since
dealing with them after the data is loaded takes our focus away from the fact data
and a star design wants all eyes on the fact data.

4) the relational model is equally adept at answering questions about any of the
tables in its model whereas the star model is about slicking and dicing the fact
table and little else matters. Indeed, the star model does very poorly in answering
questions about its dimensions because its focus is on the fact table.

The table create statements of our fact table and its original relational table.

create table sale


(
sale_id number not null primary key
, sale_date date not null
, qty_sold number not null
, item_use_id number not null
--
, foreign key (item_use_id) references item_use
)
/
create table sale_fact

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 5


(
sale_id number not null primary key
, sale_date date not null
, qty_sold number not null
--
, dept_id number not null
, emp_id number not null
, project_id number not null
, item_id number not null
, category_id number not null
--
, sale_time_id number not null
, emp_salary_range_id number not null
, item_price_range_id number not null
--
, emp_salary_NA number not null
, item_price_NA number not null
--
, foreign key (dept_id) references dept_dim
, foreign key (emp_id) references emp_dim
, foreign key (project_id) references project_dim
, foreign key (item_id) references item_dim
, foreign key (category_id) references category_dim
, foreign key (sale_time_id) references time_dim
, foreign key (emp_salary_range_id) references emp_salary_range_dim
, foreign key (item_price_range_id) references item_price_range_dim
)
/

1) At the beginning of our fact table, we have the same basic table as we saw in
our relational model. SALE_FACT is one-to-one with SALE. One row in SALE_FACT
is one row in SALE.
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 6
2) We have tossed out SALE.ITEM_USE_ID because the information given by this
table has no value in our star design. Instead we flattened the relationships all the
way up the relational foreign key chain in our relational model with the ultimate
result being that keys in all our reference tables become foreign keys in our fact
table. Subsequently we created dimensions in our star model for the data pointed
to by each of these foreign keys. The result is that the relationships between
dimension tables which roughly speaking are our original relational model
reference tables is lost in favour of directly representing these relationships on our
fact table.

3) We have added new data in our star design that did not exist in our relational
design. More specifically we created a TIME DIMENSION which represents time in
our system. Think of it as all the different interesting ways to represent a date. We
also took the salary on the emp table and created a bucketing scheme which we
then refered to in our fact table. We did the same for item price as well. The result
is our EMP_SALARY_RANGE_DIM and ITEM_PRICE_RANGE_DIM. This new data
would be accounted for in our load process when our fact table is loaded.

4) We also placed salary from our original emp table and item price from our
original item table as NOT AGGREGATABLE (or not summable) metrics on our fact
table. Consider for example that if you sum qty_sold from sale_fact for a specific
employee, you get the total quantity of items sold for that employee. This is
because qty_sold is summable for our fact table. But if you take the sum of
emp_salary from sale_fact for a specific employee you do not get the total salary
of the employee; you get the employee's salary times however many rows were
selected (well more or less assuming the employee's salary does not change over
time), a rather meaningless number. You cannot sum emp_salary off the fact
table, nor can you sum item_price. This is why they have the _NA suffixes on
them.

Relational Model

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 7


create table dept
(
dept_id number not null primary key
, dept_name varchar2(30) not null unique
)
/

create table emp


(
emp_id number not null primary key
, emp_name varchar2(30) not null unique
, salary number not null
, dept_id number not null
, foreign key (dept_id) references dept
)
/

create table project


(
project_id number not null primary key
, project_name varchar2(30) not null unique
, emp_id number not null
, foreign key (emp_id) references emp
)
/

create table category


(
category_id number not null primary key
, category_name varchar2(30) not null unique
)
/

create table item

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 8


(
item_id number not null primary key
, item_name varchar2(30) not null unique
, price number not null
, category_id number not null
, foreign key (category_id) references category
)
/

create table item_use


(
item_user_id number not null primary key
, item_id number not null
, project_id number not null
, unique (item_id,project_id)
, foreign key (item_id) references item
, foreign key (project_id) references project
)
/

create table sale


(
sale_id number not null primary key
, sale_date date not null
, qty_sold number not null
, item_use_id number not null
, foreign key (item_use_id) references item_use
)
/

Star Model
create table dept_dim

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 9


(
dept_id number not null primary key
, dept_name varchar2(30) not null unique
)
/

create table emp_dim


(
emp_id number not null primary key
, emp_name varchar2(30) not null unique
)
/

create table emp_salary_range_dim


(
emp_salary_range_id number not null primary key
, range_name varchar2(30) not null unique
, range_start number not null
, range_end number not null
)
/

create table item_dim


(
item_id number not null primary key
, item_name varchar2(30) not null unique
)
/

create table item_price_range_dim


(

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 10


item_price_range_id number not null primary key
, range_name varchar2(30) not null unique
, range_start number not null
, range_end number not null
)
/

create table project_dim


(
project_id number not null primary key
, project_name varchar2(30) not null unique
)
/

create table category_dim


(
cateogory_id number not null primary key
, category_name varchar2(30) not null unique
)
/

create table time_dim


(
time_id number not null primary key
, day_date date not null unique
, week_date date not null
)
/

create table sale_fact


(
sale_id number not null primary key

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 11


, sale_date date not null
, qty_sold number not null
--
, dept_id number not null
, emp_id number not null
, project_id number not null
, item_id number not null
, category_id number not null
--
, sale_time_id number not null
, emp_salary_range_id number not null
, item_price_range_id number not null
--
, emp_salary_NA number not null
, item_price_NA number not null
--
, foreign key (dept_id) references dept_dim
, foreign key (emp_id) references emp_dim
, foreign key (project_id) references project_dim
, foreign key (item_id) references item_dim
, foreign key (category_id) references category_dim
, foreign key (sale_time_id) references time_dim
, foreign key (emp_salary_range_id) references emp_salary_range_dim
, foreign key (item_price_range_id) references item_price_range_dim
)
/

Developing a Data Warehouse

The phases of a data warehouse project listed below are similar to those of
most database projects, starting with identifying requirements and ending with
executing the T-SQL Script to create data warehouse:

1. Identify and collect requirements


IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 12
2. Design the dimensional model

3. Execute T-SQL queries to create and populate dimension and fact tables

Identify and Collect Requirements

We need to interview the key decision makers to know, what factors define
the success in the business? How does management want to analyze their data?
What are the most important business questions, which need to be satisfied by
this new system?

We also need to work with persons in different departments to know the


data and their common relations if any, document their entire requirement which
need to be satisfied by this system. Let us first identify the requirement from
management about their requirements.

Need to see monthly, quarterly and yearly sales.

Comparison of sales on various time periods.

Comparison of sales of different items.

Need to know which item has more demand on which location?

Need to study trend of sales by branch?

Design the Dimensional Model

We need to design Dimensional Model to suit requirements of users which


must address business needs and contains information which can be easily
accessible. Design of model should be easily extensible according to future needs.
This model design must supports OLAP cubes to provide "instantaneous" query
results for analysts.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 13


Dimension: The dimension is a master table composed of individual, non-
overlapping data elements. The primary functions of dimensions are to
provide filtering, grouping and labelling on your data. Dimension tables
contain textual descriptions about the subjects of the business.

Let us identify dimensions related to the above case study :item, branch,
location and time.

Measure: A measure represents a column that contains quantifiable data,


usually numeric, that can be aggregated. A measure is generally mapped
to a column in a fact table.

Let us define the Measures in our case: dollars_sold and units_sold.

Fact Table: Data in fact table are called measures (or dependent
attributes), Fact table provides statistics for sales broken down by item,
branch, location and time dimensions. Fact table usually contains historical
transactional entries of your live system, it is mainly made up of Foreign
key column which references to various dimension and numeric measure
values on which aggregation will be performed.

Let us identify what attributes should be there in our Fact Sales Table.

Foreign Key Column: time_key,item_key,branch_key,

location_key. Measures: dollars_sold and units-sold.

Design the Relational Database

We have done some basic workout to identify dimensions and measures,


now we have to use appropriate schema to relate this dimension and Fact
tables.Few popular schemas used to develop dimensional model are as follows:

E.g. Star Schema, Snow Flake Schema, Star Flake Schema, Distributed Star
Schema, etc.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 14


Star schema the diagram resembles a star, with points radiating from a
center. The center of the star consists of fact table and the points of the star are
the dimension tables.

Let us create Our First Star Schema, please refer to the below figure:

Figure 1.2 Star Schema

Using the Code

Let us execute our T-SQL Script step by step to create table and populate
them with appropriate test values. Follow the given steps to run the query in
SSMS (SQL Server Management Studio 2012).

1. Open SQL Server Management Studio

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 15


2. Connect Database Engine

3. Open New Query editor

4. Copy paste Scripts given below in various steps in new query editor window one
by one

5. To run the given SQL Script, press F5

Step 1 : Create database for your Data Warehouse in SQL Server:

Createdatabase Sales_DW
Go
Use Sales_DW
Go

Step 2 : Create time dimension table in Data Warehouse which will hold time
details.

Create table DimTime


(
time_key datetime primary key identity,
day_of_the_week varchar(20),
month varchar(15),
quarter varchar(10),
year varchar(4)
)
go
Fill the time dimension with some sample Values

Step 3 : Create item Dimension table

Create table DimItem


(
item_Key int primary key identity,
item_Name varchar(50),
brand varchar(25),

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 16


type varchar(15),
supplier_type varchar(15)
)
Go
Fill the item dimension with sample Values

Step 4 : Create branch Dimension table

Create table DimBranch


(
branch_key int primary key identity,
branch_name varchar(50),
branch_type varchar(15)
)
Go
Fill the branch Dimension with sample Values

Step 5 : Create location Dimension table

Create table DimLocation


(
location_key int primary key identity,
street varchar(100)not null,
city varchar(100),
state varchar(100),
country varchar(100)
)
Go
Fill the Dimension location with sample values

Step 6 : Create Fact table to hold all your transactional entries of sales with
appropriate foreign key columns which refer to primary key column of
your dimensions; you have to take care while populating your fact table
to refer to primary key values of appropriate dimensions.

Create Table FactSales

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 17


(
time_key datetime ,
item_key int ,
branch_key int ,
location_key int ,
dollars_sold float,
units_sold flaot
)
Go

Add Relation between Fact table and dimension tables:

-- Add relation between fact table foreign keys to Primary keys of Dimensions
AlTER TABLE FactSales ADD CONSTRAINT _
FK_ time_key FOREIGN KEY (time_key)REFERENCES DimTime(time_key);

AlTER TABLE FactSales ADD CONSTRAINT _


FK_ item_key FOREIGN KEY (item_key)REFERENCES DimItem(item_key);

AlTER TABLE FactSales ADD CONSTRAINT _


FK_ branch_key FOREIGN KEY (branch_key)REFERENCES DimBranch(branch_key);

AlTER TABLE FactSales ADD CONSTRAINT _


FK_ location_key FOREIGN KEY (location_key)REFERENCES
DimLocation(location_key);
Go

Populate your Fact table with historical transaction values of sales with proper
values of dimension key
values. After executing the above T-SQL script, your
sample data warehouse for sales will be ready, now
you can create OLAP Cube on the basis of this data warehouse.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 18


ii.Snowflake Schema: The main difference between star and snowflake
schemas is in the definition of dimension tables.

The single dimension table for Item in the star schema is normalised in the
snowflake schema resulting in new Item and Supplier tables. The Item Dimension
table now contains the attributes item_key,item_name,brand,type and
supplier_key, where supplier_key is linked to the supplier Dimension table
containing supplier_key and supplier_type information as shown below.

Similarly the single Dimension table location in the star schema is


normalised into two tables location and city.The city_key in the new location table
links to the city Dimension as shown below.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 19


Figure 1.3 Snowflake Schema

iii.Fact Constellaion or Galaxy Schema: Sophisticated applications may


require multiple Fact tables to share Dimension tables.

The below schema specifies two Fact tables sales , shipping. The sales Fact
table is identical to that of star schema.

The shipping table has five Dimensions or keys ,


item_key,time_key,shipper_key, from_location, to_location and two measures
dollars_cost and units_shipped.

The dimension tables for time,item and location are shared between sales
and shipping fact tables as shown below.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 20


Figure 1.4 Fact Constellation or Galaxy Schema

2. Perform data Pre-processing using WEKA

1.ADD

1. Start Weka – you get the Weka GUI chooser window.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 21


Figure 2.1 Weka GUI chooser window

2. Click on the Explorer button to get the Weka Knowledge Explorer window.

Figure 2.2 Weka Knowledge Explorer window

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 22


3. Click on the “Open File.” button and open an ARFF file (try it first with an
example supplied in Weka-3-6/data, e.g. Weather.arff). You get the following:

Figure 2.3 Weka Weather.arff File

4.Click on Choose and select filters/unsupervised/attribute/Add.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 23


Figure 2.4 Weka Filter Selection

5.Then click on the area right of the Choose button. You get the following:

Figure 2.5 Weka Object Editor

Click on More to get more information about these parameters.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 24


6. Click on the Apply button to do the Addition and see how it is Added in the
Selected attribute window.

Figure 2.6 Weka File with added attribute


Try other parameters for the filter and see how the Addition changes. Don’t
forget to reload the original (numeric) relation or Undo the Addition before
applying another one.

2.Remove

1. Start Weka – you get the Weka GUI chooser window.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 25


Figure 2.7 Weka GUI chooser window
2. Click on the Explorer button and you get the Weka Knowledge Explorer
window. Click on the “Open File.” button and open an ARFF file (try it first
with an example supplied in Weka-3-6/data, e.g. weather.arff).

Figure 2.8 Weka Knowledge Explorer window

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 26


Figure 2.9 Weka Weather.arff File

3. Click on Choose and select filters/unsupervised/attribute/Remove.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 27


Figure 2.10 Weka Filter Selection

4. click on the area right of the Choose button. You get the following

Figure 2.11 Weka Filter Selection properties

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 28


5. You see here the default parameters of this filter. Enter the Indices of attribute
to be remove Click on more to get more information about these parameters.

Figure 2.12 Weka Object Editor

6. Click on the Apply button to do the remove.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 29


Figure 2.13 Weka Final File

Try other parameters for the filter and see how the remove changes. Don’t
forget to reload the original (numeric) relation or Undo the remove before
applying another one.

3. Replace Missing Values

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 30


The Pima Indians dataset is a good basis for exploring missing data.Some
attributes such as blood pressure (pres) and Body Mass Index (mass) have values
of zero, which are impossible. These are examples of corrupt or missing data that
must be marked manually.You can mark missing values in Weka using the
NumericalCleaner filter.

1. Start Weka – you get the Weka GUI chooser window.

Figure 2.14 Weka GUI chooser window

2. Click on the Explorer button and you get the Weka Knowledge Explorer window.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 31


Figure 2.15 Weka Knowledge Explorer window

3. Click on the “Open File.” button and open an ARFF file (try it first with an
example supplied in Weka-3-6/data, e.g. diabetes.arff). You get the
following:

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 32


Figure 2.16 Weka diabetes.arff

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 33


4. Click the “Choose” button for the Filter and select NumericalCleaner, it us
under unsupervized.attribute.NumericalCleaner.

Figure 2.17 Weka Select NumericCleaner Data Filter

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 34


4. Click on the filter to configure it.

5. Set the attributeIndicies to 6, the index of the mass attribute.

Figure 2.18 Weka Select NumericCleaner properties

6. Set minThreshold to 0.1E-8 (close to zero), which is the minimum value allowed
for the attribute.
7. Set minDefault to NaN, which is unknown and will replace values below the
threshold.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 35


Figure 2.19 Weka configuring NumericCleaner properties

8. Click the “OK” button on the filter configuration.

9. Click the “Apply” button to apply the filter.

Click “mass” in the “attributes” pane and review the details of the “selected
attribute”. Notice that the 11 attribute values that were formally set to 0 are not
marked as Missing.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 36


Figure 2.20 Weka Missing Data Marked

In this we marked values below a threshold as missing.

You could just as easily mark them with a specific numerical value. You could also
mark values missing between a upper and lower range of values.

Next, let’s look at how we can remove instances with missing values from our
dataset.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 37


Remove Missing Data
Now that you know how to mark missing values in your data, you need to
learn how to handle them.

A simple way to handle missing data is to remove those instances that have
one or more missing values.

You can do this in Weka using the RemoveWithValues filter.

Continuing on from the above to mark missing values, you can remove
missing values as follows:

1. Click the “Choose” button for the Filter and select RemoveWithValues, it us
under unsupervized.instance.RemoveWithValues.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 38


Figure 2.21 Weka Select RemoveWithValues Data Filter

2. Click on the filter to configure it.

3. Set the attributeIndicies to 6, the index of the mass attribute.


4. Set matchMissingValues to “True”.

Figure 2.22 Weka configuring RemoveWithValues Data Filter

5. Click the “OK” button to use the configuration for the filter.

6. Click the “Apply” button to apply the filter.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 39


Click “mass” in the “attributes” section and review the details of the “selected
attribute”.

Notice that the 11 attribute values that were marked Missing have been removed
from the dataset.

Figure 2.23 Weka Missing Values Removed

Note, undo this operation by clicking the “Undo” button.

Impute Missing Values


Instances with missing values do not have to be removed, you can replace
the missing values with some other value.

This is called imputing missing values.


IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 40
It is common to impute missing values with the mean of the numerical
distribution. You can do this easily in Weka using the ReplaceMissingValues filter.

Continuing on from the first recipe above to mark missing values, you can
impute the missing values as follows:

1. Click the “Choose” button for the Filter and select ReplaceMissingValues, it us
under unsupervized.attribute.ReplaceMissingValues.

Figure 2.24 Weka ReplaceMissingValues Data Filter

2. Click the “Apply” button to apply the filter to your dataset.

Click “mass” in the “attributes” section and review the details of the “selected
attribute”.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 41


Notice that the 11 attribute values that were marked Missing have been set to the
mean value of the distribution.

Figure 2.25 Weka Imputed Values

Try other parameters for the filter and see how the replace values changes.
Don’t forget to reload the original (numeric) relation or Undo the replaced before
applying another one.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 42


4.Standardize

1. Start Weka – you get the Weka GUI chooser window.

Figure 2.26 Weka GUI chooser window


2. Click on the Explorer button and you get the Weka Knowledge Explorer window.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 43


Figure 2.27 Weka Knowledge Explorer window

3. Click on the “Open File.” button and open an ARFF file (try it first with an
example supplied in Weka-3-6/data, e.g. weather.arff). You get the
following:

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 44


Figure 2.28 Weka weather.arff

Figure 2.29 Weka weather.arff in viewer

4. Click on Choose and select filters/unsupervised/attribute/Standardize.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 45


Figure 2.30 Weka Standardize Filter

Then click on the area right of the Choose button. You get the following:

Figure 2.31 About Weka Standardize Filter

You see here the default parameters of this filter. Click on more to get more
information about these parameters.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 46


5. Click on the Apply button to do the Standatization. Then select edit tab to
view data and see how it standard the values in the data window.

Figure 2.32 Data after Weka Standardize Filter

Try other parameters for the filter and see how the standardize changes.
Don’t forget to reload the original (numeric) relation or Undo the standardize
before applying another one.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 47


3. Perform Discritization of data using WEKA

1. Start Weka – you get the Weka GUI chooser window.

Figure 3.1 Weka GUI chooser window


2. Click on the Explorer button and you get the Weka Knowledge Explorer window.

Figure 3.2 Weka Knowledge Explorer


3. on the “Open File.” button and open an ARFF file (try it first with an example
supplied in Weka-3-6/data, e.g. labor.arff). You get the following:

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 48


Figure 3.3 Weka labor.arff

4. Click on Choose and select filters/unsupervised/attribute/Discretize.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 49


Figure 3.4 Weka Discretize Filter
5. Then click on the area right of the Choose button. You get the following:

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 50


Figure 3.5 Weka
Discretize Filter properties
You see here the default parameters of this filter. Click on More to get more
information about these parameters.

6. Click on the Apply button to do the discretization. Then select one of the
original numeric attributes (e.g. temperature) and see how it is discretized in
the Selected attribute window.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 51


Figure 3.6 Weka file after Discretize Filter

Try other parameters for the filter and see how the discretization changes.
Don’t forget to reload the original (numeric) relation or Undo the discretization
before applying another one.

4. Perform data transformation using an ETL tool

Microsoft SQL Server Integration Services (SSIS) is a platform for building


high performance data integration solutions, including extraction, transformation,
and load (ETL) packages for data warehousing.

SSIS includes graphical tools and wizards for building and debugging
packages; tasks for performing workflow functions such as FTP operations,
executing SQL statements, and sending e-mail messages; data sources and
destinations for extracting and loading data; transformations for cleaning,
aggregating, merging, and copying data; a management service, the Integration
Services service for administering package execution and storage; and application
programming interfaces (APIs) for programming the Integration Services object
model.

The package that you create takes data from a flat file, reformats the data,
and then inserts the reformatted data into a fact table. SSIS Designer is used to
create a simple ETL package that includes looping, configurations, error flow logic
and logging.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 52


ETL using SSIS Lookup

LookUp is very useful transformation SSIS component it performs lookup


operation by connecting input value with data-table or table dataset columns. It
compares source data with existing table dataset and filters matching ones and
un-matching ones.

For example let's say you have customer table with columns CusomerID,
CustomerName, CustomerAddress,CustomerCityID where CusomerID is a primary
key and CustomerCityID foreign key for City Table. And let's say you have a sample
source data in this format : "1001 as CustomerID", "Shaam as CustomerName",
"R-no 202 - mulund naka as CustomerAddress" & "Mumbai as CustomerCity".
Now if you see in the destination Customer Table we have CustomerCityID which
is foreign key [Integer value] and here in the source file we have string type value
and for proper insert we need its Foreign key value. So to get this foreign key
value we need to use #LookUp component which compares source records with
City master table to get matching key values and same can be updated to
Customer table.

So we will take up a Customer Table with columns : CustomerID, CustomerName,


CustomerAmount, CustomerAddress, CustomerCountryID, CustomerISActive. We
will also create master table for Country List name it as Country with columns :

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 53


CountryId, CountryName. We will add up some country names to this master
table.

When we load data from source file which contains "Customer Records with
country name" before it reaches to destination table (Customer Table) in between
we will apply LookUp component to compare source records with existing Country
Table and filters matching ones and un-matching ones. On matching key values we
will replace with country name and same we will update it to destination table.

Step 1

In this step we will go to our SQL management studio and create country master
table with columns (CountryID, CountryName) respectively.After that we will
some country names to this table.

Step 2

Here in this step we will create CustomerMaster table with columns : CustomerID,
CustomerName, CustomerAmount, CustomerAddress, CustomerCountryID,
CustomerISActive respectively in SQL management studio as shown in below
image.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 54


Step 3

Let's create our source file here for this example we will use flat file source and
add up some dummy data as shown in below image.

Step 4

Open up MSBI studio and create SSIS project. Once done just drag and drop Data
Flow task from toolbox and double click on it.

Step 5

Since our source file is Flat File so we will use Flat File Source component if you
want you can use different modes like Excel and so on.

For now just drag and drop Flat File Source Component from SSIS toolbox and
configure it.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 55


IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 56
Since we are loading from a source file to make sure you have assigned proper
data-type to it. To configure it go to Connection Manager -> double on flat file
connection -> Advance and assign appropriate data-type as shown in below
image.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 57


Step 6

The most important step here we will drag and drop SSIS #LookUP component and
attach it with Flat File Source component as shown in below image.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 58


Select #LookUp and right click and configure. Once you right click and edit a modal
box will pop up where you will see some menus on left hand side. Select General
menu -> Specify how to handle rows that no matching.

This means in-case if rows are not matched due to some reason then what to do.
Here we will say redirect rows to no match output means if rows are not matched
for some reason then throw it via no match output. As we discussed earlier that
#LookUp has got two outputs Matched and No Matched Output so we will throw
unmatched rows via No Matched output. So in the drop drown choose "Redirect
rows to no match output. This will also help us to identify erros occur during
runtime.

Keep cache mode to Full Cache and Connection mode to OLEDB connection.
Image representation is given below.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 59


Now from left-hand side menu select Connection to configure with our
CountryMaster table. As we discussed earlier since we are doing this because we
want CountryID to insert in CustomerMaster Table and in a source file we have
countrynames. So we will compare CountryNames of source file with
CountryMaster table using #LookUp and we will output only CountryID from that
and same we will load it to our destination CustomerMaster Table.

Select Connection menu -> Choose you SQL connection name -> Select
CountryTable as shown in below image.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 60


Again from left hand side menu choose columns in that you will find Source
column names and countrymaster table columns (right-side). Since we are
comparing country name so just drag and drop arrow to countryname from both
side (SCustomerCountry -> CountryName) and select output as CountryID as
shown in below image.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 61


All done now simply click on OK button and save it.

Step 7

For lookup Matched Output -> OLEBB Destination.

and run the project.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 62


IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 63
5.Apriori algorithm using WEKA
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 64
1.open a NOTEPAD type the following and save it as marketbasketanalysis.arrf
ARFF file

@relation marketbasketanalysis
@attribute transaction_id{100,200,300,400}
@attribute item1{0,1}
@attribute item2{0,1}
@attribute item3{0,1}
@attribute item4{0,1}
@attribute item5{0,1}
@data
100,1,1,1,0,0
200,1,1,0,1,1
300,1,0,1,1,0
400,1,0,1,0,0

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 65


1. Start Weka – you get the Weka GUI chooser window.

Figure 5.1 Weka GUI chooser window


2. Click on the Explorer button and you get the Weka Knowledge Explorer window.

Figure 5.2 Weka Knowledge Explorer

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 66


3. In the preprocess tab click openfile and select the above created
marketbasketanalysis.arff file

Figure 5.3 Weka marketbasketanalysis.arff file

5. check the attribute "transaction_id" and click on remove button.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 67


Figure 5.4 Weka Remove Filter

6. goto associate tab

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 68


Figure 5.5 Weka Associate tab

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 69


7. besides the choose button you can see the apriori with few parameters.click
on it

Figure 5.6 Weka Apriori parameters


And make the car attribute to true and numRules parameter to 5 to generate 5
rules.
NAME : weka.associations.Apriori : Class implementing an Apriori-type algorithm.
Iteratively reduces the minimum support until it finds the required number of
rules with the given minimum confidence. The algorithm has an option to mine
class association rules. It is adapted as explained in the second reference.

OPTIONS

minMetric -- Minimum metric score. Consider only rules with scores higher than
this value.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 70


verbose -- If enabled the algorithm will be run in verbose mode.

numRules -- Number of rules to find.

lowerBoundMinSupport -- Lower bound for minimum support.

classIndex -- Index of the class attribute. If set to -1, the last attribute is taken as
class attribute.

outputItemSets -- If enabled the itemsets are output as well.

car -- If enabled class association rules are mined instead of (general) association
rules.

doNotCheckCapabilities -- If set, associator capabilities are not checked before


associator is built (Use with caution to reduce runtime).

removeAllMissingCols -- Remove columns with all missing values.

significanceLevel -- Significance level. Significance test (confidence metric only).

treatZeroAsMissing -- If enabled, zero (that is, the first value of a nominal) is


treated in the same way as a missing value.

delta -- Iteratively decrease support by this factor. Reduces support until min
support is reached or required number of rules has been generated.

metricType -- Set the type of metric by which to rank rules. Confidence is the
proportion of the examples covered by the premise that are also covered by the
consequence (Class association rules can only be mined using confidence). Lift is
confidence divided by the proportion of all examples that are covered by the
consequence. This is a measure of the importance of the association that is
independent of support. Leverage is the proportion of additional examples
covered by both the premise and consequence above those expected if the
premise and consequence were independent of each other. The total number of
examples that this represents is presented in brackets following the leverage.
Conviction is another measure of departure from independence. Conviction is
given by P(premise)P(!consequence) / P(premise, !consequence).
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 71
upperBoundMinSupport -- Upper bound for minimum support. Start iteratively
decreasing minimum support from this value.

8. then click on start button now then you will get the generated apriori rules.

Figure 5.7 Weka Generated Apriori Rules

6.Implementation of Apriori algorithm to generate frequent item sets.

C/C++ VERSION

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 72


#include<iostream.h>
#include<conio.h>
void main()
{
int i,j,t1,k,l,m,f,f1,f2,f3;
//Initial item-purchase
int a[5][5];
for(i=0;i<5;i++)
{
cout<<"\n Enter items from purchase "<<i+1<<":";
for(j=0;j<5;j++)
{
cin>>a[i][j];
}
}
//Defining minimum level for acceptence
int min;
cout<<"\n Enter minimum acceptance level";
cin>>min;
//Printing initial input
cout<<"\nInitial Input:\n";
cout<<"\nTrasaction\tItems\n";
for(i=0;i<5;i++)
{
cout<<i+1<<":\t";
for(j=0;j<5;j++)
{
cout<<a[i][j]<<"\t";
}
cout<<"\n";
}
cout<<"\nAssume minimum support: "<<min;
//First pass
int l1[5];
for(i=0;i<5;i++)
{
t1=0;
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 73
for(j=0;j<5;j++)
{
for(k=0;k<5;k++)
{
if(a[j][k]==i+1)
{
t1++;
}
}
}
l1[i]=t1;
}
//Printing first pass
cout<<"\n\nGenerating C1 from data\n";
for(i=0;i<5;i++)
{
cout<<i+1<<": "<<l1[i]<<"\n";
}
//Second pass
//Counting number of possibilities for pass2
int p2pcount=0;
int p2items[5];
int p2pos=0;
for(i=0;i<5;i++)
{
if(l1[i]>=min)
{
p2pcount++;
p2items[p2pos]=i;
p2pos++;
}
}
//Printing selected items for second pass
cout<<"\nGenerating L1 From C1\n";
for(i=0;i<p2pos;i++)
{
cout<<p2items[i]+1<<"\t"<<l1[p2items[i]]<<"\n";
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 74
}
//Joining items
int l2[5][3];
int l2t1; //will hold first item for join
int l2t2; //will hold second item for join
int l2pos1=0; //position pointer in l2 array
int l2ocount=0; //product join occruance counter
int l2jcount=0; //join counter
for(i=0;i<p2pcount;i++)
{
for(j=i+1;j<p2pcount;j++)
{
l2t1=p2items[i]+1;
l2t2=p2items[j]+1;
if(l2t1==l2t2)
{
//it is self join
continue;
}
//join the elements
l2[l2pos1][0]=l2t1;
l2[l2pos1][1]=l2t2;
l2jcount++;
//count occurances
l2ocount=0; //reset counter
for(k=0;k<5;k++)
{
f1=f2=0; //resetting flag
//scan a purcahse
for(l=0;l<5;l++)
{
if(l2t1==a[k][l])
{
//one of the element found
f1=1;
}
if(l2t2==a[k][l])
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 75
{
//second elements also found
f2=1;
}
}
//one purchase scanned
if(f1==1&&f2==1) //both items are present in purchase
{
l2ocount++;
}
}
//assign count
l2[l2pos1][2]=l2ocount;
l2pos1++;
}
}
//Printing second pass
cout<<"\n\nGenerating L2\n";
for(i=0;i<l2jcount;i++)
{
for(j=0;j<3;j++)
{
cout<<l2[i][j]<<"\t";
}
cout<<"\n";
}
//Third pass
int p3pcount=0;
int p3items[5]={-1,-1,-1,-1,-1};
int p3pos=0;
for(i=0;i<5;i++)
{
if(l2[i][2]>=min)
{
f=0;
for(j=0;j<5;j++)
{
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 76
if(p3items[j]==l2[i][0])
{
f=1;
}
}
if(f!=1)
{
p3items[p3pos]=l2[i][0];
p3pos++;
p3pcount++;
}
f=0;
for(j=0;j<5;j++)
{
if(p3items[j]==l2[i][1])
{
f=1;
}
}
if(f!=1)
{
p3items[p3pos]=l2[i][1];
p3pos++;
p3pcount++;
}
}
}
//Joining
int l3[5][4];
int l3ocount=0; //occurance counter
int l3jcount=0; //join counter
for(i=0;i<p3pcount;i++)
{
for(j=i+1;j<p3pcount;j++)
{
for(k=j+1;k<p3pcount;k++)
{
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 77
l3[i][0]=p3items[i];
l3[i][1]=p3items[j];
l3[i][2]=p3items[k];
l3jcount++;
//count occurances
l3ocount=0; //reset counter
for(k=0;k<5;k++)
{
f1=f2=f3=0; //resetting flag
//scan a purcahse
for(l=0;l<5;l++)
{
if(l3[i][0]==a[k][l])
{
//one of the element found
f1=1;
}
if(l3[i][1]==a[k][l])
{
//second elements also found
f2=1;
}
if(l3[i][2]==a[k][l])
{
//third element also found
f3=1;
}
}
//one purchase scanned
if(f1==1&&f2==1&&f3==1) //all items are present in purchase
{
l3ocount++;
}
}
//assign count
l3[i][3]=l3ocount;
}
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 78
}
}
//Printing second pass
cout<<"\n\nGenerating L3\n";
for(i=0;i<l3jcount;i++)
{
for(j=0;j<4;j++)
{
cout<<l3[i][j]<<"\t";
}
cout<<"\n";
}
//Ending
getch();
}

/* Output
Enter items from purchase 1:1
5
2
0
0
Enter items from purchase 2:2
3
4
1
0
Enter items from purchase 3:3
4
0
0
0
Enter items from purchase 4:2
1
3
0
0
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 79
Enter items from purchase 5:1
2
3
0
0
Enter minimum acceptance level3
Initial Input:
Trasaction Items
1: 1 5 2 0 0
2: 2 3 4 1 0
3: 3 4 0 0 0
4: 2 1 3 0 0
5: 1 2 3 0 0
Assume minimum support: 3
Generating C1 from data
1: 4
2: 4
3: 4
4: 2
5: 1
Generating L1 From C1
14
24
34
Generating L2
124
133
233
Generating L3
1233
*/

JAVA VERSION

SQL Queries for database:

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 80


CREATE TABLE apriori(transaction_id int, object int);

INSERT INTO apriori VALUES(1, 1);


INSERT INTO apriori VALUES(1, 3);
INSERT INTO apriori VALUES(1, 4);
INSERT INTO apriori VALUES(2, 2);
INSERT INTO apriori VALUES(2, 3);
INSERT INTO apriori VALUES(2, 5);
INSERT INTO apriori VALUES(3, 1);
INSERT INTO apriori VALUES(3, 2);
INSERT INTO apriori VALUES(3, 3);
INSERT INTO apriori VALUES(3, 5);
INSERT INTO apriori VALUES(4, 2);
INSERT INTO apriori VALUES(4, 5);

*/

import java.util.*;
import java.sql.*;

class Tuple {
Set<Integer> itemset;
int support;

Tuple() {
itemset = new HashSet<>();
support = -1;
}

Tuple(Set<Integer> s) {
itemset = s;
support = -1;
}

Tuple(Set<Integer> s, int i) {
itemset = s;
support = i;
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 81
}
}

class Apriori {
static Set<Tuple> c;
static Set<Tuple> l;
static int d[][];
static int min_support;

public static void main(String args[]) throws Exception {


getDatabase();
c = new HashSet<>();
l = new HashSet<>();
Scanner scan = new Scanner(System.in);
int i, j, m, n;
System.out.println("Enter the minimum support (as an integer
value):");
min_support = scan.nextInt();
Set<Integer> candidate_set = new HashSet<>();
for(i=0 ; i < d.length ; i++) {
System.out.println("Transaction Number: " + (i+1) + ":");
for(j=0 ; j < d[i].length ; j++) {
System.out.print("Item number " + (j+1) + " = ");
System.out.println(d[i][j]);
candidate_set.add(d[i][j]);
}
}

Iterator<Integer> iterator = candidate_set.iterator();


while(iterator.hasNext()) {
Set<Integer> s = new HashSet<>();
s.add(iterator.next());
Tuple t = new Tuple(s, count(s));
c.add(t);
}

prune();
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 82
generateFrequentItemsets();
}

static int count(Set<Integer> s) {


int i, j, k;
int support = 0;
int count;
boolean containsElement;
for(i=0 ; i < d.length ; i++) {
count = 0;
Iterator<Integer> iterator = s.iterator();
while(iterator.hasNext()) {
int element = iterator.next();
containsElement = false;
for(k=0 ; k < d[i].length ; k++) {
if(element == d[i][k]) {
containsElement = true;
count++;
break;
}
}
if(!containsElement) {
break;
}
}
if(count == s.size()) {
support++;
}
}
return support;
}

static void prune() {


l.clear();
Iterator<Tuple> iterator = c.iterator();
while(iterator.hasNext()) {
Tuple t = iterator.next();
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 83
if(t.support >= min_support) {
l.add(t);
}
}
System.out.println("-+- L -+-");
for(Tuple t : l) {
System.out.println(t.itemset + " : " + t.support);
}
}

static void generateFrequentItemsets() {


boolean toBeContinued = true;
int element = 0;
int size = 1;
Set<Set> candidate_set = new HashSet<>();
while(toBeContinued) {
candidate_set.clear();
c.clear();
Iterator<Tuple> iterator = l.iterator();
while(iterator.hasNext()) {
Tuple t1 = iterator.next();
Set<Integer> temp = t1.itemset;
Iterator<Tuple> it2 = l.iterator();
while(it2.hasNext()) {
Tuple t2 = it2.next();
Iterator<Integer> it3 = t2.itemset.iterator();
while(it3.hasNext()) {
try {
element = it3.next();
} catch(ConcurrentModificationException e)
{
// Sometimes this Exception gets
thrown, so simply break in that case.
break;
}
temp.add(element);
if(temp.size() != size) {
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 84
Integer[] int_arr = temp.toArray(new
Integer[0]);
Set<Integer> temp2 = new
HashSet<>();
for(Integer x : int_arr) {
temp2.add(x);
}
candidate_set.add(temp2);
temp.remove(element);
}
}
}
}
Iterator<Set> candidate_set_iterator = candidate_set.iterator();
while(candidate_set_iterator.hasNext()) {
Set s = candidate_set_iterator.next();
// These lines cause warnings, as the candidate_set Set
stores a raw set.
c.add(new Tuple(s, count(s)));
}
prune();
if(l.size() <= 1) {
toBeContinued = false;
}
size++;
}
System.out.println("\n=+= FINAL LIST =+=");
for(Tuple t : l) {
System.out.println(t.itemset + " : " + t.support);
}
}

static void getDatabase() throws Exception {


Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
Connection con = DriverManager.getConnection("jdbc:odbc:DWM");
Statement s = con.createStatement();

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 85


ResultSet rs = s.executeQuery("SELECT * FROM apriori;");
Map<Integer, List <Integer>> m = new HashMap<>();
List<Integer> temp;
while(rs.next()) {
int list_no = Integer.parseInt(rs.getString(1));
int object = Integer.parseInt(rs.getString(2));
temp = m.get(list_no);
if(temp == null) {
temp = new LinkedList<>();
}
temp.add(object);
m.put(list_no, temp);
}

Set<Integer> keyset = m.keySet();


d = new int[keyset.size()][];
Iterator<Integer> iterator = keyset.iterator();
int count = 0;
while(iterator.hasNext()) {
temp = m.get(iterator.next());
Integer[] int_arr = temp.toArray(new Integer[0]);
d[count] = new int[int_arr.length];
for(int i=0 ; i < d[count].length ; i++) {
d[count][i] = int_arr[i].intValue();
}
count++;
}
}
}

/*

OUTPUT:
Enter the minimum support :2
Transaction Number: 1:
Item number 1 = 1
Item number 2 = 3
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 86
Item number 3 = 4
Transaction Number: 2:
Item number 1 = 2
Item number 2 = 3
Item number 3 = 5
Transaction Number: 3:
Item number 1 = 1
Item number 2 = 2
Item number 3 = 3
Item number 4 = 5
Transaction Number: 4:
Item number 1 = 2
Item number 2 = 5
-+- L -+-
[1] : 2
[3] : 3
[2] : 3
[5] : 3
-+- L -+-
[2, 3] : 2
[3, 5] : 2
[1, 3] : 2
[2, 5] : 3
-+- L -+-
[2, 3, 5] : 2

=+= FINAL LIST =+=


[2, 3, 5] : 2

*/

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 87


7. Classification algorithms using WEKA( i.Decision Tree Induction ii.KNN)

i.Decision Tree Induction

This experiment illustrates the use of C4.5 (J48) classifier in WEKA. The
sample data set used, unless otherwise indicated, is the bank data available in
comma-separated format (bank-data.csv).

This assumes that appropriate data preprocessing has been perfromed. In


this case ID field has been removed. Since C4.5 algorithm can handle numeric
attributes, there is no need to discretize any of the attributes. For the our
purpose , however, the "Children" attribute has been converted into a categorical
attribute with values "YES" or "NO".

WEKA has implementations of numerous classification and prediction


algorithms. The basic ideas behind using all of these are similar. In this we will use
the modifiedversion of the bank data to classify new instances using the C4.5
algorithm (note that the C4.5 is implemented in WEKA by the classifier
class: weka.classifiers.trees.J48).

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 88


The modified (and smaller) version of the bank data can be found in the
file "bank.arff" and the new unclassified instances are in the file "bank-new.arff".

As usual, we begin by loading the data into WEKA, as seen in below Figure :

Figure 7.1 Weka bankdata.arff

Next, we select the "Classify" tab and click the "Choose" button to select
the J48 classifier, as depicted in Figures . Note that J48 (implementation of C4.5

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 89


algorithm) does not require discretization of numeric attributes, in contrast to the
ID3 algorithm from which C4.5 has evolved.

Figure 7.2 Weka Classify tab

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 90


Figure 7.3 Weka Classify Rules Selection

Now, we can specify the various parameters. These can be specified by


clicking in the text box to the right of the "Choose" button. In this example we
accept the default values. The default version does perform some pruning (using
the subtree raising approach), but does not perform error pruning. The selected
parameters are depicted in Figure below .

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 91


Figure 7.4 Weka Classify Parameters Selection

Under the "Test options" in the main panel we select 10-fold cross-
validation as our evaluation approach. Since we do not have separate evaluation
data set, this is necessary to get a reasonable idea of accuracy of the generated
model. We now click "Start" to generate the model.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 92


Figure 7.5 Weka Classified Rules1

We can view this information in a separate window by right clicking the last
result set (inside the "Result list" panel on the left) and selecting "View in separate
window" from the pop-up menu. These steps and the resulting window containing
the classification results are depicted in the below Figures .

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 93


Figure 7.6 Weka Classified Rules2

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 94


Figure 7.7 Weka Classified Rules3

Note that the classification accuracy of our model is only about 69%. This
may indicate that we may need to do more work (either in preprocessing or in
selecting the correct parameters for classification), before building another model.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 95


WEKA also let's us view a graphical rendition of the classification tree. This
can be done by right clicking the last result set (as before) and selecting "Visualize
tree" from the pop-up menu. The tree for this example is depicted in the below
Figure . Note that by resizing the window and selecting various menu items from
inside the tree view (using the right mouse button), we can adjust the tree view to
make it more readable.

Figure 7.8 Weka Classified tree view

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 96


ii.KNN

In k-NN classification, the output is a class membership. An object is


classified by a majority vote of its neighbors, with the object being assigned to the
class most common among its k nearest neighbors (k is a positive integer, typically
small). If k = 1, then the object is simply assigned to the class of that single nearest
neighbor.
k-NN is a type of instance-based learning, or lazy learning, where the
function is only approximated locally and all computation is deferred until
classification. The k-NN algorithm is among the simplest of all machine learning
algorithms.
Both for classification and regression, it can be useful to assign weight to
the contributions of the neighbors, so that the nearer neighbors contribute more
to the average than the more distant ones. For example, a common weighting
scheme consists in giving each neighbor a weight of 1/d, where d is the distance
to the neighbor.
The neighbors are taken from a set of objects for which the class (for k-NN
classification) or the object property value (for k-NN regression) is known. This can
be thought of as the training set for the algorithm, though no explicit training step
is required.
In weka it's called IBk (instance-bases learning with parameter k) and it's in
the lazy class folder. KNN is the K parameter. IBk's KNN parameter specifies the
number of nearest neighbors to use when classifying a test instance, and the
outcome is determined by majority vote.

Weka's IBk implementation has the “cross-validation” option that can help
by choosing the best value automatically Weka uses cross-validation to select the
best value for KNN (which is the same as k).

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 97


Class weka.classifiers.IBk
java.lang.Object
|
+----weka.classifiers.Classifier
|
+----weka.classifiers.DistributionClassifier
|
+----weka.classifiers.IBk

public class IBk extends DistributionClassifier


implements OptionHandler, UpdateableClassifier, WeightedInstancesHandler

Valid options are:

-K num
Set the number of nearest neighbors to use in prediction (default 1)

-W num
Set a fixed window size for incremental train/testing. As new training instances are
added, oldest instances are removed to maintain the number of training instances
at this size. (default no window)

-D
Neighbors will be weighted by the inverse of their distance when voting. (default
equal weighting)

-F
Neighbors will be weighted by their similarity when voting. (default equal
weighting)

-X
Selects the number of neighbors to use by hold-one-out cross validation, with an
upper limit given by the -K option.

-S
When k is selected by cross-validation for numeric class attributes, minimize
mean-squared error. (default mean absolute error)
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 98
NAME : weka.classifiers.lazy.IBk : K-nearest neighbours classifier. Can select
appropriate value of K based on cross-validation. Can also do distance weighting.

OPTIONS

numDecimalPlaces -- The number of decimal places to be used for the output of


numbers in the model.

batchSize -- The preferred number of instances to process if batch prediction is


being performed. More or fewer instances may be provided, but this gives
implementations a chance to specify a preferred batch size.

KNN -- The number of neighbours to use.

distanceWeighting -- Gets the distance weighting method used.

nearestNeighbourSearchAlgorithm -- The nearest neighbour search algorithm to


use (Default: weka.core.neighboursearch.LinearNNSearch).

debug -- If set to true, classifier may output additional info to the console.

windowSize -- Gets the maximum number of instances allowed in the training


pool. The addition of new instances above this value will result in old instances
being removed. A value of 0 signifies no limit to the number of training instances.

doNotCheckCapabilities -- If set, classifier capabilities are not checked before


classifier is built (Use with caution to reduce runtime).

meanSquared -- Whether the mean squared error is used rather than mean
absolute error when doing cross-validation for regression problems.

crossValidate -- Whether hold-one-out cross-validation will be used to select the


best k value between 1 and the value specified as the KNN parameter.

Nearest Neighbor : Nearest Neighbor (also known as Collaborative Filtering


or Instance-based Learning) is a useful data mining technique that allows you to

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 99


use your past data instances, with known output values, to predict an unknown
output value of a new data instance.

So, at this point, this description should sound similar to both regression
and classification. How is this different from those two? Well, first off, remember
that regression can only be used for numerical outputs. That differentiates it from
Nearest Neighbor immediately.

Classification, uses every data instance to create a tree, which we would


traverse to find our answer. This can be a serious problem with some data. Think
about a company like Amazon and the common "Customers who purchased X also
purchased Y" feature.

If Amazon were to create a classification tree, how many branches and


nodes could it have? There are maybe a few hundred thousand products. How big
would that tree be? How accurate do you think a tree that big would be? Even if
you got to a single branch, you might be shocked to learn that it only has three
products. Amazon's page likes to have 12 products on it to recommend to you. It's
a bad data mining model for this data.

You'll find that Nearest Neighbor fixes all those problems in a very efficient
manner, especially in the example used above for Amazon. It's not limited to any
number of comparisons. It's as scalable for a 20-customer database as it is for a 20
million-customer database, and you can define the number of results you want to
find. Seems like a great technique! It really is — and probably will be the most
useful for anyone reading this who has an e-commerce store.

Math behind Nearest Neighbor : You will see that the math behind the
Nearest Neighbor technique is a lot like the math involved with the clustering
technique. Taking the unknown data point, the distance between the unknown
data point and every known data point needs to be computed. Finding the
distance is really quite trivial with a spreadsheet, and a high-powered computer
can zip through these calculations nearly instantly. The easiest and most common
distance calculation is the "Normalized Euclidian Distance." It sounds much more
complicated than it really is.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 100
Let's take a look at an example in action and try to figure out what
Customer No. 5 is likely to purchase.

Listing 1. Nearest Neighbor math


1
Customer Age Income Purchased Product
2
1 45 46k Book
3
2 39 100k TV
4
3 35 38k DVD
5
4 69 150k Car Cover
6
5 58 51k ???
7
8
Step 1: Apply Distance Formula
9
10
Step 2: Calculate the Score
11
Customer Score Purchased Product
12
1 .385 Book
13
2 .710 TV
14
3 .686 DVD
15
4 .941 Car Cover
16
5 0.0 ???
17

To answer the question "What is Customer No. 5 most likely to buy?" based
on the Nearest Neighbor algorithm we ran through above, the answer would be a
book. This is because the distance between Customer No. 5 and Customer No. 1 is
less (far less, actually) than the distance between Customer No. 5 and any other
customer. Based on this model, we say that the customer most like Customer No.
5 can predict the behavior of Customer No. 5.

However, the positives of Nearest Neighbor don't end there. The Nearest
Neighbor algorithm can be expanded beyond the closest match to include any
number of closest matches. These are termed "N-Nearest Neighbors" (for
example, 3-Nearest Neighbors).

Using the above example, if we want to know the two most likely products
to be purchased by Customer No. 5, we would conclude that they are books and a

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 101
DVD. Using the Amazon example from above, if they wanted to know the 12
products most likely to be purchased by a customer, they would want to run a 12-
Nearest Neighbor algorithm (though Amazon actually runs something more
complicated than just a simple 12-Nearest Neighbor algorithm).

Further, the algorithm shouldn't be constrained to predicting a product to


be purchased. It can also be used to predict a Yes/No output value. Considering
the above example, if we changed the last column to the following (from
customers 1-4), "Yes,No,Yes,No," a 1-Nearest Neighbor model would predict
Customer No. 5 to say "Yes" and a 2-Nearest Neighbor would predict a "Yes" (both
customer nos. 1 and 3 say "Yes"), and a 3-Nearest Neighbor model would say
"Yes." (Customer nos. 1 and 3 say "Yes," customer No. 2 says "No," so the average
value of these is "Yes.")

The final question to consider is "How many neighbors should we use in our
model?". You'll find that experimentation will be needed to determine the best
number of neighbors to use. Also, if you are trying to predict the output of a
column with a 0 or 1 value, you'd obviously want to select an odd number of
neighbors, in order to break ties.

Data set for WEKA : The data set we'll use is our fictional BMW dealership
and the promotional campaign to sell a two-year extended warranty to past
customers. There are 4,500 data points from past sales of extended warranties.
The attributes in the data set are Income Bracket [0=$0-$30k, 1=$31k-$40k,
2=$41k-$60k, 3=$61k-$75k, 4=$76k-$100k, 5=$101k-$150k, 6=$151k-$500k,
7=$501k+], the year/month their first BMW was bought, the year/month the most
recent BMW was bought, and whether they responded to the extended warranty
offer in the past.

Listing 2. Nearest Neighbor WEKA data


1 @attribute IncomeBracket {0,1,2,3,4,5,6,7}
2 @attribute FirstPurchase numeric
3 @attribute LastPurchase numeric
4 @attribute responded {1,0}
5
6 @data
7

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 102
8 4,200210,200601,0
9 5,200301,200601,1
10 ...

Nearest Neighbor in WEKA : Load the data file bmw-training.arff into WEKA
using the same steps we've used to this point in the Preprocess tab. Your screen
should look like below Figure after loading in the data.

Figure 7.9 BMW Nearest Neighbor data in WEKA


Like we did with the regression and classification model in the previous
articles, we should next select the Classify tab. On this tab, we should select lazy,
then select IBk (the IB stands for Instance-Based, and the k allows us to specify
the number of neighbors to examine).

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 103
Figure 7.10 BMW Nearest Neighbor algorithm

At this point, we are ready to create our model in WEKA. Ensure that Use
training set is selected so we use the data set we just loaded to create our model.
Click Start and let WEKA run. Below Figure shows a screenshot, and Listing
contains the output from this model.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 104
Figure 7.11 BMW Nearest Neighbor model

Listing . Output of IBk calculations

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 105
1 === Evaluation on training set ===
2 === Summary ===
3
4 Correctly Classified Instances 2663 88.7667 %
5 Incorrectly Classified Instances 337 11.2333 %
6 Kappa statistic 0.7748
7 Mean absolute error 0.1326
8 Root mean squared error 0.2573
9 Relative absolute error 26.522 %
10 Root relative squared error 51.462 %
11 Total Number of Instances 3000
12
13 === Detailed Accuracy By Class ===
14
15 TP Rate FP Rate Precision Recall F-Measure ROC Area Class
16 0.95 0.177 0.847 0.95 0.896 0.972 1
17 0.823 0.05 0.941 0.823 0.878 0.972 0
18 Weighted Avg. 0.888 0.114 0.893 0.888 0.887 0.972
19
20 === Confusion Matrix ===
21
22 a b <-- classified as
23 1449 76 | a = 1
24 261 1214 | b = 0

How does this compare with our results when we used classification to
create a model? Well, this model using Nearest Neighbor has an 89-percent
accuracy rating, while the previous model only had a 59-percent accuracy rating,
so that's definitely a good start. Nearly a 90-percent accuracy rating would be very
acceptable. Let's take this a step further and interpret the results in terms of false
positives and false negatives, so you can see how the results from WEKA apply in a
real business sense.

The results of the model say we have 76 false positives (2.5 percent), and
we have 261 false negatives (8.7 percent). Remember a false positive, in this
example, means that our model predicted the customer would buy an extended
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 106
warranty and actually didn't, and a false negative means that our model predicted
they wouldn't buy an extended warranty, and they actually did.

Let's estimate that the flier the dealership sends out cost $3 each and that
the extended warranty brings in $400 profit for the dealer. This model from a
cost/benefit perspective to the dealership would be $400 - (2.5% * $3) - (8.7% *
400) = $365. So, the model looks rather profitable for the dealership. Compare
that to the classification model, which had a cost/benefit of only $400 - (17.2% *
$3) - (23.7% * $400) = $304, and you can see that using the right model offered a
20-percent increase in potential revenue for the dealership.

As an exercise for yourself, play with the number of nearest neighbors in the
model (you do this by right-clicking on the text "IBk -K 1...." and you see a list of
parameters). You can change the "KNN" (K-nearest neighbors) to be anything you
want. You'll see in this example, that the accuracy of the model actually decreases
with the inclusion of additional neighbors.

Some final take-aways from this model: The power of Nearest Neighbor
becomes obvious when we talk about data sets like Amazon. With its 20 million
users, the algorithm is very accurate, since there are likely many potential
customers in Amazon's database with similar buying habits to you.

Thus, the nearest neighbor to yourself is likely very similar. This creates an
accurate and effective model. Contrarily, the model breaks down quickly and
becomes inaccurate when you have few data points for comparison. In the early
stages of an online e-commerce store for example, when there are only 50
customers, a product recommendation feature will likely not be accurate at all, as
the nearest neighbor may in fact be very distant from yourself.

The final challenge with the Nearest Neighbor technique is that it has the
potential to be a computing-expensive algorithm. In Amazon's case, with 20
million customers, each customer must be calculated against the other 20 million
customers to find the nearest neighbors.
First, if your business has 20 million customers, that's not technically a
problem because you're likely rolling in money. Second, these types of
computations are ideal for the cloud in that they can offloaded to dozens of

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 107
computers to be run simultaneously, with a final comparison done at the end.
(Google's MapReduce for example.)

Third, in practice, it wouldn't be necessary to compare every customer in


Amazon's database to myself if I'm only purchasing a book. The assumption can
be made that I can be compared to only other bookbuyers to find the best match,
narrowing the potential neighbors to a fraction of the entire database.

Remember: Data mining models aren't always simple input-output


mechanisms — the data must be examined to determine the right model to
choose, the input can be managed to reduce computing time, and the output
must be analyzed and accurate before you are ready to put a stamp of approval on
the entire thing.

8. Implement the following classification algorithms


i.Decision Tree Induction ii.KNN

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 108
i.)Decision Tree Induction

import java.io.*;

class DecisionTree {

/* ------------------------------- */

/* */

/* FIELDS */

/* */

/* ------------------------------- */

/* NESTED CLASS */

private class BinTree {

/* FIELDS */

private int nodeID;

private String questOrAns = null;

private BinTree yesBranch = null;

private BinTree noBranch = null;

/* CONSTRUCTOR */

public BinTree(int newNodeID, String newQuestAns) {

nodeID = newNodeID;

questOrAns = newQuestAns;

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 109
/* OTHER FIELDS */

static BufferedReader keyboardInput = new

BufferedReader(new InputStreamReader(System.in));

BinTree rootNode = null;

/* ------------------------------------ */

/* */

/* CONSTRUCTORS */

/* */

/* ------------------------------------ */

/* Default Constructor */

public DecisionTree() {

/* ----------------------------------------------- */

/* */

/* TREE BUILDING METHODS */

/* */

/* ----------------------------------------------- */

/* CREATE ROOT NODE */

public void createRoot(int newNodeID, String newQuestAns) {

rootNode = new BinTree(newNodeID,newQuestAns);


IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 110
System.out.println("Created root node " + newNodeID);

/* ADD YES NODE */

public void addYesNode(int existingNodeID, int newNodeID, String


newQuestAns) {

// If no root node do nothing

if (rootNode == null) {

System.out.println("ERROR: No root node!");

return;

// Search tree

if
(searchTreeAndAddYesNode(rootNode,existingNodeID,newNodeID,newQuestAns)
){

System.out.println("Added node " + newNodeID +

" onto \"yes\" branch of node " + existingNodeID);

else System.out.println("Node " + existingNodeID + " not found");

/* SEARCH TREE AND ADD YES NODE */

private boolean searchTreeAndAddYesNode(BinTree currentNode,

int existingNodeID, int newNodeID, String newQuestAns) {


IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 111
if (currentNode.nodeID == existingNodeID) {

// Found node

if (currentNode.yesBranch == null) currentNode.yesBranch = new

BinTree(newNodeID,newQuestAns);

else {

System.out.println("WARNING: Overwriting previous node " +

"(id = " + currentNode.yesBranch.nodeID +

") linked to yes branch of node " +

existingNodeID);

currentNode.yesBranch = new BinTree(newNodeID,newQuestAns);

return(true);

else {

// Try yes branch if it exists

if (currentNode.yesBranch != null) {

if (searchTreeAndAddYesNode(currentNode.yesBranch,

existingNodeID,newNodeID,newQuestAns)) {

return(true);

else {

// Try no branch if it exists


IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 112
if (currentNode.noBranch != null) {

return(searchTreeAndAddYesNode(currentNode.noBranch,

existingNodeID,newNodeID,newQuestAns));

else return(false); // Not found here

return(false); // Not found here

/* ADD NO NODE */

public void addNoNode(int existingNodeID, int newNodeID, String


newQuestAns) {

// If no root node do nothing

if (rootNode == null) {

System.out.println("ERROR: No root node!");

return;

// Search tree

if
(searchTreeAndAddNoNode(rootNode,existingNodeID,newNodeID,newQuestAns)
){
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 113
System.out.println("Added node " + newNodeID +

" onto \"no\" branch of node " + existingNodeID);

else System.out.println("Node " + existingNodeID + " not found");

/* SEARCH TREE AND ADD NO NODE */

private boolean searchTreeAndAddNoNode(BinTree currentNode,

int existingNodeID, int newNodeID, String newQuestAns) {

if (currentNode.nodeID == existingNodeID) {

// Found node

if (currentNode.noBranch == null) currentNode.noBranch = new

BinTree(newNodeID,newQuestAns);

else {

System.out.println("WARNING: Overwriting previous node " +

"(id = " + currentNode.noBranch.nodeID +

") linked to yes branch of node " +

existingNodeID);

currentNode.noBranch = new BinTree(newNodeID,newQuestAns);

return(true);

else {
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 114
// Try yes branch if it exists

if (currentNode.yesBranch != null) {

if (searchTreeAndAddNoNode(currentNode.yesBranch,

existingNodeID,newNodeID,newQuestAns)) {

return(true);

else {

// Try no branch if it exists

if (currentNode.noBranch != null) {

return(searchTreeAndAddNoNode(currentNode.noBranch,

existingNodeID,newNodeID,newQuestAns));

else return(false); // Not found here

else return(false); // Not found here

/* --------------------------------------------- */

/* */

/* TREE QUERY METHODS */

/* */
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 115
/* --------------------------------------------- */

public void queryBinTree() throws IOException {

queryBinTree(rootNode);

private void queryBinTree(BinTree currentNode) throws IOException {

// Test for leaf node (answer) and missing branches

if (currentNode.yesBranch==null) {

if (currentNode.noBranch==null)
System.out.println(currentNode.questOrAns);

else System.out.println("Error: Missing \"Yes\" branch at \"" +

currentNode.questOrAns + "\" question");

return;

if (currentNode.noBranch==null) {

System.out.println("Error: Missing \"No\" branch at \"" +

currentNode.questOrAns + "\" question");

return;

// Question

askQuestion(currentNode);

private void askQuestion(BinTree currentNode) throws IOException {

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 116
System.out.println(currentNode.questOrAns + " (enter \"Yes\" or \"No\")");

String answer = keyboardInput.readLine();

if (answer.equals("Yes")) queryBinTree(currentNode.yesBranch);

else {

if (answer.equals("No")) queryBinTree(currentNode.noBranch);

else {

System.out.println("ERROR: Must answer \"Yes\" or \"No\"");

askQuestion(currentNode);

/* ----------------------------------------------- */

/* */

/* TREE OUTPUT METHODS */

/* */

/* ----------------------------------------------- */

/* OUTPUT BIN TREE */

public void outputBinTree() {

outputBinTree("1",rootNode);

private void outputBinTree(String tag, BinTree currentNode) {

// Check for empty node


IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 117
if (currentNode == null) return;

// Output

System.out.println("[" + tag + "] nodeID = " + currentNode.nodeID +

", question/answer = " + currentNode.questOrAns);

// Go down yes branch

outputBinTree(tag + ".1",currentNode.yesBranch);

// Go down no branch

outputBinTree(tag + ".2",currentNode.noBranch);

class DecisionTreeApp {

/* ------------------------------- */

/* */

/* FIELDS */

/* */

/* ------------------------------- */

static BufferedReader keyboardInput = new

BufferedReader(new InputStreamReader(System.in));

static DecisionTree newTree;

/* --------------------------------- */

/* */

/* METHODS */
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 118
/* */

/* --------------------------------- */

/* MAIN */

public static void main(String[] args) throws IOException {

// Create instance of class DecisionTree

newTree = new DecisionTree();

// Generate tree

generateTree();

// Output tree

System.out.println("\nOUTPUT DECISION TREE");

System.out.println("====================");

newTree.outputBinTree();

// Query tree

queryTree();

/* GENERATE TREE */

static void generateTree() {

System.out.println("\nGENERATE DECISION TREE");

System.out.println("======================");

newTree.createRoot(1,"Does animal eat meat?");

newTree.addYesNode(1,2,"Does animal have stripes?");

newTree.addNoNode(1,3,"Does animal have stripes?");


IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 119
newTree.addYesNode(2,4,"Animal is a Tiger");

newTree.addNoNode(2,5,"Animal is a Leopard");

newTree.addYesNode(3,6,"Animal is a Zebra");

newTree.addNoNode(3,7,"Animal is a Horse");

/* QUERY TREE */

static void queryTree() throws IOException {

System.out.println("\nQUERY DECISION TREE");

System.out.println("===================");

newTree.queryBinTree();

// Option to exit

optionToExit();

/* OPTION TO EXIT PROGRAM */

static void optionToExit() throws IOException {

System.out.println("Exit? (enter \"Yes\" or \"No\")");

String answer = keyboardInput.readLine();

if (answer.equals("Yes")) return;

else {

if (answer.equals("No")) queryTree();

else {

System.out.println("ERROR: Must answer \"Yes\" or \"No\"");


IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 120
optionToExit();

Output:

GENERATE DECISION TREE


======================
Created root node 1
Added node 2 onto "yes" branch of node 1
Added node 3 onto "no" branch of node 1
Added node 4 onto "yes" branch of node 2
Added node 5 onto "no" branch of node 2
Added node 6 onto "yes" branch of node 3
Added node 7 onto "no" branch of node 3
OUTPUT DECISION TREE
====================
[1] nodeID = 1, question/answer = Does animal eat meat?
[1.1] nodeID = 2, question/answer = Does animal have stripes?
[1.1.1] nodeID = 4, question/answer = Animal is a Tiger
[1.1.2] nodeID = 5, question/answer = Animal is a Leopard
[1.2] nodeID = 3, question/answer = Does animal have stripes?
[1.2.1] nodeID = 6, question/answer = Animal is a Zebra
[1.2.2] nodeID = 7, question/answer = Animal is a Horse

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 121
QUERY DECISION TREE
===================
Does animal eat meat? (enter "Yes" or "No")
Yes
Does animal have stripes? (enter "Yes" or "No")
Yes
Animal is a Tiger
Exit? (enter "Yes" or "No")
No
QUERY DECISION TREE
===================
Does animal eat meat? (enter "Yes" or "No")
No
Does animal have stripes? (enter "Yes" or "No")
No
Animal is a Horse
Exit? (enter "Yes" or "No")
Yes

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 122
ii.)K-nearest Neighbour(KNN):

Image you have a blog which contains a lot of nice articles. You put ads at the top
of each article and hope to gain some revenue. After a while and from your
report, you see that some posts generate revenue and some do not. Assuming
that whether an article generates revenue or not depends on how many pictures
and text paragraphs in it.

Given the dataset:

we can plot them in the figure below.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 123
How K-Nearest Neighbors (KNN) algorithm works?
If we want to know whether the new article can generate revenue, we can 1)
computer the distances between the new article and each of the 6 existing
articles, 2) sort the distances in descending order, 3) take the majority vote of k.
This is the basic idea of KNN.

Now let's guess a new article, which contains 13 pictures and 1 paragraph, can
make revenue or not. By visualizing this point in the figure, we can guess it will
make profit. But we will do it in Java.

Java Solution
kNN is also provided by Weka as a class "IBk". IBk implements kNN. It uses
normalized distances for all attributes so that attributes on different scales have
the same impact on the distance function. It may return more than k neighbors if
there are ties in the distance. Neighbors are voted to form the final classification.

First prepare your data by creating a txt file "ads.txt":

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 124
@relation ads
@attribute pictures numeric
@attribute paragraphs numeric
@attribute profit {Y,N}
@data
10,2,Y
12,3,Y
9,2,Y
0,10,N
1,9,N
3,11,N
10,2,Y
12,3,Y

Java source code

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import weka.classifiers.Classifier;
import weka.classifiers.lazy.IBk;
import weka.core.Instance;
import weka.core.Instances;

public class KNN {


public static BufferedReader readDataFile(String filename) {
BufferedReader inputReader = null;

try {
inputReader = new BufferedReader(new FileReader(filename));
} catch (FileNotFoundException ex) {
System.err.println("File not found: " + filename);
}

return inputReader;
}

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 125
public static void main(String[] args) throws Exception {
BufferedReader datafile = readDataFile("ads.txt");

Instances data = new Instances(datafile);


data.setClassIndex(data.numAttributes() - 1);

//do not use first and second


Instance first = data.instance(0);
Instance second = data.instance(1);
data.delete(0);
data.delete(1);

Classifier ibk = new IBk();


ibk.buildClassifier(data);

double class1 = ibk.classifyInstance(first);


double class2 = ibk.classifyInstance(second);

System.out.println("first: " + class1 + "\nsecond: " + class2);


}
}

Output:

First:0.0

Second:1.0

9. Implement the following clustering algorithms


i.K-means ii.K-mediods

i)One -Dimension

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 126
#include<stdio.h>
int mod(int k)
{
if(k>0) return k;
else return -k;
}
int small(int b[],int n)
{
int m,pos,r=0; m=b[0];
for(pos=0;pos<n;pos++)
{
if(m>b[pos]) { m=b[pos];
r=pos;
}
}
return r;
}
void main()
{
int n,j,s=0;
int x=0,y=0,z=0;
int obj[20],c[20][20],mean[20],a[20];
int i,nc,k,m,min,count;
printf("\n\n Enter no. of items");
scanf("%d",&n);
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 127
printf("\n Enter items");
for(i=0;i<n;i++)
scanf("%d",&obj[i]);
printf("\n Enter no of clusters");
scanf("%d",&nc);
for(i=0;i<nc;i++)
for(j=0;j<n;j++)
{
c[i][j]=0; a[i]=0; }
for(i=0;i<nc;i++)
{
c[i][0]=obj[i];
mean[i]=obj[i];
}
for(i=0;i<nc;i++)
for(j=0;j<n;j++)
if(c[i][j]>0)
printf(" I:%d",c[i][j]);
j=nc;
for(i=1;i<n;i++)
{
if(j<n)
{
for(k=0;k<nc;k++)
a[k]=mod(obj[j]-mean[k]);
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 128
min=small(a,nc);
c[min][i]=obj[j];
for(k=0;k<nc;k++)
{
s=0;count=0;
for(m=0;m<n;m++)
{
if(c[k][m]>0)
{
s=s+c[k][m];
count++;
}
}
mean[k]=s/count;
}
for(k=0;k<nc;k++)
printf("\n mean values..%d\t",mean[k]);
printf("\n");
j++;
}}
for(i=0;i<nc;i++)
{
printf("\n");
for(j=0;j<n;j++)
{
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 129
if(c[i][j]>0)
printf("%d\t",c[i][j]);
}}}
Output:

ii)Two –Dimension

#include<stdio.h>

#include<math.h>

double distance(int a[][2],double b[][2],int j,int k)


{

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 130
double n=0,x1,y1,total;
int x,y;
x=a[j][0]-b[k][0];
y=a[j][1]-b[k][1];
x1=x*x;
y1=y*y;
total=x1+y1;
n=sqrt(total);
return n;
}
int small(double b[],int n)
{
int pos,r=0;double m=b[0];
for(pos=0;pos<n;pos++)
{
if(m>b[pos])
{
m=b[pos];
r=pos;
}
}
return r;
}
void main()
{
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 131
int n,j,s=0;
int x=0,y=0,z=0;
int x1,y1;
int obj[20][2],c[20][20][2];
double mean[20][2];
double a[20];
int i,nc,k,m,min,count;
printf("\n\n Enter no. of items");
scanf("%d",&n);
printf("\n Enter n items");
for(i=0;i<n;i++)
for(k=0;k<2;k++)
scanf("%d",&obj[i][k]);
printf("\n Enter no of clusters");
scanf("%d",&nc);
for(i=0;i<nc;i++)
for(j=0;j<n;j++)
{
for(k=0;k<2;k++)
c[i][j][k]=0; a[i]=0;
}
for(i=0;i<nc;i++)
{
j=0;
for(k=0;k<2;k++)
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 132
{
c[i][j][k]=obj[i][k];
mean[i][k]=obj[i][k];
}
}
for(i=0;i<nc;i++)
{
printf("\nI%d:",i);
for(j=0;j<n;j++)
for(k=0;k<2;k++)
if(c[i][j][k]>0)
printf("%d ",c[i][j][k]);
printf("\n");
}
for(i=0;i<nc;i++)
{
for(k=0;k<2;k++)
printf("\n mean values...%lf ",mean[i][k]);
printf("\n");
}
j=nc;
for(i=1;i<n;i++)
{
if(j<n)
{
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 133
for(k=0;k<nc;k++)
a[k]=distance(obj,mean,j,k);
min=small(a,nc);
c[min][i][0]=obj[j][0];
c[min][i][1]=obj[j][1];
for(m=0;m<n;m++)
{
x1=0;y1=0;count=0;
for(k=0;k<nc;k++)
{
if(c[m][k][0]>0||c[m][k][1]>0)
{
x1=x1+c[m][k][0];
y1=y1+c[m][k][1];
count++;
}
}
if(count>0)
{
mean[k][0]=x1/count;
mean[k][1]=y1/count;
}
}
j++;
}
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 134
}
for(i=0;i<nc;i++)
{
for(j=0;j<n;j++)
for(k=0;k<2;k++)
printf("%d ",c[i][j][k]);
printf("\n");
}
printf("final kmean values are....\n");
for(i=0;i<nc;i++)
printf("%lf....%lf\n",mean[i][0],mean[i][1]);
}

Output:

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 135
Ii A program to implement k-mediod algorithm

#include<stdio.h>
#include<math.h>
int distance(int [],int []);
int i,j,n,nc=3;
void main()
{
int j,count,t;
int obj[10][2],c[10][10][2],mean[10][2],c1[10][10][2];
int i,k,m,cost=0,cost1;
printf("\n enter the no. of items:");
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 136
scanf("%d",&n);
printf("\n enter the items(%d)",n);
for(i=0;i<n;i++)
for(j=0;j<2;j++)
scanf("%d",&obj[i][j]);
for(i=0;i<nc;i++)
for(j=0;j<n;j++)
for(k=0;k<2;k++)
{
c[i][j][k]=0;
c1[i][j][k]=0;
}
printf("\n enter center points");
for(i=0;i<nc;i++)
for(j=0;j<2;j++)
{
scanf("%d",&mean[i][j]);
c[i][0][j]=mean[i][j];
}
j=0;
for(i=1;i<=n;i++)
{
if(j<n)
{
if(distance(obj[j],mean[0])<distance(obj[j],mean[1]))
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 137
if(distance(obj[j],mean[0])<distance(obj[j],mean[2]))
for(k=0;k<2;k++)
{
c[0][i][k]=obj[j][k];
cost=cost+distance(obj[j],mean[0]);
}
if(distance(obj[j],mean[1])<distance(obj[j],mean[0]))
if(distance(obj[j],mean[1])<distance(obj[j],mean[2]))
for(k=0;k<2;k++)
{
c[1][i][k]=obj[j][k];
cost=cost+distance(obj[j],mean[1]);
}
if(distance(obj[j],mean[2])<distance(obj[j],mean[0]))
if(distance(obj[j],mean[2])<distance(obj[j],mean[1]))
for(k=0;k<2;k++)
{
c[2][i][k]=obj[j][k];
cost=cost+distance(obj[j],mean[2]);
}
j++;
}
}
printf("\n enter the next center points:");
for(i=0;i<nc;i++)
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 138
for(j=0;j<2;j++)
{
scanf("%d",&mean[i][j]);
c1[i][0][j]=mean[i][j];
}
j=0;
for(i=1;i<=n;i++)
{
if(j<n)
{
if(distance(obj[j],mean[0])<distance(obj[j],mean[1]))
if(distance(obj[j],mean[0])<distance(obj[j],mean[2]))
for(k=0;k<2;k++)
{
c1[0][i][k]=obj[j][k];
cost1=cost1+distance(obj[j],mean[0]);
}
if(distance(obj[j],mean[1])<distance(obj[j],mean[0]))
if(distance(obj[j],mean[1])<distance(obj[j],mean[2]))
for(k=0;k<2;k++)
{
c1[1][i][k]=obj[j][k];
cost1=cost1+distance(obj[j],mean[1]);
}
if(distance(obj[j],mean[2])<distance(obj[j],mean[0]))
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 139
if(distance(obj[j],mean[2])<distance(obj[j],mean[1]))
for(k=0;k<2;k++)
{
c[2][i][k]=obj[j][k];
cost1=cost1+distance(obj[j],mean[2]);
}
j++;
}
}
if(cost<cost1)
{
for(i=0;i<nc;i++)
{
printf("\n");
for(j=0;j<n;j++)
for(k=0;k<2;k++)
{
if(c[i][j][k]>0)
printf("%d\t",c[i][j][k]);
}
}
}
else
{
for(i=0;i<nc;i++)
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 140
{
printf("\n");
for(j=0;j<n;j++)
for(k=0;k<2;k++)
{
if(c1[i][j][k]>0)
printf("%d\t",c1[i][j][k]);
}
}
}
}
int distance(int obj[],int mean[])
{
int x1,x2,y1,y2,dist;
x1=obj[0];
x2=mean[0];
y1=obj[1];
y2=mean[1];
dist=(sqrt(pow((x1-x2),2)+pow((y1-y2),2)));
return dist;
}
Output:

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 141
10.A small case study involving all stages of KDD. (Datasets are available online
like UCI Repository etc.)

CASE STUDY - KDD PROCESS

DEFINITION: KDD Process is the process of using data mining methods


(algorithms) to extract (identify) what is deemed knowledge according to the
specifications of measures and thresholds, along with any required preprocessing,
subsampling, and transformation.

KDD:

In a multistep process many decisions are made by the user (domain



expert):
 Iterative and interactive – loops between any two steps are possible
 Usually the most focus is on the DM step, but other steps are of
considerable importance for the successful application of KDD in practice
GOALS:
 Verification of user’s hypothesis (this against the EDA principle…)
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 142
 Autonomous discovery of new patterns and models
 Prediction of future behavior of some entities
 Description of interesting patterns and models

STEPS OF DM:

1. Domain understanding and goal setting


2. Creating a target data set
3. Data cleaning and pre-processing
4. Data reduction and projection
5. Data mining
 Choosing the data mining task
 Choosing the data mining algorithm(s)
 Use of data mining algorithms
6. Interpretation of mined patterns
7. Utilization of discovered knowledge

Figure 10.1 KDD PROCESS model

1) Domain analysis

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 143
 Development of domain understanding
 Discovery of relevant prior knowledge
 Definition of the goal of the knowledge discovery
2) Data selection
 Selection and integration of the target data from possibly many different
and heterogeneous sources
 Interesting data may exist, e.g., in relational databases, document
collections, e-mails, photographs, video clips, process database, customer
transaction database, web logs etc.
 Focus on the correct subset of variables and data samples
 E.g., customer behavior in a certain country, relationship between
items purchased and customer income and age
3) Data cleaning and preprocessing
 Dirty data can confuse the mining procedures and lead to unreliable and
invalid outputs
 Complex analysis and mining on a huge amount of data may take a very
long time
 Preprocessing and cleaning should improve the quality of data and mining
results by enhancing the actual mining process
 The actions to be taken includes
 Removal of noise or outliers
 Collecting necessary information to model or account for noise
 Using prior domain knowledge to remove the inconsistencies
and duplicates from the data
 Choice or usage of strategies for handling missing data fields

4) Data reduction and projection


 Data transformation techniques
 Smoothing (binning, clustering, regression etc.)
 Aggregation (use of summary operations (e.g., averaging) on data)
 Generalization (primitive data objects can be replaced by higher-level
concepts)
 Normalization (min-max-scaling, z-score)
 Feature construction from the existing attributes (PCA, MDS)

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 144
 Data reduction techniques are applied to produce reduced representation
of the data (smaller volume that closely maintains the integrity of the
original data)
 Aggregation
 Dimension reduction (Attribute subset selection, PCA, MDS,…)
 Compression (e.g., wavelets, PCA, clustering,…)
 Numerosity reduction
 parametric models: regression and log-linear models
 non-parametric models: histograms, clustering, sampling…
 Discretization (e.g., binning, histograms,cluster analysis,…)
 Concept hierarchy generation (numeric value of ”age” to a higher
level concept ”young, middle-aged, senior”)

.5) Choice of data mining task

 Define the task for data mining


 Exploration/summarization
 Summarizing statistics (mean, median, mode, std,..)
 Class/concept description
 Explorative data analysis
 Graphical techniques, low-dimensional plots,…
 Predictive
 Classification or regression
 Descriptive
 Cluster analysis, dependency modelling, change and
outlier detection

6) Choosing the DM algorithm(s)


 Select the most appropriate methods to be used for the model and pattern
search
 Matching the chosen method with the overall goal of the KDD process
(necessites communication between the end user and method specialists)
 Note that this step requires understanding in many fields, such as computer
science, statistics, machine learning, optimization, etc.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 145
7) Use of data mining algorithms
 Application of the chosen DM algorithms to the target data set
 Search for the patterns and models of interest in a particular
representational form or a set of such representations
 Classification rules or trees, regression models, clusters, mixture models…
 Should be relatively automatic
 Generally DM involves:
 Establish the structural form (model/pattern) one is interested
 Estimate the parameters from the available data
 Interprete the fitted models

8)Interpretation/evaluation
 The mined patterns and models are interpreted
 The results should be presented in understandable form
 Visualization techniques are important for making the results useful –
mathematical models or text type descriptions may be difficult for domain
experts
 Possible return to any of the previous step

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 146
11. Using COGNOS IMPROMPTU 7 to Generate Report

Start Impromptu :You can start Impromptu by double-clicking the Impromptu icon
on your desktop, or by clicking the Start button. You see the Welcome dialog box
when you start Impromptu.

Select a Catalog: To use Impromptu to create or open reports for your business,
you must select an existing catalog. Catalogs are usually created by an
administrator. You can open a different catalog at any time during your Impromptu
session, but you can only open one catalog at a time.

Open the Great Outdoors Sales Data Catalog (Great Outdoors Sales Data.cat). You
get this catalog when you do a typical installation of Impromptu.

Try This... To open the Great Outdoors Sales Data catalog

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 147
1. If you have just started Impromptu, click Close to close the Welcome dialog box.

2. If you do not have the Great Outdoors Sales Data catalog open, from the
Catalog menu, click Open to show the Open Catalog dialog box.

3. Locate and double-click the Great Outdoors Sales Data catalog.

4. If the Cognos Common Logon dialog box appears, click Cancel.

5. In the Catalog Logon dialog box, click OK to accept your catalog User Class and
open the catalog.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 148
When working with the Great Outdoors Sales Data catalog, your user class is User
if you have the User version of Impromptu, and Creator if you have the
Administrator version of Impromptu. Tip: Check the message in the status line.
When it says "Sales data for The Great Outdoors Co.," this catalog is open. Note:
If a Catalog Upgrade dialog box appears, select Upgrade this catalog and click OK
to close the dialog box.

Open an Existing Report


In this, you are the Sales Manager for a camping equipment company called the
Great Outdoors. You are completing your annual performance reviews for your
sales staff, and you need a report detailing all the sales made by each sales
representative. You’ll need to open a report you’ve already created called Sales
Totals for Representative that shows all the sales each representative made.

Open the Report You can open an Impromptu report by

• using the Welcome dialog box when you start Impromptu

• clicking the Open button on the toolbar

• clicking the Open command from the File menu.

Try This... To open an existing report 1. From the File menu, click Open. If the
Reports folder isn’t open, double-click the Reports folder to open it.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 149
2. Locate and double-click the SalesRep Sales Totals report. Impromptu prompts
you to select one or more sales representatives.

Do not click OK yet. Note: If a Report Upgrade dialog box appears, select Upgrade
this report and click OK to close the dialog box. Respond to a Prompt Your report
may prompt you for information before retrieving the data. Your response to a
prompt determines what is included in the report. The prompt acts as a filter for
the data so that only the information you require appears in the report. One or
more prompt dialog boxes may appear when you open a report. Each prompt
dialog box further refines the data you will see in your report. You may be
prompted to select one or more values from a list, or you may be required to type

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 150
in a value.For example, this report requires you to select a sales representative
from a list. You can select one or more values from the Prompts dialog box. Try
This... To respond to a prompt

1. Click OK to accept Bill Gibbons and to open the report.

You can see the details of Bill Gibbons’ sales this year, including sales by customer
and maximum and minimum sales. You can use this report during your
performance review of Bill Gibbons.

2. From the Report menu, click Prompt to show the Prompts dialog box.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 151
3. Click Bill Smertal, and Ctrl+click Charles Loo Nam, then click OK to show the
Sales Totals for Representative report for Bill Smertal and Charles Loo Nam.

Print Your Report Impromptu lets you print your report. To print a report 1. From
the File menu, click Print. 2. In the Print dialog box, select the appropriate print
settings, and then click OK to send the report to the printer. 3. From the File
menu, click Close to close the report.

Create a List Report Using the Report Wizard


Using Impromptu’s Report Wizard is an easy way to create simple reports. For
example, you buy several GO Sport Line products, and the policy is to sell the
products from that manufacturer at cost plus 50%. When you review the product
cost and product price for the GO Small Waist Pack, the product margin seems
low. You can make a report that lists the cost, price, and margin information for
the GO Sport Line products to check their margins. You can use this information
to help you decide whether to raise the prices on GO Sport Line products to keep
the margins in line with the policy.

Try This... To create a list report using the Report Wizard

1. Click the New button to show the Report Wizard. Note: Do not click New from
the File menu. This will open the New dialog box instead of the Report Wizard.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 152
2. Type GO Product Margins, and click Next to show the list/crosstab choice page.
For more information on crosstab reports

3. Click List Report and then click Next to show the data item selection page

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 153
Select the Data On the data item selection page you select data for your report.
Each data item is presented as a column in your report.

Try This... To select data for your list report

1. Double-click the Products folder to open it.

2. Double-click the Product Line data item to add it to the Report Columns box.

3. Double-click the Product data item to add it.

4. Double-click the Price and Cost folder to open it and then double-click these
data items: • Product Cost • Product Price • Product % Margin

5. Click the Next button to group the report data.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 154
6. Click the check box beside Product Line. By grouping Product Line, Impromptu
sorts the information in the product line, removing any duplicate values. Note:
Ensure the Automatically Generate Totals check box is selected. When you select
Automatically Generate Totals, the Wizard adds the totals for the numeric
columns in the report to the overall list footer. If the report is grouped, the
Wizard also adds footers at each change in the value of the grouped data item
and inserts totals for the group in the group footers.

7. Click the Next button to show the filter page.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 155
Filter the Data On the filter page
1. To create a filter to look at all products with margins less than or equal to 50%,
double-click Product % Margin in the Available Components box.

2. Double-click <=.
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 156
3. Double-click number, type 50 in the Enter Value dialog box, and then click OK.

4. Click Finish to retrieve the data and show the report.

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 157
You can now see all the information you need to compare the product margins,
and you can focus the report further to see only the margins on GO products.

5. From the File menu, click Save As.

6. Type GO Product Margins Tutorial in the File Name box and click Save

IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 158

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy