Discrete Math Lab Manual
Discrete Math Lab Manual
Star Schema
ii.Snowflake Schema iii.Fact Constellation
Thus a true star data model has two attributes, it is always two levels
deep, and a true star model always contains only one large table that is the focus
of the model. There are of course variations of this concept as SNOWFLAKE and
FACT CONSTELLATION.
Databases using star designs always get their data from somewhere
else. They are a form or reporting database. As such, star schemas are not
required to follow normalization rules as we are accustomed to. The presumption
is that feeding systems have already applied edits and constraints on the data so
the star data repository does not need to.
When you see your star design trying to accommodate other ideas and
other purposes that focus on something other than the fact table, you should re-
evaluate the direction you are heading in. Star designs are for analyzing the one
fact table central to the design of the model, and doing anything else with your
star data model reduces its effectiveness as an analytic data store.
1) the relational model shown here is five levels deep whereas the star model
shown here is only two.
2) the relational model does not suggest by its design that any of the data it
models is special whereas to the star model, the fact table is the centre of the
universe.
3) the relational model carefully maps the relationships between tables treating
relationships between so-called reference tables as just as important as all other
relationships whereas the star model relies on its load processes to load data
correctly based on the relationships in the data, but then can (and in this case did)
toss all these relationships out of its design because it cares not about them since
dealing with them after the data is loaded takes our focus away from the fact data
and a star design wants all eyes on the fact data.
4) the relational model is equally adept at answering questions about any of the
tables in its model whereas the star model is about slicking and dicing the fact
table and little else matters. Indeed, the star model does very poorly in answering
questions about its dimensions because its focus is on the fact table.
The table create statements of our fact table and its original relational table.
1) At the beginning of our fact table, we have the same basic table as we saw in
our relational model. SALE_FACT is one-to-one with SALE. One row in SALE_FACT
is one row in SALE.
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 6
2) We have tossed out SALE.ITEM_USE_ID because the information given by this
table has no value in our star design. Instead we flattened the relationships all the
way up the relational foreign key chain in our relational model with the ultimate
result being that keys in all our reference tables become foreign keys in our fact
table. Subsequently we created dimensions in our star model for the data pointed
to by each of these foreign keys. The result is that the relationships between
dimension tables which roughly speaking are our original relational model
reference tables is lost in favour of directly representing these relationships on our
fact table.
3) We have added new data in our star design that did not exist in our relational
design. More specifically we created a TIME DIMENSION which represents time in
our system. Think of it as all the different interesting ways to represent a date. We
also took the salary on the emp table and created a bucketing scheme which we
then refered to in our fact table. We did the same for item price as well. The result
is our EMP_SALARY_RANGE_DIM and ITEM_PRICE_RANGE_DIM. This new data
would be accounted for in our load process when our fact table is loaded.
4) We also placed salary from our original emp table and item price from our
original item table as NOT AGGREGATABLE (or not summable) metrics on our fact
table. Consider for example that if you sum qty_sold from sale_fact for a specific
employee, you get the total quantity of items sold for that employee. This is
because qty_sold is summable for our fact table. But if you take the sum of
emp_salary from sale_fact for a specific employee you do not get the total salary
of the employee; you get the employee's salary times however many rows were
selected (well more or less assuming the employee's salary does not change over
time), a rather meaningless number. You cannot sum emp_salary off the fact
table, nor can you sum item_price. This is why they have the _NA suffixes on
them.
Relational Model
Star Model
create table dept_dim
The phases of a data warehouse project listed below are similar to those of
most database projects, starting with identifying requirements and ending with
executing the T-SQL Script to create data warehouse:
3. Execute T-SQL queries to create and populate dimension and fact tables
We need to interview the key decision makers to know, what factors define
the success in the business? How does management want to analyze their data?
What are the most important business questions, which need to be satisfied by
this new system?
Let us identify dimensions related to the above case study :item, branch,
location and time.
Fact Table: Data in fact table are called measures (or dependent
attributes), Fact table provides statistics for sales broken down by item,
branch, location and time dimensions. Fact table usually contains historical
transactional entries of your live system, it is mainly made up of Foreign
key column which references to various dimension and numeric measure
values on which aggregation will be performed.
Let us identify what attributes should be there in our Fact Sales Table.
E.g. Star Schema, Snow Flake Schema, Star Flake Schema, Distributed Star
Schema, etc.
Let us create Our First Star Schema, please refer to the below figure:
Let us execute our T-SQL Script step by step to create table and populate
them with appropriate test values. Follow the given steps to run the query in
SSMS (SQL Server Management Studio 2012).
4. Copy paste Scripts given below in various steps in new query editor window one
by one
Createdatabase Sales_DW
Go
Use Sales_DW
Go
Step 2 : Create time dimension table in Data Warehouse which will hold time
details.
Step 6 : Create Fact table to hold all your transactional entries of sales with
appropriate foreign key columns which refer to primary key column of
your dimensions; you have to take care while populating your fact table
to refer to primary key values of appropriate dimensions.
-- Add relation between fact table foreign keys to Primary keys of Dimensions
AlTER TABLE FactSales ADD CONSTRAINT _
FK_ time_key FOREIGN KEY (time_key)REFERENCES DimTime(time_key);
Populate your Fact table with historical transaction values of sales with proper
values of dimension key
values. After executing the above T-SQL script, your
sample data warehouse for sales will be ready, now
you can create OLAP Cube on the basis of this data warehouse.
The single dimension table for Item in the star schema is normalised in the
snowflake schema resulting in new Item and Supplier tables. The Item Dimension
table now contains the attributes item_key,item_name,brand,type and
supplier_key, where supplier_key is linked to the supplier Dimension table
containing supplier_key and supplier_type information as shown below.
The below schema specifies two Fact tables sales , shipping. The sales Fact
table is identical to that of star schema.
The dimension tables for time,item and location are shared between sales
and shipping fact tables as shown below.
1.ADD
2. Click on the Explorer button to get the Weka Knowledge Explorer window.
5.Then click on the area right of the Choose button. You get the following:
2.Remove
4. click on the area right of the Choose button. You get the following
Try other parameters for the filter and see how the remove changes. Don’t
forget to reload the original (numeric) relation or Undo the remove before
applying another one.
2. Click on the Explorer button and you get the Weka Knowledge Explorer window.
3. Click on the “Open File.” button and open an ARFF file (try it first with an
example supplied in Weka-3-6/data, e.g. diabetes.arff). You get the
following:
6. Set minThreshold to 0.1E-8 (close to zero), which is the minimum value allowed
for the attribute.
7. Set minDefault to NaN, which is unknown and will replace values below the
threshold.
Click “mass” in the “attributes” pane and review the details of the “selected
attribute”. Notice that the 11 attribute values that were formally set to 0 are not
marked as Missing.
You could just as easily mark them with a specific numerical value. You could also
mark values missing between a upper and lower range of values.
Next, let’s look at how we can remove instances with missing values from our
dataset.
A simple way to handle missing data is to remove those instances that have
one or more missing values.
Continuing on from the above to mark missing values, you can remove
missing values as follows:
1. Click the “Choose” button for the Filter and select RemoveWithValues, it us
under unsupervized.instance.RemoveWithValues.
5. Click the “OK” button to use the configuration for the filter.
Notice that the 11 attribute values that were marked Missing have been removed
from the dataset.
Continuing on from the first recipe above to mark missing values, you can
impute the missing values as follows:
1. Click the “Choose” button for the Filter and select ReplaceMissingValues, it us
under unsupervized.attribute.ReplaceMissingValues.
Click “mass” in the “attributes” section and review the details of the “selected
attribute”.
Try other parameters for the filter and see how the replace values changes.
Don’t forget to reload the original (numeric) relation or Undo the replaced before
applying another one.
3. Click on the “Open File.” button and open an ARFF file (try it first with an
example supplied in Weka-3-6/data, e.g. weather.arff). You get the
following:
Then click on the area right of the Choose button. You get the following:
You see here the default parameters of this filter. Click on more to get more
information about these parameters.
Try other parameters for the filter and see how the standardize changes.
Don’t forget to reload the original (numeric) relation or Undo the standardize
before applying another one.
6. Click on the Apply button to do the discretization. Then select one of the
original numeric attributes (e.g. temperature) and see how it is discretized in
the Selected attribute window.
Try other parameters for the filter and see how the discretization changes.
Don’t forget to reload the original (numeric) relation or Undo the discretization
before applying another one.
SSIS includes graphical tools and wizards for building and debugging
packages; tasks for performing workflow functions such as FTP operations,
executing SQL statements, and sending e-mail messages; data sources and
destinations for extracting and loading data; transformations for cleaning,
aggregating, merging, and copying data; a management service, the Integration
Services service for administering package execution and storage; and application
programming interfaces (APIs) for programming the Integration Services object
model.
The package that you create takes data from a flat file, reformats the data,
and then inserts the reformatted data into a fact table. SSIS Designer is used to
create a simple ETL package that includes looping, configurations, error flow logic
and logging.
For example let's say you have customer table with columns CusomerID,
CustomerName, CustomerAddress,CustomerCityID where CusomerID is a primary
key and CustomerCityID foreign key for City Table. And let's say you have a sample
source data in this format : "1001 as CustomerID", "Shaam as CustomerName",
"R-no 202 - mulund naka as CustomerAddress" & "Mumbai as CustomerCity".
Now if you see in the destination Customer Table we have CustomerCityID which
is foreign key [Integer value] and here in the source file we have string type value
and for proper insert we need its Foreign key value. So to get this foreign key
value we need to use #LookUp component which compares source records with
City master table to get matching key values and same can be updated to
Customer table.
When we load data from source file which contains "Customer Records with
country name" before it reaches to destination table (Customer Table) in between
we will apply LookUp component to compare source records with existing Country
Table and filters matching ones and un-matching ones. On matching key values we
will replace with country name and same we will update it to destination table.
Step 1
In this step we will go to our SQL management studio and create country master
table with columns (CountryID, CountryName) respectively.After that we will
some country names to this table.
Step 2
Here in this step we will create CustomerMaster table with columns : CustomerID,
CustomerName, CustomerAmount, CustomerAddress, CustomerCountryID,
CustomerISActive respectively in SQL management studio as shown in below
image.
Let's create our source file here for this example we will use flat file source and
add up some dummy data as shown in below image.
Step 4
Open up MSBI studio and create SSIS project. Once done just drag and drop Data
Flow task from toolbox and double click on it.
Step 5
Since our source file is Flat File so we will use Flat File Source component if you
want you can use different modes like Excel and so on.
For now just drag and drop Flat File Source Component from SSIS toolbox and
configure it.
The most important step here we will drag and drop SSIS #LookUP component and
attach it with Flat File Source component as shown in below image.
This means in-case if rows are not matched due to some reason then what to do.
Here we will say redirect rows to no match output means if rows are not matched
for some reason then throw it via no match output. As we discussed earlier that
#LookUp has got two outputs Matched and No Matched Output so we will throw
unmatched rows via No Matched output. So in the drop drown choose "Redirect
rows to no match output. This will also help us to identify erros occur during
runtime.
Keep cache mode to Full Cache and Connection mode to OLEDB connection.
Image representation is given below.
Select Connection menu -> Choose you SQL connection name -> Select
CountryTable as shown in below image.
Step 7
@relation marketbasketanalysis
@attribute transaction_id{100,200,300,400}
@attribute item1{0,1}
@attribute item2{0,1}
@attribute item3{0,1}
@attribute item4{0,1}
@attribute item5{0,1}
@data
100,1,1,1,0,0
200,1,1,0,1,1
300,1,0,1,1,0
400,1,0,1,0,0
OPTIONS
minMetric -- Minimum metric score. Consider only rules with scores higher than
this value.
classIndex -- Index of the class attribute. If set to -1, the last attribute is taken as
class attribute.
car -- If enabled class association rules are mined instead of (general) association
rules.
delta -- Iteratively decrease support by this factor. Reduces support until min
support is reached or required number of rules has been generated.
metricType -- Set the type of metric by which to rank rules. Confidence is the
proportion of the examples covered by the premise that are also covered by the
consequence (Class association rules can only be mined using confidence). Lift is
confidence divided by the proportion of all examples that are covered by the
consequence. This is a measure of the importance of the association that is
independent of support. Leverage is the proportion of additional examples
covered by both the premise and consequence above those expected if the
premise and consequence were independent of each other. The total number of
examples that this represents is presented in brackets following the leverage.
Conviction is another measure of departure from independence. Conviction is
given by P(premise)P(!consequence) / P(premise, !consequence).
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 71
upperBoundMinSupport -- Upper bound for minimum support. Start iteratively
decreasing minimum support from this value.
8. then click on start button now then you will get the generated apriori rules.
C/C++ VERSION
/* Output
Enter items from purchase 1:1
5
2
0
0
Enter items from purchase 2:2
3
4
1
0
Enter items from purchase 3:3
4
0
0
0
Enter items from purchase 4:2
1
3
0
0
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 79
Enter items from purchase 5:1
2
3
0
0
Enter minimum acceptance level3
Initial Input:
Trasaction Items
1: 1 5 2 0 0
2: 2 3 4 1 0
3: 3 4 0 0 0
4: 2 1 3 0 0
5: 1 2 3 0 0
Assume minimum support: 3
Generating C1 from data
1: 4
2: 4
3: 4
4: 2
5: 1
Generating L1 From C1
14
24
34
Generating L2
124
133
233
Generating L3
1233
*/
JAVA VERSION
*/
import java.util.*;
import java.sql.*;
class Tuple {
Set<Integer> itemset;
int support;
Tuple() {
itemset = new HashSet<>();
support = -1;
}
Tuple(Set<Integer> s) {
itemset = s;
support = -1;
}
Tuple(Set<Integer> s, int i) {
itemset = s;
support = i;
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 81
}
}
class Apriori {
static Set<Tuple> c;
static Set<Tuple> l;
static int d[][];
static int min_support;
prune();
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 82
generateFrequentItemsets();
}
/*
OUTPUT:
Enter the minimum support :2
Transaction Number: 1:
Item number 1 = 1
Item number 2 = 3
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 86
Item number 3 = 4
Transaction Number: 2:
Item number 1 = 2
Item number 2 = 3
Item number 3 = 5
Transaction Number: 3:
Item number 1 = 1
Item number 2 = 2
Item number 3 = 3
Item number 4 = 5
Transaction Number: 4:
Item number 1 = 2
Item number 2 = 5
-+- L -+-
[1] : 2
[3] : 3
[2] : 3
[5] : 3
-+- L -+-
[2, 3] : 2
[3, 5] : 2
[1, 3] : 2
[2, 5] : 3
-+- L -+-
[2, 3, 5] : 2
*/
This experiment illustrates the use of C4.5 (J48) classifier in WEKA. The
sample data set used, unless otherwise indicated, is the bank data available in
comma-separated format (bank-data.csv).
As usual, we begin by loading the data into WEKA, as seen in below Figure :
Next, we select the "Classify" tab and click the "Choose" button to select
the J48 classifier, as depicted in Figures . Note that J48 (implementation of C4.5
Under the "Test options" in the main panel we select 10-fold cross-
validation as our evaluation approach. Since we do not have separate evaluation
data set, this is necessary to get a reasonable idea of accuracy of the generated
model. We now click "Start" to generate the model.
We can view this information in a separate window by right clicking the last
result set (inside the "Result list" panel on the left) and selecting "View in separate
window" from the pop-up menu. These steps and the resulting window containing
the classification results are depicted in the below Figures .
Note that the classification accuracy of our model is only about 69%. This
may indicate that we may need to do more work (either in preprocessing or in
selecting the correct parameters for classification), before building another model.
Weka's IBk implementation has the “cross-validation” option that can help
by choosing the best value automatically Weka uses cross-validation to select the
best value for KNN (which is the same as k).
-K num
Set the number of nearest neighbors to use in prediction (default 1)
-W num
Set a fixed window size for incremental train/testing. As new training instances are
added, oldest instances are removed to maintain the number of training instances
at this size. (default no window)
-D
Neighbors will be weighted by the inverse of their distance when voting. (default
equal weighting)
-F
Neighbors will be weighted by their similarity when voting. (default equal
weighting)
-X
Selects the number of neighbors to use by hold-one-out cross validation, with an
upper limit given by the -K option.
-S
When k is selected by cross-validation for numeric class attributes, minimize
mean-squared error. (default mean absolute error)
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 98
NAME : weka.classifiers.lazy.IBk : K-nearest neighbours classifier. Can select
appropriate value of K based on cross-validation. Can also do distance weighting.
OPTIONS
debug -- If set to true, classifier may output additional info to the console.
meanSquared -- Whether the mean squared error is used rather than mean
absolute error when doing cross-validation for regression problems.
So, at this point, this description should sound similar to both regression
and classification. How is this different from those two? Well, first off, remember
that regression can only be used for numerical outputs. That differentiates it from
Nearest Neighbor immediately.
You'll find that Nearest Neighbor fixes all those problems in a very efficient
manner, especially in the example used above for Amazon. It's not limited to any
number of comparisons. It's as scalable for a 20-customer database as it is for a 20
million-customer database, and you can define the number of results you want to
find. Seems like a great technique! It really is — and probably will be the most
useful for anyone reading this who has an e-commerce store.
Math behind Nearest Neighbor : You will see that the math behind the
Nearest Neighbor technique is a lot like the math involved with the clustering
technique. Taking the unknown data point, the distance between the unknown
data point and every known data point needs to be computed. Finding the
distance is really quite trivial with a spreadsheet, and a high-powered computer
can zip through these calculations nearly instantly. The easiest and most common
distance calculation is the "Normalized Euclidian Distance." It sounds much more
complicated than it really is.
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 100
Let's take a look at an example in action and try to figure out what
Customer No. 5 is likely to purchase.
To answer the question "What is Customer No. 5 most likely to buy?" based
on the Nearest Neighbor algorithm we ran through above, the answer would be a
book. This is because the distance between Customer No. 5 and Customer No. 1 is
less (far less, actually) than the distance between Customer No. 5 and any other
customer. Based on this model, we say that the customer most like Customer No.
5 can predict the behavior of Customer No. 5.
However, the positives of Nearest Neighbor don't end there. The Nearest
Neighbor algorithm can be expanded beyond the closest match to include any
number of closest matches. These are termed "N-Nearest Neighbors" (for
example, 3-Nearest Neighbors).
Using the above example, if we want to know the two most likely products
to be purchased by Customer No. 5, we would conclude that they are books and a
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 101
DVD. Using the Amazon example from above, if they wanted to know the 12
products most likely to be purchased by a customer, they would want to run a 12-
Nearest Neighbor algorithm (though Amazon actually runs something more
complicated than just a simple 12-Nearest Neighbor algorithm).
The final question to consider is "How many neighbors should we use in our
model?". You'll find that experimentation will be needed to determine the best
number of neighbors to use. Also, if you are trying to predict the output of a
column with a 0 or 1 value, you'd obviously want to select an odd number of
neighbors, in order to break ties.
Data set for WEKA : The data set we'll use is our fictional BMW dealership
and the promotional campaign to sell a two-year extended warranty to past
customers. There are 4,500 data points from past sales of extended warranties.
The attributes in the data set are Income Bracket [0=$0-$30k, 1=$31k-$40k,
2=$41k-$60k, 3=$61k-$75k, 4=$76k-$100k, 5=$101k-$150k, 6=$151k-$500k,
7=$501k+], the year/month their first BMW was bought, the year/month the most
recent BMW was bought, and whether they responded to the extended warranty
offer in the past.
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 102
8 4,200210,200601,0
9 5,200301,200601,1
10 ...
Nearest Neighbor in WEKA : Load the data file bmw-training.arff into WEKA
using the same steps we've used to this point in the Preprocess tab. Your screen
should look like below Figure after loading in the data.
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 103
Figure 7.10 BMW Nearest Neighbor algorithm
At this point, we are ready to create our model in WEKA. Ensure that Use
training set is selected so we use the data set we just loaded to create our model.
Click Start and let WEKA run. Below Figure shows a screenshot, and Listing
contains the output from this model.
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 104
Figure 7.11 BMW Nearest Neighbor model
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 105
1 === Evaluation on training set ===
2 === Summary ===
3
4 Correctly Classified Instances 2663 88.7667 %
5 Incorrectly Classified Instances 337 11.2333 %
6 Kappa statistic 0.7748
7 Mean absolute error 0.1326
8 Root mean squared error 0.2573
9 Relative absolute error 26.522 %
10 Root relative squared error 51.462 %
11 Total Number of Instances 3000
12
13 === Detailed Accuracy By Class ===
14
15 TP Rate FP Rate Precision Recall F-Measure ROC Area Class
16 0.95 0.177 0.847 0.95 0.896 0.972 1
17 0.823 0.05 0.941 0.823 0.878 0.972 0
18 Weighted Avg. 0.888 0.114 0.893 0.888 0.887 0.972
19
20 === Confusion Matrix ===
21
22 a b <-- classified as
23 1449 76 | a = 1
24 261 1214 | b = 0
How does this compare with our results when we used classification to
create a model? Well, this model using Nearest Neighbor has an 89-percent
accuracy rating, while the previous model only had a 59-percent accuracy rating,
so that's definitely a good start. Nearly a 90-percent accuracy rating would be very
acceptable. Let's take this a step further and interpret the results in terms of false
positives and false negatives, so you can see how the results from WEKA apply in a
real business sense.
The results of the model say we have 76 false positives (2.5 percent), and
we have 261 false negatives (8.7 percent). Remember a false positive, in this
example, means that our model predicted the customer would buy an extended
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 106
warranty and actually didn't, and a false negative means that our model predicted
they wouldn't buy an extended warranty, and they actually did.
Let's estimate that the flier the dealership sends out cost $3 each and that
the extended warranty brings in $400 profit for the dealer. This model from a
cost/benefit perspective to the dealership would be $400 - (2.5% * $3) - (8.7% *
400) = $365. So, the model looks rather profitable for the dealership. Compare
that to the classification model, which had a cost/benefit of only $400 - (17.2% *
$3) - (23.7% * $400) = $304, and you can see that using the right model offered a
20-percent increase in potential revenue for the dealership.
As an exercise for yourself, play with the number of nearest neighbors in the
model (you do this by right-clicking on the text "IBk -K 1...." and you see a list of
parameters). You can change the "KNN" (K-nearest neighbors) to be anything you
want. You'll see in this example, that the accuracy of the model actually decreases
with the inclusion of additional neighbors.
Some final take-aways from this model: The power of Nearest Neighbor
becomes obvious when we talk about data sets like Amazon. With its 20 million
users, the algorithm is very accurate, since there are likely many potential
customers in Amazon's database with similar buying habits to you.
Thus, the nearest neighbor to yourself is likely very similar. This creates an
accurate and effective model. Contrarily, the model breaks down quickly and
becomes inaccurate when you have few data points for comparison. In the early
stages of an online e-commerce store for example, when there are only 50
customers, a product recommendation feature will likely not be accurate at all, as
the nearest neighbor may in fact be very distant from yourself.
The final challenge with the Nearest Neighbor technique is that it has the
potential to be a computing-expensive algorithm. In Amazon's case, with 20
million customers, each customer must be calculated against the other 20 million
customers to find the nearest neighbors.
First, if your business has 20 million customers, that's not technically a
problem because you're likely rolling in money. Second, these types of
computations are ideal for the cloud in that they can offloaded to dozens of
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 107
computers to be run simultaneously, with a final comparison done at the end.
(Google's MapReduce for example.)
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 108
i.)Decision Tree Induction
import java.io.*;
class DecisionTree {
/* ------------------------------- */
/* */
/* FIELDS */
/* */
/* ------------------------------- */
/* NESTED CLASS */
/* FIELDS */
/* CONSTRUCTOR */
nodeID = newNodeID;
questOrAns = newQuestAns;
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 109
/* OTHER FIELDS */
BufferedReader(new InputStreamReader(System.in));
/* ------------------------------------ */
/* */
/* CONSTRUCTORS */
/* */
/* ------------------------------------ */
/* Default Constructor */
public DecisionTree() {
/* ----------------------------------------------- */
/* */
/* */
/* ----------------------------------------------- */
if (rootNode == null) {
return;
// Search tree
if
(searchTreeAndAddYesNode(rootNode,existingNodeID,newNodeID,newQuestAns)
){
// Found node
BinTree(newNodeID,newQuestAns);
else {
existingNodeID);
return(true);
else {
if (currentNode.yesBranch != null) {
if (searchTreeAndAddYesNode(currentNode.yesBranch,
existingNodeID,newNodeID,newQuestAns)) {
return(true);
else {
return(searchTreeAndAddYesNode(currentNode.noBranch,
existingNodeID,newNodeID,newQuestAns));
/* ADD NO NODE */
if (rootNode == null) {
return;
// Search tree
if
(searchTreeAndAddNoNode(rootNode,existingNodeID,newNodeID,newQuestAns)
){
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 113
System.out.println("Added node " + newNodeID +
if (currentNode.nodeID == existingNodeID) {
// Found node
BinTree(newNodeID,newQuestAns);
else {
existingNodeID);
return(true);
else {
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 114
// Try yes branch if it exists
if (currentNode.yesBranch != null) {
if (searchTreeAndAddNoNode(currentNode.yesBranch,
existingNodeID,newNodeID,newQuestAns)) {
return(true);
else {
if (currentNode.noBranch != null) {
return(searchTreeAndAddNoNode(currentNode.noBranch,
existingNodeID,newNodeID,newQuestAns));
/* --------------------------------------------- */
/* */
/* */
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 115
/* --------------------------------------------- */
queryBinTree(rootNode);
if (currentNode.yesBranch==null) {
if (currentNode.noBranch==null)
System.out.println(currentNode.questOrAns);
return;
if (currentNode.noBranch==null) {
return;
// Question
askQuestion(currentNode);
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 116
System.out.println(currentNode.questOrAns + " (enter \"Yes\" or \"No\")");
if (answer.equals("Yes")) queryBinTree(currentNode.yesBranch);
else {
if (answer.equals("No")) queryBinTree(currentNode.noBranch);
else {
askQuestion(currentNode);
/* ----------------------------------------------- */
/* */
/* */
/* ----------------------------------------------- */
outputBinTree("1",rootNode);
// Output
outputBinTree(tag + ".1",currentNode.yesBranch);
// Go down no branch
outputBinTree(tag + ".2",currentNode.noBranch);
class DecisionTreeApp {
/* ------------------------------- */
/* */
/* FIELDS */
/* */
/* ------------------------------- */
BufferedReader(new InputStreamReader(System.in));
/* --------------------------------- */
/* */
/* METHODS */
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 118
/* */
/* --------------------------------- */
/* MAIN */
// Generate tree
generateTree();
// Output tree
System.out.println("====================");
newTree.outputBinTree();
// Query tree
queryTree();
/* GENERATE TREE */
System.out.println("======================");
newTree.addNoNode(2,5,"Animal is a Leopard");
newTree.addYesNode(3,6,"Animal is a Zebra");
newTree.addNoNode(3,7,"Animal is a Horse");
/* QUERY TREE */
System.out.println("===================");
newTree.queryBinTree();
// Option to exit
optionToExit();
if (answer.equals("Yes")) return;
else {
if (answer.equals("No")) queryTree();
else {
Output:
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 121
QUERY DECISION TREE
===================
Does animal eat meat? (enter "Yes" or "No")
Yes
Does animal have stripes? (enter "Yes" or "No")
Yes
Animal is a Tiger
Exit? (enter "Yes" or "No")
No
QUERY DECISION TREE
===================
Does animal eat meat? (enter "Yes" or "No")
No
Does animal have stripes? (enter "Yes" or "No")
No
Animal is a Horse
Exit? (enter "Yes" or "No")
Yes
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 122
ii.)K-nearest Neighbour(KNN):
Image you have a blog which contains a lot of nice articles. You put ads at the top
of each article and hope to gain some revenue. After a while and from your
report, you see that some posts generate revenue and some do not. Assuming
that whether an article generates revenue or not depends on how many pictures
and text paragraphs in it.
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 123
How K-Nearest Neighbors (KNN) algorithm works?
If we want to know whether the new article can generate revenue, we can 1)
computer the distances between the new article and each of the 6 existing
articles, 2) sort the distances in descending order, 3) take the majority vote of k.
This is the basic idea of KNN.
Now let's guess a new article, which contains 13 pictures and 1 paragraph, can
make revenue or not. By visualizing this point in the figure, we can guess it will
make profit. But we will do it in Java.
Java Solution
kNN is also provided by Weka as a class "IBk". IBk implements kNN. It uses
normalized distances for all attributes so that attributes on different scales have
the same impact on the distance function. It may return more than k neighbors if
there are ties in the distance. Neighbors are voted to form the final classification.
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 124
@relation ads
@attribute pictures numeric
@attribute paragraphs numeric
@attribute profit {Y,N}
@data
10,2,Y
12,3,Y
9,2,Y
0,10,N
1,9,N
3,11,N
10,2,Y
12,3,Y
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import weka.classifiers.Classifier;
import weka.classifiers.lazy.IBk;
import weka.core.Instance;
import weka.core.Instances;
try {
inputReader = new BufferedReader(new FileReader(filename));
} catch (FileNotFoundException ex) {
System.err.println("File not found: " + filename);
}
return inputReader;
}
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 125
public static void main(String[] args) throws Exception {
BufferedReader datafile = readDataFile("ads.txt");
Output:
First:0.0
Second:1.0
i)One -Dimension
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 126
#include<stdio.h>
int mod(int k)
{
if(k>0) return k;
else return -k;
}
int small(int b[],int n)
{
int m,pos,r=0; m=b[0];
for(pos=0;pos<n;pos++)
{
if(m>b[pos]) { m=b[pos];
r=pos;
}
}
return r;
}
void main()
{
int n,j,s=0;
int x=0,y=0,z=0;
int obj[20],c[20][20],mean[20],a[20];
int i,nc,k,m,min,count;
printf("\n\n Enter no. of items");
scanf("%d",&n);
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 127
printf("\n Enter items");
for(i=0;i<n;i++)
scanf("%d",&obj[i]);
printf("\n Enter no of clusters");
scanf("%d",&nc);
for(i=0;i<nc;i++)
for(j=0;j<n;j++)
{
c[i][j]=0; a[i]=0; }
for(i=0;i<nc;i++)
{
c[i][0]=obj[i];
mean[i]=obj[i];
}
for(i=0;i<nc;i++)
for(j=0;j<n;j++)
if(c[i][j]>0)
printf(" I:%d",c[i][j]);
j=nc;
for(i=1;i<n;i++)
{
if(j<n)
{
for(k=0;k<nc;k++)
a[k]=mod(obj[j]-mean[k]);
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 128
min=small(a,nc);
c[min][i]=obj[j];
for(k=0;k<nc;k++)
{
s=0;count=0;
for(m=0;m<n;m++)
{
if(c[k][m]>0)
{
s=s+c[k][m];
count++;
}
}
mean[k]=s/count;
}
for(k=0;k<nc;k++)
printf("\n mean values..%d\t",mean[k]);
printf("\n");
j++;
}}
for(i=0;i<nc;i++)
{
printf("\n");
for(j=0;j<n;j++)
{
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 129
if(c[i][j]>0)
printf("%d\t",c[i][j]);
}}}
Output:
ii)Two –Dimension
#include<stdio.h>
#include<math.h>
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 130
double n=0,x1,y1,total;
int x,y;
x=a[j][0]-b[k][0];
y=a[j][1]-b[k][1];
x1=x*x;
y1=y*y;
total=x1+y1;
n=sqrt(total);
return n;
}
int small(double b[],int n)
{
int pos,r=0;double m=b[0];
for(pos=0;pos<n;pos++)
{
if(m>b[pos])
{
m=b[pos];
r=pos;
}
}
return r;
}
void main()
{
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 131
int n,j,s=0;
int x=0,y=0,z=0;
int x1,y1;
int obj[20][2],c[20][20][2];
double mean[20][2];
double a[20];
int i,nc,k,m,min,count;
printf("\n\n Enter no. of items");
scanf("%d",&n);
printf("\n Enter n items");
for(i=0;i<n;i++)
for(k=0;k<2;k++)
scanf("%d",&obj[i][k]);
printf("\n Enter no of clusters");
scanf("%d",&nc);
for(i=0;i<nc;i++)
for(j=0;j<n;j++)
{
for(k=0;k<2;k++)
c[i][j][k]=0; a[i]=0;
}
for(i=0;i<nc;i++)
{
j=0;
for(k=0;k<2;k++)
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 132
{
c[i][j][k]=obj[i][k];
mean[i][k]=obj[i][k];
}
}
for(i=0;i<nc;i++)
{
printf("\nI%d:",i);
for(j=0;j<n;j++)
for(k=0;k<2;k++)
if(c[i][j][k]>0)
printf("%d ",c[i][j][k]);
printf("\n");
}
for(i=0;i<nc;i++)
{
for(k=0;k<2;k++)
printf("\n mean values...%lf ",mean[i][k]);
printf("\n");
}
j=nc;
for(i=1;i<n;i++)
{
if(j<n)
{
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 133
for(k=0;k<nc;k++)
a[k]=distance(obj,mean,j,k);
min=small(a,nc);
c[min][i][0]=obj[j][0];
c[min][i][1]=obj[j][1];
for(m=0;m<n;m++)
{
x1=0;y1=0;count=0;
for(k=0;k<nc;k++)
{
if(c[m][k][0]>0||c[m][k][1]>0)
{
x1=x1+c[m][k][0];
y1=y1+c[m][k][1];
count++;
}
}
if(count>0)
{
mean[k][0]=x1/count;
mean[k][1]=y1/count;
}
}
j++;
}
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 134
}
for(i=0;i<nc;i++)
{
for(j=0;j<n;j++)
for(k=0;k<2;k++)
printf("%d ",c[i][j][k]);
printf("\n");
}
printf("final kmean values are....\n");
for(i=0;i<nc;i++)
printf("%lf....%lf\n",mean[i][0],mean[i][1]);
}
Output:
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 135
Ii A program to implement k-mediod algorithm
#include<stdio.h>
#include<math.h>
int distance(int [],int []);
int i,j,n,nc=3;
void main()
{
int j,count,t;
int obj[10][2],c[10][10][2],mean[10][2],c1[10][10][2];
int i,k,m,cost=0,cost1;
printf("\n enter the no. of items:");
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 136
scanf("%d",&n);
printf("\n enter the items(%d)",n);
for(i=0;i<n;i++)
for(j=0;j<2;j++)
scanf("%d",&obj[i][j]);
for(i=0;i<nc;i++)
for(j=0;j<n;j++)
for(k=0;k<2;k++)
{
c[i][j][k]=0;
c1[i][j][k]=0;
}
printf("\n enter center points");
for(i=0;i<nc;i++)
for(j=0;j<2;j++)
{
scanf("%d",&mean[i][j]);
c[i][0][j]=mean[i][j];
}
j=0;
for(i=1;i<=n;i++)
{
if(j<n)
{
if(distance(obj[j],mean[0])<distance(obj[j],mean[1]))
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 137
if(distance(obj[j],mean[0])<distance(obj[j],mean[2]))
for(k=0;k<2;k++)
{
c[0][i][k]=obj[j][k];
cost=cost+distance(obj[j],mean[0]);
}
if(distance(obj[j],mean[1])<distance(obj[j],mean[0]))
if(distance(obj[j],mean[1])<distance(obj[j],mean[2]))
for(k=0;k<2;k++)
{
c[1][i][k]=obj[j][k];
cost=cost+distance(obj[j],mean[1]);
}
if(distance(obj[j],mean[2])<distance(obj[j],mean[0]))
if(distance(obj[j],mean[2])<distance(obj[j],mean[1]))
for(k=0;k<2;k++)
{
c[2][i][k]=obj[j][k];
cost=cost+distance(obj[j],mean[2]);
}
j++;
}
}
printf("\n enter the next center points:");
for(i=0;i<nc;i++)
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 138
for(j=0;j<2;j++)
{
scanf("%d",&mean[i][j]);
c1[i][0][j]=mean[i][j];
}
j=0;
for(i=1;i<=n;i++)
{
if(j<n)
{
if(distance(obj[j],mean[0])<distance(obj[j],mean[1]))
if(distance(obj[j],mean[0])<distance(obj[j],mean[2]))
for(k=0;k<2;k++)
{
c1[0][i][k]=obj[j][k];
cost1=cost1+distance(obj[j],mean[0]);
}
if(distance(obj[j],mean[1])<distance(obj[j],mean[0]))
if(distance(obj[j],mean[1])<distance(obj[j],mean[2]))
for(k=0;k<2;k++)
{
c1[1][i][k]=obj[j][k];
cost1=cost1+distance(obj[j],mean[1]);
}
if(distance(obj[j],mean[2])<distance(obj[j],mean[0]))
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 139
if(distance(obj[j],mean[2])<distance(obj[j],mean[1]))
for(k=0;k<2;k++)
{
c[2][i][k]=obj[j][k];
cost1=cost1+distance(obj[j],mean[2]);
}
j++;
}
}
if(cost<cost1)
{
for(i=0;i<nc;i++)
{
printf("\n");
for(j=0;j<n;j++)
for(k=0;k<2;k++)
{
if(c[i][j][k]>0)
printf("%d\t",c[i][j][k]);
}
}
}
else
{
for(i=0;i<nc;i++)
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 140
{
printf("\n");
for(j=0;j<n;j++)
for(k=0;k<2;k++)
{
if(c1[i][j][k]>0)
printf("%d\t",c1[i][j][k]);
}
}
}
}
int distance(int obj[],int mean[])
{
int x1,x2,y1,y2,dist;
x1=obj[0];
x2=mean[0];
y1=obj[1];
y2=mean[1];
dist=(sqrt(pow((x1-x2),2)+pow((y1-y2),2)));
return dist;
}
Output:
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 141
10.A small case study involving all stages of KDD. (Datasets are available online
like UCI Repository etc.)
KDD:
STEPS OF DM:
1) Domain analysis
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 143
Development of domain understanding
Discovery of relevant prior knowledge
Definition of the goal of the knowledge discovery
2) Data selection
Selection and integration of the target data from possibly many different
and heterogeneous sources
Interesting data may exist, e.g., in relational databases, document
collections, e-mails, photographs, video clips, process database, customer
transaction database, web logs etc.
Focus on the correct subset of variables and data samples
E.g., customer behavior in a certain country, relationship between
items purchased and customer income and age
3) Data cleaning and preprocessing
Dirty data can confuse the mining procedures and lead to unreliable and
invalid outputs
Complex analysis and mining on a huge amount of data may take a very
long time
Preprocessing and cleaning should improve the quality of data and mining
results by enhancing the actual mining process
The actions to be taken includes
Removal of noise or outliers
Collecting necessary information to model or account for noise
Using prior domain knowledge to remove the inconsistencies
and duplicates from the data
Choice or usage of strategies for handling missing data fields
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 144
Data reduction techniques are applied to produce reduced representation
of the data (smaller volume that closely maintains the integrity of the
original data)
Aggregation
Dimension reduction (Attribute subset selection, PCA, MDS,…)
Compression (e.g., wavelets, PCA, clustering,…)
Numerosity reduction
parametric models: regression and log-linear models
non-parametric models: histograms, clustering, sampling…
Discretization (e.g., binning, histograms,cluster analysis,…)
Concept hierarchy generation (numeric value of ”age” to a higher
level concept ”young, middle-aged, senior”)
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 145
7) Use of data mining algorithms
Application of the chosen DM algorithms to the target data set
Search for the patterns and models of interest in a particular
representational form or a set of such representations
Classification rules or trees, regression models, clusters, mixture models…
Should be relatively automatic
Generally DM involves:
Establish the structural form (model/pattern) one is interested
Estimate the parameters from the available data
Interprete the fitted models
8)Interpretation/evaluation
The mined patterns and models are interpreted
The results should be presented in understandable form
Visualization techniques are important for making the results useful –
mathematical models or text type descriptions may be difficult for domain
experts
Possible return to any of the previous step
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 146
11. Using COGNOS IMPROMPTU 7 to Generate Report
Start Impromptu :You can start Impromptu by double-clicking the Impromptu icon
on your desktop, or by clicking the Start button. You see the Welcome dialog box
when you start Impromptu.
Select a Catalog: To use Impromptu to create or open reports for your business,
you must select an existing catalog. Catalogs are usually created by an
administrator. You can open a different catalog at any time during your Impromptu
session, but you can only open one catalog at a time.
Open the Great Outdoors Sales Data Catalog (Great Outdoors Sales Data.cat). You
get this catalog when you do a typical installation of Impromptu.
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 147
1. If you have just started Impromptu, click Close to close the Welcome dialog box.
2. If you do not have the Great Outdoors Sales Data catalog open, from the
Catalog menu, click Open to show the Open Catalog dialog box.
5. In the Catalog Logon dialog box, click OK to accept your catalog User Class and
open the catalog.
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 148
When working with the Great Outdoors Sales Data catalog, your user class is User
if you have the User version of Impromptu, and Creator if you have the
Administrator version of Impromptu. Tip: Check the message in the status line.
When it says "Sales data for The Great Outdoors Co.," this catalog is open. Note:
If a Catalog Upgrade dialog box appears, select Upgrade this catalog and click OK
to close the dialog box.
Try This... To open an existing report 1. From the File menu, click Open. If the
Reports folder isn’t open, double-click the Reports folder to open it.
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 149
2. Locate and double-click the SalesRep Sales Totals report. Impromptu prompts
you to select one or more sales representatives.
Do not click OK yet. Note: If a Report Upgrade dialog box appears, select Upgrade
this report and click OK to close the dialog box. Respond to a Prompt Your report
may prompt you for information before retrieving the data. Your response to a
prompt determines what is included in the report. The prompt acts as a filter for
the data so that only the information you require appears in the report. One or
more prompt dialog boxes may appear when you open a report. Each prompt
dialog box further refines the data you will see in your report. You may be
prompted to select one or more values from a list, or you may be required to type
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 150
in a value.For example, this report requires you to select a sales representative
from a list. You can select one or more values from the Prompts dialog box. Try
This... To respond to a prompt
You can see the details of Bill Gibbons’ sales this year, including sales by customer
and maximum and minimum sales. You can use this report during your
performance review of Bill Gibbons.
2. From the Report menu, click Prompt to show the Prompts dialog box.
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 151
3. Click Bill Smertal, and Ctrl+click Charles Loo Nam, then click OK to show the
Sales Totals for Representative report for Bill Smertal and Charles Loo Nam.
Print Your Report Impromptu lets you print your report. To print a report 1. From
the File menu, click Print. 2. In the Print dialog box, select the appropriate print
settings, and then click OK to send the report to the printer. 3. From the File
menu, click Close to close the report.
1. Click the New button to show the Report Wizard. Note: Do not click New from
the File menu. This will open the New dialog box instead of the Report Wizard.
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 152
2. Type GO Product Margins, and click Next to show the list/crosstab choice page.
For more information on crosstab reports
3. Click List Report and then click Next to show the data item selection page
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 153
Select the Data On the data item selection page you select data for your report.
Each data item is presented as a column in your report.
2. Double-click the Product Line data item to add it to the Report Columns box.
4. Double-click the Price and Cost folder to open it and then double-click these
data items: • Product Cost • Product Price • Product % Margin
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 154
6. Click the check box beside Product Line. By grouping Product Line, Impromptu
sorts the information in the product line, removing any duplicate values. Note:
Ensure the Automatically Generate Totals check box is selected. When you select
Automatically Generate Totals, the Wizard adds the totals for the numeric
columns in the report to the overall list footer. If the report is grouped, the
Wizard also adds footers at each change in the value of the grouped data item
and inserts totals for the group in the group footers.
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 155
Filter the Data On the filter page
1. To create a filter to look at all products with margins less than or equal to 50%,
double-click Product % Margin in the Available Components box.
2. Double-click <=.
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 156
3. Double-click number, type 50 in the Enter Value dialog box, and then click OK.
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 157
You can now see all the information you need to compare the product margins,
and you can focus the report further to see only the margins on GO products.
6. Type GO Product Margins Tutorial in the File Name box and click Save
IV/IV II SEM B.E CSE DATA MINING LAB MANUAL Page 158