DW Lab Record
DW Lab Record
I INSTITUTE OF TECHNOLOGY
Thovalai, Kanyakumari District, Tamil Nadu- 629 302.
CCS341
DATA WAREHOUSING
NAME :
REGISTER NUMBER :
YEAR : III
SEMESTER : V
Certificate of Completion
Register Number
Certificate of Submission
Submitted for the Anna University practical examination held at C.S.I INSTITUTE OF
TECHNOLOGY, THOVALAI on ………………………….
4. Click Next
1
5. Click I Agree.
6. As your requirement do the necessary changes of settings and click Next. Full and
Associate files are the recommended settings.
2
7. Change to your desired installation location.
8. If you want a shortcut then check the box and click Install.
9. The Installation will start wait for a while it will finish within a minute.
3
10. After complete installation click on Next.
11.That‘s all click on the Finish and take a shovel and start Mining.
4
This is the GUI you get when started. You have 4 options Explorer, Experimenter,
Knowledge Flow and Simple CLI.
The Graphical User Interface:
The Weka GUI Chooser (class Weka. Gui. GUI Chooser) provides a starting point for
launching Weka‘s main GUI applications and supporting tools. If one prefers a MDI (multiple
document interface) appearance, then this is provided by an alternative launcher called Main.
The GUI Chooser consists of four buttons, one for each of the four major Weka applications,
and four menus.
The buttons can be used to start the following applications:
Explorer: An environment for exploring data with WEKA
Experimenter: An environment for performing experiments and conducting statistical
tests between learning schemes.
Knowledge Flow: This environment supports essentially the same functions as the
Explorer but with a drag-and-drop interface. One advantage is that it supports
incremental learning.
Simple CLI: Provides a simple command-line interface that allows direct execution of
WEKA commands for operating systems that do not provide their own command line
interface.
1. Explorer
The Graphical user interface
1.1 Section Tabs
At the very top of the window, just below the title bar, is a row of tabs. When the Explorer is
first started only the first tab is active; the others are greyed out. This is because it is necessary
5
to open (and potentially pre-process) a data set before starting to explore the data. The tabs are
as follows:
1. Preprocess. Choose and modify the data being acted on.
2. Classify. Train & test learning schemes that classify or perform regression
3. Cluster. Learn clusters for the data.
4. Associate. Learn association rules for the data.
5. Select attributes. Select the most relevant attributes in the data.
6. Visualize. View an interactive 2D plot of the data.
2. Weka Experimenter:
The Weka Experiment Environment enables the user to create, run, modify, and analyse
experiments in a more convenient manner than is possible when processing the schemes
individually. For example, the user can create an experiment that runs several schemes against
a series of datasets and then analyse the results to determine if one of the schemes is
(statistically) better than the other schemes.
The Experiment Environment can be run from the command line using the Simple CLI.
The Experimenter comes in two flavours, either with a simple interface that provides most of
the functionality one needs for experiments, or with an interface with full access to the
Experimenter’s capabilities. You can choose between those two with the Experiment
Configuration Mode radio buttons:
Simple
Advanced
6
Both setups allow you to setup standard experiments, that are run locally on a single
machine, or remote experiments, which are distributed between several hosts. The distribution
of experiments cut down the time the experiments will take until completion, but on the other
hand the setup takes more time. The next section covers the standard experiments (both, simple
and advanced), followed by the remote experiments and finally the analysing of the results.
3. Knowledge Flow:
The Knowledge Flow provides an alternative to the Explorer as a graphical front end
to WEKA’s core algorithms. The Knowledge Flow presents a data-flow inspired interface to
WEKA. The user can select WEKA components from a palette, place them on a layout canvas
and connect them together in order to form a knowledge flow for processing and analysing
data. At present, all of WEKA’s classifiers, filters, clusters, associators, loaders and savers are
available in the Knowledge Flow along with some extra tools.
The Knowledge Flow can handle data either incrementally or in batches (the Explorer
handles batch data only).
4. Simple CLI:
The Simple CLI provides full access to all Weka classes, i.e., classifiers, filters, clusters,
7
etc., but without the hassle of the CLASSPATH (it facilitates the one, with which Weka was
started). It offers a simple Weka shell with separated command line and output.
8
Sample Weka Data Sets
Below are some sample WEKA data sets, in arff format.
contact-lens.arff
cpu.arff
cpu.with-vendor.arff
diabetes.arff
glass.arff
ionospehre.arff
iris.arff
labor.arff
ReutersCorn-train.arff
segment-test.arff
soybean.arff
supermarket.arff
vote.arff
weather.arff
weather.nominal.arff
Steps for load the Weather data set.
1. Open WEKA Tool.
2. Click on WEKA Explorer.
3. Click on open file button.
4. Choose WEKA folder in C drive.
5. Select and Click on data option button.
6. Choose Weather. arff file and open the file.
9
List out the attribute names:
1. outlook
2. temperature
3. humidity
4. windy
5. play
10
Steps to plot the histogram:
1. Open WEKA Tool.
2. Click on WEKA Explorer.
3. Click on Visualize button.
4. Click on right click button.
5. Select and Click on polyline option button.
CONCLUSION:
Thus, the exploration and integration of data was done successfully with WEKA.
11
EX NO:2
APPLY WEKA TOOL FOR DATA VALIDATION
AIM:
To apply WEKA tool for data validation by splitting a data set into training, testing
and cross validating instances.
PROCEDURE:
1. Load a sample data set to be validated, iris.arff from WEKA dataset(Open
C:\program files->WEKA 8.5->data->iris.arff) and save 10 instances in the file and
save the file as iris1.arff in a desired location(D:\new folder).
12
5. For data validation, first split 60% of the dataset into training set with 6 instances.
Click on Weka->select Filters->select Unsupervised ->select Instance->select
Resample, then click Filter pane.
6. From the Filter pane , choose invertSelection – False, noReplacement- True, Sample
size percentage-60, then click Ok->click Apply.
7. Sample dataset is filtered with 6 instances. Click on Save and save the file as train.arff
in a desired location.
13
8. Click on Undo->click Resample from Filter pane, choose invertSelection – True,
noReplacement- True, Sample size percentage-60, then click Ok->click Apply.
10. The 50% of remaining sample data set(4instances) is splitted into cross validated
data and test data.
11. Click Resample from Filter pane, select invertSelection – True, noReplacement-
True, Sample size percentage-50, then click Ok->click Apply.
14
12. 2 instances are created. Click on Save and save the file as cv.arff in a desired location.
13. Click on Undo->click Resample from Filter pane, select, choose invertSelection –
False, noReplacement- True, Sample size percentage-50, then click Ok->click
Apply.
15
13. 2 instances are created. Click on Save and save the file as test. arff in a desired location.
14. To check the instances created in training, cross validating and testing data right click
and open the files in notepad, the files will be:
train.arff:
16
cv.arff:
test.arff:
CONCLUSION:
Thus, the data validation by splitting a data set into training, testing and cross
validating instances was done successfully with WEKA.
17
EX NO:3
PLAN THE ARCHITECTURE FOR REAL TIME APPLICATION
AIM:
To design an architecture for classifying, cross validating, and splitting dataset into
training and testing dataset and thereby computing the accuracy by applying decision tree
rule.
Steps to plan the architecture:
1. Define the problem
2. Data collection and preprocessing
3. Choose the appropriate Weka algorithms
4. Real-time data streaming
5. Model training and evaluation.
6. Integration and deployment
7. Monitoring and maintenance
PROCEDURE:
1. Open WEKA Tool.
2. Click on WEKA Explorer.
3. Click on Test option->Supplied Test set ->Set
4. Click on Open file
18
5. Load iris. arff data set from data set in WEKA,
19
The output will be:
CONCLUSION:
Thus, the architecture for classifying and testing a real time application (data set) was
designed successfully.
20
EX NO: 4
WRITE THE QUERY FOR SCHEMA DEFINITION
AIM:
To write the query for schema definition using PostgreSQL tool.
PROCEDURE:
1. Click Start- AllPrograms -PostgreSQL 16 - Open pgAdmin4.
2. Click this icon, enter name, host and password as postgre.
3. Double click PostgreSQL 16.
4. Right click databases (1) and choose Create and type database name as dwftp and
Save.
5. Double click dwftp and click schemas (1) - Right click and select Create and type
schema name as dw and Save.
21
6. Double click dw- right click Tables -select Query Tool and run the queries for creating the
tables:
(1) location
(2) phonerate
(3) timedim
(4) facts
CREATE TABLE dw.location
(
id_location integer NOT NULL,
city character VARCHAR (20),
provinence VARCHAR (20),
region VARCHAR (20),
PRIMARY KEY (id_location)
);
To insert the values into the tables right click the table(location) and type the values displayed
in Data Output pane.
22
CREATE TABLE dw.phonerate(
ID_phoneRate INTEGER NOT NULL,
phoneRateType VARCHAR (20),
PRIMARY KEY(ID_phoneRate)
);
23
CREATE TABLE dw.facts(
ID_time INTEGER NOT NULL,
ID_phoneRate INTEGER NOT NULL,
ID_location_Caller INTEGER NOT NULL,
ID_location_Receiver INTEGER NOT NULL,
Price FLOAT NOT NULL,
NumberOfCalls INTEGER NOT NULL,
24
(1) To display sum of the prices, year and phone rate type(hint:use facts, timedim and
phonerate tables, group by phoneratetype and dateyear)
OUTPUT:
(2) To display month, year, Total number of calls, Total Income and Rank Income:
(hint: use facts and timedim tables, Rename Sum of Number of calls as Total number of
calls, Sum of Price as Total income, Rank(sum of Price) as Rank Income, group by
datemonth and dateyear )
25
OUTPUT:
(3) To display month, total num of calls, Rank of num of calls where year=2003, grp by
datemonth.(hint: use facts and timedim tables, Rename Sum of Number of calls as
TotNumOfCalls, Rank(sum of NumberOfCalls) as RankNumOfCalls)
select dateMonth, sum (NumberOfCalls) as TotNumOfCalls,
rank () over (order by sum (NumberOfCalls) desc) as RankNumOfCalls
from dw.facts F, dw.timedim Te
where F.ID_time = Te.ID_time and dateYear=2003
group by dateMonth;
OUTPUT:
CONCLUSION:
Thus, the query for schema definition using PostgreSQL tool was executed
successfully.
26
EX NO: 5
DESIGN DATA WAREHOUSE FOR REAL TIME APPLICATION
AIM:
To design a data ware house for a real time application using PostgreSQL tool.
PROCEDURE:
1. Click Start- AllPrograms -PostgreSQL 16 - Open pgAdmin4.
2. Click this icon, enter name, host and password as postgre.
3. Double click PostgreSQL 16.
4. Right click databases (1) and choose Create and type database name as dwftp and
Save.
5. Double click dwftp and click schemas (1) - Right click and select Create and type
schema name as dw and Save.
27
6. Double click dw- right click Tables -select Table create table for Employee as emp1
with columns:
Eno integer PRIMARY KEY
Empname VARCHAR(20)
Age integer
Salary integer
Job Char
Deptno integer
and Save
7. To insert values into table right click the table emp1 select View/edit data -> All rows
and then add the number of rows, insert values by double clicking each attribute and
Save.
28
8. Right click on table emp1, select Query tool and perform the query operations:
(a) To list the records in the emp1 table orderby salary in descending order.
select * from dw.emp1 order by salary desc;
OUTPUT:
29
CONCLUSION:
Thus, the data ware house for a real time application such as Employee details was
designed successfully.
30
EX NO: 6
ANALYSE THE DIMENSIONAL MODELING
AIM:
To analyse the dimensional modelling for applications and designing the star, snowflake
and fact constellation schemas.
PROCEDURE:
Multidimensional schema is defined using Data Mining Query Language (DMQL). The
two primitives, cube definition and dimension definition, can be used for defining the data
warehouses and data marts.
Star Schema
Each dimension in a star schema is represented with only one-dimension table.
This dimension table contains the set of attributes.
The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.
There is a fact table at the centre. It contains the keys to each of four dimensions.
The fact table also contains the attributes, namely dollars sold and units sold.
Snowflake Schema
Some dimension tables in the Snowflake schema are normalized.
The normalization splits up the data into additional tables.
31
Unlike Star schema, the dimensions table in a snowflake schema is normalized. For
example, the item dimension table in star schema is normalized and split into two
dimension tables, namely item and supplier table.
Now the item dimension table contains the attributes item_key, item_name, type, brand,
and supplier-key.
The supplier key is linked to the supplier dimension table. The supplier dimension table
contains the attributes supplier_key and supplier_type.
32
CONCLUSION:
Thus, the dimensional modelling was analysed and the star, snowflake and fact
constellation schemas was designed successfully.
33
EX NO: 7
CASE STUDY USING OLAP
AIM:
To study the different OLAP operations.
OLAP Operations:
Since OLAP servers are based on multidimensional view of data, we will discuss
OLAPoperations in multidimensional data.
The different OLAP operations are:
1. Roll-up (Drill-up)
2. Drill-down
3. Slice and dice
4. Pivot (rotate)
1. Roll-up (Drill-up):
Roll-up performs aggregation on a data cube in any of the following ways:
By climbing up a concept hierarchy for a dimension
By dimension reduction
Roll-up is performed by climbing up a concept hierarchy for the dimension location.
Initially the concept hierarchy was "street < city < province < country". On rolling up,
the data is aggregated by ascending the location hierarchy from the level of city to the
level of country. The data is grouped into cities rather than countries. When roll-up is
performed, one or more dimensions from the data cube are removed.
34
2. Drill-down:
Drill-down is the reverse operation of roll-up. It is performed by either of these ways:
By stepping down a concept hierarchy for a dimension.
By introducing a new dimension.
Drill-down is performed by stepping down a concept hierarchy for the dimension time.
Initially the concept hierarchy was “day < month < quarter < year”. On drilling down, the
time dimension is descended from the level of quarter to the level of month. When drill-
down is performed, one or more dimensions from the data cube are added. It navigates the
data from less detailed data to highly detailed data.
3. Slice:
The slice operation selects one particular dimension from a given cube and provides a
new sub-cube.
35
4. Dice:
Dice selects two or more dimensions from a given cube and provides a new sub-cube.
5. Pivot (rotate):
The pivot operation is also known as rotation. It rotates the data axes in view in order
to provide an alternative presentation of data.
CONCLUSION:
Thus, the case study of different OLAP operations was done successfully.
36
EX NO: 8
CASE STUDY USING OLTP
AIM:
To study the various aspects of OLTP operations.
OLTP:
Online Transaction Processing (OLTP) is a type of database system that is optimized for
high-speed data processing and rapid transaction execution in real-time. OLTP is an
operational system that supports transaction-oriented applications in a 3-tier architecture.
OLTP is basically focused on query processing, maintaining data integrity in multi-access
environments as well as effectiveness that is measured by the total number of transactions
per second.
OLTP Architecture:
An OLTP system uses a 3-tier architecture:
• The presentation layer: This layer is the front end or user interface where
transactions are generated.
• The logic layer: Also called the business logic or application layer, this layer
processes transaction data based on predefined rules.
• The data or data store layer: This is where each transaction and related data are
stored and indexed. It includes the database management system (DBMS) and
the database server.
Key characteristics of OLTP:
OLTP systems have four critical characteristics:
1. Fast query processing:
37
OLTP systems are used for high-speed query processing. They handle transactions in
real-time, meaning that transactions are executed as soon as they are received, with little or
no delay.
2. High concurrency:
OLTP systems use algorithms that allow many concurrent users to perform transactions
simultaneously. Each transaction is executed independently of the others and in the proper
order.
3. ACID properties:
To ensure data integrity, consistency, and reliability, OLTP transactions comply with
the ACID (Atomicity, Consistency, Isolation, Durability) properties. These are:
Atomicity: Transactions in OLTP systems are atomic, meaning they are treated
as a single, indivisible unit of work. If any part of a transaction fails, the entire
transaction is rolled back, so the database is left in its original state.
Consistency: OLTP databases are designed to maintain data consistency,
despite failures or errors. Every transaction change table in predefined and
predictable ways, and the database will always be in a valid state.
Isolation: Transactions in OLTP systems are isolated from each other. This
means that when multiple users read and write data simultaneously, they are
executed independently. It keeps the database in a consistent state.
Durability: When a transaction is successfully executed in an OLTP database,
the changes to the data are permanent and will survive any subsequent failures
or errors, like system crashes or power outages.
4. Support for simple transactions:
OLTP systems support specific applications or business processes like order processing,
inventory management, or customer service. They are typically not used for complex
queries, data analysis, or reporting tasks.
OLTP use cases:
Online transaction processing systems are used in applications where the primary goal
is to manage and process many transactions in real time. Some standard use cases include:
E-commerce systems:
E-commerce applications use OLTP to manage customer orders, payments, and
inventory in real time. This allows them to provide exceptional customer service, boost
38
customer loyalty, and drive growth. For example, an OLTP database can help maintain up-to-
date and accurate inventory data, allowing e-commerce companies to fulfill orders promptly.
Banking and financial services:
Banks and other financial services use OLTP to process financial transactions in real-
time, manage customer data, and enable customers to make deposits, withdraw money, transfer
funds, and access other services quickly. OLTP solutions for financial transaction systems, like
online banking, must have secure and reliable data management practices, multi-currency
support, and custom reporting options. ATMs are the most common example of an OLTP
system used in the financial industry.
Reservation systems:
OLTP drives online reservation systems in the travel and hospitality industry. It is used
in applications that manage bookings, flights, payments, and related services. An online
transaction processing system in the travel and hospitality industry must integrate with external
systems, like airline reservation or car rental applications, and have multi-language and multi-
currency support. OLTP also enables efficient customer data management and helps provide
personalized recommendations that create seamless experiences for travellers.
Customer relationship management (CRM):
Customer relationship management (CRM) platforms use OLTP to manage customer
data, interactions, and transactions. An OLTP system can centralize customer data and record
interactions across multiple channels, like phone calls, emails, and chat messages. OLTP helps
CRM applications automate many sales and marketing processes, including lead generation
and campaign management, so companies can focus on nurturing customer relationships and
increasing sales.
Designing an effective OLTP system:
There are three key factors to consider when building an OLTP solution:
Best practices for schema design
A schema outlines how data is organized in a relational database. In OLTP, the database schema
is designed to process high data volumes. Here are some best practices when creating an OLTP
schema:
Normalization: Normalization involves breaking down a large table into smaller, more
manageable tables to minimize data duplication. This ensures data consistency and
helps maintain data integrity.
39
Choose appropriate indexes: Indexing is a method used to provide quick access to
database files. Indexed data can speed up query processing and improve performance.
This involves identifying the most frequently used queries and creating indexes on the
relevant columns.
Use the correct keys: Keys are used to uniquely identify the rows in a table. They
establish relationships between different tables and ensure that there are no duplicate
rows in the database. There are three types of keys:
* Primary key
* Composite key
* Foreign key
Monitor and optimize: Regularly monitor and optimize the schema to ensure it meets
performance metrics and can handle the required transaction volumes.
Scalability consideration:
Scalability determines the OLTP system’s ability to handle increasing transaction
volumes as the business grows. Factors that improve scalability are:
1. Sharding (horizontal scaling): Horizontal scaling involves adding more servers or
nodes to the database cluster to distribute the workload and improve performance.
2. Vertical scaling: Vertical scaling involves increasing the capacity of the hardware
or infrastructure that the database is running on. This can include adding more
memory, CPUs, or storage to the server.
3. Cloud-based solutions: Cloud-based solutions, such as Amazon RDS, Azure SQL
Database, and Google Cloud SQL, are scalable and highly available OLTP
databases that don’t need infrastructure maintenance.
Transaction and concurrency management:
Transaction and concurrency management are critical components of OLTP solutions.
They ensure database consistency and prevent conflicts between multiple users accessing and
modifying data. ACID compliance is the first consideration for transaction management in
OLTP. Here are three other crucial mechanisms:
Locking: Locking is a technique that ensures that multiple transactions do not access
the same data simultaneously. Locks can be implemented at the database or table level.
They can be either shared or exclusive. Shared locks allow multiple transactions to read
the same data at the same time, while exclusive locks prevent other transactions from
accessing the data until the lock is released.
40
Isolation level: An isolation level defines how transactions interact and how they see
changes made by other transactions. Most OLTP systems offer four different isolation
levels - Read Uncommitted, Read Committed, Repeatable Read, and Serializable.
Deadlock handling: Deadlocks occur when two or more transactions are waiting for
each other to release locked resources and cannot proceed. Deadlock prevention
techniques, like timeout mechanisms and prioritizing transactions, are used to detect
and resolve deadlocks before they occur.
CONCLUSION:
Thus, the case study of Online Transaction Processing was done successfully.
41