DM - Weka Reprot

WEKA Tool & Its Report
Report on The NSL-KDD Data Set.
Submitted By
Sr. No. Name Roll No.
1. Mrunal Singade 59
2. Nehal Saonerkar 61
3. Prasanna Anjankar 64
4. Savinay Surbhik 72
Submitted to
Prof. Prarthana Deshkar
Department Of Computer Technology
YESHWANTRAO CHAVAN COLLEGE OF ENGINEERING,
NAGPUR SESSION 2022-2023

Introduction
What is data mining?

Data mining is the process of sorting through large data sets to identify patterns and
relationships that can help solve business problems through data analysis. Data mining
techniques and tools enable enterprises to predict future trends and make more-informed
business decisions.
Data mining is a key part of data analytics overall and one of the core disciplines in data
science, which uses advanced analytics techniques to find useful information in data sets. At
a more granular level, data mining is a step in the knowledge discovery in databases (KDD)
process, a data science methodology for gathering, processing and analyzing data. Data
mining and KDD are sometimes referred to interchangeably, but they're more commonly seen
as distinct things.
Why is data mining important?

Data mining is a crucial component of successful analytics initiatives in organizations. The
information it generates can be used in business intelligence (BI) and advanced analytics
applications that involve analysis of historical data, as well as real-time analytics
applications that examine streaming data as it's created or collected.
Data mining process: How does it work?

Data mining is typically done by data scientists and other skilled BI and analytics
professionals. But it can also be performed by data-savvy business analysts, executives and
workers who function as citizen data scientists in an organization.
Its core elements include machine learning and statistical analysis, along with data
management tasks done to prepare data for analysis. The use of machine learning
algorithms and artificial intelligence (AI) tools has automated more of the process and made
it easier to mine massive data sets, such as customer databases, transaction records and log
files from web servers, mobile apps and sensors.
The data mining process can be broken down into these four primary stages:
1. Data gathering: Relevant data for an analytics application is identified and

assembled. The data may be located in different source systems, a data warehouse or a
data lake, an increasingly common repository in big data environments that contain a mix
of structured and unstructured data. External data sources may also be used. Wherever the
data comes from, a data scientist often moves it to a data lake for the remaining steps in
the process.
2. Data preparation: This stage includes a set of steps to get the data ready to be mined.
It starts with data exploration, profiling and pre-processing, followed by data cleansing
work to fix errors and other data quality issues. Data transformation is also done to make
data sets consistent, unless a data scientist is looking to analyze unfiltered raw data for a
particular application.
3. Mining the data: Once the data is prepared, a data scientist chooses the appropriate
data mining technique and then implements one or more algorithms to do the mining. In
machine learning applications, the algorithms typically must be trained on sample data
sets to look for the information being sought before they're run against the full set of data.
4. Data analysis and interpretation: The data mining results are used to create
analytical models that can help drive decision-making and other business actions. The data
scientist or another member of a data science team also must communicate the findings to
business executives and users, often through data visualization and the use of data
storytelling techniques.
5.
Types of data mining techniques
Various techniques can be used to mine data for different data science applications. Pattern
recognition is a common data mining use case that's enabled by multiple techniques, as is
anomaly detection, which aims to identify outlier values in data sets. Popular data mining
techniques include the following types:
 Association rule mining: In data mining, association rules are if-then statements that
identify relationships between data elements. Support and confidence criteria are used to
assess the relationships -- support measures how frequently the related elements appear in
a data set, while confidence reflects the number of times an if-then statement is accurate.
 Classification: This approach assigns the elements in data sets to different categories
defined as part of the data mining process. Decision trees, Naive Bayes classifiers, k-
nearest neighbor and logistic regression are some examples of classification methods.
 Clustering: In this case, data elements that share particular characteristics are grouped
together into clusters as part of data mining applications. Examples include k-means
clustering, hierarchical clustering and Gaussian mixture models.
 Regression: This is another way to find relationships in data sets, by calculating predicted
data values based on a set of variables. Linear regression and multivariate regression are
examples. Decision trees and some other classification methods can be used to do
regressions, too.
 Sequence and path analysis: Data can also be mined to look for patterns in which a
particular set of events or values leads to later ones.
 Neural networks: A neural network is a set of algorithms that simulates the activity of
the human brain. Neural networks are particularly useful in complex pattern recognition
applications involving deep learning, a more advanced offshoot of machine learning.
Benefits of data mining

In general, the business benefits of data mining come from the increased ability to uncover
hidden patterns, trends, correlations and anomalies in data sets. That information can be used
to improve business decision-making and strategic planning through a combination of
conventional data analysis and predictive analytics.
Specific data mining benefits include the following:
 More effective marketing and sales: Data mining helps marketers better understand
customer behavior and preferences, which enables them to create targeted marketing and
advertising campaigns. Similarly, sales teams can use data mining results to improve lead
conversion rates and sell additional products and services to existing customers.
 Better customer service: Thanks to data mining, companies can identify potential
customer service issues more promptly and give contact center agents up-to-date
information to use in calls and online chats with customers.
 Improved supply chain management: Organizations can spot market trends and forecast
product demand more accurately, enabling them to better manage inventories of goods and
supplies. Supply chain managers can also use information from data mining to optimize
warehousing, distribution and other logistics operations.
 Increased production uptime: Mining operational data from sensors on manufacturing
machines and other industrial equipment supports predictive maintenance applications to
identify potential problems before they occur, helping to avoid unscheduled downtime.
 Stronger risk management: Risk managers and business executives can better assess
financial, legal, cybersecurity and other risks to a company and develop plans for
managing them.
 Lower costs: Data mining helps drive cost savings through operational efficiencies in
business processes and reduced redundancy and waste in corporate spending.
Weka – Waikato Environment for Knowledge Analysis
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms
can either be applied directly to a dataset or called from your own Java
code. Weka contains tools for data pre-processing, classification, regression, clustering,
association rules, and visualization.
Preprocessing
Data pre-processing is a data mining technique which is used to transform the raw data in a
useful and efficient format.
Steps Involved in Data Preprocessing:
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is
done. It involves handling of missing data, noisy data etc.
(a). Missing Data:

This situation arises when some data is missing in the data. It can be handled in various
ways.
Some of them are:
 Ignore the tuples:

This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.
 Fill the Missing values:

There are various ways to do this task. You can choose to fill the missing
values manually, by attribute mean or the most probable value.
(b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated
due to faulty data collection, data entry errors etc. It can be handled in following ways :
 Binning Method: This method works on sorted data in order to smooth it. The whole
data is divided into segments of equal size and then various */methods are performed
to complete the task. Each segmented is handled separately. One can replace all data in
a segment by its mean or boundary values can be used to complete the task.
 Regression: Here data can be made smooth by fitting it to a regression function.The

regression used may be linear (having one independent variable) or multiple (having
multiple independent variables).
 Clustering: This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining
process. This involves following ways:
 Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0
to 1.0)
 Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to
help the mining process.
 Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working
with huge volume of data, analysis became harder in such cases. In order to get rid of this,
we uses data reduction technique. It aims to increase the storage efficiency and reduce data
storage and analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation: Aggregation operation is applied to data for the
construction of the data cube.
2. Attribute Subset Selection: The highly relevant attributes should be used,

rest all can be discarded. For performing attribute selection, one can use level of
significance and p- value of the attribute.the attribute having p-value greater than
significance level can be discarded.
3. Numerosity Reduction: This enable to store the model of data instead of

whole data, for example: Regression Models.
4. Dimensionality Reduction: This reduce the size of data by encoding

mechanisms.It can be lossy or lossless. If after reconstruction from compressed data,
original data can be retrieved, such reduction are called lossless reduction else it is called
lossy reduction. The two effective methods of dimensionality reduction are: Wavelet
transforms and PCA (Principal Component Analysis).
Screenshots :
Installation Process
Step 1: Go to the official website and download the WEKA tool.
Basic Interface of the WEKA tool:
Step 2: Click on Explorer the “Explorer window”.
Here for data mining purpose we will be needing the options like “Preprocess”, “Classify”, “Cluster”
and “Associate”.
Data Pre-Processing
Step 1: Go to the location of your data and select the data file and click on open.
Step 2: Select the dataset to be studied and then click on open.
The dataset’s attributes will be shown on the screen.
Protocol Type
Server Count
We can see the visualisation of all the the attribute by clicking on the “Visualise All” button.
Data Classification
Step 1: Open the “Classify” tab .
Step 2: Select choose option and select “Naïve bayes” classifier technique.
Step 3: Now select any of the test option form the given menu to classify the dataset and the
click on start
Naïve Bayes Classification using Cross Validation (summary part)

Naïve Bayes Classification using Use Training set( summary part)
Naïve Bayes Classification using Percentage Split

Clustering
Step 1: Select the cluster tab
Step 2: Click on choose button and select the required Clustering algorithm.
Step 3: Here we have selected the “SimpleKMeans” Algorithm.
Sample kMeans using “use Training set”

Sample kMeans using “percentage split”.
Sample kMeans using “Cluster Evaluation(Nom Class)”.

DM - Weka Reprot

Uploaded by

Copyright:

Available Formats

DM - Weka Reprot

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DM - Weka Reprot

Uploaded by

Copyright:

Available Formats

WEKA Tool & Its Report

Report on The NSL-KDD Data Set.

Sr. No. Name Roll No.

Prof. Prarthana Deshkar

Department Of Computer Technology

YESHWANTRAO CHAVAN COLLEGE OF ENGINEERING,

NAGPUR SESSION 2022-2023

What is data mining?

Why is data mining important?

Data mining process: How does it work?

1. Data gathering: Relevant data for an analytics application is identified and

Benefits of data mining

Specific data mining benefits include the following:

Steps Involved in Data Preprocessing:

(a). Missing Data:

 Ignore the tuples:

 Fill the Missing values:

(b). Noisy Data:

 Regression: Here data can be made smooth by fitting it to a regression function.The

2. Attribute Subset Selection: The highly relevant attributes should be used,

3. Numerosity Reduction: This enable to store the model of data instead of

4. Dimensionality Reduction: This reduce the size of data by encoding

Step 2: Click on Explorer the “Explorer window”.

Step 2: Select the dataset to be studied and then click on open.

The dataset’s attributes will be shown on the screen.

Naïve Bayes Classification using Cross Validation (summary part)

Naïve Bayes Classification using Percentage Split

Sample kMeans using “use Training set”

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.