Robust Data Model For Enhanced Anomaly Detection: R.Ravinder Reddy, Dr.Y Ramadevi, DR.K.V.N Sunitha
Robust Data Model For Enhanced Anomaly Detection: R.Ravinder Reddy, Dr.Y Ramadevi, DR.K.V.N Sunitha
Robust Data Model For Enhanced Anomaly Detection: R.Ravinder Reddy, Dr.Y Ramadevi, DR.K.V.N Sunitha
1 Introduction
2 Related work
Rough Set Theory is an extension of conventional set theory [5, 6]. That supports
approximations in decision making, it is an approximation of a vague concept. By a
pair of precise concepts, called lower and upper approximations. This is a
classification of the domain of interest into disjoint categories. Feature selection
property of rough set theory helps in finding reducts of the oversampled dataset. In
this way it not only reduces the size of dataset but also improves the classifier
performance.
Feature selection methods are used in the intrusion detection domain for
eliminating unimportant or irrelevant features. Feature selection reduces
computational complexity, removes information redundancy, increases the accuracy
of the detection algorithm, facilitates data understanding and improves generalization
The concepts in rough set theory are used to define the necessity of features. The
measures of necessity are calculated by the functions of lower and upper
approximation. These measures are employed as heuristics to guide the feature
selection process. These heuristic functions are used to decide which attribute is
relevant to the target concept.
2.4 Dataset
To evaluate any system we need a benchmark input and compare the results.
Fortunately for evaluation of the intrusion detection system we have used The “HTTP
dataset CSIC 2010” [7] contains thousands of web requests automatically generated.
It can be used for the testing of web attack protection systems. It was developed at the
“Information Security Institute” of CSIC (Spanish Research National Council).
The main motivation behind this is current problem in web attack detection is the
lack of publicly available data sets to test WAFs (Web Application Firewalls). The
DARPA data set [8, 9] has been widely used for intrusion detection. However, it has
been criticized by the IDS community [10]. Regarding web traffic, some of the
problems of the DARPA data set are that it is out of date and also that it does not
include many of the actual attacks. Because of that, it is not appropriate for web attack
detection. The problem of data privacy is also a concern in the generation of publicly
available data sets and is probably one of the reasons why most of the available HTTP
data sets do not target real web applications. Because of these reasons, we decided to
use the HTTP data set CSIC 2010.
The HTTP dataset CSIC 2010 contains the generated traffic targeted to an
eCommerce web application. In this web application, users can buy items using a
shopping cart and register by providing some personal information. As it is a web
application in Spanish, the data set contains some Latin characters. The dataset is
generated automatically and contains 36,000 normal requests and more than 25,000
anomalous requests. The HTTP requests are labeled as normal or anomalous and the
dataset includes attacks such as SQL injection, buffer overflow, information
gathering, and files disclosure, CRLF injection, XSS, server side include, parameter
tampering and so on.
3 Methodologies
In this method we address the two issues regarding anomaly detection. They are,
feature selection and balancing the dataset, here we mainly focused on class
balancing. Anomaly datasets are imbalanced in class, while using these types of data
to train the machine learning techniques like classification, it doesn’t perform well
compared to the normal distributed data. Balance the dataset [14, 15] using the data
mining technique, it will improve the prediction rate. In this approach we increase
rare class data by using oversampling technique. Here the proposed approach will
address these issues.
In the process of balancing the dataset, it may increase the size of the dataset in this
approach, it will consume system resources to avoid this problem we are applying the
rough set approach for reducing the dimensionality of the dataset. The Rough set
approach enormously reduces the data size without affecting the classifier accuracy.
In this approach we used smote algorithm for balancing the dataset. An
oversampling technique called SMOTE(Synthetic Minority Oversampling Technique)
is used to generate synthetic samples of minority class in order to balance the dataset.
SMOTE algorithm [11, 12] considers the minority class instances and oversamples it
by generating synthetic examples joining all of the k minority class nearest neighbors.
The value of k depends upon the amount oversampling to be done. The process begins
by selecting some point yi and determining its nearest neighbor’s yi1 to yik. Random
numbers from r1 to rk are generated by randomized interpolation of the selected
nearest neighbors.
1. Consider the feature vector (minority sample) and its nearest neighbor and take
the difference between them.
2. Multiply this difference by a random number between 0 and 1, and add it to the
feature vector under consideration.
3. This causes the selection of a random point along the line segment between two
specific features.
Once the data sampling is completed we need to re distribute the records.
Distribution of samples is also an important issue in the classification process.
Reducing the dimensionality of the data we used rough set approach, in this we used
Johnson’s reduct, Johnson‘s algorithm is a dynamic reduct [13] computation
algorithm. The process of reduct generation starts with an empty set, RS. Iteratively,
each conditional attribute in the discernibility matrix is evaluated based on a heuristic
measure and the highest heuristic valued attribute is to be added to the RS and deletes
the same from the original discernibility matrix. The algorithm ends when all the
clauses are removed from the discernibility matrix. Pseudo code for Johnson‘s reduct
generation is given below.
The reduct generated by the Johnson‘s algorithm may not be optimal, still research
is going on finding an optimal feature set for a given dataset. Here the HTTP dataset
CSIC 2010 is used, it contains the following conditional features. The decision
attribute is normal or anomaly.
Applying the rough set feature selection the 17 conditional features are reduced to 8
features these are as follows.
In the table 1 we compare the computational time for the rough set model, in fig 1
shows that there is huge difference for both the models.
3000
Roughset
2000 approach
1000 Without
Feature
0 selection
Computational Time
4 Result Analysis
In this model we used the HTTP dataset CSIC 2010, it contains millions of records
labeled as normal and anomaly. In this method we applied the SMOTE algorithm to
oversampling the data among the normal and anomaly records. After this randomize
the record to distribute the oversampled records throughout the data set. Once we
obtain the optimal feature vector we applied this data to the SVM classifier and
calculated the results. We used RBF kernel for evaluation of the results. It performs
well on the balanced data and produce good results compared with the other balancing
methods.
Table 2: Comparison of results with balanced and rough set reduct datasets.
Measure Unbalanced Balanced Balanced and
Rough set Reduct
Time 2593 2780 1782
Accuracy 99.44 99.85 99.80
Precision 0.994 0.998 0.998
FP Rate 0.005 0.002 0.002
Recall 0.994 0.998 0.998
F-Measure 0.994 0.998 0.998
The empirical results show that by using the balancing and rough set reduct
improves the classifier accuracy as well as reduce the computational time. In the
table 2 false positive rate is also decreased, we compute the precision, recall and F-
measure these results shows that the robust approach is performing well compared to
the earlier methods. Fig 2 shows that the hybrid approach performs well compared
with the unbalanced dataset.
In the future genetic and fuzzy algorithm may used to get better records using the
fitness function and membership values. Feature vector size may also decrease based
on the availability of the optimal feature selection algorithms. Still lot of research is
going on finding optimal feature subset. Using these techniques may achieve the
better anomaly detection model. And use the other classification techniques on well
distributed and balanced data.
References
1. Lee, W., Stolfo, S., Chan, P., Eskin, E., Fan, W., Miller, M., Hershkop, S., & Zhang, J.
(2001). Real time data mining-based intrusion detection. In:DARPA information
survivability conference & exposition II,2001, DISCEX’01, Proceedings(Vol. 1, pp. 89–
100).