Sciencedirect: Survey On Anomaly Detection Using Data Mining Techniques
Sciencedirect: Survey On Anomaly Detection Using Data Mining Techniques
Sciencedirect: Survey On Anomaly Detection Using Data Mining Techniques
com
ScienceDirect
Procedia Computer Science 60 (2015) 708 713
19th International Conference on Knowledge Based and Intelligent Information and Engineering Systems
Abstract
In the present world huge amounts of data are stored and transferred from one location to another. The data when transferred or stored is
primed exposed to attack. Although various techniques or applications are available to protect data, loopholes exist. Thus to analyze data and
to determine various kind of attack data mining techniques have emerged to make it less vulnerable. Anomaly detection uses these data
mining techniques to detect the surprising behaviour hidden within data increasing the chances of being intruded or attacked. Various hybrid
approaches have also been made in order to detect known and unknown attacks more accurately. This paper reviews various data mining
techniques for anomaly detection to provide better understanding among the existing techniques that may help interested researchers to work
future in this direction.
2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
Keywords: Anomaly Detection, Clustering, Classification, Data Mining, Intrusion Detection System.
1. Introduction
Intrusion Detection Systems (IDS) are security tools that provided to strengthen the security of communication and
information systems. This approach is similar to other measures such as antivirus software, firewalls and access control schemes.
Conventionally, these systems have been classified as a signature detection system, an anomaly detection system or a hybrid
detection system [29]. In signature based detection, the system identifies patterns of traffic or application data is presumed to be
malicious while anomaly detection systems compare activities against a normal defined behavior. Hybrid intrusion detection
systems combine the techniques of both these approaches. Each technique has its own advantages and disadvantages. Few
benefits of anomaly detection techniques over others can be stated as follows. Firstly, they are capable of detecting insider
attacks. For example if any user is using any stolen account and perform such actions that are beyond normal profile of the user,
an alarm will be generated by the anomaly detection system. Secondly, the detection system is based on custom made profiles. It
becomes very difficult for an attacker to carry out any activity without setting off an alarm. Finally, it can detect the attacks that
are previously not known. Anomaly detection systems look for anomalous events rather than the attacks. In this paper we focus
upon the various anomaly detection techniques.
1.1. Anomaly Detection
Anomaly detection is the process of finding the patterns in a dataset whose behavior is not normal on expected. These
unexpected behaviors are also termed as anomalies or outliers. The anomalies cannot always be categorized as an attack but it can
1877-0509 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-review under responsibility of KES International
doi:10.1016/j.procs.2015.08.220
Shikha Agrawal and Jitendra Agrawal / Procedia Computer Science 60 (2015) 708 713
be a surprising behavior which is previously not known. It may or may not be harmful. The anomaly detection provides very
significant and critical information in various applications, for example Credit card thefts or identity thefts [1]. When data has to
be analyzed in order to find relationship or to predict known or unknown data mining techniques are used. These include
clustering, classification and machine based learning techniques. Hybrid approaches are also being created in order to attain
higher level of accuracy on detecting anomalies. In this approach the authors try to combine existing data mining algorithms to
derive better results. Thus detecting the abnormal or unexpected behavior or anomalies will yield to study and categorize it into
new type of attacks or any particular type of intrusions. This survey attempts to provide a better understanding among the various
types of data mining approaches towards anomaly detection that has been made until now.
1.2. Basic Methodology of anomaly detection technique
Although different anomaly approaches exists, as shown in figure 1 parameter wise train a model prior to detection.
Parameterization: Pre processing data into a pre-established formats such that it is acceptable or in accordance with the targeted
systems behavior.
Training stage: A model is built on the basis of normal (or abnormal) behavior of the system. There are different ways that can
be opted depending on the type of anomaly detection considered. It can be both manual and automatic.
Detection stage: When the model for the system is available, it is compared with the (parameterized or the pre defined) observed
traffic. If the deviation found exceeds (or is less than when in the case of abnormality models) from a pre defined threshold then
an alarm will be triggered.
2. Anomaly Detection Using Data Mining Techniques
Anomalies are pattern in the data that do not conform to a well defined normal behavior. The cause of anomaly may be a
malicious activity or some kind of intrusion. This abnormal behavior found in the dataset is interesting to the analyst and this is
the most important feature for anomaly detection [14].
Anomaly detection is a topic that had been covered under various survey, review articles and books [4, 5]. Phua et al (2010) have
done a detailed survey on various fraud detection techniques that has been carried out in the past few years. They have defined
the professional fraudster, the main types and subtypes of known fraud, and also presented the nature of data evidence collected
within affected industries [6]. Padhy et al (2012) provided a detailed survey of data mining applications and its feature scope.
They stated that anomaly detection is an application of data mining where various data mining techniques can be applied [3]
Amanpreet, Mishra, and Kumar (2012) described readymade data mining techniques that can be applied directly to detect the
intrusion [7]. Garca et al (2009) have surveyed the most relevant works in the field of automatic network intrusion detection
[15]. They provided a wide prospective to the techniques that they can be practically deployed by viewing the possible causes for
the lack of acceptance to the proposed novel approaches.
In this paper review of different approaches of anomaly detection focuses on the broad classification of existing data mining
techniques. Data mining consists of four classes of task; they are association rule learning, clustering, classification and
regression. Next subsection presents anomaly detection techniques under these four classes of task:
709
710
Shikha Agrawal and Jitendra Agrawal / Procedia Computer Science 60 (2015) 708 713
Shikha Agrawal and Jitendra Agrawal / Procedia Computer Science 60 (2015) 708 713
These portions of data are applied with fuzzy logic rules to classify them as normal or malicious. There are various
other fuzzy data mining techniques to extract patterns that represent normal behaviour for intrusion detection that
describe a variety of modifications in the existing data mining algorithms in order to increase the efficiency and
accuracy [17].
Nave bayes network: There are many cases where the statistical dependencies or the causal relationships between
system variables exist. It can be difficult to precisely express the probabilistic relationships among these variables. In
other words, the former knowledge about the system is simply that some variable might be influenced by others. To
take advantage of this structural relationship between the random variables of a problem, a probabilistic graph model
called Nave Baysian Networks (NB) can be used. This model provides answer to the questions like if few observed
events are given then what is the probability of a particular kind of attack. It can be done by using formula for
conditional probability. The structure of a NB is typically represented by a Directed Acyclic Graph (DAG) where
each node represents one of system variables and each link encodes the influence of one node upon another [21].
When decision tree and baysian techniques are compared, though the accuracy of decision tree is far better but
computational time of baysian network is low [19]. Hence, when the data set is very large it will be efficient to use
NB models.
Genetic Algorithm: It was introduced in the field of computational biology. These algorithms belong to the larger
class of Evolutionary Algorithms (EA). They generate solutions to optimization problems using techniques inspired
by natural evolution, such as inheritance, selection, mutation and crossover. Since then, they have been applied in
various fields with very promising results. In intrusion detection, the Genetic Algorithm (GA) is applied to derive a
set of classification rules from the network audit data. The support-confidence framework is utilized as a fitness
function to judge the quality of each rule. Significant properties of GA are its robustness against noise and selflearning capabilities. The advantages of GA techniques reported in case of anomaly detection are high attack
detection rate and lower false-positive rate [17].
Neural Networks: It is a set of interconnected nodes designed to imitate the functioning of the human brain. Each
node has a weighted connection to several other nodes in neighbouring layers. Individual nodes take the input
received from connected nodes and use the weights together with a simple function to compute output values. Neural
networks can be constructed for supervised or unsupervised learning [20]. The user specifies the number of hidden
layers as well as the number of nodes within a specific hidden layer. Depending on the application, the output layer of
the neural network may contain one or several nodes. The Multilayer Perceptions (MLP) neural networks have been
very successful in a variety of applications and producing more accurate results than other existing computational
learning models. They are capable of approximating to random accuracy, any continuous function as long as they
contain enough hidden units. This means that such models can form any classification decision boundary in feature
space and thus act as non-linear discriminate function.
Support Vector Machine: These are a set of related supervised learning methods used for classification and
regression. Support Vector Machine (SVM) is widely applied to the field of pattern recognition. It is also used for an
intrusion detection system. The one class SVM is based on one set of examples belonging to a particular class and no
negative examples rather than using positive and negative example [18]. When compared to neural networks in KDD
cup data set, it was found out that SVM out performed NN in terms of false alarm rate and accuracy in most kind of
attacks [18].
711
712
Shikha Agrawal and Jitendra Agrawal / Procedia Computer Science 60 (2015) 708 713
was achieved [24]. A new approach for the detection of network attacks, which aims to study the effectiveness of the
method based on machine learning in intrusion detection, including artificial neural networks and support vector
machine was proposed. The experimental results obtained by applying this approach to the KDD CUP'99 data set
demonstrate that the proposed approach performs high performance, especially to U2R and U2L type attacks [25]. A
hybrid approach for combining entropy of network features and SVM have been proposed that outperformed
individual entropy and SVM techniques [2]. Thus hybrid approaches yield better results as combining different
techniques by overcoming the drawback of each other and resulting in higher accuracy of anomaly detection. Table1
presents few hybrid approaches proposed for anomaly detection:
Table 1 : Compilation of hybrid approaches for anomaly detection
Author Name
Chitrakar, Roshan,
and Chuanhe (2012)
Methods used
SVM classification and kmedoids clustering
Methodology
Similar data instances are grouped
by k- medoids technique and
resulting clusters are classified
into using SVM classifiers
Chitrakar, Roshan,
and Chuanhe (2012)
Yasami and
Mozaffari (2009)
Peddabachigari,
Abraham,Grosan and
Thomas (2007)
Peddabachigari,
Abraham, Grosan and
Thomas (2007)
Ensemble approach
Shikha Agrawal and Jitendra Agrawal / Procedia Computer Science 60 (2015) 708 713
References
1. Chandola V., Banerjee A. , Kumar V., Anomaly detection: A survey, ACM Computing Surveys (CSUR); 41(3); 2009;p. 15 .
2. Agarwal B., Mittal N., Hybrid Approach for Detection of Anomaly Network Traffic using Data Mining Techniques, Procedia Technology; 6; 2012; p. 9961003.
3. Padhy N., Mishra P. , Panigrahi R., The Survey of Data Mining Applications and Feature Scope; International Journal of Computer Science, Engineering and
Information Technology (IJCSEIT), 2(3) ;2012; p. 43-58.
4. Lee W., Stolfo J. Salvatore, Data mining approaches for intrusion detection; Proceedings of the 7th USENIX Security Symposium, San Antonio, Texas;
1998;p. 79-94.
5. Lee W., Stolfo S.J., Mok K.W., Adaptive intrusion detection: A data mining approach; Artificial Intelligence Review;14(6);2000; p. 533-567.
6. Phua C., Lee V., Smith K., Gayler R., A comprehensive survey of data mining-based fraud detection ; research; 2010; p. 1-14.
7. Chauhan A., Mishra G. , Kumar G. , Survey on Data mining Techniques in Intrusion Detection; International Journal of Scientific & Engineering Research ;
2(7), 2011; p.1-4.
8. Xu L., Yeh Y. R., Lee Y. J., Li J., A Hierarchical Framework Using Approximated Local Outlier Factor for Efficient Anomaly Detection; Procedia
Computer Science ; 19; 2013; p. 1174-1181.
9. T. Pang-Ning, M. Steinbach, V. Kumar, Introduction to data mining, Library of Congress, 2006.
10. Munz,G., Li S., Carle G., Traffic Anomaly Detection Using K-Means Clustering; GI/ITG Workshop MMBnet; 2007;p.1-8.
11. Syarif I., Prugel-Bennett A., Wills G., Data mining approaches for network intrusion detection from dimensionality reduction to misuse and anomaly
detection; Journal of Information Technology Review ; 3(2); 2012; p. 70-83.
12. Han J., Kamber M., Data Mining: Concepts and Techniques, 2nd edition, Morgan Kaufmann, 2006.
13. Berkhin P., A survey of clustering data mining techniques;Grouping multidimensional data; Springer Berlin Heidelberg; 2006; p. 25-71.
14. Dokas P., . Ertoz L., Kumar V., Lazarevic A., Srivastava J., Tan P. N., Data mining for network intrusion detection, In Proceedings of NSF Workshop on
Next Generation Data Mining; 2002; p. 21-30
15. Garcia-Teodoro P., Diaz-Verdejo J., Maci-Fernndez G. , Vzquez E., Anomaly-based network intrusion detection: Techniques, systems and challenges;
Computers and security; 28( 1); 2009; p. 18-28.
16. Wu S. Y., Yen E., Data mining-based intrusion detectors; Expert Systems with Applications; 36( 3); 2009; p. 5605-5612.
17. Kaur N., Survey paper on Data Mining techniques of Intrusion Detection;International Journal of Science, Engineering and Technology Research; 2( 4);
2013; p. 799-804.
18. Tang D. H., Cao Z.,Machine Learning-based Intrusion Detection Algorithm; Journal of Computational Information Systems;5(6); 2009; p. 1825-1831.
19. Amor N. B., Benferhat S., Elouedi Z., Naive Bayes vs decision trees in intrusion detection systems, In Proceedings of the ACM symposium on Applied
computing; 2004; p. 420-424
20. Kou Y., Lu C. T., Sirwongwattana S., Huang Y. P., Survey of fraud detection techniques; In Proceedings of the IEEE International conference Networking,
sensing and control; 2; 2004; p. 749-754.
21. TsaiC. F., Hsu Y. F., Lin C. Y., Lin W. Y. , Intrusion detection by machine learning: A review; Expert Systems with Applications; 36(10); 2009; p. 1199412000.
22. Farid D. M., Harbi N., Rahman M. Z., Combining naive bayes and decision tree for adaptive intrusion detection; International Journal of Network Security &
Its Applications (IJNSA);2( 2);2010;p. 12-25.
23. Fu S., Liu J., Pannu H., A Hybrid Anomaly Detection Framework in Cloud Computing Using One-Class and Two-Class Support Vector Machines; In
Advanced Data Mining and Applications; Springer Berlin Heidelberg; 2012; p. 726-738.
24. Yasami Y., Mozaffari S. P., A novel unsupervised classification approach for network anomaly detection by k-Means clustering and ID3 decision tree
learning methods; The Journal of Supercomputing; 53(1); 2010; p. 231-245.
25. Tang D. H., Cao Z., Machine Learning-based Intrusion Detection Algorithms; Journal of Computational Information Systems; 5( 6); 2009; p. 1825-1831.
26. Chitrakar R., Chuanhe H., Anomaly based Intrusion Detection using Hybrid Learning Approach of combining k-Medoids Clustering and Nave Bayes
Classification, In Proceedings of 8th IEEE International Conference on Wireless Communications, Networking and Mobile Computing (WiCOM); 2012; p
1-5.
27. Chitrakar R., . Chuanhe,H., Anomaly detection using Support Vector Machine classification with k-Medoids clustering; In Proceedings of IEEE Third Asian
Himalayas International Conference on Internet (AH-ICI); 2012; p. 1-5.
28. Peddabachigari S., Abraham A., Grosan C., Thomas J., Modeling intrusion detection system using hybrid intelligent systems; Journal of network and
computer applications; 30( 1); 2007; p. 114-132.
29. Patcha A., Park J. M., An overview of anomaly detection techniques: Existing solutions and latest technological trends; Computer Networks; 51(12); 2007;
p. 3448-3470.
713