BAT Deep Learning Methods On Network Intrusion Det
BAT Deep Learning Methods On Network Intrusion Det
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2972627, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
ABSTRACT Intrusion detection can identify unknown attacks from network traffics and has been an
effective means of network security. Nowadays, existing methods for network anomaly detection are usually
based on traditional machine learning models, such as KNN, SVM, etc. Although these methods can obtain
some outstanding features, they get a relatively low accuracy and rely heavily on manual design of traffic
features, which has been obsolete in the age of big data. To solve the problems of low accuracy and
feature engineering in intrusion detection, a traffic anomaly detection model BAT is proposed. The BAT
model combines BLSTM (Bidirectional Long Short-term memory) and attention mechanism. Attention
mechanism is used to screen the network flow vector composed of packet vectors generated by the BLSTM
model, which can obtain the key features for network traffic classification. In addition, we adopt multiple
convolutional layers to capture the local features of traffic data. As multiple convolutional layers are used
to process data samples, we refer BAT model as BAT-MC. The softmax classifier is used for network
traffic classification. The proposed end-to-end model does not use any feature engineering skills and can
automatically learn the key features of the hierarchy. It can well describe the network traffic behavior and
improve the ability of anomaly detection effectively. We test our model on a public benchmark dataset, and
the experimental results demonstrate our model has better performance than other comparison methods.
INDEX TERMS Network Traffic, Intrusion Detection, Deep Learning, BLSTM, Attention Mechanism
VOLUME 4, 2016 1
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2972627, IEEE Access
trusion detection methods based on deep learning have been on the obtained fine-grained features for anomaly detection.
proposed successively. In [9], the authors propose a mal-ware To verify the effectiveness of the BAT-MC network, it is
traffic classification method based on convolutional neural comprehensively evaluated on the NSL-KDD dataset and
network with traffic data as image. This method does not gets the best results. The accuracy of the BAT-MC network
need manual design features, and directly takes the original can reach 84.25%, which is about 4.12% and 2.96% higher
traffic as the input data to the classifier. In [10], the authors than the existing CNN and RNN model, respectively.
provide an analysis of the viability of Recurrent Neural The following are some of the key contributions and find-
Networks (RNN) to detect the behavior of network traffic by ings of our work:
modeling it as a sequence of states that change over time. In 1) We propose an end-to-end deep learning model BAT-
[11], the authors verify the performance of Long Short-Term MC that is composed of BLSTM and attention mecha-
memory (LSTM) network in classifying intrusion traffics. nism. BAT-MC can well solve the problem of intrusion
Experimental results show that LSTM can learn all the attack detection and provide a new research method for intru-
classes hidden in the training data. All the above methods sion detection.
treat the entire network traffic as a whole consisting of a 2) We introduce the attention mechanism into the BLSTM
sequence of traffic bytes. They don’t make full use of domain model to highlight the key input. Attention mechanism
knowledge of network traffics. For example, CNN converts conducts feature learning on sequential data composed
continuous network traffic into images for processing, which of data package vectors. The obtained feature informa-
is equivalent to treating traffics as independent and ignore the tion is reasonable and accurate.
internal relations of network traffics. Firstly, network traffic 3) We compare the performance of BAT-MC with tradi-
is a hierarchical structure. Specifically, network traffic is a tional deep learning methods, the BAT-MC model can
traffic unit composed of multiple data packets. Data packet extract information from each packet. By making full
is a traffic unit composed of multiple bytes. Secondly, traffic use of the structure information of network traffic, the
features in the same and different packets are significantly BAT-MC model can capture features more comprehen-
different. Sequential features between different packets need sively.
to be extracted independently. In other words, not all traffic 4) We evaluate our proposed network with a real NSL-
features are equally important for traffic classification in the KDD dataset. The experimental results show that the
process of extracting features on a certain network traffic. performance of BAT-MC is better than the traditional
However, little prior works have utilized the above men- methods.
tioned structure of network traffic. Inspired by these char- The rest of the paper is organized as follows: In Section 2,
acteristics, in this paper, we propose and demonstrate our we give a brief overview of the related work, especially how
method to analyze network traffic in an overall view. Net- intelligent algorithms facilitate the development of intrusion
work traffic is generally collected at fixed time intervals. detection. In Section 3, we present details of the proposed
Repeating this collecting process for m times, we can get BAT-MC model. In Section 4, we explain the experimental
0 0 0
the network traffic X 0 , where X 0 = (x1 , x2 , ..., xm ) is a setup and present our results. The performance of BAT-MC
matrix with m data packets. Each x represents a data packet, model is compared with other machine learning methods
in data packet is seen as a whole consisting of a sequence of both in binary classification and multiclass classification.
traffic bytes. Before entering the data into the BAT model, Section 5 draws the conclusions.
the original data is preprocessed by multiple convolutional
layers. Global features can be obtained with the increase II. RELATED WORKS
of the convolutional layer. With the preprocessing, we get The intrusion detection technology can be divided into three
an abstract representation of network traffic X from X 0 . major categories: pattern matching methods, traditional ma-
In order to better make full use of domain knowledge of chine learning methods and deep learning methods.
network traffics, we propose a deep learning model BAT- At the beginning, people mainly use pattern matching
MC that mainly combines bidirectional long-term memory algorithms for intrusion detection. Pattern matching algo-
(BLSTM) [12] and attention mechanism [13]. BLSTM is rithm [14], [15] is the core algorithm of intrusion detection
used to learn the characteristics of each packet and get the system based on feature matching. Most algorithms have
vector corresponding to each packet. Attention mechanism been considered for use in the past. In [16], the authors
is then used to perform feature learning on the sequence make a summary of pattern matching algorithm in Intrusion
data composed of the packet vector to obtain the fine-grained Detection System: KMP algorithm, BM algorithm, BMH
features. Up to now, we have finished the key features ex- algorithm, BMHS algorithm, AC algorithm and AC-BM
traction of network traffics via attention mechanism. The algorithm. Experiments show that the improved algorithm
whole process of feature learning does not use any feature can accelerate the matching speed and has a good time
engineering skills. The automatically learnt key features can performance. In [17], Naive approach, Knuth-MorrisPratt
better describe the traffic behavior, which can effectively algorithm and RabinKarp Algorithm are compared in order
improve the anomaly detection capability. Finally, a full to check which of them is most efficient in pattern/intrusion
connected network and a sof tmax function are performed detection. Pcap files have been used as datasets in order to
2 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2972627, IEEE Access
determine the efficiency of the algorithm by taking into con- ral network (DNN) is investigated and tested with the KDD
sideration their running times respectively. These traditional Cup 99 dataset in response to ever-evolving network attacks.
pattern recognition algorithms have serious defects, which The results show a significantly high accuracy and detection
cannot achieve the effect of intrusion detection. Finding rate, averaging 99%. However, current deep learning methods
an efficient algorithm that reaches high efficiency and low don’t make full use of the structured information of network
false positive rates is still the focus of current work. With traffic. Network traffic is essentially a kind of time series
the development of artificial intelligence, the application of data. Similar to the structure of letters, words, sentences and
intelligent algorithms for intrusion detection has become a paragraphs in natural language processing (NLP), network
new research hotspot. traffic is composed of multiple data packets and each data
The traffic anomaly detection methods based on machine packet is a set of multiple bytes.
learning have achieved a lot of success. In [18], the authors In this paper, drawing on the application methods of deep
propose a new method of feature selection and classifica- learning in NLP, we adopt phased processing. The BLSTM
tion based on support vector machine (SVM). Experimental is used to learn the sequential features in the data packet
results on NSL-KDD cup 99 of intrusion detection data to obtain a vector corresponding to each data packet. Then,
set showed that the classification accuracy of this method attention layer is used to perform feature learning on the
with all training features reached 99%. In [19], the authors sequential data composed of the packet vector. Attention can
combine k-mean clustering on the basis of KNN classifier. filter out the characteristics to get a network flow vector,
The experimental results on NSL-KDD dataset show that this which are helpful to achieve more accurate network traffic
method greatly improves the performance of KNN classifier. classification. Through the learning of two phases of BLSTM
In [20], the authors propose a new framework to combine the and attention on the time series features, the BAT-MC model
misuse and the anomaly detection in which they apply the finally outputs a network flow vector, which contains struc-
random forests algorithm. Experimental results show that the tured information of network traffic. Hence, the BAT-MC
overall detection rate of the hybrid system is 94.7% and the model makes full use of the structure information of network
overall false positive rate is 2%. In [21], the performance of traffic.
NSL-KDD dataset is evaluated via Artificial Neural Network
(ANN). The detection rate obtained is 81.2% and 79.9% III. PROPOSED WORK
for intrusion detection and attack type classification task
As shown in Figure 1, the BAT-MC model consists of five
respectively for NSL-KDD dataset. In [22], an intrusion
components, including the input layer, multiple convolutional
detection method based on decision tree (DT) is proposed.
Layers, BSLTM layer, attention layer and output layer, from
Experimental results of feature selection using the relevant
bottom to top. At the input layer, BAT-MC model converts
feature selection (CFS) subset evaluation method show that
each traffic byte into a one-hot data format. Each traffic byte
the DT based intrusion detection system has a higher accu-
is encoded as an n-dimensional vector. After traffic byte is
racy. As described above, machine learning methods have
converted into a numerical form, we perform normalization
been proposed and have achieved success for an intrusion
operations. At the multiple convolutional layer, we convert
detection system. However, these methods require large-scale
the numerical data into traffic images. Convolutional op-
preprocessing and complex feature engineering of traffic
eration is used as a feature extractor that takes an image
data. It is impossible to solve the massive intrusion data
representation of data packet. At the BLSTM layer, BLSTM
classification problem using machine learning methods.
model which connects the forward LSTM and the backward
With the superior performance of deep learning in image
LSTM is used to extract features on the the traffic bytes
recognition [23] [24] and speech recognition [25] [26], traffic
of each packet. BLSTM model can learn the sequential
anomaly detection methods based on deep learning have
characteristics within the traffic bytes because BLSTM is
been proposed. In [27], the authors use Self-taught Learning
suitable to the structure of network traffic. In the attention
(STL) on NSL-KDD dataset for network intrusion. Testing
layer, attention mechanism is used to analyze the important
results show that their 5-class classification achieved an
degree of packet vectors to obtain fine-grained features which
average f-score of 75.76%. In [28], the authors propose an
are more salient for malicious traffic detection. At the output
intrusion detection method using deep belief network (DBN)
layer, the features generated by attention mechanism are then
and probabilistic neural network (PNN). The experiment
imported into a fully connected layer for feature fusion,
result on the KDD CUP 1999 dataset shows that the method
which obtains the key features that accurately characterize
performs better than the traditional PNN, PCA-PNN and
network traffic behavior. Finally, the fused features are fed
unoptimized DBN-PNN. Similarly, [29] and [30] train the
into a classifier to get the final recognition results.
DBN as a classifier to detect intrusions. In [31], the authors
propose a novel network intrusion detection model utilizing
convolutional neural networks (CNNs). The CNN model not A. DATA PREPROCESSING LAYER
only reduces the false alarm rate (FAR) but also improves the There are three symbolic data types in NSL-KDD data fea-
accuracy of the class with small numbers. In [32], an artificial tures: protocol type, flag and service. We use one-hot encoder
intelligence (AI) intrusion detection system using a deep neu- mapping these features into binary vectors.
VOLUME 4, 2016 3
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2972627, IEEE Access
FIGURE 1. The Architecture of BAT-MC model. The whole architecture is divided into five parts.
One-hot processing. NSL-KDD dataset is processed by CNN, which convolves the input images (or feature maps)
one-hot method to transform symbolic features into nu- with multiple convolutional kernels to create different feature
merical features. For example, the second feature of the maps. According to [35], the shallower convolutional layers
NSL-KDD data sample is protocol type. The protocol type whose receptive field is narrow can extract local information,
has three values: tcp, udp, and icmp. One-hot method is and while the deeper layers can capture global information
processed into a binary code that can be recognized by a with larger vision field. Hence, as the number of the convolu-
computer, where tcp is [1, 0, 0], udp is [0, 1, 0], and icmp tional layers increases, the scale of the convolutional feature
is [0, 0, 1]. gradually becomes coarser. In this paper, the input of the con-
Normalization processing. The value of the original data volutional layer can be formulated as a tensor of the size H ×
may be too large, resulting in problems such as "large num- W × 1, where H and W denote the height and width of data
bers to eat decimals", data processing overflows, and incon- yielded by normalization processing. Suppose we have some
sistent weights so on. We use standard scaler to normalize the N unites layer as input which is followed by convolutional
continuous data into the range [0, 1]. Normalization process- layer. If we use m width filter w, the convolutional output
ing eliminates the influence of the measurement unit on the will be (N − m + 1) unites. The convolutional calculation
model training, and makes the training result more dependent process is as shown in equation (3).
on the characteristics of the data itself. The formula is shown m
X
in equation (1) and equation (2). xl,j
i,k = f (bj +
j
wa,k l−1,j
ri+(k−1)×s+a−1 ), (3)
r − rmin a=1
r0 = , (1)
rmax − rmin where xl,j
i,k is one of the ith unit of j feature map of the
kth section in the lth layer, and s is the range of section. f
rmax = max{r}, (2) is a non-linear mapping, it usually uses hyperbolic tangent
where r stands for numeric feature value, rmin stands for the function, tanh(·).
minimal value of the feature, rmax stands for the max value,
r0 stands the value after the normalization. C. BLSTM LAYER
For the time series data composed of traffic bytes, BLSTM
B. MULTIPLE CONVOLUTIONAL LAYERS can effectively use the context information of data for feature
After the above processing operations, convolutional layer learning. The BLSTM is used to learn the time series feature
is used to capture the local features of traffic data. Convo- in the data packet. Traffic bytes of each data packet are
lutional layer [33] [34] is the most important part of the sequentially input into an BLSTM, which finally obtain a
4 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2972627, IEEE Access
ut = tanh(Ww ht + bw ), (12)
The forget gate will take the output of hidden layer ht−1 at
We next measure the importance of packet vectors based
the previous moment and the input xt at the current moment
on the similarity representation ut with a context vector
as input to selectively forget in the cell state Ct , which can
uw and obtain the normalized importance weight coefficient
be expressed as:
αt . uw is a random initialization matrix that can focus on
important information over ut . The weight coefficient for the
ft = sigmoid(Wxf xt + Whf ht−1 + bf ), (4)
above coarse-grained features can be expressed as:
The input gate cooperates with a tanh function together
to control the addition of new information. tanh generates exp(uTt uw )
αt = P , (13)
a new candidate vector. The input gate generates a value exp(uTt uw )
for each item in Cet from 0 to 1 to control how much new Finally, the fine grained feature s can be computed via the
information will be added, which can be expressed as: weighted sum of ht based on αt . s can be expressed as:
X
Ct = sigmoid(ft · Ct−1 + it · C
et ), (5) s= αt ht , (14)
The fine-grained feature vector s generated from the atten-
it = sigmoid(Wxi xt + Whi ht−1 + bt ), (6) tion mechanism is used for malicious traffic recognition with
a sof tmax classifier, which can be expressed as:
C
et = tanh(Wc xt + Wc ht−1 + bc ), (7) y = sof tmax(Wh s + bh ), (15)
The output gate is used to control how much of the current Where Wh represents the weight matrix of the classifier,
unit state will be filtered out, which can be expressed as: which can map s to a new vector with length h. h is the
number of categories of network traffics.
ot = sigmoid(Wxo xt + Who ht−1 + bo ), (8)
E. MODEL TRAINING
For the BLSTM model at time t , the hidden states of the Training the proposed network contains a forward pass and a
ht that is a packet vector generated from each packet can be backward pass.
←− →
−
defined as the concatenation of h t and h t , which can be Forward Propagation The BAT-MC model is mainly com-
expressed as: posed of BLSTM layer and attention layer, each of which
presents different structures and thus plays different role
←
− →
−
ht = h t + h t , (9) in the whole model. The forward propagation [40] [41] is
VOLUME 4, 2016 5
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2972627, IEEE Access
conducted from BLSTM layer to attention layer. The input of Backward Propagation The model is trained with adam
current model is obtained by the processing of the previous [43]. Adam is calculated by the back-propagation algorithm.
model. After the completion of forward propagation, the Error differentials are back-propagated with the forward-
final recognition result is obtained. The NSL-KDD dataset is backward algorithm. Back-Propagation Through Time (BPT-
defined as X. The divided training dataset and testing dataset T) [44] [45] is applied to calculate the error differentials.
can be expressed as x1 ,x2 ,x3 . After one-hot operation and In this paper, we use the Back Propagation Through Time
normalization operation, every samples is converted into a (BPTT) algorithm to obtain the derivatives of the objective
00
format X that can be acceptable to the BAT-MC model. function with respect to all the weights, and minimize the
Meanwhile, we set the cell state vector size as Sstate . In objective function by stochastic gradient descent.
summary, the abnormal traffic detection algorithm based on
the BAT-MC model is summarized as Algorithm 1. The IV. EVALUATION
objective function of our model is the cross-entropy based In this section, we first determine the parameters of BAT-
cost function [42]. The goal of training this model is to MC to obtain the optimal model through experiments which
minimize the cross entropy of the expected and actual outputs carry out on a public dataset: the NSL-KDD dataset [46] [47].
for all activities. The formula is shown in (16): Then, we analyze the performance of the BAT-MC model.
Finally, in order to verify the advancement and practicability
XX j
C=− yi ln aji + (1 − yij ) ln(1 − aji ), (16)
i j
of the BAT-MC model, we compare the performance of this
model with some state-of-the-art works.
Where i is the index of network traffic. j is the traffic
category. a is the actual category of network traffic and y is
A. BENCHMARK DATASETS
the predicted category.
The final result of network traffic anomaly detection is
closely related to the dataset. The NSL-KDD dataset is
Algorithm 1: BAT-MC intrusion detection algo-
an enhanced version of KDD cup 1999 dataset [48] [49],
rithms
which is widely used in intrusion detection experiments. The
Input: NSL-KDD dataset, adam, lr, batch_size
NSL-KDD dataset not only effectively solves the inherent
Output: Accuracy
redundant records problems of the KDD Cup 1999 dataset
1 get X=(x1 ,x2 ,x3 ) from NSL-KDD dataset;
0 0 0 but also makes the number of records reasonable in the
2 x1 ,x12 ,x3 = one-hot(x1 ,x2 ,x3 );
00 00 00 0 0 0 training dataset and testing dataset. The NSL-KDD dataset is
3 x1 ,x12 ,x3 = normalization(x1 ,x12 ,x3 );
mainly composed of KDDTrain+ training dataset, KDDTest+
4 conduct convolutional processing;
and KDDTest-21 testing dataset, which can make a reason-
5 for t = 1; t ≤ T ; do
←−−−− able comparison with different methods of the experimental
6 create LST M cell by Sstate ; results. As shown in Table 1, the NSL-KDD dataset have
−−−−→
7 create LST M cell by Sstate ; different normal records and four different types of abnormal
←−−−− records. The KDDTest-21 dataset is a subset of the KD-
8 connect BLST Mnet by LST M cell and
−−−−→ DTest+ and is more difficult for classification.
LST M cell ;
9 initialize BLST Mnet by seed;
TABLE 1. Different classifications in the NSL-KDD dataset.
10 get hidden states ht of the BLST Mnet ;
11 end
Total Normal Dos Probe R2L U2L
12 add a full connection layer, whose value is 320;
KDDTrain+ 125973 67343 45927 11656 995 52
13 add a dropout, whose value is 0.1;
KDDTest+ 22544 9711 7458 2421 2754 200
14 for each hidden state in 1:ht ; do
KDDTest-21 11850 2152 4342 2402 2754 200
15 obtain ht implicit representation ut through a
nonlinear transformation;
16 generate a random initialization matrix uw ; Network traffic is generally collected at fixed time inter-
17 obtain the normalized importance weight vals. Essentially, network traffic data is a kind of time series
coefficient αt ; data. Network traffic is a traffic unit composed of multiple
18 get the fine-grained feature s via αt and ht ; data packets. Each data packet is seen as a whole consisting
19 end of a sequence of traffic bytes. There are 41 features from
20 add a full connection layer, whose value is 1024; different data packet and 1 class label for every data packet.
21 add a full connection layer, whose value is 10; It can be described in the following form: x = (b0 , . . . , bi
00
22 P = BAT − M Cnet (X ) ; ,. . ). bi is the i-th feature in a data packet, and x represents
23 get Loss by pi and yi ; a continuous features of data packet. These features include
24 update BAT − M Cnet by Adam with loss and η basic features (1-10), content features (11-22) and traffic
25 return accuracy,f 1 − score; features (23-41) [50]. According to its characteristics, there
are four types of attacks in this dataset: DoS (Denial of
6 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2972627, IEEE Access
Service attacks), R2L (Root to Local attacks), U2R (User to 2.10GHz; Memory is 32.0G; Python version is 3.6. In view
Root attack), and Probe (Probing attacks). of many hyper parameters existing in the BAT-MC model, we
performed 100 iterations of training on the NSL-KDD set.
B. EVALUATION METRIC The hyper parameters with the highest accuracy is selected
In this paper, Accuracy (A) is used to evaluate the BAT-MC as the model parameter. The BAT-MC model is also verified
model. Except for accuracy, false positive rate (TPR) and on the testing dataset. After lots of experiments, three one-
false positive rate (FPR) are also introduced [51]. These three dimensional convolution layers are adopted when building
indicators are commonly used in the research field of network the BAT-MC model for intrusion detection task. The param-
traffic anomaly detection, which the calculation formula is eter list of BAT-MC network is set as shown in Table 3.
shown as follows. Where True Positive (TP) represents the
TABLE 3. Super parameters of the end-to-end learning model.
correct classification of the Intruder. False Positive (FP)
represents the incorrect classification of a normal user taken
as an intruder. True Negative (NP) represents a normal user parameters Filters/neurons
classified correctly. False Negative (FN) represents an in- conv+tanh 20
stance where the intruder is incorrectly classified as a normal
conv+tanh 40
user.
Accuracy represents the proportion of correctly classified conv+tanh 60
samples to the total number of samples. The evaluation BLSTM hidden nodes 80
metric are defined as follows: BLSTM activation function relu
TP + TN
accuracy, A = . (17) Dense 320
TP + FP + FN + TN
Dropout 0.1
True Positive Rate (TPR): as the equivalent of the Detec-
tion Rate (DR), it represents the percentage of the number of softmax 10
records correctly identified over the total number of anomaly cost function cross entropy
records. optimizer Adam
TP batch size 128
DR = T P R = . (18)
TP + FN learning rate 0.001
False Positive Rate (FPR) represents the percentage of the
number of records rejected incorrectly is divided by the total
number of normal records. The evaluation metric are defined D. PERFORMANCE ANALYSIS OF BAT-MC
as follows: Experiments have been designed to study the performance
FP of the BAT-MC model for 2-category and 5-category clas-
FPR = . (19) sification, such as Normal, DoS, R2L, U2R and Probe. In
FP + TN
the experiment of identifying malicious traffics, when there
C. EXPERIMENTAL SETTINGS are 80 hidden nodes in the BAT-MC model, the accuracy of
In order to test the performance of BAT-MC model proposed BAT-MC on the KDDTest+ dataset is higher. Meanwhile, the
in this paper, NSL-KDD dataset is used for verification. The learning rate is set to 0.01 and the number of training is 100
data samples of the NSL-KDD dataset are divided into two epoches. The confusion matrix generated by the BAT-MC
parts: one is used to build a classifier, that is called the model on the KDDTest+ dataset is shown in Figure 3 and
training dataset. The other is used to evaluate the classifier, Figure 4. Figure 3 and Figure 4 represent the experimental
that is called the testing dataset. There are 125,973 records in results of the BAT-MC model for the 2-class and 5-class
the training set and 22,543 records in the testing set. Table 2 classification, respectively. The experimental results show
shows the distribution of training and testing records for the that most samples is concentrated on the diagonal of the
(normal/attack) type of network traffic. confusion matrix, indicating that the overall classification
performance is very high. However, it can be intuitively seen
TABLE 2. Distribution of training and testing records. from the confusion matrix in Figure 3 show that the BAT-
MC network achieves good detection performance in distin-
normal Dos Pbobe U2R R2L Total guishing normal traffics from attack traffics (only 51 samples
Train 67,343 45,927 11,656 52 995 125,973 are false positives), but there is still further improvement in
Test 9,711 7,458 2,421 200 2,754 22,543 distinguishing different attack traffics. The detection effect of
Dos and Probe attack traffics are relatively good, while R2L
The operating environment of all experiments is Keras and U2R attack traffics are invalid.
with tensorflow as the backend; Operating system is 64- After careful fine-tuning, the accuracy comparison of the
bit CtOS7; Processor is E5-2620 v4; Main frequency is BAT-MC model on the KDDTest+ and KDDTest-21 set is
VOLUME 4, 2016 7
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2972627, IEEE Access
FIGURE 3. Confusion matrix yielded by the BAT-MC model (5-class). TABLE 4. DR and FPR of the BAT-MC model on the NSL-KDD dataset
(5-class).
FPR DR
Normal 25.70% 97.50%
Dos 1.52% 87.55%
R2L 0.91% 44.25%
U2R 0.09% 20.95%
Probe 1.15% 85.76%
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2972627, IEEE Access
build a Deep Neural Network (DNN) model for an intrusion collected traffic as original input. Then, attention mechanism
detection system and train the model with the NSL-KDD captures key features from the outputs produced by the
Dataset. Experimental results confirm that the deep learning BLSTM model. Experimental results show that the BAT-MC
approach shows strong potential to be used for flow-based model can automatically extract features by means of end-
anomaly detection in SDN environments. In [54], the authors to-end learning, which achieves better classification results
propose to use a typical deep learning method Convolution than manual design methods. Meanwhile, we compares the
Neural Networks (CNN) for detecting cyber intrusions. The recent works of using deep learning model for abnormal
experimental results show that the performance of this IDS traffic detection. As can be seen from Figure 7, the BAT-MC
model is superior to the performance of models based on model achieved the best results on both the KDDTest+ and
traditional machine learning methods and novel deep learn- KDDTest-21 testing set. On the KDDTest+ set, the accuracy
ing methods in multi-class classification. These works use of the BAT-MC model is 4.12% and 2.96% higher than CNN
the same dataset NSL-KDD for network traffic classification. [54] and RNN [52], respectively. On the KDDTest-21 set, the
They are not only recent highly relative and representative accuracy of the BAT-MC model is 4.75% and 7.1% higher
works on intrusion detection, but also can achieve excellent than CNN [54] and RNN [52], respectively. The BAT-MC
accuracy. The comparison results among these works on network is more accurate than CNN because CNN is more
the NSL-KDD dataset are shown in Figure 6 and Figure 7, suitable for processing image data. Additionally, CNN uses a
respectively. fixed convolution kernel that cannot model longer contextual
information, which is not conducive to the feature extraction
of the time series data. The BAT-MC network is better than
RNN, LSTM and BLSTM because the BAT-MC model com-
bines attention mechanism to capture the key features and
obtain more context information. The BAT-MC model can
capture features of network traffics more comprehensively,
which can extract the information of each data packet and
then utilize it on a frame-by-frame way. These results prove
that the BAT-MC network can offer a significant advantage
across very different scenarios.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2972627, IEEE Access
proves the advantages of multiple convolutional layers. The [9] W. Wei, Z. Ming, X. Zeng, X. Ye, and Y. Sheng, “Malware traffic classifi-
RNN model has small-scale fluctuations in the accuracy of cation using convolutional neural network for representation learning,” in
International Conference on Information Networking, 2017.
the iterative process. The RNN model improves faster and [10] P. Torres, C. Catania, S. Garcia, and C. G. Garino, “An analysis of recurrent
also has lower accuracy than BAT and BAT-MC model. The neural networks for botnet detection behavior,” in Biennial Congress of
CNN model starts to improve at a slower rate and has the Argentina, 2016.
[11] R. C. Staudemeyer and C. W. Omlin, “[acm press the south african institute
worst performance in each model. In summary, the BAT- for computer scientists and information technologists conference - east
MC network can accurately identify the time series data by london, south africa (2013.10.07-2013.10.09)] proceedings of the south
84.25% accuracy, which is an effective intrusion detection african institute for computer scientists and information technologists
co,” in South African Institute for Computer Scientists & Information
method.
Technologists Conference, 2013.
[12] S. Cornegruta, R. Bakewell, S. Withey, and G. Montana, “Modelling radi-
V. CONCLUSION ological language with bidirectional long short-term memory networks,”
2016.
The current deep learning methods in the network traffic clas-
[13] O. Firat, K. Cho, and Y. Bengio, “Multi-way, multilingual neural machine
sification research don’t make full use of the network traffic translation with a shared attention mechanism,” 2016.
structured information. Drawing on the application methods [14] Z. Hu, “Design of intrusion detection system based on a new pattern
of deep learning in the field of natural language processing, matching algorithm,” in International Conference on Computer Engineer-
ing & Technology, 2009.
we propose a novel model BAT-MC via the two phase’s learn- [15] C. Yin, “An improved bm pattern matching algorithm in intrusion detec-
ing of BLSTM and attention on the time series features for tion system,” Applied Mechanics & Materials, vol. 48-49, pp. 203–207,
intrusion detection using NSL-KDD dataset. BLSTM layer 2011.
[16] P. F. Wu and H. J. Shen, “The research and amelioration of pattern-
which connects the forward LSTM and the backward LSTM
matching algorithm in intrusion detection system,” in IEEE International
is used to extract features on the the traffic bytes of each Conference on High Performance Computing & Communication & IEEE
packet. Each data packet can produce a packet vector. These International Conference on Embedded Software & Systems, 2012.
packet vectors are arranged to form a network flow vector. [17] V. Dagar, V. Prakash, and T. Bhatia, “Analysis of pattern matching algo-
rithms in network intrusion detection systems,” in International Confer-
Attention layer is used to perform feature learning on the ence on Advances in Computing, 2016.
network flow vector composed of packet vectors. The above [18] M. S. Pervez and D. M. Farid, “Feature selection and intrusion clas-
feature learning process is automatically completed by deep sification in nsl-kdd cup 99 dataset employing svms,” in International
Conference on Software, 2015.
neural network without any feature engineering technology.
[19] H. Shapoorifard and P. Shamsinejad, “Intrusion detection using a novel
This model effectively avoids the problem of manual design hybrid method incorporating an improved knn,” International Journal of
features. Performance of the BAT-MC method is tested by Computer Applications, vol. 173, no. 1, pp. 5–9, 2017.
KDDTest+ and KDDTest-21 dataset. Experimental results [20] J. Zhang, M. Zulkernine, and A. Haque, “Random-forests-based network
intrusion detection systems,” IEEE Transactions on Systems Man & Cy-
on the NSL-KDD dataset indicate that the BAT-MC model bernetics Part C, vol. 38, no. 5, pp. 649–659, 2008.
achieves pretty high accuracy. By comparing with some [21] B. Ingre and A. Yadav, “[ieee 2015 international conference on signal pro-
standard classifier, these comparisons show that BAT-MC cessing and communication engineering systems (spaces) - guntur, india
(2015.1.2-2015.1.3)] 2015 international conference on signal processing
modelaŕs
˛ results are very promising when compared to other and communication engineering systems - performance analysis of,” 2015.
current deep learning-based methods. Hence, we believe that [22] B. Ingre, A. Yadav, and A. K. Soni, “Decision tree based intrusion
the proposed method is a powerful tool for the intrusion detection system for nsl-kdd dataset,” in International Conference on
Information & Communication Technology for Intelligent Systems, 2017.
detection problem.
[23] M. Asadiaghbolaghi, A. Clapes, M. Bellantonio, H. J. Escalante, V. Pon-
celĺőpez, X. Barĺő, I. Guyon, S. Kasaei, and S. Escalera, “A survey on
REFERENCES deep learning based approaches for action and gesture recognition in
image sequences,” in IEEE International Conference on Automatic Face
[1] B. B. Zarpelo, R. S. Miani, C. T. Kawakani, and S. C. D. Alvarenga, “A
& Gesture Recognition, 2017.
survey of intrusion detection in internet of things,” vol. 84, no. C, 2017,
pp. 25–37. [24] Z. Yan, “Deep learning for medical image analysis || multi-instance multi-
stage deep learning for medical image recognition,” Deep Learning for
[2] O. Tirelo and C. H. Yang, “Network intrusion detection,” IEEE Network,
Medical Image Analysis, pp. 83–104, 2017.
vol. 8, no. 3, pp. 26–41, 2003.
[3] S. Kishorwagh, V. K. Pachghare, and S. R. Kolhe, “Survey on intrusion [25] Z. Zhang, J. Geiger, J. Pohjalainen, A. E.-D. Mousa, W. Jin, and
detection system using machine learning techniques,” International Journal B. Schuller, “Deep learning for environmentally robust speech recogni-
of Computer Applications, vol. 78, no. 16, pp. 30–37, 2013. tion,” Acm Transactions on Intelligent Systems & Technology, vol. 9,
[4] N. Sultana, N. Chilamkurti, P. Wei, and R. Alhadad, “Survey on sdn based no. 5, pp. 1–28, 2017.
network intrusion detection system using machine learning approaches,” [26] K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata, “Audio-
Peer-to-Peer Networking and Applications, no. 1-2, pp. 1–9, 2018. visual speech recognition using deep learning,” Applied Intelligence,
[5] M. Panda, A. Abraham, S. Das, and M. R. Patra, “Network intrusion vol. 42, no. 4, pp. 722–737, 2015.
detection system: A machine learning approach,” Intelligent Decision [27] A. Y. Javaid, Q. Niyaz, W. Sun, and M. Alam, “A deep learning approach
Technologies, vol. 5, no. 4, pp. 347–356, 2011. for network intrusion detection system,” in Eai International Conference
[6] W. Li, P. Yi, Y. Wu, L. Pan, and J. Li, “A new intrusion detection on Bio-inspired Information & Communications Technologies, 2015.
system based on KNN classification algorithm in wireless sensor network,” [28] G. Zhao, C. Zhang, and L. Zheng, “Intrusion detection using deep belief
Journal of Electrical and Computer Engineering, vol. 2014, pp. 1–8, 2014. network and probabilistic neural network,” in IEEE International Confer-
[7] S. Garg and S. Batra, “A novel ensembled technique for anomaly detec- ence on Computational Science & Engineering, 2017.
tion,” International Journal of Communication Systems, vol. 30, no. 11, p. [29] G. Ni, G. Ling, Q. Gao, and W. Hai, “An intrusion detection model
e3248, 2017. based on deep belief networks,” in Second International Conference on
[8] F. Kuang, W. Xu, and S. Zhang, “A novel hybrid kpca and svm with ga Advanced Cloud & Big Data, 2015.
model for intrusion detection,” Applied Soft Computing Journal, vol. 18, [30] M. Z. Alom, V. R. Bontupalli, and T. M. Taha, “Intrusion detection using
no. C, pp. 178–184, 2014. deep belief networks,” in Aerospace & Electronics Conference, 2016.
10 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2972627, IEEE Access
[31] K. Wu, Z. Chen, and W. Li, “A novel intrusion detection model for TONGTONG SU born in 1992. From 2017 to
a massive network using convolutional neural networks,” IEEE Access, 2019, he was a science graduate with the School
vol. 6, no. 99, pp. 50 850–50 859, 2018. of Computer and Information Engineering, Tianjin
[32] K. Jin, N. Shin, S. Y. Jo, and H. K. Sang, “Method of intrusion detection Normal University. His main research interests
using deep neural network,” in IEEE International Conference on Big Data include machine learning and Pattern Recognition.
& Smart Computing, 2017.
[33] A. Tatsuma and M. Aono, “Food image recognition using covariance of
convolutional layer feature maps,” Ieice Transactions on Information &
Systems, vol. E99.D, no. 6, pp. 1711–1715, 2016.
[34] Y. Zeng, T. Li, G. Luo, H. Fujita, Y. Ning, and P. Yi, “Convolutional
networks with cross-layer neurons for image recognition aî,” ˛ Information
Sciences, vol. s 433ĺC434, pp. 241–254, 2018.
[35] W. Luo, Y. Li, R. Urtasun, and R. Zemel, “Understanding the effective
receptive field in deep convolutional neural networks,” 2016.
[36] K. Greff, R. K. Srivastava, J. Koutnĺłk, B. R. Steunebrink, and J. Schmid-
huber, “Lstm: A search space odyssey,” IEEE Transactions on Neural HUAZHI SUN received the Ph.D. degree in U-
Networks & Learning Systems, vol. 28, no. 10, pp. 2222–2232, 2016. niversity of Science and Technology of Beijing,
[37] O. Francisco and R. Daniel, “Deep convolutional and lstm recurrent neural China, in 2008. He is currently a Professor with
networks for multimodal wearable activity recognition,” Sensors, vol. 16, the School of Computer and Information Engi-
no. 1, p. 115, 2016. neering, Tianjin Normal University, China. His
[38] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: Contin- main research interests include mobile computing
ual prediction with lstm,” Neural Computation, vol. 12, no. 10, pp. 2451– and distributed computing.
2471, 2000.
[39] N. Pappas and A. Popescu-Belis, “Multilingual hierarchical attention
networks for document classification,” in IJCNLP, 2017.
[40] Y. Hua, Z. Zhao, R. Li, X. Chen, Z. Liu, and H. Zhang, “Deep learning
with long short-term memory for time series prediction,” IEEE Communi-
cations Magazine, vol. PP, no. 99, pp. 1–6, 2018.
[41] S. Iamsa-At and P. Horata, “Handwritten character recognition using
histograms of oriented gradient features in deep learning of artificial neural
network,” in International Conference on It Convergence & Security, 2014. SHENG WANG is now pursuing his master’s
[42] A. Boubezoul and S. Paris, “Application of global optimization methods degree with the Academy of Computer and In-
to model and feature selection,” Pattern Recognition, vol. 45, no. 10, pp. formation Engineering,Tianjin Normal Universi-
3676–3686, 2012. ty,China.His current research focuses on network
[43] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” technology,big data analysis and deep learning.
Computer Science, 2014.
[44] M. Zeng, T. N. Le, B. Yu, O. J. Mengshoel, J. Zhu, P. Wu, and J. Zhang,
“Convolutional neural networks for human activity recognition using
mobile sensors,” in International Conference on Mobile Computing, Ap-
plications and Services, 2015, pp. 197–205.
[45] A. Graves, S. Fernĺćndez, and F. Gomez, “Connectionist temporal clas-
sification: Labelling unsegmented sequence data with recurrent neural
networks,” in International Conference on Machine Learning, 2006.
[46] S. Revathi and A. Malathi, “A detailed analysis on nsl-kdd dataset using
various machine learning techniques for intrusion detection,” International
JINQI ZHU received the Ph.D. degree in comput-
journal of engineering research and technology, vol. 2, no. 12, 2013.
er science from the University of Electronic Sci-
[47] D. H. Deshmukh, T. Ghorpade, and P. Padiya, “Improving classification
using preprocessing and machine learning algorithms on nsl-kdd dataset,”
ence and Technology of China (UESTC), China, in
in International Conference on Communication, 2015. 2009. In 2013, she joined Nanyang Technological
[48] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, “A detailed analysis University (NTU) as a Visiting Scholar, under the
of the kdd cup 99 data set,” in IEEE International Conference on Compu- supervision of Dr. YG Wen. She is currently an
tational Intelligence for Security & Defense Applications, 2009. Associate Professor with the School of Comput-
[49] V. Engen, J. Vincent, and K. T. Phalp, “Exploring discrepancies in findings er and Information Engineering, Tianjin Normal
obtained with the kdd cup ’99 data set.” Intelligent Data Analysis, vol. 15, University, China. Her main research interests in-
no. 2, pp. 251–276, 2011. clude mobile computing, vehicular networks, and
[50] L. Dhanabal and S. P. Shantharajah, “A study on nsl-kdd dataset for Security network.
intrusion detection system based on classification algorithms,” 2015.
[51] E. M. Stock, J. D. Stamey, R. Sankaranarayanan, D. M. Young, R. Mu-
wonge, and M. Arbyn, “Estimation of disease prevalence, true positive
rate, and false positive rate of two screening tests when disease verification
is applied on only screen-positives: a hierarchical model using multi-center
data,” Cancer Epidemiology, vol. 36, no. 2, pp. 153–160, 2012. YABO LI is currently pursuing a master’s de-
[52] C. L. Yin, Y. F. Zhu, J. L. Fei, and X. Z. He, “A deep learning approach for gree in computer application technology at Tianjin
intrusion detection using recurrent neural networks,” IEEE Access, vol. 5, Normal University. She is mainly engaged in the
no. 99, pp. 21 954–21 961, 2017. research of wireless self-organizing networks and
[53] T. A. Tang, L. Mhamdi, D. Mclernon, S. Zaidi, and M. Ghogho, “Deep
mobile computing.
learning approach for network intrusion detection in software defined
networking,” in International Conference on Wireless Networks & Mobile
Communications, 2016.
[54] Y. Ding and Y. Zhai, “Intrusion detection system for nsl-kdd dataset
using convolutional neural networks,” in Proceedings of the 2018 2Nd
International Conference on Computer Science and Artificial Intelligence,
2018, pp. 81–85.
VOLUME 4, 2016 11
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.