0% found this document useful (0 votes)
45 views

HAST-IDS Learning Hierarchical Spatial-Temporal Fe

This document summarizes a research paper that proposes a new intrusion detection system called HAST-IDS. HAST-IDS uses deep neural networks to automatically learn hierarchical spatial-temporal features from network traffic, without requiring feature engineering. It uses CNNs to learn low-level spatial features, and LSTMs to learn high-level temporal features based on the spatial features. The goal is for the automatically learned features to better characterize network traffic and reduce false alarm rates compared to manually designed features. Experimental results on standard datasets show HAST-IDS outperforms other approaches in accuracy, detection rate, and false alarm rate, demonstrating its effectiveness.

Uploaded by

jahangir shaikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

HAST-IDS Learning Hierarchical Spatial-Temporal Fe

This document summarizes a research paper that proposes a new intrusion detection system called HAST-IDS. HAST-IDS uses deep neural networks to automatically learn hierarchical spatial-temporal features from network traffic, without requiring feature engineering. It uses CNNs to learn low-level spatial features, and LSTMs to learn high-level temporal features based on the spatial features. The goal is for the automatically learned features to better characterize network traffic and reduce false alarm rates compared to manually designed features. Experimental results on standard datasets show HAST-IDS outperforms other approaches in accuracy, detection rate, and false alarm rate, demonstrating its effectiveness.

Uploaded by

jahangir shaikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/321738892

HAST-IDS: Learning Hierarchical Spatial-Temporal Features using Deep


Neural Networks to Improve Intrusion Detection

Article in IEEE Access · December 2017


DOI: 10.1109/ACCESS.2017.2780250

CITATIONS READS

457 2,788

7 authors, including:

Wei Wang Jinlin Wang


Chinese Academy of Sciences Chinese Academy of Sciences
32 PUBLICATIONS 1,864 CITATIONS 93 PUBLICATIONS 764 CITATIONS

SEE PROFILE SEE PROFILE

Xuewen Zeng
Chinese Academy of Sciences
47 PUBLICATIONS 1,914 CITATIONS

SEE PROFILE

All content following this page was uploaded by Xuewen Zeng on 07 February 2018.

The user has requested enhancement of the downloaded file.


This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2017.2780250, IEEE Access

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.Doi Number

HAST-IDS: Learning Hierarchical Spatial-


Temporal Features using Deep Neural Networks
to Improve Intrusion Detection
1 2 2 2 2
WEI WANG , YIQIANG SHENG , JINLIN WANG , XUEWEN ZENG , XIAOZHOU YE ,
3 1
YONGZHONG HUANG , AND MING ZHU
1
Department of Automation, University of Science and Technology of China, Hefei 230026, China
2
National Network New Media Engineering Research Center, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China
3
Guilin University of Electronic Technology, Guilin 541004, China
Corresponding author: W. Wang (ww8137@mail.ustc.edu.cn).
This work was supported by the Pioneer Program of Institute of Acoustics, Chinese Academy of Sciences, Grant No. Y654101601.

ABSTRACT The development of an anomaly-based intrusion detection system (IDS) is a primary research
direction in the field of intrusion detection. An IDS learns normal and anomalous behavior by analyzing
network traffic and can detect unknown and new attacks. However, the performance of an IDS is highly
dependent on feature design, and designing a feature set that can accurately characterize network traffic is
still an ongoing research issue. Anomaly-based IDSs also have the problem of a high false alarm rate (FAR),
which seriously restricts their practical applications. In this paper, we propose a novel IDS called the
hierarchical spatial-temporal features-based intrusion detection system (HAST-IDS), which first learns the
low-level spatial features of network traffic using deep convolutional neural networks (CNNs) and then
learns high-level temporal features using long short-term memory (LSTM) networks. The entire process of
feature learning is completed by the deep neural networks automatically; no feature engineering techniques
are required. The automatically learned traffic features effectively reduce the FAR. The standard
DARPA1998 and ISCX2012 datasets are used to evaluate the performance of the proposed system. The
experimental results show that the HAST-IDS outperforms other published approaches in terms of accuracy,
detection rate and FAR, which successfully demonstrates its effectiveness in both feature learning and FAR
reduction.

INDEX TERMS Network intrusion detection, deep neural networks, representation learning.

I. INTRODUCTION designing a feature set that can accurately characterize


Cyberspace security has recently gained increasing network traffic is still an ongoing research issue [2].
attention. Creating effective defenses against various types Anomaly-based IDS also have a high false alarm rate (FAR),
of network attacks and ensuring the safety of network which seriously restricts its practical application [3].
equipment and information security has become a highly To solve the abovementioned problems, using the
considered problem. Network intrusion detection systems representation learning approach, we propose a hierarchical
(IDSs) identify malicious attack behaviors by analyzing the spatial-temporal features-based intrusion detection system
network traffic of key nodes of a network and have become (HAST-IDS) that automatically learns network traffic
an important part of the network security protection features. It first learns the spatial features of network traffic
architecture. using deep convolutional neural networks (CNNs) and then
The anomaly-based detection method, which is a primary learns the temporal features using long short-term memory
research direction in the field of intrusion detection, learns (LSTM) networks, which are a special type of recurrent
normal and anomalous behaviors by analyzing network neural networks (RNNs). The high-level temporal features
traffic and can detect unknown and new attacks [1]. However, are based on the low-level spatial features. The entire feature-
its performance is highly dependent on feature design, and learning process is conducted automatically: no feature

2169-3536 © 2017 IEEE. Translations and content mining are permitted for academic research only.
VOLUME XX, 2017 Personal use is also permitted, but republication/redistribution requires IEEE permission. 1
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2017.2780250, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)

engineering techniques are required. We expect that the natural language processing. In the field of network intrusion
automatically learned hierarchical spatial-temporal features detection, some research results obtained using the
are better at characterizing network traffic behaviors than are representation learning approach have recently emerged. For
manually designed features and that they can effectively example, Ma et al. [5] applied deep neural networks to detect
improve the intrusion detection capability. The experimental intrusion behaviors using the KDD99 dataset. Niyaz et al. [6]
results successfully demonstrate the effectiveness of the studied the intrusion detection method on the NSL-KDD
proposed method for both feature learning and FAR dataset using deep belief networks. The common ground of
reduction. these research methods is that their models learn features
The rest of this paper is organized as follows. Section II from manually designed traffic features. However,
describes the related work. Section III describes the design applications in the fields of computer vision and natural
and implementation of the proposed method. Section IV language processing have shown that the biggest advantage
mainly covers the evaluation methodology and experimental of deep neural networks lies is their ability to learn features
results, and Section V presents conclusions and future work. directly from raw data [7]. The abovementioned research
methods used deep neural networks based on manually
II. RELATED WORK designed features without taking full advantage of the deep
neural networks.
A. INTRUSION DETECTION TECHNIQUES The literature reveals that raw network traffic data have
IDSs can be classified as either signature-based or anomaly- not been used to learn features. This revelation motivates us
based detection. Signature-based detection, also called to develop a process for learning features directly from raw
misuse detection, analyzes known attacks to extract their network traffic data using deep neural networks to obtain a
discriminating characteristics and patterns, called signatures. better traffic feature set and develop a more efficient IDS.
These signatures are compared against the new traffic to Additionally, Eesa et al. [8] proved that the detection rate can
detect intrusions. The advantages of signature-based be increased and the FAR decreased by using a better traffic
detection are that it has a high detection rate and a low FAR feature set. We also hope to reduce the FAR by using an
for known attacks, while its disadvantage is that it cannot automatically learned traffic feature set.
detect any unknown and new (0-day) attacks. Anomaly-
based detection, also called behavior-based detection, mainly B. DEEP NEURAL NETWORKS
uses a machine learning-based method. In this approach, CNNs and RNNs are the two most widely used deep neural
some network traffic features are designed first; then, a network models; they are capable of learning effective spatial
model is built based on those features using supervised or and temporal features, respectively. In common neural
unsupervised learning approaches. The model can identify networks, every neural node of every hidden layer sums the
both normal and anomalous traffic patterns. Its advantage is weighted values coming from the previous layer, applies a
that it can detect unknown and new attacks; thus, it has nonlinear transform, and transfers the resulting values to the
attracted increasing interest in research and industrial circles. next layer. The output value of the last layer can be regarded
However, in practical applications, some problems still as the representation or feature learned by the neural
exist for anomaly-based intrusion detection methods. The networks from the input data. CNNs, which improve upon
first problem is the difficulty of designing representative the architecture of the common neural networks, benefit from
traffic features. The detection effect of this method is highly the following: sparse connectivity, shared weights and
dependent on the design of the traffic features used in pooling. CNNs are capable of learning spatial features and
training. The detection effects often vary widely when have already yielded impressive achievements in many
different feature sets of network traffic are applied. No computer vision tasks, such as image classification [9].
standard guiding principle currently exists for the design of a RNNs add a self-connected weighted value as the memory
feature set that accurately characterizes network traffic. The unit for every neural node based on the architecture of
second serious problem of the anomaly-based intrusion common neural networks, and they can memorize the
detection method is its high FAR, which is a major obstacle previous state of the neural networks. Long short-term
to its practical application [3]. memory (LSTM) networks further add a component called
The representation learning approach [4] is a promising the forget gate, and LSTMs can effectively learn temporal
method for solving both these problems. Representation features from a long sequence. LSTM networks have already
learning, also called feature learning, is a technique that been shown to perform well for many natural language
allows a system to automatically extract features from raw processing tasks, such as machine translation [10].
data. Its biggest advantage is that it replaces manual feature Spatial and temporal features are two commonly used
engineering and can directly learn the best features from raw types of traffic features in the field of intrusion detection.
data. Deep neural networks have been the most successful When using spatial features, network traffic is first
technique of representation learning and have achieved transformed into traffic images; then, the image classification
remarkable results in the fields of computer vision and method based on geometric features is used to classify the

2 VOLUME XX, 2017

2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2017.2780250, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)

traffic images, which also indirectly achieves the goal of However, no studies that combine the use of CNNs and
identifying the malware traffic. This approach is relatively RNNs to detect network intrusion exist in the literature. To
new, but many recent research results demonstrate its great take full advantage of these two types of deep neural
potential. For example, Tan et al. [11] applied a widely used networks, we use both CNNs and RNNs to learn the spatial-
dissimilarity measure in the computer vision domain, namely, temporal features of raw network traffic data to develop a
the earth mover’s distance (EMD), to detect denial of service more effective IDS.
(DoS) traffic and achieved a good effect. When using
temporal features, the temporal features of a network traffic III. HAST-IDS
flow are extracted and can be used to detect intrusion
behaviors via the time series analysis method. This approach A. HAST-IDS OVERVIEW
was developed early and has been adopted by many The goals of the HAST-IDS are to automatically learn the
researchers. For example, Ariu et al. [12] developed an spatial-temporal features of raw network traffic data using
effective IDS using the hidden Markov model method based deep neural networks and to improve the effectiveness of the
on the temporal relations of traffic payload bytes. IDS. The basic design concept is as follows. At the network
Over the past two years, a few studies have used CNNs or packet level, every network packet is transformed into a two-
RNNs to perform intrusion detection tasks based on spatial dimensional image, the internal spatial features of which are
and temporal features. For example, Wang et al. [13] used a learned by a CNN. At the network flow level, the temporal
CNN to learn the spatial features of network traffic and features of a sequence of network packets of a network flow
achieved malware traffic classification using the image are further learned by an RNN. Finally, the resulting spatial-
classification method. Torres et al. [14] first transformed temporal traffic features are used to classify the traffic as
network traffic features into a sequence of characters and normal or malware.
then used RNNs to learn their temporal features, which were Two implementation schemes are used for the HAST-IDS,
further applied to detect malware traffic. The common point as shown in Figure 2. HAST-I uses a CNN and learns only
of these research methods is that they used CNNs or RNNs spatial features, while HAST-II uses both CNNs and RNNs
alone and learned a single type of traffic feature. and learns spatial-temporal features. Each scheme has
Network traffic has an obvious hierarchical structure as different application scenarios for different types of network
illustrated in Figure 1, where the bottom row shows a traffic, which will be discussed in Section IV.
sequence of traffic bytes. Based on the format of specific The various stages of the HAST-IDS are described below.
network protocols, multiple traffic bytes are combined to Preprocessing. In this stage, the input raw network traffic
form a network packet, and multiple network packets data are transformed into the two-dimensional images
communicated between two sides are further combined to required by the CNN. The basic traffic units for intrusion
form a network flow. Notably, these traffic bytes, network detection are network flows; thus, the input raw traffic data
packets and network flows are similar to the characters, must be split into multiple network flows. Each network flow
sentences and paragraphs in the field of natural language contains multiple network packets communicated between
processing. Moreover, the task of classifying a network flow two endpoints. One-hot encoding (OHE) is used as the
as either normal or malware is very similar to classifying a transformation method. In HAST-I, the first n traffic bytes of
paragraph as either positive or negative, which is a common the network flow are transformed. If the OHE vector is m-
task in the field of natural language processing, namely, dimensional, then the entire network flow can be transformed
sentiment analysis. In some recent studies on sentiment into an m*n two-dimensional image. For HAST-II, similar
analysis, deep neural networks were used to learn the preprocessing is required; however, every network packet
hierarchical features of natural language and achieved good must be transformed individually. If r packets exist in a flow
results [15][16][17]. Those studies motivated us to use deep and if the first q bytes of every packet are transformed and
neural networks to learn the hierarchical features of network the OHE vector is p-dimensional, then the entire network
traffic and further perform the intrusion detection task. flow can be transformed into r different p*q two-dimensional
images.
Traffic Flow Cross-validation. The k-fold cross-validation technique is
used for performance evaluation. In this technique, a dataset
is randomly divided into k equal parts. In each iteration, one
Packet Packet Packet part is selected as the validating dataset, while all the other k-
1 parts are treated as the training dataset. In our experiments,
k was set to 10 because of the resulting low bias, low
variance, low overfitting and good error estimate [18].
Byte Byte Byte Byte Byte Spatial feature learning. CNNs are used to learn the
spatial features of the two-dimensional traffic images. In
FIGURE 1. Hierarchical structure of network traffic HAST-I, the spatial features of the entire flow image are

2 VOLUME XX, 2017

2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2017.2780250, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)

Start
Start

Network Traffic
Dataset-
Network Traffic
Preprocessing
Dataset-
Preprocessing

10-fold Cross-Validation
10-fold Cross-Validation
Training
Dataset
Training
Dataset Spatial Features Learning
(CNN)
Spatial Features Learning
(CNN)

Temporal Features Learning


(LSTM)
Softmax Classifier

Softmax Classifier
Validating
Dataset
Is Cross-Validation
No Completed Validating
Dataset Is Cross-Validation
Yes No Completed

Yes
Test & Aggregated Results
Testing
Dataset Test & Aggregated Results
Testing
End Dataset
End

( HAST-I ) ( HAST-II )
FIGURE 2. Workflow of HAST-IDS

learned from a single m*n image, and the output is a single to every network packet. The first q bytes of every packet
flow vector. In HAST-II, the spatial features of every p*q are transformed into a p*q packet image via OHE, and each
packet image are learned individually, and the output is r image is further processed via convolution and pooling. The
packet vectors. final output consists of multiple packet vectors that
Temporal feature learning. An LSTM is used to learn represent the features of the individual network packets.
the temporal features of multiple traffic vectors. In HAST- When implemented, two convolution filters with different
II, the LSTM further learns the temporal relations among sizes are used to output two different flow/packet vectors,
the r packet vectors. The output is a single flow vector that which are concatenated together as the final vector. The
represents the spatial-temporal features of the network flow. application of CNNs in the field of computer vision has
Softmax classifier. The softmax classifier is used to shown that this method can obtain better spatial features.
determine whether the input traffic is normal or malware Algorithm 1 presents the details of the spatial feature
based on the flow vector. Softmax is a commonly used learning stage.
multi-class classification method in the field of machine The key components used in Algorithm 1 are as follows.
learning. One-hot encoding. Let xi  k be the k-dimensional
Test and result aggregation. The fine-tuned model is vector corresponding to the i-th traffic byte in a packet or
tested using the test dataset. The results of each experiment flow. A packet or flow of length n can be encoded
are collected and analyzed. according to the following formula, where  is the

B. LEARNING SPATIAL FEATURES WITH CNNS


concatenation operator. In general, xi:i  j denotes the
CNNs are used to learn the spatial features from the two- concatenation of traffic bytes xi , xi 1 ,..., xi  j .
dimensional image of the input network traffic bytes, as
shown in Figure 3. In HAST-I, the CNN is applied to the x1:n  x1  x2  ...  xn (1)
entire network flow. The first n traffic bytes of the entire Convolution operation. A convolution operation involves
flow are transformed into a single m*n flow image via a filter w hk , which is applied to a window of h traffic
OHE. The image is further processed via convolution and bytes to produce a new feature. For example, a feature ci is
pooling. The final output is a flow vector that represents the
generated using this formula, where b  . is a bias term,
features of the entire flow. In HAST-II, the CNN is applied
and f denotes ReLUs [19]

2 VOLUME XX, 2017

2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2017.2780250, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)

ci  f (w  xi:i h1  b) (2) Pooling operation. A max-over-time pooling operation


Feature mapping. A convolution filter is applied to each is then applied to the feature map and takes the maximum
value as the final feature.
possible window {x1:h , x2:h1 , , xnh1:n } to produce a
cˆ  max{c} (4).
n  h 1
feature map with c  .
c  [c1 , c2 ,..., cnh1 ] (3)

flow vec flow vec Temporal Feature


Learning
LSTM

...

packet vec packet vec packet vec

CNN CNN CNN ... CNN Spatial Feature


Learning

m p p ... p p

flow image n packet image q packet image q packet image q

OHE OHE OHE ... OHE

Preprocessing
...
flow bytes n packet bytes q packet bytes q packet bytes q

( HAST-I ) ( HAST-II )
FIGURE 3. Feature learning process of HAST-IDS

Algorithm 1: Spatial Feature Learning


Input: Network traffic image: a flow image (f1) in HAST-I or r packet images (p1, p2, …, pr) in HAST-II.
Output: Spatial features of network traffic: a flow vector (vf1) or r packet vectors (vp1, vp2, …, vpr).
Step 1: Create CNN model_1
1. Add 1st convolution layer with c1 filters of size s1, followed by 1st max pooling layer of size t1.
2. Add 2nd convolution layer with c2 filters of size s2, followed by 2nd max pooling layer of size t2.
3. Add 1st dense layer, the output of which is a temp vector Vtemp1.
Step 2: Create CNN model_2
4. Add 3rd convolution layer with c3 filters of size s3, followed by 3rd max pooling layer of size t3.
5. Add 4th convolution layer with c4 filters of size s4, followed by 4th max pooling layer of size t4.
6. Add 2nd dense layer, the output of which is a temp vector Vtemp2.
Step 3: Concatenate two temp vectors
7. Vtemp = Vtemp1 + Vtemp2
Step 4: Train and validate model
8. while early termination condition is not met, do
while training dataset is not empty, do
Prepare the mini-batch dataset as the model input.
Compute the categorical cross-entropy loss function H ( p, q)   p( x)log (q( x)) , p = true_dist and q = predict_dist.
x

Update the weights and biases using the RMSprop gradient descent optimization algorithm.
end
Validate model using the validating dataset.
end
Step 5: Test model
9. Test the fine-tuned model using the test dataset.
10. return the Vtemp of every network traffic image in the test dataset.

2 VOLUME XX, 2017

2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2017.2780250, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)

Algorithm 2: Temporal Feature Learning


Input: r packet vectors of network traffic (vp1, vp2, …, vpr).
Output: spatial-temporal features of network traffic: a flow vector (vf).
Step 1: Create LSTM model
1. Add 1st LSTM layer of l1 units, with dropout d1 and recurrent dropout r1.
2. Add 2nd LSTM layer of l2 units, with dropout d2 and recurrent dropout r2.
3. Add a dense layer whose output is a flow vector vf.
Step 2: Train and validate model
4. Train and validate the model as described in Algorithm 1.
Step 3: Test model
5. Test the fine-tuned model using the test dataset.
6. return the vf of every network traffic image in the test dataset.

____________________________________________________________________________________________________

C. LEARNING TEMPORAL FEATURES WITH LSTM IV. EVALUATION AND DISCUSSION


An LSTM [20] is used to learn the temporal features based This section evaluates the performance of the proposed
on the sequence of network packet vectors, as shown in HAST-IDS by performing various experiments on
Figure 3. In HAST-II, a bidirectional LSTM is used to scan DARPA1998 and ISCX2012, two commonly used public
the sequence both from beginning to end and in reverse. standard intrusion detection datasets. More specifically,
The application of LSTM in the field of natural language these experiments aim to achieve the following:
processing has shown that bidirectional scanning can yield  Evaluate the effect of network flow size on the
more accurate features [10]. Algorithm 2 presents the performance of the HAST-IDS.
details of the temporal feature learning stage.  Evaluate the effect of network packet size on the
Compared to common neural networks and the common performance of the HAST-IDS.
RNN, an LSTM offers several key improvements for each  Evaluate the effect of the number of network packets of a
neural node as described below. network flow on the performance of the HAST-IDS.
Forget gate. The forget gate provides a forgetting  Evaluate the spatial-temporal features learned by the
coefficient by looking at the input layer xt and previous HAST-IDS using the t-SNE visualization approach.
hidden layer ht−1 for cell state Ct−1. The coefficient ranges  Show the best experimental results of the HAST-IDS and
from 0 to 1 and controls the information between Ct−1 and compare them with other published IDS techniques.
Ct.
ft   (W f  [ht 1 , xt ]  b f ) (5) A. EXPERIMENTAL METHODOLOGY
Input node. This unit also considers the input layer xt 1) DATASETS
and the previous hidden layer ht−1. Typically, a tanh layer is The HAST-IDS learns features based on raw network
used to process the summed weighted input. traffic data; thus, the source dataset must contain raw traffic
data. At present, most research methods use manually
gt  tanh(Wg  [ht 1 , xt ]  bg ) (6) designed network traffic features. Thus, most public
Input gate. The input gate decides which values should intrusion detection datasets, such as NSL-KDD [21] and
be updated in Ct−1, and its output multiplies the value of the Kyoto2009 [22], do not contain raw traffic data. From
input node to generate a new candidate for Ct. among the few public datasets that do contain raw traffic
it   (Wi  [ht 1 , xt ]  bi ) data, we choose DARPA1998 [23] and ISCX2012 [24] as
(7)
our experiment datasets. The years of publication and the
Internal state. The memory unit combines the malware traffic types these datasets contain differ greatly;
computation results mentioned above. thus, they can be used to effectively evaluate the generality
of the proposed method.
Ct  ft * Ct 1  it * gt (8) DARPA1998. In 1998, MIT’s Lincoln Laboratory
Output gate. The hidden layer ht is produced based on conducted an intrusion detection evaluation project funded
the internal state Ct and the value of the output gate ot. by DARPA. One result of this project was a network traffic
dataset simulating various intrusion behaviors, namely,
ot   (Wo  [ht 1 , xt ]  bo ) DARPA1998. This dataset contains both normal traffic and
(9)
ht  ot * tanh(Ct ) four types of malware traffic (i.e., DoS, Probe, U2R, and
R2L) and is divided into seven weeks of training traffic and
two weeks of test traffic. The famous KDD99 dataset,

2 VOLUME XX, 2017

2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2017.2780250, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)

which includes 41 manually designed traffic features, was TABLE 2. Preprocessing results of the ISCX2012 dataset
derived from the DARPA1998 dataset and has become the Training Test
most commonly used dataset in the field of intrusion Dataset
Count Percentage Count Percentage
detection and has used to produce numerous research Normal 890,726 97.27% 593,811 97.27%
results [25]. For comparison purposes, we choose BFSSH 4,179 0.46% 2,785 0.46%
DARPA1998 as one of the evaluation datasets. Infiltrating 6,027 0.66% 4,017 0.66%
The traffic format of DARPA1998 is non-split pcap, HttpDoS 2,090 0.23% 1,392 0.23%
which must be split into multiple network flow files. In DDoS 12,673 1.38% 8,448 1.38%
addition, the label files contain a few problems, such as Total 915,695 610,453
duplicated records and incorrect labels. For example, the
2) METRICS
label file “Test/Week2/Friday” contains a record of
“07/32/1998,” which is an obvious date error. Therefore, Three metrics are used to evaluate the performance of the
the dataset requires preprocessing before the experiments HAST-IDS: accuracy, detection rate (DR) and FAR, which
can be conducted. First, the pkt2flow tool [26] is used to are all commonly used in the field of intrusion detection.
split the raw network traffic data into multiple network Accuracy is used to evaluate the overall performance of the
flows. Second, every label file is checked, and all system. DR is used to evaluate the system's performance
duplicated records and incorrect records are removed. with respect to its malware traffic detection. FAR is used to
Finally, we match every network flow file to the processed evaluate misclassifications of normal traffic. Their
definitions are presented below. TP is the number of
label files. Table 1 shows the preprocessing results for the
instances correctly classified as X, TN is the number of
DARPA1998 dataset.
instances correctly classified as Not-X, FP is the number of
TABLE 1. Preprocessing results of the DARPA1998 dataset instances incorrectly classified as X, and FN is the number
Training Test of instances incorrectly classified as Not-X.
Dataset
Count Percentage Count Percentage TP  TN
Normal 849,991 34.46% 459,547 41.79% Accuracy ( ACC ) 
TP  FP  FN  TN
DoS 1,561,231 63.29% 591,619 53.80%
TP (10)
Probe 48,984 1.99% 40,317 3.67% DetectionRate( DR ) 
R2L 6,494 0.26% 8,041 0.73% TP  FN
U2R 229 0.01% 207 0.02% FP
FalseAlarmRate( FAR ) 
Total 2,466,929 1,099,731 FP  TN
ISCX2012. In 2012, the Information Security Center of In addition, an important goal of the HAST-IDS is to
Excellence (ISCX) of the University of New Brunswick reduce the FAR as much as possible while improving the
(UNB) in Canada published an intrusion detection dataset DR. To comprehensively evaluate the HAST-IDS
named ISCX2012. This dataset contains seven days of raw considering both the DR and FAR, the effectiveness
network traffic data, including normal traffic and four types measure (EM) proposed by [29] is used in our research to
of attack traffic (i.e., BruteForce SSH, DDoS, HttpDoS, and compare the HAST-IDS with other published methods. The
Infiltrating). Some researchers have noted that the attack EM is slightly modified, and its definition is given below,
types considered in KDD99 are now obsolete [27]. In where C is the number of classes and 1 denotes the normal
contrast, the attack types of ISCX2012 are more modern class, which is excluded from the calculation. This formula
and closer to reality. In addition, the percentage of attack results in a high value only when the DR is high and the
traffic is approximately 2.8%, which makes ISCX2012 FAR is low. In addition, when the test dataset size is bigger,
similar to real-world datasets [28]. the generalization capability of the system is better; thus,
The ISCX2012 dataset also needs to be preprocessed the EM value is larger. This metric, which considers the DR,
before the experiments can be conducted. The FAR and test dataset size, can better reflect the
preprocessing method is very similar to that used for comprehensive performance of the HAST-IDS. In addition,
DARPA1998. Because the traffic data of 16 June has only EMDF is also used, which is calculated using only the DR
11 attack flows, according to the provider’s description, we and FAR.
removed these 11 flows and consider all traffic data of 16 c
DR
June as normal. ISCX2012 was not divided by the provider  FAR
into training and test datasets; therefore, we divide it into Effect. Mea.( EM )  i 2  ln(# Testing samples )
C 1 (11)
training and test datasets using a ratio of 60% to 40%, c
DR
respectively. Moreover, this ratio has recently been used by  FAR
many researchers, thus simplifying the comparison of our Effec. Mea.DF ( EM DF )  i  2
method with other methods. Table 2 shows the C 1
preprocessing results for the ISCX2012 dataset.

2 VOLUME XX, 2017

2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2017.2780250, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)

3) EXPERIMENTAL SETUP As shown in Table 5, large differences in network flow


Keras [30] and TensorFlow [31], which are run on the sizes exist in both datasets. Thus, a method to choose a
Ubuntu 16.04 64-bit OS, are used as the software suitable flow size must be determined via the experiments.
frameworks. The server is a DELL R720 with 16 CPU In our experiments, the flow size ranges from 100 to 1,500
cores and 16GB of memory. An Nvidia Tesla K40m GPU bytes. At sizes above 1,500, the performance no longer
is used as the accelerator. Tables 3 and 4 describe the improves. The evaluation metrics are accuracy, overall DR
architectures of the deep neural networks of the HAST-I and overall FAR. Table 6 shows the experimental results.
and HAST-II, respectively. For the DARPA1998 dataset, the system achieves similar
TABLE 3. DNN architectural parameters of HAST-I superior performances when the flow sizes are 800 and
1,000~1,300. Considering the training time, we choose 800
Layer Type Filters/neurons Stride Pad bytes as the best flow size. For the ISCX2012 dataset, the
1 conv+ReLU 32 1 same
system achieves the best performance when the flow size is
2 max pooling 3 3 same
600 bytes. When the flow size is larger than 700, the
3 conv+ReLU 64 1 same
performance no longer improves. Tables 7 and 8
4 max pooling 3 3 same
respectively show the best experimental results of HAST-I
5 dense 1,024 -- none
and HAST-II on the two datasets. Note that the two best
6 dense 5 -- none
sizes differ greatly from the mean sizes (57,719, 5,582) of
7 softmax -- -- none
the flows presented in Table 5. A possible explanation is
TABLE 4. DNN architectural parameters of HAST-II that the information in the front part of a flow, which
mainly corresponds to connection creation, may be more
Layer Type Filters/neurons Stride Pad
1 conv+tanh 128 1 valid
important when detecting malware traffic.
2 max pooling 2 2 valid
The above analysis shows that even though many
3 conv+tanh 256 1 valid
differences exist between the two datasets, both schemes
4 max pooling 2 2 valid achieve their highest accuracy, highest DR and lowest FAR
5 dense 128 -- none when the flow size ranges from 600 to 800. An interesting
6 conv+tanh 192 1 valid finding is that applying a similar deep neural network
7 max pooling 2 2 valid architecture to classify the MNIST hand-written digit
8 conv+tanh 320 1 valid dataset also achieves very high accuracy [32]. The size of
9 max pooling 2 2 valid an MNIST image is 784 (28*28), which is very close to our
10 dense 128 --- none chosen size. This similarity may indicate that the
11 lstm 92 --- none architecture of deep neural networks is suitable for
12 lstm 92 --- none detecting key information in data of that size, regardless of
13 dense 5 --- none whether the data represents images or traffic.
14 softmax --- --- none TABLE 6. Performance for different network flow sizes (%)

Flow DARPA1998 ISCX2012


B. INFLUENCE OF NETWORK FLOW SIZE Size ACC DR FAR ACC DR FAR
In HAST-I, each network flow must be reduced to a fixed 100 96.97 94.84 0.05 99.61 96.67 0.29
size in the preprocessing stage due to the special 200 98.30 97.13 0.06 99.51 90.85 0.23
300 98.41 97.30 0.04 99.58 95.66 0.30
requirements of CNN input data. Computer vision tasks do
400 98.51 97.51 0.09 99.63 96.68 0.28
not have this problem because the size of every image or 500 98.50 97.49 0.08 99.66 96.75 0.25
every video frame is fixed, and the images and video 600 98.62 97.68 0.05 99.69 96.91 0.22
frames are easy to preprocess to an identical size. However, 700 98.61 97.67 0.06 97.27 0.00 0.00
in the field of network traffic analysis, the number of 800 98.68 97.78 0.07 97.27 0.00 0.00
packets in a flow is variable. Furthermore, the packet size is 900 41.78 0.00 0.00 97.27 0.00 0.00
1,000 98.68 97.81 0.09 97.27 0.00 0.00
also variable, which causes large fluctuations among flow
1,100 98.69 97.82 0.09 97.27 0.00 0.00
sizes. Table 5 shows the network flow size statistics of the 1,200 98.69 97.83 0.10 97.27 0.00 0.00
two datasets. 1,300 98.70 96.89 0.08 97.27 0.00 0.00
TABLE 5. Network flow size statistics
1,400 41.78 0.00 0.00 97.27 0.00 0.00
1,500 41.78 0.00 0.00 97.27 0.00 0.00
Flow Length DARPA1998 ISCX2012
TABLE 7. Best performance of HAST-I for DARPA1998
Count 3,561,018 1,526,148
Mean 5,582 57,719 Dataset ACC DR FAR
Max 116,690,260 1,276,599,890 DoS 99.51 99.10 0.02
Min 59 60 Probe 99.32 83.35 0.01
Median 120 3,019 R2L 99.81 74.19 0.02
Mode 60/21,268,245 186/10,199 U2R 99.99 64.25 0.02
Std 536 1,896,410 Overall 99.68 97.78 0.07

2 VOLUME XX, 2017

2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2017.2780250, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)

TABLE 8. Best performance of HAST-I for ISCX2012 In our experiments, the packet sizes range from 100 to
Dataset ACC DR FAR 1,000. We choose the median (14; see the median row in
BFSSH 99.99 99.61 0.02 Table 10) as the number of packets per flow. The accuracy,
Infiltrating 99.98 97.68 0.04 DR and FAR of every class of traffic are used as evaluation
HttpDoS 99.96 83.12 0.09 metrics. Table 11 shows the experimental results, from
DDoS 99.76 97.94 0.07 which, we can see that those metrics yield the best results
Overall 99.69 96.91 0.22 when the packet sizes are 100 and 200. Considering the
training time, we choose 100 bytes as the best packet size,
which differs greatly from the mean packet size of 743
C. INFLUENCE OF NETWORK PACKET SIZE
shown in Table 9. A possible explanation is that the first
The next two sections measure the effects of two key
100 bytes mainly comprise the packet header, whose
parameters of HAST-II on the system performance, namely,
information may be more important for detecting intrusion.
the packet size and the number of packets per flow. Note
that the number of packets per flow is generally very small An interesting finding is the comparison with the
in the DARPA1998 dataset. Figure 4 shows that the sentiment analysis task in the field of natural language
majority of flows, i.e., 1,471,468, contain only one packet, processing. Using a similar deep neural network
accounting for 41.32% of the total, and 775,649 flows architecture, [15] chooses 512 characters as the best
contain only two packets, accounting for 21.78% of the sentence size, which is equivalent to the packet size in our
total. Together, these two flow sizes account for 63.1% of research. However, 512 characters cover the length of most
the total. The most important advantage of LSTM is its sentences, whereas the 100 bytes used in our study covers
ability to learn the temporal features of a long sequence of only the first small part of a network packet. Intuitively,
data. However, the sequence length of the flows in the this result may occur because the structure of a sentence is
DARPA1998 dataset is too short. In particular, when the very different from that of a packet. The information in a
number of packets is 1, no sequence exists, which makes it sentence tends to follow a uniform distribution, whereas a
meaningless to apply LSTM to the DARPA1998 dataset. packet can be divided into a packet header and payload,
Thus, the evaluation experiments for HAST-II are thus, its information tends to be emphasized differently.
conducted only for ISCX2012.
D. INFLUENCE OF THE NUMBER OF NETWORK
PACKETS
In the temporal feature learning stage of HAST-II, LSTM
requires that the number of packets per flow be fixed.
However, the number of packets per flow generally differs
greatly. Table 10 shows the statistics regarding the number
of packets per flow for the ISCX2012 dataset. The table
shows that in practice, the number of packets per flow
varies widely. Therefore, we must measure the effect of the
FIGURE 4. The number of packets per flow in the DARPA1998 dataset
number of packets per flow on the system performance and
determine the best number of packets via multiple
During spatial feature learning in HAST-II, the input data evaluation experiments.
unit of the CNNs is network packets. Similar to the
TABLE 10. Statistics regarding number of network packets per flow for
discussion in section B, every packet must be reduced to a ISCX2012
fixed size. In a network flow, the packet sizes generally
Number of Packets per Flow Value
differ greatly. Table 9 shows the statistics regarding packet
Count 1,526,148
sizes for the ISCX2012 dataset. As shown in Table 9, large Mean 77
differences in packet sizes exist in the ISCX2012 dataset. Max 1,306,267
Section B discussed how flow size has an important effect Min 1
on the system performance. Similarly, this section measures Median 14
the effect of packet size on the performance of HAST-II Mode 10/177,277
and determines the best packet size. Std 2,105
TABLE 9. Statistics regarding network packet size for ISCX2012
In our experiments, the number of packets ranges from 6
Packet Bytes Value to 30. When the number of packets is larger than 30, the
Count 118,419,538 performance no longer improves. The packet size is 100
Mean 743
bytes, as determined in section C. The accuracy, DR and
Max 1,514
Min 60
FAR of every class of traffic are used as evaluation metrics.
Median 598 Table 12 shows the experimental results. From the table, we
Mode 60/39,805,790 see that those metrics yield the best results when the
Std 676 number of packets is 6, 8 and 14. Considering the

2 VOLUME XX, 2017

2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2017.2780250, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)

TABLE 11 Effect of packet size on the performance of the HAST-IDS (%)

Packet BFSSH Infiltrating HttpDoS DDoS Overall


Size ACC DR FAR ACC DR ACC DR FAR ACC DR ACC DR FAR
100 99.96 95.76 0.01 99.95 95.29 100 99.96 95.76 0.01 99.95 95.29 100 99.96 95.76 0.01
200 99.95 95.90 0.01 99.95 95.76 200 99.95 95.90 0.01 99.95 95.76 200 99.95 95.90 0.01
300 99.96 96.19 0.01 98.51 95.07 300 99.96 96.19 0.01 98.51 95.07 300 99.96 96.19 0.01
400 99.94 95.97 0.03 96.44 94.82 400 99.94 95.97 0.03 96.44 94.82 400 99.94 95.97 0.03
500 99.93 98.99 0.01 99.90 86.08 500 99.93 98.99 0.01 99.90 86.08 500 99.93 98.99 0.01
600 99.92 98.59 0.01 99.90 85.98 600 99.92 98.59 0.01 99.90 85.98 600 99.92 98.59 0.01
700 99.96 96.55 0.01 99.91 96.31 700 99.96 96.55 0.01 99.91 96.31 700 99.96 96.55 0.01
800 99.96 96.05 0.01 99.94 93.45 800 99.96 96.05 0.01 99.94 93.45 800 99.96 96.05 0.01
900 99.96 96.05 0.01 99.94 93.45 900 99.96 96.05 0.01 99.94 93.45 900 99.96 96.05 0.01
1,000 99.94 98.92 0.01 99.75 88.00 1,000 99.94 98.92 0.01 99.75 88.00 1,000 99.94 98.92 0.01

TABLE 12 Effect of number of packets on the performance of the HAST-IDS (%)

Number BFSSH Infiltrating HttpDoS DDoS Overall


of
Packets ACC DR FAR ACC DR ACC DR FAR ACC DR ACC DR FAR
6 99.96 97.09 0.01 99.96 96.21 6 99.96 97.09 0.01 99.96 96.21 6 99.96 97.09 0.01
8 99.96 96.48 0.01 99.95 94.97 8 99.96 96.48 0.01 99.95 94.97 8 99.96 96.48 0.01
10 99.97 96.98 0.01 99.95 95.61 10 99.97 96.98 0.01 99.95 95.61 10 99.97 96.98 0.01
12 99.96 96.55 0.01 99.95 95.22 12 99.96 96.55 0.01 99.95 95.22 12 99.96 96.55 0.01
14 99.96 96.37 0.01 99.95 95.24 14 99.96 96.37 0.01 99.95 95.24 14 99.96 96.37 0.01
16 99.96 96.19 0.01 99.94 93.67 16 99.96 96.19 0.01 99.94 93.67 16 99.96 96.19 0.01
18 99.96 95.61 0.01 99.95 95.54 18 99.96 95.61 0.01 99.95 95.54 18 99.96 95.61 0.01
20 99.94 98.74 0.01 99.87 82.07 20 99.94 98.74 0.01 99.87 82.07 20 99.94 98.74 0.01
22 99.95 96.01 0.02 99.68 90.88 22 99.95 96.01 0.02 99.68 90.88 22 99.95 96.01 0.02
24 99.96 95.90 0.01 99.93 91.63 24 99.96 95.90 0.01 99.93 91.63 24 99.96 95.90 0.01
26 99.95 96.80 0.02 99.94 95.41 26 99.95 96.80 0.02 99.94 95.41 26 99.95 96.80 0.02
28 99.97 98.16 0.01 99.95 94.24 28 99.97 98.16 0.01 99.95 94.24 28 99.97 98.16 0.01
30 99.95 96.91 0.01 99.91 94.00 30 99.95 96.91 0.01 99.91 94.00 30 99.95 96.91 0.01

training time, we choose 6 as the best number of packets. TABLE 13. Best performance of HAST-II for ISCX2012
This number differs greatly from the mean number of Dataset ACC DR FAR
packets (77) shown in Table 10. This result probably occurs BFSSH 99.96 97.09 0.01
because the first few packets in a flow correspond with Infiltrating 99.96 96.21 0.00
connection creation, whose information may be more HttpDoS 99.96 92.88 0.00
important for detecting intrusion behaviors. This fact and its DDoS 99.95 97.95 0.00
possible explanation are both very similar to those of Overall 99.89 96.96 0.02
section B. Celik et al. [33] reported similar findings (i.e.,
the first few packets play a relatively more important role in E. VISUALIZATION OF SPATIAL-TEMPORAL
malware traffic detection). FEATURES USING T-SNE
Via the experiments presented in sections C and D, we
finally obtain the best values for the two parameters of The t-SNE algorithm [34] is used in this section to perform
HAST-II for the ISCX2012 dataset, namely, the best packet dimensionality reduction and visualization analysis for the
size (100) and the best packet number (6). The spatial-temporal traffic features learned by the HAST-IDS.
experimental results obtained using those two parameters The t-SNE algorithm is a nonlinear dimensionality
are shown in Table 13. Compared with Table 8, Table 13 reduction algorithm. Compared to linear dimensionality
shows that the accuracy and DR are approximately equal, reduction algorithms such as principal component analysis
while the FAR is remarkably reduced. We conclude that (PCA), t-SNE can obtain better low-dimensional results and
HAST-II achieves a comprehensively better performance better visualizations. According to the results discussed in
than that of HAST-I on the ISCX2012 dataset, which sections B, C and D, we apply the t-SNE algorithm to the
indicates that spatial-temporal features are more effective high-dimensional result vectors of the two best experiments,
than single spatial features in reducing the FAR on the namely, the network flow vectors learned by HAST-I for
ISCX2012 dataset. DARPA1998 and by HAST-II for ISCX2012.
Specifically, the flow vectors learned by the CNN in
HAST-I or by the LSTM in HAST-II are saved before

2 VOLUME XX, 2017

2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2017.2780250, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)

application of the softmax classifier. The flow vector is a


five-dimensional vector. In each test dataset of the two
datasets, 500 samples are randomly selected from every
class of traffic. The total number of samples in the
ISCX2012 dataset is 2,500; however, because there are
only 207 total test samples of U2R traffic, the total number
of samples from the DARPA1998 dataset is 2,207. The t-
SNE source code file we used was sourced from [35].
Because the flow vectors are only five-dimensional, we do
not use the PCA preprocessing stage of that code file;
instead, we directly apply t-SNE to reduce the
dimensionality. The visualization results for the resulting
low-dimensional vectors are shown in Figures 5 and 6. FIGURE 6. Visualization result of flow vectors for ISCX2012
From Figure 5, for the DARPA1998 dataset, the cluster
effects for three classes of traffic, i.e., Normal, DoS, and F. COMPARISON WITH OTHER PUBLISHED METHODS
Probe, are very good. The experimental data also yield the Researchers have proposed many intrusion detection
same result. The cluster effects for the R2L and U2R methods, most of which are based on manually designed
classes of traffic are not as good as those of the other three traffic features. In contrast, the method proposed in this
classes. Although a cluster effect exists, there are too many paper learns spatial-temporal traffic features directly from
clusters, and their distances from other clusters are too raw network traffic data. We expect the automatically
small to clearly distinguish between clusters. The learned traffic features be more accurate and more
experimental data also yield the same results. This outcome representative, thus, we compare the experimental results of
possibly occurs because there are too few training samples the HAST-IDS with those of other published methods.
of R2L and U2R compared to those of the other three Table 14 shows the experimental results comparison for
classes. The numbers of training samples of the Normal, the DARPA1998 dataset. The compared methods are all
DoS, and Probe classes are 849,991, 1,561,231 and 48,984, based on KDD99, which is a dataset containing 41
respectively, whereas the R2L and U2R classes have only manually designed features extracted from the
6,494 and 229 training samples, respectively. The DARPA1998 dataset. Accuracy and EM, proposed in
imbalance of training data [36] results in the HAST-IDS section A, are used as the evaluation metrics. The EM
being unable to learn sufficient representative features, metric comprehensively considers three metrics, DR, FAR
which causes a poor detection performance for these two and test dataset size, and can better measure the system
classes. Figure 6 shows a good overall clustering effect for performance. For example, the DR of the Probe class of the
the ISCX2012 dataset, although the discrimination degrees EID3 method is very high, but the FAR is too high, while
of some classes, such as DDoS and Infiltrating, are not very the FAR of the U2R class of the SVM method is very low,
high. The reason is also most likely due to an imbalance in but the DR is too low. The EM metric can be used to
the training samples. In general, the visualization results of comprehensively compare these methods more fairly. Table
the two datasets using the t-SNE algorithm are quite good, 14 shows that the HAST-IDS does not always perform best
which directly demonstrates the capability of the HAST- on every evaluation metric. For example, the DR of the
IDS to learn spatial-temporal features and indirectly DoS class is lower than that of the EID3 method, and the
explains why the subsequent softmax classifiers achieve FAR of U2R is higher than that of MARK-ELM. However,
such a high performance. the HAST-IDS achieves the best EM result among all listed
methods. Even without considering the test dataset size, the
EMDF result of the HAST-IDS is still the best. Thus, our
method achieves the best comprehensive performance
among all methods. It learns more accurate features, and
better features effectively result in a reduced FAR, which is
consistent with the conclusion of [8].
Table 15 shows a comparison of the experimental results
for the ISCX2012 dataset. This dataset was published much
later than DARPA1998; thus, there are relatively fewer
available corresponding experimental results. The methods
listed in Table 15 all use 16 manually designed traffic
features. Because most of these methods do not report the
FIGURE 5. . Visualization result of flow vectors for DARPA1998 accuracy and FAR for every traffic class, we cannot
compare their EM values. Based on the available

2 VOLUME XX, 2017

2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2017.2780250, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)

TABLE 14. Comparison with other published methods for DARPA1998

DoS (%) Probe (%) R2L (%) U2R (%) Test


Method Dataset EMDF EM
DR FAR DR FAR DR FAR DR FAR Size
PLSSVM[37] 78.69 0.73 86.46 13.87 84.85 0.53 30.7 0.47 311,029 8.49E+01 1.07E+03
Multi-Classifier[38] 97.3 0.4 88.7 0.4 9.6 0.4 29.8 0.1 311,029 1.97E+02 2.49E+03
Random Forest[39] 98.91 3.15 55.12 0.45 66.67 5.18 100 0.13 77,287 2.34E+02 2.63E+03
Bayes Net[40] 94.6 0.2 83.8 0.13 5.2 0.6 30.3 0.3 15,437 3.07E+02 2.96E+03
JRip[40] 97.4 0.3 83.8 0.1 0.1 0.4 12.8 0.1 15,437 3.23E+02 3.11E+03
SVM[41] 76.7 0.09 81.2 0.36 11.2 0.08 21.4 0.08 10,000 3.71E+02 3.42E+03
Naive Bayes[42] 99.69 0.04 99.11 0.45 99.11 8.02 64 0.14 311,029 7.95E+02 1.01E+04
ID3[43] 99.9 0.03 99.7 0.55 93.5 0.98 49.1 0.15 311,029 9.84E+02 1.24E+04
EID3[43] 99.9 0.03 99.8 0.39 99.7 0.22 99.8 0.12 311,029 1.22E+03 1.54E+04
MARK-ELM[29] 99.96 0.02 97.42 0.03 94.94 0.05 62.87 0.01 72,793 4.11E+03 4.60E+04
HAST-IDS 99.10 0.02 83.35 0.01 74.19 0.02 64.25 0.02 1,099,731 5.05E+03 7.03E+04

TABLE 15. Comparison with other published methods for ISCX2012 (%)

Method Normal-DR Attack-DR Accuracy FAR


MHCVF[44] 99.9 68.2 99.5 0.03
ALL-AGL[45] 99.5 93.2 95.4 0.30
KMC+NBC[46] 97.7 99.7 99.0 2.2
AMGA2-NB[11] 95.2 92.7 94.5 7.0
HAST-IDS 99.97 96.96 99.89 0.02

TABLE 16. Comparison of training time and testing time achieved using HAST-IDS with those of other published methods for DARPA1998

Method Training time Testing time


MHCVF[44] 240 min 35 min
PLSSVM[37] 104 min 46 min
Multiple-Level Hybrid Classifier (MLHC)[47] 1,443 min 4 min
HAST-IDS 58 min 1.7 min

experimental results for the compared methods, the DR of reports training and testing times, we cannot perform a
normal traffic, DR of attack traffic, accuracy and overall comparison study. Table 16 shows that the training and
FAR are used as the evaluation metrics. Table 15 shows testing times of the HAST-IDS for the entire DARPA1998
that the HAST-IDS achieves the best performances dataset are 58 min (3,533 s) and 1.7 min (103 s),
regarding the DR of normal traffic, DR of attack traffic, and respectively. The table also shows that although the HAST-
overall FAR, exceeding those of the other state-of-the-art IDS times include feature extraction and selection, it still
methods by 0.07%, 0.39% and 0.01%, respectively. The achieves the lowest training and testing times, 46 min and
DR of attack traffic obtained by the HAST-IDS is lower 2.3 min, respectively, compared to those of the previously
than that of the KMC+NBC method by 2.74% and ranks mentioned state-of-the-art methods, which clearly shows
second among all five methods.
the high efficiency of our proposed method for intrusion
Regarding the training and testing times, the input data detection.
for the HAST-IDS consists of raw network traffic; thus, the
training and testing times of our method include the time V. CONCLUSION AND FUTURE WORK
required for feature extraction and feature selection. In Because of the difficulty of hand-designing accurate traffic
contrast, the previously mentioned methods directly use features in the field of intrusion detection, we propose the
manually designed features and do not require time for HAST-IDS, which uses deep neural networks that can
feature extraction and selection. Thus, it is not suitable to automatically learn hierarchical spatial-temporal features
compare their training and testing times directly. However, directly from raw network traffic data. To the best of our
we do list the training and testing times that exist in the knowledge, this is the first time that a
literature for some methods. The experimental hardware we representation/feature-learning method based on raw traffic
used is listed in Section A. Table 16 shows the comparison data has been applied in the field of intrusion detection. The
of the training and testing times for the DARPA1998 method uses CNNs to learn the spatial features of network
dataset. The authors of the MHCVF and MLHC methods packets and then uses an LSTM to learn the temporal
did not report the FAR for every traffic class, and the features among multiple network packets. As a result, the
corresponding EM values cannot be calculated; thus, they proposed method obtains more accurate spatial-temporal
do not appear in Table 14. Because the ISCX2012 dataset is traffic features. The method does not require any of the
relatively very new and insufficient studies exist that engineering techniques used in traditional intrusion

2 VOLUME XX, 2017

2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2017.2780250, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)

detection methods. The experimental results show that the intrusion detection systems,” Expert Systems with Applications, vol.
42, pp. 2670-2679, 2015.
HAST-IDS effectively improves the accuracy and DR
[9] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, and B. Shuai,
compared to other published methods. In addition, the FAR “Recent advances in convolutional neural networks,” arXiv preprint
of many current intrusion detection methods is generally arXiv:1512.07108, 2017. [Online]. Available:
high. Eesa et al. [8] showed that the detection rate can be https://arxiv.org/abs/1512.07108
[10] Y. Goldberg, “A primer on neural network models for natural
increased while the FAR can be decreased by using a better
language processing,” Journal of Artificial Intelligence Research, vol.
traffic feature set. Our experimental results show that the 57, pp. 345-420, 2016.
HAST-IDS effectively reduces the FAR because it [11] Z. Tan, A. Jamdagni, X. He, P. Nanda, R. P. Liu, and J. Hu,
automatically learns the spatial-temporal features, which “Detection of denial-of-service attacks based on computer vision
techniques,” in IEEE Transactions on Computers, vol. 64, no. 9, pp.
improve the overall performance of the IDS.
2519-2533, 2015.
Two problems require further study in future work. The [12] D. Ariu, R. Tronci, and G. Giacinto, “HMMPayl: An intrusion
first involves improving the detection performance on detection system based on Hidden Markov Models,” Computers &
imbalanced datasets [36]. In the real world, the amount of Security, vol. 30, no. 4, pp. 221-241, 2011.
[13] W. Wang, X. Zeng, X. Ye, Y. Sheng, and M. Zhu, “Malware traffic
malware traffic is small compared to the amount of normal
classification using convolutional neural networks for representation
traffic, and the proportions of different classes of malware learning,” the 31st International Conference on Information
traffic often differ greatly. The t-SNE visualization results Networking (ICOIN), Da Nang, 2017, pp. 712-717.
and experimental data both show that the performance of [14] P. Torres, C. Catania, S. Garcia, and C. G. Garino, “An analysis of
recurrent neural networks for botnet detection behavior,” 2016 IEEE
the HAST-IDS is not good enough for the classes of traffic
Biennial Congress of Argentina (ARGENCON), Buenos Aires,
with fewer samples. We will focus on that problem in Argentina, 2016, pp. 1-6.
future work. The second problem involves combining [15] S. Vafeias, “Character level models for sentiment analysis,” [Online]
traditional traffic features. Many published research results Available: https://github.com/offbit/char-models
[16] J. Li, H. Xu, X. He, J. Deng, and X. Sun, “Tweet modeling with
show that in certain cases, some manually designed traffic
LSTM recurrent neural networks for hashtag recommendation,” 2016
features can be very useful. To further improve the system International Joint Conference on Neural Networks (IJCNN),
performance, the usefulness of either an integration of those Vancouver, BC, 2016, pp. 1570-1577.
features into the HAST-IDS framework or the use of an [17] X. Zhang and Y. LeCun, “Text understanding from scratch,” Apr.
2016, [Online] Available: http://arxiv.org/pdf/1502.01710v5.
ensemble learning method is worth exploring.
[18] J. D. Rodríguez, A. Pérez, and J. A. Lozano, “Sensitivity analysis of
Combining the previous research results [13][48], we k-fold cross validation in prediction error estimation,” IEEE Trans.
conclude that deep neural networks can automatically learn Pattern Anal. Mach. Intell., vol. 32, no. 3, pp. 569-575, Mar. 2010.
features directly from raw network traffic data and achieve [19] V. Nair and G.E. Hinton, “Rectified linear units improve restricted
Boltzmann machines,” Proc. Int'l Conf. Machine Learning, Haifa,
good results in the field of intrusion detection or network
2010, pp. 807-814.
anomaly detection. The preliminary experimental results [20] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
are promising. Following up on this idea, we will continue Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997.
to research the application of deep neural networks in the [21] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, “A detailed
analysis of the KDD CUP 99 data set,” 2009 IEEE Symposium on
IDS field with the goal of further improving IDS
Computational Intelligence for Security and Defense Applications,
performance. Ottawa, 2009, pp. 1-6.
[22] J. Song, H. Takakura, and Y. Okabe, “Description of kyoto
REFERENCES university benchmark data,” [Online]. Available:
[1] H. J. Liao, C. H. R. Lin, Y. C. Lin, and K. Y. Tung, “Intrusion http://www.takakura.com/Kyoto_data/BenchmarkData-Description-
detection system: a comprehensive review,” Journal of Network and v5.pdf
Computer Application, vol. 36, no. 1, pp. 16-24, 2013. [23] R. Lippman, R. Cunningham, D. Fried, et al., “Results of the
[2] F. Zhang and D. Wang, “An effective feature selection approach for DARPA 1998 offline intrusion detection evaluation,” 1998, [Online]
network intrusion detection,” 2013 IEEE Eighth International Available: . https://ll.mit.edu/ideval/files/RAID_1999a.pdf
Conference on Networking, Architecture and Storage, Xi'an, 2013, [24] A. Shiravi, H. Shiravi, M. Tavallaee, and A. A. Ghorbani, “Towards
pp. 307-311. developing a systematic approach to generate benchmark datasets for
[3] N. Hubballi, V. Suryanarayanan, “False alarm minimization intrusion detection,” Computers & Security, vol. 31, no. 3, pp. 357-
techniques in signature-based intrusion detection systems: A survey,” 374, 2012.
Computer Communications, vol. 49, no. 1, pp. 1-17, 2014. [25] S.Devaraju and S.Ramakrishnan, “Performance comparison for
[4] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: a intrusion detection system using neural network with KDD dataset,”
review and new perspectives,” IEEE Trans. Pattern Anal. Mach. ICTACT J. SOFT Comput., vol.4, no.3, pp.743–752, 2014.
Intell., vol. 35, pp. 1798-1828, Aug. 2013. [26] X. M. Chen, “A simple utility to classify packets into flows,”
[5] T. Ma, F. Wang, J. Cheng, Y. Yu, and X. Chen, “A hybrid spectral [Online]. Available: https://github.com/caesar0301/pkt2flow
clustering and deep neural network ensemble algorithm for intrusion [27] A. L. Buczak and E. Guven, “A survey of data mining and machine
detection in sensor networks,” Sensors, vol. 16, no. 10, pp. 1701, learning methods for cyber security intrusion detection,” IEEE
2016. Communications Surveys & Tutorials, vol. 18, no. 2, pp. 1153-1176,
[6] Q. Niyaz, W. Sun, A. Y. Javaid, and M. Alam, “A deep learning 2016.
approach for network intrusion detection system,” Proceedings of the [28] J. Song, H. Takakura, Y. Okabe, and Y. Kwon, “Correlation analysis
9th EAI International Conference on Bio-inspired Information and between honeypot data and IDS alerts using one-class SVM,”
Communications Technologies (formerly BIONETICS), 2015, pp. 21- INTECH Open Access Publisher, 2011.
26. [29] J. M. Fossaceca, T. A. Mazzuchi, and S. Sarkani, “MARK-ELM:
[7] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, application of a novel Multiple Kernel Learning framework for
Cambridge, MA, USA: MIT Press, 2016. improving the robustness of network intrusion detection,” Expert
[8] A. S. Eesa, Z. Orman, and A. M. A. Brifcani, “A novel feature- Syst. Appl., vol. 42, no. 8, pp. 4062-4080, 2015.
selection approach based on the cuttlefish optimization algorithm for

VOLUME XX, 2017 9

2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2017.2780250, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)

[30] F. Chollet, Keras, [Online] Available:


https://github.com/fchollet/keras
[31] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, et
al., “TensorFlow: Large-scale machine learning on heterogeneous
distributed systems,” arXiv preprint arXiv:1603.04467, [Online]
Available: https://arxiv.org/abs/1603.04467
[32] Y. A. LeCun, L. D. Jackel, L. Bottou, A. Brunot, C. Cortes, J. S.
Denker, et al., “Learning algorithms for classification: A comparison
on handwritten digit recognition,” Neural Networks, pp. 261-276,
1995.
[33] Z. B. Celik, R. J. Walls, P. McDaniel, and A. Swami, “Malware
traffic detection using tamper resistant features,” Military
Communications Conference, MILCOM 2015 IEEE, Tampa, FL,
2015, pp. 330-335.
[34] L. V. D. Maaten and G.E. Hinton, "Visualizing high-dimensional
data using t-SNE", J. Machine Learning Research, vol. 9, pp. 2579-
2605, 2008.
[35] L. V. D. Maaten, “Python implementation of t-SNE,” [Online]
Available: https://lvdmaaten.github.io/tsne
[36] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE
Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp.
1263-1284, 2009.
[37] F. Amiri, M. M. R. Yousefi, C. Lucas, A. Shakery, and N. Yazdani,
“Mutual information-based feature selection for intrusion detection
systems”, J. Network and Computer Applications, vol. 34, no. 4, pp.
1184-1199, 2011.
[38] M. Sabhnani and G. Serpen, “Application of machine learning
algorithms to KDD intrusion detection dataset within misuse
detection context,” Proc. Int. Conf. Mach. Learn.: Models Technol.
and Appl., Las Vegas, 2003, pp. 209-215.
[39] M. A. M. Hasan, M. Nasser, B. Pal, and S. Ahmad, “Support vector
machine and random forest modeling for intrusion detection system
(IDS),” Journal of Intelligent Learning Systems and Applications,
vol. 6, no. 1, pp. 45-52, 2014.
[40] H. A. Nguyen and D. Choi, “Application of data mining to network
intrusion detection: classifier selection model,” Challenges for Next
Generation Network Operations and Service Management, vol. 5297,
Springer-Verlag, LNCS, 2008, pp.399-408.
[41] X. Xu, “Adaptive intrusion detection based on machine learning:
feature extraction, classifier construction and sequential pattern
prediction,” Int'l Journal of Web Services Practices, vol.2, no.1-2, pp.
49-58, 2006.
[42] D. M. D. Ferid and N. Harbi, “Combining naïve Bayes and decision
tree for adaptive intrusion detection,” International Journal of
Network Security and application(IJNSA), vol. 2, pp. 189-196, 2010.
[43] V. Jaiganesh, P. Rutravigneshwaran, and P. Sumathi, “An efficient
algorithm for network intrusion detection system", International
Journal of Computer Applications, vol. 90, no. 12, pp. 0975-8887,
2014.
[44] A. Akyol, M. Hacibeyoğlu, and B. Karlik, “Design of hybrid
classifier with variant feature sets for intrusion detection system,”
IEICE Transactions on Information and Systems, vol.E99-D, no.7,
pp.1810-1821, 2016.
[45] H. Sallay, A. Ammar, M. B. Saad, and S. Bourouis, “A real time
adaptive intrusion detection alert classifier for high speed networks,”
2013 IEEE 12th International Symposium on Network Computing
and Applications, Cambridge, MA, 2013, pp. 73-80.
[46] W. Yassin, N. Udzir, Z. Muda, and M. Sulaiman, “Anomaly-based
intrusion detection through K-means clustering and Naives Bayes
classification,” Proceedings of the 4th International Conference on
Computing and Informatics, ICOCI 2013, 2013, pp. 298-303.
[47] C. Xiang, P.C. Yong, and L.S. Meng, “Design of multiple level
hybrid classifier for intrusion detection system using Bayesian
clustering and decision tree,” Pattern Recognition Letters, vol. 29, no.
7, pp. 918-924, 2008.
[48] W. Wang, M. Zhu, J. Wang, X. Zeng and Z. Yang, "End-to-end
encrypted traffic classification with one-dimensional convolution
neural networks," 2017 IEEE International Conference on
Intelligence and Security Informatics (ISI), Beijing, China, 2017, pp.
43-48.

VOLUME XX, 2017 9

2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
View publication stats http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy