HAST-IDS Learning Hierarchical Spatial-Temporal Fe
HAST-IDS Learning Hierarchical Spatial-Temporal Fe
net/publication/321738892
CITATIONS READS
457 2,788
7 authors, including:
Xuewen Zeng
Chinese Academy of Sciences
47 PUBLICATIONS 1,914 CITATIONS
SEE PROFILE
All content following this page was uploaded by Xuewen Zeng on 07 February 2018.
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.Doi Number
ABSTRACT The development of an anomaly-based intrusion detection system (IDS) is a primary research
direction in the field of intrusion detection. An IDS learns normal and anomalous behavior by analyzing
network traffic and can detect unknown and new attacks. However, the performance of an IDS is highly
dependent on feature design, and designing a feature set that can accurately characterize network traffic is
still an ongoing research issue. Anomaly-based IDSs also have the problem of a high false alarm rate (FAR),
which seriously restricts their practical applications. In this paper, we propose a novel IDS called the
hierarchical spatial-temporal features-based intrusion detection system (HAST-IDS), which first learns the
low-level spatial features of network traffic using deep convolutional neural networks (CNNs) and then
learns high-level temporal features using long short-term memory (LSTM) networks. The entire process of
feature learning is completed by the deep neural networks automatically; no feature engineering techniques
are required. The automatically learned traffic features effectively reduce the FAR. The standard
DARPA1998 and ISCX2012 datasets are used to evaluate the performance of the proposed system. The
experimental results show that the HAST-IDS outperforms other published approaches in terms of accuracy,
detection rate and FAR, which successfully demonstrates its effectiveness in both feature learning and FAR
reduction.
INDEX TERMS Network intrusion detection, deep neural networks, representation learning.
2169-3536 © 2017 IEEE. Translations and content mining are permitted for academic research only.
VOLUME XX, 2017 Personal use is also permitted, but republication/redistribution requires IEEE permission. 1
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2017.2780250, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
engineering techniques are required. We expect that the natural language processing. In the field of network intrusion
automatically learned hierarchical spatial-temporal features detection, some research results obtained using the
are better at characterizing network traffic behaviors than are representation learning approach have recently emerged. For
manually designed features and that they can effectively example, Ma et al. [5] applied deep neural networks to detect
improve the intrusion detection capability. The experimental intrusion behaviors using the KDD99 dataset. Niyaz et al. [6]
results successfully demonstrate the effectiveness of the studied the intrusion detection method on the NSL-KDD
proposed method for both feature learning and FAR dataset using deep belief networks. The common ground of
reduction. these research methods is that their models learn features
The rest of this paper is organized as follows. Section II from manually designed traffic features. However,
describes the related work. Section III describes the design applications in the fields of computer vision and natural
and implementation of the proposed method. Section IV language processing have shown that the biggest advantage
mainly covers the evaluation methodology and experimental of deep neural networks lies is their ability to learn features
results, and Section V presents conclusions and future work. directly from raw data [7]. The abovementioned research
methods used deep neural networks based on manually
II. RELATED WORK designed features without taking full advantage of the deep
neural networks.
A. INTRUSION DETECTION TECHNIQUES The literature reveals that raw network traffic data have
IDSs can be classified as either signature-based or anomaly- not been used to learn features. This revelation motivates us
based detection. Signature-based detection, also called to develop a process for learning features directly from raw
misuse detection, analyzes known attacks to extract their network traffic data using deep neural networks to obtain a
discriminating characteristics and patterns, called signatures. better traffic feature set and develop a more efficient IDS.
These signatures are compared against the new traffic to Additionally, Eesa et al. [8] proved that the detection rate can
detect intrusions. The advantages of signature-based be increased and the FAR decreased by using a better traffic
detection are that it has a high detection rate and a low FAR feature set. We also hope to reduce the FAR by using an
for known attacks, while its disadvantage is that it cannot automatically learned traffic feature set.
detect any unknown and new (0-day) attacks. Anomaly-
based detection, also called behavior-based detection, mainly B. DEEP NEURAL NETWORKS
uses a machine learning-based method. In this approach, CNNs and RNNs are the two most widely used deep neural
some network traffic features are designed first; then, a network models; they are capable of learning effective spatial
model is built based on those features using supervised or and temporal features, respectively. In common neural
unsupervised learning approaches. The model can identify networks, every neural node of every hidden layer sums the
both normal and anomalous traffic patterns. Its advantage is weighted values coming from the previous layer, applies a
that it can detect unknown and new attacks; thus, it has nonlinear transform, and transfers the resulting values to the
attracted increasing interest in research and industrial circles. next layer. The output value of the last layer can be regarded
However, in practical applications, some problems still as the representation or feature learned by the neural
exist for anomaly-based intrusion detection methods. The networks from the input data. CNNs, which improve upon
first problem is the difficulty of designing representative the architecture of the common neural networks, benefit from
traffic features. The detection effect of this method is highly the following: sparse connectivity, shared weights and
dependent on the design of the traffic features used in pooling. CNNs are capable of learning spatial features and
training. The detection effects often vary widely when have already yielded impressive achievements in many
different feature sets of network traffic are applied. No computer vision tasks, such as image classification [9].
standard guiding principle currently exists for the design of a RNNs add a self-connected weighted value as the memory
feature set that accurately characterizes network traffic. The unit for every neural node based on the architecture of
second serious problem of the anomaly-based intrusion common neural networks, and they can memorize the
detection method is its high FAR, which is a major obstacle previous state of the neural networks. Long short-term
to its practical application [3]. memory (LSTM) networks further add a component called
The representation learning approach [4] is a promising the forget gate, and LSTMs can effectively learn temporal
method for solving both these problems. Representation features from a long sequence. LSTM networks have already
learning, also called feature learning, is a technique that been shown to perform well for many natural language
allows a system to automatically extract features from raw processing tasks, such as machine translation [10].
data. Its biggest advantage is that it replaces manual feature Spatial and temporal features are two commonly used
engineering and can directly learn the best features from raw types of traffic features in the field of intrusion detection.
data. Deep neural networks have been the most successful When using spatial features, network traffic is first
technique of representation learning and have achieved transformed into traffic images; then, the image classification
remarkable results in the fields of computer vision and method based on geometric features is used to classify the
2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2017.2780250, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
traffic images, which also indirectly achieves the goal of However, no studies that combine the use of CNNs and
identifying the malware traffic. This approach is relatively RNNs to detect network intrusion exist in the literature. To
new, but many recent research results demonstrate its great take full advantage of these two types of deep neural
potential. For example, Tan et al. [11] applied a widely used networks, we use both CNNs and RNNs to learn the spatial-
dissimilarity measure in the computer vision domain, namely, temporal features of raw network traffic data to develop a
the earth mover’s distance (EMD), to detect denial of service more effective IDS.
(DoS) traffic and achieved a good effect. When using
temporal features, the temporal features of a network traffic III. HAST-IDS
flow are extracted and can be used to detect intrusion
behaviors via the time series analysis method. This approach A. HAST-IDS OVERVIEW
was developed early and has been adopted by many The goals of the HAST-IDS are to automatically learn the
researchers. For example, Ariu et al. [12] developed an spatial-temporal features of raw network traffic data using
effective IDS using the hidden Markov model method based deep neural networks and to improve the effectiveness of the
on the temporal relations of traffic payload bytes. IDS. The basic design concept is as follows. At the network
Over the past two years, a few studies have used CNNs or packet level, every network packet is transformed into a two-
RNNs to perform intrusion detection tasks based on spatial dimensional image, the internal spatial features of which are
and temporal features. For example, Wang et al. [13] used a learned by a CNN. At the network flow level, the temporal
CNN to learn the spatial features of network traffic and features of a sequence of network packets of a network flow
achieved malware traffic classification using the image are further learned by an RNN. Finally, the resulting spatial-
classification method. Torres et al. [14] first transformed temporal traffic features are used to classify the traffic as
network traffic features into a sequence of characters and normal or malware.
then used RNNs to learn their temporal features, which were Two implementation schemes are used for the HAST-IDS,
further applied to detect malware traffic. The common point as shown in Figure 2. HAST-I uses a CNN and learns only
of these research methods is that they used CNNs or RNNs spatial features, while HAST-II uses both CNNs and RNNs
alone and learned a single type of traffic feature. and learns spatial-temporal features. Each scheme has
Network traffic has an obvious hierarchical structure as different application scenarios for different types of network
illustrated in Figure 1, where the bottom row shows a traffic, which will be discussed in Section IV.
sequence of traffic bytes. Based on the format of specific The various stages of the HAST-IDS are described below.
network protocols, multiple traffic bytes are combined to Preprocessing. In this stage, the input raw network traffic
form a network packet, and multiple network packets data are transformed into the two-dimensional images
communicated between two sides are further combined to required by the CNN. The basic traffic units for intrusion
form a network flow. Notably, these traffic bytes, network detection are network flows; thus, the input raw traffic data
packets and network flows are similar to the characters, must be split into multiple network flows. Each network flow
sentences and paragraphs in the field of natural language contains multiple network packets communicated between
processing. Moreover, the task of classifying a network flow two endpoints. One-hot encoding (OHE) is used as the
as either normal or malware is very similar to classifying a transformation method. In HAST-I, the first n traffic bytes of
paragraph as either positive or negative, which is a common the network flow are transformed. If the OHE vector is m-
task in the field of natural language processing, namely, dimensional, then the entire network flow can be transformed
sentiment analysis. In some recent studies on sentiment into an m*n two-dimensional image. For HAST-II, similar
analysis, deep neural networks were used to learn the preprocessing is required; however, every network packet
hierarchical features of natural language and achieved good must be transformed individually. If r packets exist in a flow
results [15][16][17]. Those studies motivated us to use deep and if the first q bytes of every packet are transformed and
neural networks to learn the hierarchical features of network the OHE vector is p-dimensional, then the entire network
traffic and further perform the intrusion detection task. flow can be transformed into r different p*q two-dimensional
images.
Traffic Flow Cross-validation. The k-fold cross-validation technique is
used for performance evaluation. In this technique, a dataset
is randomly divided into k equal parts. In each iteration, one
Packet Packet Packet part is selected as the validating dataset, while all the other k-
1 parts are treated as the training dataset. In our experiments,
k was set to 10 because of the resulting low bias, low
variance, low overfitting and good error estimate [18].
Byte Byte Byte Byte Byte Spatial feature learning. CNNs are used to learn the
spatial features of the two-dimensional traffic images. In
FIGURE 1. Hierarchical structure of network traffic HAST-I, the spatial features of the entire flow image are
2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2017.2780250, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
Start
Start
Network Traffic
Dataset-
Network Traffic
Preprocessing
Dataset-
Preprocessing
10-fold Cross-Validation
10-fold Cross-Validation
Training
Dataset
Training
Dataset Spatial Features Learning
(CNN)
Spatial Features Learning
(CNN)
Softmax Classifier
Validating
Dataset
Is Cross-Validation
No Completed Validating
Dataset Is Cross-Validation
Yes No Completed
Yes
Test & Aggregated Results
Testing
Dataset Test & Aggregated Results
Testing
End Dataset
End
( HAST-I ) ( HAST-II )
FIGURE 2. Workflow of HAST-IDS
learned from a single m*n image, and the output is a single to every network packet. The first q bytes of every packet
flow vector. In HAST-II, the spatial features of every p*q are transformed into a p*q packet image via OHE, and each
packet image are learned individually, and the output is r image is further processed via convolution and pooling. The
packet vectors. final output consists of multiple packet vectors that
Temporal feature learning. An LSTM is used to learn represent the features of the individual network packets.
the temporal features of multiple traffic vectors. In HAST- When implemented, two convolution filters with different
II, the LSTM further learns the temporal relations among sizes are used to output two different flow/packet vectors,
the r packet vectors. The output is a single flow vector that which are concatenated together as the final vector. The
represents the spatial-temporal features of the network flow. application of CNNs in the field of computer vision has
Softmax classifier. The softmax classifier is used to shown that this method can obtain better spatial features.
determine whether the input traffic is normal or malware Algorithm 1 presents the details of the spatial feature
based on the flow vector. Softmax is a commonly used learning stage.
multi-class classification method in the field of machine The key components used in Algorithm 1 are as follows.
learning. One-hot encoding. Let xi k be the k-dimensional
Test and result aggregation. The fine-tuned model is vector corresponding to the i-th traffic byte in a packet or
tested using the test dataset. The results of each experiment flow. A packet or flow of length n can be encoded
are collected and analyzed. according to the following formula, where is the
2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2017.2780250, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
...
m p p ... p p
Preprocessing
...
flow bytes n packet bytes q packet bytes q packet bytes q
( HAST-I ) ( HAST-II )
FIGURE 3. Feature learning process of HAST-IDS
Update the weights and biases using the RMSprop gradient descent optimization algorithm.
end
Validate model using the validating dataset.
end
Step 5: Test model
9. Test the fine-tuned model using the test dataset.
10. return the Vtemp of every network traffic image in the test dataset.
2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2017.2780250, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
____________________________________________________________________________________________________
2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2017.2780250, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
which includes 41 manually designed traffic features, was TABLE 2. Preprocessing results of the ISCX2012 dataset
derived from the DARPA1998 dataset and has become the Training Test
most commonly used dataset in the field of intrusion Dataset
Count Percentage Count Percentage
detection and has used to produce numerous research Normal 890,726 97.27% 593,811 97.27%
results [25]. For comparison purposes, we choose BFSSH 4,179 0.46% 2,785 0.46%
DARPA1998 as one of the evaluation datasets. Infiltrating 6,027 0.66% 4,017 0.66%
The traffic format of DARPA1998 is non-split pcap, HttpDoS 2,090 0.23% 1,392 0.23%
which must be split into multiple network flow files. In DDoS 12,673 1.38% 8,448 1.38%
addition, the label files contain a few problems, such as Total 915,695 610,453
duplicated records and incorrect labels. For example, the
2) METRICS
label file “Test/Week2/Friday” contains a record of
“07/32/1998,” which is an obvious date error. Therefore, Three metrics are used to evaluate the performance of the
the dataset requires preprocessing before the experiments HAST-IDS: accuracy, detection rate (DR) and FAR, which
can be conducted. First, the pkt2flow tool [26] is used to are all commonly used in the field of intrusion detection.
split the raw network traffic data into multiple network Accuracy is used to evaluate the overall performance of the
flows. Second, every label file is checked, and all system. DR is used to evaluate the system's performance
duplicated records and incorrect records are removed. with respect to its malware traffic detection. FAR is used to
Finally, we match every network flow file to the processed evaluate misclassifications of normal traffic. Their
definitions are presented below. TP is the number of
label files. Table 1 shows the preprocessing results for the
instances correctly classified as X, TN is the number of
DARPA1998 dataset.
instances correctly classified as Not-X, FP is the number of
TABLE 1. Preprocessing results of the DARPA1998 dataset instances incorrectly classified as X, and FN is the number
Training Test of instances incorrectly classified as Not-X.
Dataset
Count Percentage Count Percentage TP TN
Normal 849,991 34.46% 459,547 41.79% Accuracy ( ACC )
TP FP FN TN
DoS 1,561,231 63.29% 591,619 53.80%
TP (10)
Probe 48,984 1.99% 40,317 3.67% DetectionRate( DR )
R2L 6,494 0.26% 8,041 0.73% TP FN
U2R 229 0.01% 207 0.02% FP
FalseAlarmRate( FAR )
Total 2,466,929 1,099,731 FP TN
ISCX2012. In 2012, the Information Security Center of In addition, an important goal of the HAST-IDS is to
Excellence (ISCX) of the University of New Brunswick reduce the FAR as much as possible while improving the
(UNB) in Canada published an intrusion detection dataset DR. To comprehensively evaluate the HAST-IDS
named ISCX2012. This dataset contains seven days of raw considering both the DR and FAR, the effectiveness
network traffic data, including normal traffic and four types measure (EM) proposed by [29] is used in our research to
of attack traffic (i.e., BruteForce SSH, DDoS, HttpDoS, and compare the HAST-IDS with other published methods. The
Infiltrating). Some researchers have noted that the attack EM is slightly modified, and its definition is given below,
types considered in KDD99 are now obsolete [27]. In where C is the number of classes and 1 denotes the normal
contrast, the attack types of ISCX2012 are more modern class, which is excluded from the calculation. This formula
and closer to reality. In addition, the percentage of attack results in a high value only when the DR is high and the
traffic is approximately 2.8%, which makes ISCX2012 FAR is low. In addition, when the test dataset size is bigger,
similar to real-world datasets [28]. the generalization capability of the system is better; thus,
The ISCX2012 dataset also needs to be preprocessed the EM value is larger. This metric, which considers the DR,
before the experiments can be conducted. The FAR and test dataset size, can better reflect the
preprocessing method is very similar to that used for comprehensive performance of the HAST-IDS. In addition,
DARPA1998. Because the traffic data of 16 June has only EMDF is also used, which is calculated using only the DR
11 attack flows, according to the provider’s description, we and FAR.
removed these 11 flows and consider all traffic data of 16 c
DR
June as normal. ISCX2012 was not divided by the provider FAR
into training and test datasets; therefore, we divide it into Effect. Mea.( EM ) i 2 ln(# Testing samples )
C 1 (11)
training and test datasets using a ratio of 60% to 40%, c
DR
respectively. Moreover, this ratio has recently been used by FAR
many researchers, thus simplifying the comparison of our Effec. Mea.DF ( EM DF ) i 2
method with other methods. Table 2 shows the C 1
preprocessing results for the ISCX2012 dataset.
2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2017.2780250, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2017.2780250, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
TABLE 8. Best performance of HAST-I for ISCX2012 In our experiments, the packet sizes range from 100 to
Dataset ACC DR FAR 1,000. We choose the median (14; see the median row in
BFSSH 99.99 99.61 0.02 Table 10) as the number of packets per flow. The accuracy,
Infiltrating 99.98 97.68 0.04 DR and FAR of every class of traffic are used as evaluation
HttpDoS 99.96 83.12 0.09 metrics. Table 11 shows the experimental results, from
DDoS 99.76 97.94 0.07 which, we can see that those metrics yield the best results
Overall 99.69 96.91 0.22 when the packet sizes are 100 and 200. Considering the
training time, we choose 100 bytes as the best packet size,
which differs greatly from the mean packet size of 743
C. INFLUENCE OF NETWORK PACKET SIZE
shown in Table 9. A possible explanation is that the first
The next two sections measure the effects of two key
100 bytes mainly comprise the packet header, whose
parameters of HAST-II on the system performance, namely,
information may be more important for detecting intrusion.
the packet size and the number of packets per flow. Note
that the number of packets per flow is generally very small An interesting finding is the comparison with the
in the DARPA1998 dataset. Figure 4 shows that the sentiment analysis task in the field of natural language
majority of flows, i.e., 1,471,468, contain only one packet, processing. Using a similar deep neural network
accounting for 41.32% of the total, and 775,649 flows architecture, [15] chooses 512 characters as the best
contain only two packets, accounting for 21.78% of the sentence size, which is equivalent to the packet size in our
total. Together, these two flow sizes account for 63.1% of research. However, 512 characters cover the length of most
the total. The most important advantage of LSTM is its sentences, whereas the 100 bytes used in our study covers
ability to learn the temporal features of a long sequence of only the first small part of a network packet. Intuitively,
data. However, the sequence length of the flows in the this result may occur because the structure of a sentence is
DARPA1998 dataset is too short. In particular, when the very different from that of a packet. The information in a
number of packets is 1, no sequence exists, which makes it sentence tends to follow a uniform distribution, whereas a
meaningless to apply LSTM to the DARPA1998 dataset. packet can be divided into a packet header and payload,
Thus, the evaluation experiments for HAST-II are thus, its information tends to be emphasized differently.
conducted only for ISCX2012.
D. INFLUENCE OF THE NUMBER OF NETWORK
PACKETS
In the temporal feature learning stage of HAST-II, LSTM
requires that the number of packets per flow be fixed.
However, the number of packets per flow generally differs
greatly. Table 10 shows the statistics regarding the number
of packets per flow for the ISCX2012 dataset. The table
shows that in practice, the number of packets per flow
varies widely. Therefore, we must measure the effect of the
FIGURE 4. The number of packets per flow in the DARPA1998 dataset
number of packets per flow on the system performance and
determine the best number of packets via multiple
During spatial feature learning in HAST-II, the input data evaluation experiments.
unit of the CNNs is network packets. Similar to the
TABLE 10. Statistics regarding number of network packets per flow for
discussion in section B, every packet must be reduced to a ISCX2012
fixed size. In a network flow, the packet sizes generally
Number of Packets per Flow Value
differ greatly. Table 9 shows the statistics regarding packet
Count 1,526,148
sizes for the ISCX2012 dataset. As shown in Table 9, large Mean 77
differences in packet sizes exist in the ISCX2012 dataset. Max 1,306,267
Section B discussed how flow size has an important effect Min 1
on the system performance. Similarly, this section measures Median 14
the effect of packet size on the performance of HAST-II Mode 10/177,277
and determines the best packet size. Std 2,105
TABLE 9. Statistics regarding network packet size for ISCX2012
In our experiments, the number of packets ranges from 6
Packet Bytes Value to 30. When the number of packets is larger than 30, the
Count 118,419,538 performance no longer improves. The packet size is 100
Mean 743
bytes, as determined in section C. The accuracy, DR and
Max 1,514
Min 60
FAR of every class of traffic are used as evaluation metrics.
Median 598 Table 12 shows the experimental results. From the table, we
Mode 60/39,805,790 see that those metrics yield the best results when the
Std 676 number of packets is 6, 8 and 14. Considering the
2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2017.2780250, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
training time, we choose 6 as the best number of packets. TABLE 13. Best performance of HAST-II for ISCX2012
This number differs greatly from the mean number of Dataset ACC DR FAR
packets (77) shown in Table 10. This result probably occurs BFSSH 99.96 97.09 0.01
because the first few packets in a flow correspond with Infiltrating 99.96 96.21 0.00
connection creation, whose information may be more HttpDoS 99.96 92.88 0.00
important for detecting intrusion behaviors. This fact and its DDoS 99.95 97.95 0.00
possible explanation are both very similar to those of Overall 99.89 96.96 0.02
section B. Celik et al. [33] reported similar findings (i.e.,
the first few packets play a relatively more important role in E. VISUALIZATION OF SPATIAL-TEMPORAL
malware traffic detection). FEATURES USING T-SNE
Via the experiments presented in sections C and D, we
finally obtain the best values for the two parameters of The t-SNE algorithm [34] is used in this section to perform
HAST-II for the ISCX2012 dataset, namely, the best packet dimensionality reduction and visualization analysis for the
size (100) and the best packet number (6). The spatial-temporal traffic features learned by the HAST-IDS.
experimental results obtained using those two parameters The t-SNE algorithm is a nonlinear dimensionality
are shown in Table 13. Compared with Table 8, Table 13 reduction algorithm. Compared to linear dimensionality
shows that the accuracy and DR are approximately equal, reduction algorithms such as principal component analysis
while the FAR is remarkably reduced. We conclude that (PCA), t-SNE can obtain better low-dimensional results and
HAST-II achieves a comprehensively better performance better visualizations. According to the results discussed in
than that of HAST-I on the ISCX2012 dataset, which sections B, C and D, we apply the t-SNE algorithm to the
indicates that spatial-temporal features are more effective high-dimensional result vectors of the two best experiments,
than single spatial features in reducing the FAR on the namely, the network flow vectors learned by HAST-I for
ISCX2012 dataset. DARPA1998 and by HAST-II for ISCX2012.
Specifically, the flow vectors learned by the CNN in
HAST-I or by the LSTM in HAST-II are saved before
2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2017.2780250, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2017.2780250, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
TABLE 15. Comparison with other published methods for ISCX2012 (%)
TABLE 16. Comparison of training time and testing time achieved using HAST-IDS with those of other published methods for DARPA1998
experimental results for the compared methods, the DR of reports training and testing times, we cannot perform a
normal traffic, DR of attack traffic, accuracy and overall comparison study. Table 16 shows that the training and
FAR are used as the evaluation metrics. Table 15 shows testing times of the HAST-IDS for the entire DARPA1998
that the HAST-IDS achieves the best performances dataset are 58 min (3,533 s) and 1.7 min (103 s),
regarding the DR of normal traffic, DR of attack traffic, and respectively. The table also shows that although the HAST-
overall FAR, exceeding those of the other state-of-the-art IDS times include feature extraction and selection, it still
methods by 0.07%, 0.39% and 0.01%, respectively. The achieves the lowest training and testing times, 46 min and
DR of attack traffic obtained by the HAST-IDS is lower 2.3 min, respectively, compared to those of the previously
than that of the KMC+NBC method by 2.74% and ranks mentioned state-of-the-art methods, which clearly shows
second among all five methods.
the high efficiency of our proposed method for intrusion
Regarding the training and testing times, the input data detection.
for the HAST-IDS consists of raw network traffic; thus, the
training and testing times of our method include the time V. CONCLUSION AND FUTURE WORK
required for feature extraction and feature selection. In Because of the difficulty of hand-designing accurate traffic
contrast, the previously mentioned methods directly use features in the field of intrusion detection, we propose the
manually designed features and do not require time for HAST-IDS, which uses deep neural networks that can
feature extraction and selection. Thus, it is not suitable to automatically learn hierarchical spatial-temporal features
compare their training and testing times directly. However, directly from raw network traffic data. To the best of our
we do list the training and testing times that exist in the knowledge, this is the first time that a
literature for some methods. The experimental hardware we representation/feature-learning method based on raw traffic
used is listed in Section A. Table 16 shows the comparison data has been applied in the field of intrusion detection. The
of the training and testing times for the DARPA1998 method uses CNNs to learn the spatial features of network
dataset. The authors of the MHCVF and MLHC methods packets and then uses an LSTM to learn the temporal
did not report the FAR for every traffic class, and the features among multiple network packets. As a result, the
corresponding EM values cannot be calculated; thus, they proposed method obtains more accurate spatial-temporal
do not appear in Table 14. Because the ISCX2012 dataset is traffic features. The method does not require any of the
relatively very new and insufficient studies exist that engineering techniques used in traditional intrusion
2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2017.2780250, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
detection methods. The experimental results show that the intrusion detection systems,” Expert Systems with Applications, vol.
42, pp. 2670-2679, 2015.
HAST-IDS effectively improves the accuracy and DR
[9] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, and B. Shuai,
compared to other published methods. In addition, the FAR “Recent advances in convolutional neural networks,” arXiv preprint
of many current intrusion detection methods is generally arXiv:1512.07108, 2017. [Online]. Available:
high. Eesa et al. [8] showed that the detection rate can be https://arxiv.org/abs/1512.07108
[10] Y. Goldberg, “A primer on neural network models for natural
increased while the FAR can be decreased by using a better
language processing,” Journal of Artificial Intelligence Research, vol.
traffic feature set. Our experimental results show that the 57, pp. 345-420, 2016.
HAST-IDS effectively reduces the FAR because it [11] Z. Tan, A. Jamdagni, X. He, P. Nanda, R. P. Liu, and J. Hu,
automatically learns the spatial-temporal features, which “Detection of denial-of-service attacks based on computer vision
techniques,” in IEEE Transactions on Computers, vol. 64, no. 9, pp.
improve the overall performance of the IDS.
2519-2533, 2015.
Two problems require further study in future work. The [12] D. Ariu, R. Tronci, and G. Giacinto, “HMMPayl: An intrusion
first involves improving the detection performance on detection system based on Hidden Markov Models,” Computers &
imbalanced datasets [36]. In the real world, the amount of Security, vol. 30, no. 4, pp. 221-241, 2011.
[13] W. Wang, X. Zeng, X. Ye, Y. Sheng, and M. Zhu, “Malware traffic
malware traffic is small compared to the amount of normal
classification using convolutional neural networks for representation
traffic, and the proportions of different classes of malware learning,” the 31st International Conference on Information
traffic often differ greatly. The t-SNE visualization results Networking (ICOIN), Da Nang, 2017, pp. 712-717.
and experimental data both show that the performance of [14] P. Torres, C. Catania, S. Garcia, and C. G. Garino, “An analysis of
recurrent neural networks for botnet detection behavior,” 2016 IEEE
the HAST-IDS is not good enough for the classes of traffic
Biennial Congress of Argentina (ARGENCON), Buenos Aires,
with fewer samples. We will focus on that problem in Argentina, 2016, pp. 1-6.
future work. The second problem involves combining [15] S. Vafeias, “Character level models for sentiment analysis,” [Online]
traditional traffic features. Many published research results Available: https://github.com/offbit/char-models
[16] J. Li, H. Xu, X. He, J. Deng, and X. Sun, “Tweet modeling with
show that in certain cases, some manually designed traffic
LSTM recurrent neural networks for hashtag recommendation,” 2016
features can be very useful. To further improve the system International Joint Conference on Neural Networks (IJCNN),
performance, the usefulness of either an integration of those Vancouver, BC, 2016, pp. 1570-1577.
features into the HAST-IDS framework or the use of an [17] X. Zhang and Y. LeCun, “Text understanding from scratch,” Apr.
2016, [Online] Available: http://arxiv.org/pdf/1502.01710v5.
ensemble learning method is worth exploring.
[18] J. D. Rodríguez, A. Pérez, and J. A. Lozano, “Sensitivity analysis of
Combining the previous research results [13][48], we k-fold cross validation in prediction error estimation,” IEEE Trans.
conclude that deep neural networks can automatically learn Pattern Anal. Mach. Intell., vol. 32, no. 3, pp. 569-575, Mar. 2010.
features directly from raw network traffic data and achieve [19] V. Nair and G.E. Hinton, “Rectified linear units improve restricted
Boltzmann machines,” Proc. Int'l Conf. Machine Learning, Haifa,
good results in the field of intrusion detection or network
2010, pp. 807-814.
anomaly detection. The preliminary experimental results [20] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
are promising. Following up on this idea, we will continue Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997.
to research the application of deep neural networks in the [21] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, “A detailed
analysis of the KDD CUP 99 data set,” 2009 IEEE Symposium on
IDS field with the goal of further improving IDS
Computational Intelligence for Security and Defense Applications,
performance. Ottawa, 2009, pp. 1-6.
[22] J. Song, H. Takakura, and Y. Okabe, “Description of kyoto
REFERENCES university benchmark data,” [Online]. Available:
[1] H. J. Liao, C. H. R. Lin, Y. C. Lin, and K. Y. Tung, “Intrusion http://www.takakura.com/Kyoto_data/BenchmarkData-Description-
detection system: a comprehensive review,” Journal of Network and v5.pdf
Computer Application, vol. 36, no. 1, pp. 16-24, 2013. [23] R. Lippman, R. Cunningham, D. Fried, et al., “Results of the
[2] F. Zhang and D. Wang, “An effective feature selection approach for DARPA 1998 offline intrusion detection evaluation,” 1998, [Online]
network intrusion detection,” 2013 IEEE Eighth International Available: . https://ll.mit.edu/ideval/files/RAID_1999a.pdf
Conference on Networking, Architecture and Storage, Xi'an, 2013, [24] A. Shiravi, H. Shiravi, M. Tavallaee, and A. A. Ghorbani, “Towards
pp. 307-311. developing a systematic approach to generate benchmark datasets for
[3] N. Hubballi, V. Suryanarayanan, “False alarm minimization intrusion detection,” Computers & Security, vol. 31, no. 3, pp. 357-
techniques in signature-based intrusion detection systems: A survey,” 374, 2012.
Computer Communications, vol. 49, no. 1, pp. 1-17, 2014. [25] S.Devaraju and S.Ramakrishnan, “Performance comparison for
[4] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: a intrusion detection system using neural network with KDD dataset,”
review and new perspectives,” IEEE Trans. Pattern Anal. Mach. ICTACT J. SOFT Comput., vol.4, no.3, pp.743–752, 2014.
Intell., vol. 35, pp. 1798-1828, Aug. 2013. [26] X. M. Chen, “A simple utility to classify packets into flows,”
[5] T. Ma, F. Wang, J. Cheng, Y. Yu, and X. Chen, “A hybrid spectral [Online]. Available: https://github.com/caesar0301/pkt2flow
clustering and deep neural network ensemble algorithm for intrusion [27] A. L. Buczak and E. Guven, “A survey of data mining and machine
detection in sensor networks,” Sensors, vol. 16, no. 10, pp. 1701, learning methods for cyber security intrusion detection,” IEEE
2016. Communications Surveys & Tutorials, vol. 18, no. 2, pp. 1153-1176,
[6] Q. Niyaz, W. Sun, A. Y. Javaid, and M. Alam, “A deep learning 2016.
approach for network intrusion detection system,” Proceedings of the [28] J. Song, H. Takakura, Y. Okabe, and Y. Kwon, “Correlation analysis
9th EAI International Conference on Bio-inspired Information and between honeypot data and IDS alerts using one-class SVM,”
Communications Technologies (formerly BIONETICS), 2015, pp. 21- INTECH Open Access Publisher, 2011.
26. [29] J. M. Fossaceca, T. A. Mazzuchi, and S. Sarkani, “MARK-ELM:
[7] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, application of a novel Multiple Kernel Learning framework for
Cambridge, MA, USA: MIT Press, 2016. improving the robustness of network intrusion detection,” Expert
[8] A. S. Eesa, Z. Orman, and A. M. A. Brifcani, “A novel feature- Syst. Appl., vol. 42, no. 8, pp. 4062-4080, 2015.
selection approach based on the cuttlefish optimization algorithm for
2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2017.2780250, IEEE Access
Author Name: Preparation of Papers for IEEE Access (February 2017)
2169-3536 (c) 2017 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
View publication stats http://www.ieee.org/publications_standards/publications/rights/index.html for more information.