A Dynamic Graph-Based Malware Classifier
A Dynamic Graph-Based Malware Classifier
A Dynamic Graph-Based Malware Classifier
by
ii
is by itself an NP-hard problem. Approximated graph comparison algorithms
such as Graph Edit Distance have been commonly studied in the field of graph
classification.
To address the two major weaknesses involved with the current graph-based
approaches, we propose a dynamic and scalable graph-based malware classi-
fier. At the time of this proposal, this is the first attempt to generate and
classify dynamic graphs. In spite of providing more accurate graphs, dynamic
analysis leads to the generating larger graphs, and aggravating the problem
of comparison measurement. To address this problem we modify an existing
algorithm called Simulated Annealing to reduce computational complexity.
To have a reasonable estimation of the effectiveness, our proposed system is
compared against Classy, which is the state-of-the-art graph-based system.
Our results show that proposed classifier is able to outperform Classy by an
average classification accuracy of 94%, 4% false positive rate, and leaving
only 2% of samples unlabeled.
iii
Dedication
This thesis work is dedicated to my wife, Elaheh Samani, who has been a
constant source of support and encouragement during the challenges of grad-
uate school and life. I am truly thankful for having you in my life. This work
is also dedicated to my parents who have always loved me unconditionally
and whose good examples have taught me to work hard for the things that
I aspire to achieve.
iv
Acknowledgements
v
Table of Contents
Abstract ii
Dedication iv
Acknowledgments v
Table of Contents ix
List of Tables x
List of Figures xi
Abbreviations xii
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Summary of Contributions . . . . . . . . . . . . . . . . . . . . 4
1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Literature Review 8
2.1 Graph Representations . . . . . . . . . . . . . . . . . . . . . . 8
vi
2.1.1 Function Call Graphs (FCGs) . . . . . . . . . . . . . . 9
2.1.2 Control Flow Graphs . . . . . . . . . . . . . . . . . . . 12
2.1.3 Hybrid Graphs . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Binary Unpacking . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Graph Matching . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Exact Matching . . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 Inexact Matching . . . . . . . . . . . . . . . . . . . . . 22
2.3.2.1 Graph Edit Distance . . . . . . . . . . . . . . 23
2.3.2.2 Other Inexact Graph Matching Techniques . . 26
2.4 Analyzers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.1 Static Analyzers . . . . . . . . . . . . . . . . . . . . . . 29
2.4.2 Dynamic Analyzers . . . . . . . . . . . . . . . . . . . . 30
2.5 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . 36
3 Proposed System 39
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Graph Generation . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.1 Dynamic Analyzer . . . . . . . . . . . . . . . . . . . . 41
3.2.2 Preprocessor . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.3 Graph Generator . . . . . . . . . . . . . . . . . . . . . 43
3.3 Graph Matching . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.1 GED Calculator . . . . . . . . . . . . . . . . . . . . . . 47
vii
3.3.1.1 Background . . . . . . . . . . . . . . . . . . . 50
3.3.1.2 Simulated Annealing . . . . . . . . . . . . . . 52
3.3.1.3 Improved Simulated Annealing . . . . . . . . 55
3.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4.1 Prototype Extraction . . . . . . . . . . . . . . . . . . . 57
3.4.2 Clustering of Prototypes . . . . . . . . . . . . . . . . . 57
3.4.3 Classification . . . . . . . . . . . . . . . . . . . . . . . 59
3.4.4 Incremental classification . . . . . . . . . . . . . . . . . 59
3.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . 60
4 Implementation 62
4.0.1 System Overview . . . . . . . . . . . . . . . . . . . . . 62
4.0.2 Module View . . . . . . . . . . . . . . . . . . . . . . . 63
4.0.2.1 Instruction Tracer . . . . . . . . . . . . . . . 64
4.0.2.2 Instruction Decoder . . . . . . . . . . . . . . 64
4.0.2.3 Preprocessor . . . . . . . . . . . . . . . . . . 65
4.0.2.4 Graph Generator . . . . . . . . . . . . . . . . 65
4.0.2.5 BitMatrix . . . . . . . . . . . . . . . . . . . . 65
4.0.2.6 Graph . . . . . . . . . . . . . . . . . . . . . . 66
4.0.2.7 Handler . . . . . . . . . . . . . . . . . . . . . 66
4.0.2.8 Classifier . . . . . . . . . . . . . . . . . . . . 66
4.0.2.9 GEDCalculator: . . . . . . . . . . . . . . . . 67
4.0.2.10 Simulated Annealing . . . . . . . . . . . . . . 67
viii
4.0.2.11 Adapted Simulated Annealing . . . . . . . . . 68
4.0.3 System Behaviour . . . . . . . . . . . . . . . . . . . . . 68
4.0.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 68
Bibliography 105
Vita
ix
List of Tables
List of Figures
x
3.2 Dynamic Function Call Graph of Fareit Malware . . . . . . . . 46
3.3 Dynamic CFG of Fareit Malware in Assembly Instructions . . 48
3.4 Dynamic CFG of Fareit Malware in Intermediate Representation 49
xi
List of Symbols, Nomenclature
or Abbreviations
xii
Chapter 1
Introduction
1.1 Introduction
1
The signature-based method is the most common way used by anti-virus
products to determine if a sample is indeed malicious. It performs well
when facing known malicious executables. However, the process of generat-
ing signature is currently done through manual analysis, which is expensive,
time-consuming and error-prone. Moreover, since this method ignores pro-
gram’s functionality, it can be easily crippled by obfuscation techniques such
as code re-ordering, routine re-ordering, self-mutation, and code-obfuscation
techniques [19].
Due to the vast majority of obfuscation techniques employed by malware
authors, the extraction of a high-level representation of malware structure is
required. Control flow graphs (CFGs) and function call graphs (FCGs) are
the most common abstract representations of an executable. These graphs
provide distinctive characteristics of a binary that is identifiable over strains
of malware variants [16]. These graphs represent the execution paths a pro-
gram may take. The FCG represents the inter-procedural control flow of a
program while the CFG represents intra-procedural control flow.
Both CFGs and FCGs have been widely used as basic components of most
malware detection approaches [11, 16, 42]. However, these methods suffer
from the following limitations:
1. They resort to static analysis to generate CFG and FCG graphs mostly
by employing PE-Explorer [79] or IDA Pro disassemblers [66]. The
static analysis can easily be bypassed in obfuscated cases such as using
packers. To alleviate this drawback some of the approaches [52, 49] have
2
used unpacker tools to remove the obfuscated layer of the executable
before disassembling it. However, most of them are limited to a fixed
set of known packers or are restricted by the fidelity of the emulation
environment.
3
to make the method scalable.
4. The static approaches [28, 21, 15, 71, 56, 88, 49] cannot handle on-line
streams of malicious executables due to their off-line classification or
clustering algorithms.
4
follows:
5
is extremely valuable and under circumstances reliable enough to provide
automatic classifications.
The rest of this thesis is organized as follows: Chapter 2 reviews graph based
malware classification approaches, which employ CFG or FCG and discuss
their drawbacks. Since all of the methods use static analysis we also provide
an overview of different techniques that can be used to generate dynamic
call graphs. Moreover, we provide a summary of different graph matching
techniques and consider their advantages and disadvantages. Finally we pro-
vide an overview of different detection methods which are suitable for graph
clustering.
Chapter 3 provides a general overview of the system, followed by a description
of each module. It explains how an input binary sample is passed through
the system to be classified as an existing or a new malware family. First,
the process of graph extraction based on dynamic analysis is introduced.
Then, the employed graph matching algorithms are discussed. Finally, the
algorithm for incremental classification of received samples is explained.
Design and implementation criteria of the proposed framework are described
in Chapter 4. This Chapter introduces the main components and classes im-
plemented in our framework and lists their main functions. Chapter 5 reports
the evaluation dataset, metrics, and comparative experimental results of the
6
proposed framework with existing malware classification approach. Finally,
Chapter 6 concludes the thesis by discussing its contributions, limitations
and possible improvements to the work done in this thesis.
7
Chapter 2
Literature Review
8
2.1.1 Function Call Graphs (FCGs)
9
and black colours, respectively. The node with the green colour represents
the staring point of FCG.
FCGs can be extracted by dynamic or static analysis. A dynamic FCG is
the record of an execution of a program, e.g., as output by a profiler. Thus,
a dynamic FCG can be exact, but only describes one run of the program.
A static FCG is a call graph intended to represent every possible run of
the program. The exact static FCG is an undecidable problem, so static
graph algorithms are generally based on approximations. That is, every call
relationship that occurs is represented in the graph, and possibly also some
call relationships that would never occur in actual runs of the program.
Since FCG provides an abstraction of program, which are able to represent its
main functionally, it has been widely used to identify malicious programs [28,
21, 15, 71, 56, 88, 43, 49, 52]. However, these approaches have mostly revolved
around static analysis of the binary and share the common drawbacks of any
static-based approaches. The main weakness of static analysis is that the
code analyzed may not necessarily be the code that is actually executed,
e.g. when files are packed or a third party code is being downloaded and
executed. Also, malware can employ a wide range of obfuscation mechanisms
that make static analysis ineffective. Most recent researches have widely
demonstrated the inefficiency of the static-based methods in the analysis
of sophisticated malware. In general, when the malware author employs
obfuscation techniques, the extracted graph from static analysis does not
reflect the real behaviour of an executable.
10
11
Figure 2.1: An Example of Function Call Graph
Figure 2.2: Control Flow Graph Example
CFG represents the paths that an application code might use during its
execution. CFGs have been used in the analysis of software and have been
studied for many years [46, 59, 77]. CFG can be considered as a directed
graph, where each node represents a statement of the program and each edge
represents control flow between the various statements. Examples of these
statements include copy statements, assignments and branches. Figure 2.2
shows an example of a CFG.
CFG is used in detecting metamorphic malware [6, 11, 4, 92, 76, 81, 1]
to alleviate byte sequence-based methods limitations since their syntactic
signatures ignore the program semantics. CFG can capture the nature of
an executable and its functionality, and therefore is distinguishable across
12
variants of samples. Nevertheless, the CFG-based approaches suffer from
the same limitation of FCGs since they resort to static analysis to extract
graphs.
CFG or FCG itself do not contain enough information about malicious sam-
ples [24, 25]. As a result, there is a need to improve the graphs by either
merging them together or adding more information to them.
Some of the approaches merge CFG, FCG and, register flow graph of a given
binary [2, 9] while others tried to enrich the generated graphs by employing
statistical information of dependency assembly instructions and API calls [24,
25].
Since these approaches are revolved around static analysis, they have the
same limitation of CFGs and FCGs.
13
The packer creates the semantically equivalent representation by obfuscat-
ing or encrypting the original binary and stores the result as data in a new
executable. An unpacker routine is prepended to the data, whose responsi-
bility upon invocation lies in restoring (i.e., deobfuscating or decrypting) the
data to the original representation. This reconstruction takes place solely
in memory, which prevents leaking any unpacked versions of the binary to
the disk. After unpacking, the control is handed over to the, now unpacked,
original binary that performs the intended tasks. Polymorphic variants of
a given binary can be automatically created by choosing random keys for
the encryption. However, their unpacking routines are, apart from the de-
cryption keys, largely identical. Therefore, while signatures cannot assess
the threat of the packed binary, signature matching can be used to detect
the fact that a packer program was used to create the binary. Metamorphic
variants, in contrast to polymorphic binaries, can also mutate the unpacking
routine, and may encumber detection even more.
According to Wei et al. [89], a large percentage of malicious software today
comes in packed form. Moreover, malware instances that apply multiple
recursive layers of packers are becoming more prevalent. Due to this increase,
most of the graph-based approaches are ineffective [1, 2, 4, 6, 11, 15, 21, 28,
71, 81] since they rely on static analysis for graph extraction process. To
combat this issue, some of the works employ unpacker tools [9, 16, 42, 49,
52, 56]. In their methods, before a call-graph is extracted, the binary is first
examined to determine whether it is packed or protected. To detect whether
14
a file is packed, they mostly use pattern-matching tools such as PEiD [47],
which contains signature databases for a series of known packers. Once a
packer has been identified, one needs to apply the appropriate unpacker.
This approach is fast and works well for the vast majority of known packers,
but the primary limitation of pattern-matching unpackers is that they are
ineffective when facing unknown/new packer.
An alternative solution is to use heuristics. Most of them can be consid-
ered as run-time unpacking tools and may require an isolated environment.
In general, unpacking heuristics are of questionable reliability and can be
evaded. For example, they may fail under the presence of anti-virtualization
and anti-emulator techniques. Finally, more than one packing technique may
be applied simultaneously.
Detecting malware through the use of FCGs requires means to compare FCGs
mutually, and ultimately, means to distinguish FCGs representing benign
programs from call graphs based on malware samples. This process can be
done by employing graph matching.
The process of evaluating the similarity of two graphs is commonly referred
to as graph matching. The overall aim of graph matching is to find a cor-
respondence between the nodes and edges of two graphs that satisfies some,
more or less, stringent constraints. That is, by means of the graph matching
15
process similar substructures in one graph are mapped to similar substruc-
tures in the other graph. Based on this matching, a dissimilarity or similarity
score can eventually be computed indicating the proximity of two graphs.
Graph matching has been the topic of numerous studies in computer science
over the last decades. Roughly speaking, graph matching techniques can be
classified into two categories, namely, exact matching and inexact matching.
In the former case, for a matching to be successful, it is required that a
strict correspondence is found between the two graphs being matched, or
at least among their sub-parts. In the latter approach this requirement is
substantially relaxed, since also matchings between completely non-identical
graphs are possible. That is, inexact matching algorithms are endowed with
a certain tolerance to errors and noise, enabling them to detect similarities
in a more general way than the exact matching approach. Therefore, inexact
graph matching is also referred to as error-tolerant graph matching. In the
following subsections we will consider these techniques in details.
16
possibilities to order the graph nodes. Consequently, for checking two graphs
for structural identity, we cannot simply compare their adjacency matrices.
The identity of two graphs is commonly established by defining a function,
called graph isomorphism mapping one graph to another one.
Graph Isomorphism: Let us consider two graphs denoted by G1 = (V1 ,
E1 , µ1 ) and G2 = (V2 , E2 , µ2 ) where V is vertex set, E is edge set and µ is
vertices labels set. A graph isomorphism is a bijective function φ : V1 →V2
satisfying:
e2 = (φ(u), φ(v)) ∈ E2
e1 = (φ 1 (u), φ 1 (v)) ∈ E1
17
Figure 2.3: graph isomorphism
18
matching procedure an associated similarity or dissimilarity score can be
easily inferred. In view of this, graph isomorphism as well as sub-graph
isomorphism provide us with a basic similarity measure, which is 1 (maximum
similarity) for (sub)graph isomorphic, and 0 (minimum similarity) for non-
isomorphic graphs. Hence, two graphs must be completely identical, or the
smaller graph must be identically contained in the other graph, to be deemed
similar. Consequently, the applicability of this graph similarity measure is
rather limited. Consider a case where most, but not all, nodes and edges in
two graphs are identical. The rigid concept of (sub)graph isomorphism fails
in such a situation in the sense of considering the two graphs to be totally
dissimilar. Due to this observation, the formal concept of the largest common
part of two graphs is established.
Maximum common sub-graph: Let G1 = (V1 , E1 , µ1 ) and G2 = (V2 ,
E2 , µ2 ) be graphs. A common sub-graph of G1 and G2 , cs(G1 , G2 ), is a
graph G = (V, E, µ) such that there exist sub-graph isomorphisms from G
to G1 and from G to G2 . We call G a maximum common sub-graph (MCS)
of G1 and G2 , MCS(G1 , G2 ), if there exists no other common sub-graph of
G1 and G2 that has more nodes than G.
A maximum common sub-graph of two graphs represents the maximal part
of both graphs that is identical in terms of structure and labels. Note that,
in general, the maximum common sub-graph is not uniquely defined, that is,
there may be more than one common sub-graph with a maximal number of
nodes. A standard approach to computing maximum common sub-graphs is
19
based on solving the maximum clique problem in an association graph [57,
60]. The association graph of two graphs represents the whole set of possible
node-to-node mappings that preserve the edge structure and labels of both
graphs. Finding a maximum clique in the association graph, that is, a fully
connected maximal sub-graph, is equivalent to finding a maximum common
sub-graph.
Graph dissimilarity measures can be derived from the maximum common
sub-graph of two graphs. Intuitively speaking, the larger a maximum com-
mon sub-graph of two graphs is, the more similar are the two graphs. For
instance, in [13] such a distance measure is introduced, defined by:
|M CS(G1 , G2 )|
dM CS(G1 ,G2 ) = 1-
max{|G1 |, |G2 |}
20
Figure 2.4: Maximum common sub-graph
21
graphs G is a sub-graph of maximal size for which there is an isomorphic
sub-graph in all of the graphs.
Due to the intrinsic variability of the patterns under consideration and the
noise resulting from the graph extraction process, it cannot be expected that
two graphs representing the same class of objects are completely, or at least to
a large part, identical in their structure. Moreover, if the node label alphabet
L is used to describe non-discrete properties of the underlying patterns, e.g.
L ⊆ <n , it is most probable that the actual graphs differ somewhat from
their ideal model. Obviously, such noise crucially hampers the applicability
of exact graph matching techniques, and consequently exact graph matching
is rarely used in real-world applications.
In order to overcome this drawback, it is advisable to endow the graph match-
ing framework with a certain tolerance to errors. That is, the matching pro-
cess must be able to accommodate the differences of the graphs by relaxing
to some extent the underlying constraints. In the first part of this section
the concept of graph edit distance is introduced to illustrate the paradigm
of inexact graph matching. In the second part, several other approaches to
inexact graph matching are briefly discussed.
22
2.3.2.1 Graph Edit Distance
Graph Edit Distance(GED) [12, 70] offers an intuitive way to integrate error-
tolerance into the graph matching process and is applicable to virtually all
types of graphs. Originally, edit distance has been developed for string
matching [84] and a considerable amount of variants and extensions to the
edit distance have been proposed for strings and graphs. The key idea is
to model structural variation by edit operations reflecting modifications in
structure and labeling. A standard set of edit operations is given by inser-
tions, deletions, and substitutions of both nodes and edges. Note that other
edit operations, such as merging and splitting of nodes [3], can be useful in
certain applications.
GED, calculates the minimum number of edit operations required to trans-
form graph G1 into graph G2 . Given two graphs, the source graph G1 and
the target graph G2 , the idea of graph edit distance is to delete some nodes
and edges from G1 , relabel some of the remaining nodes, and insert some
nodes and edges in G2 , such that G1 is finally transformed into G2 . A se-
quence of edit operations e1 , . . . , ek that transform G1 into G2 is called
an edit path between G1 and G2 . In Figure 2.5 an example of an edit path
between two graphs G1 and G2 is given. This edit path consists of three edge
deletions, one node deletion, one node insertion, two edge insertions, and two
node substitutions.
Since exact solutions for GED are computationally expensive, a large amount
of research has been devoted to developing fast and accurate approximation
23
Figure 2.5: A possible edit path between two graphs
algorithms for these problems, mainly in the field of image processing and
for bio-chemical applications [86]. In the following sections we will consider
these approaches.
A survey of three different approaches to perform GED calculations is con-
ducted by Riesen et. al. in [62, 68, 69]. They first give an exact GED
algorithm using A* search, but this algorithm is only suitable for small
graphs [62]. Next, A*-Beam search, a variant of A* search, which prunes
the search tree more rapidly, is used. As is to be expected, the latter algo-
rithm provides fast but suboptimal results. The last algorithm they survey
uses Munkres bipartite graph matching algorithm as an underlying scheme.
Benchmarks show that this approach, compared to the A*-search variations,
handles large graphs well, without affecting the accuracy too much.
Justice and Hero [48] formulate the GED problem as a Binary Linear Pro-
gram, but the authors conclude that their approach is not suitable for large
graphs. Nevertheless, they derive algorithms to calculate the lower and upper
bounds of the GED in polynomial time, which can be deployed for large graph
instances as estimators of the exact GED. Inspired by Justice and Hero [48]
approach, Zeng et al. [91] provide a new polynomial algorithm which finds
24
tighter upper and lower bounds for the GED problem.
In the area of malware classification there exists a few works that have used
GED to compare FCGs or CFGs. SMIT [43] is the first approach which
employs GED in malware analysis. It identifies variants using minimum cost
bipartite graph matching and the Hungarian algorithm, which finds an exact
one-to-one vertex assignment with the goal of minimizing the total mapping
cost, improving upon the greedy approach to graph matching. SIGMA [2]
also used the minimum cost bipartite graph matching to calculate the simi-
larity between their own graph representations.
Kostakis et al [53] propose an adapted version of Simulated Annealing to com-
pute GED. It is a local search algorithm which searches for a vertex mapping
that minimizes the GED. This algorithm turns out to be both faster and more
accurate than, for example, the algorithms based on Munkres'bipartite graph
matching algorithm as applied in Hu et al. approach [43]. Two steps can be
distinguished in the Simulated Annealing algorithm for call graph matching.
In the first step, the algorithm determines which external functions a pair of
call graphs have in common. These functions are mapped one-to-one. Next,
the remaining functions are mapped based on the outcome of the Simulated
Annealing algorithm, which attempts to map the remaining functions in such
a way that the GED for the call graphs under consideration is minimized.
Simulated Annealing has also been used in several works [52, 49] to compute
the GED.
Elhadi et al. [23] employ a modified greedy approach that supports GED
25
metric to find the set of best paths from the data graph that match the set
of query graph edges and construct the best sub-graph with high degree of
similarity.
26
ing how suitable each candidate label is. The initial labeling, which is based
on the node attributes, node connectivity, and other information available,
is then refined in an iterative procedure until a sufficiently accurate labeling,
i.e. a matching of two graphs, is found. Wilson and Hancock [87] employed
Bayesian consistency measure to derive a graph edit distance.
The general idea of spectral methods is that the eigenvalues and the eigenvec-
tors of the adjacency or Laplacian matrix of a graph are invariant with respect
to node permutation. Hence, if two graphs are isomorphic, their structural
matrices will have the same eigendecomposition. The converse, i.e. deducing
from the equality of eigendecompositions to graph isomorphism, is not true
in general. However, by representing the underlying graphs by means of the
eigendecomposition of their structural matrix, the matching process of the
graphs can be conducted on some features derived from their eigendecompo-
sition. The main problem of spectral methods is that they are rather sensitive
structural errors, such as missing or spurious nodes. Moreover, most of these
methods are purely structural, in the sense that they are only applicable to
unlabeled graphs, or they allow only severely constrained label alphabets.
Kernel methods were originally developed for vectorial representations, but
the kernel framework can be extended to graphs in a very natural way. A
number of graph kernels have been designed for graph matching [31]. A
seminal contribution is the work on convolution kernels, which provides a
general framework for dealing with complex objects that consist of simpler
parts [38]. Convolution kernels infer the similarity of complex objects from
27
the similarity of their parts.
A second class of graph kernels is based on the analysis of random walks in
graphs. These kernels measure the similarity of two graphs by the number of
random walks in both graphs that have all or some labels in common [7, 35].
Gartner et al. [35] show that the number of matching walks in two graphs
can be computed by means of the product graph of two graphs, without the
need to explicitly enumerate the walks. In order to handle continuous labels
the random walk kernel has been extended by Borgwardt et al. [7]. This
extension allows one to also take non-identically labeled walks into account.
A third class of graph kernels is given by diffusion kernels. The kernels of
this class are defined with respect to a base similarity measure which is used
to construct a valid kernel matrix [74]. This base similarity measure only
needs to satisfy the condition of symmetry and can be defined for any kind
of objects.
2.4 Analyzers
28
tected environment, e.g. sandbox, and its actual behavior is captured in the
form of API/system calls, or in an instruction dump. Dynamic analysis of
malware is immune to most obfuscation techniques and has shown to be more
effective in differentiating malware families. In this section we will consider
the static analyzers and also dynamic analyzers that can be used to extract
FCGs.
IDA Pro and PE-Explorer are the most popular disassembler tools that have
been widely used in many research works to extract assembly instructions or
generate graphs from a binary.
29
PE-Explorer PE-Explorer as an another disassembler tool decomposes
Portable Executable (PE) and DLL files. It is less powerful than IDA pro,
however [24] and [25] use this tool to extract the instructions of Windows
malware binaries. PE-Explorer is a static analysis tool and suffers from the
same limitation as that of IDA pro.
30
user-space. These providers allow tracing of any function entries and exits
by attaching a trap immediately before each call instruction. DTrace is noti-
fied when this trap hits and automatically executes the user-defined actions.
Because DTrace can instrument programs with low overhead, it is suitable
for production environments.
Although such approaches are powerful and high-performance, they are tightly
integrated with kernels and therefore, can only work in the kernels that sup-
port such features. Consequently, such tools do not work in a large class
of embedded devices because they rarely use operating systems with such
support.
31
in the ELF binary. Both ltrace and latrace can operate with low overhead.
32
for both statically and dynamically linked libraries [85]. The overhead of
Callgrind ranges from 20 to 100 times slower than native execution. Also,
similar to Pin it has the limitation of capturing kernel space instructions.
33
with DineroIV, a memory reference tracing simulator, to generate execution
traces and perform analysis [22]. However, QEMU lacks the capability to
allow developers to model a full range of hardware devices.
In summary, full-system simulators provide an attractive platform to carry
out dynamic FCG extractions for two reasons.
34
developed specifically for malware analysis such as TEMU and DECAF [90,
41]. The benefit of these systems is that they employ several transparent
techniques that try to alleviate anti-virtualization techniques which has been
used by program authors. Therefore, our system can extract complete and
precise dynamic FCGs which include both user and kernel space behaviours.
2.5 Detection
35
Classy [52] is the only system that devises a new online clustering algorithm
to cluster streams of FCGs. However, they had an assumption in the cluster-
ing algorithm which makes it ineffective. At very first step, to determine the
candidate clusters for the incoming sample, they only consider those graphs
that have the same number of node as the incoming sample has. Therefore,
the algorithm ignores a lot of possible samples that have a different number
of nodes but can have similar behaviour.
36
logic bombs, and can be slow and tedious to identify and disable.
For the purpose of identifying, quantifying, and expressing similarities be-
tween malware samples, most of the works used MCS or graph isomorphism
that are proven to be an NP-Hard problem. Therefore, applying these meth-
ods to large number of graphs, which have huge number of nodes, are ineffec-
tive. Moreover, the time complexity of the system is an important parameter
that also needs to be considered. SMIT and Classy are the only works that
consider the effectiveness of their system in the large-scale scenarios.
To address the mentioned shortcomings, dynamic analysis has been used to
generate graphs. Moreover, graph edit distance, which is known to be the
most suitable method to compare graphs, has been improved extensively.
Analysis Type Graph Repre- Unpacker Type Graph comparison Method ref
sentation
37
Static FCG RL!depacker graph maximum common [88]
and UPX vertexes or edges
38
Chapter 3
Proposed System
39
that a different approach must be taken to better detect them in the future.
Finally, the additional knowledge gained from the clustering results may also
be used for prioritizing samples in the queues of other automation systems
or manual inspections.
3.1 Overview
40
Figure 3.1: General overview of the proposed malware detection system
41
the execution of an incoming sample. It also records statistics about loaded
executables and libraries, tracks the entry of tainted data to the process
space, and produces a log of function calls, including arguments and return
values that we later use to generate function call graphs.
The output of the dynamic analyzer are the generated trace file in hexadec-
imal format and the function calls log for each running sample.
3.2.2 Preprocessor
42
to the end of the current instruction line. These function calls would be the
external function calls. If there does not exist any match for the requested
address in the log file, it means the function call is internal. Therefore, the
algorithm generates a internal function call name using “sub ” following the
address, (e.g. sub 0x7009453 ). Then the generated function call will be
added to the end of the instruction line.
The graph generator takes the generated assembly file, which includes the
function calls names, and generates the corresponding dynamic function call
graph and dynamic control flow graph.
Dynamic function call graph generator
43
The static function call graph generator algorithms are not applicable to
generate dynamic graphs because they extract assembly instructions of a
given sample without running it and as a result, there does not exist any
ret instruction for every call instruction. While in dynamic analysis, there
exist a ret instruction for each call instruction. Therefore, edge generation
methods in static analysis are completely different than those of the dynamic
analysis.
Due to the mentioned problem, we propose a new algorithm to generate
dynamic function call graphs. The proposed algorithm in pseudo-code is
given in Algorithm 2. In the proposed algorithm the function object is created
by using the function call name at the end of each instruction line. The
algorithm works as follow:
• For the first instruction, create a function object at the address of that
instruction.
• For each call statement or push + ret, create a function object and add
an edge from the current function to this new function object.
• For each new call statement, create or reuse a function object and add
an edge from the current function to the new or already known function
object.
• After a ret instruction, change the current function to the previous one.
Figure 3.2 shows dynamic function call graph of Fareit malware, which was
generated by our system.
44
Algorithm 2 Dynamic Call Graph Generation
1: function CallGraphGenerator(Assembly instruction files included function
calls(AIF))
2: Read the first line of AIF
3: For the first instruction, create a Function object at its address
4: CurrentFunction ← Created function object
5: Push the address to the stack
6: while it is not the end of file do
7: Read the next line of the AIF
8: Get the address and instruction
9: if instruction = Call then
10: Create a node with the address or function name
11: Create an edge from the currentFunction to this node
12: Push the address or function name to the stack
13: CurrentFunction ← Node address or node function name
14: Read the next line of the AIF
15: end if
16: if Instruction = push then
17: Read the next line of the AIF
18: Get the address and instruction of the next line
19: if Instruction = Ret then
20: Create a node with the address or function name
21: Create an edge from the currentFunction to this node
22: Push the address or function name to the stack
23: Read the next line of the AIF
24: CurrentFunction ← Node address or node function name
25: end if
26: if Instruction = Call then
27: Create a node with the address or function name
28: Create an edge from the currentFunction to this node
29: Push the address or function name to the stack
30: CurrentFunction ← Node address or node function name
31: Read the next line of the AIF
32: end if
33: Push the address or function name to the stack
34: CurrentFunction ← Node address or node function name
35: end if
36: if Instruction = Ret then
37: Pop from stack
38: CurrentFunction ← Top of stack
39: end if
40: end while
41: end function
45
46
Figure 3.2: Dynamic Function Call Graph of Fareit Malware
Dynamic control flow graph generator To generate dynamic control
flow graphs, we employ the method introduced by Kinder et al. [50]. The
algorithm implements multiple rounds of assembly instruction analysis inter-
leaved with dataflow analysis. In each round, the assembly instructions are
translated to an intermediate representation, from which the platform builds
a more accurate control-flow graph.
Figure 3.4 shows the small part of dynamic CFG of Fareit malware in assem-
bly format and figure 3.4 represents its equivalent in intermediate language.
The graph comparison is the most essential part of our proposed system. Its
accuracy and time complexity are directly related to the performance of the
system from the perspective of quality and throughput.
47
Figure 3.3: Dynamic CFG of Fareit Malware in Assembly Instructions
48
Figure 3.4: Dynamic CFG of Fareit Malware in Intermediate Representation
49
3.3.1.1 Background
50
when comparing call graphs. To circumvent this problem, the smaller of the
vertex sets V(G) and V(H) can be supplemented with disconnected (dummy)
vertices such that the resulting sets V 0(G) and V 0(H) are of equal size. A
mapping of a vertex v in graph G to a dummy vertex is then interpreted
as deleting vertex v from graph G, whereas the opposite mapping implies a
vertex insertion into graph H.
Now, for a given graph matching, we can define three cost functions: Ver-
texCost, EdgeCost and RelabelCost.
The sum of these cost functions results in the graph edit distance λφ (G,H):
51
Definition (Graph dissimilarity): The dissimilarity δ(G,H) between two graphs
G and H is a real value on the interval [0,1], where 0 indicates that graphs
G and H are identical whereas a value near 1 implies that the pair is highly
dissimilar. In addition, the following constraints hold: δ(G,H) = δ(H,G)
(symmetry), δ(G,G) = 0, and δ(G,K0 ) = 1 where K0 is the null graph, G
6= K0 . Finally, the dissimilarity δ(G,H) of two graphs is obtained from the
graph edit distance λφ (G,H):
λφ (G, H)
δ(G,H) =
|V (G)| + |V (H)| + |E(G)| + |E(H)|
52
possible bijective mappings φ between two graphs. The SA process starts
from an arbitrary bijective mapping as an initial state. Then a neighbour-
ing state in the search space is selected randomly. Neighbouring states are
created by choosing a pair of vertices in one of the graphs and swapping
their matching vertices. The difference in the cost function for the two states
determines whether the current state must be replaced by the new state or
not. We denote the difference in the cost function evaluated for two states by
∆(λφi , λφi+1 ). If the new state (bijective mapping) gives a lower value for the
cost function, it replaces the current state. Otherwise, the move is accepted
with probability e−β∆(λφi ,λφi+1 ) . SA is allowed to run for a predefined number
of steps before the value of β is increased.
The annealed parameter β is the inverse temperature used in statistical
physics. For small values of β almost any move is accepted in the process.
For β → ∞ the process is essentially a downhill move in which the SA state
will be replaced by the new bijective mapping only if the new state gives a
lower cost. The reason to introduce the annealed parameter is to overcome
the problem of getting stuck in local minima by allowing non preferential
moves.
The sequence of β can be considered an annealing schedule. The annealing
schedule contains the initial and final values of the annealed parameter, de-
noted by β0 and βf inal , together with the cooling rate, , which determines
the changes in β. In our implementation we chose the cooling rate to be a
multiplier factor in β which takes values on the interval [0, 1]. Then the
53
sequence of the values of β is determined by βt+1 = βt /. We will refer to
the number of times that β changes with the term relaxation iterations.
There are two terminating conditions for SA. The first is achieving the mini-
mum graph edit distance. But since this is the problem SA is called to solve,
a lower bound is computed and used instead [52]. The second terminating
condition comprises of terminating the SA process when no better solution
has been identified within a certain number of the most recent neighboring
solutions; this is the no progress() function in Algorithm 3.
54
3.3.1.3 Improved Simulated Annealing
55
and for each random solution, k neighbour solutions are generated. Then
the cost function of all solutions are calculated. The algorithm choose k
best solutions based on the ordered cost functions and k solution will be
selected with probability e−β∆(λφi ,λφi+1 ) . The rest of the algorithm is similar
to the original SA Algorithm. modified SA algorithm procedure is given in
Algorithm 4.
3.4 Classification
56
algorithm.
A prototype in our system is a function call graph that can represent its sur-
rounding function call graphs. Extracting an optimal set of prototypes from
data set is NP-hard that can be performed by employing either clustering al-
gorithms or super-linear computations. However, they are not suitable for ef-
ficient approximation. We use the linear-time prototype extraction algorithm
suggested by Gonzalez [36] (Algorithm 5), where distance[x ] determines the
distance between graph x and its nearest prototype. The algorithm starts by
adding first graph in the training set into the list of prototypes. Subsequently,
farthest graphs are selected as prototypes one at a time, and distance[x ] is
recalculated for each graph x . This process continues until the distance of
all graphs from their closest prototype is less than a specified threshold,
dp . The algorithms run-time linearly increases by the number of graphs and
prototypes.
The clustering phase, only cluster extracted prototypes into groups of similar
malware families and identify the unknown samples. Algorithm 6 describes
the employed hierarchical clustering algorithm. The algorithm starts with
each prototype being an individual cluster, and then iteratively determines
57
Algorithm 5 Prototype extraction
1: function Prototype Extraction(Graphs)
2: prototypes ← ∅
3: distance[x ] ← ∞ for all x ∈ graphs
4: while max(distace) > max dist prototype( dp ) do
5: choose z such that distance[z] = max(distance)
6: for x ∈ graphs and x 6= z do
7: if distance(x ) > Simulated Annealing(x , z, β, βf inal , , m) then
8: distance(x ) ← Simulated Annealing(x ,z, β, βf inal , , m)
9: end if
10: end for
11: add z to prototypes
12: end while
13: end function
58
3.4.3 Classification
While most of the existing approaches have been restricted to batch analysis,
but in real environment we face with stream of malware every day. To handle
stream of malware, we process the incoming samples in small chunks, for
59
example on a daily basis. To realize an incremental analysis, we need to keep
track of intermediate results, such as clusters determined during previous
runs of the algorithm. Fortunately, the concept of prototypes enables us to
store discovered clusters in a concise representation and, moreover, provides
a significant speed-up if used for classification. Algorithm 8 sketches the
incremental classification procedure which starts by checking input graphs
against previous prototypes, then re-clustering the remaining graphs.
60
able for graph extraction. In the preprocessing phase, traces are converted
to assembly instructions and function call names are added to assembly in-
struction files. A new algorithm is devised to generate dynamic graph from
assembly instructions. The graph comparison measure is the GED and it is
approximated using a revised version of Simulated Annealing algorithm. Fi-
nally, we adopt an stream clustering algorithm to cluster and classify stream
of call graphs.
61
Chapter 4
Implementation
The system has been built upon TEMU (version 1.0) [90], open-source whole-
system out-of-the-box fine-grained dynamic binary analysis that provides
62
whole-system view to facilitate fine-grained instrumentation, and also pro-
vides sufficient efficiency. Our modular system interfaces with Intel's XED2 [44]
library for instruction decoding and Tracecap [90] for reading and writing in-
struction traces.
The proposed classifier consists of five main components as depicted in com-
ponent diagram (Figure 4.1): instruction tracer, instruction decoder, prepro-
cessor, graph generator, graph comparator and classifier. Instruction tracer
captured the execution instruction of incoming binary. Instruction decoder
converts the trace file to assembly file. The generated assembly file will be
preprocessed by preprocessor to add external function call to it. Graph gen-
erator is responsible to generate dynamic control flow and dynamic function
call graphs. The distance between graphs is calculated by graph comparator.
Finally, the classifier classifies graphs based on the calculated distances.
The module view shows how the proposed system is decomposed into man-
ageable software units. The elements of the module view type are modules.
A module is an implementation unit of the system that provides a coherent
unit of functionality.
The module view of the proposed system in the UML class diagram is illus-
trated in Figure 4.2, followed by the description of each class.
63
Figure 4.1: Proposed System Component Diagram
This class employs lib XED2 to decode the generated trace files into assembly
instructions files. It includes the following functions:
64
readTraceFiles: reads the generated trace files.
convertTraceToInstruction: converts hex trace file to assembly instruc-
tions file.
4.0.2.3 Preprocessor
This class prepares the generated assembly files for graph generation. Fol-
lowing are the descriptions of the functions:
TraceToAssembly: uses an instance of Instruction Decoder class to decode
the hexadecimal trace into assembly instructions.
addExternalFunctionCalls: add function calls generated by log of func-
tion calls to assembly file.
4.0.2.5 BitMatrix
65
way.
4.0.2.6 Graph
The graph class is used to store each generated graph for graph comparison.
4.0.2.7 Handler
4.0.2.8 Classifier
66
4.0.2.9 GEDCalculator:
This class compares the incoming graphs. The main functions of this class
are:
calculateUpdatedCost: updates cost function after swapping two nodes.
relabelingCostForNode: calculates the cost of relabelling of each node.
edgeCostForNode: calculates the edge cost for each node.
madeSameSizeGraph: compares two graphs and adds the require number
of dummy nodes to the smaller graph to make them same size.
generateRandomBijectiveFunction: generates bijective function for two
incoming graphs.
findNeighborSolution: finds neighbor solution for the bijective function
by swapping two random nodes.
costFunction: computes the overall cost which includes relabelling cost,
edge cost and vertex cost.
relabelingCost: computes relabelling cost for the graph.
edgeCost: calculates edge cost for the graph.
vertexCost: compute vertex cost for the graph.
lowerBound: calculate the Lower Bound algorithm for GED.
no˙progress: terminates the process.
67
4.0.2.11 Adapted Simulated Annealing
Sequence diagram has been employed to depict the behaviour of the proposed
system and show how modules interact together. As illustrated in Figure 4.3,
the sequence diagram shows how the incoming sample goes through different
objects of the system to get its final label.
4.0.4 Conclusion
This chapter presents the design and implementation of the proposed system.
Two different architectural views are used to show different aspects of the
system. The module view, which employs UML class diagram, shows the
implementation modules of the system and behavioural view (sequence dia-
gram) depicts the interactions between objects in the sequential order that
those interactions occur.
68
69
Figure 4.2: Proposed System Class Diagram
70
Figure 4.3: Proposed System Sequence Diagram
Chapter 5
71
samples to an acceptable level. We built such a dataset by selecting a reason-
able mix of different benign and malicious code variants currently popular
on the Internet from the following well-known sources: Ether dataset [20],
Malicia dataset [65], VirusTotal repository [80], VirusShare [72] and Virus-
Sign [73].
Our dataset contains 9850 benign executables and 40,000 malware from 346
different families. We added diversity by selecting malware from different
categories (viruses, rootkits, etc.), and with different packer. Table 5.1 shows
the different malware types distribution and the distribution of the 18 most
popular packers observed in our dataset is reported Table 5.2. Packers with a
frequency of less than 20 samples have been grouped under “Other” category.
We scanned all samples by 57 online AVs and selected Microsoft AV for
labeling the dataset as it was able to successfully label the maximum number
of samples.
72
Table 5.2: Distribution of Packers Types in the Dataset
73
Figure 5.1: Accuracy considering different parameters value
get the highest accuracy and minimum number of the rejected samples the
value of Max dist prototype should be between 0.25-0.4, Min dist need to be
from 0.85 to 0.9 and min dist cluster have to be tuned from 0.7 to 0.9.
74
Figure 5.2: Rejected samples considering different parameters value
75
dynamic graphs. All experiments are conducted on an Intel Core i7-3770
Quad-Core Processor with 3.4 GHz and 16GB memory.
Since the proposed system uses dynamic analysis to generate FCGs, it re-
quires time to run each binary in a protected environment. Each sample is
run for a fix amount of time (3 minutes) to capture its traces. We exclude
dynamic analyzer time requirement from evaluating time complexity of graph
generation phase. We consider time requirements for preprocessing and ex-
traction steps separately to compare static and dynamic graph extraction.
In static graph extraction, preprocessing is referred to unpacking phase of a
binary. Figures 5.3 and 5.4 depict the run-time performance of preprocessing
and graph extraction phases, respectively.
In static graph extraction preprocessing time depends on the packing tech-
niques and the size of the binary while in dynamic graph extraction it de-
pends on the size of the Trace file. Depending on the packer technique, 55%
of samples can be preprocessed in less than 100 seconds while less than 10
percentages of samples require more than 250 seconds for unpacking. The
preprocessing time for dynamic graphs is higher than the static one where
only 22% of trace files can be preprocessed in less than 100 seconds and
around 26% of samples need a preprocessing time more than 250 seconds.
It is expected to have high preprocessing time for dynamic graphs since the
average size of the trace files is more than 3GB, which requires the significant
76
Figure 5.3: Graph extraction comparison
77
Figure 5.4: Preprocessing comparison
at each step of graph generation phase and also size of incoming binaries. As
can be seen, the size of trace file and assembly file generated in the dynamic
analyser and preprocessor steps are extremely large and can even exceed 10
GB. However, dynamic FCGs generated in the graph generator step reduce
the trace file and assembly file 48 and 15 times, respectively. Also, it shows
that the size of generated dynamic FCGs are comparable to Static FCGs and
in most of the cases have the similar size as static graphs have.
78
The output of preprocessing and graphs extraction steps are the generated
function call graphs. The percentage of the extracted graphs are reported
in Table 5.4. The proposed dynamic method generates corresponding FCG
for each incoming sample as reported in Table 5.4. The percentage of the
extracted static graphs depends on the preprocessing step. If the graphs
directly extracted from IDA Pro without using any unpacker tools, the per-
centage of generated graphs is 76.29. IDA Pro cannot disassemble samples
which have employed packer techniques and as a result cannot generate the
desired FCG. By employing unpacker the percentage of the generated CFGs
is increased by around 14%. Even though it improves the percentage of gen-
erated graphs but it still cannot unpack most of the samples and therefore,
the IDA Pro cannot generate the corresponding FCGs.
The most important part of the system is the Graph matching phase which
directly affects the performance and effectiveness of the proposed system.
Similar to the Classy, the original SA and Improved SA is used with param-
eter values β0 = 4.0 and = 0.9. Figures 5.5 and 5.6 show the time required
in the comparison phase for both static and dynamic graphs respectively.
79
Figure 5.5: Time requirement comparison original SA
80
A small increase of computational complexity can be observed when using
proposed improved approximation algorithm, where the time requirement
for 80% of static graphs is between 1-20 centi-second in proposed algorithm,
whereas the original SA calculate GED for around 75% of graphs at the same
time. The case is similar for dynamic graphs in which 78% of graphs are com-
pared together in between 100 to 300 centi-second by employing Improved
SA while the original one approximates GED for around 74% of the graphs
at the same time.
As it can be seen in Figures 5.5 and 5.6 the time required to compare dynamic
graphs is approximately 10 times more than that of static graphs. The
problem here is that the number of nodes and edges of the generated dynamic
graphs are significantly higher than static graphs. Based on the statistics
from Figures 5.8 and 5.7, 90% of the static graphs have 10 to 2000 nodes while
in dynamic graphs 70% of them have more than 2000 nodes. For the number
of edges the case is even worse than number of nodes where the percentage
of the graphs that have more than 2000 edges are around 90%. In contrast,
around 85% of static graphs have less than 2000 edges. Therefore, due to
large number of nodes and edges in dynamic graphs, significant amount of
time is required to compare them in comparison to static ones.
5.3.1.3 Classification
The purpose of this part is to show the power of our system to detect different
malware families. As shown in Figure 5.9 several experiments have been
81
Figure 5.7: Distribution of number of edges
82
done to prove the effectiveness of all system modules. Based on the results
our system (Dynamic Modified SA in Figure 5.9) shows persistent results
while meeting performance goals of low rejection rate and close to optimal
accuracy. At the end of day 5 our system is able to classify samples with
average accuracy of 94% while leaving only 2% of samples unlabeled. In
contrast, with static graphs the average accuracy would be 68% with 7%
unlabeled samples.
The results also prove that the Improved SA performs better than the original
SA by 2-6 percent increase in accuracy. In all cases (Static graph, dynamic
graphs), the accuracy of using Improved SA approaches the accuracy of the
original SA. This is because the Improved SA discovers the shorter edit path
and the GED is better approximated which improves the system accuracy
significantly.
Moreover, it can be seen that employing unpacker does not improve the
accuracy of the system in comparison to using dynamic graphs. The pri-
mary drawback of using static unpackers is that they are limited to a set
of known packers techniques and are unable to unpack new and unknown
packers. Since in our dataset there are a diverse set of packers, the employed
unpacker tool only can unpack a small percentage of binaries that employ
UPX, NsPack, and Upack packers.
83
84
Figure 5.9: Classification Performance
5.3.2 Comparative Evaluation with Classy
Classy applies in-house unpacker on binaries to unpack them and then feeds
them into IDA pro to extract static FCGs. Simulated Annealing algorithm
has been used to measure similarity between extracted graphs. Then they
devised a new online algorithm to cluster incoming graphs. Since they do
not mention the unpacker name, we employ PE-explorer [78] to unpack the
incoming binaries. To compare the Classy with the proposed system, we
employ both dynamic and static graphs as an input to Classy. The reason is
to see the effectiveness of both proposed dynamic FCGs and online machine
learning method. Results for the comparison are presented in Figure 5.10.
The proposed system outperforms Classy by reaching 94% accuracy and leav-
ing only a few percentage of samples unlabelled while Classy only can reach
60% of accuracy by using static graphs. The results also prove the fact that
dynamic FCGs performs better in comparison to static ones. By employing
dynamic graphs as input of Classy the accuracy of the system is increased
by more than 10%.
There are three reasons behind the low percentage of the Classy accuracy.
The first reason is that they use static graphs, which do not reflect the real
structure of the malware samples. Even using unpacker tool to unpack the
binaries does not help too much in increasing performance of the system
since the unpacker is limited to set of known packers techniques. The second
reason is their assumption about the clustering algorithm which makes their
clustering ineffective. At the beginning of the first step, to determine the
85
86
Figure 5.10: Comparison Study
candidate clusters for the incoming sample, they only consider those graphs
that have the same number of node as the incoming sample has. Therefore,
the algorithm ignores a lot of possible samples that have a different number
of nodes but can have similar behavior. Finally, as we mentioned before,
their proposed Simulated Annealing cannot find the optimal GED between
FCGs which affects the system accuracy.
In this chapter we evaluate our system in terms of efficiency and time com-
plexity by applying the proposed system to a large and diverse dataset of
executables. To do that, each module of the proposed system is compared
to its static competitor. Moreover, we provide a comparative analysis with
state-of-the-art static graph based method, namely Classy to present the
discrimination power of our dynamic system.
In terms of time requirement, the proposed system requires more time to com-
pare dynamic graphs in comparison to static graphs since the dynamic FCGs
have larger number of nodes and edges, which need a significant amount of
time for graph matching. However, employing the dynamic graphs instead
of static ones significantly increases the system detection accuracy by ap-
proximately 20%. The results also proved that the proposed Improved SA
performs better in both time complexity and efficiency since it can find the
optimal GED in lower time in contrast to original SA.
87
A comparative analysis with Classy is performed with the aim of comparing
the employed classification algorithm with their proposed online clustering
algorithm. It was found that employing the static graphs in combination
with their clustering results in a significant drop in the classification perfor-
mance. The superior classification performance of our system indicates its
effectiveness in stream malware classification.
88
Chapter 6
6.1 Conclusion
89
tional complexity. The generated dynamic graph reflects the real structure
of a malware sample and therefore has high discriminative power when used
by machine learning techniques.
Our comparative analyses confirm effectiveness of our system in classification
of stream of samples by reaching an average accuracy of 94% and a minimum
number of unlabeled samples (2% of total samples). In general, superior
performance of our system stems from (1) the power of dynamic FCGs to
reflect real malware behaviour that can be distinguishable among stream of
malware samples; and (2) the ability of estimating the optimal GED by using
improved version of Simulated Annealing which increase the overall system
accuracy.
90
performs very well in terms of accuracy and time complexity by em-
ploying pattern recognition datasets. So as our future work, we would
like to employ the proposed algorithm to compare the graphs to get the
lower comparison time. Another way to improve our time complexity
is using combination of graphics processing unit (GPU) and Hadoop.
He et al. [40] show that this combination can improve the time re-
quired of many graph problem significantly. Therefore, to make our
proposed system more scalable we will reconfigure our system based on
the combination of GPU and Hadoop.
91
Bibliography
[1] Shahid Alam, Issa Traore, and Ibrahim Sogukpinar. Annotated control
flow graph for metamorphic malware detection. The Computer Journal,
(10):2608–2621, 2014.
[2] Saed Alrabaee, Paria Shirani, Lingyu Wang, and Mourad Debbabi.
Sigma: A semantic integrated graph matching approach for identify-
ing reused functions in binary code. Digital Investigation, 12:S61–S71,
2015.
[3] R Ambauen, Stefan Fischer, and Horst Bunke. Graph edit distance with
node splitting and merging, and its application to diatom identification.
In Graph Based Representations in Pattern Recognition, pages 95–106.
Springer, 2003.
92
[5] Fabrice Bellard. Qemu, a fast and portable dynamic translator. In
USENIX Annual Technical Conference, FREENIX Track, pages 41–46,
2005.
[7] Karsten M Borgwardt, Cheng Soon Ong, Stefan Schönauer, SVN Vish-
wanathan, Alex J Smola, and Hans-Peter Kriegel. Protein function
prediction via graph kernels. Bioinformatics, 21(suppl 1):i47–i56, 2005.
[9] Ismael Briones and Aitor Gomez. Graphs, entropy and grid computing:
Automatic comparison of malware. In Virus bulletin conference, pages
1–12. Citeseer, 2008.
93
of Intrusions and Malware & Vulnerability Assessment, pages 129–143.
Springer, 2006.
[12] Horst Bunke and Gudrun Allermann. Inexact graph matching for struc-
tural pattern recognition. Pattern Recognition Letters, 1(4):245–253,
1983.
[13] Horst Bunke and Kim Shearer. A graph distance metric based on the
maximal common subgraph. Pattern recognition letters, 19(3):255–259,
1998.
[16] Silvio Cesare and Yang Xiang. Malware variant detection using sim-
ilarity search over sets of control flow graphs. In Trust, Security and
Privacy in Computing and Communications (TrustCom), 2011 IEEE
10th International Conference on, pages 181–189. IEEE, 2011.
[17] Silvio Cesare, Yang Xiang, and Wanlei Zhou. Malwise; an effective
and efficient classification system for packed and polymorphic malware.
IEEE Transactions on Computers, 62(6):1193–1206, 2013.
94
[18] Xueling Chen. Simsight: a virtual machine based dynamic call graph
generator. 2010.
[19] Mihai Christodorescu and Somesh Jha. Testing malware detectors. ACM
SIGSOFT Software Engineering Notes, 29(4):34–44, 2004.
[20] Artem Dinaburg, Paul Royal, Monirul Sharif, and Wenke Lee. Ether:
malware analysis via hardware virtualization extensions. In Proceedings
of the 15th ACM conference on Computer and communications security,
pages 51–62. ACM, 2008.
[22] Jan Edler and Mark D Hill. Dinero iv trace-driven uniprocessor cache
simulator, 1998.
[23] Ammar Ahmed E Elhadi, Mohd Aizaini Maarof, Bazara IA Barry, and
Hentabli Hamza. Enhancing the detection of metamorphic malware
using call graphs. Computers & Security, 46:62–78, 2014.
[24] Mojtaba Eskandari and Sattar Hashemi. Ecfgm: enriched control flow
graph miner for unknown vicious infected code detection. Journal in
Computer Virology, 8(3):99–108, 2012.
95
for malware detection. Journal of Computer Virology and Hacking Tech-
niques, 9(2):77–93, 2013.
[26] Andreas Fischer, Ching Y Suen, Volkmar Frinken, Kaspar Riesen, and
Horst Bunke. Approximation of graph edit distance based on hausdorff
matching. Pattern Recognition, 48(2):331–343, 2015.
[27] Andreas Fischer, Seiichi Uchida, Volkmar Frinken, Kaspar Riesen, and
Horst Bunke. Improving hausdorff edit distance using structural node
context. In Graph-Based Representations in Pattern Recognition, pages
148–157. Springer, 2015.
[31] Thomas Gaertner, John W Lloyd, and Peter A Flach. Kernels for struc-
tured data. Springer, 2003.
[32] Debin Gao, Michael K Reiter, and Dawn Song. Binhunt: Automatically
finding semantic differences in binary programs. In Information and
Communications Security, pages 238–255. Springer, 2008.
96
[33] Michael R Garey and David S Johnson. Computers and intractability:
a guide to the theory of np-completeness. 1979. San Francisco, LA:
Freeman, 1979.
[35] Thomas Gärtner, Peter Flach, and Stefan Wrobel. On graph kernels:
Hardness results and efficient alternatives. In Learning Theory and Ker-
nel Machines, pages 129–143. Springer, 2003.
[40] Bingsheng He, Wenbin Fang, Qiong Luo, Naga K Govindaraju, and
Tuyong Wang. Mars: a mapreduce framework on graphics processors. In
97
Proceedings of the 17th international conference on Parallel architectures
and compilation techniques, pages 260–269. ACM, 2008.
[41] Andrew Henderson, Aravind Prakash, Lok Kwong Yan, Xunchao Hu,
Xujiewen Wang, Rundong Zhou, and Heng Yin. Make it work, make it
right, make it fast: Building a platform-neutral whole-system dynamic
binary analysis platform. In Proceedings of the 2014 International Sym-
posium on Software Testing and Analysis, pages 248–258. ACM, 2014.
[42] Xin Hu. Large-Scale Malware Analysis, Detection, and Signature Gen-
eration. PhD thesis, The University of Michigan, 2011.
[43] Xin Hu, Tzi-cker Chiueh, and Kang G Shin. Large-scale malware index-
ing using function-call graphs. In Proceedings of the 16th ACM confer-
ence on Computer and communications security, pages 611–620. ACM,
2009.
[45] Rohit Jalan and Arun Kejariwal. Trin-trin: Whos calling? a pin-based
dynamic call graph extraction framework. International Journal of Par-
allel Programming, 40(4):410–442, 2012.
98
[47] Qwerton Jibz, XineohP Snaker, and PEiD BOB. Peid. Available in:
http://www. peid. info/. Accessed in, 21, 2011.
[48] Derek Justice and Alfred Hero. A binary linear programming formula-
tion of the graph edit distance. Pattern Analysis and Machine Intelli-
gence, IEEE Transactions on, 28(8):1200–1214, 2006.
[49] Joris Kinable and Orestis Kostakis. Malware classification based on call
graph clustering. Journal in computer virology, 7(4):233–245, 2011.
[50] Johannes Kinder and Dmitry Kravchenko. Alternating control flow re-
construction. In VMCAI, pages 267–282. Springer, 2012.
[53] Orestis Kostakis, Joris Kinable, Hamed Mahmoudi, and Kimmo Mus-
tonen. Improved call graph comparison using simulated annealing. In
Proceedings of the 2011 ACM Symposium on Applied Computing, pages
1516–1523. ACM, 2011.
99
[55] McAfee Labs. Mcafee labs threats report. http://www.mcafee.com/
ca/resources/reports/rp-quarterly-threats-aug-2015.pdf. Ac-
cessed: 2015-09-30.
[56] Jusuk Lee, Kyoochang Jeong, and Heejo Lee. Detecting metamorphic
malwares using code graphs. In Proceedings of the 2010 ACM symposium
on applied computing, pages 1970–1977. ACM, 2010.
[58] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur
Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim
Hazelwood. Pin: building customized program analysis tools with dy-
namic instrumentation. In ACM Sigplan Notices, volume 40, pages 190–
200. ACM, 2005.
[60] James J McGregor. Backtrack search algorithms and the maximal com-
mon subgraph problem. Software: Practice and Experience, 12(1):23–34,
1982.
100
[62] Michel Neuhaus, Kaspar Riesen, and Horst Bunke. Fast suboptimal
algorithms for the computation of graph edit distance. In Structural,
Syntactic, and Statistical Pattern Recognition, pages 163–172. Springer,
2006.
[63] Younghee Park, Douglas S Reeves, and Mark Stamp. Deriving com-
mon malware behavior through graph clustering. Computers & Security,
39:419–430, 2013.
[67] Konrad Rieck, Philipp Trinius, Carsten Willems, and Thorsten Holz.
Automatic analysis of malware behavior using machine learning. TU,
Professoren der Fak. IV, 2009.
[68] Kaspar Riesen and Horst Bunke. Approximate graph edit distance com-
putation by means of bipartite graph matching. Image and Vision Com-
puting, 27(7):950–959, 2009.
[69] Kaspar Riesen, Michel Neuhaus, and Horst Bunke. Bipartite graph
matching for computing the edit distance of graphs. In Graph-Based
Representations in Pattern Recognition, pages 1–12. Springer, 2007.
101
[70] Alberto Sanfeliu and King-Sun Fu. A distance measure between at-
tributed relational graphs for pattern recognition. IEEE Transactions
on Systems, Man and Cybernetics, (3):353–362, 1983.
[71] Shanhu Shang, Ning Zheng, Jian Xu, Ming Xu, and Haiping Zhang. De-
tecting malware variants via function-call graph similarity. In 2010 5th
International Conference on Malicious and Unwanted Software (MAL-
WARE), pages 113–120. IEEE, 2010.
[73] Virus Sign. Malware research & data center, virus free downloads. http:
//www.virussign.com/. Accessed: 2015-03-24.
[75] Dawn Song, David Brumley, Heng Yin, Juan Caballero, Ivan Jager,
Min Gyung Kang, Zhenkai Liang, James Newsome, Pongsin Poosankam,
and Prateek Saxena. Bitblaze: A new approach to computer security via
binary analysis. In Information systems security, pages 1–25. Springer,
2008.
[76] Fu Song and Tayssir Touili. Efficient malware detection using model-
checking. In FM 2012: Formal Methods, pages 418–433. Springer, 2012.
102
[77] Lili Tan. The worst case execution time tool challenge 2006: Technical
report for the external test. Uni-DUE, Technical Reports of WCET Tool
Challenge, 1, 2006.
[78] Heaven Tools. Pe explorer: Exe file editor, resource editor, dll view scan
tool, disassembler. http://www.heaventools.com/. Accessed: 2015-
03-26.
[80] Virus Total. Virus total-free online virus, malware and url scanner.
www.virustotal.com. Accessed: 2015-03-15.
[81] P Vinod, Vijay Laxmi, Manoj Singh Gaur, GVSS Kumar, and Yadven-
dra S Chundawat. Static cfg analyzer for metamorphic malware code.
In Proceedings of the 2nd international conference on Security of infor-
mation and networks, pages 225–228. ACM, 2009.
103
[85] Josef Weidendorfer. Kcachegrind, 2012.
[86] Nils Weskamp, Eyke Hullermeier, Daniel Kuhn, and Gerhard Klebe.
Multiple graph alignment for the structural analysis of protein active
sites. Computational Biology and Bioinformatics, IEEE/ACM Transac-
tions on, 4(2):310–320, 2007.
[88] Ming Xu, Lingfei Wu, Shuhui Qi, Jian Xu, Haiping Zhang, Yizhi Ren,
and Ning Zheng. A similarity metric method of obfuscated malware
using function-call graph. Journal of Computer Virology and Hacking
Techniques, 9(1):35–47, 2013.
[89] Wei Yan, Zheng Zhang, and Nirwan Ansari. Revealing packed malware.
Security & Privacy, IEEE, 6(5):65–69, 2008.
[90] Heng Yin and Dawn Song. Temu: Binary code analysis via whole-system
layered annotative execution. Submitted to: VEE, 10, 2010.
[91] Zhiping Zeng, Anthony KH Tung, Jianyong Wang, Jianhua Feng, and
Lizhu Zhou. Comparing stars: On approximating graph edit distance.
Proceedings of the VLDB Endowment, 2(1):25–36, 2009.
[92] Zongqu Zhao. A virus detection scheme based on features of control flow
graph. In Artificial Intelligence, Management Science and Electronic
104
Commerce (AIMSEC), 2011 2nd International Conference on, pages
943–947. IEEE, 2011.
105
Vita
University attended:
Master of Computer Science
University of New Brunswick
2013-2016
Publications:
Conference Presentations:
Elaheh Biglar Beigi, Hossein Hadian Jazi, Natalia Stakhanova, and Ali Ghor-
bani.“Towards effec- tive feature selection in machine learning-based botnet
detection approaches.” In Communications and Network Security (CNS),
2014 IEEE Conference on, pp. 247-255. IEEE, 2014.