0% found this document useful (0 votes)
24 views

A Comb for Decompiled C Code

This paper presents a novel algorithm for restructuring control flow graphs from binary programs to produce high-quality, idiomatic C code that is goto-free and easier for analysts to understand. The decompiler, revng-c, is built on the rev.ng framework and shows a significant reduction in cyclomatic complexity compared to existing tools, enhancing the efficiency of security assessments on third-party software. The study emphasizes the importance of producing readable and structured code to minimize the mental load on analysts during the evaluation of decompiled programs.

Uploaded by

zhangrunze60
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

A Comb for Decompiled C Code

This paper presents a novel algorithm for restructuring control flow graphs from binary programs to produce high-quality, idiomatic C code that is goto-free and easier for analysts to understand. The decompiler, revng-c, is built on the rev.ng framework and shows a significant reduction in cyclomatic complexity compared to existing tools, enhancing the efficiency of security assessments on third-party software. The study emphasizes the importance of producing readable and structured code to minimize the mental load on analysts during the evaluation of decompiled programs.

Uploaded by

zhangrunze60
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

A Comb for Decompiled C Code

Andrea Gussoni Alessandro Di Federico


Politecnico di Milano rev.ng Srls
andrea1.gussoni@polimi.it ale@rev.ng

Pietro Fezzardi Giovanni Agosta


rev.ng Srls Politecnico di Milano
pietro@rev.ng agosta@acm.org

ABSTRACT release upgrades to fix outstanding bugs. Companies tend to be very


Decompilers are fundamental tools to perform security assessments secretive about their implementations and rarely provide access to
of third-party software. The quality of decompiled code can be a the source code of their applications, either to protect patents and
game changer in order to reduce the time and effort required for trade secrets, or in the hope to provide security by obscurity. In
analysis. This paper proposes a novel approach to restructure the some fields, it is also common to find legacy code that runs parts
control flow graph recovered from binary programs in a semantics- of critical infrastructures, for which gaining access to the source
preserving fashion. The algorithm is designed from the ground up is not even an option, since the company that originally provided
with the goal of producing C code that is both goto-free and dras- it ran out of business.
tically reducing the mental load required for an analyst to under- In all these scenarios, it is challenging for external analysts to
stand it. As a result, the code generated with this technique is well- conduct independent security assessments of the implementations,
structured, idiomatic, readable, easy to understand and fully exploits let alone to provide fixes for bugs and vulnerabilities.
the expressiveness of C language. The algorithm has been imple- In this context, performing an in-depth analysis of a piece of
mented on top of the rev.ng [12] static binary analysis framework. software without access to its source code is significantly more dif-
The resulting decompiler, revng-c, is compared on real-world bi- ficult. To this end, decompilers are powerful tools that, starting from
naries with state-of-the-art commercial and open source tools. The a binary executable program, can reconstruct a representation of
results show that our decompilation process introduces between its behavior using a high-level programming language, typically C.
40% and 50% less extra cyclomatic complexity. These tools save the analyst from the need of looking directly the as-
sembly code, leading to a dramatic reduction of the effort necessary
CCS CONCEPTS to perform a security assessment, making it viable in new scenarios.
The compilation process is not perfectly reversible, which compli-
• Security and privacy → Software security engineering; Soft-
cates the task of evaluating the quality of the results of a decompiler.
ware reverse engineering; Security requirements.
Due to aggressive compiler optimizations and hand-written assem-
bly, it is often impossible to recover the exact original source from
KEYWORDS
which a binary executable was produced. A decompiler could even
decompilation, reverse engineering, goto, control flow restructuring be used to recover C code from a Fortran program. In principle the
ACM Reference Format: process should work, but the recovered C code would not be the
Andrea Gussoni, Alessandro Di Federico, Pietro Fezzardi, and Giovanni original source nor very idiomatic C.
Agosta. 2020. A Comb for Decompiled C Code. In 15th ACM Asia Con- Therefore, in practice, the goal of a decompiler is not really to
ference on Computer and Communications Security (ASIA CCS’20), Octo- produce the exact same source code that originated the program,
ber 5–9, 2020, Taipei, Taiwan. ACM, New York, NY, USA, 15 pages. https: which might be plainly unfeasible, but to produce some high-level
//doi.org/10.1145/3320269.3384766 representation easy for analysts to reason about. For this reason, it
is of very important for a decompiler to produce high-quality code.
1 INTRODUCTION The quality of decompiled code can be measured in different
In the last decades, software has steadily become increasingly ubiq- ways. Informally, it can be described as the readability of the code,
uitous, and programmable electronic devices are nowadays part of i.e., the ease with which a snippet of decompiled code can be un-
every aspect of everyone’s life. Most often, users have little control derstood by an analyst. This qualitative measure is strongly related
on the software that runs on these devices and on the life cycle of to the mental load necessary to understand the behavior of the
code, which in turns depends on the amount of information that
Permission to make digital or hard copies of all or part of this work for personal or the analyst has to track during the analysis. This information can
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full be ascribed mainly to the complexity of the control flow and de-
citation on the first page. Copyrights for components of this work owned by others pends on all the possible entangled execution paths that can lead
than the author(s) must be honored. Abstracting with credit is permitted. To copy
otherwise, or republish, to post on servers or to redistribute to lists, requires prior
to a certain portion of the code. All these factors contribute to the
specific permission and/or a fee. Request permissions from permissions@acm.org. mental load of an analyst.
ASIA CCS ’20, October 5–9, 2020, Taipei, Taiwan To minimize such load, and to produce high-quality output code,
© 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. decompilers adopt various techniques to restructure the control
ACM ISBN 978-1-4503-6750-9/20/10. . . $15.00
https://doi.org/10.1145/3320269.3384766 flow of decompiled programs, to make them easier to read and to
reduce the burden of understanding their functioning. As an exam- Short-circuit evaluation. In this paper, short-circuit evaluation
ple, control flow restructuring can be used to reduce the number of refers to the semantics of boolean expressions in C.
goto statements [23], cutting the number of unstructured jumps If a boolean expression has more than one argument, each argu-
across the program, hence reducing the mental load necessary to ment is evaluated only if the evaluation of the previous arguments
track all the possible paths. As another example, control flow re- is not sufficient to establish the value of the expression containing
structuring can be used to produce if-then-else or loops that the boolean operator. This is particularly important when the eval-
naturally match high-level programming constructs, or to collapse uation of some operand of the boolean expression have side-effects,
multiple ifs in a single one if they check the same condition. All because only the side-effects of the arguments that are actually
these modifications on the control flow contribute to make the code evaluated will be triggered.
easier to understand. Cyclomatic Complexity The cyclomatic complexity is a well-known
In summary, this paper makes the following contributions: software metric used to capture the complexity of a program. It
• we present a novel algorithm for control flow restructuring that was originally conceived by T. J. McCabe in 1976 [19].
1) produces well-structured programs that can always be emit- It represents a quantitative measure of the number of linearly
ted in C, without resorting to goto statements, 2) significantly independent paths in a program source code. The cyclomatic com-
reduces the cyclomatic complexity [19] of the generated C code plexity is computed on the control flow graph of a program.
compared to the state of the art, a measure of the complexity of In general, the formula to compute the cyclomatic complexity
the control flow strictly related to the mental load required to of a program is given by 𝑀 = 𝐸 − 𝑁 + 2𝑃, where 𝑀 is the cyclo-
understand the observed code, 3) fully exploits the expressive- matic complexity itself, 𝐸 is the number of edges in the CFG, 𝑁 is
ness of the C language (such as short-circuit of if conditions the number of nodes in the CFG and 𝑃 represents the number of
and switch statements); connected components in the graph.
• we implement the proposed approach employing the rev.ng In the case of a single subroutine 𝑃 is always 1 hence the formula
binary analysis framework as a basis; can be simplified to 𝑀 = 𝐸 −𝑁 +2. If we consider a program as the
• we compare the resulting decompiler, revng-c, with state-of- union of all the CFGs of its subroutines the cyclomatic complexity
the-art commercial and academic decompilers, on a set of real of the program can be computed as the sum of all the cyclomatic
world programs, measuring the size of the decompiled code and complexities of the single subroutines.
its cyclomatic complexity.
The remainder of this work is structured as follows. Section 2 in-
troduces the fundamental concepts necessary to understand the 3 RELATED WORK
rest of the work. Section 3 discusses related works while Section 4 This section describes the related work in two main fields: recovery
presents the design of the control flow restructuring algorithm. of Control Flow Graphs, and decompilation.
Section 5 shows the experimental results obtained on a set of real-
world programs, the GNU coreutils, comparing the approach pro- CFG Recovery. In this work we focus on control flow restructur-
posed in this work with other commercial, open-source and aca- ing, and we use as a starting point the Control Flow Graph of the
demic decompilers: the Hex-Rays Decompiler[1, 14], Ghidra[3], and function we want to analyze and decompile.
DREAM[22, 23]. Finally, Appendix B discusses more idiomatic case The problem of correctly recovering such graphs from binary
studies and corner cases before the concluding remarks in Section 6. code is well-known and lot of research work has been done in
this field. The CMU Binary Analysis Platform (BAP) [5] is a bi-
nary analysis framework which disassembles and lifts binary code
2 BACKGROUND into a RISC like intermediate language, called BAP Intermediate
This section briefly outlines the main concepts that are necessary Language (BIL). BAP also integrates all the techniques developed
for the understanding of the paper. previously for BitBlaze [20]. The rev.ng [9, 10] project, which is
Graph basics. In this paper we give for granted a number of fun- an architecture independent binary analysis framework based on
damental concepts revolving around Directed Graphs. For the in- qemu [4] and llvm [17], is able to lift a binary into an equivalent
terested reader, these concepts are discussed in more detail in Ap- llvm ir representation. Other research groups have also dedicated
pendix A, and a more distinguished reference can be found in [15]. efforts to tackle the problem of disassembling obfuscated code [16].
Most of this concepts should be familiar, since they are widely The approach presented in this paper does not rely on any spe-
used in program analysis for representing the control flow of a cific technique for extracting CFGs from binary code, hence it is
program by means of Control Flow Graphs (CFG). In particular, we general enough to be used with any of these approaches.
will make wide use of the following concepts. Decompilation. The academic foundational work in the field of
• control flow graph representation of a program. decompilers is probably Cifuentes’ PhD thesis [8]. The techniques
• directed acyclic graphs (DAG). presented there have been implemented in the dcc decompiler,
• search and visits over CFGs. In particular, we will make ex- which is a C decompiler for Intel 80286.
tensive use of the Depth First Search algorithm, and of the In the field of commercial decompilers, Hex-Rays[1] is the de-
orderings it induces on a CFG, as the preorder, postorder and facto leader, and its decompiler is provided as a plug-in for the
reverse postorder. Interactive Disassembler Pro (IDA) [14] tool. No specific informa-
• dominance and post-dominance, and the data structures they tion on the internal structure of the decompiler is publicly available,
induce, the dominator- and post-dominator-tree. apart from the fact that it uses some kind of structural analysis [13].
Phoenix [6], a decompiler tool built on top of the CMU Bi- binary programs. To gracefully handle common cases, the Matching
nary Analysis Platform (BAP) [5], uses iterative refinement and step is performed as post-processing to reduce duplication when
semantics-preserving transformations. The iterative part is imple- possible, leaving freedom and generality to the Combing without
mented through the emission of a goto instruction when the de- sacrificing the capability to emit high-quality code.
compilation algorithm cannot make progress. The remainder of this section is structured as follows. Section 4.1
A very recent entry in the field of decompilers is Ghidra, which provides an overview the high-level design goals. Section 4.2 intro-
is included in the Ghidra reverse engineering tool [3], initially de- duces the CFG properties that are later enforced by the Preprocess-
veloped by US National Security Agency (NSA) for internal use. ing and Combing stages. Section 4.3 provides a general overview
The tool has been open sourced very recently (April 2019) but, as of before digging into the details of the three stages: Preprocessing
today, no extensive documentation on the design of the components (Section 4.4), Combing (Section 4.5), and Matching (Section 4.6).
has been released.
The decompiled code generated from Ghidra bears some sim-
ilarities with the Hex-Rays Decompiler, and this suggests that it 4.1 Design Goals
also uses an approach based on structural analysis. In particular, The fundamental goal of the algorithm presented here is to increase
both tools, the Hex-Rays Decompiler and Ghidra, emit C code with the quality of the produced decompiled code. As mentioned in Sec-
many goto statements, which is not idiomatic and results in very tion 1, this means reducing the informative load on the shoulders
convoluted control flow. This fact makes it hard to keep track of all of analysts. To achieve this, the algorithm is designed with some
the entangled overlapping control flow paths in the code, making fundamental goals.
it hard to understand. This characteristic is also somehow shared Generality. It must be able to work on any CFG, independently
by the Phoenix decompiler, which emits goto statements when it of its complexity. This is important since, in decompilation, input
cannot make further progress. CFGs might be originate from hand-written or compiler-optimized
The DREAM [23] decompiler takes a drastically different direc- machine code. To build a decompiler that consistently generates
tion. The authors present various semantic-preserving transforma- high-quality output very few assumptions can be made on the
tions for the CFG, and a decompilation technique that emits no goto input CFGs.
statements by design. However, to avoid gotos, DREAM handles Structured. It needs to transform any CFG so that it can be ex-
“pathological” loops by means of what can be seen as predicated pressed in terms of C constructs, excluding gotos. gotos, and
execution. If a CFG has a loop and a branch that jumps straight unstructured programming in general, can considerably increase
in the middle of the loop from a point outside the loop, DREAM the complexity of the control flow[11].
wraps parts of the body of the loop inside a conditional statement Expressive. Starting from such structured CFGs, it must be able
guarded by a state variable. This design choice prevents gotos but to emit a wide range of idiomatic C constructs, such as while and
generates code where multiple execution paths are entangled and do-while loops, switch statements, and if statements with or
partially overlap. An example can be seen in an open dataset of without else and short-circuited conditions.
code snippets released by the authors [21], in Section 1.5, page 7, at
lines 12–16 of code generated by DREAM. In larger functions, this
can significantly increase the mental load of an analyst, especially if 4.2 CFG Properties
a loop contains more than one of these conditional blocks, possibly The Preprocessing and the Combing stages of the algorithm enforce
nested or with multiple conditions. some properties on the input CFGs. Such properties are inspired to
fundamental characteristics of structured C programs and designed
4 CONTROL FLOW COMBING to mimic them. The fundamental idea of the algorithm is to enforce
each of these properties one at a time. Once a property has been
In this paper we make a novel choice for the generation of decom- enforced it becomes an invariant, so that it is preserved from all
piled code that is free from goto statements: we accept to duplicate the subsequent steps. In this way, the final result of applying the
code in order to emit more idiomatic C code that reduces the mental transformations on the original CFG will feature all the properties.
load of an analyst, being more readable and easier to understand. Being these properties tailored to describe structured C programs,
This section focuses on the details of this technique, called Control the resulting CFG at the end of the algorithm is straightforward to
Flow Combing. translate in C without gotos.
The algorithm is composed of 3 stages: a Preprocessing, which In the following we list the properties we aim to enforce.
prepares the input CFG to the manipulation, transforming it into
a hierarchy of nested Directed Acyclic Graphs (DAG); the actual Two Successors. The first important property of structured C pro-
Combing stage, which disentangles complex portions of the control grams is that each basic block has at most two successors. The
flow by duplicating code portions or introducing dummy nodes; a only case that does not respect this condition is the switch state-
final Matching stage, which matches idiomatic C constructs, while ment, but every switch can always be transformed in a sequence
trying to reduce unnecessary duplication. of if-else statements and vice versa. The Preprocessing phase will
Informally, the idea is to “comb” the Control Flow Graph, dupli- always enforce this property on CFGs, deferring to the Matching
cating code to disentangle convoluted overlapping paths, so that the stage the decision of whether to emit ifs or switches.
properties necessary to emit idiomatic C code naturally emerge. We Two Predecessors. This property holds whenever a basic block in
pay this potential duplication as a cost necessary to handle generic CFG has at most two predecessors.
1 a graph cannot be emitted in well-structured C programs without
using gotos. Hence the Combing stage enforces this property.
2
4.3 Overview of the Algorithm
3 As previously anticipated, the Control Flow Combing algorithm is
designed in three incremental stages: Preprocessing, Combing, and
4 Matching.
Figure 1: A CFG without Diamond Shape property. Node 3 is Preprocessing. The goal of this stage is to massage the input CFG
reachable from 2 but does not dominate it. To emit this code in C,
in a shape that can be digested by the Combing. To do this, the
a goto statement would be required (either ⟨1,3⟩ or ⟨2,3⟩).
Preprocessing incrementally enforces all the properties described
in Section 4.2, except for the Diamond Shape property. It does so
Loop Properties. In a well-structured C program each loop has
by working on a tree-like hierarchy of nested Regions of the CFG,
the following three characteristics.
called Region Tree. At the end of Preprocessing all the Regions in the
Single Entry. In C, the only way to enter a loop, without passing
tree are transformed into DAGs.
from the entry node, is using the goto statement.
Single Successor. In C, the only way to abandon a loop without Combing. This stage works on the Region Tree generated by Pre-
gotos is using the break statement, and all the break statements processing, which is now constituted only by DAGs. The Combing
in a loop jump to the same point in the program: the single suc- enforces the Diamond Shape property on all the DAGs in the tree.
cessor of the loop. The case of a natural exit from the loop is just After this transformation the tree is ready to be transformed into
an implicit break an C Abstract Syntax Tree.
Single Retreating Target. In C, all the retreating edges in a struc- Matching. This stage uses the combed Region Tree to generate a C
tured loop jump to the same target since, given that we have no AST representation. The AST is subsequently manipulated with a
gotos, they must be continue statements. Just like breaks, all set of rules to match idiomatic C constructs. The rules presented in
continues in a loop jump to the same point: the single retreating this paper cover short-circuited ifs, switch statements, and loops
target, which is also the entry node. in the form do {...} while(...) and while(...) {...}, but others
The fact that break and continue statements always target a single can be added. After matching idiomatic C constructs, the final C
node is only true under two assumptions. The first is that there are code is emitted in textual form.
no switch statements. The second assumption is that there are no
nested loops since, e.g., break statements of a nested loop do not 4.4 Preprocessing
jump at the same target as breaks of its parent loop. These two This section describes the Preprocessing stage in detail.
assumptions might sound strong, but the Preprocessing phase is The first part of the Preprocessing, described in Section 4.4.1, is
designed to ensure that these properties are enforced in strict order, designed to divide the CFG on a hierarchy of nested Regions, each
so that when the Loop Properties are enforced all their prerequisite roughly representing a loop. The goal is to superimpose on the CFG
are guaranteed to hold. All the loop properties described above will a Region Tree, that represents the hierarchy of loops in the CFG
be enforced on the input CFG from the Preprocessing stage. itself. Each Region in the tree is then handled independently of the
Diamond Shape Property. The Diamond Shape property holds others by the next steps of Preprocessing and Combing, reducing
for a DAG whenever each node with two successors dominates all the complexity of algorithm.
the nodes between itself and its immediate post-dominator. This The second part of the Preprocessing, described in Section 4.4.2,
property is enforced by the Combing stage, on the DAGs generated works on the Region Tree, transforming each Region into a DAG, so
by the Preprocessing stage. that it can subsequently be handled by the Combing stage.
This mimics the fact that in well structured C programs all the
scopes are either nested or non-overlapping. In other words, enforc- 4.4.1 Building the Region Tree. This process is composed by three
ing this property means forcing a DAG in the form of a diamond steps. The first adds a sink node as a successor of all the exit nodes,
where each node with more than a single successor induces a region which is necessary to compute post-dominance. The second starts
of nodes with a single entry and a single exit. to enforce some of the properties discusses in Section 4.2. The third
To grasp the implications of this property, it might be useful to identifies nested Regions and builds the Region Tree.
think about a scenario where this property does not hold. An ex- Adding the sink Node. In general, CFGs obtained from binary
ample is portrayed in Figure 1. In this setting, it exists a conditional programs do not have a single exit, which is a requirement to com-
node, node 2 in Figure 1, and another node, node 3, reachable from pute the post-dominator tree, which in turn is a requirement to
the conditional node, which is not dominated by the conditional reason about the Diamond Shape property that is enforced later.
node. There exist another node, node 1, from which it is possible Hence, every CFG needs to be brought into a shape with a single
to reach node 3 without passing from node 2. But node 2 is a con- exit. This is done by adding an artificial sink node, and attaching
ditional node (an if statement in C), and since node 3 is reachable an artificial edge from each original exit basic block to the sink.
from node 2, if the program is well-structured node 3 should be This makes the sink the single exit node, allowing to compute the
either in the then or in the else, or after the if-else altogether. post-dominator tree.
At the same time, there is a path from node 1 to node 3 that does not This operation does not alter the semantic of the program, and
pass from node 2. The main problem with this scenario, is that such is preserved by all the following steps.
0 1 2
1 2 3 4
Figure 3: Electing a Region’s head. The retreating edges are dashed.
Node 1 has one incoming retreating edge, while node 0 has 2. For
Figure 2: Merging partially-overlapping SCS. There are two SCS this reason, node 0 is elected head of the Region.
( ⟨1,2,3⟩ and ⟨2,3,4⟩) induced by the retreating edges 3 → 1 and 4 → 2
(dashed). They overlap but they have no inclusion relationship, point, all the steps of the algorithm only work on Regions and the
therefore they are merged into a new SCS ⟨1,2,3,4⟩. Region Tree.

Enforcing Two Predecessors and Two Successors. First, the 4.4.2 Turning Regions into DAGs. The goal of this phase is to turn
Two Successors property is enforced by transforming all the switch each Region into a DAG that can be then be reasoned about in
statements into cascaded conditional branches, with two targets simpler terms. This process is composed of various incremental
each. Similarly, the Two Predecessors property is enforced by taking steps. The combination of all these steps enforces on the Regions
each basic block with more than one predecessor and transforming all the remaining properties introduced in Section 4.2 except for
it into a tree of empty basic blocks (dummies) that only jump to the Diamond Shape property, i.e., the Two Successors property, and
their single successor. the Loop Properties. Where noted, some of steps are optional and
These operations do not alter in any way the semantic of the dedicated to gracefully handle common cases.
program. Moreover, the Two Predecessors and Two Successors are The following steps work on a single Region at a time, moving
preserved in all the following steps. from the leaves to the root of the Region Tree. At the beginning of
this process all Regions but the root are still SCS. At the end of this
Identifying Nested Regions. The core idea of this step is to merge process the Regions are transformed in DAGs, so that they can be
sets of the partially overlapping loops in the CFG into an individual treated by the next phase, Combing.
Region that we can then reason about as a single loop.
To define these Regions, the algorithm starts from all the Strongly Electing Regions’ Heads. The Loop Properties require every Re-
Connected Subgraphs (SCS), i.e., subgraphs of the original CFG gion to have a Single Entry and a Single Retreating Target. However,
whose nodes are all reachable from each other. There might be at this stage, each of them may contain multiple retreating edges,
several overlapping and non-overlapping SCS in a graph. Note that possibly targeting different nodes. This step elects the entry node:
a SCS is a difference concept from a Strongly Connected Component the node that is target of the highest number of retreating edges.
(SCC), typically used in loop analysis. In fact, SCCs are always non- This node, the head node, represents the beginning of the loop body,
overlapping by definition, and their union represent the entire CFG. and will be the target of all the retreating edges in the loop.
In particular, we are interested in SCSs induced by retreating edges. Retreating Edges Normalization. After the election of the head,
Given a retreating edge 𝑒 = ⟨𝑠,𝑡⟩ the SCS induced by 𝑒 is constituted all the retreating edges that do not point to it are considered abnor-
by all the nodes that are on at least one path starting from 𝑡 and mal, since they do not respect the Single Retreating Target property
ending in 𝑠 that does not crosses 𝑡 nor 𝑠. and, therefore, need to be handled.
First, the algorithm identifies all the SCS induced by all the Consider the graph in Figure 3: the head is node 0 and there is
retreating edges in the CFG, simply applying the definition above. a single abnormal edge from node 2 to 1. In C parlance, this edge is
Note that at this stage the resulting SCSs can still overlap, whereas not a continue, since it jumps to the middle of a loop. Informally,
to build a hierarchy between SCS it is necessary for the set of SCSs to to handle this situation, we can introduce a state variable in the pro-
form a partially ordered set with the strict subset relation (⊂). Hence, gram so that the abnormal edge can be represented with a continue.
for each pair of SCS 𝐴 and 𝐵, if 𝐴∩𝐵 ≠ ∅, 𝐴 ⊄ 𝐵, and 𝐵 ⊄𝐴, then 𝐴∪𝐵 In practice, this edge will target a virtual head node that will check
is added to the set of SCS, removing 𝐴 and 𝐵 from the set of SCSs. the value of the state variable and dispatch execution at the correct
When this happens, the algorithm restarts from the beginning, until location (Node 1). To discriminate between retreating edges, the
a fixed point is reached. Notice that the union of two SCS is always state variable is set before every retreating edge and checked at the
an SCS, hence the process can proceed. An example of partially beginning of the loop with a dedicated construct.
overlapping SCS that trigger this condition is shown in Figure 2. This is exactly what the normalization step does for abnormal
This process converges since the ∪ operator is monotonic and edges. For each Region, a state variable 𝑣 is created. Then, a distinct
the CFG has a finite number of nodes. At the end only a set of SCS identifier is assigned to each node with incoming abnormal edges,
that is partially ordered with the ⊂ relationship is left. Each of this as well as to the head elected at the previous step. Then, a new set
remaining SCS is a Region roughly representing a loop, or a set of of nodes is created before the head, containing only conditional
loops tightly entangled together. Considering the whole CFG as a jumps that check the state variable to dispatch the execution at the
Region itself, the ⊂ relationship naturally induces a tree on all the correct target (either a target of an abnormal edge or the head). This
regions. The whole CFG is the root of the tree, and moving towards set of nodes is called the head dispatcher, and its first node is called
leaves we encounter more and more deeply nested loops. This tree ℎ. Finally, each abnormal edge 𝑒 = ⟨𝑠,𝑡⟩ is replaced with a new pair
structure is called the Region Tree. of edges. The former edge of this pair is 𝑒ℎ = ⟨𝑠,ℎ⟩. This edge points
Notice that the grouping of nodes in Regions does not alter the to the entry point of the head dispatcher, and sets the state variable
CFG, hence it does not alter the program semantic. The same holds to the value associated to 𝑡, say 𝑣𝑡 . The latter edge is added from the
if a node is moved inside or outside of an existing Region. From this node in the head dispatcher that checks for the condition 𝑣 ==𝑣𝑡 to
𝑣 := 1 0 0
𝑣 := 0

𝑣 := 0
head dispatcher 3 1 1
3
𝑣 == 0 𝑣 == 1

2 2
0 1 2
Figure 4: Normalizing retreating edges on the CFG from Figure 3. All
4 4
the retreating edges (dashed) now point to the new head dispatcher
and set the state variable (values are reported on the edge labels). Figure 5: Absorbing Successors. Left – The Region with the nodes
The head dispatcher then jumps to the original target node. with dashed border has two successors: 3 and 4. Right – Node 3 has
been absorbed in the Region, which now has a single successor.

𝑡. Finally, the single entry point of the head dispatcher is promoted


to new head of the Region. 1 2
Figure 4 shows the result for the normalization of abnormal edges
applied to the CFG originally depicted in Figure 3.
0 1 2 0 1 2
The idea is to enforce the Single Retreating Target property, edit-
ing the control flow without altering the semantics of the original
program, except for the introduction of the state variable 𝑣. This pro- before outlining after outlining
cess also preserves the Two Successors property. All these properties Figure 6: First iteration outlining. Dashed nodes are the outlined.
are from now on invariant and preserved in the next steps.
Notice that redirecting the abnormal edges to the entry dispatcher Abnormal entries are removed based on the observation that
may momentarily break the Two Predecessors property. But this does each of them generates a set of paths that: enter the Region, execute
not represent a problem since the normalized abnormal edges are some parts of the loop and at some point reach the proper head of
later removed and substituted with continue statements. the loop and proceed with regular iterations.
Loop Successors Absorption. This is an optional step that starts Thanks to this observation, the nodes and edges that compose
moving in the direction of the Single Successor loop property. It is the first iteration can be duplicated and moved out of the Region,
designed to handle gracefully a scenario observed frequently in since once they are outlined they have no retreating edges and bear
real-world example, depicted on the left in Figure 5. The Region no signs of being loops.
⟨0,1,2⟩ in the figure has two successors, 3 and 4. Informally it is easy Note that it would be possible to leave the first iteration inside the
to see that the Region, along with node 3, is substantially a loop loop, but it requires guarding each statement with conditional con-
that executes the code in node 3 on break. Given that one of the structs, an approach adopted by previous works [22, 23]. However,
goals is to emit idiomatic C, this would be better represented with we deem that choice to be suboptimal since it generates decompiled
a loop, containing an if statement that executes the code in 3 and code where paths are entangled together and artificially guarded
breaks. In order to reach this form, the node 3 must be absorbed by conditional constructs. Moving the first iteration outside the
into the Region, as shown in Figure 5.b. Region makes it easier to reason about, since it can be analyzed in
More formally, this step starts with the creation of new empty isolation, while also leading to more idiomatic C code.
dummy frontier nodes on each edge whose source is in the Region Exit Dispatcher Creation. Symmetrically to the creation of entry
and whose target is not (see the empty dashed node in Figure 5.b). dispatchers, this step normalizes the Regions to completely enforce
Then, it computes the dominator tree of the entire CFG (not only the the Single Successor loop property. The successors absorption step
current Region) and adds to the Region all the nodes that are dom- was an optional step to get the low-hanging fruit in this direction,
inated both by the head of the Region and by at least one dummy while gracefully handling common cases, but not all scenarios.
frontier node. If the Single Successor loop property does not hold after the suc-
This embodies the idea that given a node, if it is only reachable cessors absorption step, this step injects an exit dispatcher, that is
passing through the head of Region and from a dummy frontier it is built and acts similarly to the entry dispatcher, changing the control
in fact part of the Region itself, and it must be handled accordingly flow without altering the semantic of the program thanks to a state
by the remaining steps. variable. Again, each of the edges 𝑒 = ⟨𝑠,𝑡⟩ with 𝑠 in the Region
This step does not alter the semantic of the program as it only and 𝑡 outside is substituted with two edges. The first starts from 𝑠,
adds empty dummy frontiers, and it also does not break any of the sets the state variable, and jumps to the exit dispatcher. The second
previously enforced invariants. starts from the exit dispatcher and goes to the 𝑡. In this way, the
First Iteration Outlining. This step enforces the Single Entry first node of the exit dispatcher becomes the single successor of the
property on Regions, removing potential multiple entry points that Region. Notice that this means that the exit dispatcher itself is not
at this stage are still possible by means of abnormal entries. An part of the Region but is part of its parent Region in the Region Tree.
abnormal entry is an edge 𝑒 = ⟨𝑠,𝑡⟩ such that 𝑠 is not in the Region, Figure 7 shows a case where the creation of the exit dispatcher
𝑡 is in the Region, and 𝑡 is not the head. is necessary.
0
𝑣 == 2 𝑣 := 2
2 0
2 0 1
exit
5 ... 2 3
3 1 3 1
𝑣 == 3 𝑣 := 3 4
Figure 7: Creating exit dispatcher. Left – The Region composed by
nodes 0 and 1 has two successors (2 and 3). Right – Creation of the 6 continue break
exit dispatcher, making it the target of the outgoing edges. The
edges also carry the values assigned to the state variable, later used Figure 9: Collapsing nested DAG Region. The Region with red
dispatch the execution to the real successors. nodes on the left (composed of 1,2,3,4,break, and continue) can be
collapsed in a virtual node from the point of view of its parent.
continue
at a time is turned into a DAG, working on the Region Tree from
the leaves to the root. After a Region has been transformed into a
1 2 3 1 2 3 DAG, this step collapses it into a single virtual node in its parent’s
representation.
break This is possible since each DAG Region has a Single Entry (part
before after of the DAG) and a Single Successor (not part of the DAG). Retreat-
Figure 8: Creation of break and continue nodes. ing edges have been removed from the DAG, and substituted by
continue nodes, that represent jumps to the Single Entry. Paths
Note that this step does not alter the program semantics and does that exit from the DAG have been substituted with break nodes
not break any of the previously enforced invariants, since the exit jumping to the Single Successor.
dispatcher is built of conditional statements. The Single Successor Hence in the parent’s representation, a DAG Region is collapsed
loop property is enforced for all Regions and will be preserved by into a single virtual node 𝑉 as follows. Given a Region 𝑃 and a nested
all the following transformations. DAG Region 𝐶 with Single Entry 𝐸 ∈𝐶 and Single Successor 𝑆 ∈ 𝑃 \𝐶.
Again, this does not alter the program semantics and does not First, all the nodes in 𝐶 are moved into the virtual node 𝑉𝐶 .
break any of the previously enforced invariants, since the exit dis- Then, each edge 𝑒 = ⟨𝑋,𝐸⟩ jumping from 𝑃 \𝐶 to 𝐸 is substituted
patcher is built of conditional statements. The Single Successor loop with an edge 𝑒𝑉𝐶 = ⟨𝑋,𝑉𝐶 ⟩. These represent all the entry paths
property is enforced for all Regions and will be preserved by all to 𝑉𝐶 (hence to the collapsed Region 𝑅), since the Diamond Shape
following transformations. property guarantees that there are no edges in the form ⟨𝑋,𝑌 ⟩ with
break and continue Emission. This step transforms each Region 𝑋 ∈ 𝑃 \𝐶, 𝑌 ∈𝐶 and 𝑌 ≠ 𝐸. From a semantic standpoint, every new
in a DAG that is then ready to be fed into the Combing stage. edge 𝑒 𝑣 jumps from 𝑋 to the head of the Region 𝐶 collapsed into 𝑉𝐶 .
After the previous steps, the execution of a program in a given Finally, a new edge 𝑒𝑆 = ⟨𝑉𝐶 ,𝑆⟩ is added to represent the fact that
Region can either take an exit edge and jump to the single successor, break nodes inside Region 𝐶 collapsed into 𝑉𝐶 can jump straight
or take a retreating edge and jump to the head to execute another to the successor 𝑆.
iteration. At this point, all the properties introduced in Section 4.2 This step concludes the collapsing of a single Region. An example
have been enforced, with two exceptions: Diamond Shape, that will can be seen in Figure 9.
be enforced later by Combing and Two Predecessors, that was en- Once all the children of a Region have been collapsed, the Region
forced at the beginning of Section 4.4.1, but that might have been can be processed, until all the Regions in the tree become DAGs.
broken during Retreating Edges Normalization, to enforce the Single These DAGs contain, among others, virtual nodes that represent
Retreating Target and Single Exit. As a matter of fact, if all the re- nested collapsed DAG Regions. The Region Tree is now ready to be
treating edges in a Region point to the head, head might have more processed by the Combing stage.
than two predecessors. This step re-enforces the Two Predecessors
while transforming the Region in a DAG. 4.5 Combing
It starts by removing all the retreating edges, and substituting This is the core of the Control Flow Combing algorithm. It enforces
them with jumps to a newly created continue node. This naturally the Diamond Shape property, on all the DAGs in the Region Tree.
conveys the same semantic, that a retreating edge jumps to the Enforcing this property reshapes the DAG so that it is only com-
head to start another iteration of the loop. posed by nested diamond-shaped regions. These regions have only
Then, all the edges jumping out of the region to the single suc- a single entry and a single exit node. They have no branches that
cessor are substituted them with jumps to a newly created break jump directly in the middle of the region or jumping out from the
node. This also conveys the same semantic, that an exit edge jumps middle of the region. All the paths incoming into a diamond-shaped
straight out of the loop to its single successor. region pass by the entry, and all the paths outgoing from the region
Figure 8 shows an example of these transformations. pass by the exit. A simple example is visible in Figure 10.a. Diamond-
Collapsing Regions. At this point, a Region that has been trans- shaped regions are easily convertible to C if-else constructs, with
formed by all the previous steps of Preprocessing is finally a DAG. then and else branches, and with a single common successor that
As mentioned at the beginning of Section 4.4.2, only one Region is the code emitted in C after both then and else.
1 1 Basically, the underlying idea is to group the incoming edges
𝐴
𝐴 in node 𝑁 in two sets: one composed by the edges dominated by
true false 2 𝐴
conditional node 𝐶, that will be moved to node 𝑁 ′ , and the other
2 𝐴 𝐵 𝐵
one composed by the edges not dominated by 𝐶, that will remain
𝐵 3 4 4 attached to node 𝑁 .
𝐵
3 4 This is sufficient to enforce the Diamond Shape property for 𝐶
D and 𝑁 , but there might be other nodes in D(𝐶). Repeating this on
each 𝑁 ∈ D(𝐶) fully enforces the Diamond Shape property for the
5 5 conditional node 𝐶. In turn, repeating the process in post-order on
(a) (b) (c) all conditional nodes in the DAG enforces the property on the whole
Region. Notice that the process either never touches a node 𝑁 (since
Figure 10: (a) A diamond shaped region. (b) A region which is not
diamond-shaped. The arc between 1 and 4 breaks the assumption it already fulfills the Diamond Shape property for all the conditional
of not having edges incoming from outside the region. (c) The same nodes from which it is reachable) or it splits the incoming edges
region after the Combing has two nested diamond-shaped regions. of 𝑁 into two sets.
D is a dummy node, i.e., an empty node useful only to highlight At the end of the procedure, the Region DAG fulfills the Diamond
the diamond-shape. Shape property and is said to be combed.
Note that, as shown in Figure 10, the Combing can insert dummy
Informally, the key idea of the Combing step, is to take all the nodes (i.e., empty nodes) to reinstate the two predecessor property,
regions that are not diamond-shaped (as the one in Figure 10.b) and and highlight the diamond-shape.
restructure them to be diamond-shaped (as the one in Figure 10.c).
In order to achieve this goal, it is necessary to duplicate some nodes Improved Combing Algorithm: Untangling Return Paths. The
in the graph. Node duplication can increase the size of the final Combing Algorithm as described above still has a drawback in some
generated C code. However, we deem that this increases clarity common cases: it duplicates code very aggressively which can lead
since it disentangles complex overlapping paths in the control flow, to a big increase in code size if not controlled.
linearizing them and making them easier to reason about for an Consider Figure 11. The source code in the figure is very sim-
analyst, that can consider them one at a time. Moreover, in most ple, and represents a pretty common case where some checks are
cases the duplicated nodes introduced by the Combing can be dedu- performed on the arguments (A and B), a complex computation
plicated by the Matching stage, that uses them to emit idiomatic C composes the body of the function (C), and some final error check
code such as short-circuited ifs as explained in Section 4.6.2. is performed going on.
As we can see in Figure 11(a), one of the typical optimizations
The Combing Algorithm. As all previous steps, Combing is done performed by compilers even at lower optimization levels is to
on a single Region at a time. Thanks to the previous steps, Regions coalesce all the returns in a single node (R). This is intended to
at this point are DAGs. These two properties greatly reduce the reduce code size in the binaries, but from an analysis standpoint it
complexity, thanks to the shift of the problem from a global to a “entangles” different execution paths that were originally separate
local perspective, and since DAGs are acyclic. (the two return statements at lines 4 and 10).
For each Region DAG the comb works as follows. First, it collects If the vanilla Combing Algorithm is applied on the graph in Fig-
all the conditional nodes on the DAG. The Diamond Shape property ure 11(a), both nodes C and R would be duplicated, like shown in
states that every conditional node must dominate all the nodes Figure 11(b). This would be very detrimental, because it would end
between itself and its immediate post-dominator. Hence, for each up duplicating the whole bulk of the computation (C), unnecessarily
of these nodes, it identifies the immediate post-dominator. This is inflating the size of the decompiled source code.
always possible since the DAG has a single exit, thanks to the sink In order to cope with this cases we devised an improved combing
node injected at the beginning of Preprocessing. In this way, for algorithm: the Untangling Algorithm. The improved version of the
each conditional node 𝐶, the algorithm identifies as set of nodes algorithm, the Untangling is focused on handling these cases, and
D(𝐶) between 𝐶 and its immediate post-dominator. is performed just before the vanilla Combing Algorithm. After the
Second, for each node 𝑁 in D(𝐶) that is not dominated by 𝐶 there Untangling, the Combing is executed on the untangled graphs, so
is some incoming edge 𝑒 = ⟨𝑋,𝑁 ⟩ such that source node 𝑋 is not that it can iron out all the situations left behind from the Untangling
dominated by 𝐶. To enforce the Diamond Shape property, 𝑁 should because they were not beneficial to untangle.
be dominated by 𝐶. Hence the node 𝑁 is duplicated, creating a basic The Untangling is applied on each conditional node in a Region
block node 𝑁 ′ that contains the same instructions as 𝑁 . Initially DAG, and only if beneficial. Its benefits are evaluated with an heuris-
𝑁 ′ has no incoming nor outgoing edges. Then, for every outgoing tic that determines, for each conditional node, if the duplication
edge 𝑒𝑆 = ⟨𝑁 ,𝑆⟩ from 𝑁 , an outgoing edge 𝑒𝑆′ = ⟨𝑁 ′,𝑆⟩ is created induced by untangling the return path is significantly lower than
from 𝑁 ′ . This ensures that 𝑁 ′ jumps in the same places where 𝑁 the duplication that the Combing pass would introduce if the Un-
jumped, preserving the semantic of the program after 𝑁 . Then, tangling is not performed. To do this, the heuristic assigns a weight
each incoming edge 𝑒𝑃 = ⟨𝑃,𝑁 ⟩ into 𝑁 such that 𝐶 dominates 𝑃 is to each node in the graph, to evaluate the consequences of applying
substituted with an edge 𝑒𝑃′ = ⟨𝑃,𝑁 ′ ⟩ incoming into 𝑁 ′ . This means the Untangling compared to the vanilla Combing. The weight of
that after this transformation the node 𝑁 ′ is dominated by 𝐶, and each node is proportional to the number of instructions each node
the node 𝑁 is not reachable from 𝐶 anymore. See Figure 10.b and
Figure 10.c for an example of this transformation.
1 if ( arg0 ) { // A A A A
on how 𝐴 is connected to its children, they can represent the then
2 fun_call () ; // B branch of 𝐴, the else branch of 𝐴, and the code that is emitted in
3 if ( arg1 ) // B
4 return ; // R
B C B B the AST after both the then and the else. This allows to represent
5 } all the conditional nodes as well-structured if constructs.
6 // complex // C C C C R A special treatment is reserved to nodes in a DAG Region that
R
7 // code // C represent another nested DAG Region, that was collapsed by the
8 // here // C R R R
Collapsing Regions step. Whenever one of such nodes is encoun-
9 if ( err () ) // C
10 return ; // R (a) (b) (c) tered, it is emitted in the AST as a while(1) {...} construct. The
AST representing the body of the loop is then generated iteratively
Figure 11: Situation in which the baseline Combing would be very
from the DAG of the collapsed Region. In general, this representa-
costly in terms of duplicated code size. The graph in (a) is the CFG
tion is not optimal for any loop, but it’s only preliminary AST form
of the snippet of code on the left. The red dashed node (C) represent
a big and messy portion of the CFG that would greatly increase that will be made more idiomatic as described in the next section.
code size if duplicated. With the baseline Combing both C and R
would be duplicated, like shown in (b). With the Untangling, only 4.6.2 Matching Idiomatic C Constructs. The preliminary AST is
R is duplicated instead, like shown in (c). now post-processed to match idiomatic C constructs, striving to
contains, and for collapsed Regions this number if computed cumu- emit even more readable code, while, at the same time, reducing the
latively on all the nodes they contain. If, according to these weights, duplication introduced by Combing, when this is possible without
Untangling would duplicate more code than vanilla Combing, the sacrificing readability.
graph is not untangled and only combed. This post-processing is modular and extensible. We only report
Whenever triggered on a conditional node 𝑁 (B in Figure 11), some basic matching steps leading to significant improvements
the Untangling duplicates all the blocks from the post-dominator with a reduced effort. Additional matching criteria can be devised
of 𝑁 to the exit of the graph (only R in Figure 11, but potentially and added to the pipeline, to emit even better code.
any other node after R). This transformation allows the Combing Each matching criterion listed in the following is basically struc-
step to keep duplication under control. tured as a top-down visit’ on the AST, which recognizes certain
Going back to Figure 11, we can see in Figure 11(c) how the patterns and transforms the AST to more idiomatic, but semanti-
Untangling would transform the graph. Only R is duplicated, saving cally equivalent, C constructs.
a huge amount of unnecessary duplication if C is big. This is also an Short-Circuit Reduction. This criterion recognizes and recon-
example where, after the Untangling, the plain Combing Algorithm structs short-circuited if statements in C. In fact, the Combing step
does not have anything to do, because all nodes in Figure 11(c) are breaks these constructs as shown in Figure 10.c. This matching
dominated by all the conditional nodes from which they are reach- criterion reverts that choice when possible, allowing the Combing
able. This means that the Diamond Shape property already holds to handle general situations, while also emitting idiomatic short-
after Untangling, saving the work that would have been necessary circuited ifs whenever possible.
to comb the graph. Figure 10.c shows two nested if statements that have the same
Finally, notice how the graph in Figure 11(c) is much more struc- duplicated node in their else branches. This criterion matches that
turally close to the original source code than what plain Combing pattern. Whenever two nested if nodes on the AST have the same
would obtain, i.e. Figure 11(b). code on one of their branches, they are transformed into a sin-
gle if node, short-circuiting their conditions with the appropriate
4.6 Matching C Constructs combination of &&, ||, and ! operators.
This phase builds the initial Abstract Syntax Tree (AST) represen- Note that the previous works [23] did not perform short-circuited
tation of each of the combed Regions, and then manipulates it to if matching, often leading to suboptimal results.
emit idiomatic C code. Switch Reconstruction. This criterion recognizes and builds switch
4.6.1 Building the AST. Thanks to the Preprocessing and Combing statements. As mentioned in Section 4.4.1, to enforce the Two Suc-
stages and all the enforced properties, building an AST is straight- cessors property, switches are decomposed in nested ifs in the
forward. The Two Successors rule ensures that each conditional node preprocessing phase of the Restructuring.
can be emitted as an if, and since all the DAGs are diamond-shaped This criterion looks in the AST for nested ifs whose conditions
regions, the DAG naturally represents a program with perfectly compare a variable for equality with different constants. Matched
nested scopes (each diamond-shaped part represents a scope). More- sequences of ifs are transformed into switches.
over, all retreating edges have already been removed and converted Loop Promotion. Similarly to what is done in Yakdan et al.[23],
to break and continue nodes. this criterion manipulates loops, initially emitted as while(1) {...},
All these properties imply that the dominator tree of each DAG to transform them into more idiomatic loops, with complex exit
Region is a tree where each node can have at most three children. conditions and various shapes such as while(...) {...} and do {...}
Exploiting this property, the algorithm works on the dominator tree while(...).
(from root to leaves) to emit the AST. If a node 𝐴 in the dominator To match while loops, the AST is scanned looking for loops
tree has only a single child 𝐵, 𝐴 and 𝐵 are emitted as subsequent whose body starts with a statement in the form of if(𝑋 ) break;.
statements in a single scope in C. If a node 𝐴 in the dominator tree Any such cycle can be converted into a while(!𝑋 ) {...}, leaving
has two or three children, then 𝐴 is an if statement. Depending the rest of the loop body untouched. To match do-while loops,
instead, the AST is scanned looking for loops whose body’s final emerge in toy examples. For the benchmarks, the GNU Coreutils
statement is in the form if(𝑋 ) continue; else break;. These 9.29 have been compiled with GCC 4.9.3 targeting the x86-64 archi-
loops are transformed into do {...} while(𝑋 ). The same is done tecture, without debug symbols and dynamic linking. We evaluated
for loops where the continue and break statements are inverted, the performances of 4 optimizations levels, O0, O1, O2 and O3.
simply negating the condition. At this point, all the generated binaries have been decompiled
with all the decompilers, to generate C code. It is worth nothing that,
5 EXPERIMENTAL RESULTS on the basis of the data collected during our evaluation, revng-c
is the only decompiler that produces valid C code as output. The
This section evaluates the proposed approach. Section 5.1 describes
decompiled code generated by Hex-rays Decompiler and Ghidra
the experimental setup, while Section 5.2 compares our implementa-
cannot be parsed as-is by a standard-conforming C parser. In order
tion with state-of-the-art commercial and open-source decompilers.
to do this, which was a requirement to collect our evaluation metrics
on the decompiled code, we had to perform ad-hoc changes on the
5.1 Experimental Setup decompiled sources, such as declaring missing variables and types,
To evaluate the performance of the Control Flow Combing described correcting the number of parameters in function calls, and others.
in the previous section, the algorithm has been implemented on top To ensure a fair comparison, with the help of IDAPython for
of the rev.ng static binary analysis framework [9, 10], based on the Hex-Rays Decompiler and Java scripts for Ghidra, we extracted
qemu [4] and llvm [17]. rev.ng is capable of generating CFGs from some information on the decompiled functions, such as their entry
binary programs for various CPU architectures. The Preprocessing point and their size. We then proceeded to compare only the func-
and Combing stages of the algorithm has been implemented on top tions for which the three tools gave identical information about
of the LLVM IR. After these phases, the Matching stage has been entry point and size. Later, in Table 1, we report the percentage of
implemented on a simple custom AST for C, that is then translated functions which matched in dimension, and that we used in our
into the AST employed by clang (LLVM’s C/C++ compiler) and evaluation.
finally serialized to C in textual form. There is another aspect to keep in mind about using Coreutils as
The resulting decompiler is called revng-c. The quality of the benchmarks. All these programs share a core library, called gnulib,
code generated by revng-c is compared with two other well-known whose code is statically linked with all binaries. This means that the
decompilers: IDA Pro’s Hex-Rays Decompiler, the leading commer- functions in gnulib are duplicated many times. This problem has
cial decompiler developed by Hex-Rays [14], and Ghidra, developed already been pointed out in the past, by the authors of DREAM [23],
by the National Security Agency (NSA) of the USA for internal use who also designed a strategy to overcome it. The idea is simple:
and recently open-sourced [3]. all the decompiled functions across all Coreutils need to be dedu-
For a more thorough comparison, we tried to reach the authors plicated before the final comparison, to avoid overrepresenting
of two other recent academic contributions in the control flow re- duplicated functions. We adopted the same strategy for the com-
structuring research area, namely [7] and [23]. Unfortunately, the parison of our results.
authors of Brumley et al. [7] were not able to retrieve the artifacts
to reproduce the results, while the main authors of DREAM [23]
5.2 Evaluation of the Results
have left academia to focus on different topics, and therefore were
not able to answer our inquiry. Given that DREAM [23] is the only The quality of the generated code has been measured according to
other approach to generate goto-free C code, it would have been the two following metrics.
the perfect candidate to compare our approach with. This compar- gotos. The number of the emitted goto statements. goto state-
ison would have enabled an evaluation of the main novelty of our ments are very detrimental to the readability of the decompiled
approach: allowing duplication of code in order to reduce the cy- code since they can arbitrarily divert the control flow, and keeping
clomatic complexity of the decompiled code, which is a measure of track of the execution becomes significantly more difficult [11].
the mental load required to an analyst to understand the program. Cyclomatic Complexity. The increment in cyclomatic complex-
DREAM tries not to resort to duplication, while we accept small to ity [19] of the decompiled code, using the one of the original code
moderate duplication because it reduces the cyclomatic complexity as baseline. This measures the mental effort required to understand
of the generated code. Lacking reproducible results to compare the decompiled code.
directly with DREAM, we decided to focus on a restricted num- We evaluate the code generated by the three decompilers from bi-
ber of case-studies, that show how the code generated by revng-c naries with different optimization levels according to these metrics.
compares with their results. Given the limited reach this manual The evaluation is limited to the functions that all the decompilers
comparison with DREAM, we have left it in Appendix B. were able to correctly identify. The results are reported in Table 1.
The remainder of this section provides comparisons between By construction, revng-c zeros out the gotos metric, generating
revng-c, Ghidra, and Hex-Rays Decompiler. For these decompil- 0 gotos, over the entire GNU Coreutils suite. We can also see how
ers, the quality of the decompiled code was evaluated on the GNU revng-c generates decompiled code with a reduced cyclomatic
Coreutils. These are the basic file, shell and text manipulation com- complexity with respect to the Hex-Ray Decompiler and Ghidra.
mand line utilities of the GNU operating system. These benchmarks Note that the cyclomatic complexity of decompiled code of the
have been used in the past in related works on control flow restruc- tools is expressed with respect to this baseline. In fact, we assume
turing [7, 23], since they implement a large set of utilities, hence that this complexity is intrinsic in the code, and the objective of
producing a wide range of real-world different CFGs that do not the decompilation is to introduce as little additional complexity as
-O0 -O1 -O2 -O3
revng-c IDA Ghidra revng-c IDA Ghidra revng-c IDA Ghidra revng-c IDA Ghidra
Cyclomatic Complexity +11% +12% +16% +13% +17% +20% +36% +60% +72% +78% +86% +94%
Gotos 0 1010 1370 0 2370 2622 0 2082 2062 0 2119 2282
Matched functions 93% 91% 89% 81%
Table 1: Comparison between revng-c, IDA, and Ghidra. Results are aggregated for the optimization level that was used to obtain the binaries
that were then decompiled (-O0, -O1, -O2, -O3). For each optimization level, the first row shows percentage of functions that were matched
by all the decompilers (percentage of the binary code size). The second and third row show respectively the number of gotos produced by
each decompiler, and the additional cyclomatic complexity introduce by the decompilation process with respect to the baseline cyclomatic
complexity of the original source code.

possible. If we observe the -O2 optimization level, the one typically -O0 -O1 -O2 -O3
adopted in release builds, we can notice that revng-c is able to
reduce the additional cyclomatic complexity by 40% with respect Goto 1.07× 1.10× 1.15× 1.32×
to IDA, and by almost 50% with respect to Ghidra. No-Goto 1.04× 1.08× 1.12× 1.25×
For what concerns the metrics for the Hex-Rays Decompiler and
Table 2: Size increment metrics (over the original size) for the
Ghidra, we can see that Ghidra performs slightly better in terms functions over different optimization levels. For each optimization
of goto statements emitted, emitting less gotos when compared to level, we also provide the duplication factor metric computed
the Hex-Rays Decompiler. Overall, as previously stated, we think only on functions which do not have goto statements in the
that these metrics shows that the two decompilers adopt a similar original source code. While our algorithm is able to completely
approach to decompilation. eliminate gotos in the decompiled code, we can see that in case we
In Table 2 we provide an overview of the increase in terms of approach decompilation of code with gotos our duplication factor
size of the decompiled code due to the duplication introduce by our is penalized. Indeed, our algorithm makes the assumption that we
approach. In Figure 12 we also show an estimate of the probability are trying to decompile well-structured code, therefore gotos in
the original source code make this assumption to fail.
distribution function (using the KDE method) of the increase in size
for all the optimizations levels. This metric, has not been computed
for the other tools, since they do not introduce duplication. O0
4 O1
Note also that the effects of duplication could be significantly mit- O2
O3
igated by performing regular optimizations on the generated code,
such as dead code elimination. In fact, the optimizer might be able 3
to prove that, for instance, part of the code duplicated outside of a
loop due to the outlining of the first iteration will never be executed
2
and can therefore be dropped. However, due to timing constraints,
we have not been able to assess the impact of such optimizations.
We also produced a pair of heat maps Figure 13 that helps visu- 1
alizing how the relationship between duplication and decrement
in cyclomatic complexity evolves. We plotted the values for the
-O2 optimization level, comparing with both Hex-Rays and Ghidra. 0
1.2 1.4 1.6 1.8 2.0 2.2 2.4
In particular, apart from the bright spots in correspondence of a Figure 12: Plot showing the Probability distribution functions of
low duplication level which are positive, we can see some reddish the duplication introduced by revng-c. On the x axis, we have the
clouds towards the center of the heat maps, which represents a amount of duplication introduced (measured in terms of code size
class of functions for which the duplication is significant, but for increase over the original value). We can notice that as the optimiza-
which the cyclomatic complexity is reduced with respect to IDA tion level increases, revng-c introduces a little bit more duplication
and Ghidra. This represent the fact that even when a cost in terms in order to be able to be able to emit goto-free decompiled code.
of duplication is payed, we have a gain in terms of reduction of building a decompiler tool called revng-c. In the evaluation, we
cyclomatic complexity. performed compared our results against both academic and com-
mercial state-of-the-art decompilers.
6 CONCLUSION The experimental results show that our solution is able to avoid
In this work we presented a novel approach to control flow restruc- the emission of goto statements, which is an improvement over
turing and to decompilation, by introducing new techniques for the Hex-Rays Decompiler and Ghidra, but at the same time does
transforming any given CFG into a DAG form, which we called Pre- not resorts to predicated execution, which on the other hand af-
processing, to which we later apply our Combing algorithm. Thanks fects DREAM. In future work, we will to improve the quality of the
to Combing, we are able to build a C AST from the input code, decompiled code by focusing on the recovery of more idiomatic
which is then transformed by the Matching phase to emit idiomatic C constructs. This type of work will be greatly simplified by the
C. We implemented our solution on top of the rev.ng framework, already modular nature of the Matching phase of our algorithm.
1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9

1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9
Duplication

Duplication
1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2
Cyclomatic improvement Cyclomatic improvement
(a) Cyclomatic complexity improvement w.r.t. IDA (b) Cyclomatic complexity improvement w.r.t. Ghidra
Figure 13: Cyclomatic Complexity improvements of revng-c at O2. This heat maps helps us visualizing where the cyclomatic complexity
improvement gain obtained by revng-c is introduced. The cyclomatic improvement is represented on the x axis, while the duplication
factor introduced by revng-c is represented on the y axis (higher means less duplication). To color of a cell of the heat map is computed by
performing a sum of bivariate distributions for each data point in our dataset (every function). This means that a cell will assume a brighter
color as more data points showing values of duplication and decrease in cyclomatic complexity in its surrounding are present.
More in general, in the future we aim to further improve the qual- [8] Cristina Cifuentes. Reverse compilation techniques. Queensland University of
ity of the decompiled code in other areas, such as arguments and Technology, Brisbane, 1994.
[9] Alessandro Di Federico and Giovanni Agosta. A jump-target identification
return values detection and advanced type recognition techniques. method for multi-architecture static binary translation. In Compliers, Architec-
In addition, we are also considering the possibility for our de- tures, and Sythesis of Embedded Systems (CASES), 2016 International Conference
on, 2016.
compiler to support the emission of some goto statements in a very [10] Alessandro Di Federico, Mathias Payer, and Giovanni Agosta. rev. ng: a
limited and controlled setting, i.e., where they may be considered unified binary analysis framework to recover cfgs and function boundaries. In
idiomatic and legitimate, e.g., in the goto cleanup pattern. The goal of Proceedings of the 26th International Conference on Compiler Construction, 2017.
[11] Edsger W Dijkstra. Go to statement considered harmful. Communications of
this would be to trade the introduction of a goto in order to further the ACM, 11(3), 1968.
reduce the duplication introduced by our combing algorithm. [12] Alessandro Di Federico, Pietro Fezzardi, and Giovanni Agosta. rev.ng: A
We also want to address the verification of the semantics preser- multi-architecture framework for reverse engineering and vulnerability
discovery. In International Carnahan Conference on Security Technology, ICCST
vation of the control flow restructuring transformation we intro- 2018, Montréal, Canada, October 22-25, 2018 [2].
duce. We deem this goal achievable thanks to the very nature of the [13] Ilfak Guilfanov. Decompilers and beyond. Black Hat USA, 2008.
[14] Hex-Rays. Ida pro. https://www.hex-rays.com/products/ida/.
rev.ng framework. The idea is to enforce back the modifications [15] Donald E. Knuth. The Art of Computer Programming, Volume 1 (3rd Ed.):
done by the control flow restructuring algorithm at the level of the Fundamental Algorithms. Addison Wesley Longman Publishing Co., Inc.,
LLVM IR lifted by rev.ng, and to use the recompilation features Redwood City, CA, USA, 1997.
[16] Christopher Kruegel, William Robertson, Fredrik Valeur, and Giovanni Vigna.
of the framework to prove the behavioural equivalence between Static disassembly of obfuscated binaries. In USENIX security Symposium,
the original binary and the one generated after the restructuring. volume 13, 2004.
[17] Chris Lattner and Vikram Adve. LLVM: A Compilation Framework for Lifelong
Program Analysis & Transformation. In CGO 2004.
[18] Thomas Lengauer and Robert Endre Tarjan. A fast algorithm for finding
REFERENCES dominators in a flowgraph. ACM Transactions on Programming Languages and
[1] Hex-rays decompiler. https://www.hex-rays.com/products/decompiler/. Systems (TOPLAS), 1979.
[2] International Carnahan Conference on Security Technology, ICCST 2018, Montréal, [19] T. J. McCabe. A complexity measure. IEEE Transactions on Software Engineering,
Canada, October 22-25, 2018. IEEE, 2018. SE-2(4), Dec 1976.
[3] National Security Agency. Ghidra. https://ghidra-sre.org/. [20] Dawn Song, David Brumley, Heng Yin, Juan Caballero, Ivan Jager, Min Gyung
[4] Fabrice Bellard. QEMU, a fast and portable dynamic translator. In Proceedings Kang, Zhenkai Liang, James Newsome, Pongsin Poosankam, and Prateek
of the FREENIX Track: 2005 USENIX Annual Technical Conference, April 10-15, Saxena. Bitblaze: A new approach to computer security via binary analysis. In
2005, Anaheim, CA, USA, 2005. International Conference on Information Systems Security. Springer, 2008.
[5] David Brumley, Ivan Jager, Thanassis Avgerinos, and Edward J Schwartz. Bap: [21] Khaled Yakdan. Dream code snippets. https://net.cs.uni-bonn.de/fileadmin/ag/
A binary analysis platform. In International Conference on Computer Aided martini/Staff/yakdan/code_snippets_ndss_2015.pdf.
Verification. Springer, 2011. [22] Khaled Yakdan, Sergej Dechand, Elmar Gerhards-Padilla, and Matthew Smith.
[6] David Brumley, JongHyup Lee, Edward J Schwartz, and Maverick Woo. Native Helping johnny to analyze malware: A usability-optimized decompiler and
x86 decompilation using semantics-preserving structural analysis and iterative malware analysis user study. In 2016 IEEE Symposium on Security and Privacy
control-flow structuring. In Presented as part of the 22nd USENIX Security (SP). IEEE, 2016.
Symposium (USENIX Security 13), 2013. [23] Khaled Yakdan, Sebastian Eschweiler, Elmar Gerhards-Padilla, and Matthew
[7] David Brumley, JongHyup Lee, Edward J. Schwartz, and Maverick Woo. Native Smith. No more gotos: Decompilation using pattern-independent control-flow
x86 decompilation using semantics-preserving structural analysis and iterative structuring and semantic-preserving transformations. In NDSS, 2015.
control-flow structuring. In Proceedings of the 22th USENIX Security Symposium,
Washington, DC, USA, August 14-16, 2013, 2013.
A GRAPHS BASICS 1 1 1
This section introduces the fundamental concepts to understand
the design of the Control Flow Combing algorithm, described in 6 6 6
Section 4.
2 2 2
Graphs. A directed graph is a pair 𝐺 = ⟨𝑉 ,𝐸⟩, where 𝑉 is a set, and
𝐸 ⊂𝑉 ×𝑉 is a set of pairs of element of 𝑉 . Each 𝑣 ∈𝑉 is called a node, 3 4 3 4 3 4
and each 𝑒 = ⟨𝑣 1,𝑣 2 ⟩ is called an edge. Given 𝑒 as defined above, 𝑣 1
is said to be a predecessor of 𝑣 2 , while 𝑣 2 is said to be a successor of 5 5 5
𝑣 1 . 𝑒 is said to be outgoing from 𝑣 1 and 𝑖𝑛𝑐𝑜𝑚𝑖𝑛𝑔 in 𝑣 2 . 𝑣 1 is called
the source of 𝑒 and 𝑣 2 is called the target of 𝑣 1 . A sequence of edges Figure 14: Left – An example graph. Node 1 is the entry and 5 is the
𝑒 1 = ⟨𝑣 1,1,𝑣 1,2 ⟩,...,𝑒𝑛 = ⟨𝑣𝑛,1,𝑣𝑛,2 ⟩, is called a path if ∀𝑘 = 1,...,𝑛 − 1 exit. Middle – Dominator tree of the graph on the left. Edges in this
holds 𝑣𝑘,2 =𝑣𝑘+1,1 . tree go from immediate dominator to the dominated node. Right –
Post-dominator tree of the graph on the left. Edges in this tree go
Control Flow Graphs. A directed graph used to represent the from immediate post-dominator to post-dominated node.
control flow of a function in a program.
Each node of a CFG is called a basic block and represents a se-
quence of instructions in the program that are executed sequentially, Informally, dominance describes, given a node, which nodes have
without any branch with the exception of the last instruction in to be traversed on all paths from the entry, which must be present
the basic block. and unique, to that node. Given a graph and two nodes 𝐴 and 𝐵, 𝐴
Each edge in a CFG is called a branch. A branch 𝑏 = ⟨𝐵𝐵 1,𝐵𝐵 2 ⟩ in a dominates 𝐵 iff every path from the entry to 𝐵 contains 𝐴. 𝐴 properly
CFG, represents the notion that the execution of the program at the dominates 𝐵 if 𝐴 dominates 𝐵 and 𝐴 ≠ 𝐵. 𝐴 immediately dominates
end of 𝐵𝐵 1 might jump to the beginning of 𝐵𝐵 2 . Branches can be con- 𝐵 if 𝐴 properly dominates 𝐵 and 𝐴 ≠ 𝐵 and it does not exist a node
ditional or unconditional. A branch 𝑏 is unconditional if it is always 𝐶 such that 𝐴 properly dominates 𝐶 and 𝐶 properly dominates 𝐵.
taken, independently of the specific values of conditions in the pro- Conversely, post-dominance is related to which nodes must be
gram at runtime. The source 𝑠 of an unconditional branch 𝑏 has no traversed on paths from a given node to the exit, if this is present
other outgoing edges. Conversely, a branch 𝑏 is called conditional if and unique. In cases where there is not a single exit node post-
it might be taken by the execution at runtime, depending on the run- dominance is not defined. The node 𝐴 post-dominates 𝐵 if every path
time value of specific conditions in the program. The source of a con- from 𝐵 to the exit contains 𝐴. 𝐴 properly post-dominates 𝐵 if 𝐴 post-
ditional branch always has multiple outgoing edges and the condi- dominates 𝐵 and 𝐴 ≠ 𝐵. 𝐴 immediately post-dominates 𝐵 if 𝐴 prop-
tions associated to each outgoing edge are always strictly exclusive. erly post-dominates 𝐵 and 𝐴 ≠ 𝐵 and it does not exist a node 𝐶 such
Finally, a CFG representing a function has a special basic block, that 𝐴 properly post-dominates 𝐶 and 𝐶 properly post-dominates 𝐵.
called entry node, entry basic block, or simply entry, that represents Dominator and Post-Dominator Tree. The dominator tree (and
the point in the CFG where the execution of the function starts. the post-dominator tree) are a compact representation of the dom-
In the remainder of this work, where not specified otherwise, inance (and post-dominance) relationship withing a graph. The
we will refer to CFGs. dominator tree (DT) contains a node for each node of the input
Depth First Search. The concept of Depth First Search (DFS) [15] graph, and an edge from node 𝐴 to node 𝐵 iff 𝐴 is the immediate
is very important for the rest of this work. Briefly, DFS is a search dominator of 𝐵. The resulting graph is guaranteed to be a tree since
algorithm over a graph, which starts from a root node, and explores each node except the entry node has a unique immediate dominator.
as far as possible following a branch (going deep in the graph, hence As a consequence, the entry node is the root of the dominator tree.
the name), before backtracking and following the other branches. Whenever the exit of a graph is unique, it is possible to build an anal-
When the algorithm explores a new node it is pushed on a stack, ogous data structure for the post-dominance relationship, called
and, when the visit of the subtree starting in that node is completed, post-dominator tree (PDT). A well known and widely used algorithm
it is popped from the exploration stack. Such traversal can be used for calculating a dominator tree is the Lengauer-Tarjan’s algorithm,
both for inducing an ordering on the nodes of a graph, and to cal- that has the peculiarity of having an almost linear complexity [18].
culate the set of retreating edges (informally, edges that jump back Examples of dominator and post-dominator tree for a CFG are
in the control flow). Directed graphs without retreating edges are represented in Figure 14.
called Directed Acyclic Graphs (DAG). A Depth First Search induces
the following orderings of the nodes of a graph:
Preorder. Ordering of the nodes according to when they were first
visited by the DFS and, therefore, pushed on the exploration stack.
Postorder. Ordering of the nodes according to when their sub-
tree has been visited completely and, therefore, popped from the
exploration stack.
Reverse Postorder. The reverse of postorder.
Dominance and Post-Dominance. Other two fundamental con-
cepts in program analysis are dominance and post-dominance.
DREAM revng-c
void sub43E100 ( void * a1 , int a2 ) {
... Cridex4 9 5
v4 = *( result + 0 xc ) ;
if ( a2 >= v4 ) ZeusP2P 9 5
v5 = *( result + 8) + v4 ;
if ( a2 < v5 && a2 >= v4 ) SpyEye 19 15
break ;
i ++; OverlappingLoop 4 3
result += 0 x28 ; Table 3: This table presents the cyclomatic complexity of the code
... produced by DREAM and revng-c. As we can see, DREAM consis-
}
tently presents higher figures compared to revng-c due to the high
number situations in which conditions are reused multiple times.
Figure 15: Snippets which shows the reuse of the condition a2 >= 4
on the same execution path

improvements. We can see that, while DREAM uses a predicated


B CASE STUDIES execution approach, guarding the statement at line 11 with a compli-
This section is devoted to a comparison of our solution with DREAM. cated condition. On the other hand, revng-c duplicates some code,
We need a special section for this since the only artifacts of decom- in this case the assignment, directly where the conditions to eval-
pilation available from DREAM are the ones included in a whitepa- uate if the assignment needs to be performed are available, specifi-
per [21] cited in [23]. In this document, for every sample of code cally at line 5. The idea is to inline the portion of code, paying a cost
decompile by DREAM is present the corresponding code decom- in terms of duplication, instead of deferring it, but paying a cost
piled by Hex-Rays. Unfortunately, we were not able to recover the in terms of mental load necessary to understand when this assign-
original functions (in terms of binary code) used for the evaluation ment is actually executed. In the example, the cost is visible in the
of DREAM. This is due to the fact that the presented snippets be- DREAM snippet as the convoluted condition of the if statement at
longs to malware samples for which a lot of different variants are lines 9 and 10. In this case, duplication also highlights immediately
available. what value is assigned to v2, that is the return value of the function.
We observe that the provided Hex-Rays Decompiler decompiled In this section we illustrated why we think that predicated ex-
source resembles very closely the original assembly representation ecution is suboptimal in terms of mental load for the analyst that
(e.g., due to the abundance of goto statements). Therefore, in order reads the decompiled code. The point is that in presence of pred-
to be able to compare our results with DREAM’s, we decided to use icated execution, different parts of the code in different conditional
the Hex-Rays Decompiler decompiled sources as a starting point constructs are executed on the basis of the state of the conditional
for obtaining the CFG of the functions. Then, in turn, apply our variable. This causes a mix of control flow and information on the
algorithm, and obtain the revng-c decompiled sources. state of the conditional variable, which causes the heavy mental
To assess on a large scale how revng-c performed compared to load. Of course this is something that can be present in C code in
DREAM, we collected the mentioned metrics on both the DREAM principle, but the predicated execution introduced by DREAM push
decompiled sources provided in the whitepaper, and on the sources this to the limit where it becomes an impediment for the analyst.
produced by revng-c. We have done this by highlighting in a couple of examples where
Table 3 presents the results we obtained. The higher cyclomatic this happens and how we approach instead the decompilation of the
complexity in code produced by DREAM is due to the predicated same snippet of code, and by showing that the predicated execution
execution-like code. In fact, in this cases, the same condition will be approach increases the cyclomatic complexity of the code.
employed multiple times as a state variable that enables or disables During the design of the validation of our work, we also eval-
certain portions of the code. This approach forces the analyst to uated the possibility of conducting an user study to evaluate the
keep track of the state of the variables, increasing its mental load performance of different decompilers, as done in [22]. However, we
in a non-negligible way. As we can see, both DREAM and revng-c deemed that such kind of user study is really helpful to evaluate
provide decompiled sources without goto statements, but DREAM the overall performance of a decompiler tool only once aspects
presents reuse of the conditions as expected and informally ex- orthogonal to what presented in this paper are developed, such
plained throughout the paper. As a concrete example of conditional as the identification of library functions and type identification
reuse, consider the snippet in Figure 15 (lines from 11 to 17 extracted techniques. In this paper instead, we focused on the control flow
from the snippet 1.5 in DREAM whitepaper [21]), we can see how recovery portion of the decompilation task, and this led us to set
the condition a2 >= v4 is reused twice on the same execution path. up the experimental evaluation in the way we did. Anyway, we do
As an additional example, Figure 16 compares how a situation not exclude to conduct an user study, once the other mentioned
that DREAM (on the left of the listing) handles through predicated aspects of the decompiler have matured.
execution is handled with duplication in revng-c (on the right of
the listing). The revng-c listing has been manually modified to re-
flect the same variable names used by DREAM. Also some optimiza-
tions in terms of code readability have been performed, but these
changes do not concern the control flow, but are simple aesthetic
1 if (! cond1 && ! cond2 ) { 1 if ( var_4 ) {
2 v4 = sub4634E2 ( a1 + a2 *4 , a7 , 0 , ...) ; 2 v4 = sub4634E2 ( a1 + a2 *4 , a7 , 0 , ...) ;
3 v2 = v4 ; 3 if ( v4 ) {
4 if ( v4 ) { 4 if ( v4 == -4) {
5 cond3 = v4 == -4; 5 v2 = -3;
6 ... 6 } else {
7 } 7 v2 = v4
8 } 8 }
9 if (( cond1 || v4 ) && ( cond1 || ! cond2 ) 9 if (! HeapValidate ( GetProcessHeap () , 0 , lpMem ))
10 && ( cond3 || ! v3 ) && (! cond1 || ! v3 ) ) 10 return v2 ;
11 v2 = -3; 11 HeapFree ( GetProcessHeap () , 0 , lpMem ) ;
12 if (! HeapValidate ( GetProcessHeap () , 0 , lpMem ) ) 12 }
13 return v2 ; 13 ...
14 HeapFree ( GetProcessHeap () , 0 , lpMem ) ; 14 }
15 return v2 ; 15 return v2 ;

Figure 16: Side by side SpyeEye listings of the decompiled source by DREAM (on the left) and revng-c (on the right).

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy