A Comb for Decompiled C Code
A Comb for Decompiled C Code
Enforcing Two Predecessors and Two Successors. First, the 4.4.2 Turning Regions into DAGs. The goal of this phase is to turn
Two Successors property is enforced by transforming all the switch each Region into a DAG that can be then be reasoned about in
statements into cascaded conditional branches, with two targets simpler terms. This process is composed of various incremental
each. Similarly, the Two Predecessors property is enforced by taking steps. The combination of all these steps enforces on the Regions
each basic block with more than one predecessor and transforming all the remaining properties introduced in Section 4.2 except for
it into a tree of empty basic blocks (dummies) that only jump to the Diamond Shape property, i.e., the Two Successors property, and
their single successor. the Loop Properties. Where noted, some of steps are optional and
These operations do not alter in any way the semantic of the dedicated to gracefully handle common cases.
program. Moreover, the Two Predecessors and Two Successors are The following steps work on a single Region at a time, moving
preserved in all the following steps. from the leaves to the root of the Region Tree. At the beginning of
this process all Regions but the root are still SCS. At the end of this
Identifying Nested Regions. The core idea of this step is to merge process the Regions are transformed in DAGs, so that they can be
sets of the partially overlapping loops in the CFG into an individual treated by the next phase, Combing.
Region that we can then reason about as a single loop.
To define these Regions, the algorithm starts from all the Strongly Electing Regions’ Heads. The Loop Properties require every Re-
Connected Subgraphs (SCS), i.e., subgraphs of the original CFG gion to have a Single Entry and a Single Retreating Target. However,
whose nodes are all reachable from each other. There might be at this stage, each of them may contain multiple retreating edges,
several overlapping and non-overlapping SCS in a graph. Note that possibly targeting different nodes. This step elects the entry node:
a SCS is a difference concept from a Strongly Connected Component the node that is target of the highest number of retreating edges.
(SCC), typically used in loop analysis. In fact, SCCs are always non- This node, the head node, represents the beginning of the loop body,
overlapping by definition, and their union represent the entire CFG. and will be the target of all the retreating edges in the loop.
In particular, we are interested in SCSs induced by retreating edges. Retreating Edges Normalization. After the election of the head,
Given a retreating edge 𝑒 = ⟨𝑠,𝑡⟩ the SCS induced by 𝑒 is constituted all the retreating edges that do not point to it are considered abnor-
by all the nodes that are on at least one path starting from 𝑡 and mal, since they do not respect the Single Retreating Target property
ending in 𝑠 that does not crosses 𝑡 nor 𝑠. and, therefore, need to be handled.
First, the algorithm identifies all the SCS induced by all the Consider the graph in Figure 3: the head is node 0 and there is
retreating edges in the CFG, simply applying the definition above. a single abnormal edge from node 2 to 1. In C parlance, this edge is
Note that at this stage the resulting SCSs can still overlap, whereas not a continue, since it jumps to the middle of a loop. Informally,
to build a hierarchy between SCS it is necessary for the set of SCSs to to handle this situation, we can introduce a state variable in the pro-
form a partially ordered set with the strict subset relation (⊂). Hence, gram so that the abnormal edge can be represented with a continue.
for each pair of SCS 𝐴 and 𝐵, if 𝐴∩𝐵 ≠ ∅, 𝐴 ⊄ 𝐵, and 𝐵 ⊄𝐴, then 𝐴∪𝐵 In practice, this edge will target a virtual head node that will check
is added to the set of SCS, removing 𝐴 and 𝐵 from the set of SCSs. the value of the state variable and dispatch execution at the correct
When this happens, the algorithm restarts from the beginning, until location (Node 1). To discriminate between retreating edges, the
a fixed point is reached. Notice that the union of two SCS is always state variable is set before every retreating edge and checked at the
an SCS, hence the process can proceed. An example of partially beginning of the loop with a dedicated construct.
overlapping SCS that trigger this condition is shown in Figure 2. This is exactly what the normalization step does for abnormal
This process converges since the ∪ operator is monotonic and edges. For each Region, a state variable 𝑣 is created. Then, a distinct
the CFG has a finite number of nodes. At the end only a set of SCS identifier is assigned to each node with incoming abnormal edges,
that is partially ordered with the ⊂ relationship is left. Each of this as well as to the head elected at the previous step. Then, a new set
remaining SCS is a Region roughly representing a loop, or a set of of nodes is created before the head, containing only conditional
loops tightly entangled together. Considering the whole CFG as a jumps that check the state variable to dispatch the execution at the
Region itself, the ⊂ relationship naturally induces a tree on all the correct target (either a target of an abnormal edge or the head). This
regions. The whole CFG is the root of the tree, and moving towards set of nodes is called the head dispatcher, and its first node is called
leaves we encounter more and more deeply nested loops. This tree ℎ. Finally, each abnormal edge 𝑒 = ⟨𝑠,𝑡⟩ is replaced with a new pair
structure is called the Region Tree. of edges. The former edge of this pair is 𝑒ℎ = ⟨𝑠,ℎ⟩. This edge points
Notice that the grouping of nodes in Regions does not alter the to the entry point of the head dispatcher, and sets the state variable
CFG, hence it does not alter the program semantic. The same holds to the value associated to 𝑡, say 𝑣𝑡 . The latter edge is added from the
if a node is moved inside or outside of an existing Region. From this node in the head dispatcher that checks for the condition 𝑣 ==𝑣𝑡 to
𝑣 := 1 0 0
𝑣 := 0
𝑣 := 0
head dispatcher 3 1 1
3
𝑣 == 0 𝑣 == 1
2 2
0 1 2
Figure 4: Normalizing retreating edges on the CFG from Figure 3. All
4 4
the retreating edges (dashed) now point to the new head dispatcher
and set the state variable (values are reported on the edge labels). Figure 5: Absorbing Successors. Left – The Region with the nodes
The head dispatcher then jumps to the original target node. with dashed border has two successors: 3 and 4. Right – Node 3 has
been absorbed in the Region, which now has a single successor.
possible. If we observe the -O2 optimization level, the one typically -O0 -O1 -O2 -O3
adopted in release builds, we can notice that revng-c is able to
reduce the additional cyclomatic complexity by 40% with respect Goto 1.07× 1.10× 1.15× 1.32×
to IDA, and by almost 50% with respect to Ghidra. No-Goto 1.04× 1.08× 1.12× 1.25×
For what concerns the metrics for the Hex-Rays Decompiler and
Table 2: Size increment metrics (over the original size) for the
Ghidra, we can see that Ghidra performs slightly better in terms functions over different optimization levels. For each optimization
of goto statements emitted, emitting less gotos when compared to level, we also provide the duplication factor metric computed
the Hex-Rays Decompiler. Overall, as previously stated, we think only on functions which do not have goto statements in the
that these metrics shows that the two decompilers adopt a similar original source code. While our algorithm is able to completely
approach to decompilation. eliminate gotos in the decompiled code, we can see that in case we
In Table 2 we provide an overview of the increase in terms of approach decompilation of code with gotos our duplication factor
size of the decompiled code due to the duplication introduce by our is penalized. Indeed, our algorithm makes the assumption that we
approach. In Figure 12 we also show an estimate of the probability are trying to decompile well-structured code, therefore gotos in
the original source code make this assumption to fail.
distribution function (using the KDE method) of the increase in size
for all the optimizations levels. This metric, has not been computed
for the other tools, since they do not introduce duplication. O0
4 O1
Note also that the effects of duplication could be significantly mit- O2
O3
igated by performing regular optimizations on the generated code,
such as dead code elimination. In fact, the optimizer might be able 3
to prove that, for instance, part of the code duplicated outside of a
loop due to the outlining of the first iteration will never be executed
2
and can therefore be dropped. However, due to timing constraints,
we have not been able to assess the impact of such optimizations.
We also produced a pair of heat maps Figure 13 that helps visu- 1
alizing how the relationship between duplication and decrement
in cyclomatic complexity evolves. We plotted the values for the
-O2 optimization level, comparing with both Hex-Rays and Ghidra. 0
1.2 1.4 1.6 1.8 2.0 2.2 2.4
In particular, apart from the bright spots in correspondence of a Figure 12: Plot showing the Probability distribution functions of
low duplication level which are positive, we can see some reddish the duplication introduced by revng-c. On the x axis, we have the
clouds towards the center of the heat maps, which represents a amount of duplication introduced (measured in terms of code size
class of functions for which the duplication is significant, but for increase over the original value). We can notice that as the optimiza-
which the cyclomatic complexity is reduced with respect to IDA tion level increases, revng-c introduces a little bit more duplication
and Ghidra. This represent the fact that even when a cost in terms in order to be able to be able to emit goto-free decompiled code.
of duplication is payed, we have a gain in terms of reduction of building a decompiler tool called revng-c. In the evaluation, we
cyclomatic complexity. performed compared our results against both academic and com-
mercial state-of-the-art decompilers.
6 CONCLUSION The experimental results show that our solution is able to avoid
In this work we presented a novel approach to control flow restruc- the emission of goto statements, which is an improvement over
turing and to decompilation, by introducing new techniques for the Hex-Rays Decompiler and Ghidra, but at the same time does
transforming any given CFG into a DAG form, which we called Pre- not resorts to predicated execution, which on the other hand af-
processing, to which we later apply our Combing algorithm. Thanks fects DREAM. In future work, we will to improve the quality of the
to Combing, we are able to build a C AST from the input code, decompiled code by focusing on the recovery of more idiomatic
which is then transformed by the Matching phase to emit idiomatic C constructs. This type of work will be greatly simplified by the
C. We implemented our solution on top of the rev.ng framework, already modular nature of the Matching phase of our algorithm.
1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9
1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9
Duplication
Duplication
1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2
Cyclomatic improvement Cyclomatic improvement
(a) Cyclomatic complexity improvement w.r.t. IDA (b) Cyclomatic complexity improvement w.r.t. Ghidra
Figure 13: Cyclomatic Complexity improvements of revng-c at O2. This heat maps helps us visualizing where the cyclomatic complexity
improvement gain obtained by revng-c is introduced. The cyclomatic improvement is represented on the x axis, while the duplication
factor introduced by revng-c is represented on the y axis (higher means less duplication). To color of a cell of the heat map is computed by
performing a sum of bivariate distributions for each data point in our dataset (every function). This means that a cell will assume a brighter
color as more data points showing values of duplication and decrease in cyclomatic complexity in its surrounding are present.
More in general, in the future we aim to further improve the qual- [8] Cristina Cifuentes. Reverse compilation techniques. Queensland University of
ity of the decompiled code in other areas, such as arguments and Technology, Brisbane, 1994.
[9] Alessandro Di Federico and Giovanni Agosta. A jump-target identification
return values detection and advanced type recognition techniques. method for multi-architecture static binary translation. In Compliers, Architec-
In addition, we are also considering the possibility for our de- tures, and Sythesis of Embedded Systems (CASES), 2016 International Conference
on, 2016.
compiler to support the emission of some goto statements in a very [10] Alessandro Di Federico, Mathias Payer, and Giovanni Agosta. rev. ng: a
limited and controlled setting, i.e., where they may be considered unified binary analysis framework to recover cfgs and function boundaries. In
idiomatic and legitimate, e.g., in the goto cleanup pattern. The goal of Proceedings of the 26th International Conference on Compiler Construction, 2017.
[11] Edsger W Dijkstra. Go to statement considered harmful. Communications of
this would be to trade the introduction of a goto in order to further the ACM, 11(3), 1968.
reduce the duplication introduced by our combing algorithm. [12] Alessandro Di Federico, Pietro Fezzardi, and Giovanni Agosta. rev.ng: A
We also want to address the verification of the semantics preser- multi-architecture framework for reverse engineering and vulnerability
discovery. In International Carnahan Conference on Security Technology, ICCST
vation of the control flow restructuring transformation we intro- 2018, Montréal, Canada, October 22-25, 2018 [2].
duce. We deem this goal achievable thanks to the very nature of the [13] Ilfak Guilfanov. Decompilers and beyond. Black Hat USA, 2008.
[14] Hex-Rays. Ida pro. https://www.hex-rays.com/products/ida/.
rev.ng framework. The idea is to enforce back the modifications [15] Donald E. Knuth. The Art of Computer Programming, Volume 1 (3rd Ed.):
done by the control flow restructuring algorithm at the level of the Fundamental Algorithms. Addison Wesley Longman Publishing Co., Inc.,
LLVM IR lifted by rev.ng, and to use the recompilation features Redwood City, CA, USA, 1997.
[16] Christopher Kruegel, William Robertson, Fredrik Valeur, and Giovanni Vigna.
of the framework to prove the behavioural equivalence between Static disassembly of obfuscated binaries. In USENIX security Symposium,
the original binary and the one generated after the restructuring. volume 13, 2004.
[17] Chris Lattner and Vikram Adve. LLVM: A Compilation Framework for Lifelong
Program Analysis & Transformation. In CGO 2004.
[18] Thomas Lengauer and Robert Endre Tarjan. A fast algorithm for finding
REFERENCES dominators in a flowgraph. ACM Transactions on Programming Languages and
[1] Hex-rays decompiler. https://www.hex-rays.com/products/decompiler/. Systems (TOPLAS), 1979.
[2] International Carnahan Conference on Security Technology, ICCST 2018, Montréal, [19] T. J. McCabe. A complexity measure. IEEE Transactions on Software Engineering,
Canada, October 22-25, 2018. IEEE, 2018. SE-2(4), Dec 1976.
[3] National Security Agency. Ghidra. https://ghidra-sre.org/. [20] Dawn Song, David Brumley, Heng Yin, Juan Caballero, Ivan Jager, Min Gyung
[4] Fabrice Bellard. QEMU, a fast and portable dynamic translator. In Proceedings Kang, Zhenkai Liang, James Newsome, Pongsin Poosankam, and Prateek
of the FREENIX Track: 2005 USENIX Annual Technical Conference, April 10-15, Saxena. Bitblaze: A new approach to computer security via binary analysis. In
2005, Anaheim, CA, USA, 2005. International Conference on Information Systems Security. Springer, 2008.
[5] David Brumley, Ivan Jager, Thanassis Avgerinos, and Edward J Schwartz. Bap: [21] Khaled Yakdan. Dream code snippets. https://net.cs.uni-bonn.de/fileadmin/ag/
A binary analysis platform. In International Conference on Computer Aided martini/Staff/yakdan/code_snippets_ndss_2015.pdf.
Verification. Springer, 2011. [22] Khaled Yakdan, Sergej Dechand, Elmar Gerhards-Padilla, and Matthew Smith.
[6] David Brumley, JongHyup Lee, Edward J Schwartz, and Maverick Woo. Native Helping johnny to analyze malware: A usability-optimized decompiler and
x86 decompilation using semantics-preserving structural analysis and iterative malware analysis user study. In 2016 IEEE Symposium on Security and Privacy
control-flow structuring. In Presented as part of the 22nd USENIX Security (SP). IEEE, 2016.
Symposium (USENIX Security 13), 2013. [23] Khaled Yakdan, Sebastian Eschweiler, Elmar Gerhards-Padilla, and Matthew
[7] David Brumley, JongHyup Lee, Edward J. Schwartz, and Maverick Woo. Native Smith. No more gotos: Decompilation using pattern-independent control-flow
x86 decompilation using semantics-preserving structural analysis and iterative structuring and semantic-preserving transformations. In NDSS, 2015.
control-flow structuring. In Proceedings of the 22th USENIX Security Symposium,
Washington, DC, USA, August 14-16, 2013, 2013.
A GRAPHS BASICS 1 1 1
This section introduces the fundamental concepts to understand
the design of the Control Flow Combing algorithm, described in 6 6 6
Section 4.
2 2 2
Graphs. A directed graph is a pair 𝐺 = ⟨𝑉 ,𝐸⟩, where 𝑉 is a set, and
𝐸 ⊂𝑉 ×𝑉 is a set of pairs of element of 𝑉 . Each 𝑣 ∈𝑉 is called a node, 3 4 3 4 3 4
and each 𝑒 = ⟨𝑣 1,𝑣 2 ⟩ is called an edge. Given 𝑒 as defined above, 𝑣 1
is said to be a predecessor of 𝑣 2 , while 𝑣 2 is said to be a successor of 5 5 5
𝑣 1 . 𝑒 is said to be outgoing from 𝑣 1 and 𝑖𝑛𝑐𝑜𝑚𝑖𝑛𝑔 in 𝑣 2 . 𝑣 1 is called
the source of 𝑒 and 𝑣 2 is called the target of 𝑣 1 . A sequence of edges Figure 14: Left – An example graph. Node 1 is the entry and 5 is the
𝑒 1 = ⟨𝑣 1,1,𝑣 1,2 ⟩,...,𝑒𝑛 = ⟨𝑣𝑛,1,𝑣𝑛,2 ⟩, is called a path if ∀𝑘 = 1,...,𝑛 − 1 exit. Middle – Dominator tree of the graph on the left. Edges in this
holds 𝑣𝑘,2 =𝑣𝑘+1,1 . tree go from immediate dominator to the dominated node. Right –
Post-dominator tree of the graph on the left. Edges in this tree go
Control Flow Graphs. A directed graph used to represent the from immediate post-dominator to post-dominated node.
control flow of a function in a program.
Each node of a CFG is called a basic block and represents a se-
quence of instructions in the program that are executed sequentially, Informally, dominance describes, given a node, which nodes have
without any branch with the exception of the last instruction in to be traversed on all paths from the entry, which must be present
the basic block. and unique, to that node. Given a graph and two nodes 𝐴 and 𝐵, 𝐴
Each edge in a CFG is called a branch. A branch 𝑏 = ⟨𝐵𝐵 1,𝐵𝐵 2 ⟩ in a dominates 𝐵 iff every path from the entry to 𝐵 contains 𝐴. 𝐴 properly
CFG, represents the notion that the execution of the program at the dominates 𝐵 if 𝐴 dominates 𝐵 and 𝐴 ≠ 𝐵. 𝐴 immediately dominates
end of 𝐵𝐵 1 might jump to the beginning of 𝐵𝐵 2 . Branches can be con- 𝐵 if 𝐴 properly dominates 𝐵 and 𝐴 ≠ 𝐵 and it does not exist a node
ditional or unconditional. A branch 𝑏 is unconditional if it is always 𝐶 such that 𝐴 properly dominates 𝐶 and 𝐶 properly dominates 𝐵.
taken, independently of the specific values of conditions in the pro- Conversely, post-dominance is related to which nodes must be
gram at runtime. The source 𝑠 of an unconditional branch 𝑏 has no traversed on paths from a given node to the exit, if this is present
other outgoing edges. Conversely, a branch 𝑏 is called conditional if and unique. In cases where there is not a single exit node post-
it might be taken by the execution at runtime, depending on the run- dominance is not defined. The node 𝐴 post-dominates 𝐵 if every path
time value of specific conditions in the program. The source of a con- from 𝐵 to the exit contains 𝐴. 𝐴 properly post-dominates 𝐵 if 𝐴 post-
ditional branch always has multiple outgoing edges and the condi- dominates 𝐵 and 𝐴 ≠ 𝐵. 𝐴 immediately post-dominates 𝐵 if 𝐴 prop-
tions associated to each outgoing edge are always strictly exclusive. erly post-dominates 𝐵 and 𝐴 ≠ 𝐵 and it does not exist a node 𝐶 such
Finally, a CFG representing a function has a special basic block, that 𝐴 properly post-dominates 𝐶 and 𝐶 properly post-dominates 𝐵.
called entry node, entry basic block, or simply entry, that represents Dominator and Post-Dominator Tree. The dominator tree (and
the point in the CFG where the execution of the function starts. the post-dominator tree) are a compact representation of the dom-
In the remainder of this work, where not specified otherwise, inance (and post-dominance) relationship withing a graph. The
we will refer to CFGs. dominator tree (DT) contains a node for each node of the input
Depth First Search. The concept of Depth First Search (DFS) [15] graph, and an edge from node 𝐴 to node 𝐵 iff 𝐴 is the immediate
is very important for the rest of this work. Briefly, DFS is a search dominator of 𝐵. The resulting graph is guaranteed to be a tree since
algorithm over a graph, which starts from a root node, and explores each node except the entry node has a unique immediate dominator.
as far as possible following a branch (going deep in the graph, hence As a consequence, the entry node is the root of the dominator tree.
the name), before backtracking and following the other branches. Whenever the exit of a graph is unique, it is possible to build an anal-
When the algorithm explores a new node it is pushed on a stack, ogous data structure for the post-dominance relationship, called
and, when the visit of the subtree starting in that node is completed, post-dominator tree (PDT). A well known and widely used algorithm
it is popped from the exploration stack. Such traversal can be used for calculating a dominator tree is the Lengauer-Tarjan’s algorithm,
both for inducing an ordering on the nodes of a graph, and to cal- that has the peculiarity of having an almost linear complexity [18].
culate the set of retreating edges (informally, edges that jump back Examples of dominator and post-dominator tree for a CFG are
in the control flow). Directed graphs without retreating edges are represented in Figure 14.
called Directed Acyclic Graphs (DAG). A Depth First Search induces
the following orderings of the nodes of a graph:
Preorder. Ordering of the nodes according to when they were first
visited by the DFS and, therefore, pushed on the exploration stack.
Postorder. Ordering of the nodes according to when their sub-
tree has been visited completely and, therefore, popped from the
exploration stack.
Reverse Postorder. The reverse of postorder.
Dominance and Post-Dominance. Other two fundamental con-
cepts in program analysis are dominance and post-dominance.
DREAM revng-c
void sub43E100 ( void * a1 , int a2 ) {
... Cridex4 9 5
v4 = *( result + 0 xc ) ;
if ( a2 >= v4 ) ZeusP2P 9 5
v5 = *( result + 8) + v4 ;
if ( a2 < v5 && a2 >= v4 ) SpyEye 19 15
break ;
i ++; OverlappingLoop 4 3
result += 0 x28 ; Table 3: This table presents the cyclomatic complexity of the code
... produced by DREAM and revng-c. As we can see, DREAM consis-
}
tently presents higher figures compared to revng-c due to the high
number situations in which conditions are reused multiple times.
Figure 15: Snippets which shows the reuse of the condition a2 >= 4
on the same execution path
Figure 16: Side by side SpyeEye listings of the decompiled source by DREAM (on the left) and revng-c (on the right).