Program-Analysis-ThuTrangNguyen-Day-2
Program-Analysis-ThuTrangNguyen-Day-2
iSE, UET
1
How do you verify the correctness of
a software program?
2
Testing is one of the most
common methods:
• Bug detection
• Correctness verification
3
Why do we
need testing?
• All software has bugs
• Bugs are hard to find
• Bugs cause serious harm
4
Is there any other method for
software quality assurance without
running the program?
5
What is static (program) analysis?
6
• Static analyis is a method of (automatically)
examining the source code without having to
7
Static vs. Dynamic analysis
8
Example
9
Example
• Over-approximation:
“yes”, “no”, “maybe”
11
Example
• Sound and complete:
“yes”, “no”
12
Another
example
13
Another
example
• Over-approximation:
any value
14
Another
example
• Under-approximation:
some value in [0, 2)
15
Another
example
• Sound and complete?
• Exploring all possible outputs:
practically impossible
• This is the case for most real-
world programs
16
Under vs. Over-approximation
• Program: P
• For input 𝑖 ∈ I, we observe a behavior P(𝑖)
False negatives
False positives
17
Program representations
Many ways to represent a (part of) program:
• Sequence of characters
• Sequence of tokens
• Abstract syntax tree (AST)
• Control flow graph
• Program dependence graph
• Call graph
• Intermediate representation
• Etc.
18
Sequence of
characters
• Original code written by the
programmer
• Human-readable form of the
program
19
• Tree representation of source code
• “Abstract” because some details of
Abstract syntax omitted
• E.g. { in Java
Syntax tree • Nodes: Construct in source code
(AST) • Edges: Parent-child relationship
• Tools: Espima, Joern
• Used for syntax analysis
20
Abstract syntax
tree (AST)
21
• Models flow of control through a
program
• Directed graph (N, E) with:
Control flow • Nodes N: basic blocks = sequence of
operation executed together
graph (CFG) • Edges E: possible transfer of control
• Typically on the method-level
• Used for analyzing possible paths of a
program
22
Control flow
graph
23
More about
CFG
24
• A directed graph (N, E) that represents
the control/data dependencies
between program components
Program • Nodes N: basic blocks = sequence of
operation executed together
dependence • Edges E: possible (data/control)
depenedence relationship
graph (PDG) • Typically on the method-level
• Used for optimization, parallelization,
vulnerability detection
25
Program
dependence
graph (PDG)
26
Types of program analysis
27
Data flow analysis
One popular way of formulating a static analysis
28
Many IDE features are based on data
Real-world flow analysis
• E.g.
use cases • Reaching definitions
• Unused variables
29
• Propagate analysis information along the
edges of a control flow graph
30
Available expression analysis
31
• Goal: for each program point, compute
Available which expressions must have already
been computed, and not later modified.
expression • Useful, e.g., to avoid re-computing an
analysis expression
• Used as part of compiler optimization
32
Example
33
• Transfer function of a statement:
How the statement affects the analysis state
• Here: analysis state = available
Transfer expressions
• Two functions:
functions • gen: Available expressions generated by
a statement
• kill: Available expressions killed by a
statement
34
Funtion 𝑔𝑒𝑛: 𝑆𝑡𝑚𝑡 → 𝑃(𝐸𝑥𝑝𝑟)
• A statement generates an available expression
e if:
gen function
• It evaluates e and
• It does not later write any variable used in e
• Otherwise, function returns empty set
Example:
var x = a + b; generates a + b
35
Function 𝑘𝑖𝑙𝑙: 𝑆𝑡𝑚𝑡 → 𝑃(𝐸𝑥𝑝𝑟)
• A statement kills an available expression e
if:
kill function • It modifies any of the variables used in e
• Otherwise, function returns empty set
Example:
a = 23; kills a * b
36
Example
Draw the control flow
graph of this code snippet
37
entry
Example x=a+b
y=a*b
y>a+b
T
F a=a+1
x=a+b
exit
38
entry
Example (1)x = a + b
(2)y = a * b
exit
39
entry
Example (1)x = a + b
(2)y = a * b
40
entry
Example (1)x = a + b
(2)y = a * b
41
• Initially, no available expressions
• Forward analysis: Propagate available
expressions in the direction of control flow
Propagating • For each statement 𝑠, outgoing available
42
Data flow equations entry
(1)x = a + b
• 𝐴𝐸!"#$% (𝑠) : available expression at the entry of s
• 𝐴𝐸!&'# (𝑠) : available expression at the exit of s
(2)y = a * b
• 𝐴𝐸!"#$% 1 = ∅
• 𝐴𝐸!"#$% 2 = 𝐴𝐸!&'# (1) (3)y > a + b
• 𝐴𝐸!"#$% 3 = 𝐴𝐸!&'# 2 ∩ 𝐴𝐸!&'# (5)
T
• 𝐴𝐸!"#$% 4 = 𝐴𝐸!&'# (3) (4)a = a + 1
F
• 𝐴𝐸!"#$% 5 = 𝐴𝐸!&'# (4)
• 𝐴𝐸!&'# 1 = 𝐴𝐸!"#$% 1 ∪ 𝑎 + 𝑏 (5)x = a + b
• 𝐴𝐸!&'# 2 = 𝐴𝐸!"#$% 2 ∪ 𝑎 ∗ 𝑏
exit
• 𝐴𝐸!&'# 3 = 𝐴𝐸!"#$% 3 ∪ 𝑎 + 𝑏
• 𝐴𝐸!&'# 4 = 𝐴𝐸!"#$% 4 \ 𝑎 + 𝑏, 𝑎 ∗ 𝑏, 𝑎 + 1
• 𝐴𝐸!&'# 5 = 𝐴𝐸!"#$% 5 ∪ 𝑎 + 𝑏 43
Solution of the equation
• 𝐴𝐸!"#$% 1 = ∅
• 𝐴𝐸!"#$% 2 = 𝐴𝐸!&'# (1)
• 𝐴𝐸!"#$% 3 = 𝐴𝐸!&'# 2 ∩ 𝐴𝐸!&'# (5) S 𝐴𝐸!"#$% 𝑆 𝐴𝐸!&'# (𝑆)
1 ∅ 𝑎+𝑏
• 𝐴𝐸!"#$% 4 = 𝐴𝐸!&'# (3)
2 𝑎 ∗𝑏 𝑎 + 𝑏, 𝑎 ∗ 𝑏
• 𝐴𝐸!"#$% 5 = 𝐴𝐸!&'# (4) 3 𝑎+𝑏 𝑎+𝑏
• 𝐴𝐸!&'# 1 = 𝐴𝐸!"#$% 1 ∪ 𝑎 + 𝑏 4 𝑎+𝑏 ∅
• 𝐴𝐸!&'# 2 = 𝐴𝐸!"#$% 2 ∪ 𝑎 ∗ 𝑏 5 ∅ 𝑎+𝑏
• 𝐴𝐸!&'# 3 = 𝐴𝐸!"#$% 3 ∪ 𝑎 + 𝑏
• 𝐴𝐸!&'# 4 = 𝐴𝐸!"#$% 4 \ 𝑎 + 𝑏, 𝑎 ∗ 𝑏, 𝑎 + 1
• 𝐴𝐸!&'# 5 = 𝐴𝐸!"#$% 5 ∪ 𝑎 + 𝑏
44
Solution of the equation
is x – y an available
expression when entering
the statement 7?
46
Any data flow analysis is defined by six properties:
Defining a • Domain
• Direction
data flow • Transfer function
• Meet operator
analysis • Boundary condition
• Initial values
47
• Analysis associates some information with
every program point
• “Information” means elements of a set
Domain • Domain of the analysis: All possible
elements the set may have
• E.g., for available expressions analysis:
Domain is set of non-trivial expressions
48
• Analysis propagates information along the
control flow graph:
• Forward analysis: normal flow of control
Direction • Backward anlysis: invert all edges
• Reasons about executions in reverse
• E.g., available expression analysis: Forward
49
• Defines how a statement affects the
50
• What if two statements s:, s; flow to a
statement s?
• Forward analysis: Execution branches merge
Meet
• Backward analysis: branching point
• Meet operator defines how to combine the
operator incoming information
• Union:
DF()*+, s = DF(-.* s/ ∪ DF(-.* (s0 )
• Intersection:
DF()*+, s = DF(-.* s/ ∩ DF(-.* (s0 )
51
• What information to start with at the first CFG
node?
Boundary • Forward analysis: first node is entry node
• Backward analysis: first node is exit node
condition • Common choices:
• Empty set
• Entire domain
52
• What is the information to start with at
Initial intermediate nodes?
• Common choices:
values • Empty set
• Entire domain
53
Defining a data flow analysis
Any data flow analysis is defined by six Available expression is defined as:
properties:
• Domain • Non-trivial expression
• Direction • Forward
• Transfer function
• 𝐴𝐸!&'# 𝑠 = 𝐴𝐸!"#$% 𝑠 ∖ kill 𝑠 ∪
• Meet operator
𝑔𝑒𝑛(𝑠)
• Boundary condition
• Intersection (∩)
• Initial values
• 𝐴𝐸!"#$% 𝑒𝑛𝑡𝑟𝑦𝑁𝑜𝑑𝑒 = ∅
• ∅
54
• Goal: for each program point, compute
Reaching which assignments may have been made
and may not have been overwritten
definitions • Useful in various program analyses:
• Detect uninitialized variables
analysis • Optimize register allocation
• E.g. to compute a data flow graph
55
Example
Definition (x)
reaches the
entry of this
statement
56
Example
All
definitions
reaches the
entry of this
statement
57
• Domain: Definitions (assigments) in the code
• Set of pairs (𝑣, 𝑠) of variables and stmts
Defining the • (𝑣, 𝑠) means a definition of 𝑣 at 𝑠
• Direction: forward
Analysis • Meet operator: Union
• Because we care about definitions that may
reach a program point
58
• Transfer function:
𝑅D<456 𝑠 = (𝑅𝐷37689 𝑠 ∖ 𝑘𝑖𝑙𝑙 𝑠 ) ∪ 𝑔𝑒𝑛(𝑠)
• Function 𝑔𝑒𝑛(𝑠)
Defining the • If 𝑠 is assignment to 𝑣: (𝑣, 𝑠)
• Otherwise: empty set
Analysis (2) • Function 𝑘𝑖𝑙𝑙(𝑠)
• If 𝑠 is assignment to 𝑣: (𝑣, 𝑠’) for all 𝑠’ (𝑠 = ! = s)
that define 𝑣
• Otherwise: empty set
59
• Boundary condition: Entry node starts will all
variables undefined
Defining the • Special “statement” for undefined
variables: ?
Analysis (3) • 𝑅𝐷37689 𝑒𝑛𝑡𝑟𝑦𝑁𝑜𝑑𝑒 = 𝑣, ? 𝑣 ∈ 𝑉𝑎𝑟𝑠}
• Initially, all nodes have no reaching
definitions
60
Example
Draw CFG for this code
snippet
61
entry
(1)x = 5
Example
(2)y = 1
(3)x > 1
T
(4)y = x * y
F
(5)x = x - 1
exit
62
entry
(1)x = 5
Example
(2)y = 1
(1)x = 5
Example
(2)y = 1
(1)x = 5
• 𝑅𝐷!"#$% 1 = 𝑥, ? , 𝑦, ?
• 𝑅𝐷!"#$% 2 = 𝑅𝐷!&'# 1 (2)y = 1
• 𝑅𝐷!"#$% 3 = 𝑅𝐷!&'# 2 ∪ 𝑅𝐷!&'# 5
• 𝑅𝐷!"#$% 4 = 𝑅𝐷!&'# 3 (3)x > 1
• 𝑅𝐷!"#$% 5 = 𝑅𝐷!&'# 4 T
• 𝑅𝐷!&'# 1 = (𝑅𝐷!"#$% 1 ∖ 𝑥, 1 , 𝑥, 5 , 𝑥, ? ) ∪ {𝑥, 1} (4)y = x * y
• 𝑅𝐷!&'# 2 = (𝑅𝐷!"#$% 2 ∖ 𝑦, 2 , 𝑦, 4 , 𝑦, ? ) ∪ {𝑦, 2} F
• 𝑅𝐷!&'# 3 = 𝑅𝐷!"#$% 3
(5)x = x - 1
• 𝑅𝐷!&'# 4 = (𝑅𝐷!"#$% 4 ∖ 𝑦, 2 , 𝑦, 4 , 𝑦, ? ) ∪ {𝑦, 4}
• 𝑅𝐷!&'# 5 = (𝑅𝐷!"#$% 5 ∖ 𝑥, 1 , 𝑥, 5 , 𝑥, ? ) ∪ {𝑥, 5}
exit
65
Solution of the equation
• 𝑅𝐷!"#$% 1 = 𝑥, ? , 𝑦, ? S 𝑅𝐷!"#$% 𝑆 𝑹𝑫!&'# (𝑆)
• 𝑅𝐷!"#$% 2 = 𝑅𝐷!&'# 1 1 𝑥, ? , 𝑦, ? 𝑥, 1 , 𝑦, ?
• 𝑅𝐷!"#$% 3 = 𝑅𝐷!&'# 2 ∪ 𝑅𝐷!&'# 5 2 𝑥, 1 , 𝑦, ? 𝑥, 1 , 𝑦, 2
• 𝑅𝐷!"#$% 4 = 𝑅𝐷!&'# 3 3 𝑥, 1 , 𝑦, 2 , 𝑥, 1 , 𝑦, 2 ,
• 𝑅𝐷!"#$% 5 = 𝑅𝐷!&'# 4 (𝑥, 5), (𝑦, 4) (𝑥, 5), (𝑦, 4)
4 𝑥, 1 , 𝑦, 2 , 𝑥, 1
• 𝑅𝐷!&'# 1 = (𝑅𝐷!"#$% 1 ∖ 𝑥, 1 , 𝑥, 5 , 𝑥, ? ) ∪ {𝑥, 1}
(𝑥, 5), (𝑦, 4) (𝑥, 5), (𝑦, 4)
• 𝑅𝐷!&'# 2 = (𝑅𝐷!"#$% 2 ∖ 𝑦, 2 , 𝑦, 4 , 𝑦, ? ) ∪ {𝑦, 2}
5 𝑥, 1 , 𝑥, 5 , 𝑦, 4
• 𝑅𝐷!&'# 3 = 𝑅𝐷!"#$% 3 (𝑥, 5), (𝑦, 4)
• 𝑅𝐷!&'# 4 = (𝑅𝐷!"#$% 4 ∖ 𝑦, 2 , 𝑦, 4 , 𝑦, ? ) ∪ {𝑦, 4}
• 𝑅𝐷!&'# 5 = (𝑅𝐷!"#$% 5 ∖ 𝑥, 1 , 𝑥, 5 , 𝑥, ? ) ∪ {𝑥, 5}
66
• Goal: for each statement, find variables that
are may be “live” at the exit from the
Live statement
• “live”: the variable is used before being
variables redefined
• Useful, e.g., for identifying dead code
analysis • Bug detection: dead assignments are
typically unexpected
• Optimization: remove dead code
67
Example
x is not live after this statement
68
Example
Both x and y are live after this statement
69
• Domain: all variables occuring in the code
Defining the • Direction: Backward
Analysis
• Meet operator: Union
• Because we care about whether a
variable may be used
70
• Transfer function:
𝐿𝑉!"#$% 𝑠 = 𝐿𝑉!&'# 𝑠 ∖ 𝑘𝑖𝑙𝑙 𝑠 ∪ 𝑔𝑒𝑛(𝑠)
• Backward analysis: Returns set of variables that
are live at entry of statement
71
Defining the • Boundary condition: Final node starts with no
live variables
𝐿𝑉!&'# 𝑓𝑖𝑛𝑎𝑙𝑁𝑜𝑑𝑒 = ∅
analysis • Initially, all nodes have no live variables
72
Quiz
Compute the live variables
before and after every
statement
73
• Intra-procedural analysis:
• Reason about a function in isolation
Intra- • Inter-procedural analysis:
vs. • Reason about multiple functions
74
Inter- • One control flow graph per function
75
entry entry
x=1 Console.log(y)
exit exit
76
• Arguments passed into call
• Propagate to formal parameters of callee
77
Application of Static Analysis
Vulnerability Detection through Static Analysis
78
Security vulnerability is a flaw/weakness in a
What is system that can be exploited by attackers to
79
Vulnerability vs. Bug
Bug Vulnerability
80
Out of bounds
Use after free
SQL injection
Vulnerability XSS
types Null pointer dereference
Integer overflow or wraparound
Improper input validation
Use of hard corded credentials
81
Vulnerabilities
by year
Nguồn: https://www.cvedetails.com 82
Example
Is there any problem
within this program?
83
Example
84
Use-after-
free
85
• Occurs when a program continues to use a
pointer to memory that has already been freed
Use-after- or deallocated
• This can lead to undefined behaviors, crahses,
free or security vulnerabilites
• Prevention of use-after-free
• Set pointer to null after freeing memory
86
Example
Is there any problem
within this program?
87
Example
88
• A double free vulnerability occurs when a
program attempts to free (or deallocate) the
same block of memory more than once.
• Consequences of double free:
Double free • Memory corruption
• Crash or unstable behavior
• Exploited by attackers
• Prevention of double free
• Set pointers to null after freeing
89
Example
Is there any problem
within this program?
90
Example
91
Null pointer
dereference
92
• A null pointer dereference occurs when a
Null pointer program tries to access or manipulate data
through a pointer that has a null value
dereference • Dereferencing null pointer causes an error
because there’s no valid data to access
93
• Attackers can leverage null pointer
dereference to causes various forms of
harm such as crashing the program, causing
Null pointer a denial of service
94
Buffer
overflow
95
Buffer
overflow
96
Integer
overflow or
Wraparound
97
SQL injection
98
Path
traversal
99
Use hard-
coded
credentials
100
• Comply with coding standards
• CERT
(https://wiki.sei.cmu.edu/confluence/display/seccode)
How to avoid • MISRA (https://misra.org.uk)
• Follow secure designs
vulnerabilities? • Apply software quality assurance
techniques/tools
• Clang analyzer (https://clang-analyzer.llvm.org)
• Cppcheck(https://cppcheck.sourceforge.io)
• Infer (https://fbinfer.com)
101
Future trends
102
How to create a program analysis?
103
Traditional • Manually crafted
• Years of work
program • Precise, logical reasoning
104
Insight: Lots of data about software
development to learn from
analysis
Documentation Predictive tool
learning
Bug reports
Etc.
Information
useful for
developer
105
Traditional vs. neural software analysis
106
Application • Type prediction
107
Q&A
108