Hidden Overhead of A Function API

2
What we do at Snap with C++
Neural style transfer Full body tracking Ray tracing

Face tracking Cloth simulation Wrist tracking
3
Thank you, Serhii Huralnik and Eduardo Madrid!!
Section 0. Introduction
Section 1. Return value
Section 2. Parameter passing
Section 3. Multiple parameters
5
Tony Van Eerd: “people are not writing enough functions”
6
7
8
When people finally start
writing more functions,
we’d prefer to get only the
well designed ones!
When talking about performance, we
typically think about the function logic.
We’ll see that a well designed function
API can have an even larger impact.
11
How will we compare performance?

● Benchmarks at this low level are not too reliable,
and also don’t represent performance in large projects well.
12

● Dynamic instruction count is more reliable on modern CPUs.
13

● Dynamic instruction count is more reliable on modern CPUs.
● We’ll use simple examples, so that we can just compare
the number of instructions generated by a compiler.
14
Accelerate large-scale applications with BOLT (link)

“… machine code … can range from 10s to 100s of megabytes in size, which is
often too large to fit in any modern CPU instruction cache. As a result, the
hardware spends a considerable amount of processing time — nearly 30 percent,
in many cases — getting an instruction stream from memory to the CPU.”
Disclaimer:
Our discussion is relevant
only for non-inlined functions
16
ISO C++ wiki: Do inline functions improve performance?
Yes and no. Sometimes. Maybe.
There are no simple answers. inline functions might make the code faster, they
might make it slower. They might make the executable larger, they might make it
smaller. They might cause thrashing, they might prevent thrashing. And they might
be, and often are, totally irrelevant to speed.
17
ISO C++ wiki: Do inline functions improve performance?
Yes and no. Sometimes. Maybe.
There are no simple answers. inline functions might make the code faster, they
might make it slower. They might make the executable larger, they might make it
smaller. They might cause thrashing, they might prevent thrashing. And they might
be, and often are, totally irrelevant to speed.
Credit to Khalil Estell: Firefox function distribution

157946 functions above (127B)
167404 functions below (127B)
18
Understanding how machine code is generated from C++
C++ Standard
C++ Itanium ABI

Microsoft
Windows
System V gABI
ABIs
psABI: ARM, x86, …
armv8-a x86-64 (AMD64) x86-64 (AMD64) 19
System V ABI System V ABI Microsoft ABI
- iPhone - Linux server - Windows device

- M1 Mac and newer - old Mac
- Android smartphone
armv7-a x86 (IA-32) x86 (IA-32)
- ancient iPhone - ancient Linux server - ancient Windows device

- low-end Android
smartphone
armv8-a x86-64 (AMD64) x86-64 (AMD64) 20

Procedure Call Standard AMD64 Architecture x64 calling convention
for the Arm® 64-bit Processor Supplement
Architecture
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10
-O2 -std=c++20 -O2 -std=c++20 -O2 /std:c++20
armv7-a x86 (IA-32) x86 (IA-32)

Procedure Call Standard Intel386 Architecture calling conventions
for the Arm® Architecture Processor Supplement
-O2 -std=c++20 -O2 -std=c++20 -m32 -O2 /std:c++20
21
Things are complicated

We’ll be looking for simple guidelines to navigate this complexity.
C++ Core Guidelines seem like a good candidate.

23
C++ Core Guidelines
F.20: For “out” output values, prefer return values to output parameters
Reason A return value is self-documenting, whereas a & could be either in-out or

out-only and is liable to be misused.
24
Returning std::unique_ptr
#include <memory>
std::unique_ptr<int> value_ptr() { - return by value

return nullptr;
}
void output_ptr(std::unique_ptr<int>& dst) { - output parameter

dst = nullptr;
}
https://godbolt.org/z/ea9M3G94s
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 25
str xzr, [x8] mov QWORD PTR [rdi], 0 sub rsp, 24 V

ret mov rax, rdi xor eax, eax
ret mov QWORD PTR [rcx], rax A
mov rax, rcx L
add rsp, 24
ret 0 U
E
mov x8, x0 mov rax, QWORD PTR [rdi] mov rax, QWORD PTR [rcx] O
ldr x0, [x0] mov QWORD PTR [rdi], 0 mov QWORD PTR [rcx], 0
str xzr, [x8] test rax, rax test rax, rax U
cbz x0, .LBB1_2 je .L3 je SHORT $LN34@output_ptr T
b operator delete(void*) mov esi, 4 mov edx, 4
.LBB1_2: mov rdi, rax mov rcx, rax P
ret jmp operator delete(void*) jmp operator delete(void*) U
.L3: $LN34@output_ptr:
ret ret 0 T
26
Returning std::unique_ptr
#include <memory>
std::unique_ptr<int> value_ptr() { - return by value

return nullptr;
}
void output_ptr(std::unique_ptr<int>& dst) { - output parameter

dst = nullptr;
}
This might be non-empty

27
Returning std::unique_ptr : call site
#include <memory>
std::unique_ptr<int> value_ptr();
- definitions removed to avoid inlining
void output_ptr(std::unique_ptr<int>& dst);
int value_ptr_call() {
auto ptr = value_ptr(); - return by value
return *ptr;
}
int output_ptr_call() {
std::unique_ptr<int> ptr;
output_ptr(ptr); - output parameter
return *ptr;
} https://godbolt.org/z/G9aPehqM1
stp x29, x30, [sp, #-32]! push rbx push rbx

str x19, [sp, #16] sub rsp, 16 sub rsp, 32 V
mov x29, sp lea rdi, [rsp+8] lea rcx, QWORD PTR ptr$[rsp]
add
bl
x8, x29, #24
value_ptr()
call
mov
value_ptr()
rdi, QWORD PTR [rsp+8]
call
mov
value_ptr()
rcx, QWORD PTR ptr$[rsp]
A
ldr
ldr
x0, [x29, #24]
w19, [x0]
mov
mov
esi, 4
ebx, DWORD PTR [rdi]
mov
mov
edx, 4
ebx, DWORD PTR [rcx]
L
bl
mov
operator delete(void*)
w0, w19
call
add
rsp, 16
call
mov
eax, ebx U
ldr x19, [sp, #16] mov eax, ebx add rsp, 32
ldp x29, x30, [sp], #32 pop rbx pop rbx E
ret ret ret 0
stp x29, x30, [sp, #-32]! push rbx $stateUnwindMap$int output_ptr_call()

str x19, [sp, #16] sub rsp, 16 DB 02H O
mov x29, sp mov QWORD PTR [rsp+8], 0 DB 0aH
str
add
xzr, [x29, #24]
x0, x29, #24
lea
call
rdi, [rsp+8]
output_ptr(std::unique_ptr&)
DD imagerel
std::unique_ptr::~unique_ptr()
U
bl
ldr
x0, [x29, #24]
mov
mov
esi, 4
DB 060H
T
ldr
bl
w19, [x0]
mov
call
push
sub
rbx
rsp, 32 P
mov w0, w19 add rsp, 16 mov QWORD PTR ptr$[rsp], 0
ldr x19, [sp, #16] mov eax, ebx lea rcx, QWORD PTR ptr$[rsp] U
ldp x29, x30, [sp], #32 pop rbx call output_ptr(std::unique_ptr&)
ret
ldr x8, [x29, #24]
ret
mov rbx, rax
mov
mov
T
mov x19, x0 jmp .L5 mov edx, 4
cbz x8, .LBB1_4 call operator delete(void*)
mov x0, x8 mov eax, ebx
bl operator delete(void*) add rsp, 32
.LBB1_4: pop rbx
mov x0, x19 ret 0
bl _Unwind_Resume
DW.ref.__gxx_personality_v0:
.xword __gxx_personality_v0
stp x29, x30, [sp, #-32]! push rbx push rbx

str x19, [sp, #16] sub rsp, 16 sub rsp, 32 V
mov x29, sp lea rdi, [rsp+8] lea rcx, QWORD PTR ptr$[rsp]
add
bl
x8, x29, #24
value_ptr()
call
mov
value_ptr()
call
mov
value_ptr()
A
ldr
ldr
x0, [x29, #24]
w19, [x0]
mov
mov
esi, 4
mov
mov
edx, 4
L
bl
mov
w0, w19
call
add
rsp, 16
call
mov
eax, ebx U
ldr x19, [sp, #16] mov eax, ebx add rsp, 32
ldp x29, x30, [sp], #32 pop rbx pop rbx E
ret ret ret 0
stp x29, x30, [sp, #-32]! push rbx $stateUnwindMap$int output_ptr_call()

str x19, [sp, #16] sub rsp, 16 DB 02H O
mov x29, sp mov QWORD PTR [rsp+8], 0 DB 0aH
str
add
xzr, [x29, #24]
x0, x29, #24
lea
call
rdi, [rsp+8]
DD imagerel
std::unique_ptr::~unique_ptr()
U
bl
ldr
x0, [x29, #24]
mov
mov
esi, 4
DB 060H
T
ldr
bl
w19, [x0]
mov
call
push
sub
rbx
rsp, 32 P
mov w0, w19 add rsp, 16 mov QWORD PTR ptr$[rsp], 0
ldr x19, [sp, #16] mov eax, ebx lea rcx, QWORD PTR ptr$[rsp] U
ldp x29, x30, [sp], #32 pop rbx call output_ptr(std::unique_ptr&)
ret
ldr x8, [x29, #24]
ret
mov rbx, rax
mov
mov
T
mov x19, x0 jmp .L5 mov edx, 4
cbz x8, .LBB1_4 call operator delete(void*)
mov x0, x8 mov eax, ebx
bl
mov
.LBB1_4:
x0, x19
Stack unwinding add
pop
ret
rsp, 32
rbx
0
bl _Unwind_Resume
DW.ref.__gxx_personality_v0:
.xword __gxx_personality_v0
on exception
30
#include <memory>
return *ptr;
}
Has to be destroyed if output_ptr throws

return *ptr;
}
value_ptr_call output_ptr_call 31
push rbx push rbx

sub rsp, 16 sub rsp, 16 puts 0 into memory
mov QWORD PTR [rsp+8], 0
lea rdi, [rsp+8] lea rdi, [rsp+8]
call value_ptr() call output_ptr(std:unique_ptr&)
mov rdi, QWORD PTR [rsp+8] mov rdi, QWORD PTR [rsp+8]
mov esi, 4 mov esi, 4
mov ebx, DWORD PTR [rdi] mov ebx, DWORD PTR [rdi]
call operator delete(void*) call operator delete(void*)
add rsp, 16 add rsp, 16
mov eax, ebx mov eax, ebx
pop rbx pop rbx
ret ret
32
#include <memory>
return *ptr;
}
Default constructed here

return *ptr;
}
33
C++ Core Guidelines
ES.20: Always initialize an object
Reason Avoid used-before-set errors and their associated undefined behavior.

Avoid problems with comprehension of complex initialization. Simplify refactoring.
34
Quick benchmark
https://quick-bench.com/q/mOAHh7zZeagJlCJ63GtWVMPngsw
25% overhead {
35
deferred_construction<std::string> output;
read_strings(in, out(output));
https://www.foonathan.net/2016/10/output-parameter/
36
Does it solve our problems?
Pros:
● Default constructor before the function call is avoided

37
Pros:
Cons (unless we fully trust the user and don’t have exceptions):
● Stack unwind on exception is still necessary

● Extra bool flag is required to know if the object
○ was actually initialized
○ not initialized more than once
38
Pros:
Cons (unless we fully trust the user and don’t have exceptions):
● Stack unwind on exception is still necessary

● Extra bool flag is required to know if the object
○ was actually initialized
○ not initialized more than once
Any C++ compiler checks that every execution path in a function ends with a
return statement. We just need to return by value.
Hopefully, you’re convinced that
output parameter is a bad idea.
Now, let’s see how return by value works,
specifically for C++ abstractions.
40
C++ Core Guidelines
F.26: Use a unique_ptr<T> to transfer ownership where a pointer is needed
Reason Using unique_ptr is the cheapest way to pass a pointer safely.

41
Returning a pointer
#include <memory>
int* raw_ptr() { - returning raw pointer

return nullptr;
}
std::unique_ptr<int> smart_ptr() { - returning smart pointer

return nullptr;
}
https://godbolt.org/z/ExaoKfT1z
mov x0, xzr xor eax, eax xor eax, eax R

ret ret ret 0
A
W
str xzr, [x8] mov QWORD PTR [rdi], 0 sub rsp, 24 S

ret mov QWORD PTR [rcx], rax M
mov rax, rcx A
add rsp, 24
ret 0 R
T
mov x0, xzr xor eax, eax xor eax, eax R

ret ret ret 0
A
x0, xzr, eax, rsp… are machine registers. W
They are the fastest storage available on a machine.
Register in square brackets means it stores an address,

and we’re accessing memory at that address: [x8], QWORD PTR [rdi]
str xzr, [x8] mov QWORD PTR [rdi], 0 sub rsp, 24 S
ret mov QWORD PTR [rcx], rax M
mov rax, rcx A
add rsp, 24
ret 0 R
T
44
Quick benchmark
https://quick-bench.com/q/pJ3z9L_Q1M16qob8-sg8lM3-T60
200% overhead {
45
C++ Core Guidelines
F.26: Use a unique_ptr<T> to transfer ownership where a pointer is needed
Reason Using unique_ptr is the cheapest way to pass a pointer safely.
Not free
46
Wrapper over int
struct INT {
int value;
INT(int value = 0) : value{value} {}

INT(INT&& src) : value{src.value} {}
INT& operator=(INT&& src) {
value = src.value;
return *this;
}
INT(INT const& src) : value{src.value} {}
INT& operator=(INT const& src) {
value = src.value;
return *this;
}
~INT() {}
};
47
Libraries that wrap integers

● Smart pointers similarly wrap raw pointers
● std::chrono
● All other units libraries

● Safe integers
● Bindings for other languages
48
Wrapper over int
struct INT {
int value;

INT(INT&& src) : value{src.value} {}
INT& operator=(INT&& src) {
value = src.value;
return *this;
}
value = src.value;
return *this;
}
~INT() {}
};
49
Wrapper over int
struct INT { int int_seconds() {
int value; return 60;
}
INT(INT&& src) : value{src.value} {} INT INT_seconds() {
INT& operator=(INT&& src) { return 60;
value = src.value; }
return *this;
}
value = src.value;
return *this;
}
~INT() {} https://godbolt.org/z/Tddd4E6hs
};
mov w0, #60 mov eax, 60 mov eax, 60 i

ret ret ret 0
n
t
mov w9, #60 mov DWORD PTR [rdi], 60 mov QWORD PTR [rcx], 60 I
str w9, [x8] mov rax, rdi mov rax, rcx
ret ret ret 0 N
T
armv7-a clang 11.0.1 x86-64 gcc 14.2 (-m32) x86 msvc v19.40 VS17.10
mov r0, #60 mov eax, 60 mov eax, 60 i
bx lr ret ret 0
n
t
mov r1, #60 mov eax, DWORD PTR [esp+4] mov eax, DWORD PTR I
str r1, [r0] mov DWORD PTR [eax], 60 ___$ReturnUdt$[esp-4]
bx lr ret 4 mov DWORD PTR [eax], 60 N
ret 0 T
51
The problem
Itanium C++ ABI
3.1.3.1 Non-trivial Return Values
If the return type is a class type that is non-trivial for the purposes of calls, the
caller passes an address as an implicit parameter. The callee then constructs the
return value into this address.
52
The problem
Itanium C++ ABI
C++ reference: Trivial class
A trivial class is a class that
● is trivially copyable, and

● has one or more eligible default constructors such that each is trivial.
53
Wrapper over int : no custom copy and move
}
// INT(INT&&) = default; INT INT_seconds() {
// INT& operator=(INT&&) = default; return 60;
// INT(INT const&) = default; }
// INT& operator=(INT const&) = default;
~INT() {}
};
https://godbolt.org/z/f6s1P96Tx

ret ret ret 0
n
t
mov w9, #60 mov DWORD PTR [rdi], 60 mov QWORD PTR [rcx], 60 I
str w9, [x8] mov rax, rdi mov rax, rcx
ret ret ret 0 N
T
bx lr ret ret 0
n
t
str r1, [r0] mov DWORD PTR [eax], 60 ___$ReturnUdt$[esp-4]
bx lr ret 4 mov DWORD PTR [eax], 60 N
ret 0 T
55
The problem : digging deeper
Itanium C++ ABI
non-trivial for the purposes of calls
A type is considered non-trivial for the purposes of calls if:
● it has a non-trivial copy constructor, move constructor, or destructor, or

● all of its copy and move constructors are deleted.
different from
C++ reference: Trivially copyable class
Also requires trivial copy and move assignment operators.

56
Wrapper over int : no custom destructor
}
}; INT INT_seconds() {
return 60;
}
https://godbolt.org/z/qeq9rEE8T

ret ret ret 0
n
t
mov w0, #60 mov eax, 60 mov QWORD PTR [rcx], 60 I
ret ret mov rax, rcx
ret 0 N
T
bx lr ret ret 0
n
t
bx lr mov DWORD PTR [eax], 60 ___$ReturnUdt$[esp-4]
ret 4 mov DWORD PTR [eax], 60 N
ret 0 T
58
Wrapper over int : no custom constructor
int value = 0; return 60;
}; }
INT INT_seconds() {
return INT{60};
}
https://godbolt.org/z/c6GbTdjc7

ret ret ret 0
n
t
mov w0, #60 mov eax, 60 mov eax, 60 I
ret ret ret 0
N
T
bx lr ret ret 0
n
t
mov r0, #60 mov eax, DWORD PTR [esp+4] mov eax, 60 I
bx lr mov DWORD PTR [eax], 60 ret 0
ret 4 N
T
armv8-a System V x86-64 System V x86-64 Microsoft 60
If the argument type is a If a C++ object is non-trivial for the To return a user-defined type by
Composite Type that is larger than purpose of calls, as specified in value in RAX, it must have a length
16 bytes, then the argument is the C++ ABI, it is passed by of 1, 2, 4, 8, 16, 32, or 64 bits. It
copied to memory. invisible reference… in %rdi… must also have no user-defined
… the result is returned in the If the class is INTEGER, the next constructor, destructor, or copy
same registers as would be used available register of the sequence assignment operator… This
for such an argument. %rax, %rdx is used. definition is essentially the same
Otherwise, … The address… shall as a C++03 POD type.
be passed … in x8.
armv7-a System V x86 System V x86 Microsoft

A Composite Type not larger than Some fundamental types and all Return values are … returned in
4 bytes is returned in r0… aggregate types are returned in the EAX register, except for 8-byte
A Composite Type larger than 4 memory. structures, which are returned in
bytes, or whose size cannot be the EDX:EAX register pair. Larger
determined statically by both caller structures are returned in the EAX
and callee, is stored in memory at register as pointers to hidden
an address passed as an extra return structures…
argument. Structures that are not PODs will
not be returned in registers.
61
General purpose registers allocation
for function parameters and return values
Composite types
Architecture ABI returned in registers
armv8-a System V ≤ 16 bytes
armv7-a System V ≤ 4 bytes
x86-64 System V ≤ 16 bytes
x86 System V fundamental only
x86-64 Microsoft 1,2,4,8 bytes, C++03 POD
x86 Microsoft 1,2,4,8 bytes, C++03 POD
Composite types are required to be “trivial” to get into registers!

62
C++ Core Guidelines
C.20: If you can avoid defining default operations, do
Reason It’s the simplest and gives the cleanest semantics.
Note This is known as “the rule of zero”.

63
C++ Core Guidelines
C.20: If you can avoid defining default operations, do
Reason It’s the simplest and gives the cleanest semantics.
Note This is known as “the rule of zero”.
Approved
Surely, this problem
is handled properly
in the popular libraries,
right?
65
std::chrono
#include <chrono>
int64_t int_seconds() {
return 60;
}
std::chrono::seconds chrono_seconds() {
return std::chrono::seconds{60};
}
static_assert(std::is_same_v<int64_t, std::chrono::seconds::rep>);
https://godbolt.org/z/E5e1nGY94

ret ret ret 0
n
t
mov w0, #60 mov eax, 60 mov QWORD PTR [rcx], 60 C

ret 0 H
R
mov r1, #0 xor edx, edx xor edx, edx
bx lr ret ret 0 n
t
mov r1, #0 mov eax, DWORD PTR [esp+4] mov eax, DWORD PTR C
mov r2, #60 mov DWORD PTR [eax], 60 ___$ReturnUdt$[esp-4]
str r2, [r0] mov DWORD PTR [eax+4], 0 mov DWORD PTR [eax], 60 H
str r1, [r0, #4] ret 4 mov DWORD PTR [eax+4], 0 R
bx lr ret 0

ret ret ret 0
n
t
mov w0, #60 mov eax, 60 mov QWORD PTR [rcx], 60 C

ret 0 H
R
not a POD
mov r1, #0 xor edx, edx xor edx, edx
bx lr ret ret 0 n
t
mov r1, #0 mov eax, DWORD PTR [esp+4] mov eax, DWORD PTR C
mov r2, #60 mov DWORD PTR [eax], 60 ___$ReturnUdt$[esp-4]
str r2, [r0] mov DWORD PTR [eax+4], 0 mov DWORD PTR [eax], 60 H
str r1, [r0, #4] ret 4 mov DWORD PTR [eax+4], 0 R
bx lr size > 4 not fundamental ret 0 not a POD
68
Can we do something about it?
● std::chrono would have to give up encapsulation
to be maximally efficient on Windows.
● It cannot use a type smaller than int64_t

just to optimize code on armv7-a.
69
std::pair and std::tuple
● std::pair copy and move constructors are defaulted
according to the C++ standard.
● Only since C++17 std::pair is trivially destructible

if its elements are trivially destructible.
This is an ABI breakage, but a quick search gave only one complaint.
● Copy and move assignment operators are trivial only on MSVC.
This is not a problem for the function calls.

But a problem for std::memcpy and std::bit_cast.
● std::tuple is never trivially move constructible.
https://godbolt.org/z/r7McGEb8o
70
Can we do something about it?
Don’t use std::pair and especially std::tuple.
Named struct is better for both readability and performance.

71
Can we do something about std::unique_ptr?
#include <memory>
namespace detail {
int* smart_ptr_impl() {
return nullptr;
}
} // namespace detail
[[always_inline]] std::unique_ptr<int> smart_ptr() {

return std::unique_ptr<int>{detail::smart_ptr_impl()};
}
72
Return Value Optimization (copy elision)

C++ reference: copy elision
Since C++17, a prvalue (“pure” rvalue) is not materialized until needed, and then it
is constructed directly into the storage of its final destination.
73
RVO: how it works

Itanium C++ ABI
… the pointer is passed as if it were the first parameter in the function prototype,
preceding all other parameters, including the this and VTT parameters.
74
RVO: how it works

Itanium C++ ABI
… the pointer is passed as if it were the first parameter in the function prototype,
preceding all other parameters, including the this and VTT parameters.
It’s an output parameter done right by the compiler, and only when necessary!
75
76
RVO: inserting a function result into a container
#include <optional>
struct large {
large();
large(large&&);
large& operator=(large&&);
large(large const&);
large& operator=(large const&);
~large();
};
large make_large();
std::optional<large> optional_large() {
return std::optional<large>{make_large()};
} https://godbolt.org/z/bcdsx7aP4
stp x29, x30, [sp, #-32]! push rbp $stateUnwindMap$std::optional<

str x19, [sp, #16] push rbx large> optional_large() DB 02H
mov x29, sp mov rbx, rdi DB 0aH
mov x19, x8 sub rsp, 24 DD imagerel
add x8, x29, #31 lea rbp, [rsp+15] large::~large()
bl make_large() mov rdi, rbp DB 080H
add x1, x29, #31 call make_large()
mov x0, x19 mov rsi, rbp mov QWORD PTR [rsp+8], rcx
bl large::large(large&&) mov rdi, rbx push rbx
mov w8, #1 call large::large(large&&) sub rsp, 48
add x0, x29, #31 mov BYTE PTR [rbx+1], 1 mov rbx, rcx
strb w8, [x19, #1] mov rdi, rbp lea rcx, QWORD PTR $T1[rsp]
bl large::~large() call large::~large() call make_large()
ldr x19, [sp, #16] add rsp, 24 npad 1
ldp x29, x30, [sp], #32 mov rax, rbx mov rdx, rax
ret pop rbx mov rcx, rbx
mov x19, x0 pop rbp call large::large(large&&)
add x0, x29, #31 ret mov BYTE PTR [rbx+1], 1
bl large::~large() mov rbx, rax lea rcx, QWORD PTR $T1[rsp]
mov x0, x19 jmp .L2 call large::~large()
bl _Unwind_Resume optional_large() [clone .cold]: mov rax, rbx
.L2: add rsp, 48
DW.ref.__gxx_personality_v0: mov rdi, rbp pop rbx
.xword __gxx_personality_v0 call large::~large() ret 0
mov rdi, rbx
call _Unwind_Resume
78
The problem
https://en.cppreference.com/w/cpp/utility/optional/optional
template < class U = T >
constexpr optional( U&& value );
turns prvalue into rvalue,
which is then forwarded into the storage
79
The problem
https://en.cppreference.com/w/cpp/utility/optional/optional
template < class U = T >
constexpr optional( U&& value );
Affects constructors / emplace / insert into all the containers:

● std::optional
● std::variant
● std::vector and all other sequence containers
● std::map and all other associative containers
80
There is a solution!
https://quuxplusone.github.io/blog/2018/05/17/super-elider-round-2/
81
https://akrzemi1.wordpress.com/2018/05/16/rvalues-redefined/
82
83
84
Lazy evaluation with ac::lazy
https://alcash07.github.io/ACTL/actl/functional/lazy.html
template<class Function>
struct lazy {
operator std::invoke_result_t<Function>() {
return function();
}
Function function;
};
template<class Function>
lazy(Function&&) -> lazy<Function>;
std::optional<large> lazy_optional_large() {
return std::optional<large>{lazy{make_large}};
} https://godbolt.org/z/PYq6KTPKh
stp x29, x30, [sp, #-32]! push rbx push rbx L

str x19, [sp, #16] mov rbx, rdi sub rsp, 48
mov x29, sp call make_large() mov rbx, rcx A
mov x19, x8 mov rax, rbx call make_large() Z
bl make_large() mov BYTE PTR [rbx+1], 1 mov rax, rbx
mov w8, #1 pop rbx mov BYTE PTR [rbx+1], 1 Y
strb w8, [x19, #1] ret add rsp, 48
ldr x19, [sp, #16] pop rbx
ldp x29, x30, [sp], #32 ret 0
ret
stp x29, x30, [sp, #-32]! push rbx push rbx L

str x19, [sp, #16] mov rbx, rdi sub rsp, 48
mov x29, sp call make_large() mov rbx, rcx A
mov x19, x8 mov rax, rbx call make_large() Z
bl make_large() mov BYTE PTR [rbx+1], 1 mov rax, rbx
mov w8, #1 pop rbx mov BYTE PTR [rbx+1], 1 Y
strb w8, [x19, #1] ret add rsp, 48
ldr x19, [sp, #16] pop rbx
ldp x29, x30, [sp], #32 ret 0
ret
Negative-overhead abstraction!
87
C++ Core Guidelines

88
C++ Core Guidelines

Approved
89
Valid use cases for output parameters
std::ranges::transform(x, y, z);
std::ranges::sort(x);
Return value cannot be allocated on stack,

for example, because it’s a range with run-time size.
If we decouple memory allocation and data processing,

the code is more reusable.
90
ac::out and ac::inout
https://alcash07.github.io/ACTL/actl/functional/out_inout.html
template<class InRange, class OutRange, class Function>

void transform(InRange const& src, ac::out<OutRange&> dst, Function f);
template<class Range>
void sort(ac::inout<Range&> range);
template<class Range>
[[nodiscard]] Range sort(Range const& range);
transform(x, ac::out{y}, z);

sort(ac::inout{x});
auto y = sort(x);
92
C++ Core Guidelines
F.16: For “in” parameters, pass cheaply-copied types by value and others by
reference to const
Reason Both let the caller know that a function will not modify the argument, and
both allow initialization by rvalues.
What is “cheap to copy” depends on the machine architecture, but two or three
words (doubles, pointers, references) are usually best passed by value.
93
int parameter
bool value_is_zero(int x) { - passing by value
return x == 0;
}
bool ref_is_zero(int const& x) { - passing by (const) reference

return x == 0;
}
https://godbolt.org/z/ofznMvKWc
cmp w0, #0 test edi, edi test ecx, ecx V

cset w0, eq sete al sete al
ret ret ret 0 A
L
U
E
ldr w8, [x0] mov eax, DWORD PTR [rdi] cmp DWORD PTR [rcx], 0 R
cmp w8, #0 test eax, eax sete al
cset w0, eq sete al ret 0 E
ret ret F
cmp w0, #0 test edi, edi test ecx, ecx V

cset w0, eq sete al sete al
ret ret ret 0 A
L
U
E
ldr w8, [x0] mov eax, DWORD PTR [rdi] cmp DWORD PTR [rcx], 0 R
cmp w8, #0 test eax, eax sete al
cset w0, eq sete al ret 0 E
ret ret F
Reference has to be dereferenced

96
int parameter : call site
bool value_is_zero(int x);
bool ref_is_zero(int const& x);
bool value_is_zero_call() {
return value_is_zero(1); - passing by value
}
bool ref_is_zero_call() {
return ref_is_zero(1); - passing by (const) reference
}
https://godbolt.org/z/fzshqhd85
mov w0, #1 mov edi, 1 mov ecx, 1 V

b value_is_zero(int) jmp value_is_zero(int) jmp value_is_zero(int)
A
L
U
E
sub sp, sp, #32 sub rsp, 24 sub rsp, 40 R

stp x29, x30, [sp, #16] lea rdi, [rsp+12] lea rcx, QWORD PTR $T1[rsp]
add x29, sp, #16 mov DWORD PTR [rsp+12], 1 mov DWORD PTR $T1[rsp], 1 E
mov w8, #1 call ref_is_zero(int const&) call ref_is_zero(int const&) F
sub x0, x29, #4 add rsp, 24 add rsp, 40
stur w8, [x29, #-4] ret ret 0
bl ref_is_zero(int const&)
and w0, w0, #0x1
ldp x29, x30, [sp, #16]
add sp, sp, #32
ret

A
L
Here, we just put constant 1 into a register and call the function.
U
E
Below, we put constant 1 on the stack and pass its address,
and after the function call we restore the stack.
and w0, w0, #0x1
ldp x29, x30, [sp, #16]
add sp, sp, #32
ret

A
L
U
E
No profiler will guide you here

and w0, w0, #0x1
ldp x29, x30, [sp, #16]
add sp, sp, #32
ret
100
Quick benchmark
https://quick-bench.com/q/gVbxyQvoqxN76wfqnWVrFF8kqvQ
200% overhead {
101
int parameter : extra function
void some_extra_function();
bool value_extra_function(int x) { - passing by value

int const copy = x;
some_extra_function();
return copy == x;
}
bool ref_extra_function(int const& x) { - passing by (const) reference

int const copy = x;
return copy == x;
}
https://godbolt.org/z/4r946xh8T
stp x29, x30, [sp, #-16]! sub rsp, 8 sub rsp, 40 V

mov x29, sp call some_extra_function() call some_extra_function()
bl some_extra_function() mov eax, 1 mov al, 1 A
mov w0, #1 add rsp, 8 add rsp, 40 L
ldp x29, x30, [sp], #16 ret ret 0
ret U
E
stp x29, x30, [sp, #-32]! push rbp mov QWORD PTR [rsp+8], rbx R
stp x20, x19, [sp, #16] push rbx push rdi
mov x29, sp mov rbx, rdi sub rsp, 32 E
ldr w20, [x0] sub rsp, 8 mov ebx, DWORD PTR [rcx] F
mov x19, x0 mov ebp, DWORD PTR [rdi] mov rdi, rcx
bl some_extra_function() call some_extra_function() call some_extra_function()
ldr w8, [x19] cmp DWORD PTR [rbx], ebp cmp ebx, DWORD PTR [rdi]
cmp w20, w8 sete al mov rbx, QWORD PTR [rsp+48]
cset w0, eq add rsp, 8 sete al
ldp x20, x19, [sp, #16] pop rbx add rsp, 32
ldp x29, x30, [sp], #32 pop rbp pop rdi
ret ret ret 0
103
int parameter : extra function
void some_extra_function();
bool value_extra_function(int x) { - passing by value

int const copy = x;
return copy == x;
}
bool ref_extra_function(int const& x) { - passing by (const) reference

int const copy = x;
some_extra_function(); - can change the referenced value
return copy == x;
}
104
Perfect forwarding
“In C++, perfect forwarding is the act of passing a function’s parameters to another
function while preserving its reference category.” link
The main purpose is to replace copies with moves when possible.
template<class T, class... Args>

std::unique_ptr<T> make_unique(Args&&... args) {
return std::unique_ptr<T>(
new T(std::forward<Args>(args)...));
}
105
Perfect forwarding is not perfect!

}
breaks RVO
106
Perfect forwarding is not perfect!

forwarding reference is still
} a reference, so it prevents
breaks RVO passing in registers
Hopefully, you’re convinced that
built-in types should be passed by value.
Now, let’s see which C++ abstractions
should also be passed by value.
108
Chandler Carruth: There Are No Zero-Cost Abstractions
109
The problem
Itanium C++ ABI
3.1.2.3 Non-Trivial Parameters
If a parameter type is a class type that is non-trivial for the purposes of calls, the
caller must allocate space for a temporary and pass that temporary by reference.
For such types, passing by reference is likely more efficient,

because it avoids making an extra copy on the stack
(unless you need that copy anyway).
110
C++ Core Guidelines
F.24: Use a span<T> or a span_p<T> to designate a half-open sequence
Reason Informal/non-explicit ranges are a source of errors.
“use span” + “use a span”: 16 occurrences

111
C++20 std::span vs raw pointer and size
#include <span>
int raw_back(int const* ptr, size_t size) {

return ptr[size - 1];
}
int span_back(std::span<int const> span) {

return span[span.size() - 1];
}
https://godbolt.org/z/bez7c5PMK
add x8, x0, x1, lsl #2 mov eax, DWORD PTR [rdi-4+rsi*4] mov eax, DWORD PTR [rcx+rdx*4-4]
ldur w0, [x8, #-4] ret ret 0
R
ret A
W
add x8, x0, x1, lsl #2 mov eax, DWORD PTR [rdi-4+rsi*4] mov rdx, QWORD PTR [rcx+8]
ldur w0, [x8, #-4] ret mov rax, QWORD PTR [rcx]
S
ret mov eax, DWORD PTR [rax+rdx*4-4] P
ret 0
A
N
add r0, r0, r1, lsl #2 mov eax, DWORD PTR [esp+4] mov ecx, DWORD PTR _size$[esp-4]
ldr r0, [r0, #-4] mov edx, DWORD PTR [esp+8] mov eax, DWORD PTR _ptr$[esp-4]
R
bx lr mov eax, DWORD PTR [eax-4+edx*4] mov eax, DWORD PTR [eax+ecx*4-4] A
ret ret 0
W
add r0, r0, r1, lsl #2 mov eax, DWORD PTR [esp+8] mov ecx, DWORD PTR _span$[esp]
ldr r0, [r0, #-4] lea eax, [-4+eax*4] mov eax, DWORD PTR _span$[esp-4]
S
bx lr add eax, DWORD PTR [esp+4] mov eax, DWORD PTR [eax+ecx*4-4] P
mov eax, DWORD PTR [eax] ret 0
ret A
? N
113
C++23 std::mdspan vs raw pointer and sizes
#include <cstddef>
int raw_back2(int const* ptr, size_t width, size_t height) {

return ptr[width * height - 1];
}
struct mdspan2 {
int const* ptr;
size_t width;
size_t height;
};
int mdspan_back2(mdspan2 span) {

return span.ptr[span.width * span.height - 1];
} https://godbolt.org/z/EcfanMoYf
mul x8, x2, x1 imul rsi, rdx imul rdx, r8

add x8, x0, x8, lsl #2 mov eax, DWORD PTR [rdi-4+rsi*4] mov eax, DWORD PTR [rcx+rdx*4-4]
R
ldur w0, [x8, #-4] ret ret 0 A
ret
W
ldp x8, x9, [x0, #8] mov rax, QWORD PTR [rsp+16] mov rdx, QWORD PTR [rcx+16]
mul x8, x9, x8 imul rax, QWORD PTR [rsp+24] imul rdx, QWORD PTR [rcx+8]
S
ldr x9, [x0] mov rdx, QWORD PTR [rsp+8] mov rax, QWORD PTR [rcx] P
add x8, x9, x8, lsl #2 mov eax, DWORD PTR [rdx-4+rax*4] mov eax, DWORD PTR [rax+rdx*4-4]
ldur w0, [x8, #-4] ret ret 0 A
ret N
mul r3, r2, r1 mov eax, DWORD PTR [esp+12] mov ecx, DWORD PTR _width$[esp-4]
add r0, r0, r3, lsl #2 imul eax, DWORD PTR [esp+8] imul ecx, DWORD PTR_height$[esp-4]
R
ldr r0, [r0, #-4] mov edx, DWORD PTR [esp+4] mov eax, DWORD PTR _ptr$[esp-4] A
bx lr mov eax, DWORD PTR [edx-4+eax*4] mov eax, DWORD PTR [eax+ecx*4-4]
ret ret 0 W
mul r3, r1, r2 mov eax, DWORD PTR [esp+8] mov ecx, DWORD PTR _span$[esp+4]
add r0, r0, r3, lsl #2 imul eax, DWORD PTR [esp+12] imul ecx, DWORD PTR _span$[esp]
S
ldr r0, [r0, #-4] mov edx, DWORD PTR [esp+4] mov eax, DWORD PTR _span$[esp-4] P
bx lr mov eax, DWORD PTR [edx-4+eax*4] mov eax, DWORD PTR [eax+ecx*4-4]
ret ret 0 A
N
If the argument type is a If the class is MEMORY, pass the Any argument that doesn't fit in 8
Composite Type that is larger than argument on the stack… bytes, or isn't 1, 2, 4, or 8 bytes,
16 bytes, then the argument is If the size of the aggregate must be passed by reference. A
copied to memory allocated by the exceeds two eightbytes and the single argument is never spread
caller and the argument is first eightbyte isn’t SSE or any across multiple registers.
replaced by a pointer to the copy. other eightbyte isn’t SSEUP, the
whole argument is passed in
memory.

When a Composite Type argument Most parameters are passed on Parameters are pushed onto the
is assigned to core registers the stack. stack from right to left.
(either fully or partially), the - The first three parameters of type
behavior is as if the argument had __m64 are passed in %mm0, %mm1, __fastcall: Classes, structs,
been stored to memory at a and %mm2… and unions: Treated as "multibyte"
word-aligned (4-byte) address and types (regardless of size) and
then loaded into consecutive passed on the stack.
registers using a suitable
load-multiple instruction.
116
Composite types Composite types
Architecture ABI returned in registers passed in registers
armv8-a System V ≤ 16 bytes ≤ 16 bytes
armv7-a System V ≤ 4 bytes ≤ 16 bytes
x86-64 System V ≤ 16 bytes ≤ 16 bytes
x86 System V fundamental only SIMD only
x86-64 Microsoft 1,2,4,8 bytes, C++03 POD 1,2,4,8 bytes
x86 Microsoft 1,2,4,8 bytes, C++03 POD not even fundamental
x86 __fastcall Microsoft 1,2,4,8 bytes, C++03 POD fundamental only

117
C++ Core Guidelines
F.24: Use a span<T> or a span_p<T> to designate a half-open sequence
Reason Informal/non-explicit ranges are a source of errors.
Not free
118
Empty parameter : use cases
● Predicates and transform function passed to STL algorithms:
std::ranges::find_if(range, predicate);
std::ranges::transform(input_range, output, unary_op);
● Tag dispatch (somewhat obsolete after C++20 concepts)

template <class InputIter, class Diff = iter_difference_t<InputIter>>
void advance(InputIter& iter, Diff n, input_iterator_tag) {
for (; n > 0; --n)
++iter;
}
template <class RandIter, class Diff = iter_difference_t<RandIter>>
void advance(RandIter& iter, Diff n, random_access_iterator_tag) {
iter += n;
}
● Access token to make some API available only inside the library
(like the default “package private” access modifier in Java)
119
Empty parameter : tag dispatch
int raw_rand();
struct mt19937 {};

int tagged_rand(mt19937);
int raw_rand_call() {
return raw_rand();
}
int tagged_rand_call() {
return tagged_rand(mt19937{});
}
https://godbolt.org/z/vxG4eY1r4
b raw_rand() jmp raw_rand() jmp raw_rand() R

A
W
b tagged_rand(mt19937) jmp tagged_rand(mt19937) xor ecx, ecx T

jmp tagged_rand(mt19937) A
G
b raw_rand() jmp raw_rand() jmp raw_rand() R
A
W
b tagged_rand(mt19937) sub esp, 24 push ecx T

push 0 mov BYTE PTR $T1[esp+4], 0 A
call tagged_rand(mt19937) push DWORD PTR $T1[esp+4]
G
add esp, 28 call tagged_rand(mt19937)
ret add esp, 8
ret 0
121
Empty parameter
Itanium C++ ABI
2.2 POD Data Types
If the base ABI does not specify rules for empty classes, then an empty class has
size and alignment 1.
3.1.2.6 Empty Parameters
Arguments of empty class types that are not non-trivial for the purposes of calls
are passed no differently from ordinary classes.
122
C++ Core Guidelines
reference to const
123
C++ Core Guidelines
reference to const
Approved*
124
Class member functions
Itanium C++ ABI
3.1.2.1 this Parameters
Non-static member functions, including constructors and destructors, take an

implicit this parameter of pointer type. It is passed as if it were the first
parameter in the function prototype…
This isn’t efficient if the class is small enough to be passed by value.

125
Effect on empty function objects
template<class T>
struct plus {
constexpr T operator()(T const& lhs, T const& rhs) const {
return lhs + rhs;
}
};
Simple function objects like std::plus above would most likely be inlined,
but more complex empty function objects would introduce overhead if not inlined.
126
Effect on empty function objects
template<class T>
struct plus {
constexpr T operator()(T const& lhs, T const& rhs) const {
return lhs + rhs;
}
};
Simple function objects like std::plus above would most likely be inlined,
but more complex empty function objects would introduce overhead if not inlined.
Luckily, C++23 introduces static operator() and [].

128
Chain of function calls
int sum(int x1, int x2);
int sum_12_3(int x1, int x2, int x3) {

return sum(sum(x1, x2), x3);
}
}
}
129
Chain of function calls

}
}
}
}
https://godbolt.org/z/MsjeT8TTK
130
sum_12_3 9 instructions 7 instructions 9 instructions

sum_12_3 sum_13_2 131
push rbx push rbx

mov ebx, edx mov ebx, esi
mov esi, edx
call sum(int, int) call sum(int, int)
mov esi, ebx mov esi, ebx
pop rbx pop rbx
mov edi, eax mov edi, eax
jmp sum(int, int) jmp sum(int, int)
132
Order of parameters is fixed in every ABI

}
}
}
}
swap is required (3 moves)
133

134
Eduardo Madrid: about the overhead of std::function
135
Eduardo Madrid: about the overhead of std::function
136
Knowledge needed
● this parameter passing, because std::function is a function object
● consistent parameters order
● enhanced “perfect forwarding”, which preserves passing in registers
137
C++ Core Guidelines
I.13: Do not pass an array as a single pointer
Example Consider:
void copy_n(const T* p, T* q, int n); // copy from [p:p+n) to [q:q+n)
What if there are fewer than n elements in the array pointed to by q? Then, we
overwrite some probably unrelated memory. What if there are fewer than n
elements in the array pointed to by p? Then, we read some probably unrelated
memory. Either is undefined behavior and a potentially very nasty bug.
Alternative Consider using explicit spans:

void copy(span<const T> r, span<T> r2); // copy r to r2
138
Copy of a byte span
#define NDEBUG
#include <cassert>
#include <cstddef>
#include <cstring>
void raw_copy(std::byte* dst, std::byte const* src, size_t size) {

std::memcpy(dst, src, size);
}
void checked_copy( // imagine 2 std::spans here

std::byte* dst, std::byte const* src, size_t dst_size, size_t src_size
) {
assert(src_size == dst_size);
std::memcpy(dst, src, dst_size);
} https://godbolt.org/z/3Tqs849eh
b memcpy jmp memcpy jmp memcpy R

A
W
b memcpy jmp memcpy jmp memcpy C

H
E
C
K
E
D
140
Copy of a byte span : call site
#include <array>
#include <cstddef>
void raw_copy(std::byte* dst, std::byte const* src, size_t size);

void checked_copy(
std::byte* dst, std::byte const* src, size_t dst_size, size_t src_size
);
std::array<std::byte, 8> arr;
void raw_copy_call() {
raw_copy(arr.data(), arr.data(), 8);
}
void checked_copy_call() {
checked_copy(arr.data(), arr.data(), 8, 8);
} https://godbolt.org/z/7M45xz9ha
adrp x0, arr mov esi, OFFSET FLAT:arr mov r8d, 8 R

add x0, x0, :lo12:arr mov edx, 8 lea rdx, OFFSET FLAT:arr
mov w2, #8 mov rdi, rsi lea rcx, OFFSET FLAT:arr A
mov x1, x0 jmp raw_copy jmp raw_copy W
b raw_copy
adrp x0, arr mov esi, OFFSET FLAT:arr mov r9d, 8 C

add x0, x0, :lo12:arr mov ecx, 8 lea rdx, OFFSET FLAT:arr
mov w2, #8 mov edx, 8 mov r8d, r9d H
mov x1, x0 mov rdi, rsi lea rcx, OFFSET FLAT:arr E
mov w3, #8 jmp checked_copy jmp checked_copy
b checked_copy C
K
E
D
adrp x0, arr mov esi, OFFSET FLAT:arr mov r8d, 8 R

add x0, x0, :lo12:arr mov edx, 8 lea rdx, OFFSET FLAT:arr
mov w2, #8 mov rdi, rsi lea rcx, OFFSET FLAT:arr A
mov x1, x0 jmp raw_copy jmp raw_copy W
b raw_copy
adrp x0, arr mov esi, OFFSET FLAT:arr mov r9d, 8 C

add x0, x0, :lo12:arr mov ecx, 8 lea rdx, OFFSET FLAT:arr
mov w2, #8 mov edx, 8 mov r8d, r9d H
mov x1, x0 mov rdi, rsi lea rcx, OFFSET FLAT:arr E
mov w3, #8 jmp checked_copy jmp checked_copy
b checked_copy C
K
E
Size is passed twice! D
143
C++ Core Guidelines
I.13: Do not pass an array as a single pointer
Example Consider:
void copy_n(const T* p, T* q, int n); // copy from [p:p+n) to [q:q+n)
Alternative Consider using explicit spans:

void copy(span<const T> r, span<T> r2); // copy r to r2
Not free
144
C++ Core Guidelines
I.23: Keep the number of function arguments low
Reason Having many arguments opens opportunities for confusion. Passing lots
of arguments is often costly compared to alternatives.
Discussion The two most common reasons why functions have too many
parameters are:
1. Missing an abstraction. …
2. Violating “one function, one responsibility.” …
145
Triple product (wiki)
Geometrically, the scalar triple product is the
(signed) volume of the parallelepiped defined by
the three vectors.
The scalar triple product is unchanged under a

circular shift of its three operands (a, b, c):
a · (b x c) = b · (c x a) = c · (a x b)
Swapping the positions of the operators without

re-ordering the operands leaves the triple product
unchanged:
a · (b x c) = (a x b) · c
146
Triple product : all int
struct vector3 {
int x, y, z;
};
int dot_product(int ax, int ay, int az, int bx, int by, int bz);
vector3 cross_product(int ax, int ay, int az, int bx, int by, int bz);
int triple_product(
int ax, int ay, int az,
int bx, int by, int bz,
int cx, int cy, int cz
) {
vector3 d = cross_product(ax, ay, az, bx, by, bz);
return dot_product(d.x, d.y, d.z, cx, cy, cz);
}
147
Triple product : vector3
struct vector3 {
int x, y, z;
};
int vector_dot_product(vector3 const& a, vector3 const& b);

vector3 vector_cross_product(vector3 const& a, vector3 const& b);
int vector_triple_product(
vector3 const& a,
vector3 const& b,
vector3 const& c
) {
return vector_dot_product(vector_cross_product(a, b), c);
}
https://godbolt.org/z/PfdcEjo1h
sub sp, sp, #48 push rbx push rbx

stp x29, x30, [sp, #16] mov rbx, rdx sub rsp, 80
V
str x19, [sp, #32] sub rsp, 16 mov rax, QWORD PTR E
add x29, sp, #16 call vector_cross_product __security_cookie
mov x19, x2 lea rdi, [rsp+4] xor rax, rsp C
bl
str
vector_cross_product
x0, [sp]
mov
mov
rsi, rbx
QWORD PTR [rsp+4], rax
mov QWORD PTR
__$ArrayPad$[rsp], rax
T
mov x0, sp mov DWORD PTR [rsp+12], edx mov rbx, r8 O
str w1, [sp, #8] call vector_dot_product mov r8, rdx
mov x1, x19 add rsp, 16 mov rdx, rcx R
bl vector_dot_product pop rbx lea rcx, QWORD PTR $T1[rsp]
ldp x29, x30, [sp, #16] ret call vector_cross_product
ldr x19, [sp, #32] mov rdx, rbx
add sp, sp, #48 lea rcx, QWORD PTR $T2[rsp]
ret movsd xmm0, QWORD PTR [rax]
movsd QWORD PTR $T2[rsp], xmm0
mov eax, DWORD PTR [rax+8]
mov DWORD PTR $T2[rsp+8], eax
call vector_dot_product
mov rcx, QWORD PTR
__$ArrayPad$[rsp]
xor rcx, rsp
Keep an eye out for buffer security checks call __security_check_cookie
add rsp, 80
by Nicholas Frechette pop rbx
ret 0
sub sp, sp, #48 push rbx push rbx

stp x29, x30, [sp, #16] mov rbx, rdx sub rsp, 64
V
str x19, [sp, #32] sub rsp, 16 mov rbx, r8 E
add x29, sp, #16 call vector_cross_product mov r8, rdx
mov x19, x2 lea rdi, [rsp+4] mov rdx, rcx C
bl
str
x0, [sp]
mov
mov
rsi, rbx
QWORD PTR [rsp+4], rax
lea
call
rcx, QWORD PTR $T2[rsp]
T
mov x0, sp mov DWORD PTR [rsp+12], edx mov rdx, rbx O
str w1, [sp, #8] call vector_dot_product lea rcx, QWORD PTR $T1[rsp]
mov x1, x19 add rsp, 16 movsd xmm0, QWORD PTR [rax] R
bl vector_dot_product pop rbx movsd QWORD PTR $T1[rsp], xmm0
ldp x29, x30, [sp, #16] ret mov eax, DWORD PTR [rax+8]
ldr x19, [sp, #32] mov DWORD PTR $T1[rsp+8], eax
add sp, sp, #48 call vector_dot_product
ret add rsp, 64
pop rbx
ret 0
__declspec(safebuffers)
stp x29, x30, [sp, #-48]! push r12 sub rsp, 104
str x21, [sp, #16] push rbp mov eax, DWORD PTR z2$[rsp]
R
stp x20, x19, [sp, #32] push rbx mov DWORD PTR [rsp+48], eax A
mov x29, sp sub rsp, 16 mov eax, DWORD PTR y2$[rsp]
ldr w21, [x29, #48] mov ebx, DWORD PTR [rsp+48] mov DWORD PTR [rsp+40], eax W
mov w19, w7 mov ebp, DWORD PTR [rsp+56] mov DWORD PTR [rsp+32], r9d
mov w20, w6 mov r12d, DWORD PTR [rsp+64] mov r9d, r8d
bl cross_product call cross_product mov r8d, edx
lsr x8, x0, #32 add rsp, 16 mov edx, ecx
mov w2, w1 mov r8d, ebp lea rcx, QWORD PTR $T1[rsp]
mov w3, w20 mov rcx, rax call cross_product
mov w4, w19 mov r9d, r12d mov r9d, DWORD PTR x3$[rsp]
mov w1, w8 mov edi, eax movsd xmm0, QWORD PTR [rax]
mov w5, w21 shr rcx, 32 mov r8d, DWORD PTR [rax+8]
ldp x20, x19, [sp, #32] mov esi, ecx mov eax, DWORD PTR z3$[rsp]
ldr x21, [sp, #16] mov ecx, ebx mov DWORD PTR z2$[rsp], eax
ldp x29, x30, [sp], #48 pop rbx mov eax, DWORD PTR y3$[rsp]
b dot_product pop rbp movsd QWORD PTR v4$[rsp], xmm0
pop r12 mov rcx, QWORD PTR v4$[rsp]
jmp dot_product mov rdx, rcx
mov DWORD PTR y2$[rsp], eax
shr rdx, 32
add rsp, 104
jmp dot_product
R
shr rdx, 32
add rsp, 104
jmp dot_product
A lot of moving!
R
shr rdx, 32
add rsp, 104
jmp dot_product
Stack pointer is heavily used

The first eight registers, r0-r7, If the class is INTEGER, the next By default, the x64 calling
are used to pass argument values available register of the sequence convention passes the first four
into a subroutine and to return %rdi, %rsi, %rdx, %rcx, %r8 arguments to a function in
result values from a function. and %r9 is used. registers.

The first four registers r0-r3 Most parameters are passed on Parameters are pushed onto the
(a1-a4) are used to pass the stack. stack from right to left.
argument values into a subroutine - The first three parameters of type
and to return a result value from a __m64 are passed in %mm0, %mm1, __fastcall: The first two
function. and %mm2… DWORD or smaller arguments
that are found in the argument list
from left to right are passed in ECX
and EDX registers.
154
Composite types Composite types Number of registers for
Architecture ABI returned in registers passed in registers parameters + return
armv8-a System V ≤ 16 bytes ≤ 16 bytes 8 total
armv7-a System V ≤ 4 bytes ≤ 16 bytes 4 total
x86-64 System V ≤ 16 bytes ≤ 16 bytes 6+2
x86 System V fundamental only SIMD only 0+2
x86-64 Microsoft 1,2,4,8 bytes, C++03 POD 1,2,4,8 bytes 4+1
x86 Microsoft 1,2,4,8 bytes, C++03 POD not even fundamental 0+2
x86 __fastcall Microsoft 1,2,4,8 bytes, C++03 POD fundamental only 2+2

155
C++ Core Guidelines
I.23: Keep the number of function arguments low
Reason Having many arguments opens opportunities for confusion. Passing lots
of arguments is often costly compared to alternatives.
Approved*
156
Conclusions
● Compilers do unexpected things to your code,
because they have to follow all the specifications
● Compiler Explorer is you friend
https://godbolt.org/
● C++ Core Guidelines are pretty reasonable
from performance point of view
157
Most important guidelines to avoid function call overhead

● Return by value
● Pass “trivial” types by value, others by reference
● Follow the Rule of 0 (or at least support trivial copy)
● Make APIs consistent
● Understand abstractions cost on your target platform
Thank you for attention!

Hidden Overhead of A Function API

Uploaded by

Copyright:

Available Formats

Hidden Overhead of A Function API

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hidden Overhead of A Function API

Uploaded by

Copyright:

Available Formats

2

What we do at Snap with C++

Neural style transfer Full body tracking Ray tracing

How will we compare performance?

How will we compare performance?

How will we compare performance?

Accelerate large-scale applications with BOLT (link)

Credit to Khalil Estell: Firefox function distribution

C++ Itanium ABI

System V ABI System V ABI Microsoft ABI

- iPhone - Linux server - Windows device

armv7-a x86 (IA-32) x86 (IA-32)

System V ABI System V ABI Microsoft ABI

- ancient iPhone - ancient Linux server - ancient Windows device

System V ABI System V ABI Microsoft ABI

armv7-a x86 (IA-32) x86 (IA-32)

System V ABI System V ABI Microsoft ABI

Things are complicated

C++ Core Guidelines seem like a good candidate.

Reason A return value is self-documenting, whereas a & could be either in-out or

std::unique_ptr<int> value_ptr() { - return by value

void output_ptr(std::unique_ptr<int>& dst) { - output parameter

str xzr, [x8] mov QWORD PTR [rdi], 0 sub rsp, 24 V

std::unique_ptr<int> value_ptr() { - return by value

void output_ptr(std::unique_ptr<int>& dst) { - output parameter

This might be non-empty

stp x29, x30, [sp, #-32]! push rbx push rbx

stp x29, x30, [sp, #-32]! push rbx $stateUnwindMap$int output_ptr_call()

stp x29, x30, [sp, #-32]! push rbx push rbx

stp x29, x30, [sp, #-32]! push rbx $stateUnwindMap$int output_ptr_call()

Has to be destroyed if output_ptr throws

push rbx push rbx

Default constructed here

Reason Avoid used-before-set errors and their associated undefined behavior.

● Default constructor before the function call is avoided

● Default constructor before the function call is avoided

● Stack unwind on exception is still necessary

● Default constructor before the function call is avoided

● Stack unwind on exception is still necessary

Reason Using unique_ptr is the cheapest way to pass a pointer safely.

int* raw_ptr() { - returning raw pointer

std::unique_ptr<int> smart_ptr() { - returning smart pointer

mov x0, xzr xor eax, eax xor eax, eax R

str xzr, [x8] mov QWORD PTR [rdi], 0 sub rsp, 24 S

mov x0, xzr xor eax, eax xor eax, eax R

Register in square brackets means it stores an address,

Reason Using unique_ptr is the cheapest way to pass a pointer safely.

INT(int value = 0) : value{value} {}

Libraries that wrap integers

● All other units libraries

INT(int value = 0) : value{value} {}

mov w0, #60 mov eax, 60 mov eax, 60 i

C++ reference: Trivial class

A trivial class is a class that

● is trivially copyable, and

mov w0, #60 mov eax, 60 mov eax, 60 i

A type is considered non-trivial for the purposes of calls if:

● it has a non-trivial copy constructor, move constructor, or destructor, or

C++ reference: Trivially copyable class