Hidden Overhead of A Function API

Download as pdf or txt
Download as pdf or txt
You are on page 1of 158

2

What we do at Snap with C++

Neural style transfer Full body tracking Ray tracing


Face tracking Cloth simulation Wrist tracking
3
Thank you, Serhii Huralnik and Eduardo Madrid!!
Section 0. Introduction
Section 1. Return value
Section 2. Parameter passing
Section 3. Multiple parameters
5
Tony Van Eerd: “people are not writing enough functions”
6
Tony Van Eerd: “people are not writing enough functions”
7
Tony Van Eerd: “people are not writing enough functions”
8
Tony Van Eerd: “people are not writing enough functions”
When people finally start
writing more functions,
we’d prefer to get only the
well designed ones!
When talking about performance, we
typically think about the function logic.
We’ll see that a well designed function
API can have an even larger impact.
11

How will we compare performance?


● Benchmarks at this low level are not too reliable,
and also don’t represent performance in large projects well.
12

How will we compare performance?


● Benchmarks at this low level are not too reliable,
and also don’t represent performance in large projects well.
● Dynamic instruction count is more reliable on modern CPUs.
13

How will we compare performance?


● Benchmarks at this low level are not too reliable,
and also don’t represent performance in large projects well.
● Dynamic instruction count is more reliable on modern CPUs.
● We’ll use simple examples, so that we can just compare
the number of instructions generated by a compiler.
14

Accelerate large-scale applications with BOLT (link)


“… machine code … can range from 10s to 100s of megabytes in size, which is
often too large to fit in any modern CPU instruction cache. As a result, the
hardware spends a considerable amount of processing time — nearly 30 percent,
in many cases — getting an instruction stream from memory to the CPU.”
Disclaimer:
Our discussion is relevant
only for non-inlined functions
16
ISO C++ wiki: Do inline functions improve performance?
Yes and no. Sometimes. Maybe.

There are no simple answers. inline functions might make the code faster, they
might make it slower. They might make the executable larger, they might make it
smaller. They might cause thrashing, they might prevent thrashing. And they might
be, and often are, totally irrelevant to speed.
17
ISO C++ wiki: Do inline functions improve performance?
Yes and no. Sometimes. Maybe.

There are no simple answers. inline functions might make the code faster, they
might make it slower. They might make the executable larger, they might make it
smaller. They might cause thrashing, they might prevent thrashing. And they might
be, and often are, totally irrelevant to speed.

Credit to Khalil Estell: Firefox function distribution


157946 functions above (127B)
167404 functions below (127B)
18
Understanding how machine code is generated from C++

C++ Standard

C++ Itanium ABI


Microsoft
Windows
System V gABI
ABIs
psABI: ARM, x86, …
armv8-a x86-64 (AMD64) x86-64 (AMD64) 19

System V ABI System V ABI Microsoft ABI

- iPhone - Linux server - Windows device


- M1 Mac and newer - old Mac
- Android smartphone

armv7-a x86 (IA-32) x86 (IA-32)

System V ABI System V ABI Microsoft ABI

- ancient iPhone - ancient Linux server - ancient Windows device


- low-end Android
smartphone
armv8-a x86-64 (AMD64) x86-64 (AMD64) 20

System V ABI System V ABI Microsoft ABI


Procedure Call Standard AMD64 Architecture x64 calling convention
for the Arm® 64-bit Processor Supplement
Architecture
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10
-O2 -std=c++20 -O2 -std=c++20 -O2 /std:c++20

armv7-a x86 (IA-32) x86 (IA-32)

System V ABI System V ABI Microsoft ABI


Procedure Call Standard Intel386 Architecture calling conventions
for the Arm® Architecture Processor Supplement

armv7-a clang 11.0.1 x86-64 gcc 14.2 x86 msvc v19.40 VS17.10
-O2 -std=c++20 -O2 -std=c++20 -m32 -O2 /std:c++20
21

Things are complicated


We’ll be looking for simple guidelines to navigate this complexity.

C++ Core Guidelines seem like a good candidate.


Section 0. Introduction
Section 1. Return value
Section 2. Parameter passing
Section 3. Multiple parameters
23
C++ Core Guidelines
F.20: For “out” output values, prefer return values to output parameters

Reason A return value is self-documenting, whereas a & could be either in-out or


out-only and is liable to be misused.
24
Returning std::unique_ptr
#include <memory>

std::unique_ptr<int> value_ptr() { - return by value


return nullptr;
}

void output_ptr(std::unique_ptr<int>& dst) { - output parameter


dst = nullptr;
}

https://godbolt.org/z/ea9M3G94s
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 25

str xzr, [x8] mov QWORD PTR [rdi], 0 sub rsp, 24 V


ret mov rax, rdi xor eax, eax
ret mov QWORD PTR [rcx], rax A
mov rax, rcx L
add rsp, 24
ret 0 U
E

mov x8, x0 mov rax, QWORD PTR [rdi] mov rax, QWORD PTR [rcx] O
ldr x0, [x0] mov QWORD PTR [rdi], 0 mov QWORD PTR [rcx], 0
str xzr, [x8] test rax, rax test rax, rax U
cbz x0, .LBB1_2 je .L3 je SHORT $LN34@output_ptr T
b operator delete(void*) mov esi, 4 mov edx, 4
.LBB1_2: mov rdi, rax mov rcx, rax P
ret jmp operator delete(void*) jmp operator delete(void*) U
.L3: $LN34@output_ptr:
ret ret 0 T
26
Returning std::unique_ptr
#include <memory>

std::unique_ptr<int> value_ptr() { - return by value


return nullptr;
}

void output_ptr(std::unique_ptr<int>& dst) { - output parameter


dst = nullptr;
}

This might be non-empty


27
Returning std::unique_ptr : call site
#include <memory>

std::unique_ptr<int> value_ptr();
- definitions removed to avoid inlining
void output_ptr(std::unique_ptr<int>& dst);

int value_ptr_call() {
auto ptr = value_ptr(); - return by value
return *ptr;
}

int output_ptr_call() {
std::unique_ptr<int> ptr;
output_ptr(ptr); - output parameter
return *ptr;
} https://godbolt.org/z/G9aPehqM1
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 28

stp x29, x30, [sp, #-32]! push rbx push rbx


str x19, [sp, #16] sub rsp, 16 sub rsp, 32 V
mov x29, sp lea rdi, [rsp+8] lea rcx, QWORD PTR ptr$[rsp]
add
bl
x8, x29, #24
value_ptr()
call
mov
value_ptr()
rdi, QWORD PTR [rsp+8]
call
mov
value_ptr()
rcx, QWORD PTR ptr$[rsp]
A
ldr
ldr
x0, [x29, #24]
w19, [x0]
mov
mov
esi, 4
ebx, DWORD PTR [rdi]
mov
mov
edx, 4
ebx, DWORD PTR [rcx]
L
bl
mov
operator delete(void*)
w0, w19
call
add
operator delete(void*)
rsp, 16
call
mov
operator delete(void*)
eax, ebx U
ldr x19, [sp, #16] mov eax, ebx add rsp, 32
ldp x29, x30, [sp], #32 pop rbx pop rbx E
ret ret ret 0

stp x29, x30, [sp, #-32]! push rbx $stateUnwindMap$int output_ptr_call()


str x19, [sp, #16] sub rsp, 16 DB 02H O
mov x29, sp mov QWORD PTR [rsp+8], 0 DB 0aH
str
add
xzr, [x29, #24]
x0, x29, #24
lea
call
rdi, [rsp+8]
output_ptr(std::unique_ptr&)
DD imagerel
std::unique_ptr::~unique_ptr()
U
bl
ldr
output_ptr(std::unique_ptr&)
x0, [x29, #24]
mov
mov
rdi, QWORD PTR [rsp+8]
esi, 4
DB 060H
T
ldr
bl
w19, [x0]
operator delete(void*)
mov
call
ebx, DWORD PTR [rdi]
operator delete(void*)
push
sub
rbx
rsp, 32 P
mov w0, w19 add rsp, 16 mov QWORD PTR ptr$[rsp], 0
ldr x19, [sp, #16] mov eax, ebx lea rcx, QWORD PTR ptr$[rsp] U
ldp x29, x30, [sp], #32 pop rbx call output_ptr(std::unique_ptr&)
ret
ldr x8, [x29, #24]
ret
mov rbx, rax
mov
mov
rcx, QWORD PTR ptr$[rsp]
ebx, DWORD PTR [rcx]
T
mov x19, x0 jmp .L5 mov edx, 4
cbz x8, .LBB1_4 call operator delete(void*)
mov x0, x8 mov eax, ebx
bl operator delete(void*) add rsp, 32
.LBB1_4: pop rbx
mov x0, x19 ret 0
bl _Unwind_Resume

DW.ref.__gxx_personality_v0:
.xword __gxx_personality_v0
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 29

stp x29, x30, [sp, #-32]! push rbx push rbx


str x19, [sp, #16] sub rsp, 16 sub rsp, 32 V
mov x29, sp lea rdi, [rsp+8] lea rcx, QWORD PTR ptr$[rsp]
add
bl
x8, x29, #24
value_ptr()
call
mov
value_ptr()
rdi, QWORD PTR [rsp+8]
call
mov
value_ptr()
rcx, QWORD PTR ptr$[rsp]
A
ldr
ldr
x0, [x29, #24]
w19, [x0]
mov
mov
esi, 4
ebx, DWORD PTR [rdi]
mov
mov
edx, 4
ebx, DWORD PTR [rcx]
L
bl
mov
operator delete(void*)
w0, w19
call
add
operator delete(void*)
rsp, 16
call
mov
operator delete(void*)
eax, ebx U
ldr x19, [sp, #16] mov eax, ebx add rsp, 32
ldp x29, x30, [sp], #32 pop rbx pop rbx E
ret ret ret 0

stp x29, x30, [sp, #-32]! push rbx $stateUnwindMap$int output_ptr_call()


str x19, [sp, #16] sub rsp, 16 DB 02H O
mov x29, sp mov QWORD PTR [rsp+8], 0 DB 0aH
str
add
xzr, [x29, #24]
x0, x29, #24
lea
call
rdi, [rsp+8]
output_ptr(std::unique_ptr&)
DD imagerel
std::unique_ptr::~unique_ptr()
U
bl
ldr
output_ptr(std::unique_ptr&)
x0, [x29, #24]
mov
mov
rdi, QWORD PTR [rsp+8]
esi, 4
DB 060H
T
ldr
bl
w19, [x0]
operator delete(void*)
mov
call
ebx, DWORD PTR [rdi]
operator delete(void*)
push
sub
rbx
rsp, 32 P
mov w0, w19 add rsp, 16 mov QWORD PTR ptr$[rsp], 0
ldr x19, [sp, #16] mov eax, ebx lea rcx, QWORD PTR ptr$[rsp] U
ldp x29, x30, [sp], #32 pop rbx call output_ptr(std::unique_ptr&)
ret
ldr x8, [x29, #24]
ret
mov rbx, rax
mov
mov
rcx, QWORD PTR ptr$[rsp]
ebx, DWORD PTR [rcx]
T
mov x19, x0 jmp .L5 mov edx, 4
cbz x8, .LBB1_4 call operator delete(void*)
mov x0, x8 mov eax, ebx
bl

mov
operator delete(void*)
.LBB1_4:
x0, x19
Stack unwinding add
pop
ret
rsp, 32
rbx
0
bl _Unwind_Resume

DW.ref.__gxx_personality_v0:
.xword __gxx_personality_v0
on exception
30
Returning std::unique_ptr : call site
#include <memory>

std::unique_ptr<int> value_ptr();
void output_ptr(std::unique_ptr<int>& dst);

int value_ptr_call() {
auto ptr = value_ptr(); - return by value
return *ptr;
}

Has to be destroyed if output_ptr throws


int output_ptr_call() {
std::unique_ptr<int> ptr;
output_ptr(ptr); - output parameter
return *ptr;
}
value_ptr_call output_ptr_call 31

push rbx push rbx


sub rsp, 16 sub rsp, 16 puts 0 into memory
mov QWORD PTR [rsp+8], 0
lea rdi, [rsp+8] lea rdi, [rsp+8]
call value_ptr() call output_ptr(std:unique_ptr&)
mov rdi, QWORD PTR [rsp+8] mov rdi, QWORD PTR [rsp+8]
mov esi, 4 mov esi, 4
mov ebx, DWORD PTR [rdi] mov ebx, DWORD PTR [rdi]
call operator delete(void*) call operator delete(void*)
add rsp, 16 add rsp, 16
mov eax, ebx mov eax, ebx
pop rbx pop rbx
ret ret
32
Returning std::unique_ptr : call site
#include <memory>

std::unique_ptr<int> value_ptr();
void output_ptr(std::unique_ptr<int>& dst);

int value_ptr_call() {
auto ptr = value_ptr(); - return by value
return *ptr;
}

Default constructed here


int output_ptr_call() {
std::unique_ptr<int> ptr;
output_ptr(ptr); - output parameter
return *ptr;
}
33
C++ Core Guidelines
ES.20: Always initialize an object

Reason Avoid used-before-set errors and their associated undefined behavior.


Avoid problems with comprehension of complex initialization. Simplify refactoring.
34
Quick benchmark
https://quick-bench.com/q/mOAHh7zZeagJlCJ63GtWVMPngsw

25% overhead {
35

deferred_construction<std::string> output;
read_strings(in, out(output));

https://www.foonathan.net/2016/10/output-parameter/
36
Does it solve our problems?
Pros:

● Default constructor before the function call is avoided


37
Does it solve our problems?
Pros:

● Default constructor before the function call is avoided

Cons (unless we fully trust the user and don’t have exceptions):

● Stack unwind on exception is still necessary


● Extra bool flag is required to know if the object
○ was actually initialized
○ not initialized more than once
38
Does it solve our problems?
Pros:

● Default constructor before the function call is avoided

Cons (unless we fully trust the user and don’t have exceptions):

● Stack unwind on exception is still necessary


● Extra bool flag is required to know if the object
○ was actually initialized
○ not initialized more than once

Any C++ compiler checks that every execution path in a function ends with a
return statement. We just need to return by value.
Hopefully, you’re convinced that
output parameter is a bad idea.
Now, let’s see how return by value works,
specifically for C++ abstractions.
40
C++ Core Guidelines
F.26: Use a unique_ptr<T> to transfer ownership where a pointer is needed

Reason Using unique_ptr is the cheapest way to pass a pointer safely.


41
Returning a pointer
#include <memory>

int* raw_ptr() { - returning raw pointer


return nullptr;
}

std::unique_ptr<int> smart_ptr() { - returning smart pointer


return nullptr;
}

https://godbolt.org/z/ExaoKfT1z
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 42

mov x0, xzr xor eax, eax xor eax, eax R


ret ret ret 0
A
W

str xzr, [x8] mov QWORD PTR [rdi], 0 sub rsp, 24 S


ret mov rax, rdi xor eax, eax
ret mov QWORD PTR [rcx], rax M
mov rax, rcx A
add rsp, 24
ret 0 R
T
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 43

mov x0, xzr xor eax, eax xor eax, eax R


ret ret ret 0
A
x0, xzr, eax, rsp… are machine registers. W
They are the fastest storage available on a machine.

Register in square brackets means it stores an address,


and we’re accessing memory at that address: [x8], QWORD PTR [rdi]
str xzr, [x8] mov QWORD PTR [rdi], 0 sub rsp, 24 S
ret mov rax, rdi xor eax, eax
ret mov QWORD PTR [rcx], rax M
mov rax, rcx A
add rsp, 24
ret 0 R
T
44
Quick benchmark
https://quick-bench.com/q/pJ3z9L_Q1M16qob8-sg8lM3-T60

200% overhead {
45
C++ Core Guidelines
F.26: Use a unique_ptr<T> to transfer ownership where a pointer is needed

Reason Using unique_ptr is the cheapest way to pass a pointer safely.

Not free
46
Wrapper over int
struct INT {
int value;

INT(int value = 0) : value{value} {}


INT(INT&& src) : value{src.value} {}
INT& operator=(INT&& src) {
value = src.value;
return *this;
}
INT(INT const& src) : value{src.value} {}
INT& operator=(INT const& src) {
value = src.value;
return *this;
}
~INT() {}
};
47

Libraries that wrap integers


● Smart pointers similarly wrap raw pointers
● std::chrono

● All other units libraries


● Safe integers
● Bindings for other languages
48
Wrapper over int
struct INT {
int value;

INT(int value = 0) : value{value} {}


INT(INT&& src) : value{src.value} {}
INT& operator=(INT&& src) {
value = src.value;
return *this;
}
INT(INT const& src) : value{src.value} {}
INT& operator=(INT const& src) {
value = src.value;
return *this;
}
~INT() {}
};
49
Wrapper over int
struct INT { int int_seconds() {
int value; return 60;
}
INT(int value = 0) : value{value} {}
INT(INT&& src) : value{src.value} {} INT INT_seconds() {
INT& operator=(INT&& src) { return 60;
value = src.value; }
return *this;
}
INT(INT const& src) : value{src.value} {}
INT& operator=(INT const& src) {
value = src.value;
return *this;
}
~INT() {} https://godbolt.org/z/Tddd4E6hs
};
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 50

mov w0, #60 mov eax, 60 mov eax, 60 i


ret ret ret 0
n
t
mov w9, #60 mov DWORD PTR [rdi], 60 mov QWORD PTR [rcx], 60 I
str w9, [x8] mov rax, rdi mov rax, rcx
ret ret ret 0 N
T

armv7-a clang 11.0.1 x86-64 gcc 14.2 (-m32) x86 msvc v19.40 VS17.10
mov r0, #60 mov eax, 60 mov eax, 60 i
bx lr ret ret 0
n
t
mov r1, #60 mov eax, DWORD PTR [esp+4] mov eax, DWORD PTR I
str r1, [r0] mov DWORD PTR [eax], 60 ___$ReturnUdt$[esp-4]
bx lr ret 4 mov DWORD PTR [eax], 60 N
ret 0 T
51
The problem
Itanium C++ ABI
3.1.3.1 Non-trivial Return Values

If the return type is a class type that is non-trivial for the purposes of calls, the
caller passes an address as an implicit parameter. The callee then constructs the
return value into this address.
52
The problem
Itanium C++ ABI
3.1.3.1 Non-trivial Return Values

If the return type is a class type that is non-trivial for the purposes of calls, the
caller passes an address as an implicit parameter. The callee then constructs the
return value into this address.

C++ reference: Trivial class

A trivial class is a class that

● is trivially copyable, and


● has one or more eligible default constructors such that each is trivial.
53
Wrapper over int : no custom copy and move
struct INT { int int_seconds() {
int value; return 60;
}
INT(int value = 0) : value{value} {}
// INT(INT&&) = default; INT INT_seconds() {
// INT& operator=(INT&&) = default; return 60;
// INT(INT const&) = default; }
// INT& operator=(INT const&) = default;
~INT() {}
};

https://godbolt.org/z/f6s1P96Tx
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 54

mov w0, #60 mov eax, 60 mov eax, 60 i


ret ret ret 0
n
t
mov w9, #60 mov DWORD PTR [rdi], 60 mov QWORD PTR [rcx], 60 I
str w9, [x8] mov rax, rdi mov rax, rcx
ret ret ret 0 N
T

armv7-a clang 11.0.1 x86-64 gcc 14.2 (-m32) x86 msvc v19.40 VS17.10
mov r0, #60 mov eax, 60 mov eax, 60 i
bx lr ret ret 0
n
t
mov r1, #60 mov eax, DWORD PTR [esp+4] mov eax, DWORD PTR I
str r1, [r0] mov DWORD PTR [eax], 60 ___$ReturnUdt$[esp-4]
bx lr ret 4 mov DWORD PTR [eax], 60 N
ret 0 T
55
The problem : digging deeper
Itanium C++ ABI
non-trivial for the purposes of calls

A type is considered non-trivial for the purposes of calls if:

● it has a non-trivial copy constructor, move constructor, or destructor, or


● all of its copy and move constructors are deleted.

different from

C++ reference: Trivially copyable class

Also requires trivial copy and move assignment operators.


56
Wrapper over int : no custom destructor
struct INT { int int_seconds() {
int value; return 60;
}
INT(int value = 0) : value{value} {}
}; INT INT_seconds() {
return 60;
}

https://godbolt.org/z/qeq9rEE8T
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 57

mov w0, #60 mov eax, 60 mov eax, 60 i


ret ret ret 0
n
t
mov w0, #60 mov eax, 60 mov QWORD PTR [rcx], 60 I
ret ret mov rax, rcx
ret 0 N
T

armv7-a clang 11.0.1 x86-64 gcc 14.2 (-m32) x86 msvc v19.40 VS17.10
mov r0, #60 mov eax, 60 mov eax, 60 i
bx lr ret ret 0
n
t
mov r0, #60 mov eax, DWORD PTR [esp+4] mov eax, DWORD PTR I
bx lr mov DWORD PTR [eax], 60 ___$ReturnUdt$[esp-4]
ret 4 mov DWORD PTR [eax], 60 N
ret 0 T
58
Wrapper over int : no custom constructor
struct INT { int int_seconds() {
int value = 0; return 60;
}; }

INT INT_seconds() {
return INT{60};
}

https://godbolt.org/z/c6GbTdjc7
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 59

mov w0, #60 mov eax, 60 mov eax, 60 i


ret ret ret 0
n
t
mov w0, #60 mov eax, 60 mov eax, 60 I
ret ret ret 0
N
T

armv7-a clang 11.0.1 x86-64 gcc 14.2 (-m32) x86 msvc v19.40 VS17.10
mov r0, #60 mov eax, 60 mov eax, 60 i
bx lr ret ret 0
n
t
mov r0, #60 mov eax, DWORD PTR [esp+4] mov eax, 60 I
bx lr mov DWORD PTR [eax], 60 ret 0
ret 4 N
T
armv8-a System V x86-64 System V x86-64 Microsoft 60

If the argument type is a If a C++ object is non-trivial for the To return a user-defined type by
Composite Type that is larger than purpose of calls, as specified in value in RAX, it must have a length
16 bytes, then the argument is the C++ ABI, it is passed by of 1, 2, 4, 8, 16, 32, or 64 bits. It
copied to memory. invisible reference… in %rdi… must also have no user-defined
… the result is returned in the If the class is INTEGER, the next constructor, destructor, or copy
same registers as would be used available register of the sequence assignment operator… This
for such an argument. %rax, %rdx is used. definition is essentially the same
Otherwise, … The address… shall as a C++03 POD type.
be passed … in x8.

armv7-a System V x86 System V x86 Microsoft


A Composite Type not larger than Some fundamental types and all Return values are … returned in
4 bytes is returned in r0… aggregate types are returned in the EAX register, except for 8-byte
A Composite Type larger than 4 memory. structures, which are returned in
bytes, or whose size cannot be the EDX:EAX register pair. Larger
determined statically by both caller structures are returned in the EAX
and callee, is stored in memory at register as pointers to hidden
an address passed as an extra return structures…
argument. Structures that are not PODs will
not be returned in registers.
61
General purpose registers allocation
for function parameters and return values
Composite types
Architecture ABI returned in registers

armv8-a System V ≤ 16 bytes

armv7-a System V ≤ 4 bytes

x86-64 System V ≤ 16 bytes

x86 System V fundamental only

x86-64 Microsoft 1,2,4,8 bytes, C++03 POD

x86 Microsoft 1,2,4,8 bytes, C++03 POD

Composite types are required to be “trivial” to get into registers!


62
C++ Core Guidelines
C.20: If you can avoid defining default operations, do

Reason It’s the simplest and gives the cleanest semantics.

Note This is known as “the rule of zero”.


63
C++ Core Guidelines
C.20: If you can avoid defining default operations, do

Reason It’s the simplest and gives the cleanest semantics.

Note This is known as “the rule of zero”.

Approved
Surely, this problem
is handled properly
in the popular libraries,
right?
65
std::chrono
#include <chrono>

int64_t int_seconds() {
return 60;
}

std::chrono::seconds chrono_seconds() {
return std::chrono::seconds{60};
}

static_assert(std::is_same_v<int64_t, std::chrono::seconds::rep>);

https://godbolt.org/z/E5e1nGY94
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 66

mov w0, #60 mov eax, 60 mov eax, 60 i


ret ret ret 0
n
t

mov w0, #60 mov eax, 60 mov QWORD PTR [rcx], 60 C


ret ret mov rax, rcx
ret 0 H
R

armv7-a clang 11.0.1 x86-64 gcc 14.2 (-m32) x86 msvc v19.40 VS17.10
mov r0, #60 mov eax, 60 mov eax, 60 i
mov r1, #0 xor edx, edx xor edx, edx
bx lr ret ret 0 n
t
mov r1, #0 mov eax, DWORD PTR [esp+4] mov eax, DWORD PTR C
mov r2, #60 mov DWORD PTR [eax], 60 ___$ReturnUdt$[esp-4]
str r2, [r0] mov DWORD PTR [eax+4], 0 mov DWORD PTR [eax], 60 H
str r1, [r0, #4] ret 4 mov DWORD PTR [eax+4], 0 R
bx lr ret 0
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 67

mov w0, #60 mov eax, 60 mov eax, 60 i


ret ret ret 0
n
t

mov w0, #60 mov eax, 60 mov QWORD PTR [rcx], 60 C


ret ret mov rax, rcx
ret 0 H
R
not a POD

armv7-a clang 11.0.1 x86-64 gcc 14.2 (-m32) x86 msvc v19.40 VS17.10
mov r0, #60 mov eax, 60 mov eax, 60 i
mov r1, #0 xor edx, edx xor edx, edx
bx lr ret ret 0 n
t
mov r1, #0 mov eax, DWORD PTR [esp+4] mov eax, DWORD PTR C
mov r2, #60 mov DWORD PTR [eax], 60 ___$ReturnUdt$[esp-4]
str r2, [r0] mov DWORD PTR [eax+4], 0 mov DWORD PTR [eax], 60 H
str r1, [r0, #4] ret 4 mov DWORD PTR [eax+4], 0 R
bx lr size > 4 not fundamental ret 0 not a POD
68
Can we do something about it?
● std::chrono would have to give up encapsulation
to be maximally efficient on Windows.

● It cannot use a type smaller than int64_t


just to optimize code on armv7-a.
69
std::pair and std::tuple
● std::pair copy and move constructors are defaulted
according to the C++ standard.

● Only since C++17 std::pair is trivially destructible


if its elements are trivially destructible.

This is an ABI breakage, but a quick search gave only one complaint.

● Copy and move assignment operators are trivial only on MSVC.

This is not a problem for the function calls.


But a problem for std::memcpy and std::bit_cast.

● std::tuple is never trivially move constructible.

https://godbolt.org/z/r7McGEb8o
70
Can we do something about it?
Don’t use std::pair and especially std::tuple.

Named struct is better for both readability and performance.


71
Can we do something about std::unique_ptr?
#include <memory>

namespace detail {

int* smart_ptr_impl() {
return nullptr;
}

} // namespace detail

[[always_inline]] std::unique_ptr<int> smart_ptr() {


return std::unique_ptr<int>{detail::smart_ptr_impl()};
}
72

Return Value Optimization (copy elision)


C++ reference: copy elision
Since C++17, a prvalue (“pure” rvalue) is not materialized until needed, and then it
is constructed directly into the storage of its final destination.
73

RVO: how it works


Itanium C++ ABI
3.1.3.1 Non-trivial Return Values

If the return type is a class type that is non-trivial for the purposes of calls, the
caller passes an address as an implicit parameter. The callee then constructs the
return value into this address.

… the pointer is passed as if it were the first parameter in the function prototype,
preceding all other parameters, including the this and VTT parameters.
74

RVO: how it works


Itanium C++ ABI
3.1.3.1 Non-trivial Return Values

If the return type is a class type that is non-trivial for the purposes of calls, the
caller passes an address as an implicit parameter. The callee then constructs the
return value into this address.

… the pointer is passed as if it were the first parameter in the function prototype,
preceding all other parameters, including the this and VTT parameters.
It’s an output parameter done right by the compiler, and only when necessary!
75
76
RVO: inserting a function result into a container
#include <optional>

struct large {
large();
large(large&&);
large& operator=(large&&);
large(large const&);
large& operator=(large const&);
~large();
};
large make_large();

std::optional<large> optional_large() {
return std::optional<large>{make_large()};
} https://godbolt.org/z/bcdsx7aP4
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 77

stp x29, x30, [sp, #-32]! push rbp $stateUnwindMap$std::optional<


str x19, [sp, #16] push rbx large> optional_large() DB 02H
mov x29, sp mov rbx, rdi DB 0aH
mov x19, x8 sub rsp, 24 DD imagerel
add x8, x29, #31 lea rbp, [rsp+15] large::~large()
bl make_large() mov rdi, rbp DB 080H
add x1, x29, #31 call make_large()
mov x0, x19 mov rsi, rbp mov QWORD PTR [rsp+8], rcx
bl large::large(large&&) mov rdi, rbx push rbx
mov w8, #1 call large::large(large&&) sub rsp, 48
add x0, x29, #31 mov BYTE PTR [rbx+1], 1 mov rbx, rcx
strb w8, [x19, #1] mov rdi, rbp lea rcx, QWORD PTR $T1[rsp]
bl large::~large() call large::~large() call make_large()
ldr x19, [sp, #16] add rsp, 24 npad 1
ldp x29, x30, [sp], #32 mov rax, rbx mov rdx, rax
ret pop rbx mov rcx, rbx
mov x19, x0 pop rbp call large::large(large&&)
add x0, x29, #31 ret mov BYTE PTR [rbx+1], 1
bl large::~large() mov rbx, rax lea rcx, QWORD PTR $T1[rsp]
mov x0, x19 jmp .L2 call large::~large()
bl _Unwind_Resume optional_large() [clone .cold]: mov rax, rbx
.L2: add rsp, 48
DW.ref.__gxx_personality_v0: mov rdi, rbp pop rbx
.xword __gxx_personality_v0 call large::~large() ret 0
mov rdi, rbx
call _Unwind_Resume
78
The problem
https://en.cppreference.com/w/cpp/utility/optional/optional
template < class U = T >
constexpr optional( U&& value );
turns prvalue into rvalue,
which is then forwarded into the storage
79
The problem
https://en.cppreference.com/w/cpp/utility/optional/optional
template < class U = T >
constexpr optional( U&& value );

Affects constructors / emplace / insert into all the containers:


● std::optional
● std::variant
● std::vector and all other sequence containers
● std::map and all other associative containers
80
There is a solution!
https://quuxplusone.github.io/blog/2018/05/17/super-elider-round-2/
81
There is a solution!
https://quuxplusone.github.io/blog/2018/05/17/super-elider-round-2/

https://akrzemi1.wordpress.com/2018/05/16/rvalues-redefined/
82
There is a solution!
https://quuxplusone.github.io/blog/2018/05/17/super-elider-round-2/

https://akrzemi1.wordpress.com/2018/05/16/rvalues-redefined/
83
There is a solution!
https://quuxplusone.github.io/blog/2018/05/17/super-elider-round-2/

https://akrzemi1.wordpress.com/2018/05/16/rvalues-redefined/
84
Lazy evaluation with ac::lazy
https://alcash07.github.io/ACTL/actl/functional/lazy.html

template<class Function>
struct lazy {
operator std::invoke_result_t<Function>() {
return function();
}

Function function;
};
template<class Function>
lazy(Function&&) -> lazy<Function>;

std::optional<large> lazy_optional_large() {
return std::optional<large>{lazy{make_large}};
} https://godbolt.org/z/PYq6KTPKh
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 85

stp x29, x30, [sp, #-32]! push rbx push rbx L


str x19, [sp, #16] mov rbx, rdi sub rsp, 48
mov x29, sp call make_large() mov rbx, rcx A
mov x19, x8 mov rax, rbx call make_large() Z
bl make_large() mov BYTE PTR [rbx+1], 1 mov rax, rbx
mov w8, #1 pop rbx mov BYTE PTR [rbx+1], 1 Y
strb w8, [x19, #1] ret add rsp, 48
ldr x19, [sp, #16] pop rbx
ldp x29, x30, [sp], #32 ret 0
ret
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 86

stp x29, x30, [sp, #-32]! push rbx push rbx L


str x19, [sp, #16] mov rbx, rdi sub rsp, 48
mov x29, sp call make_large() mov rbx, rcx A
mov x19, x8 mov rax, rbx call make_large() Z
bl make_large() mov BYTE PTR [rbx+1], 1 mov rax, rbx
mov w8, #1 pop rbx mov BYTE PTR [rbx+1], 1 Y
strb w8, [x19, #1] ret add rsp, 48
ldr x19, [sp, #16] pop rbx
ldp x29, x30, [sp], #32 ret 0
ret

Negative-overhead abstraction!
87
C++ Core Guidelines
F.20: For “out” output values, prefer return values to output parameters

Reason A return value is self-documenting, whereas a & could be either in-out or


out-only and is liable to be misused.
88
C++ Core Guidelines
F.20: For “out” output values, prefer return values to output parameters

Reason A return value is self-documenting, whereas a & could be either in-out or


out-only and is liable to be misused.

Approved
89
Valid use cases for output parameters
std::ranges::transform(x, y, z);
std::ranges::sort(x);

Return value cannot be allocated on stack,


for example, because it’s a range with run-time size.

If we decouple memory allocation and data processing,


the code is more reusable.
90
ac::out and ac::inout
https://alcash07.github.io/ACTL/actl/functional/out_inout.html

template<class InRange, class OutRange, class Function>


void transform(InRange const& src, ac::out<OutRange&> dst, Function f);

template<class Range>
void sort(ac::inout<Range&> range);

template<class Range>
[[nodiscard]] Range sort(Range const& range);

transform(x, ac::out{y}, z);


sort(ac::inout{x});
auto y = sort(x);
Section 0. Introduction
Section 1. Return value
Section 2. Parameter passing
Section 3. Multiple parameters
92
C++ Core Guidelines
F.16: For “in” parameters, pass cheaply-copied types by value and others by
reference to const

Reason Both let the caller know that a function will not modify the argument, and
both allow initialization by rvalues.

What is “cheap to copy” depends on the machine architecture, but two or three
words (doubles, pointers, references) are usually best passed by value.
93
int parameter
bool value_is_zero(int x) { - passing by value
return x == 0;
}

bool ref_is_zero(int const& x) { - passing by (const) reference


return x == 0;
}

https://godbolt.org/z/ofznMvKWc
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 94

cmp w0, #0 test edi, edi test ecx, ecx V


cset w0, eq sete al sete al
ret ret ret 0 A
L
U
E

ldr w8, [x0] mov eax, DWORD PTR [rdi] cmp DWORD PTR [rcx], 0 R
cmp w8, #0 test eax, eax sete al
cset w0, eq sete al ret 0 E
ret ret F
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 95

cmp w0, #0 test edi, edi test ecx, ecx V


cset w0, eq sete al sete al
ret ret ret 0 A
L
U
E

ldr w8, [x0] mov eax, DWORD PTR [rdi] cmp DWORD PTR [rcx], 0 R
cmp w8, #0 test eax, eax sete al
cset w0, eq sete al ret 0 E
ret ret F

Reference has to be dereferenced


96
int parameter : call site
bool value_is_zero(int x);
bool ref_is_zero(int const& x);

bool value_is_zero_call() {
return value_is_zero(1); - passing by value
}

bool ref_is_zero_call() {
return ref_is_zero(1); - passing by (const) reference
}

https://godbolt.org/z/fzshqhd85
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 97

mov w0, #1 mov edi, 1 mov ecx, 1 V


b value_is_zero(int) jmp value_is_zero(int) jmp value_is_zero(int)
A
L
U
E

sub sp, sp, #32 sub rsp, 24 sub rsp, 40 R


stp x29, x30, [sp, #16] lea rdi, [rsp+12] lea rcx, QWORD PTR $T1[rsp]
add x29, sp, #16 mov DWORD PTR [rsp+12], 1 mov DWORD PTR $T1[rsp], 1 E
mov w8, #1 call ref_is_zero(int const&) call ref_is_zero(int const&) F
sub x0, x29, #4 add rsp, 24 add rsp, 40
stur w8, [x29, #-4] ret ret 0
bl ref_is_zero(int const&)
and w0, w0, #0x1
ldp x29, x30, [sp, #16]
add sp, sp, #32
ret
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 98

mov w0, #1 mov edi, 1 mov ecx, 1 V


b value_is_zero(int) jmp value_is_zero(int) jmp value_is_zero(int)
A
L
Here, we just put constant 1 into a register and call the function.
U
E
Below, we put constant 1 on the stack and pass its address,
and after the function call we restore the stack.
sub sp, sp, #32 sub rsp, 24 sub rsp, 40 R
stp x29, x30, [sp, #16] lea rdi, [rsp+12] lea rcx, QWORD PTR $T1[rsp]
add x29, sp, #16 mov DWORD PTR [rsp+12], 1 mov DWORD PTR $T1[rsp], 1 E
mov w8, #1 call ref_is_zero(int const&) call ref_is_zero(int const&) F
sub x0, x29, #4 add rsp, 24 add rsp, 40
stur w8, [x29, #-4] ret ret 0
bl ref_is_zero(int const&)
and w0, w0, #0x1
ldp x29, x30, [sp, #16]
add sp, sp, #32
ret
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 99

mov w0, #1 mov edi, 1 mov ecx, 1 V


b value_is_zero(int) jmp value_is_zero(int) jmp value_is_zero(int)
A
L
U
E

No profiler will guide you here


sub sp, sp, #32 sub rsp, 24 sub rsp, 40 R
stp x29, x30, [sp, #16] lea rdi, [rsp+12] lea rcx, QWORD PTR $T1[rsp]
add x29, sp, #16 mov DWORD PTR [rsp+12], 1 mov DWORD PTR $T1[rsp], 1 E
mov w8, #1 call ref_is_zero(int const&) call ref_is_zero(int const&) F
sub x0, x29, #4 add rsp, 24 add rsp, 40
stur w8, [x29, #-4] ret ret 0
bl ref_is_zero(int const&)
and w0, w0, #0x1
ldp x29, x30, [sp, #16]
add sp, sp, #32
ret
100
Quick benchmark
https://quick-bench.com/q/gVbxyQvoqxN76wfqnWVrFF8kqvQ

200% overhead {
101
int parameter : extra function
void some_extra_function();

bool value_extra_function(int x) { - passing by value


int const copy = x;
some_extra_function();
return copy == x;
}

bool ref_extra_function(int const& x) { - passing by (const) reference


int const copy = x;
some_extra_function();
return copy == x;
}

https://godbolt.org/z/4r946xh8T
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 102

stp x29, x30, [sp, #-16]! sub rsp, 8 sub rsp, 40 V


mov x29, sp call some_extra_function() call some_extra_function()
bl some_extra_function() mov eax, 1 mov al, 1 A
mov w0, #1 add rsp, 8 add rsp, 40 L
ldp x29, x30, [sp], #16 ret ret 0
ret U
E

stp x29, x30, [sp, #-32]! push rbp mov QWORD PTR [rsp+8], rbx R
stp x20, x19, [sp, #16] push rbx push rdi
mov x29, sp mov rbx, rdi sub rsp, 32 E
ldr w20, [x0] sub rsp, 8 mov ebx, DWORD PTR [rcx] F
mov x19, x0 mov ebp, DWORD PTR [rdi] mov rdi, rcx
bl some_extra_function() call some_extra_function() call some_extra_function()
ldr w8, [x19] cmp DWORD PTR [rbx], ebp cmp ebx, DWORD PTR [rdi]
cmp w20, w8 sete al mov rbx, QWORD PTR [rsp+48]
cset w0, eq add rsp, 8 sete al
ldp x20, x19, [sp, #16] pop rbx add rsp, 32
ldp x29, x30, [sp], #32 pop rbp pop rdi
ret ret ret 0
103
int parameter : extra function
void some_extra_function();

bool value_extra_function(int x) { - passing by value


int const copy = x;
some_extra_function();
return copy == x;
}

bool ref_extra_function(int const& x) { - passing by (const) reference


int const copy = x;
some_extra_function(); - can change the referenced value
return copy == x;
}
104
Perfect forwarding
“In C++, perfect forwarding is the act of passing a function’s parameters to another
function while preserving its reference category.” link
The main purpose is to replace copies with moves when possible.

template<class T, class... Args>


std::unique_ptr<T> make_unique(Args&&... args) {
return std::unique_ptr<T>(
new T(std::forward<Args>(args)...));
}
105
Perfect forwarding is not perfect!
“In C++, perfect forwarding is the act of passing a function’s parameters to another
function while preserving its reference category.” link
The main purpose is to replace copies with moves when possible.

template<class T, class... Args>


std::unique_ptr<T> make_unique(Args&&... args) {
return std::unique_ptr<T>(
new T(std::forward<Args>(args)...));
}

breaks RVO
106
Perfect forwarding is not perfect!
“In C++, perfect forwarding is the act of passing a function’s parameters to another
function while preserving its reference category.” link
The main purpose is to replace copies with moves when possible.

template<class T, class... Args>


std::unique_ptr<T> make_unique(Args&&... args) {
return std::unique_ptr<T>(
new T(std::forward<Args>(args)...));
forwarding reference is still
} a reference, so it prevents
breaks RVO passing in registers
Hopefully, you’re convinced that
built-in types should be passed by value.
Now, let’s see which C++ abstractions
should also be passed by value.
108
Chandler Carruth: There Are No Zero-Cost Abstractions
109
The problem
Itanium C++ ABI

3.1.2.3 Non-Trivial Parameters

If a parameter type is a class type that is non-trivial for the purposes of calls, the
caller must allocate space for a temporary and pass that temporary by reference.

For such types, passing by reference is likely more efficient,


because it avoids making an extra copy on the stack
(unless you need that copy anyway).
110
C++ Core Guidelines
F.24: Use a span<T> or a span_p<T> to designate a half-open sequence

Reason Informal/non-explicit ranges are a source of errors.

“use span” + “use a span”: 16 occurrences


111
C++20 std::span vs raw pointer and size
#include <span>

int raw_back(int const* ptr, size_t size) {


return ptr[size - 1];
}

int span_back(std::span<int const> span) {


return span[span.size() - 1];
}

https://godbolt.org/z/bez7c5PMK
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 112

add x8, x0, x1, lsl #2 mov eax, DWORD PTR [rdi-4+rsi*4] mov eax, DWORD PTR [rcx+rdx*4-4]
ldur w0, [x8, #-4] ret ret 0
R
ret A
W
add x8, x0, x1, lsl #2 mov eax, DWORD PTR [rdi-4+rsi*4] mov rdx, QWORD PTR [rcx+8]
ldur w0, [x8, #-4] ret mov rax, QWORD PTR [rcx]
S
ret mov eax, DWORD PTR [rax+rdx*4-4] P
ret 0
A
N

armv7-a clang 11.0.1 x86-64 gcc 14.2 (-m32) x86 msvc v19.40 VS17.10
add r0, r0, r1, lsl #2 mov eax, DWORD PTR [esp+4] mov ecx, DWORD PTR _size$[esp-4]
ldr r0, [r0, #-4] mov edx, DWORD PTR [esp+8] mov eax, DWORD PTR _ptr$[esp-4]
R
bx lr mov eax, DWORD PTR [eax-4+edx*4] mov eax, DWORD PTR [eax+ecx*4-4] A
ret ret 0
W
add r0, r0, r1, lsl #2 mov eax, DWORD PTR [esp+8] mov ecx, DWORD PTR _span$[esp]
ldr r0, [r0, #-4] lea eax, [-4+eax*4] mov eax, DWORD PTR _span$[esp-4]
S
bx lr add eax, DWORD PTR [esp+4] mov eax, DWORD PTR [eax+ecx*4-4] P
mov eax, DWORD PTR [eax] ret 0
ret A
? N
113
C++23 std::mdspan vs raw pointer and sizes
#include <cstddef>

int raw_back2(int const* ptr, size_t width, size_t height) {


return ptr[width * height - 1];
}

struct mdspan2 {
int const* ptr;
size_t width;
size_t height;
};

int mdspan_back2(mdspan2 span) {


return span.ptr[span.width * span.height - 1];
} https://godbolt.org/z/EcfanMoYf
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 114

mul x8, x2, x1 imul rsi, rdx imul rdx, r8


add x8, x0, x8, lsl #2 mov eax, DWORD PTR [rdi-4+rsi*4] mov eax, DWORD PTR [rcx+rdx*4-4]
R
ldur w0, [x8, #-4] ret ret 0 A
ret
W
ldp x8, x9, [x0, #8] mov rax, QWORD PTR [rsp+16] mov rdx, QWORD PTR [rcx+16]
mul x8, x9, x8 imul rax, QWORD PTR [rsp+24] imul rdx, QWORD PTR [rcx+8]
S
ldr x9, [x0] mov rdx, QWORD PTR [rsp+8] mov rax, QWORD PTR [rcx] P
add x8, x9, x8, lsl #2 mov eax, DWORD PTR [rdx-4+rax*4] mov eax, DWORD PTR [rax+rdx*4-4]
ldur w0, [x8, #-4] ret ret 0 A
ret N

armv7-a clang 11.0.1 x86-64 gcc 14.2 (-m32) x86 msvc v19.40 VS17.10
mul r3, r2, r1 mov eax, DWORD PTR [esp+12] mov ecx, DWORD PTR _width$[esp-4]
add r0, r0, r3, lsl #2 imul eax, DWORD PTR [esp+8] imul ecx, DWORD PTR_height$[esp-4]
R
ldr r0, [r0, #-4] mov edx, DWORD PTR [esp+4] mov eax, DWORD PTR _ptr$[esp-4] A
bx lr mov eax, DWORD PTR [edx-4+eax*4] mov eax, DWORD PTR [eax+ecx*4-4]
ret ret 0 W
mul r3, r1, r2 mov eax, DWORD PTR [esp+8] mov ecx, DWORD PTR _span$[esp+4]
add r0, r0, r3, lsl #2 imul eax, DWORD PTR [esp+12] imul ecx, DWORD PTR _span$[esp]
S
ldr r0, [r0, #-4] mov edx, DWORD PTR [esp+4] mov eax, DWORD PTR _span$[esp-4] P
bx lr mov eax, DWORD PTR [edx-4+eax*4] mov eax, DWORD PTR [eax+ecx*4-4]
ret ret 0 A
N
armv8-a System V x86-64 System V x86-64 Microsoft 115

If the argument type is a If the class is MEMORY, pass the Any argument that doesn't fit in 8
Composite Type that is larger than argument on the stack… bytes, or isn't 1, 2, 4, or 8 bytes,
16 bytes, then the argument is If the size of the aggregate must be passed by reference. A
copied to memory allocated by the exceeds two eightbytes and the single argument is never spread
caller and the argument is first eightbyte isn’t SSE or any across multiple registers.
replaced by a pointer to the copy. other eightbyte isn’t SSEUP, the
whole argument is passed in
memory.

armv7-a System V x86 System V x86 Microsoft


When a Composite Type argument Most parameters are passed on Parameters are pushed onto the
is assigned to core registers the stack. stack from right to left.
(either fully or partially), the - The first three parameters of type
behavior is as if the argument had __m64 are passed in %mm0, %mm1, __fastcall: Classes, structs,
been stored to memory at a and %mm2… and unions: Treated as "multibyte"
word-aligned (4-byte) address and types (regardless of size) and
then loaded into consecutive passed on the stack.
registers using a suitable
load-multiple instruction.
116
General purpose registers allocation
for function parameters and return values
Composite types Composite types
Architecture ABI returned in registers passed in registers

armv8-a System V ≤ 16 bytes ≤ 16 bytes

armv7-a System V ≤ 4 bytes ≤ 16 bytes

x86-64 System V ≤ 16 bytes ≤ 16 bytes

x86 System V fundamental only SIMD only

x86-64 Microsoft 1,2,4,8 bytes, C++03 POD 1,2,4,8 bytes

x86 Microsoft 1,2,4,8 bytes, C++03 POD not even fundamental

x86 __fastcall Microsoft 1,2,4,8 bytes, C++03 POD fundamental only

Composite types are required to be “trivial” to get into registers!


117
C++ Core Guidelines
F.24: Use a span<T> or a span_p<T> to designate a half-open sequence

Reason Informal/non-explicit ranges are a source of errors.

Not free
118
Empty parameter : use cases
● Predicates and transform function passed to STL algorithms:
std::ranges::find_if(range, predicate);
std::ranges::transform(input_range, output, unary_op);

● Tag dispatch (somewhat obsolete after C++20 concepts)


template <class InputIter, class Diff = iter_difference_t<InputIter>>
void advance(InputIter& iter, Diff n, input_iterator_tag) {
for (; n > 0; --n)
++iter;
}
template <class RandIter, class Diff = iter_difference_t<RandIter>>
void advance(RandIter& iter, Diff n, random_access_iterator_tag) {
iter += n;
}

● Access token to make some API available only inside the library
(like the default “package private” access modifier in Java)
119
Empty parameter : tag dispatch
int raw_rand();

struct mt19937 {};


int tagged_rand(mt19937);

int raw_rand_call() {
return raw_rand();
}

int tagged_rand_call() {
return tagged_rand(mt19937{});
}

https://godbolt.org/z/vxG4eY1r4
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 120

b raw_rand() jmp raw_rand() jmp raw_rand() R


A
W

b tagged_rand(mt19937) jmp tagged_rand(mt19937) xor ecx, ecx T


jmp tagged_rand(mt19937) A
G

armv7-a clang 11.0.1 x86-64 gcc 14.2 (-m32) x86 msvc v19.40 VS17.10
b raw_rand() jmp raw_rand() jmp raw_rand() R
A
W

b tagged_rand(mt19937) sub esp, 24 push ecx T


push 0 mov BYTE PTR $T1[esp+4], 0 A
call tagged_rand(mt19937) push DWORD PTR $T1[esp+4]
G
add esp, 28 call tagged_rand(mt19937)
ret add esp, 8
ret 0
121
Empty parameter
Itanium C++ ABI

2.2 POD Data Types

If the base ABI does not specify rules for empty classes, then an empty class has
size and alignment 1.

3.1.2.6 Empty Parameters

Arguments of empty class types that are not non-trivial for the purposes of calls
are passed no differently from ordinary classes.
122
C++ Core Guidelines
F.16: For “in” parameters, pass cheaply-copied types by value and others by
reference to const

Reason Both let the caller know that a function will not modify the argument, and
both allow initialization by rvalues.

What is “cheap to copy” depends on the machine architecture, but two or three
words (doubles, pointers, references) are usually best passed by value.
123
C++ Core Guidelines
F.16: For “in” parameters, pass cheaply-copied types by value and others by
reference to const

Reason Both let the caller know that a function will not modify the argument, and
both allow initialization by rvalues.

What is “cheap to copy” depends on the machine architecture, but two or three
words (doubles, pointers, references) are usually best passed by value.

Approved*
124
Class member functions
Itanium C++ ABI

3.1.2.1 this Parameters

Non-static member functions, including constructors and destructors, take an


implicit this parameter of pointer type. It is passed as if it were the first
parameter in the function prototype…

This isn’t efficient if the class is small enough to be passed by value.


125
Effect on empty function objects
template<class T>
struct plus {
constexpr T operator()(T const& lhs, T const& rhs) const {
return lhs + rhs;
}
};

Simple function objects like std::plus above would most likely be inlined,
but more complex empty function objects would introduce overhead if not inlined.
126
Effect on empty function objects
template<class T>
struct plus {
constexpr T operator()(T const& lhs, T const& rhs) const {
return lhs + rhs;
}
};

Simple function objects like std::plus above would most likely be inlined,
but more complex empty function objects would introduce overhead if not inlined.

Luckily, C++23 introduces static operator() and [].


Section 0. Introduction
Section 1. Return value
Section 2. Parameter passing
Section 3. Multiple parameters
128
Chain of function calls
int sum(int x1, int x2);

int sum_12_3(int x1, int x2, int x3) {


return sum(sum(x1, x2), x3);
}
int sum_13_2(int x1, int x2, int x3) {
return sum(sum(x1, x3), x2);
}
int sum_23_1(int x1, int x2, int x3) {
return sum(sum(x2, x3), x1);
}
129
Chain of function calls
int sum(int x1, int x2);

int sum_12_3(int x1, int x2, int x3) {


return sum(sum(x1, x2), x3);
}
int sum_13_2(int x1, int x2, int x3) {
return sum(sum(x1, x3), x2);
}
int sum_23_1(int x1, int x2, int x3) {
return sum(sum(x2, x3), x1);
}
int sum_21_3(int x1, int x2, int x3) {
return sum(sum(x2, x1), x3);
}

https://godbolt.org/z/MsjeT8TTK
130

armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10

sum_12_3 9 instructions 7 instructions 9 instructions

sum_13_2 10 instructions 8 instructions 10 instructions

sum_23_1 11 instructions 9 instructions 12 instructions

sum_21_3 12 instructions 10 instructions 12 instructions


sum_12_3 sum_13_2 131

push rbx push rbx


mov ebx, edx mov ebx, esi
mov esi, edx
call sum(int, int) call sum(int, int)
mov esi, ebx mov esi, ebx
pop rbx pop rbx
mov edi, eax mov edi, eax
jmp sum(int, int) jmp sum(int, int)
132
Order of parameters is fixed in every ABI
int sum(int x1, int x2);

int sum_12_3(int x1, int x2, int x3) {


return sum(sum(x1, x2), x3);
}
int sum_13_2(int x1, int x2, int x3) {
return sum(sum(x1, x3), x2);
}
int sum_23_1(int x1, int x2, int x3) {
return sum(sum(x2, x3), x1);
}
int sum_12_3(int x1, int x2, int x3) {
return sum(sum(x2, x1), x3);
}
swap is required (3 moves)
133

armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10

sum_12_3 9 instructions 7 instructions 9 instructions

sum_13_2 10 instructions 8 instructions 10 instructions

sum_23_1 11 instructions 9 instructions 12 instructions

sum_21_3 12 instructions 10 instructions 12 instructions


134
Eduardo Madrid: about the overhead of std::function
135
Eduardo Madrid: about the overhead of std::function
136
Knowledge needed
● this parameter passing, because std::function is a function object
● consistent parameters order
● enhanced “perfect forwarding”, which preserves passing in registers
137
C++ Core Guidelines
I.13: Do not pass an array as a single pointer

Example Consider:
void copy_n(const T* p, T* q, int n); // copy from [p:p+n) to [q:q+n)

What if there are fewer than n elements in the array pointed to by q? Then, we
overwrite some probably unrelated memory. What if there are fewer than n
elements in the array pointed to by p? Then, we read some probably unrelated
memory. Either is undefined behavior and a potentially very nasty bug.

Alternative Consider using explicit spans:


void copy(span<const T> r, span<T> r2); // copy r to r2
138
Copy of a byte span
#define NDEBUG
#include <cassert>
#include <cstddef>
#include <cstring>

void raw_copy(std::byte* dst, std::byte const* src, size_t size) {


std::memcpy(dst, src, size);
}

void checked_copy( // imagine 2 std::spans here


std::byte* dst, std::byte const* src, size_t dst_size, size_t src_size
) {
assert(src_size == dst_size);
std::memcpy(dst, src, dst_size);
} https://godbolt.org/z/3Tqs849eh
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 139

b memcpy jmp memcpy jmp memcpy R


A
W

b memcpy jmp memcpy jmp memcpy C


H
E
C
K
E
D
140
Copy of a byte span : call site
#include <array>
#include <cstddef>

void raw_copy(std::byte* dst, std::byte const* src, size_t size);


void checked_copy(
std::byte* dst, std::byte const* src, size_t dst_size, size_t src_size
);
std::array<std::byte, 8> arr;

void raw_copy_call() {
raw_copy(arr.data(), arr.data(), 8);
}
void checked_copy_call() {
checked_copy(arr.data(), arr.data(), 8, 8);
} https://godbolt.org/z/7M45xz9ha
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 141

adrp x0, arr mov esi, OFFSET FLAT:arr mov r8d, 8 R


add x0, x0, :lo12:arr mov edx, 8 lea rdx, OFFSET FLAT:arr
mov w2, #8 mov rdi, rsi lea rcx, OFFSET FLAT:arr A
mov x1, x0 jmp raw_copy jmp raw_copy W
b raw_copy

adrp x0, arr mov esi, OFFSET FLAT:arr mov r9d, 8 C


add x0, x0, :lo12:arr mov ecx, 8 lea rdx, OFFSET FLAT:arr
mov w2, #8 mov edx, 8 mov r8d, r9d H
mov x1, x0 mov rdi, rsi lea rcx, OFFSET FLAT:arr E
mov w3, #8 jmp checked_copy jmp checked_copy
b checked_copy C
K
E
D
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 142

adrp x0, arr mov esi, OFFSET FLAT:arr mov r8d, 8 R


add x0, x0, :lo12:arr mov edx, 8 lea rdx, OFFSET FLAT:arr
mov w2, #8 mov rdi, rsi lea rcx, OFFSET FLAT:arr A
mov x1, x0 jmp raw_copy jmp raw_copy W
b raw_copy

adrp x0, arr mov esi, OFFSET FLAT:arr mov r9d, 8 C


add x0, x0, :lo12:arr mov ecx, 8 lea rdx, OFFSET FLAT:arr
mov w2, #8 mov edx, 8 mov r8d, r9d H
mov x1, x0 mov rdi, rsi lea rcx, OFFSET FLAT:arr E
mov w3, #8 jmp checked_copy jmp checked_copy
b checked_copy C
K
E
Size is passed twice! D
143
C++ Core Guidelines
I.13: Do not pass an array as a single pointer

Example Consider:
void copy_n(const T* p, T* q, int n); // copy from [p:p+n) to [q:q+n)

Alternative Consider using explicit spans:


void copy(span<const T> r, span<T> r2); // copy r to r2

Not free
144
C++ Core Guidelines
I.23: Keep the number of function arguments low

Reason Having many arguments opens opportunities for confusion. Passing lots
of arguments is often costly compared to alternatives.

Discussion The two most common reasons why functions have too many
parameters are:

1. Missing an abstraction. …
2. Violating “one function, one responsibility.” …
145
Triple product (wiki)
Geometrically, the scalar triple product is the
(signed) volume of the parallelepiped defined by
the three vectors.

The scalar triple product is unchanged under a


circular shift of its three operands (a, b, c):

a · (b x c) = b · (c x a) = c · (a x b)

Swapping the positions of the operators without


re-ordering the operands leaves the triple product
unchanged:

a · (b x c) = (a x b) · c
146
Triple product : all int
struct vector3 {
int x, y, z;
};

int dot_product(int ax, int ay, int az, int bx, int by, int bz);
vector3 cross_product(int ax, int ay, int az, int bx, int by, int bz);

int triple_product(
int ax, int ay, int az,
int bx, int by, int bz,
int cx, int cy, int cz
) {
vector3 d = cross_product(ax, ay, az, bx, by, bz);
return dot_product(d.x, d.y, d.z, cx, cy, cz);
}
147
Triple product : vector3
struct vector3 {
int x, y, z;
};

int vector_dot_product(vector3 const& a, vector3 const& b);


vector3 vector_cross_product(vector3 const& a, vector3 const& b);

int vector_triple_product(
vector3 const& a,
vector3 const& b,
vector3 const& c
) {
return vector_dot_product(vector_cross_product(a, b), c);
}

https://godbolt.org/z/PfdcEjo1h
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 148

sub sp, sp, #48 push rbx push rbx


stp x29, x30, [sp, #16] mov rbx, rdx sub rsp, 80
V
str x19, [sp, #32] sub rsp, 16 mov rax, QWORD PTR E
add x29, sp, #16 call vector_cross_product __security_cookie
mov x19, x2 lea rdi, [rsp+4] xor rax, rsp C
bl
str
vector_cross_product
x0, [sp]
mov
mov
rsi, rbx
QWORD PTR [rsp+4], rax
mov QWORD PTR
__$ArrayPad$[rsp], rax
T
mov x0, sp mov DWORD PTR [rsp+12], edx mov rbx, r8 O
str w1, [sp, #8] call vector_dot_product mov r8, rdx
mov x1, x19 add rsp, 16 mov rdx, rcx R
bl vector_dot_product pop rbx lea rcx, QWORD PTR $T1[rsp]
ldp x29, x30, [sp, #16] ret call vector_cross_product
ldr x19, [sp, #32] mov rdx, rbx
add sp, sp, #48 lea rcx, QWORD PTR $T2[rsp]
ret movsd xmm0, QWORD PTR [rax]
movsd QWORD PTR $T2[rsp], xmm0
mov eax, DWORD PTR [rax+8]
mov DWORD PTR $T2[rsp+8], eax
call vector_dot_product
mov rcx, QWORD PTR
__$ArrayPad$[rsp]
xor rcx, rsp
Keep an eye out for buffer security checks call __security_check_cookie
add rsp, 80
by Nicholas Frechette pop rbx
ret 0
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 149

sub sp, sp, #48 push rbx push rbx


stp x29, x30, [sp, #16] mov rbx, rdx sub rsp, 64
V
str x19, [sp, #32] sub rsp, 16 mov rbx, r8 E
add x29, sp, #16 call vector_cross_product mov r8, rdx
mov x19, x2 lea rdi, [rsp+4] mov rdx, rcx C
bl
str
vector_cross_product
x0, [sp]
mov
mov
rsi, rbx
QWORD PTR [rsp+4], rax
lea
call
rcx, QWORD PTR $T2[rsp]
vector_cross_product
T
mov x0, sp mov DWORD PTR [rsp+12], edx mov rdx, rbx O
str w1, [sp, #8] call vector_dot_product lea rcx, QWORD PTR $T1[rsp]
mov x1, x19 add rsp, 16 movsd xmm0, QWORD PTR [rax] R
bl vector_dot_product pop rbx movsd QWORD PTR $T1[rsp], xmm0
ldp x29, x30, [sp, #16] ret mov eax, DWORD PTR [rax+8]
ldr x19, [sp, #32] mov DWORD PTR $T1[rsp+8], eax
add sp, sp, #48 call vector_dot_product
ret add rsp, 64
pop rbx
ret 0

__declspec(safebuffers)
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 150

stp x29, x30, [sp, #-48]! push r12 sub rsp, 104
str x21, [sp, #16] push rbp mov eax, DWORD PTR z2$[rsp]
R
stp x20, x19, [sp, #32] push rbx mov DWORD PTR [rsp+48], eax A
mov x29, sp sub rsp, 16 mov eax, DWORD PTR y2$[rsp]
ldr w21, [x29, #48] mov ebx, DWORD PTR [rsp+48] mov DWORD PTR [rsp+40], eax W
mov w19, w7 mov ebp, DWORD PTR [rsp+56] mov DWORD PTR [rsp+32], r9d
mov w20, w6 mov r12d, DWORD PTR [rsp+64] mov r9d, r8d
bl cross_product call cross_product mov r8d, edx
lsr x8, x0, #32 add rsp, 16 mov edx, ecx
mov w2, w1 mov r8d, ebp lea rcx, QWORD PTR $T1[rsp]
mov w3, w20 mov rcx, rax call cross_product
mov w4, w19 mov r9d, r12d mov r9d, DWORD PTR x3$[rsp]
mov w1, w8 mov edi, eax movsd xmm0, QWORD PTR [rax]
mov w5, w21 shr rcx, 32 mov r8d, DWORD PTR [rax+8]
ldp x20, x19, [sp, #32] mov esi, ecx mov eax, DWORD PTR z3$[rsp]
ldr x21, [sp, #16] mov ecx, ebx mov DWORD PTR z2$[rsp], eax
ldp x29, x30, [sp], #48 pop rbx mov eax, DWORD PTR y3$[rsp]
b dot_product pop rbp movsd QWORD PTR v4$[rsp], xmm0
pop r12 mov rcx, QWORD PTR v4$[rsp]
jmp dot_product mov rdx, rcx
mov DWORD PTR y2$[rsp], eax
shr rdx, 32
add rsp, 104
jmp dot_product
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 151

stp x29, x30, [sp, #-48]! push r12 sub rsp, 104
str x21, [sp, #16] push rbp mov eax, DWORD PTR z2$[rsp]
R
stp x20, x19, [sp, #32] push rbx mov DWORD PTR [rsp+48], eax A
mov x29, sp sub rsp, 16 mov eax, DWORD PTR y2$[rsp]
ldr w21, [x29, #48] mov ebx, DWORD PTR [rsp+48] mov DWORD PTR [rsp+40], eax W
mov w19, w7 mov ebp, DWORD PTR [rsp+56] mov DWORD PTR [rsp+32], r9d
mov w20, w6 mov r12d, DWORD PTR [rsp+64] mov r9d, r8d
bl cross_product call cross_product mov r8d, edx
lsr x8, x0, #32 add rsp, 16 mov edx, ecx
mov w2, w1 mov r8d, ebp lea rcx, QWORD PTR $T1[rsp]
mov w3, w20 mov rcx, rax call cross_product
mov w4, w19 mov r9d, r12d mov r9d, DWORD PTR x3$[rsp]
mov w1, w8 mov edi, eax movsd xmm0, QWORD PTR [rax]
mov w5, w21 shr rcx, 32 mov r8d, DWORD PTR [rax+8]
ldp x20, x19, [sp, #32] mov esi, ecx mov eax, DWORD PTR z3$[rsp]
ldr x21, [sp, #16] mov ecx, ebx mov DWORD PTR z2$[rsp], eax
ldp x29, x30, [sp], #48 pop rbx mov eax, DWORD PTR y3$[rsp]
b dot_product pop rbp movsd QWORD PTR v4$[rsp], xmm0
pop r12 mov rcx, QWORD PTR v4$[rsp]
jmp dot_product mov rdx, rcx
mov DWORD PTR y2$[rsp], eax
shr rdx, 32
add rsp, 104
jmp dot_product

A lot of moving!
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 152

stp x29, x30, [sp, #-48]! push r12 sub rsp, 104
str x21, [sp, #16] push rbp mov eax, DWORD PTR z2$[rsp]
R
stp x20, x19, [sp, #32] push rbx mov DWORD PTR [rsp+48], eax A
mov x29, sp sub rsp, 16 mov eax, DWORD PTR y2$[rsp]
ldr w21, [x29, #48] mov ebx, DWORD PTR [rsp+48] mov DWORD PTR [rsp+40], eax W
mov w19, w7 mov ebp, DWORD PTR [rsp+56] mov DWORD PTR [rsp+32], r9d
mov w20, w6 mov r12d, DWORD PTR [rsp+64] mov r9d, r8d
bl cross_product call cross_product mov r8d, edx
lsr x8, x0, #32 add rsp, 16 mov edx, ecx
mov w2, w1 mov r8d, ebp lea rcx, QWORD PTR $T1[rsp]
mov w3, w20 mov rcx, rax call cross_product
mov w4, w19 mov r9d, r12d mov r9d, DWORD PTR x3$[rsp]
mov w1, w8 mov edi, eax movsd xmm0, QWORD PTR [rax]
mov w5, w21 shr rcx, 32 mov r8d, DWORD PTR [rax+8]
ldp x20, x19, [sp, #32] mov esi, ecx mov eax, DWORD PTR z3$[rsp]
ldr x21, [sp, #16] mov ecx, ebx mov DWORD PTR z2$[rsp], eax
ldp x29, x30, [sp], #48 pop rbx mov eax, DWORD PTR y3$[rsp]
b dot_product pop rbp movsd QWORD PTR v4$[rsp], xmm0
pop r12 mov rcx, QWORD PTR v4$[rsp]
jmp dot_product mov rdx, rcx
mov DWORD PTR y2$[rsp], eax
shr rdx, 32
add rsp, 104
jmp dot_product

Stack pointer is heavily used


armv8-a System V x86-64 System V x86-64 Microsoft 153

The first eight registers, r0-r7, If the class is INTEGER, the next By default, the x64 calling
are used to pass argument values available register of the sequence convention passes the first four
into a subroutine and to return %rdi, %rsi, %rdx, %rcx, %r8 arguments to a function in
result values from a function. and %r9 is used. registers.

armv7-a System V x86 System V x86 Microsoft


The first four registers r0-r3 Most parameters are passed on Parameters are pushed onto the
(a1-a4) are used to pass the stack. stack from right to left.
argument values into a subroutine - The first three parameters of type
and to return a result value from a __m64 are passed in %mm0, %mm1, __fastcall: The first two
function. and %mm2… DWORD or smaller arguments
that are found in the argument list
from left to right are passed in ECX
and EDX registers.
154
General purpose registers allocation
for function parameters and return values
Composite types Composite types Number of registers for
Architecture ABI returned in registers passed in registers parameters + return

armv8-a System V ≤ 16 bytes ≤ 16 bytes 8 total

armv7-a System V ≤ 4 bytes ≤ 16 bytes 4 total

x86-64 System V ≤ 16 bytes ≤ 16 bytes 6+2

x86 System V fundamental only SIMD only 0+2

x86-64 Microsoft 1,2,4,8 bytes, C++03 POD 1,2,4,8 bytes 4+1

x86 Microsoft 1,2,4,8 bytes, C++03 POD not even fundamental 0+2

x86 __fastcall Microsoft 1,2,4,8 bytes, C++03 POD fundamental only 2+2

Composite types are required to be “trivial” to get into registers!


155
C++ Core Guidelines
I.23: Keep the number of function arguments low

Reason Having many arguments opens opportunities for confusion. Passing lots
of arguments is often costly compared to alternatives.

Approved*
156

Conclusions
● Compilers do unexpected things to your code,
because they have to follow all the specifications
● Compiler Explorer is you friend
https://godbolt.org/
● C++ Core Guidelines are pretty reasonable
from performance point of view
157

Most important guidelines to avoid function call overhead


● Return by value
● Pass “trivial” types by value, others by reference
● Follow the Rule of 0 (or at least support trivial copy)
● Make APIs consistent
● Understand abstractions cost on your target platform
Thank you for attention!

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy