Hidden Overhead of A Function API
Hidden Overhead of A Function API
Hidden Overhead of A Function API
There are no simple answers. inline functions might make the code faster, they
might make it slower. They might make the executable larger, they might make it
smaller. They might cause thrashing, they might prevent thrashing. And they might
be, and often are, totally irrelevant to speed.
17
ISO C++ wiki: Do inline functions improve performance?
Yes and no. Sometimes. Maybe.
There are no simple answers. inline functions might make the code faster, they
might make it slower. They might make the executable larger, they might make it
smaller. They might cause thrashing, they might prevent thrashing. And they might
be, and often are, totally irrelevant to speed.
C++ Standard
armv7-a clang 11.0.1 x86-64 gcc 14.2 x86 msvc v19.40 VS17.10
-O2 -std=c++20 -O2 -std=c++20 -m32 -O2 /std:c++20
21
https://godbolt.org/z/ea9M3G94s
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 25
mov x8, x0 mov rax, QWORD PTR [rdi] mov rax, QWORD PTR [rcx] O
ldr x0, [x0] mov QWORD PTR [rdi], 0 mov QWORD PTR [rcx], 0
str xzr, [x8] test rax, rax test rax, rax U
cbz x0, .LBB1_2 je .L3 je SHORT $LN34@output_ptr T
b operator delete(void*) mov esi, 4 mov edx, 4
.LBB1_2: mov rdi, rax mov rcx, rax P
ret jmp operator delete(void*) jmp operator delete(void*) U
.L3: $LN34@output_ptr:
ret ret 0 T
26
Returning std::unique_ptr
#include <memory>
std::unique_ptr<int> value_ptr();
- definitions removed to avoid inlining
void output_ptr(std::unique_ptr<int>& dst);
int value_ptr_call() {
auto ptr = value_ptr(); - return by value
return *ptr;
}
int output_ptr_call() {
std::unique_ptr<int> ptr;
output_ptr(ptr); - output parameter
return *ptr;
} https://godbolt.org/z/G9aPehqM1
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 28
DW.ref.__gxx_personality_v0:
.xword __gxx_personality_v0
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 29
mov
operator delete(void*)
.LBB1_4:
x0, x19
Stack unwinding add
pop
ret
rsp, 32
rbx
0
bl _Unwind_Resume
DW.ref.__gxx_personality_v0:
.xword __gxx_personality_v0
on exception
30
Returning std::unique_ptr : call site
#include <memory>
std::unique_ptr<int> value_ptr();
void output_ptr(std::unique_ptr<int>& dst);
int value_ptr_call() {
auto ptr = value_ptr(); - return by value
return *ptr;
}
std::unique_ptr<int> value_ptr();
void output_ptr(std::unique_ptr<int>& dst);
int value_ptr_call() {
auto ptr = value_ptr(); - return by value
return *ptr;
}
25% overhead {
35
deferred_construction<std::string> output;
read_strings(in, out(output));
https://www.foonathan.net/2016/10/output-parameter/
36
Does it solve our problems?
Pros:
Cons (unless we fully trust the user and don’t have exceptions):
Cons (unless we fully trust the user and don’t have exceptions):
Any C++ compiler checks that every execution path in a function ends with a
return statement. We just need to return by value.
Hopefully, you’re convinced that
output parameter is a bad idea.
Now, let’s see how return by value works,
specifically for C++ abstractions.
40
C++ Core Guidelines
F.26: Use a unique_ptr<T> to transfer ownership where a pointer is needed
https://godbolt.org/z/ExaoKfT1z
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 42
200% overhead {
45
C++ Core Guidelines
F.26: Use a unique_ptr<T> to transfer ownership where a pointer is needed
Not free
46
Wrapper over int
struct INT {
int value;
armv7-a clang 11.0.1 x86-64 gcc 14.2 (-m32) x86 msvc v19.40 VS17.10
mov r0, #60 mov eax, 60 mov eax, 60 i
bx lr ret ret 0
n
t
mov r1, #60 mov eax, DWORD PTR [esp+4] mov eax, DWORD PTR I
str r1, [r0] mov DWORD PTR [eax], 60 ___$ReturnUdt$[esp-4]
bx lr ret 4 mov DWORD PTR [eax], 60 N
ret 0 T
51
The problem
Itanium C++ ABI
3.1.3.1 Non-trivial Return Values
If the return type is a class type that is non-trivial for the purposes of calls, the
caller passes an address as an implicit parameter. The callee then constructs the
return value into this address.
52
The problem
Itanium C++ ABI
3.1.3.1 Non-trivial Return Values
If the return type is a class type that is non-trivial for the purposes of calls, the
caller passes an address as an implicit parameter. The callee then constructs the
return value into this address.
https://godbolt.org/z/f6s1P96Tx
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 54
armv7-a clang 11.0.1 x86-64 gcc 14.2 (-m32) x86 msvc v19.40 VS17.10
mov r0, #60 mov eax, 60 mov eax, 60 i
bx lr ret ret 0
n
t
mov r1, #60 mov eax, DWORD PTR [esp+4] mov eax, DWORD PTR I
str r1, [r0] mov DWORD PTR [eax], 60 ___$ReturnUdt$[esp-4]
bx lr ret 4 mov DWORD PTR [eax], 60 N
ret 0 T
55
The problem : digging deeper
Itanium C++ ABI
non-trivial for the purposes of calls
different from
https://godbolt.org/z/qeq9rEE8T
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 57
armv7-a clang 11.0.1 x86-64 gcc 14.2 (-m32) x86 msvc v19.40 VS17.10
mov r0, #60 mov eax, 60 mov eax, 60 i
bx lr ret ret 0
n
t
mov r0, #60 mov eax, DWORD PTR [esp+4] mov eax, DWORD PTR I
bx lr mov DWORD PTR [eax], 60 ___$ReturnUdt$[esp-4]
ret 4 mov DWORD PTR [eax], 60 N
ret 0 T
58
Wrapper over int : no custom constructor
struct INT { int int_seconds() {
int value = 0; return 60;
}; }
INT INT_seconds() {
return INT{60};
}
https://godbolt.org/z/c6GbTdjc7
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 59
armv7-a clang 11.0.1 x86-64 gcc 14.2 (-m32) x86 msvc v19.40 VS17.10
mov r0, #60 mov eax, 60 mov eax, 60 i
bx lr ret ret 0
n
t
mov r0, #60 mov eax, DWORD PTR [esp+4] mov eax, 60 I
bx lr mov DWORD PTR [eax], 60 ret 0
ret 4 N
T
armv8-a System V x86-64 System V x86-64 Microsoft 60
If the argument type is a If a C++ object is non-trivial for the To return a user-defined type by
Composite Type that is larger than purpose of calls, as specified in value in RAX, it must have a length
16 bytes, then the argument is the C++ ABI, it is passed by of 1, 2, 4, 8, 16, 32, or 64 bits. It
copied to memory. invisible reference… in %rdi… must also have no user-defined
… the result is returned in the If the class is INTEGER, the next constructor, destructor, or copy
same registers as would be used available register of the sequence assignment operator… This
for such an argument. %rax, %rdx is used. definition is essentially the same
Otherwise, … The address… shall as a C++03 POD type.
be passed … in x8.
Approved
Surely, this problem
is handled properly
in the popular libraries,
right?
65
std::chrono
#include <chrono>
int64_t int_seconds() {
return 60;
}
std::chrono::seconds chrono_seconds() {
return std::chrono::seconds{60};
}
static_assert(std::is_same_v<int64_t, std::chrono::seconds::rep>);
https://godbolt.org/z/E5e1nGY94
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 66
armv7-a clang 11.0.1 x86-64 gcc 14.2 (-m32) x86 msvc v19.40 VS17.10
mov r0, #60 mov eax, 60 mov eax, 60 i
mov r1, #0 xor edx, edx xor edx, edx
bx lr ret ret 0 n
t
mov r1, #0 mov eax, DWORD PTR [esp+4] mov eax, DWORD PTR C
mov r2, #60 mov DWORD PTR [eax], 60 ___$ReturnUdt$[esp-4]
str r2, [r0] mov DWORD PTR [eax+4], 0 mov DWORD PTR [eax], 60 H
str r1, [r0, #4] ret 4 mov DWORD PTR [eax+4], 0 R
bx lr ret 0
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 67
armv7-a clang 11.0.1 x86-64 gcc 14.2 (-m32) x86 msvc v19.40 VS17.10
mov r0, #60 mov eax, 60 mov eax, 60 i
mov r1, #0 xor edx, edx xor edx, edx
bx lr ret ret 0 n
t
mov r1, #0 mov eax, DWORD PTR [esp+4] mov eax, DWORD PTR C
mov r2, #60 mov DWORD PTR [eax], 60 ___$ReturnUdt$[esp-4]
str r2, [r0] mov DWORD PTR [eax+4], 0 mov DWORD PTR [eax], 60 H
str r1, [r0, #4] ret 4 mov DWORD PTR [eax+4], 0 R
bx lr size > 4 not fundamental ret 0 not a POD
68
Can we do something about it?
● std::chrono would have to give up encapsulation
to be maximally efficient on Windows.
This is an ABI breakage, but a quick search gave only one complaint.
https://godbolt.org/z/r7McGEb8o
70
Can we do something about it?
Don’t use std::pair and especially std::tuple.
namespace detail {
int* smart_ptr_impl() {
return nullptr;
}
} // namespace detail
If the return type is a class type that is non-trivial for the purposes of calls, the
caller passes an address as an implicit parameter. The callee then constructs the
return value into this address.
… the pointer is passed as if it were the first parameter in the function prototype,
preceding all other parameters, including the this and VTT parameters.
74
If the return type is a class type that is non-trivial for the purposes of calls, the
caller passes an address as an implicit parameter. The callee then constructs the
return value into this address.
… the pointer is passed as if it were the first parameter in the function prototype,
preceding all other parameters, including the this and VTT parameters.
It’s an output parameter done right by the compiler, and only when necessary!
75
76
RVO: inserting a function result into a container
#include <optional>
struct large {
large();
large(large&&);
large& operator=(large&&);
large(large const&);
large& operator=(large const&);
~large();
};
large make_large();
std::optional<large> optional_large() {
return std::optional<large>{make_large()};
} https://godbolt.org/z/bcdsx7aP4
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 77
https://akrzemi1.wordpress.com/2018/05/16/rvalues-redefined/
82
There is a solution!
https://quuxplusone.github.io/blog/2018/05/17/super-elider-round-2/
https://akrzemi1.wordpress.com/2018/05/16/rvalues-redefined/
83
There is a solution!
https://quuxplusone.github.io/blog/2018/05/17/super-elider-round-2/
https://akrzemi1.wordpress.com/2018/05/16/rvalues-redefined/
84
Lazy evaluation with ac::lazy
https://alcash07.github.io/ACTL/actl/functional/lazy.html
template<class Function>
struct lazy {
operator std::invoke_result_t<Function>() {
return function();
}
Function function;
};
template<class Function>
lazy(Function&&) -> lazy<Function>;
std::optional<large> lazy_optional_large() {
return std::optional<large>{lazy{make_large}};
} https://godbolt.org/z/PYq6KTPKh
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 85
Negative-overhead abstraction!
87
C++ Core Guidelines
F.20: For “out” output values, prefer return values to output parameters
Approved
89
Valid use cases for output parameters
std::ranges::transform(x, y, z);
std::ranges::sort(x);
template<class Range>
void sort(ac::inout<Range&> range);
template<class Range>
[[nodiscard]] Range sort(Range const& range);
Reason Both let the caller know that a function will not modify the argument, and
both allow initialization by rvalues.
What is “cheap to copy” depends on the machine architecture, but two or three
words (doubles, pointers, references) are usually best passed by value.
93
int parameter
bool value_is_zero(int x) { - passing by value
return x == 0;
}
https://godbolt.org/z/ofznMvKWc
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 94
ldr w8, [x0] mov eax, DWORD PTR [rdi] cmp DWORD PTR [rcx], 0 R
cmp w8, #0 test eax, eax sete al
cset w0, eq sete al ret 0 E
ret ret F
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 95
ldr w8, [x0] mov eax, DWORD PTR [rdi] cmp DWORD PTR [rcx], 0 R
cmp w8, #0 test eax, eax sete al
cset w0, eq sete al ret 0 E
ret ret F
bool value_is_zero_call() {
return value_is_zero(1); - passing by value
}
bool ref_is_zero_call() {
return ref_is_zero(1); - passing by (const) reference
}
https://godbolt.org/z/fzshqhd85
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 97
200% overhead {
101
int parameter : extra function
void some_extra_function();
https://godbolt.org/z/4r946xh8T
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 102
stp x29, x30, [sp, #-32]! push rbp mov QWORD PTR [rsp+8], rbx R
stp x20, x19, [sp, #16] push rbx push rdi
mov x29, sp mov rbx, rdi sub rsp, 32 E
ldr w20, [x0] sub rsp, 8 mov ebx, DWORD PTR [rcx] F
mov x19, x0 mov ebp, DWORD PTR [rdi] mov rdi, rcx
bl some_extra_function() call some_extra_function() call some_extra_function()
ldr w8, [x19] cmp DWORD PTR [rbx], ebp cmp ebx, DWORD PTR [rdi]
cmp w20, w8 sete al mov rbx, QWORD PTR [rsp+48]
cset w0, eq add rsp, 8 sete al
ldp x20, x19, [sp, #16] pop rbx add rsp, 32
ldp x29, x30, [sp], #32 pop rbp pop rdi
ret ret ret 0
103
int parameter : extra function
void some_extra_function();
breaks RVO
106
Perfect forwarding is not perfect!
“In C++, perfect forwarding is the act of passing a function’s parameters to another
function while preserving its reference category.” link
The main purpose is to replace copies with moves when possible.
If a parameter type is a class type that is non-trivial for the purposes of calls, the
caller must allocate space for a temporary and pass that temporary by reference.
https://godbolt.org/z/bez7c5PMK
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 112
add x8, x0, x1, lsl #2 mov eax, DWORD PTR [rdi-4+rsi*4] mov eax, DWORD PTR [rcx+rdx*4-4]
ldur w0, [x8, #-4] ret ret 0
R
ret A
W
add x8, x0, x1, lsl #2 mov eax, DWORD PTR [rdi-4+rsi*4] mov rdx, QWORD PTR [rcx+8]
ldur w0, [x8, #-4] ret mov rax, QWORD PTR [rcx]
S
ret mov eax, DWORD PTR [rax+rdx*4-4] P
ret 0
A
N
armv7-a clang 11.0.1 x86-64 gcc 14.2 (-m32) x86 msvc v19.40 VS17.10
add r0, r0, r1, lsl #2 mov eax, DWORD PTR [esp+4] mov ecx, DWORD PTR _size$[esp-4]
ldr r0, [r0, #-4] mov edx, DWORD PTR [esp+8] mov eax, DWORD PTR _ptr$[esp-4]
R
bx lr mov eax, DWORD PTR [eax-4+edx*4] mov eax, DWORD PTR [eax+ecx*4-4] A
ret ret 0
W
add r0, r0, r1, lsl #2 mov eax, DWORD PTR [esp+8] mov ecx, DWORD PTR _span$[esp]
ldr r0, [r0, #-4] lea eax, [-4+eax*4] mov eax, DWORD PTR _span$[esp-4]
S
bx lr add eax, DWORD PTR [esp+4] mov eax, DWORD PTR [eax+ecx*4-4] P
mov eax, DWORD PTR [eax] ret 0
ret A
? N
113
C++23 std::mdspan vs raw pointer and sizes
#include <cstddef>
struct mdspan2 {
int const* ptr;
size_t width;
size_t height;
};
armv7-a clang 11.0.1 x86-64 gcc 14.2 (-m32) x86 msvc v19.40 VS17.10
mul r3, r2, r1 mov eax, DWORD PTR [esp+12] mov ecx, DWORD PTR _width$[esp-4]
add r0, r0, r3, lsl #2 imul eax, DWORD PTR [esp+8] imul ecx, DWORD PTR_height$[esp-4]
R
ldr r0, [r0, #-4] mov edx, DWORD PTR [esp+4] mov eax, DWORD PTR _ptr$[esp-4] A
bx lr mov eax, DWORD PTR [edx-4+eax*4] mov eax, DWORD PTR [eax+ecx*4-4]
ret ret 0 W
mul r3, r1, r2 mov eax, DWORD PTR [esp+8] mov ecx, DWORD PTR _span$[esp+4]
add r0, r0, r3, lsl #2 imul eax, DWORD PTR [esp+12] imul ecx, DWORD PTR _span$[esp]
S
ldr r0, [r0, #-4] mov edx, DWORD PTR [esp+4] mov eax, DWORD PTR _span$[esp-4] P
bx lr mov eax, DWORD PTR [edx-4+eax*4] mov eax, DWORD PTR [eax+ecx*4-4]
ret ret 0 A
N
armv8-a System V x86-64 System V x86-64 Microsoft 115
If the argument type is a If the class is MEMORY, pass the Any argument that doesn't fit in 8
Composite Type that is larger than argument on the stack… bytes, or isn't 1, 2, 4, or 8 bytes,
16 bytes, then the argument is If the size of the aggregate must be passed by reference. A
copied to memory allocated by the exceeds two eightbytes and the single argument is never spread
caller and the argument is first eightbyte isn’t SSE or any across multiple registers.
replaced by a pointer to the copy. other eightbyte isn’t SSEUP, the
whole argument is passed in
memory.
Not free
118
Empty parameter : use cases
● Predicates and transform function passed to STL algorithms:
std::ranges::find_if(range, predicate);
std::ranges::transform(input_range, output, unary_op);
● Access token to make some API available only inside the library
(like the default “package private” access modifier in Java)
119
Empty parameter : tag dispatch
int raw_rand();
int raw_rand_call() {
return raw_rand();
}
int tagged_rand_call() {
return tagged_rand(mt19937{});
}
https://godbolt.org/z/vxG4eY1r4
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 120
armv7-a clang 11.0.1 x86-64 gcc 14.2 (-m32) x86 msvc v19.40 VS17.10
b raw_rand() jmp raw_rand() jmp raw_rand() R
A
W
If the base ABI does not specify rules for empty classes, then an empty class has
size and alignment 1.
Arguments of empty class types that are not non-trivial for the purposes of calls
are passed no differently from ordinary classes.
122
C++ Core Guidelines
F.16: For “in” parameters, pass cheaply-copied types by value and others by
reference to const
Reason Both let the caller know that a function will not modify the argument, and
both allow initialization by rvalues.
What is “cheap to copy” depends on the machine architecture, but two or three
words (doubles, pointers, references) are usually best passed by value.
123
C++ Core Guidelines
F.16: For “in” parameters, pass cheaply-copied types by value and others by
reference to const
Reason Both let the caller know that a function will not modify the argument, and
both allow initialization by rvalues.
What is “cheap to copy” depends on the machine architecture, but two or three
words (doubles, pointers, references) are usually best passed by value.
Approved*
124
Class member functions
Itanium C++ ABI
Simple function objects like std::plus above would most likely be inlined,
but more complex empty function objects would introduce overhead if not inlined.
126
Effect on empty function objects
template<class T>
struct plus {
constexpr T operator()(T const& lhs, T const& rhs) const {
return lhs + rhs;
}
};
Simple function objects like std::plus above would most likely be inlined,
but more complex empty function objects would introduce overhead if not inlined.
https://godbolt.org/z/MsjeT8TTK
130
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10
Example Consider:
void copy_n(const T* p, T* q, int n); // copy from [p:p+n) to [q:q+n)
What if there are fewer than n elements in the array pointed to by q? Then, we
overwrite some probably unrelated memory. What if there are fewer than n
elements in the array pointed to by p? Then, we read some probably unrelated
memory. Either is undefined behavior and a potentially very nasty bug.
void raw_copy_call() {
raw_copy(arr.data(), arr.data(), 8);
}
void checked_copy_call() {
checked_copy(arr.data(), arr.data(), 8, 8);
} https://godbolt.org/z/7M45xz9ha
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 141
Example Consider:
void copy_n(const T* p, T* q, int n); // copy from [p:p+n) to [q:q+n)
Not free
144
C++ Core Guidelines
I.23: Keep the number of function arguments low
Reason Having many arguments opens opportunities for confusion. Passing lots
of arguments is often costly compared to alternatives.
Discussion The two most common reasons why functions have too many
parameters are:
1. Missing an abstraction. …
2. Violating “one function, one responsibility.” …
145
Triple product (wiki)
Geometrically, the scalar triple product is the
(signed) volume of the parallelepiped defined by
the three vectors.
a · (b x c) = b · (c x a) = c · (a x b)
a · (b x c) = (a x b) · c
146
Triple product : all int
struct vector3 {
int x, y, z;
};
int dot_product(int ax, int ay, int az, int bx, int by, int bz);
vector3 cross_product(int ax, int ay, int az, int bx, int by, int bz);
int triple_product(
int ax, int ay, int az,
int bx, int by, int bz,
int cx, int cy, int cz
) {
vector3 d = cross_product(ax, ay, az, bx, by, bz);
return dot_product(d.x, d.y, d.z, cx, cy, cz);
}
147
Triple product : vector3
struct vector3 {
int x, y, z;
};
int vector_triple_product(
vector3 const& a,
vector3 const& b,
vector3 const& c
) {
return vector_dot_product(vector_cross_product(a, b), c);
}
https://godbolt.org/z/PfdcEjo1h
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 148
__declspec(safebuffers)
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 150
stp x29, x30, [sp, #-48]! push r12 sub rsp, 104
str x21, [sp, #16] push rbp mov eax, DWORD PTR z2$[rsp]
R
stp x20, x19, [sp, #32] push rbx mov DWORD PTR [rsp+48], eax A
mov x29, sp sub rsp, 16 mov eax, DWORD PTR y2$[rsp]
ldr w21, [x29, #48] mov ebx, DWORD PTR [rsp+48] mov DWORD PTR [rsp+40], eax W
mov w19, w7 mov ebp, DWORD PTR [rsp+56] mov DWORD PTR [rsp+32], r9d
mov w20, w6 mov r12d, DWORD PTR [rsp+64] mov r9d, r8d
bl cross_product call cross_product mov r8d, edx
lsr x8, x0, #32 add rsp, 16 mov edx, ecx
mov w2, w1 mov r8d, ebp lea rcx, QWORD PTR $T1[rsp]
mov w3, w20 mov rcx, rax call cross_product
mov w4, w19 mov r9d, r12d mov r9d, DWORD PTR x3$[rsp]
mov w1, w8 mov edi, eax movsd xmm0, QWORD PTR [rax]
mov w5, w21 shr rcx, 32 mov r8d, DWORD PTR [rax+8]
ldp x20, x19, [sp, #32] mov esi, ecx mov eax, DWORD PTR z3$[rsp]
ldr x21, [sp, #16] mov ecx, ebx mov DWORD PTR z2$[rsp], eax
ldp x29, x30, [sp], #48 pop rbx mov eax, DWORD PTR y3$[rsp]
b dot_product pop rbp movsd QWORD PTR v4$[rsp], xmm0
pop r12 mov rcx, QWORD PTR v4$[rsp]
jmp dot_product mov rdx, rcx
mov DWORD PTR y2$[rsp], eax
shr rdx, 32
add rsp, 104
jmp dot_product
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 151
stp x29, x30, [sp, #-48]! push r12 sub rsp, 104
str x21, [sp, #16] push rbp mov eax, DWORD PTR z2$[rsp]
R
stp x20, x19, [sp, #32] push rbx mov DWORD PTR [rsp+48], eax A
mov x29, sp sub rsp, 16 mov eax, DWORD PTR y2$[rsp]
ldr w21, [x29, #48] mov ebx, DWORD PTR [rsp+48] mov DWORD PTR [rsp+40], eax W
mov w19, w7 mov ebp, DWORD PTR [rsp+56] mov DWORD PTR [rsp+32], r9d
mov w20, w6 mov r12d, DWORD PTR [rsp+64] mov r9d, r8d
bl cross_product call cross_product mov r8d, edx
lsr x8, x0, #32 add rsp, 16 mov edx, ecx
mov w2, w1 mov r8d, ebp lea rcx, QWORD PTR $T1[rsp]
mov w3, w20 mov rcx, rax call cross_product
mov w4, w19 mov r9d, r12d mov r9d, DWORD PTR x3$[rsp]
mov w1, w8 mov edi, eax movsd xmm0, QWORD PTR [rax]
mov w5, w21 shr rcx, 32 mov r8d, DWORD PTR [rax+8]
ldp x20, x19, [sp, #32] mov esi, ecx mov eax, DWORD PTR z3$[rsp]
ldr x21, [sp, #16] mov ecx, ebx mov DWORD PTR z2$[rsp], eax
ldp x29, x30, [sp], #48 pop rbx mov eax, DWORD PTR y3$[rsp]
b dot_product pop rbp movsd QWORD PTR v4$[rsp], xmm0
pop r12 mov rcx, QWORD PTR v4$[rsp]
jmp dot_product mov rdx, rcx
mov DWORD PTR y2$[rsp], eax
shr rdx, 32
add rsp, 104
jmp dot_product
A lot of moving!
armv8-a clang 18.1.0 x86-64 gcc 14.2 x64 msvc v19.40 VS17.10 152
stp x29, x30, [sp, #-48]! push r12 sub rsp, 104
str x21, [sp, #16] push rbp mov eax, DWORD PTR z2$[rsp]
R
stp x20, x19, [sp, #32] push rbx mov DWORD PTR [rsp+48], eax A
mov x29, sp sub rsp, 16 mov eax, DWORD PTR y2$[rsp]
ldr w21, [x29, #48] mov ebx, DWORD PTR [rsp+48] mov DWORD PTR [rsp+40], eax W
mov w19, w7 mov ebp, DWORD PTR [rsp+56] mov DWORD PTR [rsp+32], r9d
mov w20, w6 mov r12d, DWORD PTR [rsp+64] mov r9d, r8d
bl cross_product call cross_product mov r8d, edx
lsr x8, x0, #32 add rsp, 16 mov edx, ecx
mov w2, w1 mov r8d, ebp lea rcx, QWORD PTR $T1[rsp]
mov w3, w20 mov rcx, rax call cross_product
mov w4, w19 mov r9d, r12d mov r9d, DWORD PTR x3$[rsp]
mov w1, w8 mov edi, eax movsd xmm0, QWORD PTR [rax]
mov w5, w21 shr rcx, 32 mov r8d, DWORD PTR [rax+8]
ldp x20, x19, [sp, #32] mov esi, ecx mov eax, DWORD PTR z3$[rsp]
ldr x21, [sp, #16] mov ecx, ebx mov DWORD PTR z2$[rsp], eax
ldp x29, x30, [sp], #48 pop rbx mov eax, DWORD PTR y3$[rsp]
b dot_product pop rbp movsd QWORD PTR v4$[rsp], xmm0
pop r12 mov rcx, QWORD PTR v4$[rsp]
jmp dot_product mov rdx, rcx
mov DWORD PTR y2$[rsp], eax
shr rdx, 32
add rsp, 104
jmp dot_product
The first eight registers, r0-r7, If the class is INTEGER, the next By default, the x64 calling
are used to pass argument values available register of the sequence convention passes the first four
into a subroutine and to return %rdi, %rsi, %rdx, %rcx, %r8 arguments to a function in
result values from a function. and %r9 is used. registers.
x86 Microsoft 1,2,4,8 bytes, C++03 POD not even fundamental 0+2
x86 __fastcall Microsoft 1,2,4,8 bytes, C++03 POD fundamental only 2+2
Reason Having many arguments opens opportunities for confusion. Passing lots
of arguments is often costly compared to alternatives.
Approved*
156
Conclusions
● Compilers do unexpected things to your code,
because they have to follow all the specifications
● Compiler Explorer is you friend
https://godbolt.org/
● C++ Core Guidelines are pretty reasonable
from performance point of view
157