Move add_to_alignment logic to BufferVec #9928
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Objective
The "Uninitialized buffer uniform tail" trick was both used by skinning and morphing.
We should abstract this and merge them to have a consistent and explicit implementation. We may also take the opportunity to optimize it.
Solution
add_to_alignment
logic to aBufferVec
impl blockconst
part of the calculationconst
, and panic at compile time when alignment is impossible.BufferVec
that is as efficient as possible.The goal is to avoid the overhead of
push
, which involves checking for available capacity each iteration.We set a new capacity and set the newly allocated memory positions to zero. Because using any other
Vec
method makes rust too dumb to do any optimizations on this.Alternatives
I've tried a lot of different approaches to improve perfs.
Using
buffer.values.extend((0..to_add).map(|_| T::zeroed())
While in isolated code, this inlines the whole operation, in the context of the extract systems, it still calls
SpecializedExtend
as an external function, and is slower than the currentwhile
solution.I can confirm that in isolated solutions, this is the best, because
Range
has aTrustedLen
impl, this allows the compiler to remove a lot of bound checks, which makes the optimizers more capable. In contrast toiter::repeat(T::zeroed()).take(to_add)
.Using a zeroed vector
This does an allocation, and rust is not capable of using
calloc
onbytemuck::Zeroable
types1, so it allocates the vec and pushes zeros to it, then callsptr::copy_nonoverlapping
to copy them at the end of thebuffer.values
. I'm not sure it is any gain from other solutions, especially when we expect the additional zeros to be between 4 and 64.Using
set_len
without initializationThis is very unsafe, as it breaks an important invariant of
Vec
(no unintialized memory withinlen
). It is unsound in rust to have any value be uninitialized, even stuff likei32
where all bit patterns are accepted, because "uninitialized" in C terms means the value is not fixed, which breaks a lot of rust assumptions. But according to my research, it should be sound. As the values of thevalue
field ofBufferVec
are never read (so fixedness is irrelevant). In fact,wgpu
does handle it like FFI data, usingptr::copy_nonoverlapping
and passing it directly to the driver.For our specific use-case of
add_to_alignment
, it's fine, because even in the shader, we do not read the uninitialized values. I didn't test perfs on the current iteration, but for this, we get a 3% speedup onextract_skinned_meshes
.However, this requires disabling a forbid clippy lint. I'm comfortable enough to say "this is fine" but I suspect this would be rejected by most of the community.
push
with explicit alloc elisionWould you believe that this generates a capacity check for each loop iteration? We know we will never overflow capacity though! Here is the way to remove them
When applying this to the
add_to_alignment
method, we get something pretty nice. But we still, for some reasons, have individual increments of thelen
field, and each0
is added individually.prefer consts
One important insight is that the compiler handles much better values derived from consts.
So instead of:
We could do:
From my testing, this helps a lot, because
t_align
will be known at compile time, since it is directly derived from a constant value, and the compiler is more capable of optimizing around that.Footnotes
Note that
Vec
has a specialized implementation that supportscalloc
when initializing zeroed vectors, but only on std types such as integer types. ↩