Skip to content

Move add_to_alignment logic to BufferVec #9928

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nicopap
Copy link
Contributor

@nicopap nicopap commented Sep 26, 2023

Objective

The "Uninitialized buffer uniform tail" trick was both used by skinning and morphing.

We should abstract this and merge them to have a consistent and explicit implementation. We may also take the opportunity to optimize it.

Solution

  • Move the add_to_alignment logic to a BufferVec impl block
  • Make the const part of the calculation const, and panic at compile time when alignment is impossible.
  • Devise a new way to extend the BufferVec that is as efficient as possible.

The goal is to avoid the overhead of push, which involves checking for available capacity each iteration.

We set a new capacity and set the newly allocated memory positions to zero. Because using any other Vec method makes rust too dumb to do any optimizations on this.

Alternatives

I've tried a lot of different approaches to improve perfs.

Using buffer.values.extend((0..to_add).map(|_| T::zeroed())

While in isolated code, this inlines the whole operation, in the context of the extract systems, it still calls SpecializedExtend as an external function, and is slower than the current while solution.

I can confirm that in isolated solutions, this is the best, because Range has a TrustedLen impl, this allows the compiler to remove a lot of bound checks, which makes the optimizers more capable. In contrast to iter::repeat(T::zeroed()).take(to_add).

Using a zeroed vector

This does an allocation, and rust is not capable of using calloc on bytemuck::Zeroable types1, so it allocates the vec and pushes zeros to it, then calls ptr::copy_nonoverlapping to copy them at the end of the buffer.values. I'm not sure it is any gain from other solutions, especially when we expect the additional zeros to be between 4 and 64.

Using set_len without initialization

This is very unsafe, as it breaks an important invariant of Vec (no unintialized memory within len). It is unsound in rust to have any value be uninitialized, even stuff like i32 where all bit patterns are accepted, because "uninitialized" in C terms means the value is not fixed, which breaks a lot of rust assumptions. But according to my research, it should be sound. As the values of the value field of BufferVec are never read (so fixedness is irrelevant). In fact, wgpu does handle it like FFI data, using ptr::copy_nonoverlapping and passing it directly to the driver.

For our specific use-case of add_to_alignment, it's fine, because even in the shader, we do not read the uninitialized values. I didn't test perfs on the current iteration, but for this, we get a 3% speedup on extract_skinned_meshes.

set_len_perf-2023-09-25

However, this requires disabling a forbid clippy lint. I'm comfortable enough to say "this is fine" but I suspect this would be rejected by most of the community.

push with explicit alloc elision

let mut my_vec = Vec::new();
my_vec.reserve(12)
for _ in 0..12 {
  my_vec.push(0);
}

Would you believe that this generates a capacity check for each loop iteration? We know we will never overflow capacity though! Here is the way to remove them

let mut my_vec = Vec::new();
my_vec.reserve(12)
for _ in 0..12 {
  if my_vec.len() == my_vec.capacity() {
    unsafe { std::hint::unreachable_unchecked() };
  }
  my_vec.push(0);
}

When applying this to the add_to_alignment method, we get something pretty nice. But we still, for some reasons, have individual increments of the len field, and each 0 is added individually.

prefer consts

One important insight is that the compiler handles much better values derived from consts.

So instead of:

let len = buffer.values.len();
let t_aligned_len = div_ceil(len, t_align) * t_align;
let to_add = t_aligned_len - len;
buffer.values.extend((0..to_add).map(|_| T::zeroed())

We could do:

buffer.values.extend((0..t_align).map(|_| T::zeroed());
buffer.values.truncate(t_aligned_len);

From my testing, this helps a lot, because t_align will be known at compile time, since it is directly derived from a constant value, and the compiler is more capable of optimizing around that.

Footnotes

  1. Note that Vec has a specialized implementation that supports calloc when initializing zeroed vectors, but only on std types such as integer types.

@nicopap nicopap added A-Rendering Drawing game state to the screen C-Code-Quality A section of code that is hard to understand or change labels Sep 26, 2023
@nicopap nicopap added this to the 0.13 milestone Oct 25, 2023
@alice-i-cecile alice-i-cecile removed this from the 0.13 milestone Jan 24, 2024
@janhohenheim
Copy link
Member

Triage: has merge conflicts and is draft
@lkolbly do you want to update this PR or should I tag it S-Needs-Adoption? :)

@janhohenheim janhohenheim added the S-Waiting-on-Author The author needs to make changes or address concerns before this can be merged label May 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Rendering Drawing game state to the screen C-Code-Quality A section of code that is hard to understand or change S-Waiting-on-Author The author needs to make changes or address concerns before this can be merged
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

3 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy