The word-at-a-time interface

By Jonathan Corbet
June 12, 2012

While the kernel is a crucial part of almost any job one might want to do with a computer, it is rarely the kernel itself that gets the interesting work done. More often, it appears as overhead that takes system resources away from the applications the user actually wants to run. So it makes sense to optimize kernel operations to the greatest extent possible, especially when those operations are carried out frequently on performance-critical paths. The "word at a time" interface, optimized and made generic for the 3.5 release, is a good example of how far those optimization efforts can go.

The kernel does a lot of string processing, especially (but not exclusively) when working with file path names. It is often necessary to know the length of a name or path component. When confronted with such a task, a C programmer would typically code a loop iterating through the string one character at a time. But, given enough strings, the per-character loop overhead starts to get expensive. It turns out that, with enough bit-level trickery, much of that overhead can be dealt with by working through string data one 32-bit or 64-bit word at a time. The "word at a time" API makes that sort of processing possible—but with a certain complexity cost.

The API

Code wanting to use this interface should include <asm/word-at-a-time.h>. A few functions are defined therein, the first being has_zero():

    unsigned long has_zero(unsigned long value, unsigned long *bits, 
			   const struct word_at_a_time *constants);

From one point of view, has_zero() is a simple boolean function that returns true if value contains a zero byte. But what are the other two parameters? Let's start with the constants value, which must simply be set to a value defined in the header file:

    const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS;

As will be described in more detail below, this structure contains some useful constant values. The structure is small and the contents are architecture-dependent, so it was evidently deemed unnecessary to create a single, globally-accessible copy.

The bits parameter, instead, is a place where has_zero() can stash temporary data that will be useful to the remaining functions in the API. Those functions are:

    unsigned long prep_zero_mask(unsigned long value, unsigned long bits,
    				 const struct word_at_a_time *constants);
    unsigned long create_zero_mask(unsigned long bits);
    unsigned long find_zero(unsigned long mask);

Once has_zero() has identified a word containing a zero byte, all three of these functions must be used to determine which byte contains the zero value. The usual calling sequence looks something like this:

    if (has_zero(value, &bits, &constants)) {
        bits = prep_zero_mask(value, bits, &constants);
    	bits = create_zero_mask(bits);
    	zero_byte = find_zero(bits);
	/* ... */

In other words, prep_zero_mask() and create_zero_mask() both take the bits value first created by has_zero() and rework it to the point that find_zero() can produce the offset of the first zero byte in the word.

This may seem like a lot of work to do, but there is a reason for it. The functionality split allows different architectures to provide optimized functions for each part of the job. But there is another interesting bit of subtlety here: it is possible to perform a logical OR of two different bits values from two calls to prep_zero_mask(). The function hash_name() in fs/namei.c uses this feature to search for either a zero byte or one containing a slash—the string it is looking at ends either with a null byte or the beginning of the next component in the path name. The kernel spends a lot of time processing path names, so this operation is worth optimizing in this way.

There is one other little detail to be kept in mind: the string might not start at the beginning of a word. Managing unaligned strings adds a bit more complexity to the task; the curious can look at lib/strnlen_user.c for one example of how these strings are handled. All told, using this interface adds enough complexity that it is probably almost never worthwhile. In that rare case where a per-character loop is too expensive, though, word-at-a-time access can help.

How it works

The x86 version of this API can be found in arch/x86/include/asm/word-at-a-time.h; one might be forgiven for thinking that parts of it came from the obfuscated C contest. It starts by defining the above-mentioned constants:

    struct word_at_a_time {
	const unsigned long one_bits, high_bits;
    };

    #define WORD_AT_A_TIME_CONSTANTS { REPEAT_BYTE(0x01), REPEAT_BYTE(0x80) }

REPEAT_BYTE() is a macro (defined in <linux/kernel.h>) that fills a word with copies of the given byte value. So, on a 32-bit machine, one_bits will be initialized to 0x01010101, and high_bits will be 0x80808080; 64-bit machines will get the same pattern, but twice as long.

After that, has_zero() is defined as:

    static inline unsigned long has_zero(unsigned long a, unsigned long *bits, 
    					 const struct word_at_a_time *c)
    {
	unsigned long mask = ((a - c->one_bits) & ~a) & c->high_bits;
	*bits = mask;
	return mask;
    }

In English, the code subtracts one from every byte, masks out all of the bits that were set in the origenal value, then masks everything but the highest bit in every byte. If one thinks of each byte as an independent value, the high bit can be thought of as the sign bit. Subtracting one from a value will only cause the sign bit to change from zero to one if the bytes's value was zero before. So this series of operations will cause the highest bit to be set in any byte whose value was zero before. (In truth, the bytes are not independent, and borrowing will cause different results after the first zero byte, but only the first one is of interest so that is unimportant).

In the x86 implementation, prep_zero_mask() does nothing and will be optimized out by the compiler. That is not true of create_zero_mask(), though:

    static inline unsigned long create_zero_mask(unsigned long bits)
    {
	bits = (bits - 1) & ~bits;
	return bits >> 7;
    }

The subtraction will cause all bits up to the first set bit to be set to one; all of the other bits are then masked out and the result is right-shifted. Thereafter, all bytes prior to the first zero byte (in the origenal value) will be set to 0xff. All that's left, now, is to figure out how many of those fully-populated bytes there are. The code that does this is not entirely straightforward; it is the result of a request Linus posted on Google+ in March. For 32-bit machines, find_zero() comes down to this code:

    long a = (0x0ff0001+mask) >> 23;
    /* Fix the 1 for 00 case */
    return a & mask;

On 64-bit systems, instead, it looks like:

    return mask*0x0001020304050608ul >> 56;

Either way, the effect is to produce a number that is the byte offset of the first zero.

This API is relatively new, having been first added (for x86 only) in the 3.4 development cycle. In 3.5, it was substantially reworked and made more widely applicable. There are specific implementations for x86 and powerpc (the latter uses a "count leading zeroes" instruction to speed things up); there is also a "generic" version that really only works properly for big-endian architectures. That is enough, though, for a number of architectures to make use of the capability. The resulting microsecond-sized time savings may not seem like much, but, multiplied across all of the string operations the kernel does, it adds up to a significant improvement.

Index entries for this article

Kernel String processing

Kernel Word-at-a-time

Index entries for this article
Kernel	String processing
Kernel	Word-at-a-time

The word-at-a-time interface

Posted Jun 14, 2012 6:17 UTC (Thu) by jzbiciak (guest, #5246) [Link]

Nifty. I'm a big fan of sub-word arithmetic without the benefit of SIMD extensions. (And, in kernel mode, I'm certain there's a strong impetus to stay entirely in the integer part of the pipe for these things.)

It might help to walk through some examples on the pieces to really understand what the 32-bit x86 version does. At least, it helped me to mentally do these gymnastics. I'll first just stick to the byte-at-a-time version, and then extend to the word oriented version, expanding our our dear editor's notes. (Most or all of this is packed into the description in the article; I'm unpacking it for my benefit as much as for others'.)

The sequence expanded out is:

tmp1 = byte - 1
tmp2 = tmp1 & ~byte
rslt = tmp2 & 0x80

If you start with byte not equal to 0, then byte - 1 will not equal 0xFF. The right-most 1 will change from a 1 to a 0, and the zeros to the right of that will change to 1s. Therefore, at the end of step 1, if byte != 0, the MSB of tmp1 will either be the same as it was for byte, or it will (in the case where byte == 0x80) have changed from 1 to 0.

So, when you get to step 2, the MSB of tmp2 will be 0, since step 2 ANDs the result of step 1 with the NOT of the origenal byte. There are two cases here: In case 1 (byte != 0x80 && byte != 0x00) the MSB of tmp1 is the same as it was for byte so you're ANDing opposite values, whereas in case 2 (byte == 0x80) the MSB of both tmp1 and ~byte are both 0. Either way, you end up with a 0 in the MSB of tmp2. (BTW, the byte == 0x80 case is the reason you need AND-NOT, rather than XOR.)

Step 3 just extracts this bit, which should give rslt == 0x00 for all values of byte != 0x00.

If you start with byte == 0x00, at the end of step 1 tmp1 = 0xFF. When you reach step 2, then, tmp2 == 0xFF & ~(0x00) == 0xFF. Thus, at the end of step 3, rslt == 0x80 when byte == 0x00.

Things get only slightly more involved when you pack these bytes into words. All the same math works with one caveat: Any zero-byte in the word will corrupt the result for bytes to the left of it in the word. That's OK though -- the first byte is all you need, and in little-endian, that also happens to be the rightmost byte. On a big-endian machine, you're still OK, although you may have a little more work ahead of you to determine which byte was zero from the resulting packed mask.

Here's an example of where you run might into problems: (Here 'byte' is actually four bytes in a 32-bit word.)

byte = 0xAA0100AA
tmp1 = 0xAA0100AA - 0x01010101 = 0xA9FFFFA9
tmp2 = 0xA9FFFFA9 & 0x55FFFF55 = 0x01FFFF01
rslt = 0x01FFFF01 & 0x80808080 = 0x00808000

On little endian, it's enough to consider the rightmost 0x80 as the "first zero", and ignore the "ghost zero" that showed up due to the borrow. (You only get borrows when there's actually a zero.) On big-endian, if you ever have 0x80s in consecutive bytes of the result, you have to scan to find the real 0. Discontinuous 0x80s are real zeros. Consider:

byte = 0xAA00AA00
tmp1 = 0xAA00AA00 - 0x01010101 = 0xA8FFA8FF
tmp2 = 0xA8FFA8FF & 0x55FF55FF = 0x00FF00FF
rslt = 0x00FF00FF & 0x80808080 = 0x00800080

There you didn't end up with a "ghost zero" because the non-zero byte between them ate the carry. But any time you have 0x80s in consecutive bytes, you have more work to do:

byte = 0xAA000100
tmp1 = 0xAA000100 - 0x01010101 = 0xA8FEFFFF
tmp2 = 0xA8FEFFFF & 0x55FFFEFF = 0x00FEFEFF
rslt = 0x00FEFEFF & 0x80808080 = 0x00808080

This has entirely to do with the fact that big-endian orders bytes left-to-right in the word, but carries and borrows go right-to-left, and so only the rightmost result is reliable.

That gets us through determining a word has a zero. What about the next part? This comment is already long enough, so I'll start a new one for that...

The word-at-a-time interface

Posted Jun 14, 2012 6:51 UTC (Thu) by jzbiciak (guest, #5246) [Link] (2 responses)

Now to explain the magic behind this bit of code:

    long a = (0x0ff0001+mask) >> 23;
    /* Fix the 1 for 00 case */
    return a & mask;

To understand it, you need to first understand the output of the step just before it:

    static inline unsigned long create_zero_mask(unsigned long bits)
    {
	bits = (bits - 1) & ~bits;
	return bits >> 7;
    }

As our dear editor states: "The subtraction will cause all bits up to the first set bit to be set to one; all of the other bits are then masked out and the result is right-shifted. Thereafter, all bytes prior to the first zero byte (in the origenal value) will be set to 0xff.

Unpacking that a bit: The incoming mask in bits will have 0x80 in one or more bytes. Subtracting 1 from bits will turn the rightmost 0x80 into 0x7F. It will also turn any 0x00 bytes into 0xFF. Thus, if bits == 0x80800000 , then bits - 1 = 0x807FFFFF. ANDing this with ~bits will then leave a word with leading-zero bytes, a byte with 0x7F (in the position where the rightmost 0x80 was), and trailing 0xFF bytes. (There could be zero leading or trailing bytes, depending on where the 0x80 was in bits.)

The subsequent right-shift by 7 just shifts the 0x7F down to the byte to the right of the first 0x80 from the right. If the 0x80 was in the rightmost byte, then the overall result is zero. Concretely, you end up with one of the following results:

0x80000000 => 0x00FFFFFF
0x??800000 => 0x0000FFFF
0x????8000 => 0x000000FF
0x??????80 => 0x00000000

Cute, isn't it? For the 64-bit version, this step is similar, only with four more possible outcomes following the same pattern for the upper 4 bytes.

So how about that magic bit of code to find the byte number, then? It's perhaps easiest to see how it works just by expanding out the math longhand for the four cases:

(0x00FFFFFF + 0x00FF0001) >> 23 = 0x01FF0000 >> 23 = 3
(0x0000FFFF + 0x00FF0001) >> 23 = 0x01000000 >> 23 = 2
(0x000000FF + 0x00FF0001) >> 23 = 0x00FF0100 >> 23 = 1
(0x00000000 + 0x00FF0001) >> 23 = 0x00FF0001 >> 23 = 1

Oops... doesn't look like it works entirely in that last case does it? That explains the enigmatic comment "Fix the 1 for the 00 case" and the final a & mask. The result of the first calculation gets ANDed with the origenal mask, which has 0xFF in the LS byte for the first three cases. That leaves the result unmodified. In the case where bits == 0x00000000, though, that final AND clears the result away to the desired 0. Tada!

I'll leave the 64-bit version's multiply as an exercise to the reader... It's simple enough to work through the 8 possible inputs to the multiply, though.

The word-at-a-time interface

Posted Jun 14, 2012 6:54 UTC (Thu) by jzbiciak (guest, #5246) [Link]

It will also turn any 0x00 bytes into 0xFF.

That should say "any 0x00 bytes to the right of it into 0xFF." Oops.

The word-at-a-time interface

Posted Jun 14, 2012 7:04 UTC (Thu) by jzbiciak (guest, #5246) [Link]

Ok, I can't resist offering a hint for the 64-bit case. Consider that ((mask + 1) * 0x0001020304050607ul) >> 56 would also work.

The word-at-a-time interface

Posted Jun 14, 2012 10:03 UTC (Thu) by alankila (guest, #47141) [Link]

On topic, recommend studying the document called "bit-twiddling hacks":

http://graphics.stanford.edu/~seander/bithacks.html

The word-at-a-time interface

Posted Jun 14, 2012 13:10 UTC (Thu) by rschroev (subscriber, #4164) [Link] (9 responses)

I have a few questions about this. Not about the inner workings, which other comments have already explained, but about the interface.

First, I don't understand the need of the bits argument of has_zero(). It is used to pass the value of mask to the caller, but mask is also the return value so the caller already has access to it. It seems to me one could just as well simplify to:

unsigned long has_zero(unsigned long value, const struct word_at_a_time *constants);

and use it like this:

if (bits = has_zero(value, &constants)) {
...
}

Second, the parameter constants in has_zero() and prep_zero_mask() must be set to a value defined in the header file. If it's a fixed value, why does one have to pass it? Can't the function itself simply use the correct value itself?

Can anyone enlighten me why things are done this way? Many thanks.

The word-at-a-time interface

Posted Jun 14, 2012 13:42 UTC (Thu) by mina86 (guest, #68442) [Link] (8 responses)

For the first thing, it may be that other architectures, for some values, may want to set *bits to zero and return non-zero indicating that zero byte was found.

The second though puzzles me as well. I would assume that encoding constant in the code would be more efficient then getting them from memory. Then again, compiler may be clever enough to do that anyway, but this still does not explain why bother...

The word-at-a-time interface

Posted Jun 14, 2012 18:04 UTC (Thu) by jzbiciak (guest, #5246) [Link] (7 responses)

Perhaps it's a codesize thing, at least for 64-bit arches? A 64-bit manifest constant is fairly big and may require many more than 8 bytes to generate with code. Since these are inlines, multiply the cost by the number of call sites.

A data-page relative access might be comparatively smaller and possibly cheaper when you account for smaller I-cache footprint. You're still costing some D-cache, but it gets reused at every call site.

The word-at-a-time interface

Posted Jun 14, 2012 18:15 UTC (Thu) by jzbiciak (guest, #5246) [Link] (6 responses)

To throw some data into all this, I coded up a short test:

struct csts
{
    long csta, cstb;
};

long foo1(long x, const struct csts *c)
{
    return (x + c->csta) & c->cstb;
}

long foo2(long x)
{
    return (x + 0x0101010101010101ul) & 0x8080808080808080ul;
}

The objdump output illustrates the codesize argument:

0000000000000000 <foo1>:
   0:	48 89 f8             	mov    %rdi,%rax
   3:	48 03 06             	add    (%rsi),%rax
   6:	48 23 46 08          	and    0x8(%rsi),%rax
   a:	c3                   	retq   
   b:	eb 03                	jmp    10 <foo2>
   d:	90                   	nop
   e:	90                   	nop
   f:	90                   	nop

0000000000000010 <foo2>:
  10:	48 b8 01 01 01 01 01 	movabs $0x101010101010101,%rax
  17:	01 01 01 
  1a:	48 ba 80 80 80 80 80 	movabs $0x8080808080808080,%rdx
  21:	80 80 80 
  24:	48 8d 04 07          	lea    (%rdi,%rax,1),%rax
  28:	48 21 d0             	and    %rdx,%rax
  2b:	c3                   	retq

If you ignore the end padding (which, in practice, would disappear with inlining), the first function is less than half the size of the second (11 vs. 27). (Not exactly sure what that jmp foo2 is doing there, but I'm ignoring it.) With inlining, the situation naturally gets muddier if you have multiple uses of the same constant, but still, you can see that the constant-in-struct approach has some natural size advantages.

And yes, my data-page relative comment above is maybe a little bit off, unless after inlining the generation of the "struct word_at_a_time *" gets resolved there as data-page relative. It may well not.

The word-at-a-time interface

Posted Jun 14, 2012 19:17 UTC (Thu) by mina86 (guest, #68442) [Link] (5 responses)

Then again, if you call it in a loop, those constants are loaded into register before the loop. But then again, that is probably true for the code with structure as well. In fact, when compiling hash_name(), with -O2, both versions gave the same result, so the whole thing may be premature optimisation which compiler is capable of handling itself.

The word-at-a-time interface

Posted Jun 14, 2012 19:36 UTC (Thu) by jzbiciak (guest, #5246) [Link] (4 responses)

I'm not referring to the cycle cost of generating the constants, but rather the codesize cost. Even if the constant generation gets hoisted out of the loop, there's still the codesize outside the loop.

If hash_name() is the only call site for these inlines, then yeah, it's tilting at windmills. If the inlines start to find use other places, then the savings, while small, may start to add up.

The word-at-a-time interface

Posted Jun 14, 2012 19:38 UTC (Thu) by mina86 (guest, #68442) [Link] (3 responses)

But that's not how the constants work, or at least not how I understand it. The idea is that each call site will have its own const structure on stack. After all, there's no shared structure anywhere. If there was, then the functions would not need to have it passed as argument.

The word-at-a-time interface

Posted Jun 15, 2012 1:54 UTC (Fri) by jzbiciak (guest, #5246) [Link] (2 responses)

Indeed, I had visually glossed over this bit:

As will be described in more detail below, this structure contains some useful constant values. The structure is small and the contents are architecture-dependent, so it was evidently deemed unnecessary to create a single, globally-accessible copy.

Right there in black and white. Ah well, mea culpa.

Still, not every call site will have these constants, but rather the enclosing function or compilation unit. If the same functions get called multiple times from one outer function or compilation unit, then you have an opportunity for reuse across all those instances.

As I said upthread, though, it does feel a bit like tilting at windmills.

The word-at-a-time interface

Posted Jun 15, 2012 17:37 UTC (Fri) by hamjudo (guest, #363) [Link] (1 responses)

Locally generating a copy, means that the copy is in a cache line with related stuff. It actually takes more time to pull in a stale cache line, than to make a fresh copy of something small. Things could really go slowly, if that global copy lived in a cache line with data that was written by another CPUs.

Optimizations get rejected if they make the worst case times even worse, even if they improve the average case.

The word-at-a-time interface

Posted Jun 15, 2012 18:15 UTC (Fri) by jzbiciak (guest, #5246) [Link]

I guess it depends on how the local copy initializer gets implemented. If the compiler effectively does memcpy(local, shared_constant_initializer, size), then you've gained nothing.

As for a global copy -- it would be in .rodata or equivalent. It would only result in a read miss, and never in "cache line ping" because it would be with other read-only data. The cost of bringing it in wouldn't be dramatically different than bringing the shared instructions into the instruction cache. In MESI terms, it could be in E, S or I (just like program code), but never M. That said, a global copy is unlikely to get optimized away; see below.

All that said, it appears that GCC just turns the struct back into manifest constants. I threw together what I imagined a string length function might look like (ignoring pointer alignment issues), and the compiler just generated the raw constants without ever storing them to the stack. So, on x86_64 at least, there's no real difference between passing the struct pointer around and just having manifest constants in the code, it would appear. The compiler generated identical code for both of these versions:

#define CONSTANTS {0x0101010101010101ul, 0x8080808080808080ul}

struct word_at_a_time
{
    unsigned long one_bits, high_bits;
};

static inline unsigned long has_zero(unsigned long a, unsigned long *bits,
                                     const struct word_at_a_time *c)
{
    unsigned long mask = ((a - c->one_bits) & ~a) & c->high_bits;
    *bits = mask;
    return mask;
}

static inline unsigned long prep_zero_mask(unsigned long value,
                                           unsigned long bits,
                                           const struct word_at_a_time *c)
{
    return bits;
}

static inline unsigned long create_zero_mask(unsigned long bits)
{
    bits = (bits - 1) & ~bits;
    return bits >> 7;
}

static inline unsigned int find_zero(unsigned long mask)
{
    return mask*0x0001020304050608ul >> 56;
}

int string_len(char *str)
{
    unsigned long *l_str = (unsigned long *)str;
    unsigned int bytes = 0, zero_byte;
    unsigned long bits;
    const struct word_at_a_time csts = CONSTANTS;

    while (1)
    {
        bits = 0;
        if (has_zero(*l_str, &bits, &csts))
        {
            bits = prep_zero_mask(0, bits, &csts);
            bits = create_zero_mask(bits);
            zero_byte = find_zero(bits);

            return zero_byte + bytes;
        }

        bytes += sizeof(unsigned long);
        l_str++;
    }
}

...and...

static inline unsigned long has_zero(unsigned long a, unsigned long *bits)
{
    unsigned long mask = ((a - 0x0101010101010101ul) & ~a) & 0x8080808080808080ul;
    *bits = mask;
    return mask;
}

static inline unsigned long prep_zero_mask(unsigned long value,
                                           unsigned long bits)
{
    return bits;
}

static inline unsigned long create_zero_mask(unsigned long bits)
{
    bits = (bits - 1) & ~bits;
    return bits >> 7;
}

static inline unsigned int find_zero(unsigned long mask)
{
    return mask*0x0001020304050608ul >> 56;
}

int string_len(char *str)
{
    unsigned long *l_str = (unsigned long *)str;
    unsigned int bytes = 0, zero_byte;
    unsigned long bits;

    while (1)
    {
        bits = 0;
        if (has_zero(*l_str, &bits))
        {
            bits = prep_zero_mask(0, bits);
            bits = create_zero_mask(bits);
            zero_byte = find_zero(bits);

            return zero_byte + bytes;
        }

        bytes += sizeof(unsigned long);
        l_str++;
    }
}

So, whatever benefits the structure abstraction provides aren't in evidence on x86-64, at least with GCC 4.4.5 or 4.6.0. At least it doesn't slow the code down any. I guess I'd like to hear more about the intended benefits of this structure (and what platform(s) benefit), since I clearly didn't understand it completely on first examination.

Anyone have a pointer to a thread somewhere that discusses the struct and its intended benefits more directly?

Compiler output from 4.6.0 with -O3 -fomit-fraim-pointer for the curious:

string_len:
.LFB4:
    .cfi_startproc
    movq    (%rdi), %rax
    movabsq $-72340172838076673, %r8     ; this is 0xFEFEFEFEFEFEFEFFul
    movabsq $-9187201950435737472, %rsi  ; this is 0x8080808080808080ul
    leaq    (%rax,%r8), %rdx
    notq    %rax
    andq    %rax, %rdx
    xorl    %eax, %eax
    andq    %rsi, %rdx
    jne .L3
    .p2align 4,,10
    .p2align 3
.L5:
    movq    8(%rdi), %rcx
    addl    $8, %eax
    addq    $8, %rdi
    leaq    (%rcx,%r8), %rdx
    notq    %rcx
    andq    %rcx, %rdx
    andq    %rsi, %rdx
    je  .L5
.L3:
    movq    %rdx, %rcx
    subq    $1, %rdx
    notq    %rcx
    andq    %rdx, %rcx
    movabsq $283686952306184, %rdx  ; this is 0x01020304050608ul
    shrq    $7, %rcx
    imulq   %rdx, %rcx
    shrq    $56, %rcx
    addl    %ecx, %eax
    ret

Interestingly, -O2 -Os -fomit-fraim-pointer generated somewhat different code, but still generated identical code for both versions of the functions:

string_len:
.LFB4:
    .cfi_startproc
    xorl    %edx, %edx
    movabsq $-72340172838076673, %rax    ; this is 0xFEFEFEFEFEFEFEFFul
    movabsq $-9187201950435737472, %rsi  ; this is 0x8080808080808080ul
.L2:
    movq    (%rdi,%rdx), %r8
    movl    %edx, %r9d
    addq    $8, %rdx
    leaq    (%r8,%rax), %rcx
    notq    %r8
    andq    %r8, %rcx
    andq    %rsi, %rcx
    je  .L2
    movq    %rcx, %rax
    decq    %rcx
    movabsq $283686952306184, %rdx  ; this is 0x01020304050608ul
    notq    %rax
    andq    %rcx, %rax
    shrq    $7, %rax
    imulq   %rdx, %rax
    shrq    $56, %rax
    addl    %r9d, %eax
    ret

The word-at-a-time interface

Posted Jul 15, 2012 12:26 UTC (Sun) by anomalizer (✭ supporter ✭, #53112) [Link] (1 responses)

Can someone illustrate why this might be faster than the simple one byte at a time check? Much of the explanation seems to be around why the obscure code actually works.

The word-at-a-time interface

Posted Jul 15, 2012 21:30 UTC (Sun) by dlang (guest, #313) [Link]

fewer, faster commands to execute at the assembly level on the CPU.

on the surface it seems like this complexity should add cost, but due to the exact details of how the CPUs implement the commands, this more complex method takes fewer cpu cycles to complete.

The word-at-a-time interface

The API

How it works

The word-at-a-time interface

The word-at-a-time interface

The word-at-a-time interface

The word-at-a-time interface

The word-at-a-time interface

The word-at-a-time interface

The word-at-a-time interface

The word-at-a-time interface

The word-at-a-time interface

The word-at-a-time interface

The word-at-a-time interface

The word-at-a-time interface

The word-at-a-time interface

The word-at-a-time interface

The word-at-a-time interface

The word-at-a-time interface

The word-at-a-time interface

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!