An unexpected way to do strlen()

The general rule of thumb is that if you have a function written in an interpreted language (Perl, Python, etc.), it’ll be faster in C.  If you need it faster than that, you go to assembly.  Prepare to have your world rocked: Venkatesh Srinivas found that strlen() in libc was actually slower written in assembly than in C.  His commit message has numbers to back that up.

8 Replies to “An unexpected way to do strlen()”

  1. This post is quite disingenuous. The version of strlen() written in assembly language made sense when the 80286 and 80386 were popular, but should have been rewritten from scratch the moment the 80486 was released, and again when the Pentium was released, and should have been replaced with a native C implementation again once the compiler’s code quality reached sufficient quality that it equaled human effort.

    Starting from the 80486 and continuing onward, dedicated string instructions have been (much) slower than explicit loops, and have been advertised as such for about as long. Why, then, is this coming as a surprise to anyone, and more importantly, why is it that assembly language once again gets a “bad rap” because of people’s ignorance of basic CPU architecture? Need I remind everyone that C compilers compile *to* assembly language? What happens when the compiler writers themselves no longer understand the target they’re compiling to?

    BTW, want to know what is even faster than using strlen()? Use counted-strings instead of terminated-strings. Getting the length of the string is then an O(1) operation. NUL-terminated strings are evil, evil, evil, evil, evil. Avoid them at all costs.

    It really saddened me to read this article, clearly written to give support to the STUPID, STUPID, STUPID falsism that “C is faster than hand-coded assembly” that so many people seem to support these days. As someone skilled in the art of assembly language programming, I find it factually inaccurate and outright insulting.

  2. Zoey4ever — I have seen the NetBSD versions of the string routines; they are certainly faster, primarily since they work word-at-a-time. However, I am not sure that that is actually a valid optimization — there are two places where there can be problems. First, if you put a one character string and its null byte as the last two bytes of a valid page, with the subsequent page unmapped, any word-wise strlen() will generate a page fault. This is solvable if you are careful to not cross page boundaries w/ word-reads; the second problem is that it is not valid in C to read beyond the end of static array. It will probably work, but I’d rather strlen() not depend on that.

    Salvio — The original implementation was written in 1993, before even FreeBSD existed; no one had just gotten around to taking a look at the string implementation here. It was no surprise that this implementation was slow for modern systems, just that no one had looked in this corner.

    I was surprised at gcc’s take on the naive C strlen — it generates a simple loop, roughly this:
    ; eax = edx = string base
    CMP BYTE PTR [edx], 0x0
    JE done
    ADD EAX, 0x1
    CMP BYTE PTR [eax], 0x0
    JNE loop

    Bytewise compares shouldn’t have been _that_ much faster; but oh well.

    Wrt the string instructions — they are not always a bad idea. REP MOVS and STOS can operate cacheline-at-a-time if a particular MSR bit is set (fast string enable or some such) and strings are aligned correctly; under those conditions, they are pretty good ways to move data. Even Atom has some optimizations for small count REP string ops.

  3. Samuel, I didn’t think this article was supporting the idea of C being faster than assembly – it’s expressing surprise that this could even happen in what is, as you described, a weird, out of date chunk of code.

    Any sentence where I use a phrase like “Prepare to have your world rocked” should perhaps not be considered a prescriptive description of language quality. This is more of a man bites dog sort of item.

  4. For what its worth, I was able to write an assembler version that worked on even faster on the Core 2, but it checks in at much slower (even than REP SCASB) on the K8:

    movl 4(%esp), %eax
    movl %eax, %edx
    movzx (%eax), %ecx
    leal 1(%eax), %eax
    test %ecx, %ecx
    jne L1

    subl $1, %eax
    subl %edx, %eax

    The LEA for the increment is kinda important — switching to INC or ADD hurt performance measurable; ditto for swapping the test and the lea. The latter was unexpected, usually tests do better back to back with their load. Oh well.

    On the Core 2, the 10-byte string test fell from 9.5 sec -> 4.6 sec, the 68-by string fell from 19.8 -> 16.8 sec, and the 175-by string fell from 40 sec to 33 sec; don’t have the exact numbers for the K8 machine handy, but the 77 sec 175-by run moved up to taking over 200 sec… oh well.

  5. Venkatesh: the inner loop in NetBSD’s strlen is aligned. That ensures that no page boundaries are crossed. It also avoids the not so significant penalty for unaligned access in the hot path.

Comments are closed.