NUMA and cache-line locality in DragonFly

Matthew Dillon has been doing a significant amount of work on cache lines, and I haven’t been linking to it because it’s hard to point at single commits with such a technical subject.  However, he’s summarized it all, along with news on NUMA handling and vkernel improvements.

21 Replies to “NUMA and cache-line locality in DragonFly”

  1. Really good post to read.

    Like others have said, I’m itching to read a benchmark comparison of DragonFly 4.8 vs *BSD vs Linux

  2. We will have one soon, for network – Sepherosa has a paper in for AsiaBSDCon 2017. I have not seen the paper yet, but I assume it will be published very soon cause AsiaBSDCon is I think a week away.

  3. Sepherosa

    Your benchmark is interesting.

    It’s firmly debunks the myth that FreeBSD is best is networks.

    It’s actually is worst, by a mile.

  4. I’m surprised to see Linux tied for 1st and FreeBSD in last by a large margin.

  5. Anonymous – “FreeBSD is worst” is not an accurate statement – these are tests for network load, so it’s a specific scenario. Also, it’s segmented by operating system, but it doesn’t explain cause – the results of the last benchmark appear to have the best results for Linux simply because the driver for that network card uses more of the available hardware than any other system. I’d expect all the BSDs to perform relatively better if they had the same access to hardware. (which is probably a matter of working on the driver; I don’t know details because I don’t write network drivers.)

    That’s my longwinded way of saying don’t tie too much value in benchmarks; they can be good for comparative analysis. but also dangerous.

  6. >>”I’d expect all the BSDs to perform relatively better … which is probably a matter of working on the driver”

    Since we’re talking about software here, isn’t it always a “matter of working” more on it.

    Isn’t it a fact that Linux has worked more on the problem, given that huge delta in development effort vs the BSD.

    Yes, it’s amazing how well DragonFly does with such a small development team …. but how will it ever catch up to the performance of Linux due to the shear huge resources dedicated to working on improving it?

  7. To be frank, I don’t think there are _huge_ amount of Linux developers that are doing low level network stuffs. Under most of the cases, only one or two developers made changes that can move the needles in the real world.

    If you meant ‘more money spent’ == ‘better performance’, well, I would only reply you with an ‘heh’.

  8. Sepherosa

    Can you commit on why it’s possible for Linux, which uses a lock based approach – can achieve similar performance as DragonFly which is lockless.

  9. Anonymous – yes, in general more work on a driver is good. I’m not talking about that in a general sense, though. Look again at the PDF Sephe linked, and you’ll see that the bidirectional test was fastest on Linux specifically because it’s using 24 RX/TX rings, and DragonFly is only currently using power-of-two amounts, or 16. That specifically is something to fix.

    Until then, the only conclusion is the part where Sephe points out that DragonFly is faster – under the same ring restriction in an earlier version of Linux.

  10. Why is FreeBSD so slow. I always heard it was the fastest but this test seems to disprove that.

  11. I too am confused how a lock based approach of Linux can be performing so well given Dragonfly which is lockless.

    Can someone who understands networking better than me explain?

  12. Wayne, Kevin – DragonFly performs better in Linux in these benchmarks, except for the one explicitly mentioning a hardware limitation.

    Are you sure recent versions of the Linux kernel are using a locking model in a significant way?

  13. @Justin

    https://leaf.dragonflybsd.org/~sephe/perf_cmp.pdf

    Slide # 1 says “Linux, lock based”

    Linux has very similar performance as Dragonfly … one is lock based and one is lockless.

    Seems like Linux has ever more room to improve simply by removing their locks, whereas I’m not sure why dragonfly is barely better Linux when it’s lockless.

  14. The major difference between DragonFly and Linux for HTTP workload, is latency; DragonFly has much lower or more stable latency.

    Do you really think removing a lock/locks could be done ‘simply’? :)

    And to be frank, DragonFly does have a lot of room to improve, since we only use power-of-2 CPUs for network protocol processing, in my test only 16 CPUs are used, Linux uses all 24 CPUs. My next goal is to enable all CPUs for network protocol processing.

    I am afraid that enabling all available CPUs can’t be done ‘simply’ :); it will take time.

  15. Sepherosa

    The part that I’m confused about is that you imply the reason why Dragonfly is slow is because it’s not using all the cpus.

    Yet on the test when you have both Linux and dragonfly use the same number of cpus (last slide).

    Dragonfly was 9.25 Mbps
    Linux was 8.25 Mbps

    So even though Linux is lock based, it’s still with ~10% of dragonfly which is lockless.

    Maybe my expectations are off, but I would have thought this apple-to-apple comparison would have showed that the lockless implementation (dragonfly) would have been far faster. E.g. 40%+ better, but it’s not.

    Are my expectations on the benefits of being lockless over exaggerated?

  16. Wayne – do you have any reference as to why you think the difference should be greater, or why that’s the only differentiating factor?

    Keep in mind that even though it’s generally being summed up as one network stack has locks, and the other doesn’t, the entire rest of the network stack is also different.

  17. Heh, not Mbps, but Mpps (Mpackets/second, 18B UDP datagrams :).

    Pushing 10% in Mpps probably is not as easy as you may imagine nowadays, especially when all obvious bits are ironed out. Admittedly, we are still far far far away from 29.6Mpps.

    After all, the gap between Linux4.9-16rxring and Linux4.9-24rxring kinda testified that using all CPUs for network processing probably is necessary, especially in the forwarding case, i.e. no userland gets involved. As a developer, it means I have more homework to do.

    Well, in order to keep your expectation up, the latency stdev of Dragonfly is 1/6 of Linux’s :).

  18. What I find most interesting on the last slide of the perf is that Linux scales linearly from 16 CPU to 24 CPU.

    At ~0.5 Mpps / CPU:

    ~8Mpps / 16 CPU, and
    ~12Mpps / 24 CPU

  19. Not to troll but based on the results, why should I use Dragonfly?

    If networking it’s radically better on Dragonfly vs Linux … what other qualities does Dragonfly that should cause me to have my companies servers run it in.

    Performance and stability are top priorities for me as a system admin.

Comments are closed.