wayland and libkscreen benchmarks

So, first of all, this is all very much work-in-progress and highly experimental. It’s related to the work on screen management which I’ve outlined in an earlier article.

libkscreen wayland benchmark data

I ran a few benchmarks across our wayland stack, especially measuring interprocess communication performance when switching from X11 (or, in fact XCB and XRandR) to wayland. I haven’t done a highly scientific setup, just ran the same code with different backends to see how long it takes to receive information about screens connected, their modes, etc..
I also ran the numbers when loading the libkscreen backend in-process, more on that later.

data

The spreadsheet shows three data columns, in vertical blocks per backend the results for 4-5 individual runs and their mean values. One column for the default out-of-process mode, one with loading the backend in process and one showing the speedup between in- and out-of-process of the same backend.
The lower part contains some cross referencing of the mean values to compare different setups.
All values are nano seconds.

2x

My results show a speedup of between 2 and 2.5 times when querying screen information on X11 and on wayland, wayland being much faster here.
The qscreen and xrandr backends perform pretty similar, they’re both going through XCB. That checks out. The difference between wayland and xrandr/qscreen can then be attributed to either the wayland protocol or its implementation in KWayland being much faster than the corresponding XCB implementations.

But, here’s the kicker…

in- vs. out-of-process

The main overhead, as it turns out, is libkscreen loading the backend plugins out-of-process. That means that it starts a dbus-autolaunched backend process and then passes data over DBus between the libkscreen front-end API and the backend plugin. It’s done that way to shield the client API (for example the plasma shell process or systemsettings) from unsafe calls into X11, as it encapsulates some crash-prone code in the XRandR backend. When using the wayland backend, this is not necessary, as we’re using KWayland, which is much safer.
I went ahead and instrumented libkscreen in a way that these backends are being loaded in process, which avoids most of the overhead. This change has an even more dramatic influence on performance: on X11, the speedup is 1.6x – 2x, on wayland loading the backend in-process makes it run 10 times faster. Of course, these speedups are complementary, so combined, querying screen information on wayland can be done about 20 times faster.

While this change from out-of-process to in-process backends introduces a bit more complexity in the library, it has a couple of other advantages additional to the performance gains. In-process means that debugging is much easier. If there are crashes, we do not hide them anymore, but identify and fix them. It also makes development more worthwhile, since it’s much easier to debug and test the backends and frontend API together. It also means that we can load backend plugins at the same time.

I’ve uploaded the benchmark data here. Before merging this, I’ll have to iron out some more wrinkles and have the code reviewed, so it’s not quite ready for prime-time yet.

10 thoughts on “wayland and libkscreen benchmarks

  1. Nice work! But enlighten me, who is not so knowledgeable on these things: Isn’t libkscreen mainly for querying monitors, switching resolution and such? How are speed improvements there relevant, if runtimes are already on the nanosecond scale? I know I wouldn’t notice :) But I bet I’m missing something.

    1. libkscreen is currently used by plasmashell, so it affects startup time. See the linked article on screen management.

      It may also be interesting to run the same test with the qscreen backend under Wayland, which then uses not XCB but the Wayland protocol.

        1. Bah. Accidently hit submit… but still, chasing nanoseconds in the startup time? Who will notice? Or is it called a lot by plasmashell during startup? (if so, for what?).

          1. That’s the usual fallacy which causes us to suffer so many slow programs! I noticed in my own software that saving a few milliseconds here and a few milliseconds there and there and there … can quickly add up to a noticeable speedup. So I am in favor of such work, especially if it also comes with clean code and easier debugging. Anyway, everything that makes that darn slow Plasma startup faster is much welcomed by me.

          2. anonymity is great: I really didn’t mean to suggest small improvements don’t count, or to diminish the work. Not all all. They do count! It’s just that I thought there was no real bottleneck in the things libkscreen does. Now I know better :) In general it’s always better to first focus on the big gains when optimizing (and then turn to the small ones). I was just suprised by the small scale (nanoseconds) that we’re talking about here, because in that domain, I dare say it’s a hardly perceptible change, even if it’s a 20x speedup.

          3. The performance gains of this are in the range of milliseconds — I only used nanoseconds as base unit to conduct the benchmarks. This means that it makes a difference of skipping something like 10 frames or not at 60 FPS. On rotating metal disks, the speedup is probably a bit more dramatic, but I haven’t done any measurements of that.

            I’m not saying that I did this just for the performance gain, it would actually be very painful to implement the wayland backend when having to debug two processes at once, and that’s my main motivation. The performance improvement is just a nice icing on the cake.

          4. Perhaps a useful addition about the performance aspect.

            In my tests on this system, these calls into in-process libkscreen are fairly reliably under 5ms. Previously, they were in the range of 80ms. Assuming sync access (which isn’t necessary, but let’s consider it as worst case here), that means that we skip 5 frames, that is a noticeable effect on the user experience. Staying under 5ms on the other hand means that it can fit into the allowance for one frame (1/60 of a second, 16.6ms) easily. This means that the API doesn’t come with the caveat of introducing frame skips to certain operations.

            This may not indeed make a noticeable difference right now, since Plasma only queries on startup, but even then, the I/O saved for the whole dbus-autolaunch dance of the backendloader process may already make a difference.

            As I said, the main motivation for this change is to make it easier to work with, especially on the backend code. For the work on the wayland backend, it’s pretty much a necessity to be even able to work effectively and debug the whole thing well enough to run stable and reliable. And fast, as a welcome side-effect. :)

Comments are closed.