Open Source Ecosystems

This is eventfd-based synchronization, or 'esync' for short. Turn it on with WINEESYNC=1; debug it with +esync.

== BUGS AND LIMITATIONS ==

Please let me know if you find any bugs. If you can, also attach a log with +seh,+pid,+esync,+server,+timestamp.

If you get something like "eventfd: Too many open files" and then things start crashing, you've probably run out of file descriptors. esync creates one eventfd descriptor for each synchronization object, and some games may use a large number of these. Linux by default limits a process to 4096 file descriptors, which probably was reasonable back in the nineties but isn't really anymore. (Fortunately Debian and derivatives [Ubuntu, Mint] already have a reasonable limit.) To raise the limit you'll want to edit /etc/security/limits.conf and add a line like

hard nofile 1048576

then restart your session.

On distributions using systemd, the settings in /etc/security/limits.conf will be overridden by systemd's own settings. If you run ulimit -Hn and it returns a lower number than the one you've previously set, then you can set

DefaultLimitNOFILE=1024:1048576

in both /etc/systemd/system.conf and /etc/systemd/user.conf. You can then execute sudo systemctl daemon-reexec and restart your session. Check again with ulimit -Hn that the limit is correct.

Also note that if the wineserver has esync active, all clients also must, and vice versa. Otherwise things will probably crash quite badly.

== EXPLANATION ==

The aim is to execute all synchronization operations in "user-space", that is, without going through wineserver. We do this using Linux's eventfd facility. The main impetus to using eventfd is so that we can poll multiple objects at once; in particular we can't do this with futexes, or pthread semaphores, or the like. The only way I know of to wait on any of multiple objects is to use select/poll/epoll to wait on multiple fds, and eventfd gives us those fds in a quite usable way.

Whenever a semaphore, event, or mutex is created, we have the server, instead of creating a traditional server-side event/semaphore/mutex, instead create an 'esync' primitive. These live in esync.c and are very slim objects; in fact, they don't even know what type of primitive they are. The server is involved at all because we still need a way of creating named objects, passing handles to another process, etc.

The server creates an eventfd file descriptor with the requested parameters and passes it back to ntdll. ntdll creates an object of the appropriate type, then caches it in a table. This table is copied almost wholesale from the fd cache code in server.c.

Specific operations follow quite straightforwardly from eventfd:

To release an object, or set an event, we simply write() to it.
An object is signalled if read() succeeds on it. Notably, we create all
eventfd descriptors with O_NONBLOCK, so that we can atomically check if an
object is signalled and grab it if it is. This also lets us reset events.
For objects whose state should not be reset upon waitinge.g. manual-reset
eventswe simply check for the POLLIN flag instead of reading.
Semaphores are handled by the EFD_SEMAPHORE flag. This matches up quite well
(although with some difficulties; see below).
Mutexes store their owner thread locally. This isn't reliable information if
a different process's thread owns the mutex, but this doesn't mattera
thread should only care whether it owns the mutex, so it knows whether to
try waiting on it or simply to increase the recursion count.

The interesting part about esync is that (almost) all waits happen in ntdll, including those on server-bound objects. The idea here is that on the server side, for any waitable object, we create an eventfd file descriptor (not an esync primitive), and then pass it to ntdll if the program tries to wait on it. These are cached too, so only the first wait will require a round trip to the server. Then the server signals the file descriptor as appropriate, and thereby wakes up the client. So far this is implemented for processes, threads, message queues (difficult; see below), and device managers (necessary for drivers to work). All of these are necessarily server-bound, so we wouldn't really gain anything by signalling on the client side instead. Of course, except possibly for message queues, it's not likely that any program (cutting-edge D3D game or not) is going to be causing a great wineserver load by waiting on any of these objects; the motivation was rather to provide a way to wait on ntdll-bound and server-bound objects at the same time.

Some cases are still passed to the server, and there's probably no reason not to keep them that way. Those that I noticed while testing include: async objects, which are internal to the file APIs and never exposed to userspace, startup_info objects, which are internal to the loader and signalled when a process starts, and keyed events, which are exposed through an ntdll API (although not through kernel32) but can't be mixed with other objects (you have to use NtWaitForKeyedEvent()). Other cases include: named pipes, debug events, sockets, and timers. It's unlikely we'll want to optimize debug events or sockets (or any of the other, rather rare, objects), but it is possible we'll want to optimize named pipes or timers.

There were two sort of complications when working out the above. The first one was events. The trouble is that (1) the server actually creates some events by itself and (2) the server sometimes manipulates events passed by the client. Resolving the first case was easy enough, and merely entailed creating eventfd descriptors for the events the same way as for processes and threads (note that we don't really lose anything this way; the events include "LowMemoryCondition" and the event that signals system processes to shut down). For the second case I basically had to hook the server-side event functions to redirect to esync versions if the event was actually an esync primitive.

The second complication was message queues. The difficulty here is that X11 signals events by writing into a pipe (at least I think it's a pipe?), and so as a result wineserver has to poll on that descriptor. In theory we could just let wineserver do so and then signal us as appropriate, except that wineserver only polls on the pipe when the thread is waiting for events (otherwise we'd get e.g. keyboard input while the thread is doing something else, and spin forever trying to wake up a thread that doesn't care). The obvious solution is just to poll on that fd ourselves, and that's what I didit's just that getting the fd from wineserver was kind of ugly, and the code for waiting was also kind of ugly basically because we have to wait on both X11's fd and the "normal" process/thread-style wineserver fd that we use to signal sent messages. The upshot about the whole thing was that races are basically impossible, since a thread can only wait on its own queue.

System APCs already work, since the server will forcibly suspend a thread if it's not already waiting, and so we just need to check for EINTR from poll(). User APCs and alertable waits are implemented in a similar style to message queues (well, sort of): whenever someone executes an alertable wait, we add an additional eventfd to the list, which the server signals when an APC arrives. If that eventfd gets signaled, we hand it off to the server to take care of, and return STATUS_USER_APC.

Originally I kept the volatile state of semaphores and mutexes inside a variable local to the handle, with the knowledge that this would break if someone tried to open the handle elsewhere or duplicate it. It did, and so now this state is stored inside shared memory. This is of the POSIX variety, is allocated by the server (but never mapped there) and lives under the path "/wine-esync".

There are a couple things that this infrastructure can't handle, although surprisingly there aren't that many. In particular:

Implementing wait-all, i.e. WaitForMultipleObjects(..., TRUE, ...), is not
exactly possible the way we'd like it to be possible. In theory that
function should wait until it knows all objects are available, then grab
them all at once atomically. The server (like the kernel) can do this
because the server is single-threaded and can't race with itself. We can't
do this in ntdll, though. The approach I've taken I've laid out in great
detail in the relevant patch, but for a quick summary we poll on each object
until it's signaled (but don't grab it), check them all again, and if
they're all signaled we try to grab them all at once in a tight loop, and if
we fail on any of them we reset the count on whatever we shouldn't have
consumed. Such a blip would necessarily be very quick.
The whole patchset only works on Linux, where eventfd is available. However,
it should be possible to make it work on a Mac, since eventfd is just a
quicker, easier way to use pipes (i.e. instead of writing 1 to the fd you'd
write 1 byte; instead of reading a 64-bit value from the fd you'd read as
many bytes as you can carry, which is admittedly less than 2**64 but
can probably be something reasonable.) It's also possible, although I
haven't yet looked, to use some different kind of synchronization
primitives, but pipes would be easiest to tack onto this framework.
PulseEvent() can't work the way it's supposed to work. Fortunately it's rare
and deprecated. It's also explicitly mentioned on MSDN that a thread can
miss the notification for a kernel APC, so in a sense we're not necessarily
doing anything wrong.

There are some things that are perfectly implementable but that I just haven't done yet:

Other synchronizable server primitives. It's unlikely we'll need any of
these, except perhaps named pipes (which would honestly be rather difficult)
and (maybe) timers.
Access masks. We'd need to store these inside ntdll, and validate them when
someone tries to execute esync operations.

This patchset was inspired by Daniel Santos' "hybrid synchronization" patchset. My idea was to create a framework whereby even contended waits could be executed in userspace, eliminating a lot of the complexity that his synchronization primitives used. I do however owe some significant gratitude toward him for setting me on the right path.

I've tried to maximize code separation, both to make any potential rebases easier and to ensure that esync is only active when configured. All code in existing source files is guarded with "if (do_esync())", and generally that condition is followed by "return esync_version_of_this_method(...);", where the latter lives in esync.c and is declared in esync.h. I've also tried to make the patchset very clear and readableto write it as if I were going to submit it upstream. (Some intermediate patches do break things, which Wine is generally against, but I think it's for the better in this case.) I have cut some corners, though; there is some error checking missing, or implicit assumptions that the program is behaving correctly.

I've tried to be careful about races. There are a lot of comments whose purpose are basically to assure me that races are impossible. In most cases we don't have to worry about races since all of the low-level synchronization is done by the kernel.

Anyway, yeah, this is esync. Use it if you like.

--Zebediah Figura

Related Projects

eventq

Explores the different platform execution models for IO.

26 Jul 2022 13

wed

wed is a terminal text editor with key bindings commonly used in Windows based editors

14 Feb 2016 73

winapi

Minimal but useful Lua bindings to the Windows API

08 Jun 2011 190

sc

Common libraries and data structures for C.

10 Nov 2020 2,244