FEX

A fast usermode x86 and x86-64 emulator for Arm64 Linux

MIT License

Stars
1.8K

Bot releases are hidden (Show)

FEX - FEX-2405 Latest Release

Published by Sonicadvance1 6 months ago

Read the blog post at FEX-Emu's Site!

One month older and we have a new release for FEX. This month is a little bit slower for user facing features but behind the scenes we had a large amount of refactoring to facilitate future improvements.

Support OpenGL and Vulkan thunking with forwarding X11

A thorn in FEX's side has been forwarding X11 calls when thunking OpenGL and Vulkan. This has caused us pain since X11's API is a fairly leaky
abstraction that leaks data symbols which FEX's thunking can't accurately thunk. We have now instead redirect the X11 Display object directly in
OpenGL and Vulkan. This not only reduces the amount of code we need to thunk, but also is required for us to eventually thunk 32-bit libraries.

This may change behaviour in some games when thunks are enabled, so it will be useful to do some testing and regression testing.

Enable Enhanced REP MOVS when memcpy TSO is disabled

Alongside last month's optimization when optimizing memcpy instructions, we now enable this CPUID bit when memcpy TSO emulation is disabled. This
means that glibc will take advantage of the optimization when it is doing memcpy and memset operations. This is a minor performance improvement
depending on the application.

Implement support for SMSW instruction

This instruction isn't too notable since all it does on recent x86 CPUs is return the same data no matter what, but legacy CPUs it was useful for
checking if x87 was supported. As this is considered a system level instruction, FEX didn't implement it originally but we finally found a game that
uses it. The original Far Cry game from 2004 uses this instruction for some reason. Now that we have implemented the instruction the game at least
gets to the menus but seems to still stall out when going in-game. Kind of neat!

Fary Cry running under FEX

Fix disabling TSO emulation on some stack accesses

When emulating the x86 memory model, we can get away with not emulating it when a thread accesses its stack. This works since we know that stack
accesses are usually not shared between threads. While we usually disabled the TSO emulation in these cases, we had accidentally missed some cases.
This will mean that there is some performance improvements for "free."

Fix ADC and SBC instructions

Back in FEX-2403 we had landed some optimizations for these instructions. Turns out that we had inadvertently introduced some broken behaviour as an
edge case. Most games didn't hit these edge cases so they were fine but it completely broke rendering in the Steam release of Final Fantasy 7.
With that fixed, we now get correct rendering again!

Final Fantasy 7 under FEX

Option to disable half-barrier TSO emulation on unaligned atomics

On x86 most atomic instructions don't care about alignment, with a feature that Intel calls "Split-locks" to even support unaligned atomics that cross
a cacheline. On ARM we don't have this luxury, where all atomic instructions need to be naturally aligned to their size. Newer ARM CPUs added a
feature that allows unaligned atomics inside of a 16-byte region, but if the data crosses the edge then FEX needs to handle this instruction in a
different way. We backpatch the instruction stream with a "half-barrier" access which still emulates the x86 memory model but is exceedingly heavy.

Now we have an option to convert these "half-barrier" implementations in to non-atomic access. While this doesn't match true x86 behaviour, this can
accelerate some games that heavily abuse unaligned atomic memory accesses. So if a game is running slow, this is an option to try out!

FEXConfig option

Refactor code using clang-format

As alluded to at the start, the FEX codebase has now been completely refactored using clang-format to ensure a more consistent coding experience. We
provide a helper script and CI enforcement to ensure this stays in place. This took quite a bit of work to ensure the feature was up to everyone's
standards. Major thanks to everyone that worked on this!

Eliminate cross-block liveness for instructions

This is preparation work for future JIT improvements and speedups. Cross-block liveness makes our register allocator more complex and by removing this
from our JIT we can start working on improving the register allocator. Look forward to register allocator improvements in the coming months.

Support arm64ec function calls

This changes how some of FEX's JIT infrastructure works in order to support the ARM64ec windows ABI. This will get used in conjunction with Wine's
ARM64ec support. While still a work in progress, this is starting to become interesting as WINE continues to implement more features to handle this
case.

Video game showcase

YouTube Link
Hades running under FEX

Raw Changes

FEX Release FEX-2405

FEX - FEX-2404

Published by Sonicadvance1 6 months ago

Read the blog post at FEX-Emu's Site!

After last month having an absolute ton of improvements, this month of changes is going to look positively tiny in comparison. We have some good new
options for tinkering with FEX's behaviour and more performance improvements. Let's get in to it!

Implement more memory model emulation toggles

The biggest performance hit with FEX's x86 emulation has always been emulating the memory model of x86. ARM has added various extensions over the years to make this emulation faster but it still isn't enough.

  • FEAT_LSE - Adds a bunch of atomic memory instructions
    • Original ARMv8.0 doesn't support this. Massive impact on performance.
  • FEAT_LSE2 - Adds unaligned atomics (within a 16-byte granule) to improve performance of x86 atomics.
    • Doesn't quite cover the full 64-byte cacheline of unaligned atomics that x86 supports
  • FEAT_LRCPC - Adds new load instructions which match the x86 memory model
  • FEAT_LRCPC2 - Adds even more loadstore instructions which match x86 instructions
  • FEAT_LRCPC3 - Adds even more, including vector loadstore instructions
    • No hardware today supports this extension

Even with this set of extensions, emulating x86's memory model can have near a 10x performance hit. This performance impact is most felt in games because they use vector instructions very heavily, which is because of the lack of the FEAT_LRCPC3 extension.
With this in mind, we are introducing some sub-options around emulating x86's TSO memory model to try and lessen the impact when we can get away with it. These new options can be found in the FEXConfig Hacks tag.

These two new options are only available for toggling when TSO emulation is enabled. If your CPU supports FEAT_LRCPC and FEAT_LRCPC2 then a recommended configuration is to keep the TSO Enabled option enabled, but disable the Vector and Memcpy options.
While this will incur a performance hit compared to disabling TSO emulation, it is significantly more stable to keep TSO emulation on.

If you still need more performance, then it may be beneficial to turn off TSO emulation entirely. It's unstable though! It's incorrect emulation to gain speed!

Vector TSO enabled

This option enables emulating the memory model around vector loadstore instructions. This has a HUGE performance impact even on latest generation hardware.

Memcpy TSO enabled

This option enables emulating the memory model around x86's REP MOVS and REP STOS instructions. These are used for doing memory copying and memory setting respectively.

The impact of this option depends heavily on the application. This is because most memcpy and memset functions actually use vectors to modify the memory.

JIT core improvements

Once again this month there has been a focus on JIT optimizations. Although this time it might be hard to see what is improving. Overall in benchmarks
there has been roughly a 3% performance improvement. With a mixture of improvements this month being foundational work to lower JIT compile time
overhead in the coming months. As usual there is too much to dive in to each change individually so we'll just have a list.

  • Optimize LOOP/N/E
  • Negate more inline constants
  • Optimize PF calculation using integers rather than vectors
  • Optimize CLC
  • Optimize cmpxchg
  • A bunch of instructions cleaned up and rewritten to remove small amounts of overhead
  • Improves 32-bit address mode accesses
  • Implements support for prefetch and rdpid instructions

Optimize memcpy and memset IR operations when TSO emulation is disabled

Speaking of the previous optimization. We have now optimized the implementation of the memcpy and memset instructions to be significantly faster. Sometimes a compiler will inline these instructions which was causing upwards of 5% CPU time doing memory copies.

With this optimization in place we have benchmarked and improvement from 2-3GB/s up to 88GB/s! That'll teach that code to be slow.

Fix memory leaks in thread creation

A memory leak that has occured where FEX would leak some thread stacks when they shutdown. This has now been resolved which lowers memory usage for
long running applications that shutdown threads. In particular this makes Steam consume less RAM.

We have more memory leaks to solve as we move forward but they are significantly less severe than this.

A ton of small cleanups in the code

This month has had a lot of code cleanup in FEX but these aren't user facing so it isn't very interesting. Let it be known although that something
like half the commits this month were cleaning up various bits of code or restructuring which isn't getting a focus.

Raw Changes

FEX Release FEX-2404

FEX - FEX-2403

Published by Sonicadvance1 8 months ago

Read the blog post at FEX-Emu's Site!

Welcome back to another new tagged version of FEX-Emu! This month we have quite a few important bug fixes and optimizations, so let's get right in to
it!

Steam fix

As of Steam's February 27th update there was a fairly major change to how Steam starts its embedded Chromium instance.
With this most recent change it now is run inside of the Steam Linux Runtime environment. In turn Steam has disabled the sandbox feature of the
Chromium instance because it is incompatible. FEX was already disabling this sandbox and forcibly passing in the argument to disable it.

Chromium really didn't like the argument being passed in twice and it was causing it to crash early. We have now removed our application profile and
let Steam configure the arguments as required.

As a side effect of Steam updating their version of Chromium, some users have noted that they are experiencing problems with GPU acceleration on
Raspberry Pi systems. This is seemingly a video driver problem and unrelated to this crash that was fixed. It is currently unknown if we can fix this
problem, as it is working find on Tegra and Snapdragon systems.

Rootfs images updated

FEX's rootfs images have been updated to include the latest versions of Mesa, gfxreconstruct, and Renderdoc. The major change here is having Mesa
updated to 24.0 as the other two packages are mostly for developers.

Fix a potential hang on forking with memory allocations

We have fixed a known hang that occurs when a process is forking while another thread is allocating memory. This tended to occur as a hang when
running Proton applications. While this fixes one hang, we still have another one that sporatically happens that we haven't tracked down. While the
occurence is relatively rare, it's good to watch the process trees if the program is stuck waiting on a futex.

A bunch of CPU optimizations

As per usual, this last month has added a bunch of CPU optimizations! We have noted up to a 14% performance improvement in one benchmark and an
average of around 4% in Geekbench. We need to commend our developers for
hammering out these optimizations, even a small optimization can have big impacts on games that abuse a particular feature.

  • Use FlagM SETF8/16 for INC/DEC
  • Optimize LOCK DEC
  • Optimize ADC/SBC
  • Fuse add+cmn in to adds
  • Misc other optimizations
  • Optimize less than 32-bit add and sub

Small timestamp counter scaling

Recently we have found out that some games rely on an x86 CPU's RDTSC instruction operating at Ghz frequencies. While this is not a good idea to make
assumptions, it is relatively common that x86 CPUs have a really high frequency cycle counter frequency. Most laptop CPUs operating in the 1-2Ghz
range while desktop CPUs can go up to 3Ghz in our testing.

Unreal Engine 5 has a new work graph system that spin-loops CPUs for a fixed number of cycles, expecting to not spin for very long. While this is
relatively okay at 1Ghz, since it is only a few nanoseconds, When an ARM CPU's cycle counter gets added to the mix it starts encountering problems.

The primary problem here is that all 64-bit Snapdragon processors ship with a fixed rate 19.2Mhz cycle counter. This continues all the way to their
latest flagship the Snapdragon 8 Gen 3. Additionally other ARM devices we have tested like the Nvidia Jetson ARM boards and Apple M1 also ship a
similarly low cycle counter. So while Unreal engine will only spin-loop for a measily 1597 cycles, on an ARM board this takes ~51,000 nanoseconds but
on an x86 PC it only takes 591 nanoseconds! This was causing games to burn CPU time unnecessarily and run slower than they should!

To compensate for these slow cycle counters on most ARM devices, we are now scaling the value we return to the applications by multiplying the value
by 128 times! This makes snapdragon cycle counters behave more similarly to a 2457 Mhz cycle counter, but with a 128 cycle granularity. This improves
the FPS in Tekken 8 and will also improve performance in all other Unreal Engine 5 games. There may be other games affected as well!

As a side-note, a 1Ghz cycle counter is now mandated by ARMv8.6 and ARMv9.1 spec! So this problem will soon go away as new SoCs get on to the market.

Introduce ARM64EC static register allocation mapping

As part of the ongoing effort to support WINE's Arm64EC code, we have changed the order in which our registers get allocated to more closely match
what the Arm64EC ABI
wants for register layout. Matching what Arm64EC wants for the regster layout means that when the JIT jumps out to some code, we shuffle less data
around which gives a performance improvement. The Linux side of code doesn't need this, so this only happens when building as a WINE module.

32-bit thunking improvements

This last month has had an exciting milestone for 32-bit thunking! We have landed support for thunking Wayland on 32-bit. Which this means with our
previously implemented Vulkan thunking, we can now run some games using Wayland plus Vulkan and Zink thrown in to the mix! In particular we have been
able to test that Super Meat Boy works in this configuration! We still have more work to do before X11 and GLX works with 32-bit thunking so stay
tuned to the future!

Memory leak fix

FEX had an issue with long running processes leaking memory. This showed up in applications that would start hundreds of threads and tear them down
over and over. Steam is one of these long running processes that would starve the system of memory if left open over night. This is because the
program spins up helper threads fairly aggressively and then shuts them down.

We have fixed one major memory leak but we still have a few more to go before its nailed down!

Syscall passthrough optimization

One important thing to be wary of when running games is syscall overhead. Every time an ioctl or other syscall is made, FEX can incur significant
overhead compared to running the application natively. Additionally if we are passing syscalls through to glibc helpers then this can add more
overhead and sometimes introduce bad behaviour.

This month we spent some time looking at how syscalls are handled when we know that we can pass the data directly to the kernel. This allows us to
more quickly add new system calls when the kernel adds them, and ensures they are as fast as possible. With this optimization in-place FEX now
directly emits small syscall handlers per syscall and jumps directly to the kernel if possible. This lowers CPU overhead for the most common syscalls,
thus removing emulation overhead. While FEX's syscalls were already fairly low overhead, this just improves the situation further!

Raw Changes

FEX Release FEX-2403

FEX - FEX-2402

Published by Sonicadvance1 8 months ago

Read the blog post at FEX-Emu's Site!

Welcome back everyone! After last month's cancelled release and this month being a bit late we have a lot of changes that happened.

More JIT performance improvements

A lot of the work these paste two months have been optimizing our JIT more. We have run Geekbench and Bytemark for these which showed a marginal
performance improvement in these benchmarks. Bytemark showing the biggest improvement of 16% in one sub-benchmark. A lot of the performance
improvements are targeting real-world applications rather than benchmarks which shows as those games getting more of an improvement.

As typical, explaining each individual optimization would take too long so we're going to spam out a bunch in a list.

  • Removes a vtable indirection for syscalls
  • Fix RCL/RCR wraparound behaviour
  • Remove process-wide lock in JIT
  • Fixes syscall rcx/r11 state
  • Optimize SIB address calculation from three instructions to one
  • Optimize TST instruction with -1
  • Optimize TST more
  • Improve XCHG instructions
  • Optimize rotates
  • Optimize CDQ
  • Optimize shifts
  • Optimize PTEST, VTESTP, PDEP
  • Optimize SHA256 instructions to remove spilling
  • Optimize CMPXCHG
  • Stop zero extending a bunch of instructions where it doesn't matter
  • Optimize ANDN
  • Optimize a bunch of instructions using NZCV flags

Fix glibc clone usage of CLONE_CLEAR_SIGHAND

Newer glibc versions starting with 2.38 have started using this new clone flag for executing a program. We also fixed this in 2312.1 but can now make
a note of it.

Fix VDSO symbol fetching on ARM64

This is a fairly minor change but can have a big performance hit. When FEX was querying for VDSO interface functions we were using the wrong names on
ARM64. Since the wrong names were used, this meant we always fell back to the slower glibc implementation of functions. This in particular fixes a
performance hit when games call clock_gettime excessively.

Fix Proton again

Sometime in December there were some changes to Valve's Proton layer which caused us to break it. This has now been fixed.

Expose Linux 6.6

With some relatively minor changes we now support reporting kernel version 6.6 to the guest application. This gives us a range from v5.0 to v6.6 now.

Workaround hang when process is forking

A long-standing bug in FEX is that sometimes a process can hang when it is forking, usually to execute another program. We have now worked around this
issue to an extent that lets the application continue. It's not a full fix because we can still have a crash but that is easier to see instead of a
program hanging forever in the background.

Commonize some WOW64 code to share with ARM64ec

In preparation for sharing some code with FEX built for ARM64ec, this has shared move some Windows code to a common location to be used.

An absolute ton of work went in to thunking

Over the past couple months this has been one of the more active projects within FEX. Today FEX has support for thunking 64-bit x86 libraries across
to ARM64. A significant portion of this work is doing analysis of API interfaces in order to allow thunking 32-bit x86 libraries over to ARM64
libraries with data repacking. This isn't yet complete but since a ton of work has gone in to this, we wanted to call it out.

NOTE: Memory leak on long-running processes like Steam

We have found a memory leak when a process shuts down a thread that has been around for quite a while. We only identified this memory leak this last month which hasn't been fixed.
We are hoping to fix this bug for the next release but be aware that long running processes like Steam has a relatively aggressive memory leak. This is exacerbated by how Steam spins up threads for
doing work which makes this application particularly heavy.

Raw Changes

FEX Release FEX-2402

FEX - FEX-2312

Published by Sonicadvance1 11 months ago

Read the blog post at FEX-Emu's Site!

We're back with another month of changes. After last month being a bit slower, we're back in the swing of implementing more optimizations and bug fixes. No dilly dallying, let's get right in to it!

More optimizations this month!

Once again this month has a whole bunch of optimizations that is very exciting! We will lightly go over the changes to talk about what changed.

Keep guest SF/ZF/CF/OF flags resident in host NZCV

This is one of the bigger optimizations this month. A bit of backstory is needed for what this optimization is for. x86 has a flags register called EFLAGS which contains quite a few random bits of information. The subflags we care about here are the SF, ZF, CF, and OF flags inside of it. These are various flags that are set typically from ALU operations for information depending on the result. So something like an integer Add will set ZF if the result was zero, SF if the result has the sign bit set, CF if a carry occured, and OF if the operation overflowed. These are usually quite cheap for the CPU to calculate by itself, but manually calculating the flags usually takes a few additional instructions each.

The original implementation inside of FEX for calculating these flags would spend the additional instructions and calculate each one manually. This
would usually end up with a dozen or so of additional instructions for calculating flags. While FEX would typically optimize out the calculations if
they weren't used, it would still add CPU time when we couldn't.

Luckily ARM also has a flags register called NZCV which maps almost perfectly to x86's EFLAGS. This lets us optimize these instruction implementations
to instead use the ARM flags directly. This has a couple of effects, not only does it remove the instructions from our code generation, it has
knock-on effects that the flags are now stored inside of the NZCV which reduces memory accesses. A multi-hit combo for improving performance.

While not all x86 instructions map their flags registers 1:1, this has a fairly significant performance uplift in most situations!

Dedicate registers for PF/AF

Related to the previous change, x86 has two flags registers stored inside of EFLAGS that doesn't have a direct equivalent on ARM CPUs. These two
flags are fairly uncommon but instructions will still generate them. These flags have the additional problem that they are fairly costly to calculate,
with one of them requiring a GPR population count instruction which ARM doesn't even support until new instruction extensions called CSSC. While in
most cases the result of these flags isn't used, the overhead of calculating them can add up a bit. This is why we are now dedicating two registers to
these flags to reduce their overhead as much as possible!

Misc optimizations

  • Optimize BT/BTC/BTS/BTR
  • Optimize shifts/rotates
  • Optimize selects & branches & more nzcv goodies
  • Optimize three sha instructions
  • Make "not" not garbage
  • Optimize memcpy and memset when direction is compile time constant

With all these optimizations in place this month we have a fairly significant performance uplift!

<-- Geekbench and bytemark graphs -->

While Geekbench is showing a fairly modest 17.6% performance uplift, bytemark is showing up to a 60% performance uplift! Over the course of
the last three months we have had benchmarks that have improved by over 100%! These improvements can be seen in games as well, with some CPU heavy
games have had their FPS improve by over 2x. In a lot of games tested they have changed from being CPU limited to GPU limited on our Lenovo X13s
laptops even! We are looking forward to when these companies release new laptops based on Snapdragon X
Elite
in the middle of next year!

Various bug fixes

In addition to performance improvements, we have some bug fixes this month.

  • Fixed corruption in the JIT
    • Caused corruption with x87 heavy games
  • Fixes integer multiply corrupting results
    • Corrupted some register state, which was breaking the game Dungeon Defenders

Support extracting erofs images

One of the features that FEXRootFSFetcher was missing was the ability to extract erofs images once downloaded. This was because we didn't know that
erofs-utils provided an application for extracting these images without FUSE. Turns out the developers put an extractor inside of their fsck
application that we had completely missed! Now if a user wants to extract an x86 rootfs image for lower overhead, they can do this directly from our
FEXRootFSFetcher tool.

Preparation for improving gdbserver

GDBServer is a socket interface that GDB supports for remotely debugging applications. One of the harder things about working on FEX-Emu is that the
ability to debug an application is usually quite hard. GDBServer is a way to improve this situation so that GDB can remotely connect to a FEX process.

There's a bunch of work this month towards cleaning up this interface and getting it to work correctly. While it is still not quite usable for
debugging, we are working towards this so applications can actually be debugged!

Improvements to WOW64 compatibility for newer WINE

Newer versions of WINE has changed some behaviour around WOW64 support. So this month we have added support for some of this newer behaviour. Thanks
again to Bylaws for implementing this!

FEX rootfs image updates

This month we are updating our rootfs images to incorporate the latest Mesa 23.3.0 release that
occured a few days ago. We have updated our Ubuntu 22.04, 23.04, 23.10, ArchLinux, and Fedora 38 images with this latest version of mesa. As usual if
there are any issues, let us know so we can sort them out.

Raw Changes

FEX Release FEX-2312

FEX - FEX-2311

Published by Sonicadvance1 12 months ago

Read the blog post at FEX-Emu's Site!

Another month gone by and another FEX release out the door! This last month was a bit of a less busy month as most of our team spent a week in Spain
to take part in XDC 2023! We did still have the rest of the month to do some work although, so let's get
to the changes!

Small bug fixes

This month we fixed a couple of bugs with could have caused spurious crashes! In fact while testing some upcoming performance optimizations, we fixed
a few unrelated bugs that was crashing Steam periodically! Always nice to see a bunch of little work that just improves the software, even if they
aren't a single big fix.

  • Fix register corruption when jumping out of JIT
  • Fixes double munmap which would cause spurious pointer unmaps
    • Fixes crashes when a program would shut down a thread
  • Implements RPRES support and Fix implementation issue with ARM's new RPRES feature
    • RPRES gives us the ability to do reciprocals in one instruction instead of using ARM's divide instruction.
    • The bug would have caused invalid data to be returned
    • No CPU supports this yet luckily
  • Fixed issue with *at syscalls not working with absolute paths
    • Broke Proton's pressure-vessel in weird and unique ways
  • Fixes bug with named enum argument parser
    • This is used to override CPU features with the FEX_HOSTFEATURES option so typically not hit

32-bit thunking infrastructure

While 32-bit thunking is not yet in place, and this month it still isn't fully integrated, some of the code has been landing to work towards this
goal. In order to do 32-bit thunking the right way we are spending bunch of time ensuring that we have a proper daya layout analysis system in place
that is based on clang to do a couple of things. This analysis will let us to automatic translations of data structure from 32-bit in to 64-bit and
also alert us if something needs to be manually translated. This needs to be in place because otherwise we can end up in a situation where we
unknowingly corrupt data and it would be a nightmare to find. So this month we now have the ability to annotate our thunk definitions and start having
clang work for us. While not complete, some of the work has shown to have thunking working for 32-bit Super Meat Boy to work! It's getting there!

NZCV usage preparation

A big performance improvement that FEX is working on is to use the CPU's flags to directly emulate the x86 flags when possible. This is a long and
arduous task but the performance improvements will be huge once the code lands! A bunch of prep work this month has landed to start down this path but
we're going to need to let this sit in the oven for a bit longer. Check back next month to see if we get there!

Minor optimizations

With XDC being in the middle of the month, it caused most of the bigger work to be delayed so we have a bunch of smaller things this month!

  • Minor optimization to bfi/bfxil
    • Removes one or two instructions for some instruction translations
  • Optimize atomic fetch operations in to atomic if the result isn't used
    • Removes a couple of instructions if the resulting fetch data isn't used.
  • Implements support for ARM's new AFP extension
    • Currently disabled until we can audit the codebase to ensure we aren't corrupting anything
    • Lets us remove an insert after every scalar operation to match SSE behaviour
  • Optimize palignr that behaves like a move
    • Compilers shouldn't use this, but now we optimize it to a move
  • Optimize pblendw
    • A fairly uncommon instruction but now its implementation is basically as fast as it can be
  • Optimize blendps
    • We had already optimized blendpd last month, so this time was to optimize the 32-bit version
    • Fairly commonly used so should improve perf in some games
  • Optimize dpps and dppd
    • These instructions do a dot product and a broadcast of their result but we couldn't find a game using it heavily
    • So while this is now optimal, this is unlikely to affect any real game
  • Optimize some 3DNow! instructions
    • 3DNow! is a really old instruction extension that is basically only used in some really old games
    • All of these instruction implementations are basically as fast as we can make them now, which is good!
  • Optimize direction flag pointer offset calculation
    • This converts a three instruction calculation down to one and stops using a ternary selection
    • This happens with x86's repeat instructions, which typically happens for memcpy and memset
    • Used a lot but is a minimal improvement.
  • A few other random bits and bobs!

AVX optimizations!

While nothing supports our AVX implementation today, we have optimized a handful of implementations once hardware supports what we need. We have
optimized a smattering of instruction translations.

  • 256-bit VExtr, VFCADD, VURAvg, VFDiv, VSMax, VSMin, VUMax, VUMin

  • Removes a bunch of truncating moves

    • If we know an AVX instruction is operating at 128-bit width, we can remove a redundant move which speeds things up!

    Raw Changes

FEX Release FEX-2311

FEX - FEX-2310

Published by Sonicadvance1 about 1 year ago

Read the blog post at FEX-Emu's Site!

Welcome back to another monthly release for FEX-Emu. You might be thinking that after last month's optimizations that we wouldn't have much to show
for this month. Well you would be wrong! We optimized even more! Let's get in to it!

More instruction optimizations!

As stated last month, we introduced Instruction Count CI which has allowed us to do targeted optimizations of our code. One again we have optimized so
many instructions that it would be impossible to go through each individual change. Check our detailed change log if you want to see all the
instructions optimized. Let's just look at the final benchmark numbers compared to last month.

<- Geekbench 5 versus last month ->
<- Bytemark versus last month ->

Let's talk about the Geekbench 5.4 results first since they don't look very
impressive at first glance. While we are only showing ~13% of a performance improvement, the problem with this result is that this number is an
aggregate of multiple smaller benchmarks. Looking at the breakdown of all the subtests there are some that have improved by up to 66%! This is of
course because some benchmarks take advantage of some instructions that we optimized more heavily than others. Luckily this improvement also scales to
other video games as well.

The Bytemark improvements are a bit hard to make out, some numbers are hardly changed at all while a couple stand out as huge improvements. This
mostly comes down to some very specific instruction optimizations that significantly improved performance in a couple of tests and the rest don't show
up as much.

With this months optimizations and last months combined these optimizations end up being significantly more interesting. Some
Geekbench results are showing an average of 50% to 65% higher performance
sometimes even higher. Some benchmark results showing nearly 2x the performance compared to before! These numbers translate very well to gaming
performance where some games have more than doubled their FPS over the past couple months.

We're not slowing down either, we still have a ton of optimizations to go on our march to get our emulation close to native performance.

Support preserve_all for interpreter fallbacks

We're calling out this particular optimization for three reasons.

  1. It improves performance of x87 heavy code
  2. It only works with the super recently released Clang 17
  3. wine packages in FEX's rootfs use x87 heavily in some instances.

Let's talk about what this optimization is and how it improves performance. In Clang 17 they added support for a new function calling ABI called
preserve_all. x86 has supported this ABI for a very long time but it is a new addition for Arm64. This ABI breaks convention from the regular AAPCS64
ABI in that if a small function needs to more registers then they need to first save pretty much any of them. Unlike AAPCS64 where it has a bunch of
registers free for using. This is beneficial for FEX's JIT since we can save signicant time by not saving any state when we need to jump out of the
JIT and execute x87 softfloat code.

In particular this manifests to upwards of a 200% performance improvement in some microbenchmarks around x87 code! While this advantage is quite
significant, the only way to take advantage of it is to compile FEX with Clang 17. Since this compiler release came out only last month, pretty much
no distros have adopted it so it is unlikely to be used soon. In a few months time, or years depending on distro, they should naturally upgrade their
compiler stack and free performance improvements will happen.

As a fairly major side note to this excursion, FEX has found that the 32-bit wine packages that is compiled with Canonical's repository uses x87
heavily in some instances. This causes some really bad performance issues with some 32-bit games and installers. It is recommended to use Proton where
you can here since it compiles its 32-bit libraries with SSE optimizations instead which work significantly better.

FEX-Emu may look to provide its own wine packages in the future with this same optimization in place to help alleviate some of this burden. Until then
it is recommended to use FEX's x87 reduced precision mode to try and alleviate some of the overhead.

Fixes a bug when chrooting in to rootfs

For quite a few months now FEX-Emu has changed some behaviour around chrooting in to the FEX rootfs.
While chrooting isn't generally advised, if a user wants to modify the rootfs then it's the only option. While we provide some scripts inside of our
rootfs images to facilitate this, it has been broken for a few months.

We have now fixed this bug in both FEX-Emu and the scripts inside of our rootfs images. So if you want to modify packages inside of the image you will
now be able to do so again. Make sure to update your image to get the new scripts!

Remove x86-64 JIT and Interpreter

This has been a long time coming in the FEX-Emu project. We have had support for an IR interpreter and x86-64 host JIT for compatibility testing since
the project's inception. It has always been the case that if these CPU backends get in the way of the ARM64 JIT that they would get removed.

That time has finally come. Due to some upcoming changes around how flags are getting represented in FEX's JIT and the general burden of implemented
FEX's IR operations three times, often undoing an x86->Arm64 translation to go back to x86. It has been deemed too much of a burden and these have
been removed. This is a necessary step for our ARM64 JIT to gain more performance that we will be gaining in the coming months!

We are looking forward to future ARM platforms that can take Radeon GPUs through PCIe slots to regain a platform which can test RADV directly, but
until that point we will have to make due with our current devices.

Instruction Count CI on x86-64 hosts

While we removed our x86-64 JIT, we do have a fun addition to our instruction count CI. Now developers that don't have an Arm64 device handy can still
run the Instruction Count CI and attempt to optimize implementations without even having an ARM64 device to run it on. This is as simple as building
FEX on an x86-64 device with the Vixl disassembler and simulator enabled and you will be able to optimize to your hearts content!

We've got a need for JIT speed! Let's go fast!

Implement first optimizations using 128-bit SVE

This is a fairly minor change but previously FEX was not using any 128-bit SVE instructions. This is primarily because there aren't really any SVE
supporting devices in the consumer market, even though Snapdragon hardware theoretically supports it. 128-bit SVE adds a couple of optimizations that
we can use.

  • Wide-element shifts
  • Index instruction for generating simple index masks

While these are fairly simple initially, they change some from being translated to six instructions down to one or two depending. This is a fairly
minor change, but it is good to note that FEX is now taking advantage of SVE if it is available!

Adds WOW64 frontend

This has been a long time coming, with us adding initial mingw support back in FEX-2305. FEXCore now supports being built with a brand new WOW64 WINE
frontend. While currently not being utilized, this will allow WINE to integrate FEX directly in to its WOW64 layer for running both x86 and x86-64
applications on Arm64 host devices.

This is a very substantial change to how WINE integrates with FEX, since today FEX-Emu just runs the full x86-64 WINE process and eats the overhead of
emulating everything WINE needs to do. With the WOW64 layer now implemented, a bunch of the WINE code can now be Arm64 native code and when it needs
to execute application code it just jumps back to the emulator. This is similar to how Windows natively handles its emulation through its "XTA" layer.
Sadly today this is only wired up to work through a 32-bit x86 part of the layer, we need to get setup to support Wine when it inevitably supports
Wow64 for x86_64->Arm64.

Big shout out to ByLaws implementing support for this! We look forward to future Wine integration work landing!

Implement thunking support for wayland-client and zink

We have some improvements to thunking this month! As we are working towards supporting thunking more code, we implemented some features to get
wayland-client thunking wired up. While this support is early, it is enough to get Super Meat Boy up and running using wayland and zink overrides
within a Wayland environment. We look forward to additional thunking improvements going forward so that performance can be improved everywhere.

Raw Changes

FEX Release FEX-2310

  • AppConfig

  • Removes Steam config (02da6d6)

  • Arm64

  • Fixes inline syscalls (4e9a114)

  • Optimize wide shifts slightly for 64-bit OpSize (f5c4e28)

  • Recover two unused vector vector temporary registers (90f7937)

  • ALUOps

  • Remove spills in PEXT (4604c01)

  • VectorOps

  • Elide moves where applicable in 128-bit VSQXTUN2 (fd1b639)

  • Improve handling of 128-bit vector VInsElement (950a8db)

  • Elide moves in ASIMD VUShrNI2 if possible (b3269f2)

  • Assert VTMP1 and VTMP2 are sequential in VTBL2 (8168a49)

  • Fix SVE aliasing-path move in VSShr (ffb5876)

  • CI

  • Run tests with <30s runtime first (e1eb151)

  • CPUID

  • Enabled Enhanced REP MOVSB/STOSB (6fe643d)

  • Config

  • Fixes core sanitization (da3e172)

  • ConstProp

  • Fixes unscaled signed 9-bit range (72d092e)

  • DeadContextStoreElimination

  • Silence unused function warning (773e946)

  • ELFCodeLoader

  • Expose FSGSBase in getauxval HWCAP2 (fbc4bda)

  • FEX

  • Moves Linux utils and adds spdx (ba56e51)

  • Common

  • Adds SPDX identifier (9f5f09b)

  • Tools

  • Adds SPDX identifier (ddf4b5c)

  • FEXCore

  • Support CpuState relative vector named constants (3413eb3)

  • Merge Arm64Dispatcher in to Dispatcher (935b3a3)

  • Removes x86 JIT. (879b41c)

  • Removes vestigial Interpreter code (65b6df9)

  • Support preserve_all ABI for interpreter fallbacks (fea72ce)

  • Gut interpreter (7d99eb0)

  • Adds SPDX identifier (67680d7)

  • Implements support for shifted bitwise ops (d578256)

  • Disable Enhanced REP MOVSB if Atomic TSO is enabled (0604336)

  • Defer setting x87 softflow rounding mode until use (9866e23)

  • Minor optimization to StoreRegisterSRA (6fdf2f9)

  • Include

  • Adds SPDX identifier (d86f41e)

  • JitSymbols

  • Buffer writes to reduce overhead (2ea2300)

  • FEXServerClient

  • Adds back ServerSocketPath config option (e795ec6)

  • FHU

  • Prepend SPDX identifier (3b188b7)

  • FileManagement

  • Fix inverted boolean check for procfs/interpreter support (615ab8d)

  • Github

  • Changes jobs to have unique names (220761a)

  • HostFeatures

  • Fix x86 CLWB support check (6e08ac6)

  • Detect FlagM/2 (98f1487)

  • IR

  • Changes Select operation to not have implicit sizes (a8c1720)

  • Changes crc32 operation to always return a 32-bit result. (6d9b524)

  • RCLSE: Partially reenables the RCLSE pass (879fcdc)

  • InstCountCI

  • Enable running on x86 hosts (a1a709f)

  • Support f64 reduced precision mode tests (5eed24a)

  • Fail CI if there was any difference. (93aeb15)

  • Adds negative immediate primary tests (c38beff)

  • Add log before compiling instruction (1804b00)

  • Adds missing instructions from Secondary OpSize tables (750d909)

  • OpcodeDispatcher

  • Don't mask logic op inputs (d1d3de8)

  • Optimize lock btr (2b7e1d1)

  • Optimize reconstructing FSW (5fc8699)

  • Removes non-explicit SelectCC function (43fd159)

  • Improve output of SHLX/SHRX/SARX (ad8b0c6)

  • Improve output of MULX (e574cfe)

  • Handle RORX corner cases better (647629a)

  • Optimize cmov (c8e7c34)

  • Optimize CRC32 (96bbd01)

  • Optimize 16-bit MOVBE (f84a264)

  • Optimize blendp{s,d} (92824f5)

  • Optimize pins{b,w,d,q} (213d3c4)

  • Optimize pextr{b,w} (d4c6749)

  • Optimize shufpd (655cee0)

  • Implement shufps with VTBL2 in worst case (31d8283)

  • Optimize a bunch of shufps variants (cfe620a)

  • Optimize 32-bit bswap (ebdca02)

  • Optimize NOP vector move (f7e652b)

  • Minor optimization to BT/BTC/BTR/BTS (48521a4)

  • Update 32/64-bit RCL for operating size (950007c)

  • Update 32/64-bit RCR for operating size (d029394)

  • PassManager

  • Optimize out CPUID and XGetBV calls (234e029)

  • RCLSE

  • Optimize redundant store->load operations (6dc5c0d)

  • Scripts

  • Update generate_doc_outline for moved FEXCore (9ff5544)

  • Thunks

  • Only build guest target for libfex_thunk_test if FEXLinuxTests are enabled (507cf82)

  • Analyze data layout to detect platform differences (48fa4f1)

  • Fix AddressSanitizer build (1439874)

  • Update Vulkan thunk to v1.3.261.1 (76d4637)

  • Avoid recompiling thunk interfaces on FEXLoader changes (533f359)

  • Minor restructuring and small cleanups (be07254)

  • wayland

  • Add support for APIs required by zink and Super Meat Boy (0d9dce9)

  • Tools

  • Fixes usage of waitpid in the face of EINTR (dda5861)

  • X87F64

  • Implement FABS with vector instruction (3a25dd6)

  • Use Bfe for rounding mode, FCHS use float instruction (ccfd770)

  • Misc

  • Minor AVX optimizations (3ba1c79)

  • Optimize ASCII flags (6b4ff4a)

  • Use adcs (ca87d86)

  • Optimize 8/16-bit CF calculation (8b3881b)

  • Optimize PF calculation in lahf (19a7b51)

  • More opts to the dispatcher + 1 to the JIT (bee9730)

  • Requiem for the x86 jit (86ad35c)

  • Add WOW64 JIT frontend (797c890)

  • Optimize reconstructing x87, harder (5444810)

  • Make x87 FCMOV slightly less terrible (65d558b)

  • Minor/flag opts (8b52308)

  • InstCountCI/VEX_map3: Add missing zeroing vperm2f128/vperm2i128 test cases (3d0b664)

  • Inline constant with PF calculation (9152fb0)

  • Optimize out carry invert for DEC (b6922df)

  • unittests

  • Instruct CTest to print output from tests on failure (e32601f)

  • Add test thunk library (000fb2e)

  • ASM

  • Implements tests for vpgatherqd/vgatherqps (ab4642a)

  • Implements tests for vpgatherqq/vgatherqpd (d94e5ce)

  • Implements tests for vpgatherdq/vgatherpq (dad7086)

  • Implements tests for vpgatherdd/vgatherps (85da0f0)

  • Adds unit test caught by #3153 (0d12cce)

FEX - FEX-2309.1

Published by Sonicadvance1 about 1 year ago

Hotfix patch to fix a bug around accessing files!

FEX - FEX-2309

Published by Sonicadvance1 about 1 year ago

Read the blog post at FEX-Emu's Site!

Last month we hinted that we didn't get all optimizations in that we wanted. There's more of that this month but we have also had an entire month to
push optimizations in. This month was a whirlwind of optimizations improving performance all over the place because of one feature that landed;
Instruction Count continous integration! Let's dive in to what this is.

Instruction Count CI

This is a major feature that we added last month that doesn't directly affect users but is such a huge quality of life improvement to our developers
that we need to discuss what it is. At its core, InstCountCI is a database (Actually JSON) of x86 instructions that FEX-Emu supports and shows how
that instruction gets converted to Arm64 instructions. This is in textual format for easily reading these instruction implementations and updating
quickly when the implementation changes. This has had a profound effect on our developers where they can't help but look at poor instruction
implementations and finding ways to optimize them.

<- Optimized versus non-optimized picture ->

As you can see in the example, one very complex instruction that was not optimal before has now translated in to something much more reasonable.
So far this has nerdsniped at least half a dozen developers in to finding more optimal implementations of these instruction translations!

Some design considerations of this must be understood when looking at FEX's instruction implementations although. The most important thing to remember
is that these implementations are looking at the instruction in a vacuum. These are translated as only single instruction entities, so any sort of
multi-instruction optimization is not going to be visible in this CI system. Additionally this isn't getting run on hardware in our CI, so
implementations that are close on instruction count may have wildly different performance characteristics depending on the hardware. So while it is a
good guide for getting eyes on the assembly, there still needs to be some knowledge as for what the translation is doing to ensure it's both fast and
correct.

This CI system was used heavily this last month for what our next topic is.

Optimization Extravaganza!

With InstCountCI in place, we can now quantify optimizations going in to the FEX CPU JIT without accidentally compromising performance of other
instructions. With this in-place we have had an absolute ton of CPU optimizations land in our JIT, enough that if we went through them all it would
take longer than all of previous progress reports!

Instead of going through each individual change, let's just discuss the main optimizations that have landed. The bulk of optimizations has
been making sure the translation between SSE instructions to Arm64's ASIMD instructions is more optimal. This is because reasoning about vector
optimizations is easier in this instance, and also because games more heavily abuse vector instructions than regular desktop applications. There were
other optimizations like some flag generation instructions becoming more optimal and eliminating redundant move instructions as well!

Let's take a look at the bytemark results.

<- Bytemark graphs ->

There's some surprising uplift in numbers here! Even more so since bytemark shouldn't heavily utilize SSE instructions so this is more just coming
from general optimizations that occured. Let's take a look at another benchmark for fun.

<- Geekbench 5.4.0 graph ->

Whoa, that is a surprising uplift in one month! Geekbench actually has some
benchmarks that use vector operations so they can get improvements more improvements than expected. We should expect even more performance once we
start optimizing more non-vector instruction translations!

As for gaming benchmarks, we're not going to do some in this blog post, but we have been told that due to various optimizations this month that Portal
performance has gained 30% and Oblivion has 50%. Big improvements towards making games feel better when playing them. Main concern here is that the
Adreno 690 in our Lenovo X13s test systems are actually quite unstable during testing, so finding suitable games that are CPU bound without crashing
the kernel driver is surprisingly difficult. Most of the lighter games that don't crash the MSM kernel driver are already running at hundreds of FPS
anyway so it isn't interesting.

A fun quirk of optimizing vector operations this month, we have finally landed our first optimizations that use ARM's SVE instruction set when
operating at 128-bit width. Turns out there are a few optimizations that can be done here aside from implementing AVX with the 256-bit version! I'm
sure we will see more of these as we continue optimizing.

Remove most implicit sized IR operations

Continuing from the last topic, this is one of the main changes that allows us to start working on non-vector instruction optimizations. FEX's IR
around general purpose ALU operations has a history of using implicit sized IR operations. This means we would check the size of the incoming data
sources and make an assumption for what the operating size of the whole thing should be. While this worked, it has been an absolute thorn in our side
for years at this point. Any time we would make a seemingly innocuous change it would subtly change the behaviour of some IR operations as a new size
propagates through the stack. Now that all of these operations explicitly state their operating size at generation time there is less room for error.

This follows with how our vector operations worked, where all of these were explicitly sized from the start and has had significantly less issues over
time.

With this change in place we can start optimizing general purpose ALU operations with less worry about breaking the world.

Mingw work

Some more work this month towards getting WINE WOW64 support wired up. Adding a toolchain file to help
facilitate cross compiling, stop saving and restoring the x18 platform register and various other things. While full support isn't yet merged, there's
a lot of preliminary work landing so we can support this. While this work is very early, it is already showing significant performance improvements
for Windows native games. A game like Bioshock Infinite is already running faster than FEX emulating x86 WINE fully! Look forward to future
improvements and integrations as this gets wired up!

Raw Changes

FEX Release FEX-2309

FEX - FEX-2308

Published by Sonicadvance1 about 1 year ago

Read the blog post at FEX-Emu's Site!

Whoa jeez, another month already? We've had our heads down working hard this last month, trying to make FEX-Emu the greatest x86/x86-64 emulator on
Linux. A huge focus this month is optimizations because of course what we want is to go fast. We're all cats and we've got the zoomies.

Every day we're optimizing

As said before, this month has been an absolute mess of optimizations as we've been optimizing the project as thoroughly as possible. We could spend
another month talking about the optimizations that we did this last month, so let's blast through what we did. First let's show a graph for how much
FEX has improved over this last month.

Look at those numbers! Some benchmarks from bytemark have cracked the 200% mark! While a couple benchmarks do have regressions, we're pretty sure that
we know what they are and they will be rectified soon. These are the sorts of optimizations that can be felt in real games though.

So lets quickly run through some of the optimizations we ran in to this last month.

Switch to using half-barriers for memory accesses

When ARM hits an unaligned atomic memory access, we previous wrapped that load or store in two slow barrier instructions. We can now safely only use
one barrier on one half of the instruction! This makes unaligned accesses quite a bit quicker.

Optimize x87 memory accesses

This removes a couple instructions when we access 80-bit floats.

Only clear icache for code

Some large code blocks can generate a decent amount of metadata that don't need an icache clear. Can remove a bit of stutter.

Const prop BFI operation

Sometimes when a BFI instruction has constants in it, we can remove the BFI instruction

Optimize vector TSO loadstores

vector operations typically need an additional add on its address if it can't fit in the instruction encoding for the immediate offset. We missed the
optimization in which the immediate offset CAN actually fit. Removes an instruction per vector loadstore commonly

Use TST instead of CMN

Sometimes these instructions hit a slow path on Cortex-A57 so a minor win there.

Optimize xor reg, reg

x86's universally agreed upon instruction for generating zero in a register is xor. This instruction isn't actually optimal in ARM hardware. We now
emit a move of constant zero which gets optimized to register rename on most ARM hardware.

More instructions optimized

These mostly just make the implementations use less instructions which makes them faster. There will be way more of this in the coming month

  • rotate flag calculations
  • phsubsw/phaddsw
  • cmpxchg8b/16b
  • psad*
  • 8-bit, 16-bit rcr
  • fcmov
  • shld/shrd
  • movss
  • maskmovdqu
  • maskmovq
  • phminposuw
  • fild
  • PF flag calculation optimization
  • Optimizing packing RFLAGS
  • Optimize ADD/ADC OF flag packing

Fixes bug in SSE4.2 pcmpestri

This was causing Java applications to crash. Now that we fixed a different bug last month, we now have Java working to an extent. It still crashes on
shutdown which is interesting and not all games are expected to work. But good luck testing random Java games!

Pack NZCV flags

This is the first step towards FEX generating x86 flags in a more optimal way. These flags match the ARM flags fairly closely and can be emulated in a
more optimal way if we pack them together. This is likely what causes the regression in bytemark, but since this is an intermediate step it is
expected to go away with the next optimization after this. Look forward to future optimizations that make this faster!

Remove weak symbol declarations in thunks

A bug that cropped up in thunks has been a crash that occurs when trying to use thunks from Ubuntu's PPA system. This has been a major thorn in FEX's
side for months because once you rebuild the project locally, it would never reproduce. The problem stems from the fact that clang would decide that
it can inline a "weak" symbol if its implementation is visible. This would only occur on Canonical's ARM builders, potentially due to whatever device
they use to compile the code on. This would cause our thunks to crash almost immediately if a user tried them from the PPA system. We have now worked
around this clang quirk and this will now fix thunks when enabled from the PPA system.

Mingw build work

As part of FEX's effort towards supporting running as a WINE dll, we have been slowly adding support for compiling FEXCore as a Windows DLL.
This month we have removed a bunch of Linux assumptions and API usages from FEXCore and moved it to the frontend FEXInterpreter application. In doing
so, FEXCore can now be compiled using llvm-mingw as a WINE specific DLL. This is completely unusable for users today but sets the groundwork towards
what will eventually become a WoW64 integration in the future. We have also added mingw building of FEXCore to our CI so we ensure it doesn't get
broken.
To be clear, even though this work allows us to compile as a Windows DLL, this doesn't allow us to run under Windows. FEX still does a bunch of things
that are Linux specific inside of the code.

ARMEmitter cleanups

Another improvement that doesn't affect our users but good to shoutout the improvement for our developers. @Lioncache
has spent a good amount of time this last month adding missing instructions and aliases to our AArch64 code emitter. While our code emitter covers a
decent amount of the AArch64 instruction space, it takes time to ensure full coverage. Whenever we're writing code for our JIT and an instruction is
missing, it slows down whatever we are working on. So kudos for improving our coverage because it makes everyone's lives easier.

Implement missing accept4, recvmmsg, sendmmsg for 32-bit socketcall

In a recent Steam client update, it started using accept4 for some background thing. This would cause it to spam a bunch of logs when failing to
accept some connection. A simple fix just for a few missing system calls, Steam now no longer is complaining loudly.

Fix variadic packing in X11 thunking

WINE had broken X11 thunking for all of FEX's history without any indication as to why. We never had time to look in to this but this last month we
finally hit a game that crashed which made this easier to debug. This bug occured because WINE is one of the few applications that pass more than
seven arguments through a few variadic API calls. This triggered a bug in FEX's variadic repacking code once we starting packing the arguments on to
the stack. With this fixed, WINE X11 thunking now works in significantly more games. This means that both OpenGL and Vulkan applications can be
thunked under WINE.

Fixes dead context store elimination pass

This optimization pass removes redundant stores to FEX's CPU context state. While this usually doesn't save much, it can improve performance for some
edge cases in FEX's JIT. While this is a performance optimization, it likely won't affect many things.

Fix 16-bit POPA instruction

This instruction was accidentally zero extending the 16-bit value in to the 32-bit register. We now insert the 16-bits as expected. This fixes an
issue with OpenAL in some cases.

Raw Changes

FEX - FEX-2307

Published by Sonicadvance1 over 1 year ago

Read the blog post at FEX-Emu's Site!

This release we had a bit of a slower month as some larger pieces were being worked on, but we still have some good stuff that is worth talking about.

Implement per-instruction RIP reconstruction

This was a fairly curious bug that FEX encountered. When trying to run the game Ultimate Chicken Horse then the game would crash very in its startup.
While investigating the game we determined that this was one of the first games we tested that uses Unity Engine's AOT scripting reflection(?) mechanism. This codepath seemingly heavily relies on either tagged pointers or some other
mechanism that causes a SIGSEGV when accessing it the first time. After that point the Unity AOT will catch the SIGSEGV and depending on the RIP of
the instruction, it will change behaviour. One of the problems with FEX is that on synchronous faults like SIGSEGV, we don't yet support full state
reconstruction. Since it seems like this only relies on RIP being correct, we can fairly easily wire this up and get Ultimate Chicken Horse running!

AVX work completed!

This last month FEX has done the last remaining work to implement AVX. With this month the remaining SSE4.2 instructions were finished,
and the prerequisite XSAVE and XRSTOR instructions were implemented. Although while the feature is effectively complete we aren't yet enabling the
CPUID bit yet. We are wanting to investigate a potential crash that has cropped up in Java games due to the extension first, and additionally we want
to finish up AVX2 work and enable them both in one step! Next month is looking to be the first version with AVX and AVX2 support in the source.

Fix 32-bit robust futex fetching

This issue has been a thorn in our sides for quite a while now. Usually this only ever manifested as an issue if the user was running Steam using
FEX's official PPA binaries in their setup. Once the user tried running Steam, then it would
crash with a really obscure message about "Fata error: futex robust_list not initialized by pthreads." This was something that would then never
reproduce if the code was rebuilt locally.

With a bit of poking around and using a local pbuilder version of FEX we were finally able to reproduce the error. Turns out FEX was writing a 64-bit
pointer back in to the result when the application tried querying the robust list pointer, overwriting part of the stack and corrupting its data.
This falls under one of the circumstances of "How did this ever work!?" but now with it resolved, theoretically Steam should finally work for our
users that are using the PPA build of FEX. Enjoy~!

Fix application hangs due to mutexes being locked on forks

This has been a very spicy bug that has been haunting FEX for years at this point. Whenever an application in modern day wants to execute a process it
will use a combination of fork and execve. Fork might end up being a vfork, or might end up being a clone syscall that does the same thing. Regardless
fork when executing in a threaded environment has some very strict requirements that it basically can only do an execve afterwards. vfork even adds an
additional restriction that it can't corrupt the stack at all because its sharing memory space.

The problem with this approach is that even if the application is only ever going to call execve after the fact, FEX needs to do a bunch of
bookkeeping or additional JIT emission and execution. This causes the problem that FEX's mutexes might end up being in an unknown state going in to a
fork, which will cause this new child process to hang indefinitely on the mutex.

To work around this issue, FEX will now globally lock all mutexes that matter, do the fork, and then immediately unlock the mutexes on the parent
side. On the child process FEX needs to be a bit mean to these mutexes, resetting them to zero to ensure no thread is holding the mutex. While this is
fairly heavy-handed this dramatically reduces how frequently FEX hangs when fork is used.

Specifically Steam tended to launch a bunch of background processes which would hang indefinitely, causing Proton games or downloads to never
continue. This should pretty much entirely be fixed!

Stop using faccessat2 to emulated faccessat

This was an oops on our part. faccessat2 was added in Linux kernel 5.8, so if your device was running an older kernel this syscall would /never/ work.
We didn't notice this since most of our devices are running a new enough kernel that faccessat2 just worked.

Thanks to the user that found this problem!

Handle xattr syscalls with overlayfs rootfs

Turns out that FEX had missed the various syscalls that access files to get xattr information. This was causing weird failures where some applications
would say that a file doesn't exist purely because it was in the rootfs overlay only.
Sadly Linux doesn't support *at variants of these syscalls so they aren't quite as fast as native execution, but that's fine.

Fix conflicting ARM64 register allocation

A couple months ago we added one more register to our register allocation for slightly more optimal register allocation. This broke a game called
Osmos under FEX. This is purely a bug but in resolving it, we likely fixed crashes in various
applications that we didn't notice before. Oops!

RootFS additions

This month we have a couple new rootfs images on our server that have been hotly requested! We now have an ArchLinux rootfs image and a Fedora 38
rootfs image. These haven't been as thoroughly tested as our Ubuntu images so if you find any problems with them, make sure to let us know on our
Discord

Raw Changes

FEX - FEX-2306

Published by Sonicadvance1 over 1 year ago

Read the blog post at FEX-Emu's Site!

Another interesting month of changes for FEX-Emu! While this release is shorter than last, this also only has a month of work rather than two. We had
some great work done this month, including a bunch of plumbing that most people won't notice. Let's see what changed!

Adds support for hardware TSO memory emulation prctl

Emulating the x86 memory model is the number one thing that slows down FEX emulation today. Apple Silicon supports this memory model in hardware which
is why Rosetta on MacOS can get amazing performance. With some recent changes from the
Asahi developers, FEX can ask the hardware to enable the TSO emulation bit. If the kernel reports back that hardware TSO memory is enabled, FEX can
take a more lax approach to its memory emulation, getting an automatic speedup for Asahi systems.

Additionally not only is this a speed-up, it's required for correct emulation. When this feature isn't supported by the hardware, FEX needs to emulate
the memory model using atomics and LRCPC instructions. This absolutely demolishes performance so it is usually recommended to disable the emulation to
get "free" performance. This issue with this is that it can crash instability in the most awkward and peculiar of ways. We even found out in this last
month that Unity games with their complex buffer management are highly likely to crash due to old cachelines of data hanging around. The only fix is
to use emulate the memory model using hardware TSO flags or our atomic/LRCPC path. Sadly's ARM Cortex's hardware LRCPC implementation is barely any
faster than atomics.

Steamwebhelper crash fix

New beta versions of Steam has started relying on AT_EXECFN existing. FEX didn't previously emulate this auxv value which was causing it to crash. With this fixed, steamwebhelper is now working again.

More AVX work!

This has been a long time coming and we're almost there finally. After these changes FEX only needs to fix a few implementation bugs in the string
operations, and implement the XSAVE instructions before allowing AVX emulation. In addition to that, FEX is also almost able to supprot AVX2 with the
only instructions that need to be implemented is the gather load instructions!

Implements support for XGETBV

This is a fairly simple instruction as it lets the application query which CPU features are enabled. Necessary for an application to check before
enabling any AVX usage.

Handle PCMPESTRM, PCMPISTRM and AVX variants

These are the remaining string instructions that FEX has implemented. While mostly implemented there are still a couple of edge case behaviours that
aren't quite correct and just need to be fixed.

Implement support for deferring asynchronous signals

This change has been a long time coming to make FEX's JIT faster in the face of handling asynchronous signals. This issue is that FEX needs to enter
code regions that are effectively "uninterruptible" until it is complete. This is basically a reentrancy problem where a piece of code executing could
lock a mutex, or non-atomically updating a container's data, then when a signal occurs it will jump out of the code and potentially come back to this
corrupted state.

As an initial workaround to this problem, FEX would just disable all signals in each of these "signal-deferring regions." This had the overhead that
every region would have a system call going in to it and then another one coming out of it. If how frequently these regions happened was little then
it would be a non-issue; but as is commonly the case, FEX's JIT needs to be wrapped in this signal blocks. If a game is executing a bunch of code,
this means we can be doing thousands of additional system calls per second which adds up as direct overhead.

With this new change, FEX marks that it is in a signal-deferring region with some very cheap memory accesses and if a signal doesn't occur the
overhead is negligible. In the case that a signal does occur, it will get stored to a queue, FEX will finish its signal-deferring region, and then
come back to handle the signal.

This mostly works because asynchronous signals don't have guarantees about the timeliness of the signal being delivered. Sadly this can result in
signal queue depths being subtly incorrect but we are monitoring the situation to know if any game is affected. All in all this finally makes it so
FEX can be straced without being overwhelmed and improves stutter problems!

Grand Theft Auto 5 AVX fix

FEX was accidentally reporting support for BMI1 and BMI2 CPU instructions. These extensions have a requirement that AVX must be implemented for these.
This was causing Grand Theft Auto 5 to crash early trying to use AVX. We will now only report these extensions if emulated AVX is supported, which fixes this game.

Make vfork wait for the process to exit

FEX's previous implementation of vfork actually behaved like fork. The difference between these two syscalls is fairly subtle. In the case of vfork, the parent process will end up sleeping until the child either exits or executes execve.
We were instead treating this like a fork, where the parent continues immediately without waiting. While no known issues were encountered, it is good
to ensure this behaviour is correct for future work.

Getdents optimization

This classic syscall is used for querying directory contents, FEX needs to emulate this syscall since 64-bit applications now use a new syscall called
getdents64. FEX's original implementation was fairly slow due to a misunderstanding as to how this syscall worked. It would create a temporary working
buffer and copy data around a couple of times. With the new implementation it is able to use the buffer provided by the application and doing some
minor fixups to make the overhead fairly light now. This improves performance when an application is doing heavy folder scanning, which mostly means
it improves Proton startup time.

Minor optimizations

There were a handful of minor optimizations that improve performance so minorly that it falls within noise, but is nice to have.

Optimize ARM64 thunk trampolines

This is a very small optimization that changes an indirect load in to a PC relative load, removing a single data dependency.

Minor x87 FCMOV optimization

FEX was duplicating a mask from a GPR in to a vector register using two instruction and now it only uses one instruction.

Optimize ADC/ADD OF flag calculation

This was a small mistake where a bitwise negate was using two instructions instead of one.

Optimize EFLAG unpacking

Each time FEX was unpacking the EFLAG register it was using four instructions per bit of the flag. This has now been improved to only two. Cutting
flag unpacking to 82% of its original size in some edge cases.

Supported emulated Linux kernel version up to 6.2

FEX used to max out the reported kernel version up to 5.18. Now we can report up to 6.2 with this change. 6.3 is going to be harder since it
introduces a new prctl that FEX needs to work around.

Video game showcase

As said previously about Unity engine games having issues without TSO emulation. Here is a clip of Hollow Knight running under FEX full speed on a
Lenovo X13s. Even with the overhead of emulating the x86-TSO memory model, this game runs remarkably well. With x86-TSO emulation disabled this game
would have crashed a few seconds in.

Raw Changes

FEX - FEX-2305

Published by Sonicadvance1 over 1 year ago

Read the blog post at FEX-Emu's Site!

Welcome back to another release of FEX-Emu! We had cancelled last month's release due to a large amount of code churn happening. In order to ensure
the highest quality of stability we were forced to do so. Now we're back with an even lengthier release this month, so buckle up because there were a
large number of changes that happened.

More AVX Work!

These last two months have been a while ride towards implementing AVX. @Lioncache has been burning down a ton of
instructions to get everything in place for AVX emulation.

New instructions implemented

  • PCMPISTRI/VPCMPISTRI
  • VPMASKMOVD/VPMASKMOVQ
  • VCVTPD2PS/VCVTPS2PD
  • VCVTSD2SS/VCVTSS2SD
  • PCMPESTRI/VPCMPESTRI
  • VMPSADBW
  • VPSLLVD/VPSLLVQ
  • VPSRLVD/VPSRLVQ
  • VCVTSI2SD/VCVTSI2SS
  • VPINSRB/VPINSRD/VPINSRQ/VPINSRW
  • VPSADBW
  • VTESTPD/VTESTPS
  • VPMADDUBSW
  • VPMOVMSKB
  • VMASKMOVPD/VMASKMOVPS

That's a whole bunch of instructions implemented! We have now nearly implemented all the instructions required for AVX.
The two major instructions before AVX can be exposed is the SSE4.2 instructions VPCMPISTRI and VPCMPESTRM. This is because these two
instructions also have AVX versions so it is a required feature in order to support AVX.

We are getting really close and once this feature is done, we can quickly move on to finishing support for AVX2, F16C, and the fused
multiply-accumulate extensions. At that point our CPU emulation will be effectively "feature-complete" for everything that games will care about in
the short-term. Exciting times!

llvm-mingw and WINE support

This is a very big change that has been coming down the pipe for a while now. We have been mostly working behind the scenes to get FEX-Emu wired up so
that it can be compiled as a Windows shared library. This last month is where this work has finally come to a head and most of the work is in place
for this.

How this works is that FEX-Emu has a shared-library and static-library that gets compiled called FEXCore. This is where all the CPU emulation
happens and tries to be mostly OS agnostic, while everything that is Linux specific lives in the frontend called FEXInterpreter. Is is FEXCore now
that can be compiled as a Windows AArch64 PE library. While this isn't currently useful to end users today. This means that WINE can link to this
library for emulating x86/x86-64 on AArch64 platforms. It should be noted that there are still some Linux assumptions strewn about the code, so this
isn't a generic solution for emulation on a true Windows platform. We're writing this support specifically for WINE today.

Converting away from C++ containers that allocate memory

This is the significant change that caused us to cancel last month's release. While @Neobrain was writing code to
support 32-bit library thunking, they had discovered a very big problem. FEX-Emu has long overridden the glibc memory allocation routines in order for
us to ensure that FEX can allocate memory when emulating 32-bit applications. We discovered that this overriding also extends to system libraries that
we load in after the fact. This meant that any time libGL would allocate memory, it would end up being a 64-bit pointer and there was nothing we could
do about it.

The workaround for this problem is to stop overriding the system allocators, which will allow shared libraries to allocate memory that can safely be
used by the 32-bit guest. But this also has the problem that FEX would then run out of memory when executing 32-bit applications. This is due to a
quirk that FEX-Emu needs to allocate all the memory on the system before executing 32-bit applications.

The new workaround is to replace usage of every C++ container that allocates memory with FEX's own container that will use its own allocator. This was
an exceedingly invasive change that touches almost everything in our codebase. With the pain done, FEX now can use its own internal allocators while
system libraries will use the regular glibc allocator as expected. See more about the limitations of this with our
documentation.

Re-enable glibc allocator hooking again

Okay, the previous paragraph was a ruse; FEX-Emu needed to actually override the glibc allocator again. In this case FEX-Emu will actually have three
allocators active at any given moment.

  • FEX-Emu uses jemalloc for its internal allocator.
  • The system allocator is overridden with another jemalloc allocator.
  • The guest application's glibc allocator is untouched.

The problems start occuring when a pointer is shared between thunks and the guest application. If one allocator tries to free a pointer from a
different allocator then fireworks occur. The way around this is to use a jemalloc function to determine if it owns the pointer and choose which
allocator to end up freeing the pointer from. This is particularly painful with X11 thunking because pointers are passed between client and server in
a very laissez faire fashion. This may not stay around in the future but it is a necessary evil for now.

JIT Optimizations and improvements

Reclaim static assigned registers on 32-bit

This allows us to use 8 more general purpose registers and 8 more floating point registers with 32-bit applications. Depending on the game this can
improve performance by a decent margin. We have seen upwards of 20% performance uplift in various games due to it.

Fix Visual C++ redistributable crashing

This was a really annoying bug, where every.single.time. that Proton would run, it would try to install the C++ runtime at least four times. The user
would be required to kill the processes after they were installed. This was fairly egregious because we had thought it was fixed months ago and didn't
realize that it wasn't actually fixed. Depending on the version of the Visual C++ redistributable and Proton it would still occur.

Root causing this issue turns out that the redistributable uses Windows' structured exception handling to catch the case when it passes a null pointer
to strlen which results in a SIGSEGV on the Linux side. FEX was incorrectly saving and restoring state when this occured, which caused it to
infinitely loop and crash. Now that this is fixed, these install correctly and Proton doesn't try doing it on every single run.

Implement REP MOVS as a memcpy

This instruction behaves like a fairly fast memory copy on the CPU. We now convert this over to an internal memory copy operation.
Similar to last month where we converted an instruction to a memset, this instruction being implemented as an IR operation has many times over
performance improvements. In real games this usually translates to a few percentage FPS improvement which is a nice uplift.

Fix restoring of AVX state

While not actually being utilized today (Except due to a bug), @AndreRH found out that we were accidentally failing to
restore AVX register state when a signal handler returned. It's surprising that this wasn't noticed earlier but it could have resulted in some really
bad floating point state.

Remove double syscall overhead on filesystem accesses

When FEX was checking to see if a file exists in the overlayfs style rootfs image we provide, we need to check if the file exists there first. If the
file exists we will redirect the file to be opened from the rootfs instead of the host filesystem. We had an issue that if the file didn't exist, we
would then check for it again on accident before accessing the host file. This would mean that one syscall turned in to three. With this fix in place
we are now only converting it in to two.

If you're running a rootfs image off of a particularly slow drive (or a network share) then this can shave a decent amount of time off of load times.
This was particularly noticeable when running a Proton game under Steam because they will access a ton of files before starting up.

Adds default DRM ioctl interface

This is a fairly basic change. Instead of breaking when hitting an unknown ioctl, pass it to the kernel and hope for the best. This is mostly so Asahi
and other drivers can test things under FEX without pushing patches to us for downstream support.

Add support for thunking Wayland

This doesn't affect most users today but adding support for thunking wayland means in the future applications that use this can sanely use this thunk.
SDL applications today might be able to take advantage of it but it is fairly fresh. We're looking forward to the inevitable Wayland and WINE
utilization to let things move away from X11.

Fixed 32-bit clock_nanosleep

There was a fairly nasty implementation detail where a 32-bit application trying to sleep with this syscall would actually consume a CPU core to 100%.
While fairly uncommon, this allows the game Alwa's Awakening to not burn a CPU core while running.

Add a bunch of functions to FEX's ARMEmitter

Not really a user facing feature but our code emitter has gained a bunch of new instruction support. This will be used in the future for our AVX2
implementation and various things. So it's good to have.

Raw Changes

FEX - FEX-2303

Published by Sonicadvance1 over 1 year ago

Read the blog post at FEX-Emu's Site!

Oh jeez, another month already? I guess it's time for another FEX-Emu release. Let's pick a commit, spin the roulette wheel, and hope for the best!
Surely that's how releases work?

Rootfs images are now on a new CDN!

While this is something that doesn't directly impact FEX when running applications, it's a problem that most of our users need to deal with when
installing FEX. Our previous CDN which was hosting our x86 images had a fair number of problems that couldn't be solved. The main issue that affected
users was that it was slow to download the images and depending where you were in the world, it could have an unstable connection. This resulted in
gigabyte sized files taking forever to download or never at all!

This month we have switched our CDN to a service that has worldwide data replication across multiple dataservers. This improves the speed in which
users can download our prebuilt images. Going from an average of 20MB/s to over 300MB/s is a significant boost. In addition to that, the connection is
significantly more stable to the far corners of the world. Also something that doesn't affect users at all is that this new CDN is actually
significantly lower cost than what we are currently using. This was unexpected but it's a nice bonus that this CDN is an improvement is every regard,
including cost.

This month's code changes

With that out of the way, onward to this month's changes.

Optimize REP STOS instruction in to inline memset

This is an instruction that x86 offers that behaves similarly to a memory set operation. It behaves slightly differently since this allows you to set
the memory by element size, and also you can choose to direction in which the memory is set. In particular this instruction tends to get used for
zeroing out memory. Latest x86 CPUs have even optimized this instruction in order to be fast as possible. Previously FEX had decomposed this instruction
in to a complex series of code blocks that was inefficient for our JIT and everything surrounding it. Now we instead convert this to a single IR
operation called MemSet which exposes the semantics of how the instruction works. Allowing our IR to be cleaner and the backend to decompose it in
a more optimal fashion. Currently we emit a a fairly trivial loop that handles this memory set operation. ARM has recently announced that future CPUs
are going to support a memory set instruction that is very similar to the 8-bit REP STOS which will make this implementation even faster!

As seen by this graph, FEX is no where near a native implementation. It's important to note that even without writing "optimal" codegen, this change
has still given FEX up to an 11% performance improvement on its implementation. This was primarily focused around improving the IR, we can now
optimize the code that the JIT emits significantly more easily! Getting closer to native is likely something to come in the
future.

Add config option hide hypervisor CPUID bit

We encountered the first game that has anti-virtual machine code and refuses to run if it thinks it is running in a VM. While FEX isn't a virtual
machine, we expose this CPUID bit so software that cares can use it as hint to query FEX specific CPUID information. Now that this game has stumbled
upon this issue, we added a configuration profile to disable this CPUID bit for the game. If any other games also pick up on this issue then we will
need more profiles.

Proton and pressure-vessel startup optimizations

One of this months efforts have been about improving the time it takes for Proton to startup. pressure-vessel is the project that is used to setup the
Proton execution environment which takes a while overall. One of the hardest things about Proton is that it executes thousands of programs and does an
absolute ton of filesystem accesses. ARM devices typically don't have the highest performance filesystems, which makes one part of this hard, but also
FEX's filesystem overlay adds overhead to this. Additionally one of FEX's shortcomings currently is that every application execution must JIT fresh
code every time it restarts. Since pressure-vessel starts so many programs, a lot of the time is just spent emitting code to memory. There were a few
optimizations that went towards making this faster this month.

With the couple of optimizations in place we managed to shave a second off of the start-up time. Cutting the execution from 9.7 seconds down to 8.7
seconds. Or in the case of running on an Apple M1, execution is now down to 7 seconds. Almost all of this time improvement comes from faster syscall
wrapping and the remaining CPU time is code JIT and execution. It'll only get faster in the future!

Fix a race condition with syscall emulation

While this is a fairly minor change, we fixed a race condition around system calls which would consistently cause crashes when Steam was starting up.
Every piece of work that improves stability just makes the whole emulation experience so much better and needs to be celebrated!

Signal frame improvements!

A significant problem with using FEX is the debugging experience when something breaks. We spent a good amount of time this month improving how FEX
sets up its signal frames when the guest application hits a fault. Since we weren't following traditional signal frame generation, tooling around
backtracing was broken in most cases. We have now reworked this so that libSegFault will now work to give FEX a backtrace of the application's
state when it crashes.

We will be shipping a new rootfs which includes x86 and x86-64 libraries for libSegFault so that if users want to debug a crashing application, they
can try and get a backtrace.

AVX work continues

Another month, another bunch of AVX work that has been implemented.

Instructions implemented

  • VPHSUBSW
  • VHSUBPD/VHSUBPS
  • VPERMILPD/VPERMILPS
  • VPERMD/VPERMPS
  • VPHADDSW
  • VPTEST
  • VPMOVSD/VPMOVSS
  • VSHUFPD/VSHUFPS
  • VPSHUFD/VPSHUFHW/VPSHUFLW
  • VPSHUFB
  • VPALIGNR
  • VEXTRACTF128/VEXTRACTI128
  • VPBLENDVB/VBLENDVPD/VBLENDVPS
  • VBLENDPD/VPBLENDW

As you can see a lot of new instructions are now implemented. This now leaves us with about thirty more instructions that need to be implemented
before we can start avertising the features on SVE2-256bit supporting hardware. This is significant as we keep finding more and more games that are
requiring AVX to run

ARM emitter cleanups

Another change that isn't user facing but is always nice to point out some janitorial tasks that have been done. When we switched over to using our
own code emitter there were some design choices and implementations that weren't quite optimal. This usually culminates as developer pain when using
the emitter but was a necessary evil since we wanted to get rid of VIXL's assembler as fast as possible. @Lioncache
spent some time this month cleaning up a lot of the dirty code in the emitter, in some cases making it slightly faster as well. This is always greatly
appreciated as it reduces maintenance burden when working in the JIT.

They also implemented an absolute ton of new instruction emitter functions which previously didn't exist. While we don't use these yet, we will likely
use them at some point which will make our lives easier in the future.

New development machines for our developers

Just recently a new Snapdragon laptop has gotten working OpenGL and Vulkan drivers up and running! We are gifting each of our developers one of these
great machines in order to ensure we have testing platforms for all the OpenGL 4, DXVK, and VKD3D applications we want to be running! Kudos to all the
developers that worked on bringing this hardware up so quickly!

Raw Changes

FEX - FEX-2302

Published by Sonicadvance1 over 1 year ago

Read the blog post at FEX-Emu's Site!

This month certainly passed in the blink of an eye. A lot of good bug fixes this month as usual! Continue reading to find out more.

Fix incorrect operation for cache line clears

In emulating the CLFLUSH instruction, FEX was incorrectly using the wrong operation for clearing caches. We were accidentally using the CVAU operation instead of CIVAC.
While this is incorrect, it was hard to find anything that was actually affected by the wrong implementation. With Snapdragon's open source Vulkan driver implementing what is required for VKD3D,
it became evident from Vulkan tests that this was incorrectly implemented. Switching the implementation is easy and will let VKD3D run without hacks
when the required feature is finished.

Bug fixes to 64-bit x87 emulation

A big thanks to CallumDev for finding and fixing these latest bugs in FEX's less accurate x87 emulation. As a
reminder, x87 on original hardware operates using 80-bit float values. This is a feature that ARM doesn't natively support, so FEX needs to emulate
this using a software floating point library. We have a hack in our configuration to allow removing this software implementation and instead operate
using 64-bit double operations instead. This can significantly improve performance in some 32-bit games but introduce rendering artifacts.

This month there were many bug fixes:

  • ALU operations that consume integers converted to floats are fixed
  • Float comparison that also consumes 16-bit integers fixed
  • FPREM instruction no longer infinite looping

With these fixes in place, a large number of games now actually render correctly with this hack enabled. It will be interesting to see how well this
improves performance or batterty savings in 32-bit games!

More AVX instructions emulated

With one of FEX's developers taking some away time, this was a little less involved than the last couple of months.
There was still a handful of instructions implementation

  • VPBLENDD, VBLENDPS, and VPSRAVD

Additionally while these aren't AVX instruction, we also implemented the CLWB and CLFLUSHOPT instructions. These match their ARM equivalents so it was
mostly an easy implementation that applications can use if they want.

Fix copy and paste error in Arm64 JIT

While this is a fairly minor issue, we had a copy and paste error in FEX's register spilling code. This caused Steam to crash in certain situations,
so fixing this since the previous release helps users wanting to run that.

A bunch of minor optimizations

This month had a bunch of small optimizations around the entire project. Alone these are all quite minor but added together should result in a couple
percentage of CPU time removed from FEX's JIT.

  • Arm64 Dispatcher is slightly faster
  • CPUID emulation initialization is faster
  • Optimize File loading, improving config loading time
  • Frontend instruction decoder optimizations to be faster
  • Makes IR operations 1 byte smaller, improving memory usage
  • Inline IR constants optimization to reduce IR memory size

Fixing thunk symbol override fetching

FEX's thunks had an issue where if a library was loaded, we would only ever fetch relevant symbols from that library directly. While this worked for
our use case, it breaks when wanting to use MangoHud in OpenGL applications. Resolving this issue fixes most things that will override symbols with
LD_PRELOAD.

Update JEMalloc from 5.2.1 to 5.3.0

While this is a fairly minor change, this release on JEMalloc fixes some bugs and improves performance. Small but every performance improvement is
welcome.

Support for execveat with AT_EMPTY_PATH

This is an interesting feature where an application can be executed directly through a file descriptor instead of a filepath on disk. This is a fairly
simple idea but has some interesting edge cases that might be interesting to some people. To see the more technical information about implementing
this, check out the pull request.

Raw Changes

FEX - FEX-2301

Published by Sonicadvance1 almost 2 years ago

Read the blog post at FEX-Emu's Site!

Happy new year! A new month brings a new release of FEX-Emu, bringing in the new year.

A large amount of work in this last month, showing that FEX-Emu isn't slowing down even through the holiday season.

AVX emulation work continues

An absolute ton of work landed this last month towards bringing up AVX emulation in this last month. In total there were around 185 new
AVX instructions implemented in FEX-Emu's backend this month. At this point it starts becoming easier to talk about the number of missing instructions
rather than what is implemented.

According to FEX-Emu's instruction decoder tables, we have around 60 more instructions to implement before we can start advertising the feature. Of course
with anything programming related, the last 10% is going to take the longest to implement.

A huge shoutout to @lioncash for smashing out these implementations so quickly. The amount of work going in to this is
extensive.

As a side-note for users looking forward to this feature. The implementation requires hardware that supports both SVE and SVE2 with a 256-bit register
width now. Which means that Fujitsu A64FX, Neoverse-V1, and all current consumer class Cortex chips are incapable of taking advantage of AVX once
complete. This is a future proofing implementation for when future hardware becomes available that supports what FEX-Emu needs.

Implement a new AArch64 code emitter

One thing that has been a stand out performance bottleneck has been how quickly FEX-Emu can emit AArch64 binary code to memory. The project that
FEX-Emu used for this is ARM/Linaro's project called vixl. This project is a suite of tools including assemblers,
simulators, and disassemblers and many open source projects do use this. This is a very nice project that eases the developer's burden when writing a
JIT that targets ARM devices. Sadly when profiling our code, it turns out that FEX-Emu spensd a decent amount of time inside of vixl code due to how
obtusely large it is. Even with Link-Time-Optimization enabled in our code, we can't reduce the overhead incurred from vixl sadly.

With this in mind, FEX-Emu decided to create its own AArch64 code emitter tailored to what the project needs, which is high performance and low
overhead.

As seen in the chart above, the percentage of time between how long it takes to emit code between Vixl and our new emitter is significant. With the
Cortex-X1 only taking 68.7% of the time, and a smaller Cortex-A55 only taking 60.2% of the time. The Cortex-A55 having more of a win is showcasing
that due to how much code vixl takes to emit code, it is effectively saturating the icache and
BTB of the poor little CPU core.

Only code emission performance isn't the only story that matters here though. We need to showcase how much of an improvement this has including the
rest of the translation from x86 code.

Although code emission is only a percentage of our total time spent when translating x86 code, this new emitter is having a fairly massive ~8%
reduction in time spent JITing. This will manifest as reduced stutters when users are running games and generally faster application execution for
short-lived applications.

We're not stopping there of course, look forward to the coming months as we spend more time optimizing our JIT so it runs even faster!

Initial 32-bit thunk support

A tricky feature that FEX-Emu does with its emulation is that it is translating 32-bit x86 applications to run inside of a 64-bit process space. This
is a hard problem to resolve which is why we don't currently support thunking of libraries when running 32-bit applications. This is the initial work
required to start supporting this use case.

While not wired up to any library currently, we are quickly working towards getting Vulkan and OpenGL wired up to this interface so we can accelerate
older 32-bit games.

Various JIT optimizations

There have been various JIT optimizations this month which will improve performance a small amount. These aren't benchmarked since the percentage
improvements are so small that it is likely to fall in to single digit noise.

Optimize inline syscall spilling

When FEX handles a syscall inline with our JIT, we were spilling all of our registers to memory. Now with this optimization correctly working we only
spill exactly what is required, making inline syscalls faster.

Optimize generic spilling and filling

When jumping out of the JIT to C code, we need to spill both general purpose registers and vector registers to the stack. With this optimization in place we now
generate roughly half the instructions necessary when doing so.

Optimize SVE register spilling and filling

While currently not utilized today, this cuts the number of instructions required for spilling SVE registers to a quarter. Should be quite nice for
future hardware.

Zip elements for PHSUB instructions

These horizontal vector instructions behave a little weirdly and our original JIT implementation wasn't quite optimal. Previously we were doing
explicit element inserts to combine the final result. Now we are using the AArch64 Zip instructions which are significantly more optimal.

Fix global application configurations

This was a bug where we accidentally broke applications configurations shipped with the fex-emu package. In particular this caused the steamwebhelper
to break. With this resolved, steam will work correctly again.

Fix misspelled library names in Thunks Database

While a fairly minor fix, this can have a profound impact on users that are using our thunking infrastructure. Our XCB thunks were incorrectly named,
which meant that if users were enabling XCB thunks independentally of Vulkan/GL, then they wouldn't have actually been enabled.
With this typo fixed then this won't be a concern.

Note that if Vulkan or GL thunks were enabled, then this wouldn't likely have been an issue since X11 would have loaded xcb independentally anyway.

Misc

There was a bunch more this month that was smaller and spread out. We don't want to take up too much of your time so if you want to see more, make
sure to check out the detailed change log!

Raw Changes

FEX - FEX-2212

Published by Sonicadvance1 almost 2 years ago

Read the blog post at FEX-Emu's Site!

A lot of good work this month with the highlight being that we have started working on our AVX implementation and started optimizing our IR to be more efficient.

Disable PCLMUL if not supported on host

This carry-less multiplication instruction is only implemented on ARM SoCs that ship the cryptographic extension.
This extension is unsupported on the Raspberry pi which was causing applications that use openssl to crash.
Specifically this fixes Steam running on the Raspberry Pi again.

Adds 256-bit support to the remaining IR vector ops

A lot of work this month for implementing support for 256-bit operations.
With this work in place our JITs now support 256-bit for all of the IR operations.

Work started on AVX emulation

With the previous work completed for having our JITs support 256-bit operations, work could now be started on implementing AVX.
This AVX work is implemented as native SVE 256-bit operations, so the only hardware that can currently execute this partial implementation is Neoverse-V1 CPUs.
The expectation that as ARM CPUs become more powerful, they will eventually support SVE with 256-bit sized registers.
It may take a few generations to get hardware that supports this, if ARM CPUs want to run AVX games then they will need to support the equivalent hardware feature-set.

Current instructions implemented:

  • VZEROUPPER, VZEROALL
  • VMOVAPS, VMOVQ
  • VMOVNTDQ, VMOVNTDQA, VMOVNTPD, VMOVNTPS
  • VMOVDQA, VMOVDQU
  • VMOVAPD, VMOVUPD, VMOVUPS
  • VMOVLPD, VMOVLPS
  • VMOVSHDUP, VMOVSLDUP
  • VMOVHPD, VMOVHPS
  • VMOVDDUP
  • VORPD, VORPS, VPOR
  • VPXOR, VXORPD, VXORPS
  • VANDPD, VANDPS, VPAND, VANDNPD, VANDNPS, VPANDN
  • VADDPD, VADDPS, VPADDB, VPADDW, VPADDD, VPADDQ

This is just the beginning of us implementing support for this, stay tuned as we implement the remaining operations over the next few months.

Generate register access IR operations directly

As an original implementation design detail, FEX implemented GPR and XMM register accesses as a generic emulated CPU state access. Once we added
static register allocation we also added an optimization pass to convert these generic accesses in to register accesses which directly map to our
static register allocator.

This is a redundant pass since we know upfront which registers were being accessed. With this change we are generating register access IR operations
directly and removed the optimization pass. This removes around 12% JIT compilation time, which improves responsiveness and lets FEX spend less time
compiling code.

Systemd fixes

While this is a niche supported operation, some people may be interested in running FEXServer as a systemd client.
A FEXServer is meant to be a user-wide server that the FEX clients talk to for rootfs and eventually other management.
Using a systemd user service, a FEXServer can be started early, letting it mount the rootfs image, and run in the background.
This can be fairly useful as FEX error logs can then be printed to journalctl for inspection as for why a process has crashed.

Add support for steamid based configuration files

As an ongoing effort of documenting which applications can run with FEX's OpenGL and Vulkan thunk libraries, it was determined that some applications
use generic executable names. This means that a configuration file that uses the application name would have erroneously enabled thunks for other
untested applications.

In order to work around this issue, our configuration system now supports an optional steamid based naming convention for games that are launched from
Steam. With this in place, we now have a repository that contains application configurations that users can install at their leisure. This repository
can be found on Github

As part of the documentation process, all of these configurations must be documented on our Wiki with
testing results to ensure it works.

Implement SGDT

This is a quirky instruction that is emulated on a native x86 system these days. This instruction is a system instruction that is used by the OS for
getting the configuration of the global descriptor table. Linux captures this instruction and returns a configuration that says the table is living in
kernel memory space. While this is already true, an application usually doesn't need to care about this data.

Curiously enough Denuvo uses this instruction in some of their implementations for some reason. With us implementing this instruction, Denuvo games
now get slightly further before they horribly crash.

auxv fixes

When FEX executes an application, it needs to setup an emulated auxv state since this isn't a cross-architecture state.

  • AT_RANDOM
    • This now correctly passes through the host's AT_RANDOM value rather than fixed values
  • AT_PLATFORM
    • Some tooling uses this to determine if it is running as i686 or x86-64
  • AT_HWCAP/HWCAP2
    • This just returns some CPUID values, most applications use CPUID directly instead of this
  • AT_MINSIGSTKSZ
    • The minimum signal stack size is no longer being a hardcoded constant size
    • Applications are supposed to use this to calculate a signal stack size

Support radeon drm driver in ioctl emulation

Most Radeon GPUs these days use the amdgpu kernel driver, but a user found a hole in our ioctl emulation by using an old Radeon GPU on a Phytium ARM
board.

With this in-place, older Radeon cards that use the radeon kernel driver can now have accelerated OpenGL.

Misc optimizations

This month we have had a random smattering of optimizations that improve startup, shutdown, and execve performance. While not individually providing a
lot of benefit; small optimizations like these add up to make FEX better over time

  • Defer cpuinfo file initialization until first access
    • Improves startup time
  • Use tsl::robin_map for some internal maps
    • Improves JIT time, and some minor shutdown performance improvements
  • Disable multiblock by default
    • This causes excessive JIT overhead which makes the experience worse for the user
    • Significantly reduces stutters
  • Improve hot path of file existance checking in syscall wrapping
    • During our overlayfs handling, this can be hit quite hard during file accesses
    • Improves file IO in applications

Raw Changes

FEX - FEX-2211

Published by Sonicadvance1 almost 2 years ago

Read the blog post at FEX-Emu's Site!

A lot of good changes this month for our users. Both performance and compatibility improvements to be had!

Segment register index optimization

This optimization has been a long time coming. Sitting in pull-request limbo since back in April. This is an optimization to cache segment register
addresses so the JIT can more optimally generate memory accesses. While segment registers are mostly gone with x86-64, 32-bit segment registers are
used fairly commonly with some instructions completely implicitly. This just adds overhead to fetch the LDT and GDT entries for something that
typically doesn't change very quickly.

With this optimization in place, we get an average of 4.3% uplift in 32-bit Bytemark. This performance improvement will be directly felt when
running 32-bit applications.

48-bit Proton Experimental fixes

For a while now FEX has worked with Proton 7.0 and older, but we have had issues running Proton Experimental in some cases.
This was a tricky problem to nail down but we had some good leads. If your ARM device was running its kernel with 48-bit Virtual address space (VA) enabled then Proton
Experimental wouldn't work. On the other-hand if your kernel is compiled using a 36-bit VA then it would run fine. After a few days of debugging, it
turns out that Proton/Wine allocates the lowest 32MB of its stack space, and the kernel by default allocates a 128MB space for the application.

When an application is ran natively the stack is allocated at the fixed location in memory. FEX was failing to allocate the stack at the correct
location. When Wine's preloader eventually ran; FEX will have allocated JIT code at that fixed location, which Wine would then map over, zeroing the
memory and breaking the FEX JIT. The preloader has done this for a long time and it was by pure chance that we weren't breaking older versions of Wine
and Proton.

With this problem fixed in FEX, we are now able to run triple-A games on AArch64. Just like the following images of God of War running on Snapdragon
888.

Even more IR changes preparing for AVX emulation

Once again this month we have a absolute ton of commits from Lioncash working on making our JIT be ready for AVX emulation. Around 25 commits working
towards this, with only about four more IR vector operations to support AVX with.

Once the JITs support 256-bit operations, we can start working towards emulating the instructions themselves.

Fix thunk crashing due to insufficient stack space

When FEX starts we potentially need to allocate all memory inside of the 48-bit VA space to match how x86-64 only has 47-bits.
This intersects with our stack space allocation which is supposed to autogrow, but we allocated it instead. Now we give the full 128MB stack space to
FEX so it won't crash anymore.

Implements support for remaining BCD instructions

Thanks to @wannacu for implementing the remaining handful of 32-bit BCD instructions. DAA, DAS, AAA, AAS, AAM,
AAD were all missing in FEX's implementation. While BCD is fairly uncommonly used these days, they still managed to find an application that uses
these instructions. With these implemented, FEX should have all of the BCD instructions finally implemented.

Implement gpuvis timeline profiler support

While not majorly important for users, this is a very good interface for developers wanting to watch why a game has stuttered and for how long code
took to compile. This lets us take advantage of the same interface that GPU profiling events are using to see why a game missed a vsync.

This isn't enabled by default out of concern for taking too much CPU time, so it needs to be enabled with the ENABLE_FEXCORE_PROFILER cmake
option.

Fix ROR OF flag calculation

This is a fairly minor bug since not many things rely on the OF flag specifically. But in our testing of new Proton games, we found out that Denuvo
Anti-Tamper is relying on this edge case behaviour and we messed it up. While this gets Denuvo running slightly farther, it still doesn't quite work
under FEX.

Fixes FPREM1 C2 flag calculation

FPREM1 will return a flag if the number was too large to calculate in one step. Which is usually not the case. Since we are calculating the full
remainder we will never set say we return a partial remainder. This solves an infinite loop in Mono applications that are using SIN/COS math
operations.

Claim X87 transcenental ops are in range

X87 will set a flag if a program tries to operate on a value that is out of range for trancendental SIN/COS/TAN operations.
FEX-Emu doesn't actually detect these for performance reasons, so instead claim these are always in range. While not always true, if they are out of
range then we weren't detecting them anyway. Fixes an issue where glibc would do some fixups to try and bring the value in range, resulting in invalid
results.

Add missing thunk library versions

This fixes an issue where FEX thunks would try to dlopen development libraries, which are missing on most user's devices.

Fixes indirect thunks with 8+ arguments

This fixes a quite bad crash with OpenGL and Vulkan thunking where every function with 8 or more arguments would be likely to break.
Fixes thunks for a bunch of games.

Add support for disabling thunks in application configurations

This is useful for narrowing down thunk compatibility issues in certain applications. While it is still not recommended to enable thunks globally,
this allows more flexibility with tinkering with it

Implements four more auxv values

FEX implements most of these values for applications to pull but in some cases we didn't have these setup. Specifically AT_PLATFORM is required
so ldconfig can work correctly. AT_HWCAP/AT_HWCAP2 is used for an application to check for CPU features, and AT_RANDOM is a 128-bit
random number that the kernel provides.

Misc

Quite a few more things that were changed this month, but this report has been going on long enough.

Raw Changes

FEX - FEX-2210

Published by Sonicadvance1 about 2 years ago

Read the blog post at FEX-Emu's Site!

This month's release was a bit delayed due to the fact that most of FEX-Emu's developers were meeting up physically at the X.Org Developer's
Conference this year! Before we talk about this months changes we need to spend a bit of time talking about some cool things.

FEX-Emu XDC talk

This year FEX-Emu had a talk to discuss some of the weird interactions with Mesa in an emulated environment. You can see the full talk in the embedded
video.
XDC Talk

At the end of the video we showed a quick demo of (mostly!) Proton games running under FEX-Emu on a Snapdragon 888 device. You can see this demo
directly embedded below.
XDC Sizzle Reel

Ubuntu 22.04 Rootfs Mesa update

We have had to update the Ubuntu 22.04 rootfs image with a newer version of Mesa today. Unfortunately our last update with Mesa 22.2 had a bug in the
Raspberry Pi Vulkan driver which completely broke Vulkan on ALL devices, not just raspberry pi. We have updated the rootfs today with a mesa git
version of the library to work around this issue. As a benefit, this version of the FEX rootfs includes the new Venus Vulkan 1.3 driver which can be
useful for testing.

Pick up the latest rootfs with the FEXRootFSFetcher tool.

New Lenovo ThinkPad X13s Gen 1 laptops

Last month Lenovo launched a new Snapdragon laptop that is one of the best development platforms that FEX-Emu devs could ask for. This platform is
shipping the Snapdragon 8cx Gen 3 SoC which is one of Qualcomm's most powerful chips. The only downside with this platform currently is that the GPU
doesn't yet work under Linux. There is an ongoing community effort to get the GPU up and running but these Snapdragon chips typically take a while
before support is fully in-place.

Once the GPU works then this will be a perfect platform for testing Adreno with the Turnip Vulkan driver and Freedreno. At that point we will be
shipping out these laptops to all of our devs so we have a good Vulkan development platform.

Tweet from @FEX_Emu

FEX-2210

Although most of our developers were at XDC, there is no shortage of code that was merged this last month.

IR changes preparing for AVX emulation

This last month had at least 32 commits preparing our JITs for emulating AVX. While AVX isn't yet wired up, this is still a required step before it is supported. We are still requiring ARM SVE hardware that is shipping with 256-bit wide registers. This means the current consumer CPUs and just announced Neoverse-V2 won't work for our emulation here! This is future-proofing work since more games are requiring AVX to run but we'll just need to live with the problem that we will need new CPUs for the latest AAA games to run under FEX.

Support clang for thunks

We added support for building our thunks with clang this release. In particular the Ubuntu PPA is shipping this already. This might give a very minor perf increase but the main thing is removing a hard dependency on GCC.

Add uninstall cmake target

While it is generally advised to not install directly from source building, user tend to still do this.
It was asked multiple times to have an uninstall target so we finally added this convenience feature.

32-bit VDSO thunking support

This is FEX-Emu's first 32-bit thunk library! This exercises most of the thunking framework to bring this feature to 32-bit, without some of the harder parts that require data repacking. Now that this is proving that our 32-bit thunking is working, it is likely that we will start working towards getting the rest of the thunks supporting 32-bit as well!

IR cleanups

While this isn't directly user facing, this makes the JIT IR a bit easier to handle. Making the devs lives easier. We've removed redundant operations that aren't necessary.

Add support for vixl simulator in CI

While we are waiting for SVE-256bit hardware to get on the market, we need CI to prove that our implementation is correct. We have once again added the vixl simulator to our source tree.
The vixl simulator supports emulating the SVE instructions at whichever register width you want. While stacking emulators isn't good for performance, it is good for ensuring correct behaviour.
Sadly ARM's simulator doesn't emulate 100% of the operations correctly, we have had to disable a few of our unit tests in this case; but, it works well enough that it can pick up major mistakes.

CI functional testing

We have added functional testing of some of our thunks in our CI system. Specifically we are testing our OpenGL and Vulkan thunks to ensure they don't break. Since this is the beginning of functional testing, we currently only run vulkaninfo and glxinfo.
Soon we will be expanding this functional testing to encompass more features which will likely capture even more problems if they come up.

Map ELF files more like the kernel

The kernel has an interesting behaviour around how it maps ELF files in memory. It will always load the dynamic linker at around the highest address
it can. The primary ELF file will be loaded roughly in the middle of the address space with a bit of ASLR bias. We now emulate the same behaviour in
FEX to help with problems when running WINE. While not all the issues are sorted out, this is a good step towards making it more stable.

Fix LLVM ASAN

We had an issue with our ELF loading where LLVM ASAN was breaking due to mixing multiple mmaps in the same space. Simple bug with a simple fix. ASAN
all the things!

SMC deadlock fix

There was a fix to prevent a potential deadlock in our Self-Modifying-Code detection routines. Thanks to the developer that found this!

Lots of misc fixes this month

It would be hard to list all of the misc other fixes that happened this month. Find out more in our raw release notes!

Raw Changes

  • Arm64

  • Fixes SVE VectorImm (ad8526852)

  • Centralize location for register defines (169cfbbee)

  • VectorOps

  • Make use of static predicate registers (0fee355ff)

  • CI

  • Fixes struct verifier on Ubuntu 20.04 (f97a4afd8)

  • Adds support for flakes (96fecfd7c)

  • CMake

  • Add toolchain file for 32-bit cross-compiler (1ed3ecb40)

  • Extend AArch64 check to include arm64 (a583ebe59)

  • Docs

  • Update Release docs (c987e1ef4)

  • ELFCodeLoader

  • Map primary ELF more like the kernel (b44b3401b)

  • Fixes dynamic non-interpreter ELFs (edca52860)

  • Map interpreter first (71f7ff510)

  • ELFCodeloader

  • Map once and then use MAP_FIXED to overwrite (d68b84bc2)

  • FEXConfig

  • Ensure APP_CONFIG_NAME isn't stored in json (8d69f539a)

  • FEXLinuxTests

  • Adds missing pthread_cancel flake status (c262362a0)

  • Migrate to Catch2 (8f70137b1)

  • Build 32-bit and 64-bit test variants separately (3448c8343)

  • Use the build system instead of setting up compile flags via source-code annotations (d2138694b)

  • FEXServer

  • Fix waiting on kernel version older than 5.3 (8f9d79934)

  • FHU

  • Convert to a interface target (dee85f14f)

  • IR

  • Handle 256-bit VSMul/VUMul (23dd056b6)

  • Handle 256-bit VRev64 (c412d073b)

  • Handle 256-bit VShlI/VUShlI/VUShrI (c2b6aef6f)

  • Handle 256-bit VSShrS/VUShlS/VUShrS (51214d1be)

  • Handle 256-bit VFCMPORD/VFCMPUNO (7b4b9a80f)

  • Handle 256-bit VFCMPLT/VFCMPGT/VFCMPLE (25a8a0077)

  • Handle 256-bit VFCMPEQ/VFCMPNEQ (a67f7422b)

  • Handle 256-bit VCMPGT/VCMPGTZ/VCMPLTZ (ed8150cfb)

  • Handle 256-bit VCMPEQ/VCMPEQZ (462a163ba)

  • Handle 256-bit VBSL (6374175a6)

  • Handle 256-bit VSMax/VUMax (8d8b02928)

  • Handle 256-bit VSMin/VUMin (aa6a49932)

  • Handle 256-bit VNot (64c4fdccf)

  • Handle 256-bit VFNeg (d715ffbc8)

  • Handle 256-bit VNeg (1799d4c67)

  • Handle 256-bit VFRSqrt (dacd96cab)

  • Handle 256-bit VFSqrt (ca4d3bf64)

  • Handle 256-bit VFRecp (ea38b043c)

  • Handle 256-bit VFMax (a39746df2)

  • Handle 256-bit VFMin (2367a8e50)

  • Handle 256-bit VAddP (cb121d7f1)

  • Handle 256-bit VFDiv (50eba4066)

  • Handle 256-bit VFMul (447226576)

  • Handle 256-bit VFSub (3f8b872f1)

  • Handle 256-bit VFAddP (e573ddc2d)

  • Handle 256-bit VFAdd (eedbde6f1)

  • Handle 256-bit VPopcount (4e441e5a0)

  • Handle 256-bit VAbs (3e287a36c)

  • Removes Mov IR op (46bde401b)

  • Removes VExtractElement (01beac495)

  • Removes unnecessary VBitcast IR op (fcd981e6b)

  • Removes SplatVector{2,4} (2b9cc9666)

  • Removes VInsScalarElement (82eba2229)

  • Interpreter

  • Handle 256-bit VSShr/VUShl/VUShr (4d6e15d7a)

  • Use constant for AVX register size where applicable (808e1c033)

  • Handle 256-bit VMov (412793c21)

  • Handle 256-bit VAnd/VBic/VOr/VXor (b5cb42924)

  • JITs

  • Handle spilling/filling 256-bit vectors (6742e0c37)

  • Expand max spill slot size to 32 bytes (0d0d116bd)

  • SMC

  • Fix possible deadlock (8da9ebc2e)

  • Scripts

  • Updates DefinitionExtract (3977e1f29)

  • StructVerifier

  • Fixes CI failure (d4b5bf0f7)

  • ThunkLibs

  • X11/Xext: Removes two functions that don't exist on 32-bit (cc4c705fc)

  • Thunks

  • Add support for building with clang (2b1ef9735)

  • Adds dependency on linker script (eaddf7f1a)

  • Implement the Thunk IR op for 32-bit mode (1ea00f68a)

  • Adds functional thunk testing to CI (a59097763)

  • Host

  • Adds bool operator to fex_guest_function_ptr (3237de308)

  • gen

  • Use fmt for writing formatted output (704afed97)

  • libvulkan

  • Fixes print for 32-bit (d8c2a8271)

  • VDSO

  • Fix vsyscall (5cf59408a)

  • VectorOps

  • Handle 256-bit VURAvg (977d6dd24)

  • Handle 256-bit VUMinV (0261ed353)

  • Extend VSQAdd/VSQSub/VUQAdd/VUQSub (f34f1309a)

  • Extend VAdd/VSub (0ad52b7d1)

  • Misc

  • Add opencl thunk db (b693112c8)

  • 32-bit VDSO support (6f6f3c9dc)

  • Update vixl external (832a320e2)

  • Move thunk generator logic from ASTVisitor to ASTFrontendAction (54915f87c)

  • Add support for the vixl simulator (b36ec152d)

  • cmake

  • Adds uninstall target (6adf22761)

  • unittests

  • Disable gvisor pselect test (acddc0323)

FEX - FEX-2209

Published by Sonicadvance1 about 2 years ago

Read the blog post at FEX-Emu's Site!

A lot of miscellaneous work this month that isn't directly user facing. We do still have some interesting topics this month that some people will be
interested in.

Simplify StealMemory functions

A fairly significant change this month is reducing the time it takes FEX to set up its memory upon load. FEX needs to do an initial setup of the memory when an application loads
because between x86-64, x86, and AArch64 the memory layouts are significantly different.

Depending on the architecture of the application, FEX needs to allocate a large amount of memory to emulate the x86/x86-64 memory behaviour.

On 32-bit x86

  • We need to allocate all memory above 32-bit memory space
    • This is because we emulate 32-bit applications as a 64-bit AArch64 application

On 64-bit x86-64

  • We need to allocate all memory in the 48-bit virtual address space
    • This is because AArch64 supports the full 48-bit space for the user
    • x86-64 userspace only receives 47-bit
    • Application's rely on not receiving 48-bit pointers!

From this graph showing the amount of CPU time spent in each routine, we can see a significant reduction in time to execute.
For 32-bit and 64-bit specific operations this results in a ~70x and ~181x reduction in in execution time!

How well does this improve execution time in practice though?

This graph is showing the total time it takes to run applications fully through. The smallest test applications have shaved off around 75% - 85% their execution time. The biggest improvement
comes from Proton setting up its execution environment. Proton's underlying execution environment is called pressure-vessel which executes hundreds
of background applications while setting up. This is one of the worst cases for FEX since each independent application execution needs to JIT new code
and handle all of its state setup. This case reduces the execution time from around 21 seconds down to around 17 seconds! This can really
be felt when execution back to back Proton instances when testing games!

While this is a significant step in the right direction, FEX still has a ways to go to hit the native execution time of pressure-vessel which can take
as little as one second.

More AVX work

A bunch more work has gone in to supporting AVX emulation. This is still preliminary backend work for now.

  • HostRunner

    • Handle upper YMM lanes in sigsegv handler
  • InterpreterOps

    • Extend SSAData size to accomodate 256-bit operations
  • VectorOps

    • Extend VAnd/VBic/VOr/VXor
    • Extend VMov
    • Extend VectorImm
    • Extend VectorZero

Thunks

X11

Some fairly minor changes here that improve usability of thunks with Proton. We added more Xlibint functions to the thunks which fixes X11 thunking
with DXVK. X11 is required for both Vulkan and OpenGL thunking so having this working is necessary when running those games.

Another necessary change for supporting thunks with Wine/Proton is more aggressively supporting X11 functions which require variadic arguments. There
are quite a few of these functions sprinkled around that require this. While we supported these functions with open-coded support up to 7 arguments,
we need to support at least up to 14 arbitrary arguments in some instances. We now have some assembly code in place which can support an arbitrary
number of arguments by packing these in memory the expected way. While this only works for 64-bit integers, it's all that we need for X11.

With both of these features implemented both OpenGL and Vulkan thunking works with Proton.

VDSO

While this is implemented as a thunk on the FEX side, it behaves slightly differently that normal thunks. This will always be enabled as long as FEX
can load the VDSO-host.so library installed on the system. Due to the nature of VDSO, all applications always have a VDSO region provided by the
kernel at all times. FEX wants to provide fast emulation of this "library" since applications abuse it heavily for performance. This was noticed when
running Proton games, they abuse the clock_gettime very heavily which was causing significant CPU overhead. Applications were calling this VDSO
syscall hundreds or thousands of times a second. This now significantly lowers the amount of time spent in the kernel for timing functions.

getdents syscall emulation

AArch64 doesn't support this syscall but in most cases applications don't use it. This is because there is a much more modern syscall called
getdents64 that everything uses now. When running older compiled applications they are likely to use the classic syscall. Since AArch64 doesn't have
the classic version, we now emulate it entirely using getdents64, which fixes running applications from centos 7.

Misc

  • Fix compiling without jemalloc
    • Thunks are unsupported without jemalloc but we need to keep it compiling
  • Consolidate generated files to one file per platform
    • Nice code cleanup for developers
  • Minor cleanups for signature-based function pointer thunking
  • Support direct thunk config in configuration files
    • This improves the user experience with enabling thunks for application configurations
    • No need for two files to describe one thing now

Raw Changes