Linux Subsystem for FreeBSD (😈 on 🐧)
APACHE-2.0 License
Emulates FreeBSD on Linux. Designed to be extensible to support other Unix-like OS personalities too.
Tested on Ubuntu 22.04 (kernel 5.15). Needs kernel 5.6 at least.
(linux)$ docker build -t lsf .
(linux)$ docker run -it --rm --security-opt seccomp=unconfined lsf
# file /bin/sh
/bin/sh: ELF 64-bit LSB pie executable, x86-64, version 1 (FreeBSD), dynamically linked, interpreter /libexec/ld-elf.so.1, for FreeBSD 13.1, FreeBSD-style, stripped
# uname -a
FreeBSD 177f2177ddab 13.1-RELEASE-p1 FreeBSD 13.1-RELEASE-p1 LSF amd64
⚠️ Running LSF outside a container is highly discouraged, and may result in breaking the host Linux.
make
install _output/bin/lsf ~/bin/
mkdir -p ~/freebsd/rootfs
curl -SL http://ftp.freebsd.org/pub/FreeBSD/releases/amd64/13.1-RELEASE/base.txz | tar CxJ ~/freebsd/rootfs
cd ~/freebsd/rootfs
export LD_LIBRARY_PATH=$(pwd)/lib
lsf -- libexec/ld-elf.so.1 usr/bin/uname -a
POC.
docker run
several times if you see Error: input/output error
.docker run -e LSF_DEBUG=1
to enable debug output.docker exec -it <CONTAINERID> /lsf -- /bin/sh
to open another shell.Surprisingly the Linux kernel does not validate the OSABI of the ELF binaries on execve()
.
So, LSF can "just" load ELFOSABI_FREEBSD
binaries without cooking up the PROT_EXEC
pages by itself.
Syscalls are trapped using the plain old PTRACE_SYSCALL
.
Unlike UML, PTRACE_SYSEMU
, which reduces the ptrace overhead when the trapped syscall rarely needs to be executed, is not used.
Because in the case of LSF, most syscalls can be just passed through to the Linux kernel but with different register values such as the syscall number in the RAX
register.
Syscall User Dispatch is not used either.
The syscall ABI is almost same across Linux and FreeBSD:
The syscall number is stored in the RAX
register, and the syscall arguments are stored in the RDI
, RSI
, RDX
, R10
, R8
, and R9
registers.
This is similar to the System V AMD64 ABI calling convention for the userspace (RDI
, RSI
, RDX
, RCX
, R8
, R9
).
However, it should be noted that in the case of the syscalls, the fourth argument is stored in R10
, not RCX
,
because the syscall
instruction (0F 05
) clobbers RCX
.
The returned value is stored back in the RAX
register. An errno is stored in the RAX
register too, but as a negative value.
In addition, FreeBSD processes expect the CF
flag of the RFLAGS
register to be set on an error.
LSF sets the CF
flag using PTRACE_SETREGS
.
Some syscalls can't be just passed through by changing the register values, when the corresponding syscall is missing in Linux, or the syscall has an incompatible argument such as a struct
with
different struct members:
int fstat(int fd, struct stat *buf);
In such a case, LSF rewrites the syscall number in the RAX
register to a "NOP" syscall number (getpid()
), and handles the original syscall arguments in the userspace
when the "NOP" syscall exits.
The userspace handler uses pidfd_getfd()
to fetch the file descriptors, translates the struct
definitions, and calls Linux syscalls to emulate the requested FreeBSD syscall.
The pidfd_getfd()
syscall has been available since Linux kernel 5.6, but disabled in Docker's default seccomp profile.
So, running LSF inside Docker needs --security-opt seccomp=unconfined
, or at least a custom seccomp profile to enable pidfd_getfd()
.
Enabling pidfd_getfd()
does NOT require acquiring the CAP_SYS_PTRACE
capability.
Instead of using pidfd_getfd()
, LSF could alternatively just use symlinks under /proc/<PID>/fd/
and position information under /proc/<PID>/fdinfo/
to create yet another descriptor with the similar internal state,
but this approach is not as robust as pidfd_getfd()
, and very unlikely to work with descriptors of non-regular files.
FreeBSD processes expect the TLS pointer (FSBASE
) to be initialized by the kernel, while the Linux kernel does not provide it.
LSF uses PTRACE_PEEKTEXT
to inject the syscall
instruction (0F 05
) into the code of the FreeBSD process for allocating the TLS with brk()
,
and after single-stepping the syscall
instruction, LSF restores the code and rewinds the instruction pointer to the original position.
The TLS is initialized with the the .tdata
and .tbss
sections of the ELF.
At the end of the TLS, there is the TLS pointer that points to itself.
The FSBASE
register is set to this pointer.
The initial registers are different and modified using PTRACE_SETREGS
.
RSP | RDI | FSBASE | |
---|---|---|---|
Linux | stack | - | - |
FreeBSD | stack (aligned) | stack | end of TLS |
The stack layout is similar.
The stack begins with argc
, argv
, envp
, and auxv
, but auxv
is slightly incompatible across Linux and FreeBSD.
FreeBSD processes expect the AT_BASE
element in the auxv to be always provided with a non-zero value,
but the Linux kernel sets AT_BASE
to zero when the ELF interpreter (/libexec/ld-elf.so.1
) is executed directly.
In such a case, LSF modifies the AT_BASE
value on the stack to be the base address parsed from /proc/<PID>/maps
.
Also, some of the auxv elements are incompatible and nullified.
FreeBSD (and others) on Linux:
Darwin on Linux:
SunOS and Solaris on Linux:
CONFIG_SUNOS_EMUL
(for SunOS 4/Solaris 1) and CONFIG_SOLARIS_EMUL
(for SunOS 5/Solaris 2) were natively present in the Linux kernel for the SPARC architecture untilSystem V derivatives on Linux:
SIGSEGV
handlers for trapping syscalls, while LSF uses ptrace. Also, ibcs-us needs CAP_SYS_RAWIO
while LSF does not.Windows on Linux:
Linux on FreeBSD:
Linux on Darwin:
Linux on Solaris:
Linux on System V derivatives:
SIGSEGV
handlers for trapping Linux syscalls on SCO OpenServer, UnixWare, and Solaris,Linux on Windows:
int80.sys
,int80.sys
for trapping syscalls,