2.11 Elf Binary Mangling Pt. 4: Limit Break
~ netspooky
CONTENTS
0. Introduction
1. Background
2. Read No Longer Implies Exec
3. Exploring Other Overlays
3.1 The 0x38 Overlay
3.2 The 0x31 Overlay
4. Tracing the ELF Loader
5. Limited Addition
6. There's Levels To This Chip
7. write() Or Die
8. ./xit
0. Introduction
It's been about three years since I started playing with tiny executable files and released the first part of the ELF Binary Mangling series.
In part 2, we established a limit of 84 bytes for 64 bit ELFs. This was achieved by starting the program header at offset 0x1C within the ELF header, and carefully crafting each field to make it work within the required values.
Sometime in the summer of 2020, the ELF Binary Golf world was rocked by a change in the kernel's ELF loader that would render this trick useless.
This writeup is a deep dive into what caused the 84 byte ELF to stop working, how the kernel parses ELFs, what truly matters in the ELF and Program headers, and a working 64 bit ELF that is even smaller than the previous record.
1. Background
Unlike PEs on Windows, ELF files don't have a hard limit on how small they can be. This has allowed for people to create impossibly small ELFs that can be loaded and executed, as long as they follow the spec [1].
In EBM Part 2 [2], a program header overlay was introduced that started the program header table at 0x1C in the file. This set the p_flags field to 0x1C, because it overlapped with the e_phoff field in the ELF header. When I wrote this, I only had a limited amount of experience with kernel code. Reading through the loader's logic, I had mistakenly thought that the flags in p_flags (in practice) were mapped to the kernel memory manager's own protection flags [3], which look like this:
#define PROT_READ 0x1 /* page can be read */
#define PROT_WRITE 0x2 /* page can be written */
#define PROT_EXEC 0x4 /* page can be executed */
Following this logic, 0x1C would have the PROT_EXEC flag set:
PROT_READ 1 00000001 Read
PROT_WRITE 2 00000010 Write
PROT_EXEC 4 00000100 Execute
p_flags 1Ch 00011100
└── PROT_EXEC is set
I honestly didn't think much of it. I was just happy that it executed and chalked up the rest to kernel unknowns. Narrator: _This was not the case._
SIDE NOTE: What even is a program header?
A program header is a structure that describes a segment, or other info, that the system needs to prepare the program for execution. An ELF needs at least one PT_LOAD segment to be loaded into memory and executed.
REFS:
[1] https://refspecs.linuxfoundation.org/elf/elf.pdf
[2] https://n0.lol/ebm/2.html
[3] https://elixir.bootlin.com/linux/latest/source/include/uapi/asm-generic/mman-common.h#L12
2. Read No Longer Implies Exec
Behind the scenes, the segment permissions set by the program headers DO rely on the permissions flags [1] defined by the ELF spec. Here, the bit flags for READ and WRITE are swapped from the kernel mmap flags:
PF_X 1 00000001 Execute
PF_W 2 00000010 Write
PF_R 4 00000100 Read
p_flags 1Ch 00011100
└── PF_R is set
In the 0x1C overlay ELFs, the only p_flags set was PF_R, which maps the segment as read only. This worked because of a backward compatibility feature in the loader for older x86_64 processors. On x86, for each new process, a personality [2] flag was set by default called READ_IMPLIES_EXEC. This flag essentially asserts that when a memory allocation is made with the PROT_READ flag, it means that the mmap'd memory is also marked PROT_EXEC, giving read AND execute permissions to the memory area.
The flag was set because of a header type known as PT_GNU_STACK which determines whether or not the stack is executable. When this header was missing, it defaulted to setting READ_IMPLIES_EXEC, to support platforms that don't or can't have an NX bit [3]. This is a "fail-open" scenario that makes sure code will run as intended.
So when there was no PT_GNU_STACK header (due to us only having one program header), our read flag failed open, allowing the program's memory page to be mapped with RX permissions by the kernel. This is what enabled the overlay to work, and the binaries executed just fine.
On April 20, 2020, a Linux kernel commit was made that would end this trick:
x86/elf: Disable automatic READ_IMPLIES_EXEC on 64-bit [4]
This made the lack of a PT_GNU_STACK header in a 64 bit ELF turn off the READ_IMPLIES_EXEC [5] flag. The change turned "exec-all" into "exec-none" (fail closed).
(Big shoutout to @lichtman_ben for locating this commit!)
So what happens when you run a 0x1C ELF now? This is what the EBM2 binary looks like when executed:
$ strace ./xit
execve("./xit", ["./xit"], 0x7ffdfd471e80 /* 47 vars */) = 0
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x100000004} ---
+++ killed by SIGSEGV (core dumped) +++
Segmentation fault (core dumped)
If you run dmesg, you can see these errors as well:
[22056.363534] xit[31789]: segfault at 100000004 ip 0000000100000004 sp 00007ffc8ad06dd0
error 15 in xit[100000000+1000]
[22056.363584] Code: Unable to access opcode bytes at RIP 0xffffffda.
error 15 [6] tells us exactly what we need to know: Something tried to execute from a mapped memory area that isn't executable.
Since we know that the PF_X flag wasn't set, and that READ no longer implies EXEC, this points us to the core of the issue.
We can also confirm the mapping issue by viewing the memory permissions of the loaded process.
Here's what it looks like when it works, and READ_IMPLIES_EXEC is set:
$ rizin -b 64 -d xit
...
[0x100000004]> dm
0x0000000100000000 - 0x0000000100001000 * usr 4K s r-x /tmp/xit /tmp/xit ; map.tmp_xit.r_x
0x00007fff8793e000 - 0x00007fff8795f000 - usr 132K s rwx [stack] [stack] ; map.stack_.rwx
0x00007fff879b0000 - 0x00007fff879b3000 - usr 12K s r-- [vvar] [vvar] ; map.vvar_.r
0x00007fff879b3000 - 0x00007fff879b5000 - usr 8K s r-x [vdso] [vdso] ; map.vdso_.r_x
0xffffffffff600000 - 0xffffffffff601000 - usr 4K s r-x [vsyscall] [vsyscall] ; map.vsyscall_.r_x
This is what it looks like now, on a kernel that contains this patch:
$ rizin -b 64 -d xit
...
[0x100000004]> dm
0x0000000100000000 - 0x0000000100001000 * usr 4K s r-- /tmp/xit /tmp/xit ; map.tmp_xit.r
0x00007fffb88cd000 - 0x00007fffb88ee000 - usr 132K s rw- [stack] [stack] ; map.stack_.rw
0x00007fffb896e000 - 0x00007fffb8972000 - usr 16K s r-- [vvar] [vvar] ; map.vvar_.r
0x00007fffb8972000 - 0x00007fffb8974000 - usr 8K s r-x [vdso] [vdso] ; map.vdso_.r_x
0xffffffffff600000 - 0xffffffffff601000 - usr 4K s --x [vsyscall] [vsyscall] ; map.vsyscall_.__x
Notice the top line, the r-x (Read, Execute) permission changes to r-- (Read). Also, the stack has rwx (Read, Write, Execute) permissions, and is changed to rw- (Read, Write) due to the patch.
Fun Fact: When no flags are set (such as p_flags = 0x8), you get `error 14: attempt to execute code from an unmapped area`.
[12571.119831] tester[21040]: segfault at 50000001078 ip 0000050000001078 sp 00007ffff2d7fdc0
error 14 in tester[50000001000+1000]
[12571.119848] Code: Unable to access opcode bytes at RIP 0x5000000104e.
Additionally, other errors can be triggered that don't even give the "Unable to access opcode bytes" error, and just dump the program header starting at p_offset's 6th byte.
[12490.529762] tester[20913]: segfault at 50000001078 ip 0000050000001078 sp 00007ffdf405a300
error 15 in tester[50000001000+1000]
[12490.529776] Code: 00 00 00 10 00 00 00 05 00 00 00 00 00 40 00 38 00 01 00 01 00 00 00 00 00
00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 <b0> 3c 66 bf 06 00 0f 05 00
00 00 00 00 00 00 00 00 00 00 00 00 00
Unrelated, but you can also cause a SIGBUS by making p_offset larger than the file itself [7].
REFS:
[1] https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/elf.h#L239
[2] https://man7.org/linux/man-pages/man2/personality.2.html
[3] https://en.wikipedia.org/wiki/NX_bit
[4] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9fccc5c0c99f238aa1b0460fccbdb30a887e7036
[5] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/include/asm/elf.h#n280
[6] https://utcc.utoronto.ca/~cks/space/blog/linux/KernelSegfaultErrorCodes
[7] https://twitter.com/netspooky/status/1409686118962909188
3. Exploring Other Overlays
When I started testing new overlays, I began with test.asm [1]. This is a vanilla tiny ELF that is 128 bytes. There is no overlay, and the header size is 120. This is followed by 8 bytes of code that calls exit(6).
I experimented with placing data in different locations, and wrote a small fuzzer to increment bytes and then attempt to execute it and read the return value. This process was painfully slow, and it didn't actually give me much useful information.
In discussions with others, we'd talked about z3 solvers and other methods to determine what fields and specific bytes actually mattered. This too would prove to be more difficult and have a lot of edge cases to account for, so it made more sense to try to find the maximum value for each field by hand.
The wonderful thing about nasm and using the raw binary output option, is that you can control every byte. There are also macros and other useful functionality that make things easier to test, rather that calculating every address and offset by hand.
I started out slow, and put 0xFF bytes wherever I could. Some overwritable fields were already known from previous research, so the rest was simply trying to fill in the gaps. The following code is the result of this process.
;-- 0xFFtactics.asm ------------------------------------------------------------
; build:
; $ nasm -f bin 0xFFtactics.asm -o 0xFFtactics
BITS 64
org 0x4FFFFFFFF000 ; Base Address
;-----------------------------+------+-------------+----------+-----------------
; ELF Header struct | OFFS | ELFHDR | PHDR | ASSEMBLY OUTPUT
;-----------------------------+------+-------------+----------+-----------------
db 0x7F, "ELF" ; 0x00 | e_ident | A
db 0xFE ; 0x04 | ei_class | B
db 0xFF ; 0x05 | ei_data | C
db 0xFF ; 0x06 | ei_version | D
db 0xFF ; 0x07 | | E
dq 0xFFFFFFFFFFFFFFFF ; 0x08 | e_padding | F
dw 0x02 ; 0x10 | e_type | G
dw 0x3e ; 0x12 | e_machine | H
dd 0xFFFFFFFF ; 0x14 | e_version | I
dq 0x4FFFFFFFF078 ; 0x18 | e_entry | J
dq phdr - $$ ; 0x20 | e_phoff | K
dq 0xFFFFFFFFFFFFFFFF ; 0x28 | e_shoff | L
dd 0xFFFFFFFF ; 0x30 | e_flags | M
dw 0xFFFF ; 0x34 | e_ehsize | N
dw 0x38 ; 0x36 | e_phentsize | O
dw 1 ; 0x38 | e_phnum | P
dw 0xFFFF ; 0x3A | e_shentsize | Q
dw 0xFFFF ; 0x3C | e_shnum | R
dw 0xFFFF ; 0x3E | e_shstrndx | S
;-----------------------------+------+-------------+----------+-----------------
; Program Header Begin | OFFS | ELFHDR | PHDR | ASSEMBLY OUTPUT
;-----------------------------+------+-------------+----------+-----------------
phdr: dd 1 ; 0x40 | PA | p_type |
dd 0xFFFFFFFF ; 0x44 | PB | p_flags |
dq 0 ; 0x48 | PC | p_offset |
dq $$ ; 0x50 | PD | p_vaddr |
dq 0xFFFFFFFFFFFFFFFF ; 0x58 | PE | p_paddr |
dq 0x7FFFFFF00 ; 0x60 | PF | p_filesz |
dq 0x7FFFFFF00 ; 0x68 | PG | p_memsz |
dq 0xFFFFFFFFFFFFFFFE ; 0x70 | PH | p_align |
_start: mov al,0x3c ; exit syscall | b0 3c
mov di, 6 ; return value 6 | 66 bf 06 00
syscall ; call the kernel | 0f 05
;-- END ------------------------------------------------------------------------
Using this, I constructed a table that details each ELF header field [2] that it would be possible to overwrite and still allow the ELF to execute properly.
──────────────┬──────┬────┬─────┬───────────────────────────────────────────────
Name │ OFFS │ SZ │ OW? │ Note
──────────────┼──────┼────┼─────┼───────────────────────────────────────────────
EI_MAG0 │ 0x00 │ 1 │ NO │ '\\x7F', Part of the magic value.
EI_MAG1 │ 0x01 │ 1 │ NO │ 'E', Part of the magic value.
EI_MAG2 │ 0x02 │ 1 │ NO │ 'L', Part of the magic value.
EI_MAG3 │ 0x03 │ 1 │ NO │ 'F', Part of the magic value.
EI_CLASS │ 0x04 │ 1 │ YES │ Values 1 (32 Bit) and 2 (64 Bit) are valid
EI_DATA │ 0x05 │ 1 │ YES │ Values 1 (LSB) and 2 (MSB) are expected
EI_VERSION │ 0x06 │ 1 │ YES │ Only "1" is defined, not checked
EI_OSABI │ 0x07 │ 1 │ YES │ This might actually be deprecated?
EI_ABIVERSION │ 0x08 │ 1 │ YES │ This might actually be deprecated?
EI_PAD │ 0x09 │ 7 │ YES │ Free real estate ;)
E_TYPE │ 0x10 │ 2 │ NO │ The type of object file, ET_EXEC, ET_DYN etc.
E_MACHINE │ 0x12 │ 2 │ NO │ This is the CPU arch
E_VERSION │ 0x14 │ 4 │ YES │ Not checked, version 1 is the only version
E_ENTRY │ 0x18 │ 8 │ NO │ Entrypoint
E_PHOFF │ 0x20 │ 8 │ NO │ Program header offset.
E_SHOFF │ 0x28 │ 8 │ YES │ Only if no section headers are defined
E_FLAGS │ 0x30 │ 4 │ YES │ Processor specific flags
E_EHSIZE │ 0x34 │ 2 │ YES │ ELF Header Size. Can be 0
E_PHENTSIZE │ 0x36 │ 2 │ NO │ Size of a program header, actually matters
E_PHNUM │ 0x38 │ 2 │ NO │ Number of program headers
E_SHENTSIZE │ 0x3A │ 2 │ YES │ Section Header size
E_SHNUM │ 0x3C │ 2 │ YES │ Number of section headers
E_SHSTRNDX │ 0x3E │ 2 │ YES │ This sections string table index number
──────────────┴──────┴────┴─────┴───────────────────────────────────────────────
Visual representation of what can be overwritten in the ELF Header, indicated by _:
00000000: 7f45 4c46 ____ ____ ____ ____ ____ ____ .ELF............
00000010: 0300 3e00 ____ ____ 5058 0000 0000 0000 ..>.....PX......
00000020: 4000 0000 0000 0000 ____ ____ ____ ____ @...............
00000030: ____ ____ ____ 3800 0100 ____ ____ ____ ....@.8...@.....
Out of the 64 bytes in the ELF header, 36 of them can be used for whatever. You can also do some fun tricks with the known offsets in the header, like use e_entry to store some code and a short jump, or use \x7F\x45 to do a `jg 0x47`, or use EI_MAG* as a constant at 0x0 in the process memory :)
Now we can do the same thing with the program header too:
─────────┬──────┬────┬─────┬───────────────────────────────────────────────
Name │ OFFS │ SZ │ OW? │ Note
─────────┼──────┼────┼─────┼───────────────────────────────────────────────
P_TYPE │ 0x00 │ 4 │ NO │ The first one needs to be 1, SIGSEGV otherwise
P_FLAGS │ 0x04 │ 4 │ PRT │ Only the bottom byte is needed
P_OFFSET │ 0x08 │ 8 │ NO │ Pretty much must be 0 for the first PT_LOAD
P_VADDR │ 0x10 │ 8 │ NO │ This is required
P_PADDR │ 0x18 │ 8 │ YES │ This seems to be largely ignored, but will need more testing
P_FILESZ │ 0x20 │ 8 │ PRT │ As long as p_memsz > p_filesz > actual file size, it's okay
P_MEMSZ │ 0x28 │ 8 │ PRT │ As long as p_memsz > p_filesz > actual file size, it's okay
P_ALIGN │ 0x30 │ 8 │ PRT │ Must be a power of 2
─────────┴──────┴────┴─────┴───────────────────────────────────────────────
Visual representation of what can be overwritten in the Program Header, indicated by _:
00000040: 0100 0000 !!__ ____ 0000 0000 0000 0000 ................
00000050: 00_0 ____ __!_ 0000 ____ ____ ____ ____ ..@.......@.....
00000060: !!__ ____ 0!00 0000 !!__ ____ 0!00 0000 ................
00000070: _!__ ____ ____ ____ ........
This one is a little more complicated. All of the !'s represent a nibble (4 bits) that has it's own limitations. This is due to some hard limits set elsewhere in the kernel (more on that later), that you have to abide by to make your binary work. All told, there is roughly 32 out of 56 bytes in this header that can be used.
Compare to the program headers of demo to see _some_ of the limits of this header.
00000040: 0100 0000 ffff ffff 0000 0000 0000 0000 ................
00000050: 00f0 ffff ff4f 0000 ffff ffff ffff ffff .....O..........
00000060: 00ff ffff 0700 0000 00ff ffff 0700 0000 ................
00000070: feff ffff ffff ffff ........
An interesting side effect of this, even if you aren't golfing, is that it can really mess up a lot of parsers, including pretty much all of the binutils.
$ readelf -a 0xFFtactics
ELF Header:
Magic: 7f 45 4c 46 fe ff ff ff ff ff ff ff ff ff ff ff
Class: <unknown: fe>
Data: <unknown: ff>
Version: 255 <unknown>
OS/ABI: <unknown: ff>
ABI Version: 255
Type: EXEC (Executable file)
Machine: Advanced Micro Devices X86-64
Version: 0xffffffff
Entry point address: 0xfffff078
Start of program headers: 20479 (bytes into file)
Start of section headers: 64 (bytes into file)
Flags: 0x0
Size of this header: 65535 (bytes)
Size of program headers: 65535 (bytes)
Number of program headers: 65535
Size of section headers: 65535 (bytes)
Number of section headers: 65535
Section header string table index: 65535 <corrupt: out of range>
readelf: Warning: The e_shentsize field in the ELF header is larger than the size of an ELF
section header
readelf: Error: Reading 4294836225 bytes extends past end of file for section headers
readelf: Error: Section headers are not available!
readelf: Error: Too many program headers - 0xffff - the file is not that big
There is no dynamic section in this file.
readelf: Error: Too many program headers - 0xffff - the file is not that big
For more info on generating ELFs with data stored in the headers like this, check out the paper @TheXcellerator did in tmp.0ut:1 - Dead Bytes [3] and the libgolf library [4].
Now we should get into some of the overlays that are possible.
REFS:
[1] https://n0.lol/ebm/test.asm
[2] https://refspecs.linuxfoundation.org/elf/gabi4+/ch4.eheader.html
[3] https://tmpout.sh/1/1.html
[4] https://www.github.com/xcellerator/libgolf
3.1 The 0x38 Overlay
In the previous part of this section, we determined that P_TYPE has to be 1, E_PHNUM must contain the correct number of Program Headers, and the fields after can be overwritten. This allows for a program header overlay at 0x38, shrinking the combined header size from 120 to 112.
In @subvisor's write up about this trick [1], e_ehsize was set to 0x38 to reflect the total size of the ELF header. The fuzzing attempting before showed that this can actually be any number, because the ELF loader assumes that you're following the spec. Check out their post for more info about this and other fun things!
REFS:
[1] https://ftp.lol/posts/small-elf.html
3.2 The 0x31 Overlay
This was first publicly demonstrated by Twitter user @f1ac5, who posted a binary [1] that printed their handle. It came to my attention after subvisor had posted some of their own ELF experiments. The code is clever, and it does it's overlay in an interesting way.
This is the layout, with the program header fields highlighted:
$ xxd f1ac5.bin
00000000: 7f45 4c46 0a6a 016a 065a 5889 c7eb 1900 .ELF.j.j.ZX.....
00000010: 0200 3e00 0f05 eb49 0500 0100 0000 0000 ..>....I........
00000020: 3100 0000 0000 0000 be49 0001 00eb e500 1........I......
p_type p_flags p_offs
┌────────┐┌────────┐┌────────────────
00000030: 0001 0000 0005 3800 0100 0000 0000 0000 ......8.........
p_vaddr p_paddr
─┐┌──────────────────┐┌────────────────
00000040: 0000 0001 0000 0000 0066 6c61 6373 0a00 .........flacs..
p_filesz p_memsz
─┐┌──────────────────┐┌────────────────
00000050: 0068 0000 0000 0000 0068 0000 0000 0000 .h.......h......
p_align
─┐┌──────────────────┐
00000060: 006a 3c58 89df 0f05 0000 .j<X......
Base64 Version
f0VMRgpqAWoGWliJx+sZAAIAPgAPBetJBQABAAAAAAAxAAAAAAAAAL5JAAEA6+UAAAEAAAAFOAAB
AAAAAAAAAAAAAAEAAAAAAGZsYWNzCgAAaAAAAAAAAABoAAAAAAAAAGo8WInfDwUAAA==
There's actually a good amount of code in here too. Note that the virtual address is set to 0x10000.
0x10005: 6a01 push 0x1
0x10007: 6a06 push 0x6
0x10009: 5a pop rdx ; buffer length
0x1000A: 58 pop rax ; write syscall
0x1000B: 89c7 mov edi,eax
┌0x1000D: eb19 jmp 0x10028
│...
┌│0x10014: 0f05 syscall
┌││0x10016: eb49 jmp 0x10061
│││...
││└0x10028: be49000100 mov esi, 0x10049 ; The buffer
│└─0x1002D: ebe5 jmp 0x10014
│ ...
└──0x10061: 6a3c push 0x3c
0x10063: 58 pop rax
0x10064: 89df mov edi,ebx
0x10066: 0f05 syscall
This was thought to be the smallest overlay that could be done, now that 0x1C didn't work, and there were few viable locations for another.
After seeing this, the hunt was on to find something even smaller. During this process, there were some interesting failures (some for future writeups). To understand them, we need to look deeper into the kernel, to understand how ELF files are turned into processes that execute what we want.
REFS:
[1] https://twitter.com/f1ac5/status/1241925791514071040
4. Tracing the ELF Loader
There's a lot of really good writeups about how the ELF loader (and processes in general) work that are far beyond the scope of this article. If you want to really understand this on an atomic scale, I'd recommend reading them, and following along with kernel source.
LWN How programs get run: ELF binaries https://lwn.net/Articles/631631/
Linux Insides - Chapter 4.4 - How does the Linux kernel run a program https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-4.html
The Linux Programming Interface - Chapter 24 - Process Creation https://nostarch.com/tlpi
The main thing we'll be focusing on to understand why different overlays don't work is binfmt_elf.c, which contains much of the code used to parse ELF files and assist in setting up processes. It does a number of checks to see if the file is valid, and then does various tasks such as mapping memory, and locating the interpreter for the program if one is required.
Tracing the functions called by `load_elf_binary` can show us (almost) exactly what is happening on a kernel level as our binaries are loaded. To track that our syscalls are called correctly, we use strace. But syscalls are implemented by the kernel, so what do you use if you need to track those?
The answer: ftrace!
The ftrace tool [1] is a framework of utilities used to track what happens during each process from the kernel's perspective. There's a lot of documentation, utilities, and frontends for it, which I encourage you to play around with. The way I used it here was relatively simple, and there are much more interesting things you can do!
When you run ftrace and record a process, there's a lot of output that is generated. There's so many functions that can be hard to contextualize, and a lot of redundant things that make it confusing. The approach I took was to use my known good ELF from earlier, test.bin, and use that as a baseline to compare other binaries to. Because of how complex nearly every other ELF is in terms of how it is loaded and parsed, working with a simple binary makes things a lot less confusing.
I used trace-cmd to record all of the function calls of the working binary, and then all of the ones I had been working on too. I piped each report into text files so I could read later.
$ trace-cmd record -p function -F ./test.bin
$ trace-cmd report > report.test.txt
NOTE: The output from this is very W I D E, so I'm truncating a bunch of the spaces when I need to.
The first thing I did was grep for the load_elf_binary function to make sure that it actually was called, which was present in all my tests.
$ grep -n load_elf_binary report.*
report.test.txt:1103: test-30856 [001] 21296.075952: function: load_elf_binary
report.x19.txt:803: x19-31112 [001] 21511.641779: function: load_elf_binary
report.x1A.txt:807: x1A-31697 [000] 21990.537442: function: load_elf_binary
report.x1B.txt:802: x1B-31738 [003] 22014.264920: function: load_elf_binary
report.xit.txt:924: xit-31789 [004] 22056.369197: function: load_elf_binary
I had a few errors that I wanted to track down, such as ENOMEM and EINVAL, but there are a lot of places where these errors can happen. By working backwards from the end of the function, you can compare where certain things failed, while others succeed.
The original 84 byte ELF somehow passed all of these checks I had looked for. I found a function unique to it in the trace output, bad_area_access_error.
5223 xit-31789 [004] 22056.370823: function: find_vma
5224 xit-31789 [004] 22056.370823: function: vmacache_find
* 5225 xit-31789 [004] 22056.370824: function: bad_area_access_error
5226 xit-31789 [004] 22056.370824: function: up_read
...
5378 xit-31789 [004] 22056.370904: function: force_sig_fault
Looking for references to bad_area_access_error, I was only able to find one that made sense [2], within mm/fault.c. The function do_user_addr_fault was called by something else, but it was hard to track down, because do_user_addr_fault [3] is exported with the NOKPROBE_SYMBOL(), which means we actually can't trace this function the normal way. There are actually quite a few spots [4] where functions are hidden from kprobes [5], which makes debugging certain things very challenging.
This can ultimately be traced back to the lack of executable permissions, and the error 15 we saw earlier. For the others, things fail for different reasons.
REFS:
[1] https://www.kernel.org/doc/Documentation/trace/ftrace.txt
[2] https://elixir.bootlin.com/linux/latest/C/ident/bad_area_access_error
[3] https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/fault.c#L1447
[4] https://elixir.bootlin.com/linux/latest/C/ident/NOKPROBE_SYMBOL
[5] https://lwn.net/Articles/132196/
5. Limited Addition
The area I focused on for new overlays was subtracting from 0x1C, to play with 3 different offsets that it seemed possible to use without interfering with the 8 bytes of 0's required for the p_offset field, or the 4 bytes required for p_type.
0x19 seemed like the best candidate, because e_phoff would still be in the p_flags field, staying within our boundaries. I built a POC and tested it.
$ strace ./x19
execve("./x19", ["./x19"], 0x7ffd5ecf1610 /* 47 vars */) = -1 EINVAL (Invalid argument)
+++ killed by SIGSEGV +++
Segmentation fault (core dumped)
Comparing ftrace output to the test.bin output, I found that the last common function called by load_elf_binary was elf_map [1]. This function maps the memory segment described by the program header, and does some checks to make sure it makes sense before running.
$ grep -n elf_map report.test.txt
5669: test-30856 [001] 21296.077624: function: elf_map
$ grep -n elf_map report.x19.txt
4638: x19-31112 [001] 21511.643201: function: elf_map
I took a look at the subsequent calls and found that it failed in vm_mmap, called by elf_map.
$ less -N report.x19.txt
...
4638: elf_map
4639: vm_mmap ; [2]
4640: kfree
4641: kfree
4642: kfree
4643: _raw_read_lock
4644: module_put
4645: force_sigsegv ; [3]
Specifically here, in vm_mmap:
if (unlikely(offset + PAGE_ALIGN(len) < offset))
return -EINVAL;
if (unlikely(offset_in_page(offset)))
return -EINVAL;
This PAGE_ALIGN macro (which like `unlikely()` is a macro that doesn't show up in ftrace -_-), is what caught my attention.
In my source for x19, the base virtual address I set was 0x50000000100, because of how the headers overlapped with the entrypoint.
┌ e_entry ────────┐
│ │
│ p_type p_flags
│ ┌────────┐┌──────>
00000010: 0200 3e00 0100 0000 0801 0000 0005 0000 ..>.............
Unfortunately, this violates the required page alignment, which is 0x1000 (4096 bytes). Seeing as there was no way to change p_type without increasing the size of the binary, there was no path forward here.
The next overlay I tried was 0x1B, which surely would give enough room for page alignment, and then some. I made my POC, set the base address to a ridiculous 0x0500000001000000, and ran it.
$ strace ./x1B
execve("./x1B", ["./x1B"], 0x7ffe9a0fa4d0 /* 47 vars */) = -1 ENOMEM (Cannot allocate memory)
+++ killed by SIGSEGV +++
Segmentation fault (core dumped)
Great, a whole new issue! Looking through my reports, I saw that this in fact does make it through vm_mmap, but hits a force_sigsegv right after entering `arch_get_unmapped_area_topdown`[4]. Looking at this function, there's an ENOMEM error that suddenly makes so much sense:
/* requested length too big for entire address space */
if (len > mmap_end - mmap_min_addr)
return -ENOMEM;
Of course - There's a limit that I hadn't thought about. Anything over 48 bits for a virtual address (greater than 0x0800000000000) will fail. This means that overlays at 0x1B and 0x1A won't work, because there's no possible way to fit within this 48 bit address limit without sacrificing a key field that is required in the program header.
Per the reference [2.6]
> On 64-bit x86 Linux, generally any faulting address over 0x7fffffffffff will be reported as having a mapping and so you'll get error codes 5, 7, or 15 respective for read, write, and attempt to execute. These are always wild or corrupted pointers (or addresses more generally), since you never have valid user space addresses up there.
This limit isn't even just from an OS perspective, this is tied directly to the CPU architecture:
$ grep address /proc/cpuinfo
address sizes : 39 bits physical, 48 bits virtual
At this point, it felt like all hope was lost. If the CPU itself enforces these rules, what's left to be done?
REFS:
[1] https://elixir.bootlin.com/linux/latest/source/fs/binfmt_elf.c#L1141
[2] https://elixir.bootlin.com/linux/latest/source/mm/util.c#L529
[3] https://elixir.bootlin.com/linux/v4.4/source/kernel/signal.c#L1447
[4] https://elixir.bootlin.com/linux/latest/source/mm/mmap.c#L2207
6. There's Levels To This Chip
I was now disheartened, reading kernel documentation, as people do, when I found something I hadn't heard of. In kernel 4.14, a patch [1] was introduced to allow for 5-level paging [2][3], which is an extension of the size of virtual addresses from 48 bits to 57 bits. This increased the size from 256TB to 128PB, which is becoming necessary due to the increasing power of VPSes and hypervisors.
It seems to be enabled by default in Ubuntu 20.04.
You can check if 5-level paging is enabled like so:
$ grep CONFIG_X86_5LEVEL /boot/config-$yourkernelversion
CONFIG_X86_5LEVEL=y
After some research, it seemed like only the Intel Ice Lake line of CPUs actually implemented this. Not able to locate anything that used an Ice Lake processor, I had asked around. A friend from the tmp.0ut chat, @David3141593, informed me that qemu was able to emulate 5-level paging, which meant that I could actually test my 0x1A overlay. They got to it before I did, and was able to confirm that it indeed worked.
I ended up compiling qemu myself [4] because I wasn't sure what version it was added to, and could only get older ones from my package manager. This is what I ran on a freshly compiled qemu 6.0.0:
$ ./qemu-system-x86_64 -accel tcg -cpu qemu64,+la57 -m 4096 /home/u/ISO/ubuntu-20.04.iso
You have to connect via VNC, and it somehow takes 30 minutes to actually get to the desktop, but it was confirmed to work! An 82 byte ELF64, that requires 5-level paging to run.
See here: https://twitter.com/netspooky/status/1412222077441040387
Here is the POC:
$ base64 x1A
f0VMRgIBAQCwPGa/BgAPBQIAPgABAAAACAABAAAABQAaAAAAAAAAAAAAAAABAAAABQAAAEAAOAAB
AAAQAAAAAAAAABAAAAAAAAAAEAAAAAAAAA==
$ sha256sum x1A
2da9f0607057e798a38bacad64d5e38c58af18689be519949a9f28fe3f43a925 x1A
This POC only returns 6, the original value from test.asm. I wanted it to actually do something, like print a message, to show that it was more than just a little trick...
REFS:
[1] http://lkml.iu.edu/hypermail/linux/kernel/1612.1/00383.html
[2] https://www.kernel.org/doc/html/v5.9/x86/x86_64/5level-paging.html
[3] https://en.wikipedia.org/wiki/Intel_5-level_paging
[4] https://wiki.qemu.org/Hosts/Linux
7. write() Or Die
The final POC for this is an 82 byte ELF 64 that prints the word "ELF" and exits cleanly. Tested in qemu 6.0.0 on Ubuntu 20.04 with kernel 5.8.0 using the same configuration as above.
$ base64 p82.3
f0VMRv/A/8exBUjB4TDrBAIAPgCyA+sdBAABAAAABQAaAAAAAAAAAAAAAAABAAAABQAAsDy+OAAB
AEgBzusLAAAASAHO6wsAAABIg+43DwXr4Q==
$ sha256sum p82.3
308a30c9a47cd2665701f30397d5b744d26cf23e6c556485aea8eeb01691a581 p82.3
Here is the layout for p82.3:
ELFHEADER A───────┐ B┐C┐ D┐E┐ F─────────────────┐
00000000: 7f45 4c46 ffc0 ffc7 b105 48c1 e130 eb04 .ELF.......H..0.
│ │ │ │ │ │
7f45 4c46─│────│────│────│─────────│──────────────────── db 0x7F, "ELF"
ffc0─│────│────│─────────│──────────── _start: inc eax ; write syscall
ffc7─│────│─────────│──────────────────── inc edi ; fd = STDOUT
b105─│─────────│──────────────────── mov cl,0x5
48c1 e130─│──────────────────── shl rcx,0x30
eb04───────────────── jmp ev
ELFHEADER G──┐ H──┐ I───────┐ J─────────────────┐
PRGHEADER │ │ │ │ │ │ │ p_type p_flags
│ │ │ │ │ │ │ ┌───────┐ ┌──>
00000010: 0200 3e00 b203 eb1d 0400 0100 0000 0500 ..>.............
│ │ │ │ │
0200─│────│────│────│──────────────────────────────────── dw 0x2
3e00─│────│────│──────────────────────────────────── dw 0x3e
b203─│────│──────────────────────────────── ev: mov dl, 0x3 ; length
eb1d─│──────────────────────────────────── jmp ehs
0400 0100 0000 0500────────────────── dq 0x05000000010004
ELFHEADER K─────────────────┐ L─────────────────┐
PRGHEADER │ p_offs │ │ p_vaddr │
>──┐ ┌─────────────────┐ ┌────────────>
00000020: 1a00 0000 0000 0000 0000 0000 0100 0000 ................
│ │
1a00 0000 0000 0000──────│─────────────────────────────── dq 0x1A
0000 0000 0100 0000────────────────── dq 0x100000000
ELFHEADER M───────┐ N──┐ O──┐ P──┐ Q──┐ R──┐ S──┐
PRGHEADER │ p_paddr│ │ │ │ │ │p_filesz │ │
>──┐ ┌─────────────────┐ ┌────────────>
00000030: 0500 00b0 3cbe 3800 0100 4801 ceeb 0b00 ....<.8...H.....
│ │ │ │ │ │ │
0500─│─│────│────────────│──────│────│─────────────────── dw 5
00│────│────────────│──────│────│─────────────────── db 0
b0 3c│────────────│──────│────│────────────── xit: mov al, 0x3C ; exit syscall
be 3800 0100─│──────│────│────────────── ehs: mov esi,0x10038
4801 ce│────│─────────────────── add rsi, rcx
eb 0b│─────────────────── jmp pal
00────────────────── db 0
PRGHEADER p_memsz p_align
>──┐ ┌─────────────────┐ ┌────────────>
00000040: 0000 4801 ceeb 0b00 0000 4883 ee37 0f05 ..H.......H..7..
│ │ │ │ │ │ │
0000─│──────│────│──│────│─────────│───────────────────── dw 0
4801 ce│────│──│────│─────────│───────────────────── add rsi, rcx
eb 0b│──│────│─────────│───────────────────── jmp 0xb
00─│────│─────────│───────────────────── db 0
0000─│─────────│───────────────────── dw 0
4883 ee37─│──────────────── pal: sub rsi, 0x37 ; *buf
0f05────────────────── syscall
>──┐
00000050: ebe1 ..
│
ebe1───────────────────────────────────────────────────── jmp xit
8. ./xit
It seems like this is the end of the line for the limits of ELF 64 Binary Golf on x86_64 in terms of how small something can actually be. 105 is the size limit for CPUs with 48 bit virtual addresses, and 82 is the limit for CPUs with 57 bit virtual address.
I made this table describing ELF header bytes 0 through 0x38 describing exactly why a header overlay won't work there. If anyone does manage to make something smaller, or craft a POC, let me know!
Note that this only applies to x86_64 ELFs.
─────┬───┬────────────────────────────────────────────────────────────────────────────────
OFFS │ ? │ Description
─────┼───┼────────────────────────────────────────────────────────────────────────────────
0x00 │ . │ ELF signature interferes with p_type
0x01 │ . │ ELF signature interferes with p_type
0x02 │ . │ ELF signature interferes with p_type
0x03 │ . │ ELF signature interferes with p_type
0x04 │ . │ e_type and e_machine intefere with p_offset
0x05 │ . │ e_type and e_machine intefere with p_offset
0x06 │ . │ e_type and e_machine intefere with p_offset
0x07 │ . │ e_type and e_machine intefere with p_offset
0x08 │ . │ e_type and e_machine intefere with p_offset
0x09 │ . │ e_machine inteferes with p_offset
0x0A │ . │ e_machine inteferes with p_offset
0x0B │ . │ Needs the entrypoint to be 0, also can't exec the ELF sig without setting flags
0x0C │ . │ e_type is 0002, so PF_X in p_flags won't be set. Same entrypoint issue as above
0x0D │ . │ e_type interferes with p_type, also same entrypoint issue as above
0x0E │ . │ interferences with p_type and p_offset
0x0F │ . │ interferences with p_type and p_offset
0x10 │ . │ interferences with p_type and p_offset
0x11 │ . │ interferences with p_type and p_offset
0x12 │ . │ interferences with p_type and p_offset
0x13 │ . │ interferences with p_type and p_offset
0x14 │ . │ interferences with p_type and p_offset
0x15 │ . │ interferences with p_type and p_offset
0x16 │ . │ interferences with p_type and p_offset
0x17 │ . │ interferences with p_type and p_offset
0x18 │ . │ e_phoff will interfere with p_offset
0x19 │ . │ The required entrypoint addr is not page aligned
0x1A │ Y │ Needs 5─Level paging. Binary size is 82
0x1B │ . │ The entrypoint addr would be beyond even 56 bits
0x1C │ . │ Doesn't work because PF_X is not set
0x1D │ . │ e_phoff interferes with p_type
0x1E │ . │ e_phoff interferes with p_type
0x1F │ . │ e_phoff interferes with p_type
0x20 │ . │ p_type interferes with e_phoff
0x21 │ . │ p_type interferes with e_phoff
0x22 │ . │ p_type interferes with e_phoff
0x23 │ . │ p_type interferes with e_phoff
0x24 │ . │ p_type interferes with e_phoff
0x25 │ . │ p_type interferes with e_phoff
0x26 │ . │ p_type interferes with e_phoff
0x27 │ . │ p_type interferes with e_phoff, e_phentsize interferes with p_offset
0x28 │ . │ e_phentsize interferes with p_offset
0x29 │ . │ e_phentsize and e_phnum interfere with p_offset
0x2A │ . │ e_phentsize and e_phnum interfere with p_offset
0x2B │ . │ e_phentsize and e_phnum interfere with p_offset
0x2C │ . │ e_phentsize and e_phnum interfere with p_offset
0x2D │ . │ e_phentsize and e_phnum interfere with p_offset
0x2E │ . │ e_phentsize and e_phnum interfere with p_offset
0x2F │ . │ e_phentsize and e_phnum interfere with p_offset
0x30 │ . │ e_phentsize and e_phnum interfere with p_offset
0x31 │ Y │ Does work, binary size is 105
0x32 │ . │ e_phentsize is not an odd number, so PF_X in p_flags isn't set.
0x33 │ . │ e_phentsize interferes with p_type
0x34 │ . │ e_phentsize interferes with p_type
0x35 │ . │ e_phentsize and e_phnum interfere with p_type
0x36 │ . │ e_phentsize and e_phnum interfere with p_type
0x37 │ . │ e_phnum interferes with p_type and p_type interferes with e_phentsize
0x38 │ Y │ does work, binary size is 112
─────┴───┴────────────────────────────────────────────────────────────────────────────────
All in all, this was a lot of fun. Everything was slow for a while, and suddenly was very fast. The Binary Golf Grand Prix 2021 is still happening as of this writing, and I have been in the golfing mood since then.
I want to thank everyone who tested stuff with me and created their own POCs over the years. It's been a joy working with people and learning about some deep kernel and processor things.
There's still so many interesting things to explore and mysteries to solve here. Don't be afraid to play around, even if people tell you that it's silly or impossible, or you feel like you'll never understand. What you learn becomes more real when you experiment and find things out for yourself.
Be on the lookout for more binary weirdness in tmp.0ut Volume 2!
Shoutouts: tmp.0ut, TCHQ, VXUG, TCPD, xcellerator, subvisor, retr0id, iximeow, dnz, hermit, Ian C, a certain fox, and everyone who worked on the BGGP.