Developing Tilck, a Tiny Linux-compatible kernel

Developing Vladislav K. Valtchev (2022)

What Tilck is  A project consisting on:  A
monolithic kernel written in C and assembly  A bootloader working both on UEFI and legacy BIOS systems  Several test suites and a powerful CMake-based build system  Buildroot-like scripts for downloading & building 3rd party software  Partially compatible with Linux at binary level  Uniprocessor, but fully preemptable  Educational, with potential to be more than that (see testing etc.)  Runs only on i686, at the moment (will be ported to ARM, RISC-V etc.)  Open source, distributed under the BSD 2-clause license Vladislav K. Valtchev (2022)

What Tilck is NOT  An attempt to replace Linux
 An attempt to be yet another desktop operating system  An attempt to be a large-scale server operating system  A real-time OS, but it might become one in the future  A OS running on NOMMU machines, but (probably) will in the future  Ready for production use: it still lacks features as storage, networking etc. Vladislav K. Valtchev (2022)

Why the binary compatibility with Linux?  It’s cool being
able to test the same “bits” both on Linux and Tilck  Robustness: Tilck can empirically show robustness and correctness by running 3rd party software never written for it  Didn’t want to design a whole new syscall interface from scratch  Didn’t want to implement a whole libc too  Didn’t want to build a custom GCC toolchain. I wanted to use the pre-built toolchains from: https://toolchains.bootlin.com/  Increase the likelihood the project to get more interest from the community?  Porting pre-existing software to Tilck will require a little or no effort at all. Vladislav K. Valtchev (2022)

Core values & goals  Minimal memory footprint  Ultra
low-latency  Deterministic behavior  Extra robustness  Portability  Simplicity  Partial compatibility with Linux  Must work on real (modern) hardware  Exceptional developer experience: building & testing the project should be as easy as technologically possible Vladislav K. Valtchev (2022)

Live demo Because a demo is worth more than a
thousand words Vladislav K. Valtchev (2022)

Funny stories & interesting challenges Vladislav K. Valtchev (2022)

My latest bug [1/6]  I have a test (fork_oom)
that: 1. Estimates the amount of committed memory that can be used 2. Allocates and commits more than half of that 3. Calls fork() 4. In the child, tries to commit all of that memory 5. Expects the child to be killed by the kernel Vladislav K. Valtchev (2022) (and its 2-char fix)

that: 1. Estimates the amount of committed memory that can be used 2. Allocates and commits more than half of that 3. Calls fork() 4. In the child, tries to commit all of that memory 5. Expects the child to be killed by the kernel  I just found that it fails on real HW machines Vladislav K. Valtchev (2022) (and its 2-char fix)

that: 1. Estimates the amount of committed memory that can be used 2. Allocates and commits more than half of that 3. Calls fork() 4. In the child, tries to commit all of that memory 5. Expects the child to be killed by the kernel  I just found that it fails on real HW machines  Quickly, I discovered that it fails on VMs too but only when they have significantly more RAM. That’s weird. Mmm… Vladislav K. Valtchev (2022) (and its 2-char fix)

that: 1. Estimates the amount of committed memory that can be used 2. Allocates and commits more than half of that 3. Calls fork() 4. In the child, tries to commit all of that memory 5. Expects the child to be killed by the kernel  I just found that it fails on real HW machines  Quickly, I discovered that it fails on VMs too but only when they have significantly more RAM. That’s weird. Mmm…  I had to debug that. Vladislav K. Valtchev (2022) (and its 2-char fix)

My latest bug [2/6] Vladislav K. Valtchev (2022) That’s fine…
How could we commit so much memory? 262 MB x 2 = 524 MB > 501 MB [usable] (ehm, we don’t have swap) That means trying to free a page not allocated in the heap, during munmap().

My latest bug [3/6] Vladislav K. Valtchev (2022) So, I
started debugging the CoW page-fault logic… After committing a few MBs in the child, we end up here!

My latest bug [4/6] Vladislav K. Valtchev (2022) I realized
I had ASSERTs disabled in that build! So, after turning them on… Aha, gotcha! You’re really trying to free the zero page!

My latest bug [5/6] Vladislav K. Valtchev (2022) Let’s look
at this limit case… Allocating 255 MB works…

My latest bug [6/6] Vladislav K. Valtchev (2022) That means
only one thing…

only one thing… That’s the problem: a 16-bit ref-count

only one thing… That’s the problem: a 16-bit ref-count It wraps around after 65,535 pages, meaning that the kernel cannot support 256 MB or more of uncommited memory!

Making the framebuffer console fast Vladislav K. Valtchev (2022)

Making the framebuffer console fast  Premise: why implement a
framebuffer console?  Text mode was almost completely dead even 5 years ago  Pure-UEFI machines don’t support text mode  Text mode is a x86 thing: Raspberry PI and other machines don’t support it Vladislav K. Valtchev (2022)

framebuffer console?  Text mode was almost completely dead even 5 years ago  Pure-UEFI machines don’t support text mode  Text mode is a x86 thing: Raspberry PI and other machines don’t support it  Why speed matters so much? Just mark the pages as WC and it will be reasonably fast. Vladislav K. Valtchev (2022)

framebuffer console?  Text mode was almost completely dead even 5 years ago  Pure-UEFI machines don’t support text mode  Text mode is a x86 thing: Raspberry PI and other machines don’t support it  Why speed matters so much? Just mark the pages as WC and it will be reasonably fast.  I didn’t know about WC (write-combining) at the time Vladislav K. Valtchev (2022)

framebuffer console?  Text mode was almost completely dead even 5 years ago  Pure-UEFI machines don’t support text mode  Text mode is a x86 thing: Raspberry PI and other machines don’t support it  Why speed matters so much? Just mark the pages as WC and it will be reasonably fast.  I didn’t know about WC (write-combining) at the time  Therefore, I implemented a series of optimizations before discovering WC Vladislav K. Valtchev (2022)

PSF fonts: a bitfield per each glyph Vladislav K. Valtchev
(2022) 8 bit 8 bit 32 bit 8 bit 16 bit

The simplest draw function (failsafe) Vladislav K. Valtchev (2022)

Performance? Too slow, in particular on the modern machine (left)
Intel Core i7-7500U Kaby Lake  1,124,773 RDTSC cycles / char (avg.) [~385.7 μs] Intel Atom N270 Diamondville (32-bit)  297,287 RDTSC cycles / char (avg.) [~186.3 μs] Vladislav K. Valtchev (2022) 16x8 font, 800x600 32x16 font, 3200x1800  7,416,012 RDTSC cycles / char (avg.) [~2543.2 μs] Scrolling the whole screen takes several seconds!!

A naïve optimization: loop unrolling Vladislav K. Valtchev (2022)

Benefits? Nah. Intel Core i7-7500U Kaby Lake Before (avg.) 385.72
μs / char After (avg.) 384.44 μs / char Speed up 0.3% faster Intel Atom N270 Diamondville (32-bit) Before (avg.) 186.27 μs / char After (avg.) 175.30 μs / char Speed up 6.2% faster Vladislav K. Valtchev (2022) Old school optimizations work better on old school machines!

Intuition 1: rendering glyphs pixel by pixel is too slow
Vladislav K. Valtchev (2022)

Solution 1: pre-rendering!  But… is pre-rendering every glyph in
the font even feasible? Vladislav K. Valtchev (2022) framebuffer

Pre-rendering! (font 16x8) 16 x 8 x 4 x 256
x 16 x 16 = Vladislav K. Valtchev (2022) Height x Width Bytes per pixel # glyphs FG colors BG colors 32 MB: unfeasible!

Pre-rendering! (font 32x16) 32 x 16 x 4 x 256
x 16 x 16 = Vladislav K. Valtchev (2022) Height x Width Bytes per pixel # glyphs FG colors BG colors 128 MB: pure madness!

A better idea: pre-render all the possible 8-bit “scanlines” (=
glyph rows) Vladislav K. Valtchev (2022) 28 x 4 x 8 x 16 x 16 = All scanlines Bytes per pixel FG colors BG colors Scanline length 2 MB Still expensive, but affordable!

It works on 32x16 fonts too! Vladislav K. Valtchev (2022)
8 bit 8 bit Scanline 00000011 Scanline 00111100

The pre-render code Vladislav K. Valtchev (2022)

Intuition 2: copying 4 bytes at a time is too
slow!  Pre-rendering the glyphs or the just the “scanlines” is not enough  The x86 rep movsl instruction copies just 4 bytes (= 1 pixel) at a time Vladislav K. Valtchev (2022)

Solution 2: use the FPU  Introduce something like fpu_memcpy()
 Write a whole row at a time during scrolling  Only this way, we could offset the cost of saving/restoring the FPU registers Vladislav K. Valtchev (2022)

Vladislav K. Valtchev (2022) Flag: during IRQ, we cannot use
the FPU Scanlines for the given FG/BG colors Copy 256 bit (32 bytes) the fastest way possible Jump to the same address during the whole loop

The FPU code [1/2] Vladislav K. Valtchev (2022)

The FPU code [2/2] Vladislav K. Valtchev (2022)

The moment of truth Vladislav K. Valtchev (2022) Core i7-7500U
Kaby Lake, AVX 2, 256-bit regs Before (avg.) 385.72 μs / char After (avg.) 67.42 μs / char Speed up 5.72x faster Atom N270 Diamondville (32-bit), SSSE 3, 128-bit regs Before (avg.) 186.27 μs / char After (avg.) 94.82 μs / char Speed up 1.96x faster Font 16x8, resolution 800x600, default memory type*, not WC Not bad at all! Smaller impact, but smaller regs here * Typically that means UC (uncacheable) set through MTRRs

The moment of truth Vladislav K. Valtchev (2022) Core i7-7500U
Kaby Lake, AVX 2, 256-bit regs Before (avg.) 385.72 μs / char After (avg.) 67.42 μs / char Speed up 5.72x faster Atom N270 Diamondville (32-bit), SSSE 3, 128-bit regs Before (avg.) 186.27 μs / char After (avg.) 94.82 μs / char Speed up 1.96x faster Font 16x8, resolution 800x600, default memory type*, not WC * Typically that means UC (uncacheable) set through MTRRs Font 32x16, resolution 3200x1800, default memory type*, not WC Before (avg.) 2543.21 μs / char After (avg.) 371.54 μs / char Speed up 6.84x faster Wow, that’s close to the max 8x improvement! (From 32 bit/write to 256 bit/write) Still not fast enough, though

The writing combining memory type (WC)  Allows data to
combined, temporarily stored in a buffer (WCB) and then released in burst mode  Cannot be used most of the time because offers just weak ordering  Can be set using PAT or MTRRs  It’s perfect for frame buffers Vladislav K. Valtchev (2022)

Performance: the full picture [modern machine] Vladislav K. Valtchev (2022)
32.9x faster! Just 12.5% faster Intel Core i7-7500U Kaby Lake (AVX 2, 256-bit fpu regs) Font 16x8, resolution 800x600, 32 bbp

Performance: the full picture [older machine] Vladislav K. Valtchev (2022)
Intel Atom N270 Diamondville (32-bit, SSSE 3, 128 bit fpu regs) Font 16x8, resolution 800x600, 32 bbp 2.04x faster No difference at all!

Performance on native res [modern machine] Vladislav K. Valtchev (2022)
101.26x faster! 2.63x faster Intel Core i7-7500U Kaby Lake (AVX 2, 256-bit fpu regs) Font 32x16, resolution 3200x1800, 32 bbp Not bad! 6.84x faster

Performance vs Linux [modern machine] Vladislav K. Valtchev (2022) Font
32x16, resolution 3200x1800, 32 bbp Kernel 5.4.0 (Ubuntu 20.04.4 LTS) Commit a858f229, release build

Performance vs Linux [modern machine]  9.55 μs / char
Vladislav K. Valtchev (2022) Font 32x16, resolution 3200x1800, 32 bbp Kernel 5.4.0 (Ubuntu 20.04.4 LTS) Commit a858f229, release build

 56.40 μs / char Vladislav K. Valtchev (2022) Font 32x16, resolution 3200x1800, 32 bbp Kernel 5.4.0 (Ubuntu 20.04.4 LTS) Commit a858f229, release build

 56.40 μs / char Vladislav K. Valtchev (2022) Font 32x16, resolution 3200x1800, 32 bbp Kernel 5.4.0 (Ubuntu 20.04 LTS) Commit a858f229, release build 5.9x faster!

Performance vs Linux [modern machine] Vladislav K. Valtchev (2022) Font
32x16, resolution 3200x1800, 32 bbp Kernel 5.4.0 (Ubuntu 20.04 LTS) Commit a858f229, release build 56.40 μs – Linux 5.4.0 25.09 μs – Tilck failsafe + WC 9.55 μs – Tilck’s best OPT ?

The benchmark code

Making libmusl applications to work Vladislav K. Valtchev (2022)

Why libmusl?  It made no sense to write a
custom libc.  Libmusl produces the smallest binaries (~13 KB for “hello world”)  It’s actively maintained and widely used in the Embedded Linux world  It’s supported by https://toolchains.bootlin.com/  Uclibc-ng is more customizable but:  Typically produces larger binaries  Using a pre-built toolchain means no customization anyway  Dietlibc is not well-maintained and has no pre-built toolchains Vladislav K. Valtchev (2022)

Libmusl requires TLS support Vladislav K. Valtchev (2022)  TLS
requires set_thread_area()

Libmusl requires TLS support Vladislav K. Valtchev (2022)  TLS
requires set_thread_area()  Can we cheat by returning –ENOSYS ? ☺

Sometimes cheating works Vladislav K. Valtchev (2022)

Sometimes cheating works  Sometimes it doesn’t. Vladislav K. Valtchev
(2022)

Sometimes cheating works  Sometimes it doesn’t.  Can we
try returning 0 instead and see what happens? Vladislav K. Valtchev (2022)

In EDX we’re supposed to have now the entry number
in the GDT. Clearly -1 is invalid. So now we got an invalid selector now in EDX

And, of course, here we get a GPF

What if we returned 0 and set a valid GDT
entry number in user_desc, without doing anything else? Vladislav K. Valtchev (2022)

Now EDX contains a valid GDT selector, 0x23, already used
for userspace data

We passed __init_tls(aux)!!

We reached main()!!

Ehm.. I don’t believe we’re going to pass that far
indirect call…

Yep, page fault.

Vaddr is clearly just 0x10 because the GDT selector 0x23
has offset = 0 (flat segmentation)

Lesson learned  Often, we cannot cheat. Vladislav K. Valtchev
(2022)

Lesson learned  Often, we cannot cheat.  Even basic
I/O functions use TLS variables. Vladislav K. Valtchev (2022)

Lesson learned  Often, we cannot cheat.  Even basic
I/O functions use TLS variables.  Had to provide a fully-functional implementation for set_thread_area(), in order run even single- threaded libmusl applications. Vladislav K. Valtchev (2022)

That was quite some code, but it’s not enough. We
need a ref-count for GDT entries as well. Why? Think about fork(). What happens if the parent dies before the child and we free the GDT slots?

ACPICA & AcpiOsWaitSemaphore()  ACPICA requires the OSL to provide
a counting semaphore implementation capable of waiting and signaling N units.  That is weird requirement.  It could be trivially implemented on the top of a regular counting semaphore, but that would be extremely inefficient.  I implemented such a semaphore in Tilck. Vladislav K. Valtchev (2022)

Classic semaphore New semaphore [1/2] Vladislav K. Valtchev (2022)

Classic semaphore New semaphore [2/2] Vladislav K. Valtchev (2022)

But.. how Linux did implement the counting semaphore to make
ACPICA happy? Vladislav K. Valtchev (2022)

ACPICA happy? It didn’t ☺ Vladislav K. Valtchev (2022)

ACPICA happy? Vladislav K. Valtchev (2022)

ACPICA happy? Vladislav K. Valtchev (2022) Sometimes cheating works.

Thank you! https://github.com/vvaltchev/tilck Vladislav K. Valtchev (2022)

Developing Tilck, a Tiny Linux-compatible kernel

Developing Tilck, a Tiny Linux-compatible kernel

More Decks by Kernel Recipes

Featured

Transcript