$30 off During Our Annual Pro Sale. View Details »

Developing Tilck, a Tiny Linux-compatible kernel

Kernel Recipes
June 09, 2024
13

Developing Tilck, a Tiny Linux-compatible kernel

One of the ways to get a solid understanding of operating systems is to write one, from scratch. This is what Vlad has been doing in his free time since 2016. It all started as an experiment to see what would it take to write an OS that could just “run a shell” and learn a bit of x86 hardware, kernels and bootloaders in the meanwhile. A few months later, he decided that it would be super cool if his toy-kernel could run natively i686-linux programs and that opened Pandora’s box: being binary-compatible with Linux and running non-trivial 3rd-party apps introduced plenty of unexpected challenges. Fast forward a few years, Tilck can now boot on both BIOS and UEFI machines (using its own bootloader or GRUB), run Busybox, Vim, fbDOOM, Micropython and several other console applications compiled for i686-linux with a gcc-musl toolchain.

This talk is about the story of how the Tilck kernel and its bootloader went from “Hello world” to where it is today, with focus on some of the most interesting challenges and subtle bugs its author had to go through.

Vlasdislav K. Valtchev, VmWare

Kernel Recipes

June 09, 2024
Tweet

Transcript

  1. What Tilck is  A project consisting on:  A

    monolithic kernel written in C and assembly  A bootloader working both on UEFI and legacy BIOS systems  Several test suites and a powerful CMake-based build system  Buildroot-like scripts for downloading & building 3rd party software  Partially compatible with Linux at binary level  Uniprocessor, but fully preemptable  Educational, with potential to be more than that (see testing etc.)  Runs only on i686, at the moment (will be ported to ARM, RISC-V etc.)  Open source, distributed under the BSD 2-clause license Vladislav K. Valtchev (2022)
  2. What Tilck is NOT  An attempt to replace Linux

     An attempt to be yet another desktop operating system  An attempt to be a large-scale server operating system  A real-time OS, but it might become one in the future  A OS running on NOMMU machines, but (probably) will in the future  Ready for production use: it still lacks features as storage, networking etc. Vladislav K. Valtchev (2022)
  3. Why the binary compatibility with Linux?  It’s cool being

    able to test the same “bits” both on Linux and Tilck  Robustness: Tilck can empirically show robustness and correctness by running 3rd party software never written for it  Didn’t want to design a whole new syscall interface from scratch  Didn’t want to implement a whole libc too  Didn’t want to build a custom GCC toolchain. I wanted to use the pre-built toolchains from: https://toolchains.bootlin.com/  Increase the likelihood the project to get more interest from the community?  Porting pre-existing software to Tilck will require a little or no effort at all. Vladislav K. Valtchev (2022)
  4. Core values & goals  Minimal memory footprint  Ultra

    low-latency  Deterministic behavior  Extra robustness  Portability  Simplicity  Partial compatibility with Linux  Must work on real (modern) hardware  Exceptional developer experience: building & testing the project should be as easy as technologically possible Vladislav K. Valtchev (2022)
  5. Live demo Because a demo is worth more than a

    thousand words Vladislav K. Valtchev (2022)
  6. My latest bug [1/6]  I have a test (fork_oom)

    that: 1. Estimates the amount of committed memory that can be used 2. Allocates and commits more than half of that 3. Calls fork() 4. In the child, tries to commit all of that memory 5. Expects the child to be killed by the kernel Vladislav K. Valtchev (2022) (and its 2-char fix)
  7. My latest bug [1/6]  I have a test (fork_oom)

    that: 1. Estimates the amount of committed memory that can be used 2. Allocates and commits more than half of that 3. Calls fork() 4. In the child, tries to commit all of that memory 5. Expects the child to be killed by the kernel  I just found that it fails on real HW machines Vladislav K. Valtchev (2022) (and its 2-char fix)
  8. My latest bug [1/6]  I have a test (fork_oom)

    that: 1. Estimates the amount of committed memory that can be used 2. Allocates and commits more than half of that 3. Calls fork() 4. In the child, tries to commit all of that memory 5. Expects the child to be killed by the kernel  I just found that it fails on real HW machines  Quickly, I discovered that it fails on VMs too but only when they have significantly more RAM. That’s weird. Mmm… Vladislav K. Valtchev (2022) (and its 2-char fix)
  9. My latest bug [1/6]  I have a test (fork_oom)

    that: 1. Estimates the amount of committed memory that can be used 2. Allocates and commits more than half of that 3. Calls fork() 4. In the child, tries to commit all of that memory 5. Expects the child to be killed by the kernel  I just found that it fails on real HW machines  Quickly, I discovered that it fails on VMs too but only when they have significantly more RAM. That’s weird. Mmm…  I had to debug that. Vladislav K. Valtchev (2022) (and its 2-char fix)
  10. My latest bug [2/6] Vladislav K. Valtchev (2022) That’s fine…

    How could we commit so much memory? 262 MB x 2 = 524 MB > 501 MB [usable] (ehm, we don’t have swap) That means trying to free a page not allocated in the heap, during munmap().
  11. My latest bug [3/6] Vladislav K. Valtchev (2022) So, I

    started debugging the CoW page-fault logic… After committing a few MBs in the child, we end up here!
  12. My latest bug [4/6] Vladislav K. Valtchev (2022) I realized

    I had ASSERTs disabled in that build! So, after turning them on… Aha, gotcha! You’re really trying to free the zero page!
  13. My latest bug [5/6] Vladislav K. Valtchev (2022) Let’s look

    at this limit case… Allocating 255 MB works…
  14. My latest bug [6/6] Vladislav K. Valtchev (2022) That means

    only one thing… That’s the problem: a 16-bit ref-count
  15. My latest bug [6/6] Vladislav K. Valtchev (2022) That means

    only one thing… That’s the problem: a 16-bit ref-count It wraps around after 65,535 pages, meaning that the kernel cannot support 256 MB or more of uncommited memory!
  16. Making the framebuffer console fast  Premise: why implement a

    framebuffer console?  Text mode was almost completely dead even 5 years ago  Pure-UEFI machines don’t support text mode  Text mode is a x86 thing: Raspberry PI and other machines don’t support it Vladislav K. Valtchev (2022)
  17. Making the framebuffer console fast  Premise: why implement a

    framebuffer console?  Text mode was almost completely dead even 5 years ago  Pure-UEFI machines don’t support text mode  Text mode is a x86 thing: Raspberry PI and other machines don’t support it  Why speed matters so much? Just mark the pages as WC and it will be reasonably fast. Vladislav K. Valtchev (2022)
  18. Making the framebuffer console fast  Premise: why implement a

    framebuffer console?  Text mode was almost completely dead even 5 years ago  Pure-UEFI machines don’t support text mode  Text mode is a x86 thing: Raspberry PI and other machines don’t support it  Why speed matters so much? Just mark the pages as WC and it will be reasonably fast.  I didn’t know about WC (write-combining) at the time Vladislav K. Valtchev (2022)
  19. Making the framebuffer console fast  Premise: why implement a

    framebuffer console?  Text mode was almost completely dead even 5 years ago  Pure-UEFI machines don’t support text mode  Text mode is a x86 thing: Raspberry PI and other machines don’t support it  Why speed matters so much? Just mark the pages as WC and it will be reasonably fast.  I didn’t know about WC (write-combining) at the time  Therefore, I implemented a series of optimizations before discovering WC Vladislav K. Valtchev (2022)
  20. PSF fonts: a bitfield per each glyph Vladislav K. Valtchev

    (2022) 8 bit 8 bit 32 bit 8 bit 16 bit
  21. Performance? Too slow, in particular on the modern machine (left)

    Intel Core i7-7500U Kaby Lake  1,124,773 RDTSC cycles / char (avg.) [~385.7 μs] Intel Atom N270 Diamondville (32-bit)  297,287 RDTSC cycles / char (avg.) [~186.3 μs] Vladislav K. Valtchev (2022) 16x8 font, 800x600 32x16 font, 3200x1800  7,416,012 RDTSC cycles / char (avg.) [~2543.2 μs] Scrolling the whole screen takes several seconds!!
  22. Benefits? Nah. Intel Core i7-7500U Kaby Lake Before (avg.) 385.72

    μs / char After (avg.) 384.44 μs / char Speed up 0.3% faster Intel Atom N270 Diamondville (32-bit) Before (avg.) 186.27 μs / char After (avg.) 175.30 μs / char Speed up 6.2% faster Vladislav K. Valtchev (2022) Old school optimizations work better on old school machines!
  23. Solution 1: pre-rendering!  But… is pre-rendering every glyph in

    the font even feasible? Vladislav K. Valtchev (2022) framebuffer
  24. Pre-rendering! (font 16x8) 16 x 8 x 4 x 256

    x 16 x 16 = Vladislav K. Valtchev (2022) Height x Width Bytes per pixel # glyphs FG colors BG colors 32 MB: unfeasible!
  25. Pre-rendering! (font 32x16) 32 x 16 x 4 x 256

    x 16 x 16 = Vladislav K. Valtchev (2022) Height x Width Bytes per pixel # glyphs FG colors BG colors 128 MB: pure madness!
  26. A better idea: pre-render all the possible 8-bit “scanlines” (=

    glyph rows) Vladislav K. Valtchev (2022) 28 x 4 x 8 x 16 x 16 = All scanlines Bytes per pixel FG colors BG colors Scanline length 2 MB Still expensive, but affordable!
  27. It works on 32x16 fonts too! Vladislav K. Valtchev (2022)

    8 bit 8 bit Scanline 00000011 Scanline 00111100
  28. Intuition 2: copying 4 bytes at a time is too

    slow!  Pre-rendering the glyphs or the just the “scanlines” is not enough  The x86 rep movsl instruction copies just 4 bytes (= 1 pixel) at a time Vladislav K. Valtchev (2022)
  29. Solution 2: use the FPU  Introduce something like fpu_memcpy()

     Write a whole row at a time during scrolling  Only this way, we could offset the cost of saving/restoring the FPU registers Vladislav K. Valtchev (2022)
  30. Vladislav K. Valtchev (2022) Flag: during IRQ, we cannot use

    the FPU Scanlines for the given FG/BG colors Copy 256 bit (32 bytes) the fastest way possible Jump to the same address during the whole loop
  31. The moment of truth Vladislav K. Valtchev (2022) Core i7-7500U

    Kaby Lake, AVX 2, 256-bit regs Before (avg.) 385.72 μs / char After (avg.) 67.42 μs / char Speed up 5.72x faster Atom N270 Diamondville (32-bit), SSSE 3, 128-bit regs Before (avg.) 186.27 μs / char After (avg.) 94.82 μs / char Speed up 1.96x faster Font 16x8, resolution 800x600, default memory type*, not WC Not bad at all! Smaller impact, but smaller regs here * Typically that means UC (uncacheable) set through MTRRs
  32. The moment of truth Vladislav K. Valtchev (2022) Core i7-7500U

    Kaby Lake, AVX 2, 256-bit regs Before (avg.) 385.72 μs / char After (avg.) 67.42 μs / char Speed up 5.72x faster Atom N270 Diamondville (32-bit), SSSE 3, 128-bit regs Before (avg.) 186.27 μs / char After (avg.) 94.82 μs / char Speed up 1.96x faster Font 16x8, resolution 800x600, default memory type*, not WC * Typically that means UC (uncacheable) set through MTRRs Font 32x16, resolution 3200x1800, default memory type*, not WC Before (avg.) 2543.21 μs / char After (avg.) 371.54 μs / char Speed up 6.84x faster Wow, that’s close to the max 8x improvement! (From 32 bit/write to 256 bit/write) Still not fast enough, though
  33. The writing combining memory type (WC)  Allows data to

    combined, temporarily stored in a buffer (WCB) and then released in burst mode  Cannot be used most of the time because offers just weak ordering  Can be set using PAT or MTRRs  It’s perfect for frame buffers Vladislav K. Valtchev (2022)
  34. Performance: the full picture [modern machine] Vladislav K. Valtchev (2022)

    32.9x faster! Just 12.5% faster Intel Core i7-7500U Kaby Lake (AVX 2, 256-bit fpu regs) Font 16x8, resolution 800x600, 32 bbp
  35. Performance: the full picture [older machine] Vladislav K. Valtchev (2022)

    Intel Atom N270 Diamondville (32-bit, SSSE 3, 128 bit fpu regs) Font 16x8, resolution 800x600, 32 bbp 2.04x faster No difference at all!
  36. Performance on native res [modern machine] Vladislav K. Valtchev (2022)

    101.26x faster! 2.63x faster Intel Core i7-7500U Kaby Lake (AVX 2, 256-bit fpu regs) Font 32x16, resolution 3200x1800, 32 bbp Not bad! 6.84x faster
  37. Performance vs Linux [modern machine] Vladislav K. Valtchev (2022) Font

    32x16, resolution 3200x1800, 32 bbp Kernel 5.4.0 (Ubuntu 20.04.4 LTS) Commit a858f229, release build
  38. Performance vs Linux [modern machine]  9.55 μs / char

    Vladislav K. Valtchev (2022) Font 32x16, resolution 3200x1800, 32 bbp Kernel 5.4.0 (Ubuntu 20.04.4 LTS) Commit a858f229, release build
  39. Performance vs Linux [modern machine]  9.55 μs / char

     56.40 μs / char Vladislav K. Valtchev (2022) Font 32x16, resolution 3200x1800, 32 bbp Kernel 5.4.0 (Ubuntu 20.04.4 LTS) Commit a858f229, release build
  40. Performance vs Linux [modern machine]  9.55 μs / char

     56.40 μs / char Vladislav K. Valtchev (2022) Font 32x16, resolution 3200x1800, 32 bbp Kernel 5.4.0 (Ubuntu 20.04 LTS) Commit a858f229, release build 5.9x faster!
  41. Performance vs Linux [modern machine] Vladislav K. Valtchev (2022) Font

    32x16, resolution 3200x1800, 32 bbp Kernel 5.4.0 (Ubuntu 20.04 LTS) Commit a858f229, release build 56.40 μs – Linux 5.4.0 25.09 μs – Tilck failsafe + WC 9.55 μs – Tilck’s best OPT ?
  42. Why libmusl?  It made no sense to write a

    custom libc.  Libmusl produces the smallest binaries (~13 KB for “hello world”)  It’s actively maintained and widely used in the Embedded Linux world  It’s supported by https://toolchains.bootlin.com/  Uclibc-ng is more customizable but:  Typically produces larger binaries  Using a pre-built toolchain means no customization anyway  Dietlibc is not well-maintained and has no pre-built toolchains Vladislav K. Valtchev (2022)
  43. Libmusl requires TLS support Vladislav K. Valtchev (2022)  TLS

    requires set_thread_area()  Can we cheat by returning –ENOSYS ? ☺
  44. Sometimes cheating works  Sometimes it doesn’t.  Can we

    try returning 0 instead and see what happens? Vladislav K. Valtchev (2022)
  45. In EDX we’re supposed to have now the entry number

    in the GDT. Clearly -1 is invalid. So now we got an invalid selector now in EDX
  46. What if we returned 0 and set a valid GDT

    entry number in user_desc, without doing anything else? Vladislav K. Valtchev (2022)
  47. Lesson learned  Often, we cannot cheat.  Even basic

    I/O functions use TLS variables. Vladislav K. Valtchev (2022)
  48. Lesson learned  Often, we cannot cheat.  Even basic

    I/O functions use TLS variables.  Had to provide a fully-functional implementation for set_thread_area(), in order run even single- threaded libmusl applications. Vladislav K. Valtchev (2022)
  49. That was quite some code, but it’s not enough. We

    need a ref-count for GDT entries as well. Why? Think about fork(). What happens if the parent dies before the child and we free the GDT slots?
  50. ACPICA & AcpiOsWaitSemaphore()  ACPICA requires the OSL to provide

    a counting semaphore implementation capable of waiting and signaling N units.  That is weird requirement.  It could be trivially implemented on the top of a regular counting semaphore, but that would be extremely inefficient.  I implemented such a semaphore in Tilck. Vladislav K. Valtchev (2022)
  51. But.. how Linux did implement the counting semaphore to make

    ACPICA happy? Vladislav K. Valtchev (2022)
  52. But.. how Linux did implement the counting semaphore to make

    ACPICA happy? It didn’t ☺ Vladislav K. Valtchev (2022)
  53. But.. how Linux did implement the counting semaphore to make

    ACPICA happy? Vladislav K. Valtchev (2022)
  54. But.. how Linux did implement the counting semaphore to make

    ACPICA happy? Vladislav K. Valtchev (2022) Sometimes cheating works.