Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Fearsome File Formats

Fearsome File Formats

Presented at 38C3 in Hamburg on the 28th December 2024.

Video recording: https://media.ccc.de/v/38c3-fearsome-file-formats

With so many open-source parsers being tested and fuzzed, and widely available specs,
what could go wrong with file formats nowadays ? Nothing to fear, right?

Let's explore even darker corners of their landscape!
Even extreme simplicity can misleadingly lead to unexpected challenges.
And at the other end of the spectrum, new complex constructs appeared over the years:
near-polyglots, timecryption, hashquines … Even AI is an element of the game now.

Let's play FileCraft, and enjoy the ride!

Ange Albertini

December 28, 2024
Tweet

More Decks by Ange Albertini

Other Decks in Technology

Transcript

  1. - Looking at hex editors for 35 years. - Malware

    analyst for 20 years: Symantec, Avira, Google. - Corkami: posters, PoCs, tools, tutorials (15k ⭐). - CPS2Shock, PoC||GTFO… About the author github / angea / pocorgtfo My own views and opinions. 3 Pixel art by Squiblydoo (2023) cps2shock.emu-france.info
  2. Can an empty f ile be useful? Besides: - crashing

    code in production, - stopping malware installation, - shutting down botnets, … Can one find purpose in emptiness ? 🤔 7
  3. /bin/true used to be empty. $ touch test $ chmod

    +x test $ ./test $ echo $? 0 An empty shell script. Standard in every system. It always works, and saves space. It even became copyrighted despite its empty payload. Nowadays, /bin/true is an ELF binary. 9 VOID IS WIN
  4. In Doom WADs, empty files are used as map index

    in the archive table: E1M1, … 10 VOID IS HERE
  5. Under IBM PC-DOS 1.0 and CP/M, launching an empty f

    ile will just re-run the last one: the memory wasn't cleared between executions. The IBM Personal Computer DOS Version 1.00 (C)Copyright IBM Corp 1981 A>DEBUG EMPTY.COM File not found -w -q A>DIR EMPTY.COM EMPTY COM 0 01-01-80 A>_ A>DIR TIME.COM TIME COM 250 08-04-81 A>TIME Current time is 18:24:41.81 Enter new time: A>DIR EMPTY.COM EMPTY COM 0 01-01-80 A>EMPTY Current time is 18:24:53.27 Enter new time: A>_ CP/M 2.2 - Amstrad Consumer Electronics plc A>ED EMPTY.COM NEW FILE : *e A>STAT EMPTY.COM Recs Bytes Ext Acc 0 0k 1 R/W A:EMPTY.COM Bytes Remaining On A: 5k A>EMPTY EMPTY.COM Recs Bytes Ext Acc 0 0k 1 R/W A:EMPTY.COM Bytes Remaining On A: 5k A>█ MS-Dos 1.25 added a size check in 1982. 11 VOID IS LAST
  6. So the empty f ile is…🤯 - A standard system

    shell script that always executes successfully. - An index in Doom archives. - A commercial (!) DOS executable that repeats the last command. (among possibly many other things) 12
  7. The type, context and purpose are already unclear. A file

    is more than its contents: context and metadata can be critical. No content, and yet… 13
  8. What are 'f , iles' ? Let's look at something

    else… 14 New Game Continue
  9. Narpas sword0 GAMETEAM Oh No! More lemmings! Level 11-Crazy: No

    Problemming! Password: LCAMTUFPBR 15 When storage was too expensive, games used to rely on long passwords to save your data! Some aren’t even in text!
  10. 16 00: 16 19 03 .$ .$ .$ .$ .$

    08: 19 17 10 .$ .$ .$ .$ .$ 10: 19 0D 0F .$ .$ .$ .$ .$ 18: 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 28: 00 00 00 00 00 00 00 00 30: 22 FF 00 00 00 00 00 00 38: 00 00 00 00 00 08 00 00 40: 00 00 00 00 00 00 00 00 48: 00 00 00 00 00 00 00 00 50: 00 00 00 00 00 00 00 00 58: 22 FF 00 00 00 00 00 00 60: 00 00 00 00 00 08 00 00 68: 00 00 00 00 00 00 00 00 70: 00 00 00 00 00 00 00 00 78: 00 00 00 00 00 00 00 00 80: 22 FF 00 00 00 00 00 00 88: 00 00 00 00 00 08 00 00 00-09: 0-9 10-24: A-Z Saving games in 1986: hardcoded offsets in SRAM.
  11. 1993 Link’s Awakening. 17 From “player” to “f ile”. 1998:

    Ocarina of Time 2004 The Minish Cap 2001: Oracles of... 1986 the Legend of Zelda 1987 the Adventure of Link 1991 A Link to the Past
  12. What's a f ile without a format? What does that

    even mean? Why would you do that? 18
  13. …didn't use a f ile format! The whole memory page

    was saved as a f ile… …with whatever else in memory! Who needs standardization when you're just on your own? It was just faster to snapshot the memory range. 0000: 31 BE 00 00 00 AB 00 00 00 00 00 00 00 00 8C 00 1╛ ½ î 0010: 00 00 03 00 04 00 04 00 04 00 04 00 04 00 4E 4F NO 0020: 52 4D 41 4C 2E 53 54 59 00 00 00 00 00 00 00 00 RMAL.STY 0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0080: 48 65 6C 6C 6F 20 57 6F 72 6C 64 21 64 2E 3E 00 Hello World!d.> 0090: 80 00 46 80 76 61 72 69 61 6E 74 3A 20 20 63 68 Ç FÇvariant: ch 00A0: 6F 6F 73 65 20 61 20 6C 65 74 74 65 72 20 6F 72 oose a letter or 00B0: 20 6E 75 6D 62 65 72 20 74 6F 20 69 64 65 6E 74 number to ident 00C0: 69 66 79 20 74 68 69 73 20 73 74 79 6C 65 20 61 ify this style a 00D0: 73 20 61 20 75 6E 69 71 75 65 46 44 82 76 61 72 s a uniqueFDévar 00E0: 69 61 74 69 6F 6E 20 6F 66 20 75 73 61 67 65 20 iation of usage. 00F0: 6E 61 6D 65 2E 20 50 72 65 73 73 20 61 20 64 69 name. Press a di 0100: 80 00 00 00 8C 00 00 00 FF FF 00 00 00 00 00 00 Ç î 0110: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0120: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0130: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 16 F6 ÷ 0140: 03 00 F0 07 00 00 05 00 EA F6 00 00 22 AE 01 00 ≡ Ω÷ "« 0150: 17 00 C2 9C 00 00 80 00 F2 F6 06 80 03 04 00 00 ┬£ Ç ≥÷ Ç 0160: 80 00 80 00 FF 00 17 00 2C 9C 00 00 0C F7 03 00 Ç Ç ,£ ≈ 0170: 32 05 34 03 17 00 16 00 18 00 C2 9C C2 9C 06 01 2 4 ┬£┬£ 0180: 80 00 00 00 8D 00 00 00 FF FF 00 00 00 00 00 00 Ç ì 0190: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01A0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01B0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 16 F6 ÷ 01C0: 03 00 F0 07 00 00 05 00 EA F6 00 00 22 AE 01 00 ≡ Ω÷ "« 01D0: 17 00 C2 9C 00 00 80 00 F2 F6 06 80 03 04 00 00 ┬£ Ç ≥÷ Ç 01E0: 80 00 80 00 FF 00 17 00 2C 9C 00 00 0C F7 03 00 Ç Ç ,£ ≈ 01F0: 32 05 34 03 17 00 16 00 18 00 C2 9C C2 9C 06 01 2 4 ┬£┬£ 21 HELLOW.DOC (512 bytes for 12 bytes of text!) A sign of the times!
  14. … "What a mess!" How do you reliably handle such

    files? 22 mustangstromboneheadlinefeedbackhandrailroadsideshowdownturnoverbookcaseworkshop
  15. Full of nasty surprises, exceptions, oddities, for historical or technical

    reasons. We need to preserve file formats in a better way… What's the Mame of file formats? 🤔 File format landscape 101: 23
  16. Some things haven't changed… Ambiguous files (aka werewolves aka parser

    differentials aka schizophrenic files) are still there. No reference parser, no test corpus. Expensive specifications? -> devs don't pay for them! And often, no real/serious specifications. A simple example… 25
  17. How do you pronounce this name ? Anje, Enn-ji, An-gé,

    Anzu (杏), Enn-ré, Ąż… Male or female? How many names are 'unpronounceable' ? Without references, things quickly get messy. "Ange" 26
  18. Concatenation still works! Duplicate file entry in a CPIO archive

    used to hack cars over the air, in 2024. 27 CPIO (1977)
  19. Polyglots 1. Concatenation (appended data) 2. Parasite (comment) 3. Zipper

    (mutual comments) 4. Chimera (shared data) 28 Multi-type / chameleon files, a.k.a.
  20. > unrar v clickme1.pdf.exe.htm.dcm.rar.iso.7z.apk.smc UNRAR 5.40 beta 2 x64 freeware

    Copyright (c) 1993-2016 Alexander Roshal Archive: clickme1.pdf.exe.htm.dcm.rar.iso.7z.apk.smc Details: RAR 4, SFX Attributes Size Packed Ratio Date Time Checksum Name ----------- --------- -------- ----- ---------- ----- -------- ---- ..A.... 4 4 100% 2020-01-18 19:08 982134A1 rar4.txt ----------- --------- -------- ----- ---------- ----- -------- ---- 4 4 100% 1 ClickMe (.PDF.EXE.HTM.DCM.RAR.ISO.7Z.APK.SMC) >clickme1.pdf.exe.htm.dcm.rar.iso.7z.apk.smc.exe 32-bit PE 29
  21. Named after Mithridates (a famous polyglot) 30 Identify file types,

    make space, combine and adjust data. It should keep the files valid: no deep parsing, just the minimum. Mitra https://github.com/corkami/mitra $ mitra.py dicom.dcm png.png dicom.dcm File 1: DICOM / Digital Imaging and Communications in Medicine png.png File 2: PNG / Portable Network Graphics Zipper Success! Zipper: interleaving of File1 (type DCM) and File2 (type PNG)
  22. Polymocks (ID bypass) Wrappend Normalize Embedding Col lisions Near polyglots

    (AngeCryption, TimeCryption) Ambiguity Sequences (train) Stacked boxes Pointers (book) Concatenation Formats features Tricks Cavity Parasite Start of fset Appended data Magic Formats structures Combination strategies Polyglots (type bypass) Abuses Generating weird files Chains (towed boats) Cavity Parasite 31 Zipper Mitra
  23. Embedding payloads 89 P N G \r \n ^Z \n

    00 00 01 38 c O M M - - > \r \n < d i v __ i d = ' m y p a g e ' > \r \n < h 1 > H T M L __ p a g e < / h 1 > \r \n < s c r i p t __ l a n g u a g e = j a v a s c r i p t __ t y p e = " t e x t / j a v a s c r i p t " > __ \r \n d o c u m e n t . d o c u m e n t E l e m e n t . i n n e r H T M L __ = __ d o c u m e n t . g e t E l e m e n t B y I d ( ' m y p a g e ' ) . i n n e r H T M L ; \r \n d o c u m e n t . t i t l e __ = __ ' H T M L __ t i t l e ' ; \r \n a l e r t ( " J a v a S c r i p t __ p a y l o a d " ) ; \r \n c o n s o l e . l o g ( " J a v a S c r i p t __ p a y l o a d " ) ; \r \n < / s c r i p t > \r \n < / d i v > \r \n < ! - - __ 2E DA DC 65 00 00 00 0D I H D R 00 00 00 0D 00 00 00 07 01 03 00 00 00 E9 BE 55 59 00 00 00 06 P L T E FF FF FF 00 00 00 55 C2 D3 7E 00 00 00 1B I D A T 08 1D 63 00 82 54 03 86 70 07 86 F4 02 06 F7 00 06 57 03 06 06 06 00 21 1A 03 10 32 6A 0B 48 00 00 00 00 I E N D AE 42 60 82 000: 010: 020: 030: 040: 050: 060: 070: 080: 090: 0A0: 0B0: 0C0: 0D0: 0E0: 0F0: 100: 110: 120: 130: 140: 150: 160: 170: 180: 190: 1A0: $ mitra.py png.png script.js -f png.png File 1: PNG / Portable Network Graphics script.js File 2: binary blob Stack: concatenation of File1 (type PNG) and File2 (type BIN) Parasite: hosting of File2 (type BIN) in File1 (type PNG) 32 --> <div id='mypage'> <h1>HTML page</h1> <script language=javascript type="text/javascript"> document.documentElement.innerHTML = document.getElementById('mypage').innerHTML; document.title = 'HTML title'; alert("JavaScript payload"); console.log("JavaScript payload"); </script> </div> <!-- Parasite code A valid PNG file with a working JavaScript payload
  24. $ mocky.py --combined input/jpg.jpg Filetype: JFIF / JPEG File Interchange

    Format Parasite-combined sig(s): unicos / Symbian / snd / wdk / SoundFont / icc / VICAR / netbsd_ktraceS / SoundFX / VirtualBox / ScreamTracker / Plot84 / ezd / dicom / Tar(checksum) / ds / CCP4 / DRDOS / pif / mbr 25676 > Combined Mock: mA-jpg.jpg $ file mA-jpg.jpg mA-jpg.jpg: tar archive Using Mocky to bypass file identif ication $ identify -verbose ./mA-jpg.jpg Image: Filename: ./mA-jpg.jpg Format: JPEG (Joint Photographic Experts Group JFIF format) Mime type: image/jpeg Class: PseudoClass Geometry: 104x56+0+0 Resolution: 36x36 Print size: 2.88889x1.55556 Units: PixelsPerCentimeter Colorspace: Gray [...] <- FILE sees it as a TAR file! (valid TAR signature + checksum) Still a perfectly valid JPEG! (with an extra COMment segment stuffed with signatures) $ file mA-jpg.jpg --keep-going --raw mA-jpg.jpg: tar archive - DR-DOS executable (COM) - JPEG image data, baseline, precision 8, 104x56, components 1 - Windows Program Information File for acsp` - VICAR label file - DOS/MBR boot sector - Nintendo DS ROM image: "�����" (SNDH, Rev.107) (homebrew) - Plot84 plotting file - DOS/MBR boot sector - sfArk compressed Soundfont - Old EZD Electron Density Map - Symbian installation file - Scream Tracker Sample mono 8bit - SNDH Atari ST music - SoundFX Module sound file - DICOM medical imaging data - CCP4 Electron Density Map - VirtualBox Disk Image (�����), 5715999566798081280 bytes - unicos (cray) executable - data 33 Many detected file types Add any possible signature with Mocky Polymocks (ID bypass)
  25. multi: Windows Program Information File for \030(o\001 - MAR Area

    Detector Image, - Linux kernel x86 boot executable RW-rootFS, - ReiserFS V3.6 - Files-11 On-Disk Structure (ODS-52); volume label is ' ' - DOS/MBR boot sector - Game Boy ROM image (Rev.00) [ROM ONLY], ROM: 256Kbit - Plot84 plotting file - DOS/MBR boot sector - DOSFONT2 encrypted font data - Kodak Photo CD image pack file , landscape mode - SymbOS executable v., name: HNRO0\334\247\304\375]\034\236\243 - ISO 9660 CD-ROM filesystem data (raw 2352 byte sectors) - Nero CD image at 0x4B000 ISO 9660 CD-ROM filesystem data - High Sierra CD-ROM filesystem data - Old EZD Electron Density Map - Apple File System (APFS), blocksize 24061976 - Zoo archive data, modify: v78.88+ - Symbian installation file - 4-channel Fasttracker module sound data Title: "MZ`\352\210\360'\315!" - Scream Tracker Sample adlib drum mono 8bit unpacked - Poly Tracker PTM Module Title: "MZ`\352\210\360'\315!" - SNDH Atari ST music - SoundFX Module sound file - D64 Image - Nintendo Wii disc image: "NXSB\030(o\001" (MZ`\35, Rev.205) - Nintendo 3DS File Archive (CFA) (v0, 0.0.0) - Unix Fast File system [v1] (little-endian), last mounted on , ... - Unix Fast File system [v2] (little-endian) last mounted on , ... - Unix Fast File system [v2] (little-endian) last mounted on , … - ISO 9660 CD-ROM filesystem data (DOS/MBR boot sector) - F2FS filesystem, UUID=00000000-0000-0000-0000-000000000000, volume name "" - DICOM medical imaging data - Linux kernel ARM boot executable zImage (little-endian) - CCP4 Electron Density Map - Ultrix core file from 'X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVI... - VirtualBox Disk Image (MZ`\352\210\360'\315!), 5715999566798081280 bytes - MS Compress archive data - AMUSIC Adlib Tracker MS-DOS executable, MZ for MS-DOS COM executable for DOS - JPEG 2000 image - ARJ archive data - unicos (cray) executable - IBM OS/400 save file data - data This file is simultaneously detected as: - DOS EXE, COM and MBR - Zoo, ARJ, VirtualBox, MS Compress, 3DS - ISO, RAW ISO, Nero, PhotoCD - FastTracker, ScreamTracker, Adlib tracker, Polytracker, SoundFX - Apple, IBM, HP, Linux, Ultrix, Raid, ODS, Nintendo, Kodak - EZD, CCP4, Plot84, MAR, Dicom ... A polymock - a 190-in-1 yet empty f ile 34 00 10 20 30 40 50 60 70 80 … Many magics are at the start of the file. The file is mostly empty! It only contains magics to fake file types. output from file --keep-going 0 0x0 Gameboy ROM,, [ROM ONLY], ROM: 256Kbit 80 0x50 RAR archive data, version 5.x 88 0x58 lrzip compressed data 89 0x59 rzip compressed data - version 76.79... 114 0x72 xz compressed data 120 0x78 LZ4 compressed data ... output (150 sigs) from Binwalk https://github.com/corkami/pocs/tree/master/polymocks .M .Z 60 EA .j .P 01 07 19 04 00 10 .S .N .D .H .N .R .O .0 DC A7 C4 FD 5D 1C 9E A3 .R .E .~ .^ .N .X .S .B 18 28 6F 01 .P .K 03 04 .P .T .M .F .S .y .m .E .x .e .7 .z BC AF 27 1C .S .O .N .G 7F 10 DA BE 00 00 CD 21 .P .K 01 02 .S .C .R .S .R .a .r .! ^Z 07 01 00 .L .R .Z .I .P .L .O .T .% .% .8 .4 .R .a .r .! ^Z 07 00 00 00 .M .A .P . .( FD .7 .z .X .Z 00 04 22 4D 18 03 21 4C 18 .D .I .C .M .% .P .D .F .- .1 .. .4 . .o .b .j …
  26. Each format characteristic enables more possibilities Z 7 A R

    P I D T P M A B B C C E E F F G G I I I I J J N O P L P P R R T W B J P P W I X i Z r A D S C A S P R M Z A P B L L l I Z C C D L P P E G S N E N I T I A P a C C A D Z p j R F O M R 4 P 2 B I M F V a F C O 3 D 2 G S G D K G F F F D G v A A S 3 O L c v A F F a P P M v 2 N 1 Zip . X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 41 7Z X . X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 41 Arj X X . X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 41 RAR X X X . X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 41 PDF X X X X . X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 41 ISO X X X X X . X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 41 DCM X X X X X X . X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 37 TAR X X X X X X . X X X X X X X X X X X X X X X X X X X X X X X X 30 PS X X X X X X X X . 8 MP4 X X X X X X X X . 8 AR X X X X X X X X . 8 BMP X X X X X X X . 7 BZ2 X X X X X X X . 7 CAB X X X X X X X X . 8 CPIO X X X X X X X X . 8 EBML X X X X X X . 6 ELF X X X X X X X . 7 FLV X X X X X X X X . 8 Flac X X X X X X X X . 8 GIF X X X X X X X . 7 GZ X X X X X X X X . 8 ICC X X X X X X . 6 ICO X X X X X X X X . 8 ID3v2 X X X X X X X X . 8 ILDA X X X X X X X X . 8 JP2 X X X X X X X X . 8 JPG X X X X X X X X . 8 NES X X X X X X X . 7 OGG X X X X X X X X . 8 PSD X X X X X X X X . 8 LNK X X X X X X . 6 PE X X X X X X X . 7 PNG X X X X X X X X . 8 RIFF X X X X X X X X . 8 RTF X X X X X X X X . 8 TIFF X X X X X X X X . 8 WAD X X X X X X X X . 8 BPG X X X X X X X X . 8 Java X X X X X X X . 7 PCAP X X X X X X X X . 8 PCAPNG X X X X X X X X . 8 WASM X X X X X X X X . 8 ID3v1 . 0 XZ . 0 35 Magic signatures at offset zero Formats with cavities (->zippers) Valid at any offset Formats enforcing magics at offset zero Footers
  27. A custom binary lasagna: Abusing line comments and interleave PDF

    statements w/ arbitrary data. 000: 2031 3233 3435 3637 3839 3031 3233 3435 123456789012345 010: 0a25 5044 462d 312e 3425 2020 2020 2020 .%PDF-1.4% 020: 3031 3233 3435 3637 3839 3031 3233 3435 0123456789012345 030: 0a31 2030 206f 626a 3c3c 2520 2020 2020 .1 0 obj<<% 040: 3031 3233 3435 3637 3839 3031 3233 3435 0123456789012345 050: 0a2f 5479 7065 2f43 6174 616c 6f67 2520 ./Type/Catalog% 060: 3031 3233 3435 3637 3839 3031 3233 3435 0123456789012345 070: 0a2f 5061 6765 7320 3220 3020 5225 2020 ./Pages 2 0 R% 080: 3031 3233 3435 3637 3839 3031 3233 3435 0123456789012345 090: 0a3e 3e65 6e64 6f62 6a0a 2520 2020 2020 .>>endobj.% ... 640: 3031 3233 3435 3637 3839 3031 3233 3435 0123456789012345 650: 0a74 7261 696c 6572 203c 3c25 2020 2020 .trailer <<% 660: 3031 3233 3435 3637 3839 3031 3233 3435 0123456789012345 670: 0a2f 526f 6f74 2031 2030 2052 3e3e 2520 ./Root 1 0 R>>% 680: 3031 3233 3435 3637 3839 3031 3233 3435 0123456789012345 36
  28. Duplicity in ZIPs: 4 names for the same archived f

    ile via older structures. 00: 50 4B 03 04 00 00 00 08 00 00 00 00 00 00 95 19 PK 10: 85 1B 0C 00 00 00 0C 00 00 00 08 00 2E 00 4C 46 . LF 20: 48 20 4E 61 6D 65 75 70 11 00 01 BE A1 2C A5 55 H Nameup , U 30: 6E 69 63 6F 64 65 20 4E 61 6D 65 05 26 15 00 5A nicode Name & Z 40: 50 49 54 08 4D 61 63 20 4E 61 6D 65 5A 49 50 20 PIT Mac NameZIP 50: 53 49 54 78 48 65 6C 6C 6F 20 77 6F 72 6C 64 21 SITxHello world! 60: 50 4B 01 02 00 00 00 00 00 00 00 00 00 00 00 00 PK 70: 95 19 85 1B 0C 00 00 00 0C 00 00 00 09 00 00 00 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 43 44 CD 90: 46 48 20 4E 61 6D 65 50 4B 05 06 00 00 00 00 01 FH NamePK a0: 00 01 00 37 00 00 00 60 00 00 00 00 00 7 ` FileCraft 37
  29. Near polyglots Non-working parasites with data to be replaced. The

    smaller that data, the better. (ex: overlapping magics) An external operation will swap the overlapping data. 38 Variable Unsupported offset parasite Minimal start offset 1 2 4 8 9 16 20 23 28 34 40 64 94 132 12 28 12 26 32 36 68 112 226 16 P P J F M T F W G P R I R B C I P C J P E A P I I J W B O B E G L N S E P l P I L A Z N I D T M P L S A P C L R C C C a A P G Z B I N E G a 4 F V D G F 3 F P I D D B 2 A F A O C v S G G 2 M F K S c F F v O A P P a M L 2 N G 1* PS . M A ? ? ? ? ? ? A ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 2^ PE M . A A A A A A A A A A A A A A A A A A ! ! ! ! ! ! M M M ! ! ! ! ! 4+ JPG A A . M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M A: automated ?: likely possible M: manual !: unknown
  30. Swap the overlap via [cryptographic] operations En-/de-cryption with specific parameters

    (IV, Nonce): Bruteforcing may be required. Each payload is [partially] hidden when the other is in clear. From near-polyglots to crypto-polyglots 39
  31. 89 P N G \r \n ^Z \r 00 00

    00 2C c O M M 00 00 0D 00 07 00 01 00 01 00 FF FF FF 00 00 00 00 00 00 00 65 40 00 00 55 40 00 00 67 60 00 00 57 50 00 00 65 60 00 00 00 00 00 00 00 00 00 00 1D 44 05 DC 00 00 00 0D I H D R 00 00 00 0D 00 00 00 07 01 03 00 00 00 E9 BE 55 59 00 00 00 06 P L T E FF FF FF 00 00 00 55 C2 D3 7E 00 00 00 1B I D A T 08 1D 63 00 82 54 03 86 70 07 86 F4 02 06 F7 00 06 57 03 06 06 06 00 21 1A 03 10 32 6A 0B 48 00 00 00 00 I E N D AE 42 60 82 00: 10: 20: 30: 40: 50: 60: 70: 80: 90: A0: B M 3C 00 00 00 00 00 00 00 20 00 00 00 0C 00 A BMP/PNG near polyglot, with 16 bytes of overlap. B M 3C 00 00 00 00 00 00 00 20 00 00 00 0C 00 89 P N G \r \n ^Z \n 00 00 00 2C c O M M mitra.py bmp.bmp png.png --overlap Generates O(10-40)-PNG[BMP]{424D3C00000000000000200000000C00}.1965e270.png.bmp 40
  32. When AES(☢)=☠ B M 3C 00 00 00 00 00

    00 00 20 00 00 00 0C 00 00 00 0D 00 07 00 01 00 01 00 FF FF FF 00 00 00 00 00 00 00 65 40 00 00 55 40 00 00 67 60 00 00 57 50 00 00 65 60 00 00 00 00 00 00 00 00 00 00 00 A1 3B E2 E0 64 F0 A7 AE 5E 21 64 BC 44 5F 09 E3 67 D3 10 19 AF 09 F1 99 1A 33 B3 BF 28 EF 9E 71 3D 87 79 EC 73 A9 60 82 74 1B EB 08 B4 4E B7 E5 9E 16 A9 CE BC 1B 71 99 E7 F8 E8 FA 8C C0 6C 6B 85 4B 56 73 7D 22 BD 46 DE AC 3F BF EE 8B 96 AB 74 55 5F 21 B7 10 1B D6 96 18 45 6E E5 B0 3C 7C 22 99 87 EA FE 1F 4D FF C8 52 C0 24 C7 AD A8 00: 10: 20: 30: 40: 50: 60: 70: 80: 90: A0: 89 P N G \r \n ^Z \n 00 00 00 30 c O M M 71 2F D8 C7 79 C1 EB CF 63 B0 22 2B 0A 6D E3 2D 24 49 57 B1 9B BB C2 FA 94 8A 8C 53 9E A1 30 63 30 C9 41 75 EA AF 75 EE 95 7C 57 E9 16 4F F7 3B 1D 44 05 DC 00 00 00 0D I H D R 00 00 00 0D 00 00 00 07 01 03 00 00 00 E9 BE 55 59 00 00 00 06 P L T E FF FF FF 00 00 00 55 C2 D3 7E 00 00 00 1B I D A T 08 1D 63 00 82 54 03 86 70 07 86 F4 02 06 F7 00 06 57 03 06 06 06 00 21 1A 03 10 32 6A 0B 48 00 00 00 00 I E N D AE 42 60 82 00 00 00 00 00 00 00 00 00 00 00 00 00 00 A valid BMP is AES-CBC encrypted as a PNG with a special IV to encrypt the first block as expected (AngeCryption). AES-CBC mitra/utils/cbc$ angecrypt.py "O(10-40)-PNG[BMP]{424D3C00000000000000200000000C00}.1965e270.png.bmp" bmp-png.cbc 41 AngeCryption works with ECB, CBC, CFB, OFB
  33. A BMP/PS near polyglot with 3 bytes of overlap. /

    { ( 00 00 00 00 00 00 00 20 00 00 00 0C 00 00 00 0D 00 07 00 01 00 01 00 FF FF FF 00 00 00 00 00 00 00 65 40 00 00 55 40 00 00 67 60 00 00 57 50 00 00 65 60 00 00 00 00 00 00 ) } % ! P S \r \n / N i m b u s S a n s - R e g u l a r 1 0 0 s e l e c t f o n t \r \n 7 5 4 0 0 m o v e t o \r \n ( P o s t S c r i p t ) s h o w \r \n s h o w p a g e \r \n s t o p \r \n 00 00 00 00 00 00 B M 3C 00: 10: 20: 30: 40: 50: 60: 70: 80: 90: / { ( B M 3C mitra.py postscript.ps bmp.bmp --overlap Generates O(3-3c)-PS[BMP]{424D3C}.209881aa.ps.bmp 42
  34. Both files are decrypted via GCM from the same ciphertext

    but via different keys. The nonce is bruteforced to generate the right overlap with either key. B M 3C 00 00 00 00 00 00 00 20 00 00 00 0C 00 00 00 0D 00 07 00 01 00 01 00 FF FF FF 00 00 00 00 00 00 00 65 40 00 00 55 40 00 00 67 60 00 00 57 50 00 00 65 60 00 00 00 00 00 00 B7 EB 32 E8 16 D6 9E 76 AC 20 9C 8C 9F 06 6F 55 3F 96 0E 09 04 24 41 5D 22 7C A6 E5 0E AC ED 1C 04 65 BE E6 E8 AB E4 D2 C6 B6 CD 9F AB 85 E1 CE 03 C5 A5 85 70 B5 09 EB EB CB D1 2F 7C 4D B0 09 35 38 D9 B7 82 31 BB 87 96 22 C8 4E C0 EC 89 C3 CB 97 63 D3 A0 28 47 5B 71 C2 95 EC 12 E2 52 B0 6F B1 EE 61 09 6A B5 E0 C7 B5 D7 41 55 9B DA 24 3B E2 13 B4 / { ( 07 3A 14 40 E5 3E EC AE A2 AD 87 AA 38 11 C4 5D 5A 35 2D EB EC 47 CC A7 B5 63 22 90 B7 5F D7 41 7B FD 6D 53 DB 78 9F AA A6 2B 22 61 AD BB 38 48 4A 5C A7 D5 E4 63 4F 4D 7B ) } % ! P S \r \n / N i m b u s S a n s - R e g u l a r 1 0 0 s e l e c t f o n t \r \n 7 5 4 0 0 m o v e t o \r \n ( P o s t S c r i p t ) s h o w \r \n s h o w p a g e \r \n s t o p \r \n 00 00 00 00 00 00 C8 4D 88 94 64 F9 8B F5 70 5D 1F 16 C0 63 50 A0 PostScript 00: 10: 20: 30: 40: 50: 60: 70: 80: 90: A0: mitra/utils/gcm$ meringue.py "O(3-3c)-PS[BMP]{424D3C}.209881aa.ps.bmp" bmp-ps.gcm 43 TimeCryption works with CTR, OFB, GCM, GCM-SIV, OCB3 ciphertext Key 2 Key 1
  35. Keys CipherTexts Keystreams Keystreams Keys CipherTexts Overlap? Polyglot File1 📝

    swap of fsets 🔑 Nonce AuthData tag Encryption Combine Correction Authenticated Decryption Block index Bruteforce File2 📝 Xor Slice CipherText File1 File2 File Format FIX Authentication Under the hood - check Mitra & KeyCom. Corrected CipherText 44
  36. Our PDF article f ile is also a PDF viewer

    executable! Via authenticated encryption. $ wget https://eprint.iacr.org/2020/1456.pdf [...] $ openssl enc -in 1456.pdf -out crypted \ -aes-128-ctr -iv 00000000000000000000e7c600000002 \ -K 4e6f773f000000000000000000000000 $ openssl enc -in crypted -out viewer.exe \ -aes-128-ctr -iv 00000000000000000000e7c600000002 \ -K 4c347433722121210000000000000000 $ wine viewer.exe 1456.pdf 45
  37. 👼 TIMECRYPTION Without key commitment, a ciphertext can be crafted

    to decrypt with authentication to different payloads. Vulnerabilities @ Facebook, Amazon, Google… With key management: friendly today, evil tomorrow.
  38. Overlap? ✓ ✓ ✗ ✗ (just magic) A hierarchy of

    weird f iles Same format? Ambiguous Polyglot Near polyglot ✗ ✓ PolyMock 47 Ful l format?
  39. Private information can leak at cloud's scale. Leaked credentials are

    abused within minutes. Keys, login/passwords, cookies. A single "minor" bug can affect billions of users! File formats challenges in 2024 48
  40. 89 P N G \r \n ^Z \r 00 00

    00 0D I H D R 00 00 00 0D 00 00 00 07 01 03 00 00 00 E9 BE 55 59 00 00 00 06 P L T E FF FF FF 00 00 00 55 C2 D3 7E 00 00 00 1B I D A T 08 1D 63 00 82 54 03 86 70 07 86 F4 02 06 F7 00 06 57 03 06 06 06 00 21 1A 03 10 32 6A 0B 48 00 00 00 00 I E N D AE 42 60 82 00: 10: 20: 30: 40: 50: 60: PNG is clearly def ined since 1996, and yet… 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
  41. Acropalypse (2023) Just standard PNG files, cropped by the user.

    The smaller file is kept with trailing data leftovers. -> major leak of information for users. aCropalypse - Wikipedia 50 by Simon Aarons and David Buchanan.
  42. SQLite f iles in the wild… S Q L i

    t e f o r m a t 3 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e s C R E A T E T A B L E M e s s a g e P r o p e r t i e s ( m s g I D I N T E G E R Adobe In-Product Messaging 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F Bitcoin Wallet SELECT Content FROM IPMMessage WHERE Content LIKE '%expire%'; Your subscription is about to expire Your subscription has expired S Q L i t e f o r m a t 3 00 00 00 00 00 F9 BE B4 D9 00 00 00 00 00 00 00 00 t a b l e m a i n m a i n C R E A T E T A B L E m a i n ( k e y B L O B P R I M A R Y 00: ... 40: ... F70: F80: F90: 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F SQLite3 Magic AppId: unset SQLite_schema table 00: ... 40: ... 860: 870: 880: SQLite3 Magic AppId: F9 BE B4 D9 (Bitcoin Main Network) SQlite_schema table
  43. SQLite: data leaks in plain sight Magic: "SQLite format 3\0"

    (16 bytes) -> very strong identification. but… No easy subtype-identification: the Application ID is rarely used. Is it a standard assets storage ? A mountable filesystem? Cookies / web history / credit cards / bitcoin wallet ? -> Identification tool: sqlbuddy.py 52
  44. Some AVs detect the EICAR f ile by CRC32! X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*

    or DpVRUX<=EICAR CRC collision? Use Shake128/Kangaroo12/Blake3 instead! Collisions in 2024? Script: mycar.sh Who needs cryptographic hashes for collisions? 54 Same CRC32
  45. 2017 BlackHat, RWC, Crypto Trophy wall 56 2019 PtS, Hack.lu

    2019 (workshop) PtS, Hack.lu, BA… https://github.com/corkami/collisions docs, precomputed prefixes, scripts, pocs…(MIT licence)
  46. Detecting collisions w/ signatures DetectColl can detect any MD5 or

    SHA1 hash collision. Github / corkami / collisions / README.md#signatures $ detectcoll_unsafe flame.der | ./logparse.py flame.der block: 11, collision: Flame 57 $ detectcoll 13-shambles1.bin | ./logparse.py 13-shambles1.bin block: 9, collision: SHAttered/Shambles Flame's unique collision. Newest SHA1's: Shambles
  47. Retr0id's hashquine archive (2023) A generic tar file contains a

    hash list with the hash of the whole archive. The file is "building" a Tar header via 653 MD5 collisions abusing ZStandard frames. Explanations on github / corkami / collisions / hashquines $ tar -xvf self.tar.zst x hash.md5 x hello.txt $ md5sum -c hash.md5 self.tar.zst: OK hello.txt: OK 59
  48. "If it's not broken in practice… …it must be good

    enough!" New MD5 attack - June 2024 60 23 October 2023 Expires: 25 April 2024 Deprecating Insecure Practices in RADIUS While MD5 has been broken, it is a testament to the design of RADIUS that there have been (as yet) no attacks on RADIUS Authenticator signatures which are stronger than brute-force. https://www.ietf.org/archive/id/draft-dekok-radext-deprecating-radius-05.txt And yet…
  49. TextColl (2024) - 1 bit-difference - Not 64 bytes rounded!

    - Custom alphabet: alphanum –> test your password! TEXTCOLLBYfGiJUETHQ4hAcKSMd5zYpgqf1YRDhkmxHkhPWptrkoyz28wnI9V0aHeAuaKnak TEXTCOLLBYfGiJUETHQ4hEcKSMd5zYpgqf1YRDhkmxHkhPWptrkoyz28wnI9V0aHeAuaKnak TEXTCOLLBYfGiJUETHQ4hAcKSMd5zYpgqf1YRDhkmxHkhPWptrkoyz28wnI9V0aHmSZaAAAA()(()()(()((((((()((()((()())))()(()))))())(())))))()(() TEXTCOLLBYfGiJUETHQ4hEcKSMd5zYpgqf1YRDhkmxHkhPWptrkoyz28wnI9V0aHmSZaAAAA()(()()(()((((((()((()((()())))()(()))))())(())))))()(() BASEX64G5MM2g4CpNnoHBeERiMZ3J5P2YsP7wIlz4Kfh+JGOxOiptV+pvQZ0whAt1q3Jt+la2CKWqu5H9bzDIxBaNrzCkij91ZB9M5DlPne5sUir5TZ6yQGfGKtaX0BG BASEX64G5MM2g4CpNnoHBiERiMZ3J5P2YsP7wIlz4Kfh+JGOxOiptV+pvQZ0whAt1q3Jt+la2CKWqu5H9bzDIxBaNrzCkij91ZB9M5DlPne5sUir5TZ6yQGfGKtaX0BG Github / cr-marcstevens / hashclash / tree / textcoll 61
  50. Formats are complex. Files are layered. Archive, stack, encapsulation, compressions…

    One format may be robust against some attacks, but its inner/outer format or side format might make the whole system vulnerable. 62
  51. Shattered First SHA1 collision on PDF files. It wasn't a

    PDF collision as PDF parsers can't be reliably collided with SHA1. -> Abuse JPG in PDF as JPG can be collided reliably. This would likely work in any format based on JPG. 63
  52. Inside Out Abusing .docx: - XML can't tolerate collision blocks.

    - ZIPs can't be collided generically. -> Abuse XML in Zips via Zip structures. Abusing .tar.gz: (Tar can't be collided generically) -> Abuse Gzip structure to show different Tar contents. 65
  53. Tar can't be collided generically: -> Abuse GZip/L4/Zstandard archive structure

    to present different archived file contents to external parsers. Tar hashquines 66
  54. File formats… and AI ? "Who needs AI to check

    a magic signature?" "It won't catch polyglots anyway." But… What about source files? Or any kind of attachments? … 68
  55. 200+ formats: text and binary. Small model: runs on CPU,

    needs 1 Mb. Fast: <5ms per file. Used in production on 100s of billion files weekly. Used in 150+ projects. Open-source: Python, Go, Rust, JavaScript. https://github.com/google/magika Paper (ICSE 25) https://arxiv.org/abs/2409.13768 Non-generative AI: no copyright infringement, just a detection verdict. Magika 69
  56. The 'Magika in production' effect… How many file formats overall?

    Who knows… 🤯󰤇 Each community have its weirdnesses, overlaps, do's and don'ts. "What a mess 🤌" 70
  57. No silver bullet It doesn't scan the whole file (only

    the first & last 2Kbs). Not enough samples to train on many formats. Standard AI limitations: no editting / omitting. May fail on weird files 😉 May catch corrupted/spoofed files: -> useful for carving, recovery-abuse. -> Remove the first 16 bytes, then re-scan. 71
  58. Magika on corrupted f iles A ZIP with invalid signatures:

    An invalid file recovered by applications. -> scanning bypass. 00 .B .K \3 \4 0a 00 00 00 00 00 00 00 00 00 23 8e 10 5a 6b 05 00 00 00 05 00 00 00 07 00 00 00 .z .i 20 .p .. .t .x .t .Z .I .P \r \n .B .K 01 02 1f 00 30 0a 00 00 00 00 00 00 00 00 00 23 8e 5a 6b 05 00 40 00 00 05 00 00 00 07 00 00 00 00 00 00 00 00 00 50 00 00 00 00 00 00 00 00 .z .i .p .. .t .x .t .B 60 .K 05 06 00 00 00 00 01 00 01 00 35 00 00 00 2a 70 00 00 00 00 00 $ file badsigs.zip badsigs.zip: data $ magika badsigs.zip -s badsigs.zip: Zip archive data (archive) 98% 72
  59. Magika is new & different, and useful in its own

    way. Planning to make a new engine? -> Investigate all existing ones, then give a talk on the topic -> 73
  60. Some formats give you full control over the first X

    bytes. Most make it possible to insert exploitable contents early. Use Mitra to insert 1 kb of free space in your file: mitra.py <inputfile> /dev/null --pad 1 -f Use Mocky to insert dummy signatures: mocky.py <inputfile> --combined Mocky & Mitra @ Github corkami/mitra Fool AI identif ication? 74
  61. In 2024… Many old tricks still work. Specifications can still

    be naive or laughable. No reference code, no test cases. No incentive to fix anything if it's not a security bug. -> back to the eternal: "let's check Wikipedia…" ? 76 Does he bite? "Specs are enough" No, but he can hurt you in other ways
  62. From funky PoCs to fearsome tools. Working at scale with

    new tools: - 100s of collisions possibilities - 1000s of polyglot combinations - 100s of billions of scanned files by AI. 77
  63. AI & f ile formats - Many AI formats are

    vulnerable. - Magika brings something new to file format processing. - Mitra can be used to inject arbitrary data in formats (and fool AI). 78
  64. Room for improvement 🍺 - Specifications writing and updating. -

    Sample crafting and sharing. - Format identification and heuristics. - Format classifying and rating. 79
  65. Give a man a fish and you feed him for

    a day. Teach a man to fish and you feed him for a lifetime. Magic at offset zero fast identification, no bypass Clear chunk structure forward compatibility, easy parsing/cleanup Version number Forward thinking No duplicity Duplicity → discrepency ☠ No "constant" variables Ossification → hardcoding Up-to-date specs Reflect reality Samples set Theory isn't enough Extensibility Your format will evolve in unknown ways Keep the spirit Don't reuse formats for different intent without trivial distinction Perfect is the enemy of good Shortcuts will be taken to avoid over-complexity. Commandments of a good file format 80
  66. Thanks for your attention! Acknowledgements: Marc Stevens, Philippe Teuwen, Stefan

    Kölbl, Atul Luykx, Daniel Bleichenbacher, David Buchanan, Sophie Schmieg, Yanick Fratantonio, and the Fabianis. 81
  67. Com rograms (DOS) (executables under 64Kb) No structure whatsoever ->

    The whole file is copied in memory and blindly executed. Just a maximum file size (64kb). Called "Transient commands" under CP/M 83
  68. Polyglot storage IRL: An aperture card: punch card + microf

    ilm An analog picture with digital indexing. 84
  69. 85

  70. 86

  71. 87

  72. $ file selfmd5-release.zip selfmd5-release.zip: Sega Mega Drive / Genesis ROM

    image: "TOY MD5 COLLIDER" (GM 00000000-00, (C) MAKO 2017 ) $ 2964F721 7EEEF375 983F0420 725976C2 60101938 18BDD53D 332E8131 25244205 04D9B9CE 80FF0958 EB01DAD4 9A4DAA18 AD894BEB A3A824B2 C94DB974 378499C2 478D436C 255C79F3 A7B2A523 CBA811FB D7D0C870 1F1C6B5F 6EEBDFDF 4BA0AD41 31D8B06A 020B9399 B897DB50 499C7713 879C2E0B DB0267DD FE27A567 DDA5487C 2964F721 7EEEF375 983F0420 725976C2 601019B8 18BDD53D 332E8131 25244205 04D9B9CE 80FF0958 EB01DAD4 9ACDAA18 AD894BEB A3A824B2 C94DB9F4 378499C2 478D436C 255C79F3 A7B2A523 CBA811FB D7D0C8F0 1F1C6B5F 6EEBDFDF 4BA0AD41 31D8B06A 020B9399 B897DB50 491C7713 879C2E0B DB0267DD FE27A5E7 DDA5487C 4CFB0E37 5E7078A2 31260B95 4550524A Mako's “Toy MD5 Collider” for the Mega Drive dd49d7eb... …on a MegaDrive Computing MD5 collisions… 1988: Sega Megadrive 16bits @ 7.6 MHz 1992: MD5 88
  73. Quite Ok Image format (2021) A fixed header. No room

    for any metadata. It defines an End Marker for data. -> So metadata can be appended? but there's no shortcut to quickly jump to it. -> A great concise data format, not a good file format. -> Hacks will be created (like for MP3). 89
  74. 91 +0 +1 +2 +3 +4 +5 +6 +7 +8

    +9 +A +B +C +D +E +F .G 9B 4F 00 FF FE 9B 07 00 FF 0F 9B 8A 00 FF F9 .. .. .G . Signature RLE Marker (9B) 4F Length FF Repeated value RLE Marker (9B) 07 Length FF Repeated value RLE Marker (9B) 8A Length FF Repeated value +0 +1 +2 +3 +4 +5 +6 +7 +8 +9 +A +B +C +D +E +F 0x 1x 9B . 4F 00 . FF . 9B . 07 00 . FF . 9B . 8A 00 . FF . A genuine PrintFox f ile: avanger.gb G = Gesamtbild
  75. PrintFox FP via TrID A C64 image format from the

    1980s. The file structure is just a single letter signature, then pure RLE data. Cf C64-Wiki A bad structure, but a sign of the times. -> many FPs - 1.8 M files on VirusTotal. Yet only a handful of actual PrintFox files. 92