Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Deconstructing an Abstraction to Reconstruct an...

Deconstructing an Abstraction to Reconstruct an Outage (SREcon EMEA 2023 edition)

We all rely on abstractions to build the applications we use day-to-day. It's easy for those abstractions to feel like impenetrable walls, hiding scary low-level parts of the system - especially for a complex piece of software like a database. That needn't be the case!

In this talk, we'll explore the aftermath of a complex outage in a Postgres cluster. We'll retrace the steps we took to reliably reproduce the failure in a local environment and pull out lessons about debugging complex systems along the way. At one point, we'll dive into the depths of how Postgres represents data on disk and realise that even unfamiliar layers of a system don't need to be scary.

Chris Sinjakli

October 12, 2023
Tweet

More Decks by Chris Sinjakli

Other Decks in Programming

Transcript

  1. Hi

  2. 1. RAID array loses disks 2. Kernel sets fi lesystem

    read-only 3. Pacemaker detects primary failure 4. Synchronous replica crash 5. Suspicious log on synchronous replica
  3. 1. RAID array loses disks 2. Kernel sets fi lesystem

    read-only 3. Pacemaker detects primary failure 4. Synchronous replica crash 5. Suspicious log on synchronous replica
  4. 1. RAID array loses disks 2. Kernel sets fi lesystem

    read-only 3. Pacemaker detects primary failure 4. Synchronous replica crash 5. Suspicious log on synchronous replica
  5. 1. RAID array loses disks 2. Kernel sets fi lesystem

    read-only 3. Pacemaker detects primary failure 4. Synchronous replica crash 5. Suspicious log on synchronous replica
  6. 1. RAID array loses disks 2. Kernel sets fi lesystem

    read-only 3. Pacemaker detects primary failure 4. Synchronous replica crash 5. Suspicious log on synchronous replica
  7. 2023-02-24 17:23:01 GMT LOG: restored log file "000000020000000000000003" from archive

    2023-02-24 17:23:02 GMT LOG: invalid record length at 0/3000180 Suspicious log on synchronous replica
  8. 1. RAID array loses disks 2. Kernel sets fi lesystem

    read-only 3. Pacemaker detects primary failure 4. Synchronous replica crash 5. Suspicious log on synchronous replica
  9. 1. RAID array loses disks 2. Kernel sets fi lesystem

    read-only 3. Pacemaker detects primary failure 4. Synchronous replica crash 5. Suspicious log on synchronous replica
  10. 1. RAID array loses disks 2. Kernel sets fi lesystem

    read-only 3. Pacemaker detects primary failure 4. Synchronous replica crash 5. Suspicious log on synchronous replica
  11. KILL(1) General Commands Manual KILL(1) NAME kill – terminate or

    signal a process SYNOPSIS kill [-s signal_name] pid ... kill -l [exit_status] kill -signal_name pid ... kill -signal_number pid ... DESCRIPTION The kill utility sends a signal to the processes specified by the pid operands. Only the super-user may send signals to other users' processes. The options are as follows:
  12. # on primary - hard kill kill -SIGKILL <main_pid> #

    on synchronous replica - subprocess crash kill -SIGABRT <subprocess_pid>
  13. # on primary - hard kill kill -SIGKILL <main_pid> #

    on synchronous replica - subprocess crash kill -SIGABRT <subprocess_pid>
  14. 1. RAID array loses disks 2. Kernel sets fi lesystem

    read-only 3. Pacemaker detects primary failure 4. Synchronous replica crash 5. Suspicious log on synchronous replica
  15. 1. RAID array loses disks 2. Kernel sets fi lesystem

    read-only 3. Pacemaker detects primary failure 4. Synchronous replica crash 5. Suspicious log on synchronous replica
  16. 2023-02-24 17:23:01 GMT LOG: restored log file "000000020000000000000003" from archive

    2023-02-24 17:23:02 GMT LOG: invalid record length at 0/3000180 Suspicious log on synchronous replica
  17. [2023-02-26 23:02:37Z] GET / - 200 [2023-02-26 23:02:49Z] GET /favicon.ico

    - 200 [2023-02-26 23:02:52Z] POST /login - 200 [2023-02-26 23:33:52Z] POST /posts - 201 [2023-02-26 23:33:57Z] GET /posts/binary-logs—talk - 200 What we normally mean by logs
  18. INSERT INTO users VALUES ('codd'); INSERT INTO users VALUES ('lovelace');

    INSERT INTO users VALUES ('turing'); Some extremely boring SQL
  19. INSERT INTO users VALUES ('codd'); INSERT INTO users VALUES ('lovelace');

    INSERT INTO users VALUES ('turing'); ‑ Wrote 'codd' into table 'users' Wrote 'lovelace' into table 'users' Wrote 'turing' into table 'users' A different kind of logs 
 (if they were textual)
  20. Index Table id 1 2 💥 . id username 1

    codd 2 lovelace 3 turing
  21. We can replay this operation INSERT INTO users VALUES ('codd');

    INSERT INTO users VALUES ('lovelace'); INSERT INTO users VALUES ('turing'); ‑ Wrote 'codd' into table 'users' Wrote 'lovelace' into table 'users' Wrote 'turing' into table 'users'
  22. 2023-02-24 17:23:01 GMT LOG: restored log file "000000020000000000000003" from archive

    2023-02-24 17:23:02 GMT LOG: invalid record length at 0/3000180 Suspicious log on synchronous replica
  23. 2023-02-24 17:23:01 GMT LOG: restored log file "000000020000000000000003" from archive

    2023-02-24 17:23:02 GMT LOG: invalid record length at 0/3000180 Suspicious log on synchronous replica
  24. $ git checkout REL9_4_26 # we were running 9.4 $

    git grep -n "invalid record length" src/backend/access/transam/xlogreader.c:295: [...] src/backend/access/transam/xlogreader.c:604: [...] src/backend/access/transam/xlogreader.c:678: [...] Let's fi nd the error
  25. src/backend/access/transam/xlogreader.c:291-300: { /* XXX: more validation should be done here

    */ if (total_len < SizeOfXLogRecord) { report_invalid_record(state, "invalid record length at %X/%X", (uint32) (RecPtr >> 32), (uint32) RecPtr); goto err; } gotheader = false; } Let's fi nd the error
  26. src/backend/access/transam/xlogreader.c:291-300: { /* XXX: more validation should be done here

    */ if (total_len < SizeOfXLogRecord) { report_invalid_record(state, "invalid record length at %X/%X", (uint32) (RecPtr >> 32), (uint32) RecPtr); goto err; } gotheader = false; } Let's fi nd the error
  27. src/backend/access/transam/xlogreader.c:291-300: { /* XXX: more validation should be done here

    */ if (total_len < SizeOfXLogRecord) { report_invalid_record(state, "invalid record length at %X/%X", (uint32) (RecPtr >> 32), (uint32) RecPtr); goto err; } gotheader = false; } Let's fi nd the error
  28. src/include/access/xlog.h:41-56: typedef struct XLogRecord { uint32 xl_tot_len; /* total len

    of entire record */ TransactionId xl_xid; /* xact id */ uint32 xl_len; /* total len of rmgr data */ uint8 xl_info; /* flag bits, see below */ RmgrId xl_rmid; /* resource manager for this record */ /* 2 bytes of padding here, initialize to zero */ XLogRecPtr xl_prev; /* ptr to previous record in log */ pg_crc32 xl_crc; /* CRC for this record */ /* If MAXALIGN==8, there are 4 wasted bytes here */ /* ACTUAL LOG DATA FOLLOWS AT END OF STRUCT */ } XLogRecord; Let's fi nd the error
  29. src/include/access/xlog.h:41-56: typedef struct XLogRecord { uint32 xl_tot_len; /* total len

    of entire record */ TransactionId xl_xid; /* xact id */ uint32 xl_len; /* total len of rmgr data */ uint8 xl_info; /* flag bits, see below */ RmgrId xl_rmid; /* resource manager for this record */ /* 2 bytes of padding here, initialize to zero */ XLogRecPtr xl_prev; /* ptr to previous record in log */ pg_crc32 xl_crc; /* CRC for this record */ /* If MAXALIGN==8, there are 4 wasted bytes here */ /* ACTUAL LOG DATA FOLLOWS AT END OF STRUCT */ } XLogRecord; Let's fi nd the error
  30. src/backend/access/transam/xlogreader.c:291-300: { /* XXX: more validation should be done here

    */ if (total_len < SizeOfXLogRecord) { report_invalid_record(state, "invalid record length at %X/%X", (uint32) (RecPtr >> 32), (uint32) RecPtr); goto err; } gotheader = false; } What was that check doing?
  31. What was that check doing? Size the record says it

    is Smallest possible size it can be src/backend/access/transam/xlogreader.c:291-300: { /* XXX: more validation should be done here */ if (total_len < SizeOfXLogRecord) { report_invalid_record(state, "invalid record length at %X/%X", (uint32) (RecPtr >> 32), (uint32) RecPtr); goto err; } gotheader = false; }
  32. INSERT INTO users VALUES ('codd'); INSERT INTO users VALUES ('lovelace');

    INSERT INTO users VALUES ('turing'); ‑ Wrote 'codd' into table 'users' Wrote 'lovelace' into table 'users' Wrote 'turing' into table 'users' A different kind of logs 
 (if they were textual)
  33. INSERT INTO users VALUES ('codd'); INSERT INTO users VALUES ('lovelace');

    INSERT INTO users VALUES ('turing'); Some extremely boring SQL
  34. Decimal Hexadecimal Character 62 3E > 63 3F ? 64

    40 @ 65 41 A 66 42 B Same data, rendered differently
  35. INSERT INTO repro VALUES ('A'); INSERT INTO repro VALUES ('AB');

    INSERT INTO repro VALUES ('ABC'); INSERT INTO repro VALUES ('ABCD'); INSERT INTO repro VALUES ('ABCDE'); ... Some even more boring SQL
  36. Notice anything? Decimal Hexadecimal Character 63 3F ? 64 40

    @ 65 41 A The data we inserted Familiar characters
  37. wal_file_name = ARGV[0] puts wal_file_name wal_contents = IO.read(wal_file_name, encoding: "BINARY")

    hex = wal_contents.unpack("H*").first replaced = hex.gsub(/3f(000000.+41424300)/, "00\\1") bindata = [replaced].pack("H*") File.write(wal_file_name + ".broken", bindata) break_wal.rb
  38. wal_file_name = ARGV[0] puts wal_file_name wal_contents = IO.read(wal_file_name, encoding: "BINARY")

    hex = wal_contents.unpack("H*").first replaced = hex.gsub(/3f(000000.+41424300)/, "00\\1") # Replaces 'ABC' size bindata = [replaced].pack("H*") File.write(wal_file_name + ".broken", bindata) break_wal.rb
  39. 2023-02-28 19:24:11 GMT LOG: restored log file "000000020000000000000003" from archive

    2023-02-28 19:24:11 GMT LOG: invalid record length at 0/3000148 We reproduced the error!
  40. 1. RAID array loses disks 2. Kernel sets fi lesystem

    read-only 3. Pacemaker detects primary failure 4. Synchronous replica crash 5. Suspicious log on synchronous replica
  41. 1. RAID array loses disks 2. Kernel sets fi lesystem

    read-only 3. Pacemaker detects primary failure 4. Synchronous replica crash 5. Suspicious log on synchronous replica 6. ...
  42. 1. RAID array loses disks 2. Kernel sets fi lesystem

    read-only 3. Pacemaker detects primary failure 4. Synchronous replica crash 5. Suspicious log on synchronous replica 6. Backup VIP on synchronous replica
  43. 1. RAID array loses disks 2. Kernel sets fi lesystem

    read-only 3. Pacemaker detects primary failure 4. Synchronous replica crash 5. Suspicious log on synchronous replica 6. Backup VIP on synchronous replica
  44. "We noticed a problem." "We fi xed the problem." "We'll

    make sure the problem doesn't happen again."
  45. Image credits • Programmer's Laptop - Wall Boat - Public

    Domain - https://www. fl ickr.com/photos/ wallboat/36819065315/ • Pouring Latte Art - Craft Coffee Spot - CC-BY - https://www. fl ickr.com/photos/ 195403219@N08/52200966448/ • microscope - Milosz1 - CC-BY - https://www. fl ickr.com/photos/mikolski/3269906279 • Hard Disk Guts - CC-BY - https://www. fl ickr.com/photos/mattandkim/97533589/ • Corsair ForceGT 180GB - CC-BY - https://www. fl ickr.com/photos/ruocaled/8173124575/
  46. Image credits • Server - The Noun Project (via WikiMedia)

    - CC0 - https://commons.wikimedia.org/wiki/ File:Server_-_The_Noun_Project.svg • Rope - Robo Android - CC-BY - https://www. fl ickr.com/photos/ 49140926@N07/6798304070/ • Stargazing - Max Delaquis - CC-BY - https://www. fl ickr.com/photos/ 115000114@N07/28861043652