Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lessons Learned the Hard Way: Postgres in Produ...

Lessons Learned the Hard Way: Postgres in Production at GoCardless (PGDay UK Edition)

Talk video: https://www.youtube.com/watch?v=Tu-cf-Jki60

From the beginning, GoCardless has used Postgres for the stuff that matters. It’s the primary store behind our APIs and dashboards.

After 5 years, it’s fair to say we’ve seen some things. From spikes in latency as resources run out, through the weird contortions ORMs put your database through, and on to the perils of running a high-availability cluster - this talk has it all.

In the talk, we’ll run through some war stories from the last two years at GoCardless. By gathering them together, we’ll spot some recurring themes, and hopefully avoid fighting the same battles again!

Chris Sinjakli

July 05, 2016
Tweet

More Decks by Chris Sinjakli

Other Decks in Programming

Transcript

  1. -- Slow query (AccessShare) SELECT DISTINCT(customer_id) FROM payments; -- Forces

    this to queue (AccessExclusive) ALTER TABLE payments ADD COLUMN refunded boolean;
  2. -- Slow query (AccessShare) SELECT DISTINCT(customer_id) FROM payments; -- Forces

    this to queue (AccessExclusive) ALTER TABLE payments ADD COLUMN refunded boolean; -- Which blocks these (AccessShare) SELECT * FROM payments WHERE id = 123;
  3. … LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp23325.11166", size 1245184 LOG: temporary

    file: path "base/pgsql_tmp/pgsql_tmp23325.11165", size 8675328 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp10729.11129", size 47734784 …
  4. ALTER SYSTEM SET temp_file_limit = -1; # disable CREATE INDEX

    pay_merch ON payments (merchant_id); ALTER SYSTEM SET temp_file_limit = …; # re-enable
  5. SET client_encoding TO "utf8"; SET client_min_messages TO "panic"; SET standard_conforming_strings

    = on; SET client_min_messages TO "warning"; SET time zone "UTC";
  6. SET client_encoding TO "utf8"; SET client_min_messages TO "panic"; SET standard_conforming_strings

    = on; SET client_min_messages TO "warning"; SET time zone "UTC";
  7. $ git show a456acb2f2 commit a456acb2f2af8365eb9151c7cd2d5a10c189d191 Author: Harry Marr <[email protected]>

    Date: Wed Oct 28 16:30:02 2015 +0000 Avoid disabling postgres errors # Enable standard-conforming strings if available. def set_standard_conforming_strings - old, self.client_min_messages = client_min_messages, 'panic' - execute('SET standard_conforming_strings = on', 'SCHEMA') rescue nil - ensure - self.client_min_messages = old + execute(<<-SQL, 'SCHEMA') + UPDATE pg_settings + SET setting = 'on' + WHERE name = 'standard_conforming_strings' + SQL end
  8. 70 3 1 26 3 1 Normal Incident App Unused

    Superuser (non-reserved) Superuser (reserved) App Superuser (rejected) Superuser (reserved) 97
  9. su postgres -c 'psql -c "select now()"' Change to the

    Postgres user Run a simple query Connect to the database
  10. # strace su postgres -c 'echo "hello"' getuid() = 0

    # ltrace su postgres -c 'echo "hello"' getlogin() = "chris" getpwnam_r(…) # blocks
  11. # strace su postgres -c 'echo "hello"' getuid() = 0

    # ltrace su postgres -c 'echo "hello"' getlogin() = "chris" getpwnam_r(…) # blocks !?
  12. the user name associated by the login activity with the

    controlling terminal of the current process
  13. the user name associated by the login activity with the

    controlling terminal of the current process
  14. # ps aux | grep '[p]acemakerd' | awk '{print $2}'

    9001 # cat /proc/9001/loginuid 1234
  15. # ps aux | grep '[p]acemakerd' | awk '{print $2}'

    9001 # cat /proc/9001/loginuid 1234 # getent passwd 1234 | cut -d':' -f1 chris
  16. uid = getuid() login = getlogin() passwd = passwd_from_name(login) if

    (passwd.uid == uid) return passwd 0 (root) chris {…} 0 == 1234
  17. uid = getuid() login = getlogin() passwd = passwd_from_name(login) if

    (passwd.uid == uid) return passwd else return passwd_from_uid(uid) 0 (root) chris {…} 0 == 1234
  18. Image credits • Heart emoji - https://github.com/mozilla/fxemoji/blob/ 9f68ca9c5bc51521f9ffe284e00ba8b7308e2c41/svgs/ FirefoxEmoji/u2764-redheart.svg •

    Elephants - https://www.flickr.com/photos/makeitkenya/ 22047623331/ • Campfire - https://www.flickr.com/photos/csakkarin/ 15499732208/ • Detour - https://www.flickr.com/photos/wwward0/8257660963/ • Kitten - https://www.flickr.com/photos/aigle_dore/7787096102
  19. References (Incident 1) • Safe Operations For High Volume PostgreSQL

    (Braintree) - https://www.braintreepayments.com/blog/safe-operations-for- high-volume-postgresql/ • Zero-downtime Postgres migrations - the hard parts - https:// gocardless.com/blog/zero-downtime-postgres-migrations-the- hard-parts/ • Zero-downtime Postgres migrations - a little help - https:// gocardless.com/blog/zero-downtime-postgres-migrations-a-little- help/
  20. References (Incident 1) • Postgres 9.5 Client Connection Defaults (lock_timeout,

    statement_timeout) - https://www.postgresql.org/docs/9.5/ static/runtime-config-client.html • Postgres 9.6 Client Connection Defaults (idle_in_transaction_session_timeout) - https:// www.postgresql.org/docs/9.5/static/runtime-config-client.html • Postgres 9.5 Error Reporting and Logging (log_lock_waits) - https://www.postgresql.org/docs/9.5/static/runtime-config- logging.html
  21. References (Incident 2) • Postgres 9.5 Resource Consumption (temp_file_limit) -

    https:// www.postgresql.org/docs/current/static/runtime-config- resource.html
  22. References (Incident 3) • PgBouncer feature matrix for pooling modes

    - https:// wiki.postgresql.org/wiki/ PgBouncer#Feature_matrix_for_pooling_modes • The Troubleshooting Tales: issues scaling Postgres connections - https:// gocardless.com/blog/the-troubleshooting-tales-issues-scaling-postgres- connections/ • Rails commit that fixed client_min_messages issue - https://github.com/ rails/rails/commit/a456acb2f2af8365eb9151c7cd2d5a10c189d191 • Rick Branson's tweet about the same problem in Django - https:// twitter.com/rbranson/status/675005104701870080
  23. References (Incident 4) • Postgres 9.5 Connections and Authentication (superuser_reserved_connections)

    - https:// www.postgresql.org/docs/9.5/static/runtime-config- connection.html
  24. References (Incident 5) • Julia Evans' strace zine - http://jvns.ca/blog/2015/04/14/

    strace-zine/ • Game Day Exercises at Stripe: Learning from `kill -9` - https:// stripe.com/blog/game-day-exercises-at-stripe • Scott Andreas' tweet about investigating things - https:// twitter.com/cscotta/status/535898821914419200
  25. References (Incident 5) • Current code in `su` to look

    up user - https://github.com/ shadow-maint/shadow/blob/ ef45bb2496182b5df90ad0323bef75d1a5d69887/contrib/ pwdauth.c#L127 • Much older similar code in `su` (1994 or earlier) - https:// github.com/freebsd/freebsd/blob/ b5c3fb9427806b740f7df3e43a1513e1f5fa840b/usr.bin/su/ su.c#L255
  26. References (Incident 5) • getuid manual - http://pubs.opengroup.org/onlinepubs/ 009695399/functions/getuid.html •

    getlogin manual - http://pubs.opengroup.org/onlinepubs/ 009695399/functions/getlogin.html • getpwnam manual - http://pubs.opengroup.org/onlinepubs/ 9699919799/functions/getpwnam.html • getpwuid manual - http://pubs.opengroup.org/onlinepubs/ 009695399/functions/getpwuid.html
  27. References (Misc) • Practical Postmortems at Etsy - https://www.infoq.com/articles/ postmortems-etsy

    • Scott Andreas' tweet about investigating things - https:// twitter.com/cscotta/status/535898821914419200