Postgres load balancing is secretly broken-the cancellation problem | Uptime 2022 | Jelte Fennema

Postgres load balancing is secretly broken: The cancellation problem Jelte
Fennema (@JelteF) Developing Citus and Postgres at Microsoft & PgBouncer maintainer 2022-09-14

©Microsoft Corporation Azure What am I going to talk about?
• Load balancing across Postgres servers • Read replicas • Citus • Low level details about cancelling queries • PgBouncer

©Microsoft Corporation Azure What is load balancing? • Sharing workload
across servers • Different servers handle independent requests

©Microsoft Corporation Azure Why load balancing for Postgres? • Performance
• Scaling reads with Postgres read replicas • Scaling writes with the Citus extension for Postgres

©Microsoft Corporation Azure Scaling reads or writes? Why? Clients

©Microsoft Corporation Azure What are read replicas? Client Client Clients

©Microsoft Corporation Azure What is Citus?

©Microsoft Corporation Azure How to do load balancing? • Option
1: Client-side load balancing Client knows about all servers and connects to all of them • Option 2: Server-side load balancing Client only knows a single logical server, that server “secretly” forwards requests to multiple servers

©Microsoft Corporation Azure Client-side load balancing explained Common approaches: •
Hardcoded list of IPs or domains used by client • Multiple DNS records for the same domain, one for each server • With smart client • Or with round-robin resolver Downsides: • Every client needs to know about all the servers • Every client needs logic to choose a server • Caching: what happens when you add a server

©Microsoft Corporation Azure Client-side load balancing with Postgres • JDBC
(Java) and Npgsql (C#) support • PgBouncer released client-side load balancing last March in v1.17.0 • libpq does not support client-side load balancing natively I submitted a patch to add support for this: https://commitfest.postgresql.org/39/3679/ • Many DNS resolvers always return a single fixed result So, a single client does not load balance

©Microsoft Corporation Azure Server-side load balancing explained Common approaches: •
Proxy server • Software defined networking (SDN) Problems: • Proxy server introduces latency • Costs extra money Big advantage: • Clients don’t need special support and/or configuration

©Microsoft Corporation Azure Server-side load balancing with Postgres Options: 1.
A dedicated PgBouncer server configured in client-side load balancing mode • Can use transaction or session load balancing • Extra network hop introduces latency • Single threaded 2. Off the shelf TCP load balancer • Each TCP stream is assigned to a different server • Probably has “fancy” things like health-checks • No extra network hop in case of SDN based load balancer

©Microsoft Corporation Azure Decision seems simple Off the shelf TCP
based load balancer it is

©Microsoft Corporation Azure Not so fast • Important assumption: Different
TCP streams are independent

©Microsoft Corporation Azure How Postgres cancellation requests work Client

©Microsoft Corporation Azure How Postgres cancellation requests work Client CANCEL
SECRET-TOKEN-123

©Microsoft Corporation Azure So what happens with cancellations and a
load balancer Client Load balancer

load balancer Client Load balancer CANCEL SECRET-TOKEN-123

load balancer Client Load balancer CANCEL SECRET-TOKEN-123 CANCEL SECRET-TOKEN-123

load balancer Client Load balancer CANCEL SECRET-TOKEN-123 CANCEL SECRET-TOKEN-123 I don’t know SECRET-TOKEN-123. This is probably a hacker. Let’s ignore this request

©Microsoft Corporation Azure Not so fast (continued) • Important assumption:
Different TCP streams are independent • Correct: Query on same Postgres session == same TCP stream • Correct: New session == new TCP stream • Wrong: Query cancellation request == new TCP stream

©Microsoft Corporation Azure The cancellation problem • Query cancellations end
up at the wrong server most of the time • Workaround: Trigger cancellations multiple time at the client side • Solutions: None exist so far

©Microsoft Corporation Azure Idea: Make Postgres servers know about each
other Client Load balancer

other Client Load balancer CANCEL SECRET-TOKEN-123

other Client Load balancer CANCEL SECRET-TOKEN-123 CANCEL SECRET-TOKEN-123

other Client Load balancer CANCEL SECRET-TOKEN-123 CANCEL SECRET-TOKEN-123 I don’t know SECRET-TOKEN-123 This was probably meant for another server. Let me ask Server A

other Client Load balancer CANCEL SECRET-TOKEN-123 CANCEL SECRET-TOKEN-123

©Microsoft Corporation Azure How to make this idea work? •
Postgres does not support this out of the box • What can we do to make it work?

©Microsoft Corporation Azure Option 1: Postgres extension • No extension
hooks exist for cancellation related code  • Getting hooks (or built-in functionality) in Postgres will take a long time and won’t work on old Postgres versions • Need to build cancellation forwarding support

©Microsoft Corporation Azure Option 2: Modify PgBouncer • PgBouncer can
run in front of any Postgres server version • PgBouncer already has cancellation forwarding code to forward cancellations to Postgres • This code only needs to be modified to send to other PgBouncer servers • PR for this can be found here: https://github.com/pgbouncer/pgbouncer/pull/666

©Microsoft Corporation Azure Option 2: Modify PgBouncer Client Load balancer
PgBouncer B PgBouncer A Postgres A Postgres B

CANCEL SECRET-TOKEN-123 PgBouncer B PgBouncer A Postgres A Postgres B

CANCEL SECRET-TOKEN-123 CANCEL SECRET-TOKEN-123 PgBouncer B PgBouncer A Postgres A Postgres B

CANCEL SECRET-TOKEN-123 CANCEL SECRET-TOKEN-123 PgBouncer B PgBouncer A Postgres A Postgres B I don’t know SECRET-TOKEN-123 This was probably meant for another server. Let me ask Server A

CANCEL SECRET-TOKEN-123 CANCEL SECRET-TOKEN-123 PgBouncer B PgBouncer A Postgres A Postgres B

©Microsoft Corporation Azure Making PgBouncer performant • Single-threaded design becomes
a bottleneck on high throughput servers • PgBouncer supports multi-process by using so_reuseport=1 • The Linux kernel then load-balances TCP streams across all processes • So, what we end up with is a multi-layered load balancer

©Microsoft Corporation Azure What we end up with Client Load
balancer PgBouncer B1 PgBouncer A1 Postgres A Postgres B PgBouncer A2 Linux kernel PgBouncer B2 Linux kernel

©Microsoft Corporation Azure What we end up with Client Load
balancer CANCEL SECRET-TOKEN-123 PgBouncer B1 PgBouncer A1 Postgres A Postgres B PgBouncer A2 Linux kernel PgBouncer B2 Linux kernel

©Microsoft Corporation Azure The final problem • To which one
should PgBouncer forward it? • Sending the cancellation to all servers is quite heavy

©Microsoft Corporation Azure The final solution • Encode an identifier
in the cancellation token: e.g. SECRET-TOKEN-123-A1

Postgres load balancing is secretly broken-the ...

Postgres load balancing is secretly broken-the cancellation problem | Uptime 2022 | Jelte Fennema

More Decks by Citus Data

Other Decks in Technology

Featured

Transcript