Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Linux at Cloudflare

Avatar for majek04 majek04
March 22, 2019

Linux at Cloudflare

Avatar for majek04

majek04

March 22, 2019
Tweet

More Decks by majek04

Other Decks in Programming

Transcript

  1. Edge Network - software management • Uniform configuration everywhere •

    No virtualization, no containers, raw metal • Thousands of IP addresses/subnets (anycast) • Multiple applications ◦ HTTP (HTTP, TLS 1.3, HTTP2, QUIC) ◦ DNS (Auth, Resolver) ◦ Other
  2. iptables: xt_bpf, connlimit, hashlimits, ipsets syn cookies SO_FILTER XDP (locks!)

    https://lists.openwall.net/netdev/2019/02/22/87 https://patchwork.ozlabs.org/cover/998940/ https://speakerdeck.com/majek04/lessons-from-defending-the-indefensible https://netdevconf.org/2.1/papers/Gilberto_Bertin_XDP_in_practice.pdf DDoS mitigation
  3. DDoS mitigation - XDP • Implementing token buckets without locking

    is hard • More concurrency primitives (cmpxchg16b?) • "Traffic policing in eBPF: applying token bucket algorithm" ◦ http://vger.kernel.org/lpc-bpf.html#session-9
  4. Socket dispatch - DoS considerations • The case of 30k

    UDP sockets • Solution: ebpf token bucket in SO_FILTER 192.0.2.2:53 192.0.2.1:53 192.0.2.0:53 0.0.0.0:53 + SO_FILTER
  5. Socket dispatch - zero downtime restart for Quic • Connected

    UDP sockets for zero-downtime server restart :443 10.0.0.1 --> 192.0.2.0:443 10.0.0.2 --> 192.0.2.0:443 :443 10.2.0.9 --> 192.0.2.0:443 10.3.0.8 --> 192.0.2.0:443 :443 :443
  6. • heavy user of AnyIP https://blog.cloudflare.com/how-we-built-spectrum/ • SO_BINDTOPREFIX http://patchwork.ozlabs.org/patch/602916/ •

    TPROXY https://blog.cloudflare.com/how-we-built-spectrum/ • TPROXY UDP Socket dispatch - AnyIP Single IP Many subnets Single port bind() SO_BINDTOPREFIX Many ports TPROXY TPROXY
  7. Transmission path - TPROXY UDP send() • Sending packets is

    hard src IP src port dst IP dst port connected socket - - - - bind(INADDR_ANY) auto - selected selected bind(INADDR_ANY) + IP_PKTINFO PKTINFO - selected selected bind(127.0.0.1, X) + IP_PKTINFO PKTINFO in bind selected selected
  8. Tuning for the Internet • Custom initcwnd, BPF_SOCK_OPT • BBR

    • TCP_NOTSENT_LOWAT • TCP Fast Open • ECN • TCP Multipath • QUIC - tuning for UDP, like UDP GSO • More introspection - Listen Drops https://blog.cloudflare.com/http-2-prioritization-with-nginx/
  9. Prometheus - ebpf_exporter • Backend for prometheus • Allowing more

    detailed event views - like histograms for block I/O • https://blog.cloudflare.com/introducing-ebpf_exporter/ Matt Bostock's SREcon17 talk
  10. Upgrading kernel • LTS, inertia • off-the-tree drivers age fast

    (we were stuck on 3.18) ◦ fusion-io, sfc / ixgbe, netmap, glb-redirect ◦ custom patches (EPOLL_RR, SO_BINDTOPREFIX, XDP features) • hardware issues ◦ microcode bugs ◦ driver regressions • regressions ◦ SO_FILTER https://www.spinics.net/lists/netdev/msg555565.html ◦ nf_conncount https://www.spinics.net/lists/netfilter-devel/msg57316.html ◦ nf_nat_cleanup_conntrack https://bugzilla.kernel.org/show_bug.cgi?id=196821 ◦ systemd disabling TSO https://blog.cloudflare.com/tracing-system-cpu-on-debian-stretch/
  11. • XDP for DDoS • XDP for load balancing •

    xt_bpf on iptables • SO_FILTER for application DDoS • SOCKMAP + kTLS within application • ebpf_exporter for metrics • • future: BPF_SOCK_OPTS • future: cgroups • future: SO_BINDTOBPF ? BPF is everywhere
  12. Kernel bypass for application • Rarely kernel-bound • Kernel features

    ◦ iptables for DDoS ◦ SYN cookies ◦ RFC4821 tcp_mtu_probing ◦ BBR ◦ Kernel Debuggability ◦ tcpdump, sampling
  13. Core • Couple large locations • Marathon + Mesos (custom

    load balancers, discovery) • Kubernetes (vxlan) • Kafka, ClickHouse, Ceph, HBase, Postgresql / CitusDB https://blog.cloudflare.com/http-analytics-for-6m-requests-per-second-using-clickhouse/
  14. What about NIC tuning, MTU and QoS traffic type MTU

    QoS: saturation north (public) anycast eyeball requests 1500 inbound: attack outbound: traffic spike south (public) origin origin pulls 1500 - east-west - L4LB inbound requests 1544 - east-west - cache cache traffic jumbo hot assets • MTU is hard • LRO can get disabled on large MSS • tc qdisc work better on physical device