talks about the importance being a generalist • I think specializing is fine (and normal as your career advances), but it's VITAL to keep a generalist perspective • Disaster porn!
taxis ◦ Web booking integration for taxi fleets ◦ In-car payment hardware (PIM) • What's a PIM? ◦ Passenger Information Monitor ◦ 7" HD touchscreen ◦ Credit card swipe ◦ Wired into cab hardware and dispatch system ◦ Uses cellular communication to talk to TM ◦ Regular GPS events over UDP ◦ Payment transactions over HTTPS
drivers in Los Angeles begin reporting failures when swiping CCs • Embedded hardware team recalls a few cabs and investigates local log files • Reports problems during SSL handshake to RideCharge servers • Tech Ops team remaps httpd to the same libcrypto.so and libssl.so version as the PIM using libmap.conf(5) • Problem vanishes! HOORAY!!! Beer!
of failing CC swipes across the entire SoCal region • Hardware team pulls more vehicles and notices the same SSL handshake problem • Tech Ops team is unable to correlate this to a drop in traffic • Furthermore, Tech Ops is still seeing regular GPS updates from ALL active cabs!
any problems • (Sound familiar to anyone?) • I start running the standard toolkit looking for patterns ◦ tcpdump ◦ traceroute ◦ NMAP • NMAP is giving me some inconsistent results
TCP connection? ◦ SYN (Hey, you there?) ◦ SYN/ACK (Yeah, what's up?) ◦ ACK (Cool, lets talk!) • What happens if you connect to a port that doesn't have a service bound to it? ◦ SYN (Hey, you there?) ◦ RST (leave me alone!) • So why am I only getting a RST every now and then? Why do I see timeouts instead? • This is starting to smell like a routing problem
updates over UDP from all the cabs I can use this to identify the IP of a cab and its location at a point in time • We know the expected behavior when attempting a connection to a closed port • Let's run some tests and gather some data
dozen calls to the ISP and as many "escalations" we landed on a conference call with some lead networks engineers • After 6 hours on this conference call reiterating the problem and showing the data one engineer asks us to "hold tight" • Things get very quiet... • Like magic all of my tests start succeeding!
SoCal region to a new data center in Anaheim. This was an epic failure and they rolled back • On June 12th, the ISP migrated again to Anaheim "successfully" • Cell traffic is pooled by connection, and one of the pools was routing asymmetrically • Asymmetric routing + stateful firewalls = BAD • Updating the routing tables fixed everything
Understanding the full stack means being able to troubleshoot problems at all layers • Fluid communication between sysadmins, developers, hardware engineers, and network engineers requires generalists • Fewer people in the war room results in faster problem solving • This saves time and money and makes your team more valuable to the business