Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Beyond 200 OK: What Happens to Your API Respons...

Beyond 200 OK: What Happens to Your API Responses After They Leave Your Server

Co-presented with Jignesh Patel at CFCamp 2026.

Server says 200, logs are clean, dashboards are green, yet users face crashes or slowdowns. Back end success isn’t user success. Learn how to close the observability gap by extending telemetry to mobile and web, so “works on my server” truly means “works for the user.”

Most observability talks focus on the server because that’s where the tooling is mature. However, for a growing number of applications, the API consumer isn't a browser on a fiber connection, it’s a mid-range mobile device on a train, switching cell towers. When things break there, your server-side monitoring won't tell you a thing.

This session covers the practical side of extending observability beyond the server. We will explore:

- Real-world examples where valid JSON caused Mobile App crashes or payloads were too heavy for low-memory devices in developing markets.
- How to deal with handling thousands of OS combinations, "store-and-forward" telemetry for offline-first apps and clients killed by the OS mid-request.
- Closing the Loop: The practical implementation of Distributed Tracing. We’ll discuss the current state of OpenTelemetry (OTel) for Mobile, and where vendor SDKs still fill the gaps and do a better job.
- How can we manage a transition from uptime monitoring to smooth sessions for your users: Redefining SLOs to connect backend reliability targets to actual user experience, while keeping telemetry costs under control.
- This isn’t a talk about learning mobile development; it’s a talk about knowing whether the APIs you build are actually delivering a good experience.

Avatar for Kai Koenig

Kai Koenig

June 19, 2026

More Decks by Kai Koenig

Other Decks in Programming

Transcript

  1. CFCAMP 2026 · MUNICH Beyond 200 OK. What happens to

    your API responses after they leave your server Jignesh Patel · Kai Koenig
  2. Two people - two angles of the same problem Beyond

    200 OK · CFCamp 2026 INTRO CLIENT / MOBILE BACKEND / INFRA OVERLAP distributed systems · observability JP Jignesh Patel Solution Architect, Enrich Technolabs Mobile · Client · Product KK Kai Koenig Software Architect, Ventego Creative Backend · APIs · Infra · JVM / CFML
  3. PART ONE It works on my server The discrepancy between

    server monitoring and real user monitoring
  4. Your dashboard is green. Your user is not. On your

    server • 200 OK • Latency: 87ms • Payload: valid JSON • All probes passing On your user's phone • App crashed • 11s to first paint • OOM on Android 9 • 1-star review incoming Beyond 200 OK · CFCamp 2026 WORKS ON MY SERVER
  5. We monitor the tip. Our users live in the mass

    below the waterline — invisible to everything behind our API gateway. Beyond 200 OK · CFCamp 2026 WORKS ON MY SERVER
  6. Three responses that were 'fine' All three returned 200 OK.

    All three made users miserable. The valid JSON that actually wasn't valid Backend serialized a long as 9007199254740993. Client on Android used a JS bridge. Number became 9007199254740992. Order ID off by one. Crashes. The 87ms response that took 11 seconds Server logged 87ms. User on the train between Munich and Salzburg. TCP retransmits, tower handoff, TLS renegotiation. 200 OK arrived after the user gave up. The payload that broke a Pixel 4a in Lagos 847 KB JSON response. Fine on Wi-Fi in Munich. On a 2GB-RAM device with 14 Chrome tabs open, the parser hit an OutOfMemoryError. Silent Death of the user experience. Beyond 200 OK · CFCamp 2026 WORKS ON MY SERVER
  7. If you can't see past your API gateway, how do

    you know your API is actually working? Beyond 200 OK · CFCamp 2026 WORKS ON MY SERVER
  8. PART TWO The gap between server and user Why the

    client side breaks rules your backend never had to think about
  9. Two worlds, one request Your server • 1–10 OS /

    runtime versions • Stable network, predictable latency • RAM and CPU you provisioned • You control restarts • Logs you can grep Your user's device • Thousands of OS × hardware combos • Connections that drop, switch, throttle • RAM the user is fighting you for • OS kills you mid-request • Logs? You hope they're collected and uploaded. To… somewhere? Beyond 200 OK · CFCamp 2026 THE GAP
  10. Store-and-forward: Telemetry's dirty secret On the client, telemetry is not

    real-time. It's eventually-uploaded. Event happens T+0 Buffered locally T+0.1s Network down T+5s App backgrounded T+30s Uploaded? T+8h Implication Your client telemetry pipeline must assume offline-first, out-of-order arrival, and partial loss. Server-side log aggregation patterns do not transfer. Beyond 200 OK · CFCamp 2026 THE GAP
  11. Invisible to your backend Network reality End-to-end latency, packet loss,

    TLS handshake on the user's actual connection Render performance Time from response received to pixels on screen, often longer than the network call Client-side errors Parser crashes, null pointer exceptions, OOMs caused by your payload shape User abandonment Users who closed the app before your beautiful response arrived Beyond 200 OK · CFCamp 2026 THE GAP
  12. The complexity of the modern stack Beyond 200 OK ·

    CFCamp 2026 CLOSING THE LOOP CLIENT EDGE / CDN ENTRY & ROUTING COMPUTE DATA & INTEGRATIONS Browser / Mobile app DNS lookup CDN cache & static WAF TLS / DDoS Load balancer API gateway / reverse proxy Containers K8s / Cloud Run Serverless functions Microservices SQL DB Cache Object storage Queue External APIs → → → →
  13. One trace, end to end A trace is a tree

    of spans linked by a shared trace ID. The client owns the root span. 0ms 1000ms 2000ms 3000ms 4000ms 📱 user_tap_checkout 4200ms 📱 fetch /api/checkout 3800ms 🌐 dns + tls 620ms ☕ checkout() controller 1070ms 🗄 SELECT cart, user 500ms 💳 payments.charge 500ms 📱 parse + render 280ms ← server logs only see this slice Beyond 200 OK · CFCamp 2026 CLOSING THE LOOP
  14. The glue: W3C trace context A header carries the trace

    ID from the client tap to your back end handler. GET /api/checkout HTTP/1.1 Host: api.example.com traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01 version and instructions 00-... - spec version …-01 - tracing instruction flag trace-id 4bf9...4736 - same across the whole call tree parent-id 00f0...02b7 - the span that just made the call Beyond 200 OK · CFCamp 2026 CLOSING THE LOOP
  15. Following the trace across services One trace-id is created at

    the edge and copied onto every hop. The parent-id is re-stamped at each call to name the span that made it. Beyond 200 OK · CFCamp 2026 CLOSING THE LOOP 1 Mobile App → API Gateway traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01 Trace is born. This trace-id now tags everything downstream. 2 API Gateway → Checkout Service traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-b9c7e5d4f3a21a2b-01 Same trace-id. parent-id swapped to the Gateway's span. 3 Checkout Service → Payment Service traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-c3d4a1b2e7f85e6f-01 Same trace-id. parent-id swapped to Checkout's span.
  16. OpenTelemetry Beyond 200 OK · CFCamp 2026 CLOSING THE LOOP

    Open, vendor-neutral standard for generating and shipping telemetry - a CNCF graduated project. Traces The journey of one request across services, built on the W3C trace context (traceparent) you just saw. STABLE Metrics Aggregated measurements over time: latency, throughput, error rate, saturation. STABLE Logs Timestamped event records, auto-correlated to the trace that produced them. STABLE
  17. Data flows in OpenTelemetry Beyond 200 OK · CFCamp 2026

    CLOSING THE LOOP Your app speaks OTLP to a Collector. The Collector decides where telemetry data goes. Your services OTel SDK auto-instrumentation OTLP → OpenTelemetry Collector Receive OTLP, Prometheus, Jaeger in Process batch, filter, sample, enrich Export fan out to any backend exporters → Open source Jaeger · Tempo · Prometheus · Loki · Grafana Commercial FusionReactor · Datadog · Honeycomb · New Relic · Grafana Cloud · Dynatrace · Elastic One protocol in, any backend out. The Collector is the bridge between your code and where telemetry lives. Fan out or swap freely. Send to several backends at once, or replace one tomorrow.
  18. What does this mean for a CFML back end You

    don't have to rewrite much. You have to instrument the JVM you're already running. 1 Attach the OTel Java agent Lucee, Adobe ColdFusion, BoxLang all run on the JVM. There is an OTel agent Java SDK for homegrown systems or you can use FusionReactor. 2 Propagate trace context in CFCs If you make outbound HTTP calls (cfhttp, REST), the agent injects traceparent on the JVM/servlet level. For background tasks, propagate the context manually. 3 Add custom spans for business ops Wrap critical CFCs (checkout, payment, search) in spans with attributes: tenant, plan, region. This is where business-aware tracing pays off. Beyond 200 OK · CFCamp 2026 CLOSING THE LOOP
  19. OpenTelemetry on mobile - a mixed story Verify exact GA

    status before you use any of these libraries, this ecosystem moves fast. Android GA core, instrumentation maturing maturity ~75% Java/Kotlin SDK stable. Auto-instrumentation for OkHttp, Retrofit. ANR detection improving. iOS / Swift Stable SDK, instrumentation gaps maturity ~65% Solid manual instrumentation. Auto-instrumentation behind Android. URLSession support good. Flutter / RN Community SDKs, vendor-led maturity ~45% OTel coverage exists but is patchy. Most teams use vendor SDKs (Embrace, Sentry, Raygun) that bridge to OTel. Beyond 200 OK · CFCamp 2026 CLOSING THE LOOP
  20. An actual look at instrumentation on mobile devices Android ·

    Kotlin · OkHttp interceptor and ~ 15 lines of your own code val tracer = OpenTelemetry.getTracer("app.checkout") fun onCheckoutTap() { val span = tracer.spanBuilder("user.checkout_tap").startSpan() span.setAttribute("user.id", userId) span.setAttribute("network.type", connectivity.activeType) try (val scope = span.makeCurrent()) { // OkHttp interceptor injects traceparent automatically api.checkout(cartId) // → Call to back end } finally { span.end() } } Beyond 200 OK · CFCamp 2026 CLOSING THE LOOP
  21. OTel or vendor SDK? Both, probably. Pure OpenTelemetry Vendor Mobile

    SDK (Rayguyn, Embrace, Sentry, Datadog RUM) Network instrumentation Solid Solid + richer context Crash reporting Limited Mature, symbolicated, deduped ANR / freeze detection Improving Production-grade Session replay Not in scope Often included Vendor lock-in None Some, but most emit OTLP Cost predictability DIY Per-seat/MAU pricing Verdict: use a vendor SDK that exports OTLP. You get crash quality + portability. Beyond 200 OK · CFCamp 2026 CLOSING THE LOOP
  22. Correlation in practice From a 1-star review to the offending

    back end controller in 30 seconds 1 User reports 1-star: 'Checkout broken on Pixel' 2 Find session Filter RUM: device=Pixel, action=checkout, error=true 3 Open trace trace_id from session → full waterfall 4 Land on span checkout() span shows DB timeout at 12.3s 5 Fix forward Connection pool was misconfigured for one region Beyond 200 OK · CFCamp 2026 CLOSING THE LOOP
  23. PART FOUR Case studies Two companies that lived this. One

    scaled the fix. One paid for not having it.
  24. CASE STUDY · 01 Shopify: green dashboards and slow app

    The problem • Backend APIs returning 200 OK on time • Mobile app screens still felt sluggish • ScrollViews drawing 20 items when 3 were visible • Cache misses adding network lag to P75 • Server-side dashboards: all green What changed • Instrumented client-side render performance • Defined SLOs measured on the device, not the server • Pre-warmed cache for critical screens (90% cache hits) • Replaced ScrollView with virtualized lists RESULT App launch P75 44% faster · Screen load P75 59% faster Source: Shopify Engineering Blog · 'Improving Shopify App's Performance' Beyond 200 OK · CFCamp 2026 CASE STUDIES
  25. CASE STUDY · 02 DoorDash 2021: the cascading outage nobody

    saw How the outage unfolded T+0 Payment service latency rises Backend slow-down — by itself, recoverable T+~30s Mobile clients time out, retry hard App SDKs retry aggressively. No exponential backoff. T+~2min Retry storm amplifies load Server now handling 5–10× normal traffic on the slow path T+~5min Cascade across dependent services Other microservices waiting on payment also degrade T+2h Site-wide outage resolved Public postmortem documents the timeline LESSON Client retry behavior is part of your reliability model. If you can't see it, you can't bound it. Source: DoorDash Engineering · 'Inside DoorDash's Service Mesh Journey, Part 1' Beyond 200 OK · CFCamp 2026 CASE STUDIES
  26. PART FIVE From uptime to smooth sessions Why your SLOs

    should measure user experience, not server liveness
  27. The SLO has to leave the data center YESTERDAY Server

    uptime ≥ 99.95% • Measured at the load balancer • Says nothing about the user • Green even when users churn TODAY 95% of checkouts complete in < 3s on the user's device • Measured at the user's screen • Tied to revenue / retention • Red when users actually suffer Beyond 200 OK · CFCamp 2026 SMOOTH SESSIONS
  28. Sampling: signal vs bulk Tracing every request is complete but

    expensive. Sampling keeps a representative slice. Beyond 200 OK · CFCamp 2026 SMOOTH SESSIONS No sampling · keep 100% Head sampling · keep 10% every request stored & queryable 1 in 10 kept · the rest dropped $$$$ 100% $ 10% Keep 10% ≈ 90% less to ingest & store. At 1M requests/day that's 100k traces kept instead of 1M — cost scales with the spans you keep.
  29. Telemetry meets GDPR You are now collecting personal data on

    the user's device. Treat it like one. Minimize No URLs with tokens. No PII in span attributes. Hash user IDs at the SDK layer. Consent Telemetry SDK must respect the user's tracking choice. Default to less, not more. Locality EU users → EU collectors. Most vendors offer EU-only ingest. Verify, don't assume. Beyond 200 OK · CFCamp 2026 SMOOTH SESSIONS
  30. PART SIX Where this is headed Five years from now,

    server-only observability will look as quaint as paper logs
  31. Catching bad releases earlier Today's loop • Ship release •

    Wait for crash reports • Wait for 1-star reviews • Hotfix (1-3 days) • Damage already done What's emerging • Anomaly detection on RUM signals • AI flags regression on canary cohort • Auto-rollback before 5% rollout • AI-driven Hotfix in minutes or hours, not days • Most users never see the bug Beyond 200 OK · CFCamp 2026 WHERE THIS IS HEADED
  32. Five things to do from here None of these necessarily

    require a mobile team. 1 Look into OTel for your backend. 2 Make sure your inbound HTTP layer accepts and continues W3C-standard tracing 3 Pick one critical user journey. Define one user-experience SLO for it. 4 Stand up an OTel collector with EU-located ingest. Start sampled. 5 Talk to the team that owns your mobile or web client. Have one coffee. Beyond 200 OK · CFCamp 2026 WHERE THIS IS HEADED
  33. 200 OK is a story your server tells itself. Your

    job is to find out whether the user agrees. Danke. Questions?
  34. Thank you Beyond 200 OK · CFCamp 2026 KK Kai

    Koenig Ventego Creative linkedin.com/in/kaikoenig ventego-creative.co.nz JP Jignesh Patel Enrich Technolabs linkedin.com/in/jigneshworld linktr.ee/JigneshWorld Slides, questions and follow-ups - be in touch