Beyond 200 OK: What Happens to Your API Responses After They Leave Your Server

CFCAMP 2026 · MUNICH Beyond 200 OK. What happens to
your API responses after they leave your server Jignesh Patel · Kai Koenig

Two people - two angles of the same problem Beyond
200 OK · CFCamp 2026 INTRO CLIENT / MOBILE BACKEND / INFRA OVERLAP distributed systems · observability JP Jignesh Patel Solution Architect, Enrich Technolabs Mobile · Client · Product KK Kai Koenig Software Architect, Ventego Creative Backend · APIs · Infra · JVM / CFML

PART ONE It works on my server The discrepancy between
server monitoring and real user monitoring

Your dashboard is green. Your user is not. On your
server • 200 OK • Latency: 87ms • Payload: valid JSON • All probes passing On your user's phone • App crashed • 11s to first paint • OOM on Android 9 • 1-star review incoming Beyond 200 OK · CFCamp 2026 WORKS ON MY SERVER

We monitor the tip. Our users live in the mass
below the waterline — invisible to everything behind our API gateway. Beyond 200 OK · CFCamp 2026 WORKS ON MY SERVER

Three responses that were 'fine' All three returned 200 OK.
All three made users miserable. The valid JSON that actually wasn't valid Backend serialized a long as 9007199254740993. Client on Android used a JS bridge. Number became 9007199254740992. Order ID off by one. Crashes. The 87ms response that took 11 seconds Server logged 87ms. User on the train between Munich and Salzburg. TCP retransmits, tower handoff, TLS renegotiation. 200 OK arrived after the user gave up. The payload that broke a Pixel 4a in Lagos 847 KB JSON response. Fine on Wi-Fi in Munich. On a 2GB-RAM device with 14 Chrome tabs open, the parser hit an OutOfMemoryError. Silent Death of the user experience. Beyond 200 OK · CFCamp 2026 WORKS ON MY SERVER

If you can't see past your API gateway, how do
you know your API is actually working? Beyond 200 OK · CFCamp 2026 WORKS ON MY SERVER

PART TWO The gap between server and user Why the
client side breaks rules your backend never had to think about

Two worlds, one request Your server • 1–10 OS /
runtime versions • Stable network, predictable latency • RAM and CPU you provisioned • You control restarts • Logs you can grep Your user's device • Thousands of OS × hardware combos • Connections that drop, switch, throttle • RAM the user is fighting you for • OS kills you mid-request • Logs? You hope they're collected and uploaded. To… somewhere? Beyond 200 OK · CFCamp 2026 THE GAP

Store-and-forward: Telemetry's dirty secret On the client, telemetry is not
real-time. It's eventually-uploaded. Event happens T+0 Buffered locally T+0.1s Network down T+5s App backgrounded T+30s Uploaded? T+8h Implication Your client telemetry pipeline must assume offline-first, out-of-order arrival, and partial loss. Server-side log aggregation patterns do not transfer. Beyond 200 OK · CFCamp 2026 THE GAP

Invisible to your backend Network reality End-to-end latency, packet loss,
TLS handshake on the user's actual connection Render performance Time from response received to pixels on screen, often longer than the network call Client-side errors Parser crashes, null pointer exceptions, OOMs caused by your payload shape User abandonment Users who closed the app before your beautiful response arrived Beyond 200 OK · CFCamp 2026 THE GAP

PART THREE Closing the loop How distributed traces follow the
request all the way home

The complexity of the modern stack Beyond 200 OK ·
CFCamp 2026 CLOSING THE LOOP CLIENT EDGE / CDN ENTRY & ROUTING COMPUTE DATA & INTEGRATIONS Browser / Mobile app DNS lookup CDN cache & static WAF TLS / DDoS Load balancer API gateway / reverse proxy Containers K8s / Cloud Run Serverless functions Microservices SQL DB Cache Object storage Queue External APIs → → → →

One trace, end to end A trace is a tree
of spans linked by a shared trace ID. The client owns the root span. 0ms 1000ms 2000ms 3000ms 4000ms 📱 user_tap_checkout 4200ms 📱 fetch /api/checkout 3800ms 🌐 dns + tls 620ms ☕ checkout() controller 1070ms 🗄 SELECT cart, user 500ms 💳 payments.charge 500ms 📱 parse + render 280ms ← server logs only see this slice Beyond 200 OK · CFCamp 2026 CLOSING THE LOOP

The glue: W3C trace context A header carries the trace
ID from the client tap to your back end handler. GET /api/checkout HTTP/1.1 Host: api.example.com traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01 version and instructions 00-... - spec version …-01 - tracing instruction flag trace-id 4bf9...4736 - same across the whole call tree parent-id 00f0...02b7 - the span that just made the call Beyond 200 OK · CFCamp 2026 CLOSING THE LOOP

Following the trace across services One trace-id is created at
the edge and copied onto every hop. The parent-id is re-stamped at each call to name the span that made it. Beyond 200 OK · CFCamp 2026 CLOSING THE LOOP 1 Mobile App → API Gateway traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01 Trace is born. This trace-id now tags everything downstream. 2 API Gateway → Checkout Service traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-b9c7e5d4f3a21a2b-01 Same trace-id. parent-id swapped to the Gateway's span. 3 Checkout Service → Payment Service traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-c3d4a1b2e7f85e6f-01 Same trace-id. parent-id swapped to Checkout's span.

OpenTelemetry Beyond 200 OK · CFCamp 2026 CLOSING THE LOOP
Open, vendor-neutral standard for generating and shipping telemetry - a CNCF graduated project. Traces The journey of one request across services, built on the W3C trace context (traceparent) you just saw. STABLE Metrics Aggregated measurements over time: latency, throughput, error rate, saturation. STABLE Logs Timestamped event records, auto-correlated to the trace that produced them. STABLE

Data flows in OpenTelemetry Beyond 200 OK · CFCamp 2026
CLOSING THE LOOP Your app speaks OTLP to a Collector. The Collector decides where telemetry data goes. Your services OTel SDK auto-instrumentation OTLP → OpenTelemetry Collector Receive OTLP, Prometheus, Jaeger in Process batch, filter, sample, enrich Export fan out to any backend exporters → Open source Jaeger · Tempo · Prometheus · Loki · Grafana Commercial FusionReactor · Datadog · Honeycomb · New Relic · Grafana Cloud · Dynatrace · Elastic One protocol in, any backend out. The Collector is the bridge between your code and where telemetry lives. Fan out or swap freely. Send to several backends at once, or replace one tomorrow.

What does this mean for a CFML back end You
don't have to rewrite much. You have to instrument the JVM you're already running. 1 Attach the OTel Java agent Lucee, Adobe ColdFusion, BoxLang all run on the JVM. There is an OTel agent Java SDK for homegrown systems or you can use FusionReactor. 2 Propagate trace context in CFCs If you make outbound HTTP calls (cfhttp, REST), the agent injects traceparent on the JVM/servlet level. For background tasks, propagate the context manually. 3 Add custom spans for business ops Wrap critical CFCs (checkout, payment, search) in spans with attributes: tenant, plan, region. This is where business-aware tracing pays off. Beyond 200 OK · CFCamp 2026 CLOSING THE LOOP

OpenTelemetry on mobile - a mixed story Verify exact GA
status before you use any of these libraries, this ecosystem moves fast. Android GA core, instrumentation maturing maturity ~75% Java/Kotlin SDK stable. Auto-instrumentation for OkHttp, Retrofit. ANR detection improving. iOS / Swift Stable SDK, instrumentation gaps maturity ~65% Solid manual instrumentation. Auto-instrumentation behind Android. URLSession support good. Flutter / RN Community SDKs, vendor-led maturity ~45% OTel coverage exists but is patchy. Most teams use vendor SDKs (Embrace, Sentry, Raygun) that bridge to OTel. Beyond 200 OK · CFCamp 2026 CLOSING THE LOOP

An actual look at instrumentation on mobile devices Android ·
Kotlin · OkHttp interceptor and ~ 15 lines of your own code val tracer = OpenTelemetry.getTracer("app.checkout") fun onCheckoutTap() { val span = tracer.spanBuilder("user.checkout_tap").startSpan() span.setAttribute("user.id", userId) span.setAttribute("network.type", connectivity.activeType) try (val scope = span.makeCurrent()) { // OkHttp interceptor injects traceparent automatically api.checkout(cartId) // → Call to back end } finally { span.end() } } Beyond 200 OK · CFCamp 2026 CLOSING THE LOOP

OTel or vendor SDK? Both, probably. Pure OpenTelemetry Vendor Mobile
SDK (Rayguyn, Embrace, Sentry, Datadog RUM) Network instrumentation Solid Solid + richer context Crash reporting Limited Mature, symbolicated, deduped ANR / freeze detection Improving Production-grade Session replay Not in scope Often included Vendor lock-in None Some, but most emit OTLP Cost predictability DIY Per-seat/MAU pricing Verdict: use a vendor SDK that exports OTLP. You get crash quality + portability. Beyond 200 OK · CFCamp 2026 CLOSING THE LOOP

Correlation in practice From a 1-star review to the offending
back end controller in 30 seconds 1 User reports 1-star: 'Checkout broken on Pixel' 2 Find session Filter RUM: device=Pixel, action=checkout, error=true 3 Open trace trace_id from session → full waterfall 4 Land on span checkout() span shows DB timeout at 12.3s 5 Fix forward Connection pool was misconfigured for one region Beyond 200 OK · CFCamp 2026 CLOSING THE LOOP

PART FOUR Case studies Two companies that lived this. One
scaled the fix. One paid for not having it.

CASE STUDY · 01 Shopify: green dashboards and slow app
The problem • Backend APIs returning 200 OK on time • Mobile app screens still felt sluggish • ScrollViews drawing 20 items when 3 were visible • Cache misses adding network lag to P75 • Server-side dashboards: all green What changed • Instrumented client-side render performance • Defined SLOs measured on the device, not the server • Pre-warmed cache for critical screens (90% cache hits) • Replaced ScrollView with virtualized lists RESULT App launch P75 44% faster · Screen load P75 59% faster Source: Shopify Engineering Blog · 'Improving Shopify App's Performance' Beyond 200 OK · CFCamp 2026 CASE STUDIES

CASE STUDY · 02 DoorDash 2021: the cascading outage nobody
saw How the outage unfolded T+0 Payment service latency rises Backend slow-down — by itself, recoverable T+~30s Mobile clients time out, retry hard App SDKs retry aggressively. No exponential backoff. T+~2min Retry storm amplifies load Server now handling 5–10× normal traffic on the slow path T+~5min Cascade across dependent services Other microservices waiting on payment also degrade T+2h Site-wide outage resolved Public postmortem documents the timeline LESSON Client retry behavior is part of your reliability model. If you can't see it, you can't bound it. Source: DoorDash Engineering · 'Inside DoorDash's Service Mesh Journey, Part 1' Beyond 200 OK · CFCamp 2026 CASE STUDIES

PART FIVE From uptime to smooth sessions Why your SLOs
should measure user experience, not server liveness

The SLO has to leave the data center YESTERDAY Server
uptime ≥ 99.95% • Measured at the load balancer • Says nothing about the user • Green even when users churn TODAY 95% of checkouts complete in < 3s on the user's device • Measured at the user's screen • Tied to revenue / retention • Red when users actually suffer Beyond 200 OK · CFCamp 2026 SMOOTH SESSIONS

Sampling: signal vs bulk Tracing every request is complete but
expensive. Sampling keeps a representative slice. Beyond 200 OK · CFCamp 2026 SMOOTH SESSIONS No sampling · keep 100% Head sampling · keep 10% every request stored & queryable 1 in 10 kept · the rest dropped $$$$ 100% $ 10% Keep 10% ≈ 90% less to ingest & store. At 1M requests/day that's 100k traces kept instead of 1M — cost scales with the spans you keep.

Telemetry meets GDPR You are now collecting personal data on
the user's device. Treat it like one. Minimize No URLs with tokens. No PII in span attributes. Hash user IDs at the SDK layer. Consent Telemetry SDK must respect the user's tracking choice. Default to less, not more. Locality EU users → EU collectors. Most vendors offer EU-only ingest. Verify, don't assume. Beyond 200 OK · CFCamp 2026 SMOOTH SESSIONS

PART SIX Where this is headed Five years from now,
server-only observability will look as quaint as paper logs

Catching bad releases earlier Today's loop • Ship release •
Wait for crash reports • Wait for 1-star reviews • Hotfix (1-3 days) • Damage already done What's emerging • Anomaly detection on RUM signals • AI flags regression on canary cohort • Auto-rollback before 5% rollout • AI-driven Hotfix in minutes or hours, not days • Most users never see the bug Beyond 200 OK · CFCamp 2026 WHERE THIS IS HEADED

Five things to do from here None of these necessarily
require a mobile team. 1 Look into OTel for your backend. 2 Make sure your inbound HTTP layer accepts and continues W3C-standard tracing 3 Pick one critical user journey. Define one user-experience SLO for it. 4 Stand up an OTel collector with EU-located ingest. Start sampled. 5 Talk to the team that owns your mobile or web client. Have one coffee. Beyond 200 OK · CFCamp 2026 WHERE THIS IS HEADED

200 OK is a story your server tells itself. Your
job is to find out whether the user agrees. Danke. Questions?

Thank you Beyond 200 OK · CFCamp 2026 KK Kai
Koenig Ventego Creative linkedin.com/in/kaikoenig ventego-creative.co.nz JP Jignesh Patel Enrich Technolabs linkedin.com/in/jigneshworld linktr.ee/JigneshWorld Slides, questions and follow-ups - be in touch

Beyond 200 OK: What Happens to Your API Respons...

Beyond 200 OK: What Happens to Your API Responses After They Leave Your Server

More Decks by Kai Koenig

Other Decks in Programming

Featured

Transcript