Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[2020.09 Meetup] [Talk] Pranjal Deo - Engineeri...

DevOps Lisbon
September 14, 2020

[2020.09 Meetup] [Talk] Pranjal Deo - Engineering Reliable Mobile Applications

Pranjal Deo, Engineering Program Manager at Google, who gave a brilliant talk where she shared her lessons learned on mobile engineering, reliability, and the future of SRE for mobile!

DevOps Lisbon

September 14, 2020
Tweet

More Decks by DevOps Lisbon

Other Decks in Technology

Transcript

  1. DevOps Lisbon, Sep 2020 Engineering Reliable Mobile Applications Pranjal Deo

    Program Manager, Client Infrastructure SRE and Firebase SRE
  2. Proprietary + Confidential A little bit about me • Site

    Reliability Engineering (SRE) Program Manager at Google • External Engagements ◦ Blameless Postmortem Chapter in the Site Reliability Workbook ▪ DevopsDays Stockholm, Istanbul + Keynote Speaker @DevopsDays Portugal ◦ Mobile reliability publication ▪ This talk! • Previous ◦ Test automation / software engineering / DevOps at Brightidea Inc. ◦ Electrical Engineer ◦ Dance instructor • Passions ◦ My pup (Teddi, a Golden Retriever boy) ◦ Travel (25 countries and counting) ◦ Food
  3. Proprietary + Confidential Agenda • SRE for Mobile • Challenges

    ◦ Scale ◦ Monitoring ◦ Control ◦ Change Management • Strategies for developing resilient native mobile applications • Case Studies: Google Doodle outage, Search app outage, Thundering Herd problem • Key takeaways
  4. Proprietary + Confidential Traditional SRE • Availability • Latency •

    Efficiency • Emergency response • Change management • Monitoring • Capacity planning • etc. SRE = Job role + mindset 1 Hope is not a strategy 2 Whole service lifecycle 3 Healthy services 4 Horizontal projects 5
  5. Proprietary + Confidential Users perceive reliability of our services through

    the clients (devices). What’s the point of five 9s of server availability if your mobile application cannot access it?
  6. Proprietary + Confidential SRE for Mobile Focusing on the server-side

    does not entirely capture user experience anymore. • Monitoring • Rollouts • Incident management & resolution • Catch & fix/rollback issues in production fast • Affect as few users as possible Deliver code to users’ devices 1 Make sure it works well 2 Things may only happen on a client 3 Hope is not a mobile strategy either 4
  7. Proprietary + Confidential Challenge #1 Scale • Billions of devices

    • Thousands of device models • Hundreds of applications • Multiple versions of applications
  8. Proprietary + Confidential Challenge #2 Monitoring • Metrics have many

    dimensions because of scale • Logging / monitoring has a tangible cost to the end user
  9. Proprietary + Confidential Challenge #4 Change Management • No rollbacks

    • Power lies with the user • This is very important!
  10. Proprietary + Confidential App Availability Examples of unavailability • Tap

    icon, app about to load, then it immediately vanished • Message saying “application has stopped” or “application not responding” • App made no sign of responding to your tap • Empty screen displayed • Screen with old results, and you had to refresh • Eventually abandoned by clicking the back button Crash reports - Critical to monitor and triage.
  11. Proprietary + Confidential Realtime Monitoring • Reduce mean time to

    resolution (MTTR) ◦ Faster problem detection, quicker investigation • Get quick feedback on production fixes • Typical server side fixes: Resolution time driven by humans • Extra for Mobile: How fast can fixes be pushed to devices? ◦ Polling oriented mobile experimentation and configuration ◦ Uptake rate varies ◦ Constrain view of error metrics to devices using your fix Monitor metrics exposed by app internals Run UI test probes for user journeys
  12. Proprietary + Confidential Performance & Efficiency • Mobile apps on

    a device share precious resources e.g. battery, network, storage, CPU, memory • Particularly important for lower end devices • Block launches that hamper user happiness
  13. Proprietary + Confidential Change Management • Problems found in production

    can be irrecoverable • Take extra care when releasing client changes! • Staged rollouts ◦ Gradually gather production feedback ◦ Diversify pool of users and devices • Experimentation ◦ Reduce bias caused by better network / devices ◦ Release changes via experiments ◦ A/B analysis over staged rollout ◦ Randomized control and experiment groups • Feature flags ◦ Release code through binary releases and control user set via feature flags ◦ Rollback shouldn’t break the app • Upgrade side effects and noise ◦ Placebo binaries
  14. Proprietary + Confidential Support Horizons • How many app versions

    can SRE meaningfully support? • Older app version can never really go away • Trade-off between reliability and business decisions
  15. Proprietary + Confidential Server-Side Impact • Client changes to apps

    impact servers • Global events can suddenly overwhelm servers • Client releases can cause unintended consequences
  16. Proprietary + Confidential #1 Android Google Search App (AGSA) Doodle

    Crashes What happened? • Bad Doodle configuration caused crashes in AGSA whenever user were shown a SERP (Search Engine Results Page) • Triggered as doodle rolled out in each timezone • Fix was submitted for this particular issue (both configuration and binary fix) but same issue happened again! • Affected older versions without the fix
  17. Proprietary + Confidential #1 Android Google Search App (AGSA) Doodle

    Crashes Key Takeaways • Client-only fixes may not fix everything (e.g. users may not update to the version with the fix); always include server-side fixes when possible • Know your dependencies (especially if you have many feature teams contributing)
  18. Proprietary + Confidential #2 Search broken for certain versions of

    AGSA What happened? • AGSA started crash looping on five older versions - a near miss of a massive outage • A simple four character change to a config, caused a crash at app startup • Unable to fetch the rolled back config before crashing • Only recovery: notify users to upgrade or clear app data
  19. Proprietary + Confidential #2 Search broken for certain versions of

    AGSA Key takeaways • Lots of older app versions in the wild • “Apply” before “Commit”: always validate and exercise the new config before committing (i.e. caching) • Expire regularly cached configuration in a reliable manner • Detect and self-recover from crash loops • Don’t rely on recovery external to the app • Sending notifications for manual recovery has limited utility • Monitor crash recovery
  20. Proprietary + Confidential #3 Thundering Herd problem What happened? •

    A GMSCore (Google Play Services) update caused devices to register for Firebase Cloud Messaging (FCM) notifications at install time • FCM is not scaled to support 2B devices updating at GMSCore's update rate, so it throttled all GMSCore registrations globally • This could easily have been a global outage
  21. Proprietary + Confidential #3 Thundering Herd problem Key Takeaways •

    Don't make service calls during upgrades • Server calls should be an app release qualification criteria • App release rates are probably not well correlated with server capacity management
  22. Proprietary + Confidential Hope is not a Mobile strategy •

    Rollout changes in a controlled, metric driven way • Monitor apps in production by measuring critical user interactions and key health metrics • Prepare for app’s impact on servers • Create Incident management processes specific to client side • Make client reliability a part of your mission!