Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Balancing Speed and Reliability: The Double-Edg...

Ty Smith
September 06, 2023

Balancing Speed and Reliability: The Double-Edged Sword of Third-Party Libraries

Using third-party libraries in your apps can be a great way to save engineering time and move faster, but can also bring significant risk. If a library malfunctions and causes an outage, it may take days or weeks to get it solved for all your users. Apps have long update cycles and don’t have the luxury of hotfixes when something goes wrong. At Uber, as an app that people rely upon for making their income, getting to the doctor, or commuting to work, reliability in our app is the top priority. Learn how Uber decides when mobile libraries are safe to include and when they should be avoided.

We’ll review how Uber analyzes external libraries to reduce risk, walk through some horror stories when things went wrong, and discover some techniques that can help keep reliability for your user when the worst does happen. You’ll walk away with a tactical framework for evaluating libraries in your own apps.

Ty Smith

September 06, 2023
Tweet

More Decks by Ty Smith

Other Decks in Programming

Transcript

  1. 11:30 - Incident detected 14:40 - Google Rollback Started 04:00

    - Google releases Android fix #1 10:00 - Enable Uber Maps in US, CA, MX 14:00 - Release Android hotfix #1 19:40 - Google release iOS fix 22:30 - Release iOS hotfix 11:30 - Google releases Android fix #2 14:30 - Release Android hotfix #2 07:30 - Enable Uber Maps In Remaining Areas 4 day outage - Several rotating incident commanders - Teams from every org Outage Timeline 6 Presentation name Thursday Friday Saturday Sunday
  2. 11:30 - Incident detected 14:40 - Google Rollback Started 04:00

    - Google releases Android fix #1 10:00 - Enable Uber Maps in US, CA, MX 14:00 - Release Android hotfix #1 19:40 - Google release iOS fix 22:30 - Release iOS hotfix 11:30 - Google releases Android fix #2 14:30 - Release Android hotfix #2 07:30 - Enable Uber Maps In Remaining Areas 4 day outage - Several rotating incident commanders - Teams from every org Outage Timeline 7 Presentation name Thursday Friday Saturday Sunday
  3. 11:30 - Incident detected 14:40 - Google Rollback Started 04:00

    - Google releases Android fix #1 10:00 - Enable Uber Maps in US, CA, MX 14:00 - Release Android hotfix #1 19:40 - Google release iOS fix 22:30 - Release iOS hotfix 11:30 - Google releases Android fix #2 14:30 - Release Android hotfix #2 07:30 - Enable Uber Maps In Remaining Areas 4 day outage - Several rotating incident commanders - Teams from every org Outage Timeline 8 Presentation name Thursday Friday Saturday Sunday
  4. 11:30 - Incident detected 14:40 - Google Rollback Started 04:00

    - Google releases Android fix #1 10:00 - Enable Uber Maps in US, CA, MX 14:00 - Release Android hotfix #1 19:40 - Google release iOS fix 22:30 - Release iOS hotfix 11:30 - Google releases Android fix #2 14:30 - Release Android hotfix #2 07:30 - Enable Uber Maps In Remaining Areas 4 day outage - Several rotating incident commanders - Teams from every org Outage Timeline 9 Presentation name Thursday Friday Saturday Sunday
  5. 11 • Largest mobile outage in Uber’s history • Millions

    of users blocked • Millions of $ lost • Thousands of hours of lost employee productivity Impact
  6. 12 • Executive review of postmortem • New Intercompany legal

    agreements • Improved library governance process • Improved crash protection • Improved crash recovery Aftermath
  7. 14 ✅ Modern platform ✅ Available Features ✅ Faster development

    ✅ Free maintenance and updates Third Party Code
  8. 17 ✅ Faster development ✅ Available features ✅ Modern platform

    ✅ Free maintenance and updates 🆇 Crashes 🆇 Security Vulnerabilities 🆇 Government Compliance 🆇 Legal Risk 🆇 Implicit Permissioning 🆇 Performance Degradation 🆇 Memory Leaks 🆇 Transitive Dependency Conflicts 🆇 Less control Third Party Code
  9. 23 “The process of managing and controlling the use of

    software libraries, including acquisition, deployment, use, and maintenance.” - Bard Library Governance
  10. No policy. Use what’s the fastest. Seed startup Tech lead

    or sr eng best judgement. Bias towards speed. Small scale-up Bespoke. “If you want to add a new library, come talk to Mobile Platform” Medium Sized Co Well defined set of criteria and a responsible team for approval. Large Enterprise 24
  11. 25 Setting up Library Governance • Define business priorities •

    Define library requirements • Define governance body • Define review process • Define exception process • Define upgrade process
  12. 26 Business priorities • Speed to market • Developer velocity

    & staffing • App quality and reliability • Long term foundation & scale
  13. 28 Uber’s priorities and acceptable risk 1. App quality and

    reliability 2. Long term foundation/scale 3. Speed to market 4. Developer velocity & staffing
  14. 29 • License • Secure • Private • Stable •

    Mature • Maintained • Small • Industry Standard • Testable • High Quality • Owned internally • Category (Platform/Feature) Third Party Library Requirements A non-exhaustive list
  15. 33 Upgrades • Greenkeeping • Similar risk as new libraries

    • Intentional Updates • Organizational Cost
  16. 35 ✅ Appropriate license (Apache 2.0) ✅ Compelling Business Use-

    case ✅ No additional permissions needed ✅ Low binary size impact < 50kb ✅ Low method count < 200 ✅ Transitive Deps all in use or reasonable. ✅ Standard for Compose image loading ✅ Reasonable API that can be flagged ✅ No known vulnerabilities ✅ Highly used by peer companies ✅ Good tests ✅ Stable ✅ No outside servers or dynamic behavior ✅ Regularly maintained ✅ No unexpected network or battery effect ✅ Reasonable memory profile Coil ✅
  17. 36 ✅ Compelling Business Use-case ✅ Security Checks Pass ✅

    Well Tested Facebook Auth SDK 🆇 Proprietary License 🆇 Outside infrastructure and APIs 🆇 Complex Client Side Code 🆇 Web alternative is feasible ❌
  18. 38 ✅ Compelling Business Use-case Twilio Video SDK 🆇 Closed

    source 🆇 High Binary Size > 5 mb 🆇 Alternative costly ❌
  19. 39 ◻ Closed source ◻ High Binary Size > 5mb

    Twilio Video SDK ✅ ✅ → Met with Twilio & Organized clean-room analysis ✅ → Dynamic Feature Module + Feature Flag
  20. 40 ✅ Very compelling Business Use-case Google Ads SDK 🆇

    Closed source 🆇 Uncatchable startup code 🆇 High Binary Size > 1mb 🆇 Dynamic updates & Internal XP 🆇 No feasible alternative ❌
  21. 41 ◻ Closed source ◻ Uncatchable startup code ◻ High

    Binary Size > 1mb ◻ Dynamic updates & Internal XP ✅ → Met with Google Ads team to understand internals ✅ → Disable startup code with manifest flag ✅ → Dynamic Feature Module + Feature Flag ✅ → Runtime fallback via API check Google Ads SDK ✅
  22. Active Development Week 1 1. Build Train Release 2. Release

    Testing 3. Employee rollout 0 → 100% 4. Beta rollout 0 → 100% Week 2 Prod rollout 0 → 100% 40% adoption Week 3 65% adoption Week 4 Life of a commit 43 80% adoption Week 5 90% adoption Week 6
  23. Defense Gates Develop (Build) Review Deploy (app update) Production Rollout

    Design (PRD/ERD/Fi gma) Library Governance Repackaging Integration Testing Soak Testing E2E Testing Library Abstractions Monitoring Feature Flags Delayed Initialization Dependency Scanning Dynamic Features Employee Testing Linters
  24. 47 class MainActivity { fun useSdk() { val useSdk =

    FeatureFlags.get("UseSdk") if(useSdk) { Sdk.doSomething() } else { // Fallback Experience } } } Feature Flags
  25. 48 class MyApp : Application() { override fun onCreate() {

    super.onCreate() Sdk.init() // Continue App setup... } } Delayed Initialization
  26. 49 class MyApp : Application() { override fun onCreate() {

    super.onCreate() FeatureFlags.get("UseSDK") if(useSdk) { Sdk.init() } // Continue App setup... } } Delayed Initialization
  27. 51 Delayed Initialization class MyApp : Application() { override fun

    onCreate() { super.onCreate() FeatureFlags.get("UseSDK", Dispatcher.IO) { useSdk -> if(useSdk) { Sdk.init() } } // Continue App setup... } }
  28. 52 Delayed Initialization class SdkFeatureActivity : Activity() { override fun

    onCreate() { super.onCreate() FeatureFlags.get("UseSDK", Dispatcher.IO) { useSdk -> if(useSdk) { Sdk.init() } } // Continue Activity setup... } }
  29. 59 Bundled Code • Broadcast Receivers • Intent Filters •

    Content Providers • Native Callbacks • AIDLs
  30. 60 Play Services • Opaque • System level permissions •

    Dynamic behavior outside app’s release cadence • XP and feature flags in your app
  31. 61 Play Services MyApp.apk Ads Play Services - Business Logic

    - IPC - Updates - Feature Flag - Experimentation - Dynamic Loading MLKit Pay Recaptcha
  32. 62 Onboarding ML Feature ML Feature Ads Feature Play Services

    + Dynamic Features MyApp.aab Ads Play Services - Business Logic - IPC - Updates - Feature Flag - Experimentation - Dynamic Loading MLKit Pay Recaptcha
  33. 63 Dynamic Features val installSDK = FeatureFlags.get("InstallSDK") val initSdk =

    FeatureFlags.get("InitSDK") if (installSdk) { SplitInstallManagerFactory.create(context) .startInstall(request) .addOnSuccessListener { if(initSdk) { Sdk.init() } } .addOnFailureListener { exception -> ... } }
  34. 69 Jar Shading dependencies { compile jarjar.repackage { from io.reactivex.rxjava2:rxjava:2.1.0'

    classRename "io.reactivex.rxjava2.**" "com.uber.internal.rxjava2.@1" } }
  35. 70 Jar Shading 🆇 Increased App Size 🆇 Nested Dep

    Complexity 🆇 Maintenance ✅ Dependency Stability ✅ Support multiple versions 🚨Use as last resort, prioritize updating all code to single version first!
  36. 72 • Local Abstractions ◦ Useful for local utilities with

    unstable APIs ◦ Can enable better testability and feature flagging ◦ Replace heavy SDKs with small client REST APIs • Server Abstractions ◦ Use server side integration instead of client side Library Abstractions
  37. 73 Linters • Ban known dangerous APIs • Shift runtime

    exceptions left into build time exceptions
  38. 75 class Picasso { fun load(path: String?): RequestCreator { ...

    require(path.isNotBlank()) { "Path must not be empty." } return load(Uri.parse(path)) } } Linters
  39. 76 fun Picasso.loadSafely(url: String?): RequestCreator { if (url != null

    && url.isEmpty()) { Lumber.monitor("picasso").e("empty strings are not allowed by picasso") return this.load(null as String?) } return this.load(url) } Linters
  40. 77 /** * Methods that should not be used at

    all. * */ @JvmStatic val methods = mapOf( "com.squareup.picasso.Picasso.load(kotlin.String?)" to "Empty strings can trigger crashes, use the loadSafely extension.", ) Linters
  41. 79 • On-call alert • Triage bug • Rollback feature

    flag • Monitor • Post-mortem Incident Golden Path
  42. 80 • Automated Crash Recovery • Push Based Recovery •

    Multiprocess Agent • Hotfixes and Force Upgrade What if that doesn’t work…
  43. 81 Automated Crash Recovery App Start Remove Boot File Boot

    file Present? Startup Steps Step 1 Step 2 Step N Create Boot File No Yes Blackswan Recovery 1 Recovery 2 Recovery N
  44. 82 Automated Crash Recovery Blackswan 1: Retry 2: Clear Cache

    3: Clear XPs 4: Clear Data 5: Webview Fallback
  45. 83 Server Based Rules • Pushed Feature Flags • Blackswan

    Custom Recovery Actions • DNS + Firebase Remote Config
  46. 84 Uber App Multiprocess Agent *Future opportunity App Process Startup

    Steps App Runtime Recovery Process Blackswan Feature Flags DNS + Remote Config IPC
  47. 85 Hotfixes and Force Upgrades • Realtime mitigations are much

    faster • Hotfix introduces additional risk • Force upgrades cause user attrition