Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Applied Science Fiction: Operating a Research-L...

Applied Science Fiction: Operating a Research-Led Product

Presented at SRECon 2022.

Noah Kantrowitz

March 14, 2022
Tweet

More Decks by Noah Kantrowitz

Other Decks in Technology

Transcript

  1. APPLIED SCIENCE FICTION Operating a Research-Led Product Noah Kantrowitz -

    @kantrn - SRECon 2022 - March 14 @kantrn - SRECon 2022 1
  2. NOAH KANTROWITZ » He/him » coderanger.net / @kantrn » Kubernetes

    and Python » SRE/Platform for Geomagical Labs, a division of IKEA » We do CV/AR for the home @kantrn - SRECon 2022 2
  3. DEV SCI OPS » Machine Learning++ » Traditional CV algorithms,

    new research papers, everything » A lot of experimentation, but also running in real-time » Needs: autonomy, stability, performance » Wants: "play with cool new science", iterate fast @kantrn - SRECon 2022 3
  4. WHY RESEARCHER LED? » ~1 platform engineer : 4 researchers

    » SRE is me, I am SRE, it me » Post deployment testing » (I think ops should be a service provider anyway) @kantrn - SRECon 2022 4
  5. BUILDING GUARDRAILS » (Most) researchers are not engineers » And

    don't want to be engineers » Build safe paths to move forward » Strong mutual respect for specialization » Progress with confidence and quality @kantrn - SRECon 2022 6
  6. CI » Nightly -> per-commit » Buildkite for custom pre-processing

    » Testing with GPUs: Kubernetes Jobs @kantrn - SRECon 2022 7
  7. WRITING TESTS? » Functional tests » Easy harness » Numeric

    metrics » Unit tests, as applicable and comfortable @kantrn - SRECon 2022 8
  8. CODE REVIEW » Deputize the most technical » Code quality,

    obvious performance issues, test coverage » Science review: I stay out of it @kantrn - SRECon 2022 9
  9. STATIC ANALYSIS » Pre-Commit: tool runner » Black: code formatter

    » Isort: more code formatting » Flake8: code quality » Overall well received @kantrn - SRECon 2022 10
  10. VERSIONING » SemVer in our hearts » mymod-latest » Branch

    main » mymod-x.y » Branch release-x.y for hotfixes @kantrn - SRECon 2022 13
  11. PARALLEL INSTANCES » New release, new instance » Latest builds

    update in place » Old versions are left time-locked » Unless there's a critical issue » Pipeline definitions pin module versions @kantrn - SRECon 2022 14
  12. AUTOSCALING » Running every version forever is $$$ » We

    already have a queue driven architecture » Autoscaling! » ??? » Profit @kantrn - SRECon 2022 15
  13. LATEST » Work collisions » Like an unpinned dependency »

    Needed for corpus runs » Overall positive @kantrn - SRECon 2022 16
  14. PIPELINE DEFINITIONS » Declarative is nice » DSL is more

    realistic » Compilable DSL is best @kantrn - SRECon 2022 17
  15. RELEASE FLOW » Develop locally » Make a PR, code

    review, etc » Merge to main » Test -latest in staging » Tag a release » Make a new pipeline » Test new pipeline in staging » Propose new pipeline for prod » Team approvals » Mark pipeline as default » Smoke test in prod » Repeat @kantrn - SRECon 2022 19
  16. RUNTIME SUPPORT » Helper libraries » Format conversion » Logging

    and metrics » Common algorithms » Asset sync, in and out » Retries! @kantrn - SRECon 2022 20
  17. NO REALLY, RETRIES! » CELERY_ACKS_LATE » Queue based retries »

    Pipeline based retries @kantrn - SRECon 2022 21
  18. TIMEOUTS » Trust with retries but verify with timeouts »

    Nested timeouts » Work unit » Individual steps » If it has a timeout, it should also have a metric @kantrn - SRECon 2022 22
  19. ASIDE: STRUCTURED LOGGING » You already know this is good

    » But really, it is » logfmt is a nice mix of parsing but also humans » Context variables that attach to every log line » ts=2022-... level=INFO msg="Hello world" run_id=6362 @kantrn - SRECon 2022 23
  20. TESTING WEEK » No new features or fixes » All

    quality all the time » Great place to introduce new tools » Gets everyone talking about tests/quality @kantrn - SRECon 2022 25
  21. PLATFORM TOOLS » Django admin » View customization‽ » Grafana

    and Loki » Domo » Slack notifications @kantrn - SRECon 2022 26
  22. CORPUS MANAGEMENT » Production edge cases? » Personal info (PII)

    » Easy copy tools w/ anonymization » Paid testers whenever possible @kantrn - SRECon 2022 27
  23. PLANNING » Cross-team OKRs » Team-level plans » Two week

    discussion cycles » Talk to each other! @kantrn - SRECon 2022 28
  24. BAD ERROR MESSAGES » Pip and Poetry » Docker »

    The usual suspects » "Could you look at this build failure?" @kantrn - SRECon 2022 30
  25. LICENSING » What is safe to use » Experiments vs.

    production » I am not a lawyer, talk to your legal team » Internal training @kantrn - SRECon 2022 31
  26. ON CALL » Just hasn't come up yet » Do

    it when needed » Likely unpopular @kantrn - SRECon 2022 32
  27. » Engineering tools work for research too » As long

    as you build them well » DAGs are your friend » New versions as new deploys » Retry failures everywhere » Observability is for everyone » Plan together, succeed together @kantrn - SRECon 2022 34
  28. » Intros » DevSciOps » ML++ » Needs » Wants

    » Building Guardrails @kantrn - SRECon 2022 38