Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Applied Science Fiction: Operating a Research-L...

Applied Science Fiction: Operating a Research-Led Product

Presented at SRECon 2022.

Avatar for Noah Kantrowitz

Noah Kantrowitz

March 14, 2022
Tweet

More Decks by Noah Kantrowitz

Other Decks in Technology

Transcript

  1. APPLIED SCIENCE FICTION Operating a Research-Led Product Noah Kantrowitz -

    @kantrn - SRECon 2022 - March 14 @kantrn - SRECon 2022 1
  2. NOAH KANTROWITZ » He/him » coderanger.net / @kantrn » Kubernetes

    and Python » SRE/Platform for Geomagical Labs, a division of IKEA » We do CV/AR for the home @kantrn - SRECon 2022 2
  3. DEV SCI OPS » Machine Learning++ » Traditional CV algorithms,

    new research papers, everything » A lot of experimentation, but also running in real-time » Needs: autonomy, stability, performance » Wants: "play with cool new science", iterate fast @kantrn - SRECon 2022 3
  4. WHY RESEARCHER LED? » ~1 platform engineer : 4 researchers

    » SRE is me, I am SRE, it me » Post deployment testing » (I think ops should be a service provider anyway) @kantrn - SRECon 2022 4
  5. BUILDING GUARDRAILS » (Most) researchers are not engineers » And

    don't want to be engineers » Build safe paths to move forward » Strong mutual respect for specialization » Progress with confidence and quality @kantrn - SRECon 2022 6
  6. CI » Nightly -> per-commit » Buildkite for custom pre-processing

    » Testing with GPUs: Kubernetes Jobs @kantrn - SRECon 2022 7
  7. WRITING TESTS? » Functional tests » Easy harness » Numeric

    metrics » Unit tests, as applicable and comfortable @kantrn - SRECon 2022 8
  8. CODE REVIEW » Deputize the most technical » Code quality,

    obvious performance issues, test coverage » Science review: I stay out of it @kantrn - SRECon 2022 9
  9. STATIC ANALYSIS » Pre-Commit: tool runner » Black: code formatter

    » Isort: more code formatting » Flake8: code quality » Overall well received @kantrn - SRECon 2022 10
  10. VERSIONING » SemVer in our hearts » mymod-latest » Branch

    main » mymod-x.y » Branch release-x.y for hotfixes @kantrn - SRECon 2022 13
  11. PARALLEL INSTANCES » New release, new instance » Latest builds

    update in place » Old versions are left time-locked » Unless there's a critical issue » Pipeline definitions pin module versions @kantrn - SRECon 2022 14
  12. AUTOSCALING » Running every version forever is $$$ » We

    already have a queue driven architecture » Autoscaling! » ??? » Profit @kantrn - SRECon 2022 15
  13. LATEST » Work collisions » Like an unpinned dependency »

    Needed for corpus runs » Overall positive @kantrn - SRECon 2022 16
  14. PIPELINE DEFINITIONS » Declarative is nice » DSL is more

    realistic » Compilable DSL is best @kantrn - SRECon 2022 17
  15. RELEASE FLOW » Develop locally » Make a PR, code

    review, etc » Merge to main » Test -latest in staging » Tag a release » Make a new pipeline » Test new pipeline in staging » Propose new pipeline for prod » Team approvals » Mark pipeline as default » Smoke test in prod » Repeat @kantrn - SRECon 2022 19
  16. RUNTIME SUPPORT » Helper libraries » Format conversion » Logging

    and metrics » Common algorithms » Asset sync, in and out » Retries! @kantrn - SRECon 2022 20
  17. NO REALLY, RETRIES! » CELERY_ACKS_LATE » Queue based retries »

    Pipeline based retries @kantrn - SRECon 2022 21
  18. TIMEOUTS » Trust with retries but verify with timeouts »

    Nested timeouts » Work unit » Individual steps » If it has a timeout, it should also have a metric @kantrn - SRECon 2022 22
  19. ASIDE: STRUCTURED LOGGING » You already know this is good

    » But really, it is » logfmt is a nice mix of parsing but also humans » Context variables that attach to every log line » ts=2022-... level=INFO msg="Hello world" run_id=6362 @kantrn - SRECon 2022 23
  20. TESTING WEEK » No new features or fixes » All

    quality all the time » Great place to introduce new tools » Gets everyone talking about tests/quality @kantrn - SRECon 2022 25
  21. PLATFORM TOOLS » Django admin » View customization‽ » Grafana

    and Loki » Domo » Slack notifications @kantrn - SRECon 2022 26
  22. CORPUS MANAGEMENT » Production edge cases? » Personal info (PII)

    » Easy copy tools w/ anonymization » Paid testers whenever possible @kantrn - SRECon 2022 27
  23. PLANNING » Cross-team OKRs » Team-level plans » Two week

    discussion cycles » Talk to each other! @kantrn - SRECon 2022 28
  24. BAD ERROR MESSAGES » Pip and Poetry » Docker »

    The usual suspects » "Could you look at this build failure?" @kantrn - SRECon 2022 30
  25. LICENSING » What is safe to use » Experiments vs.

    production » I am not a lawyer, talk to your legal team » Internal training @kantrn - SRECon 2022 31
  26. ON CALL » Just hasn't come up yet » Do

    it when needed » Likely unpopular @kantrn - SRECon 2022 32
  27. » Engineering tools work for research too » As long

    as you build them well » DAGs are your friend » New versions as new deploys » Retry failures everywhere » Observability is for everyone » Plan together, succeed together @kantrn - SRECon 2022 34
  28. » Intros » DevSciOps » ML++ » Needs » Wants

    » Building Guardrails @kantrn - SRECon 2022 38