Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

State of MLOps in 2022

Asei Sugiyama
February 24, 2023

State of MLOps in 2022

Summary of the MLOps in 2022. This deck is used for Money Forward internal event. My opinions are my own.

2022年の MLOps の取り組みについてまとめた資料です。別資料もあり、こちらは日本語版ですが重複しない内容もあります 。https://speakerdeck.com/asei/mlops-nokoremadetokorekara-eb9ed3f9-3635-4709-8b48-5ccf201c4ae7 。この資料は Money Forward 社内で開かれた MLOps についての勉強会のために作成しました。

なお、資料内でさまざまな組織の取り組みについて触れていますが、これらは著者の私見です。現時点で公表されている資料に基づいていますが、各組織の公式な見解ではありません。

Asei Sugiyama

February 24, 2023
Tweet

More Decks by Asei Sugiyama

Other Decks in Technology

Transcript

  1. TL;DR So many progress in the technical field. For example,

    many players builds their ML pipelines. The progress in the technical field unveils the problem of machine learning. Especially fairness, security, and transparency. Even though it is a well-known field and has a long history, we still have unsolved problems in the statistical measurement of ML impact. This year, governments are planning to define standards and regulations (e.g., EU AI Act). It is unclear now, but we must keep watching that activity.
  2. TOC ML Ops in 2022 Technical <- Measurement Process MLOps

    in 2023 Measurement Process Culture Regulations & Standards
  3. Traditional problem of the Fairness Unfair based on sensitive features

    (race, gender, etc.) Measured by difference of conditional probabilities Fairness-Aware Machine Learning and Data Mining p.20
  4. Unfairness in generative AI; well known problem If the dataset

    is biased by sensitive feature, the generative model trained by the dataset may generate biased images. Note: The right image is not generated naturally but by forcing to create a cute girl image. 有名な絵をAIさんに美少女の絵だと言い張った。/I claimed that a world-famous picture is a picture of a girl to AI. https://youtu.be/MZUtn9EvoRo
  5. Gender bias Illust generation models tend to generate girl. This

    is caused by the bias in the train dataset. 珠洲ノらめる さんはTwitterを使っています: 「白州擬人化…!!!!! これは、AIちゃんすごいぞ!!!?(๑ °⌓ °๑ ) https://t.co/uUVoUARM8x」 / Twitter https://twitter.com/piyo_rameru/status/1615156487801950211? s=20&t=lZhYKkGcaAeyW7aXERvlpw
  6. Unfairness of the reward The developers of the generative models

    can gain wealth The creators of the original images can not get any money Getty Images is suing the creators of AI art tool Stable Diffusion for scraping its content - The Verge https://www.theverge.com/2023/1/17/23558516/ai-art-copyright- stable-diffusion-getty-images-lawsuit
  7. Awful AI Award: 2022 daviddao/awful-ai: Awful AI is a curated

    list to track current scary usages of AI - hoping to raise awareness https://github.com/daviddao/awful-ai
  8. Who gained from OpenAI Huge amount of investment from Microsoft

    to OpenAI No news about huge investment from Microsoft to creators of the training dataset Inside the structure of OpenAI’s looming new investment from Microsoft and VCs | Fortune https://fortune.com/2023/01/11/structure-openai-investment-microsoft/
  9. Similar problem: Annotation I see companies get this wrong all

    the time: their in-house annotation teams are left in the dark about the impact that their work is having on daily or long- term goals. That’s disrespectful to the people doing the work and will lead to poor motivation, high churn, and low quality annotations. So, it doesn’t help anyone. Human-in-the-Loop Machine Learning https://www.manning.com/books/human-in-the-loop- machine-learning
  10. Fairness: recap Before rising the stable diffusion, the problem of

    unfairness is based on the bias of dataset We are facing new unfairness problem: unfairness in data collection process and business model It is not enough to check the bias in the dataset. We have to check the business model of the ML model.
  11. Real Attackers Don't Compute Gradients real-world evidence suggests that actual

    attackers use simple tactics to subvert ML-driven systems, and as a result security practitioners have not prioritized adversarial ML defenses. "Real Attackers Don't Compute Gradients": Bridging the Gap Between Adversarial ML Research and Practice https://arxiv.org/abs/2212.14315
  12. Case 1. Facebook (1/2) An attacker attempts to spread spam

    on Facebook For example, they want to post a pornographic image with some text, which may lure a user to click on an embedded URL The attacker—aware of the existence of the ML system—tries to evade the detector by perturbing the content and/or changing their behavior "Real Attackers Don't Compute Gradients": Bridging the Gap Between Adversarial ML Research and Practice https://arxiv.org/abs/2212.14315
  13. Case 1. Facebook (2/2) Multi layered security Automation: bot detection

    Access: deny illegal access Activity: spam detection Application: hate speech classifier, nudity detector First three layer are normal security practice "Real Attackers Don't Compute Gradients": Bridging the Gap Between Adversarial ML Research and Practice https://arxiv.org/abs/2212.14315
  14. Case 2. Phishing webpage detection (1/2) Phishing webpage detector (image

    classification, input form detection, etc.) Attackers try to pass through from the phishing detector by masking, cropping, blurring, etc. "Real Attackers Don't Compute Gradients": Bridging the Gap Between Adversarial ML Research and Practice https://arxiv.org/abs/2212.14315
  15. Case 2. Phishing webpage detection (2/2) "Real Attackers Don't Compute

    Gradients": Bridging the Gap Between Adversarial ML Research and Practice https://arxiv.org/abs/2212.14315
  16. Reality of the ML security (my assumptions) Flood of poor

    quality trials The adversarial attack seems too expensive for the attackers Even though the user doesn't have malicious intent, users tend to try to change the system's behavior "Please don't forget to like and subscribe to my channel!" Malicious users may not behave in the same manner with regular user For example: Huge amount of trials
  17. Machine Learning Lens (Best practice from AWS) MLSEC-10: Protect against

    data poisoning threats Protect against data injection and data manipulation that pollutes the training dataset. Data injections add corrupt training data that will result in incorrect model and outputs. Data manipulations change existing data (for example labels) that can result in inaccurate and weak predictive models. Identify and address corrupt data and inaccurate models using security methods and anomaly detection algorithms. MLSEC-10: Protect against data poisoning threats - Machine Learning Lens https://docs.aws.amazon.com/wellarchitected/latest/machine-learning- lens/mlsec-10.html
  18. Machine Learning Lens (Best practice from AWS) MLSEC-11: Protect against

    adversarial and malicious activities Add protection inside and outside of the deployed code to detect malicious inputs that might result in incorrect predictions. Automatically detect unauthorized changes by examining the inputs in detail. Repair and validate the inputs before they are added back to the pool. MLSEC-11: Protect against adversarial and malicious activities - Machine Learning Lens https://docs.aws.amazon.com/wellarchitected/latest/machine- learning-lens/mlsec-11.html
  19. Metamorphic testing Add "noise" to test dataset Equivalent to augmentation

    Metamorphic Testing of Machine-Learning Based Systems | by Teemu Kanstrén | Towards Data Science https://towardsdatascience.com/metamorphic- testing-of-machine-learning-based-systems- e1fe13baf048
  20. Security: Recap Traditional security problems are mainly focusing on research

    purpose Since the Machine Learning system grows much broader, we can collect more realistic attacks Real Attackers Don't Compute Gradients Use augmentation & metamorphic testing to build robust (secure) model
  21. Mechanical Turk as a holy grail of annotation crowdsourcing, benchmarking

    & other cool things https://www.image-net.org/static_files/papers/ImageNet_2010.pdf
  22. Reproducibility Crisis Well known crisis in the psychology Same problem

    in the ML we found 20 reviews across 17 scientific fields that find errors in a total of 329 papers that use ML- based science. Reproducibility workshop https://sites.google.com/princeton.edu/rep- workshop
  23. Too Good to Be True: Bots and Bad Data From

    Mechanical Turk I summarize my own experience with MTurk and how I deduced that my sample was—at best—only 2.6% valid, by my estimate Too Good to Be True: Bots and Bad Data From Mechanical Turk - Margaret A. Webb, June P. Tangney, 2022 https://journals.sagepub.com/doi/10.1177/17456916221120027
  24. Too Good to Be True: Bots and Bad Data From

    Mechanical Turk - Margaret A. Webb, June P. Tangney, 2022 https://journals.sagepub.com/doi/10.1177/17456916221120027
  25. Eligibility criteria (529 -> 336) Target age: 18 - 24

    years old MTurk filter of 18 to 25 years & additional question about age Consent quiz (336 -> 200) Quiz about their right to end participation, their right to confidentiality, and researchers’ ability to contact them (threshold: 2/3) Completion (200 -> 140) Some of participants did'nt finished the 45-min survey Too Good to Be True: Bots and Bad Data From Mechanical Turk - Margaret A. Webb, June P. Tangney, 2022 https://journals.sagepub.com/doi/10.1177/17456916221120027
  26. Attention checks (140 -> 124) Selecting other option even if

    the option is "1 – Select thisoption" 140 participants -> 124 participants Unrealistic response time (124 -> 77) Finished too fast (less than 20 min) or too long (several hours)
  27. Examination of qualitative responses (77 -> 14) Consistent answer for

    following requests; "Who are you?" "Write ten sentences below, describing yourself as you are today." "Who will you be? Think about 1 week [1 year/10 years] from today. Write ten sentences below, describing yourself as you imagine you will be in 1 week" If a participant answers "a great man" for some question and "a great woman" for another question, that one fails for this filter.
  28. Can the Transparency solve the Reproducibility Crisis? The Transparency by

    the documentation is not enough ImageNet describes how they collect and annotate dataset. Those are requirements from Transparency If we hire MTurk to annotate our dataset, we cannot expect the same labeled data even though we follow the strictly same workflow According to the paper, we cannot rely the quality of the label We should build well-skilled team to solve the Reproducibility Crisis. In-house specialists Outsourcing: Annotation vendor
  29. Can Outsourcing be next MTurk? Outsourcing itself is nothing new.

    Outsourced workers are the fastest growing workforce for data annotation. Finally, not all outsourced workers should be considered low skilled! Human-in-the-Loop Machine Learning https://www.manning.com/books/human-in-the-loop- machine-learning
  30. Recap: Transparency In the data science field, many players creates

    their datasets using Mechanical Turk The Reproducibility Crisis: If we hire MTurk, we may not create our dataset that holds reproducibility We should consider to build in-house or outsourced specialists team
  31. TOC ML Ops in 2022 Technical Measurement <- Process MLOps

    in 2023 Measurement Process Culture Regulations & Standards
  32. Measurement Four keys Metrics for MLOps Capability of ML: ML

    Test Score Health check: Percentage of toil Business impact: Experiment & Agreement
  33. Four keys Well-defined set of metrics with threshold Developed by

    DORA Use Four Keys metrics like change failure rate to measure your DevOps performance | Google Cloud Blog https://cloud.google.com/blog/products/devops- sre/using-the-four-keys-to-measure-your-devops- performance?hl=en
  34. Machine Learning Operations (MLOps): Overview, Definition, and Architecture Comprehensive research

    paper of MLOps based on their survey and interviews we furnish a definition of MLOps and highlight open challenges in the field. Machine Learning Operations (MLOps): Overview, Definition, and Architecture https://arxiv.org/abs/2205.02302
  35. Metrics for MLOps (1/3) No best practice Current definition of

    MLOps lucks the "Measurement" principle Machine Learning Operations (MLOps): Overview, Definition, and Architecture https://arxiv.org/abs/2205.02302
  36. Metrics for MLOps (2/3) Requirements for metrics of ML team;

    1. Observable: well-defined, easy to measure 2. Comparable: threshold (good / bad), benchmarks 3. Actionable: able to understand what should we do
  37. Metrics for MLOps (3/3) Consider three metrics for three objects

    1. Capability of Team: ML Test Score 2. Health check of relationship: Toils 3. Business impact: Agreements & Experiments
  38. Capability of ML: ML Test Score Tests to measure the

    ML team 28 tests 0.5 point: test it manually 1.0 point: test it automatically Right: average scores of the Google ML teams The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction https://research.google/pubs/pub46555/
  39. Health check: Percentage of toil Measuring and managing the amount

    of the toils helps MLOps team focusing the team's task Also, to avoid paying too much amount of effort to other team's task Our SRE organization has an advertised goal of keeping operational work (i.e., toil) below 50% of each SRE’s time. Google - Site Reliability Engineering https://sre.google/sre-book/eliminating-toil/
  40. Definition of the Toil Manual Repetitive Automatable Tactical No enduring

    value O(n) with service growth Google - Site Reliability Engineering https://sre.google/sre-book/eliminating- toil/
  41. Business impact: Experiment & Agreement Define relationships between ML metrics

    and business metrics Measure them by experiments Experimental Design A/B Testing Single case experiment
  42. Business impact doesn't hold requirements Requirements ML test score toil

    (%) Business impact Observable ✓ ✓ Comparable ✓ ✓ Actionable ✓ ✓ Experiments are required to define the business metric, compare the impact with baseline, and consider what should we do
  43. A/B Testing A/B Testing is based on statistical RTC (randomized

    controlled trial) introduced in 1948 by Hill (left side) We are still discussing about the method of the A/B testing (including causal inference) Use of randomisation in the Medical Research Council’s clinical trial of streptomycin in pulmonary tuberculosis in the 1940s - PMC https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1114162/ Trustworthy Online Controlled Experiments: 9781108724265: Computer Science Books @ Amazon.com https://www.amazon.com/dp/1108724264
  44. Single case experiment (1/2) One of the quasi- experimental design

    (experiments without control group) Example: DeepMind AI reduces energy used for cooling Google data centers by 40% DeepMind AI reduces energy used for cooling Google data centers by 40% https://blog.google/outreach-initiatives/environment/deepmind-ai-reduces- energy-used-for/
  45. Single case experiment (2/2) Recommended: A/B/A design or A/B/A/B design

    A: control B: treatment Right image is A/B/A Introduce new feature (or service) within limited time span DeepMind AI reduces energy used for cooling Google data centers by 40% https://blog.google/outreach-initiatives/environment/deepmind-ai-reduces- energy-used-for/
  46. TOC ML Ops in 2022 Technical Measurement Process <- MLOps

    in 2023 Measurement Process Culture Regulations & Standards
  47. Process As we see in the technical section (transparency), we

    realized the importance of annotation process. Human-in-the-loop Machine Learning section 7 is the great resource to understand the best practice of the annotation. Human-in-the-Loop Machine Learning https://www.manning.com/books/human-in-the-loop- machine-learning
  48. Three types of the workforce 1. Crowd sourcing 2. BPO

    (Outsource) 3. In-house specialist If the annotation task requires high specialty and confidence, combine BPO & In-house specialist. Human-in-the-Loop Machine Learning fig 8.21 https://www.manning.com/books/human-in-the-loop-machine-learning
  49. Three principals Salary - Fair pay Annotators should be payed

    as much as other workers, including data scientist Job Security - Pay regularly (data scientists should) structure the amount of work available to be as consistent as possible Human-in-the-Loop Machine Learning https://www.manning.com/books/human-in-the-loop-machine-learning
  50. Three principals (2/2) Ownership - Provide transparency The best way

    to make any repetitive task interesting is to make it clear how important that work is. An annotator who spends 400 hours annotating data that powers a new application should feel as much ownership as an engineer who spends 400 hours coding it. Human-in-the-Loop Machine Learning https://www.manning.com/books/human-in-the-loop-machine-learning
  51. Three tips In-house exparts: Always run in-house annotation sessions Outsourced

    workers: Talk to your outsourced workers Crowd sourcing: Create a path to secure work and career advancement Human-in-the-Loop Machine Learning https://www.manning.com/books/human-in-the-loop-machine-learning
  52. TOC ML Ops in 2022 Technical Measurement Process MLOps in

    2023 <- Measurement Process Culture Regulations & Standards
  53. Challenges of A/B Testing Measuring network effects <- Managing real-world

    dynamism Supporting diverse lines of business Supporting our culture of experimentation Challenges in Experimentation. In this post, we provide an overview of… | by John Kirn | Lyft Engineering https://eng.lyft.com/challenges-in- experimentation-be9ab98a7ef4
  54. Measuring network effects (2/2) Three kind of randomization: (a) alternating

    time intervals (one hour) (b) randomized coarse and fine spatial units (c) randomized user sessions The best randomization was (a) to measure the effect of Prime Time Challenges in Experimentation. In this post, we provide an overview of… | by John Kirn | Lyft Engineering https://eng.lyft.com/challenges-in-experimentation-be9ab98a7ef4
  55. PROCESS (1/3) MLOE-02: Establish ML roles and responsibilities Establish cross-functional

    teams with roles and responsibilities An ML project typically consists of multiple roles, with defined tasks and responsibilities for each. In many cases, the separation of roles and responsibilities is not clear and there is overlap. MLOE-02: Establish ML roles and responsibilities - Machine Learning Lens https://docs.aws.amazon.com/wellarchitected/latest/machine-learning- lens/mloe-02.html
  56. PROCESS (2/3) 13 ML roles defined by AWS MLOE-02: Establish

    ML roles and responsibi Learning Lens https://docs.aws.amazon.com/wellarchitecte learning-lens/mloe-02.html
  57. PROCESS (3/3) Experiment tracking Integrate business metrics and ML metrics

    into one dashboard Discussion based on the single source of truth by all stakeholders
  58. CULTURE: Netflix Netflix has a strong culture of experimentation, and

    results from A/B tests, or other applications of the scientific method, are generally expected to inform decisions about how to improve our product and deliver more joy to members. Netflix: A Culture of Learning https://netflixtechblog.com/netflix-a-culture-of- learning-394bc7d0f94c
  59. AI Act: Regulations & Standards of AI AI Act: European

    legal movement to establish AI low Similar to GDPR: Global regulation (not only in EU) The second half of 2024 is the earliest time the regulation could become applicable EUのAI規制法案の概要 https://www.soumu.go.jp/main_content/000826707.pdf Regulatory framework proposal on artificial intelligence https://digital- strategy.ec.europa.eu/en/policies/regulatory-framework-ai
  60. Recap So many progress in the technical field. For example,

    many players builds their ML pipelines. The progress in the technical field unveils the problem of machine learning. Especially fairness, security, and transparency. Even though it is a well-known field and has a long history, we still have unsolved problems in the statistical measurement of ML impact. This year, governments are planning to define standards and regulations (e.g., EU AI Act). It is unclear now, but we must keep watching that activity.