Managing Data Science in the Enterprise

1 Managing Data Science in the Enterprise Strata NYC, September
2018

© 2018 Domino Data Lab, Inc. 2 Who are we?
Josh Poduska Chief Data Scientist Domino Data Labs Patrick Harrison Director of Data Science S&P Global Market Intelligence

© 2018 Domino Data Lab, Inc. Agenda Introduction and welcome
Motivation: why this matters Common challenges to managing data science in the enterprise Guiding principles and framework Process Breakout Exercise: Project pre-flight checklist Break People Breakout Exercise: Team-building plan Managing Technology and X-Factors Summary

© 2018 Domino Data Lab, Inc. 4 Data Science Why
is it different;   why does this matter?

© 2018 Domino Data Lab, Inc. 5 At the heart
of data science   lies an innocuous sounding thing…

10 1 2 The implications of not becoming a Model-Driven
business are existential. The implications of not becoming a Model-Driven Business are existential.

© 2018 Domino Data Lab, Inc. 11 1 Breakthroughs open
new revenue streams, expand into new markets, create and deliver new products. 2 Operational efficiency gains that compound through constant incremental improvement.

© 2018 Domino Data Lab, Inc. 12 Jeff Bezos’s 2016
Annual Letter to Shareholders: At Amazon, we’ve been engaged in the practical application of machine learning for many years now. Some of this work is highly visible: our autonomous Prime Air delivery drones; the Amazon Go convenience store that uses machine vision to eliminate checkout lines; and Alexa, our cloud- based AI assistant. But much of what we do with machine learning happens beneath the surface. Machine learning drives our algorithms for demand forecasting, product search ranking, product and deals recommendations, merchandising placements, fraud detection, translations, and much more.

© 2018 Domino Data Lab, Inc. 13 1 5 90%
of companies want to make data science an operational part of their business 30% have 5+ models in production And Yet…

© 2018 Domino Data Lab, Inc. 14 Many organizations treat
data science   as a technical practice — instead of an organizational capability. 1 6

© 2017 Domino Data Lab, Inc. 15 Ad-hoc, one-off production
How most organizations operate Data Science Leaders Data Scientists Risk and Compliance Business stakeholders and IT system owners Business Leaders & Data Science Leaders CIOs and IT leaders Slow, unclear path ”black hole” wild west of desktop data science $$$$ ? ? @#!

© 2017 Domino Data Lab, Inc. 16 Model Delivery/ Deployment
How model-driven organizations operate Data science leaders Risk and compliance leaders Business stakeholders and IT system owners Business leaders & Data science leaders CIOs and IT leaders Validation & Review Monitoring & Feedback Model Development Data Scientists

Barriers 17

Data-Era Infrastructure Mentality    Data science demands new degrees of
infrastructure flexibility and scalability. 18 1

Garage Silos    Data scientists’ work is bespoke,   ad
hoc, and siloed. 19 2

Broken Loops 20 3 Companies struggle to put models and
model-backed products into production. Or if they make it into production, companies struggle to measure their impact and drive subsequent improvement.

Model Liability    Models built without proper checks and controls
have the potential to do significant harm to a company’s profits, brand, and reputation. 21 4

22 Speed of iteration Big breakthroughs > Tool agility Any
one tool > Process and culture Any one piece of technology > Producing an answer > Reusable knowledge Mindsets of the most effective data science organizations

© 2018 Domino Data Lab, Inc. 23 A framework for
managing data science as a capability People • Attract, hire, onboard, retain, and organize world-class talent Technology • Productivity and best practices to enable scale X-Factors • Managing model liability • Navigate organizational politics Process • Deliver measurable, reliable, scalable outcomes

24 Process

Process 25 • Deciding what we do • Doing projects
• Wrapping up projects

© 2018 Domino Data Lab, Inc. 26 Typical approach •
Data -> Analysis -> Product Development -> KPI • Common Pitfalls = Scope creep, loss of stakeholder enthusiasm, no crisp measure of success Better method • Problem -> Relevant KPIs -> Product Requirements -> Analysis Necessary -> Data • Result = Greater focus, lower risk • Business process map • Educate stakeholders on what is possible (avoid perception of magic) • Allow all stakeholders to submit ideas • Publish monthly to all stakeholders, re-prioritize at least quarterly Deciding what we do: engage the business

© 2018 Domino Data Lab, Inc. 27 • Calculate Value
at Stake • Order of magnitude value capture ($100k, $1mln, $10mln, etc.) • How much improvement is realistic? • Estimate Effort • Order of magnitude cost estimation (1hr, 1 day, 1 week, 1 quarter, 1 year) • Forecast Risks • Barriers to adoption • Potential consequences of errors or performance degradation • Project Prioritization Risk EFFORT VALUE Low High Low High Do! Don’t!

© 2018 Domino Data Lab, Inc. 28 • Embark on
never-ending science projects • Overlook linkages between model insight and business action • Focus on what’s easy or clever instead of what’s valuable • Cost estimates fail to consider integration, maintenance, retraining Prioritization Pitfalls

© 2018 Domino Data Lab, Inc. 29 • “We don’t
fail because of the math… we fail because we don’t anticipate how the math will be used.” • Time saved here pays 10x in development and 100x in prod • “Product management” principles apply to data science projects just as much as engineering projects Project kick-off

© 2018 Domino Data Lab, Inc. 30 • Business case
definition • Stakeholder mapping • Technology needs • Data availability • Prior art review • Model delivery plan • Success measures • Compliance and regulatory checks Kickoff checklist

© 2018 Domino Data Lab, Inc. 31 PROJECT #1 –
CHURN PREDICTION PROJECT #2 – FRAUD CLASSIFIER Value at Stake 100,000 customers * $1000 ARR * 10% current churn = $10mln problem 50,000 applications * 1% fraud rate * $2000 avg. resolution cost = $1mln problem Potential for Improvement Low (already quite low churn) High (not doing anything today, no headcount ) Dependencies Enough support staff to act? App dev team integration Level of Effort 1 month 1 quarter Risk of False Positive Low (extra support) High (bad customer experience) Risk of False Negative Medium (lost revenue) High (more lost revenue) Re-training Requirements Medium (marketing mix changes slowly) High (adversarial) Change Management Requirements Low (educate support team, currently use random / intuition) High (modify real-time application flow) ROI Math Example

© 2018 Domino Data Lab, Inc. 32 • Best Practices
• Define responsible parties from each group: data science, business, DevOps, application dev, compliance, etc. • Common Pitfalls • Lack empathy with goal of actual end user • Throw results “over the fence” to IT with no context Stakeholder Mapping

• Consider opportunities to accelerate research • Identify dependencies early • Common Pitfalls • “One size fits all” tooling • Underpowered infrastructure Technology Needs

• Leverage existing sources first to build baseline • Create synthetic data with realistic characteristics • Track engagement with datasets to automatically discover experts • Common Pitfalls • Wait for “perfect” data • Buy external data without clear onboarding plan Data Availability

• Review state of the art — internally and externally • Common Pitfalls • Culture of NIH • Nose-to-the-ground mindsets • No single source of truth Prior Art Review

• Design multiple mock-ups of different form factors • Designate approvers in advance (IT, DS, biz) • Create process flow to precisely show where model will impact • Consider agile approach • Common Pitfalls • Fail to educate end-users • Over-engineer relative to the requirements use case Model Delivery Plan

• Pre-emptively answer “how will we know if this worked?” • Frame in terms of business KPIs not statistical measures • Define needs for holdout groups, A/B testing, etc. • Common Pitfalls • Not knowing when it is “good enough” • Fail to establish testing infrastructure and culture Success Measures

• Consider consequences of errors (e.g., false positives / negatives) • State likely biases in training data • Track ongoing usage to prevent inappropriate consumers • Common Pitfalls • Assume no regulation today will last • Conflate model interpretability with model provenance • Model misuse Risk Mitigation

• Defend the scientific method • Store positive and negative results • Preserve synthesis, intermediate results, code, data, and environment • Common Pitfalls • Repeated quiet failures • Old analysis doesn’t run Wrapping up projects

40 Group Exercise #1: Fill out a pre-flight checklist for
one of your projects • Spend 15 minutes filling out template • Discuss in groups of five for 20 minutes

© 2018 Domino Data Lab, Inc. 41 8 Factor Pre-Launch
Checklist: Questions to Ask Business Case What’s the desired outcome of the project, in terms of a business metric? What are the linkages from your project to impacting that ultimate business metric?  What is the order-of-magnitude ROI and cost? Stakeholders Lead DS, Proposed Validator, Data Engineers, Product Manager, Business Executive, Business End User (internal or external), Application Developer, DevOps Engineer, Compliance Technology What compute hardware and software infrastructure do you anticipate being necessary? Would your project benefit from specialized or parallelized computing? Data What relevant data exists today? Who are the subject matter experts? What other data would we potentially want to capture, create, or buy externally? Prior Art Who has worked on this business topic before (internal and external)? Who are the relevant experts in the techniques I will likely use? Model Delivery Who will consume this and what form factor will the final product take (report, app, API)? What dependencies or resources will you require to deliver work this way (e.g., IT)? What are other possible delivery mechanisms, especially ones that are lighter weight or easier to test first? What user training is necessary to ensure adoption? Success Measure How will you know if it’s working as expected, or otherwise get feedback? What’s your “monitoring” plan, even if it’s manual and subjective? Risk Mitigation How could this model be mis-used by end users? Any constraints on modeling approach (e.g., interpretability requirements)?

© 2018 Domino Data Lab, Inc. 42 • Don’t be
overwhelmed into paralysis by complex process • Look for low-hanging fruit to buy political capital for more headcount and risky projects • Find senior sponsor • Most important takeaway: engage the business as partners early and often A note for early teams

© 2018 Domino Data Lab, Inc. 43 Process: The Data
Science Lifecycle

© 2018 Domino Data Lab, Inc. 44 Common pitfalls in
the data science lifecycle 2 1 3 4

© 2018 Domino Data Lab, Inc. 45 DATA SCIENCE BOTTLENECKS
CHALLENGES • X% of projects have little / no impact • Y number of weeks lost by employees identifying what projects have been done before and understanding that work 1 Inconsistent Project Prioritization and Kickoff • Duplication of work wastes time and slows down progress • Inability to leverage past work and customize across locations • Scope creep and loss of stakeholder enthusiasm 2 No Access to Technology On- Demand • <Subsidiary> approval for required infrastructure takes weeks per project • Insufficient infrastructure prevents differentiated innovation • 4-6 weeks delays for resource requests spread between approvals and implementation • X time wasted replacing Data Scientists 3 No Ability to Easily Deploy Results to Business • Data Scientists waste time on mundane tasks to expose models • Business stakeholders complain about lengthy delays to business value • Z time lost by employees setting up dashboard servers BUSINESS IMPACT Bottlenecks and pitfalls have quantified negative impact 4 Failure to preserve knowledge upon completion • Lack of documentation and reproducibility of code hurts iteration • Projects just fade away, so null results aren’t known for future collaborators • Model iteration velocity slowed by average of 1m

46 People

© 2018 Domino Data Lab, Inc. 47 • Talent gap
commonly cited as obstacle to being model-driven • Typical tenure <2 years with 3+ month ramp • Overwhelmed by resumes, underwhelmed by output Why focus on people?

© 2018 Domino Data Lab, Inc. 48 • Attract –
How to lure the best talent • Assess – Hire systematically • Train – Focus on mindset, not just skills • Retain – Build community and mentorship • Organize - Define optimal roles and structure Framework for People

• Have a differentiated offering and strategy • Advertise projects, not just the company • Offer modern tools and commitment to open source • Common Pitfalls • Write unrealistic job descriptions • Seek PhDs when need hackers (or vice versa) Attracting the best and brightest

• Be systematic: identify required attributes, design assessments for each • Be analytical: track interviewer and interview type efficacy • Include EQ and non-technical assessments • Sell while assessing: simulate real work • Common Pitfalls • Over-rely on tech screens Picture of women’s sport’s team bench Assessment

• Reinforce mindsets, not just skills • Develop culture of reuse, compounding • Reward community- enhancing behavior • Provide “soft” skills training • Common Pitfalls • “Not built here” mentality Training

• Emphasize listening to stakeholders • Compensate team on new and existing work, not just current projects • Common Pitfalls • Employee churn from flawed expectations Source: Max Shron, Warby Parker Set expectations on time allocation upfront

• Share accountability with the business’s KPIs • Focus on iteration velocity • Systematically capture stakeholder feedback and engagement • Common Pitfalls • Measure everyone but yourself • Over-index on any one project vs. factory performance Metrics of managing data science

© 2018 Domino Data Lab, Inc. 54 The many hats
of data science PRIORITIES ROLE Generating and communicating insights, understanding the strengths and risks Data Scientist Creating engaging visual and narrative journeys for analytical solutions Data Storyteller Building scalable pipelines and infrastructure that make it possible to do the higher levels of needs. Data Infrastructure  Engineer Articulating the business problem, translating to day-to-day work, ensuring ongoing engagement. Data Product   Manager Vetting the prioritization and ROI, providing ongoing feedback Business   Stakeholder

• Solve prioritization and delivery problems first • Bridge silos with cross- cutting platforms • Common Pitfalls • Fail to evolve structure as org matures • Confine teams to ivory tower innovation labs • Stronger alignment with business processes and priorities • Easier change management • Less technical knowledge compounding • Harder to codify best practices • Risk of IT governance issues DECENTRALIZATION CENTRALIZATION • Community and mentorship • easier transparency for managers and IT • More passive technical knowledge sharing • Isolation on data science island • Loss of credibility with business • Frustrated data scientists Pros Cons Organizational Design Dilemmas

© 2018 Domino Data Lab, Inc. 56 What Org Design
is Right For You? DS IT LoB1 LoB2 LoB3 Centralized Standalone Centralized under IT/Eng DS IT LoB1 LoB2 LoB3 Federated DS1 IT LoB1 LoB2 LoB3 DS2 DS1 Hub-and-Spoke DS1 IT LoB1 LoB2 LoB3 DS2 DS1 DS • Prioritize stakeholder proximity early if internal use cases • Tie to engineering if primary building model-driven external-facing products • Develop hub-and-spoke as you scale

© 2018 Domino Data Lab, Inc. 57 Hiring and Ramping
Plan Template: Questions to Ask Attracting Talent • What’s your differentiated value proposition for candidate data scientists? List three things that make the opportunity unique, that you think will resonate with your target candidate pool. • What are 1-3 risks that might make the opportunity less appealing that competitive opportunities? How can you mitigate or get ahead of them? Hiring Process • What are the three most important attributes for your candidates? What is your assessment plan for each? Onboarding • What outcomes need to have been achieved in the first 30, 60, and 90 days? • What are the most important pieces of “tribal knowledge” your new hire needs to know, and how will she learn them? Examples include data sources, project methodologies, stakeholder dynamics, notable wins / losses, etc. Retention and Management • What skills do you hope this candidate develops over the first year? • What metrics will determine success of this candidate after a year? Examples include certain business metrics, community contributions, number of insights produced, or project iteration velocity.

58 Group Exercise #2: Build your hiring and ramp plan
• Spend 15 minutes filling out template • Discuss in groups of five for 20 minutes

© 2018 Domino Data Lab, Inc. 59 Technology Agility &
Iteration • Experimental agility • Tools / packages • Compute • Deployment agility • Expose work back to   the business quickly Collaboration • Shared context • Discussion • Knowledge Management (search & discovery) Reproducibility &   Reusability • Code • Data • Results • Environments

© 2018 Domino Data Lab, Inc. 60 Strategy: incentivize best
practices “bottom up” Test Ideas Faster Deploy and Share Work Easily DATA SCIENTISTS: “I’M MORE PRODUCTIVE!” Powerful Collaboration Features Version Control & Reproducibility LEADERS: “CENTRALIZED WORK!”

© 2018 Domino Data Lab, Inc. 61 Decrease time to
business impact: - Deploy models as APIs - Deploy apps (e.g., Shiny) & reports to non-technical stakeholders - Scheduled jobs for ETL, reporting, model retraining Entice data scientists with: - Vertical and horizontally scalable infrastructure - DevOps automation - Computational lab notebook to track results Centralizing work makes it possible to: - Find, reuse, reproduce, and discuss work. How we approached this

62 X-Factors

© 2018 Domino Data Lab, Inc. 63 • Problem emerges
at later maturity • Track and guardrail model usage • Document risks and trade-offs made in flight, not post hoc • Pre-emptively establish validation, monitoring, and compliance controls Model liability

© 2018 Domino Data Lab, Inc. 64 • Educate executives
on reality of probabilistic research • Anticipate demands of procurement (ROI of aggregate project portfolio) • Frame impacts of data science investment: • Out-compete peers • Increase operational efficiency • Reduce costs (headcount etc) • Reduce risk Navigating organizational politics

65 Summary

© 2018 Domino Data Lab, Inc. 66 • Data science
success is not adding up individual successes, it’s an organizational capability • Alignment and partnership with the business is critical • Process – Enforce a pre-flight checklist • People – Develop hiring and onboarding plans • Technology - Leverage technology to increase productivity and best practice processes • X-Factors – Navigate politics and risk Summary

© 2018 Domino Data Lab, Inc. 67 • Ask us
about Domino’s Data Science Lifecycle and Value Assessment offerings • Tailored analysis of existing processes, gaps, and tangible best practices • Leverage our ROI analysis templates across your portfolio Struggling with your own lifecycle?

© 2018 Domino Data Lab, Inc. 68 • Check out
this content for more information • The Practical Guide to Managing Data Science at Scale • Data Science Management Survey Report • Stop by our booth #1403 Want to learn more? Questions?

Managing Data Science in the Enterprise

Managing Data Science in the Enterprise

More Decks by Patrick Harrison

Other Decks in Technology

Featured

Transcript