Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Orchestrating the Future: Navigating Today's Da...

Orchestrating the Future: Navigating Today's Data Workflow Challenges with Airflow and Beyond | Budapest Data + ML Forum 2024

Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.

In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.

This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.

The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).

This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.

Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627

Kaxil Naik

June 11, 2024
Tweet

More Decks by Kaxil Naik

Other Decks in Programming

Transcript

  1. Kaxil Naik Apache Airflow Committer & PMC Member Senior Director

    of Engineering @ Astronomer @kaxil @kaxil @kaxil
  2. • Orchestrator – The What & Why? • What is

    Apache Airflow? ◦ Why is Airflow the Industry Standard for Data Professionals? ◦ Evolution of Airflow • Today’s Data Workflow Challenges ◦ How Airflow addresses them – Real world case studies • The Future of Airflow Agenda
  3. Orchestration in Engineering! Workflow Orchestrator Automates and manages interconnected tasks

    across various systems to streamline complex business processes. E.g Running bash script everyday to update packages on a laptop. Data Orchestrator Automates and manages interconnected tasks that deal with data across various systems to streamline complex business processes. E.g ETL for a BI dashboard.
  4. A Workflow Orchestrator, most commonly used for Data Orchestration Official

    Definition: A platform to programmatically author, schedule and monitor workflows What is Apache Airflow?
  5. Python Native The language of data professionals (Data Engineers &

    Scientists). DAGs are defined in code: allowing more flexibility & observability of code changes when used with git. Pluggable Compute GPUs, Kubernetes, EC2, VMs etc. Integrates with Toolkit All data sources, all Python libraries, TensorFlow, SageMaker, MLFlow, Spark, Ray, etc. Common Interface Between Data Engineering, Data Science, ML Engineering and Operations. Data Agnostic But data aware. Cloud Native But cloud neutral. Monitoring & Alerting Built in features for logging, monitoring and alerting to external systems. Extensible Standardize custom operators and templates for common DS tasks across the organization. Key Features of Airflow
  6. Conference & Meetups Attendees: Online Edition (2020-2022): 10k In-person (2023+):

    500+ 15 Local Groups across the globe with 11k members
  7. Use cases for Airflow Ingestion and ETL/ELT related to business

    operations 0% 25% Source: 2023 Apache Airflow Survey, n=797 13% 90% of Apache Airflow usage is dedicated to ingestion and ETL/ELT tasks associated with analytics, followed by 68% for business operations. Additionally, there’s a growing adoption for MLOps (28%) and infrastructure management (13%), highlighting its versatility across various data workflow tasks. 50% 100% 90% 68% 28% Ingestion and ETL/ELT related to analytics Training, serving, or generally manage MLOps Spinning up and spinning down infrastructure Other 3% 75%
  8. Timeline: Major Milestones 2014 Oct Created at AirBnb 2016 March

    Donated to the Apache Software Foundation (ASF) as an Incubating project 2020 Dec Airflow 2.0 released 2015 June Open Sourced 2018 Dec Graduated as a top-level project 2025 Mar-Apr (Planned) Airflow 3.0 release 2020 July First Airflow Summit
  9. Timeline: 2.x Minor Releases 2.1 2021-05 2.3 2022-05 2.2 2021-11

    2.4 2022-09 2.5 2022-11 2.6 2023-04 2.7 2023-08 2.8 2023-12 2.9 2024-04
  10. Today’s Data Workflow challenges Increasing Data Volumes Businesses generates more

    data than ever. Handling this data & its quality is critical. Need for near Real-time Processing Data Workflows are being used to drive critical business decisions in near real-time & hence requiring reliability & performance guarantees. Complexity in Data Workflows Modern workflows need handling data from multiple sources that require managing complex deps & dynamic schedules. Intelligent Infrastructure Infrastructure must be elastic & flexible to optimize for a modern workloads.
  11. Today’s Data Workflow challenges Additional Interfaces Net-new teams- from ML

    to AI - want to get the best out of Airflow without learning a new framework. Licensing & Security in OSS OSS projects owned by a single company have changed licenses too often in recent past. Platform Governance Visibility, auditability, & lineage across a data platform is need-to-have. Cost Reduction Tight budgets have pushed teams to efficiently utilize the resources to drive operational costs down.
  12. Case Study: Texas Rangers Company: A professional baseball team in

    Major League Baseball (MLB), based in Arlington, Texas. The Rangers won their first World Series championship in 2023. Goal: Use data to gain unfair advantage, Moneyball style! Data to be collected: real-time game data streaming, comprehensive player health reporting, predictive analytics of everything from pitch spin to hit trajectory, and more Challenge: Scalability issues due to volume & unprecedented rate of data & infra bottleneck in their live game analytics pipeline. This impacted the timely delivery of analytics to their team and affected their competitive edge.
  13. Case Study: Texas Rangers Solution: Use Airflow’s worker queues to

    create dedicated worker pools for CPU-intensive tasks while other tasks used cheaper workers. Using Data-aware Scheduling, they were able to start their DAGs when data was available instead of time-based scheduling. Result: Improved Scalability Using worker queues, DAG completion time reduced by 80% (from 20 mins to 3 mins) Increased Efficiency Optimizing compute resources allowed processing of 4 additional DAGs in parallel, enabling immediate post-game analytics delivery for a competitive edge.
  14. Case Study: Bloomberg Company: Bloomberg is a leading source for

    financial & economic data: Equities, bonds, Index, Mortgages, currencies, etc. Founded in 1981 with subscribers in 170+ countries. Goal: Deliver a diverse array of information, news & analytics to facilitate decision-making Challenge: Maintaining custom pipelines for diverse datasets of different domains is expensive & time consuming. Their engineers lacked domain knowledge to aggregate data into client insights & their domain experts lack skills to maintain data pipelines in Production.
  15. Case Study: Bloomberg Solution: Configuration-driven ETL platform leveraging Airflow &

    dynamic DAGs. User-defined configs are translated into Dynamic DAGs determining tasks & their dependencies with success/failure actions. Result: The Data Platform teams now supports 1600+ DAGs, 700+ datasets, 200+ users, 11 different product teams, 10k+ weekly file ingestions Source: https://airflowsummit.org/sessions/2023/airflow-at-bloomberg-leveraging-dynamic-dags-for-data-ingestion/
  16. Case Study: Company: FanDuel Group is a sports betting company

    that lives on data with approx 17 million customers. Goal: Business growth led to higher daily data volumes, which fueled demand for new sources and richer analytics. Challenge: 2022 NFL season was fast approaching and FanDuel wanted a robust data architecture in anticipation of company’s busiest time in terms of daily volume of data.
  17. Case Study: Solution: They worked with Astro professional services team

    to replace Operators with more efficient Deferrable Operators along with Astro’s auto-scaling features. Result: The number of worker nodes running on avg decreased by 35%, resulting in immediate infrastructure cost savings & average tasks per worker increased by 305%
  18. Other Interesting Case Studies Grindr has saved $600,000 in Snowflake

    costs by monitoring their Snowflake usage across the organization with Airflow. Condé Nast has reduced costs by 54% by using deferrable operators. Airline: a tool powered by Airflow, built by Astronomer’s Customer Reliability Engineering (CRE) team, that monitors Airflow deployments and sends alerts proactively when issues arise.
  19. Other Interesting Case Studies King uses ‘data reliability engineering as

    code’ tools such as SodaCore within Airflow pipelines to detect, diagnose and inform about data issues to create coverage, improve quality & accuracy and help eliminate data downtime. Laurel.ai: A pioneering AI company that automates time and billing for professional services. Uses multiple domain-specific LLMs to create billing timesheets from users’s footprints across their workflows & tools (Zoom, MS Teams etc). Airflow orchestrates their entire GenAI lifecycle: data extraction, model tuning & feedback loops. Ask Astro: An end-to-end example of a Q&A LLM application used to answer questions about Apache Airflow and Astronomer
  20. Airflow 3 Make Airflow the foundation for Data, ML, and

    Gen AI orchestration for the next 5 years. 1. Enable secure remote task execution across network boundaries. 2. Integrate data awareness needed for governance and compliance 3. Enable non-python tasks, for integration with any language 4. Enable Versioning of Dags and Datasets 5. Single command local install for learning and experimentation.
  21. Thank You A friendly reminder to RSVP to Airflow Summit

    2024: • Celebrating 10 Years of Airflow • Sept. 10th-12th • The Westin St. Francis • San Francisco, CA @kaxil @kaxil @kaxil Airflow Summit Discount Code: 15DISC_MEETUP