Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Discover the Latest Innovations in Apache Airfl...

Discover the Latest Innovations in Apache Airflow 3.0, Designed to Enhance Data Orchestration for Teams of All Sizes

目的: Airflow 3.0の主要なアップデートを紹介し、現ユーザーと新規ユーザーの両方にとっての利点を強調する。
内容:
- Airflowの紹介とデータパイプラインにおけるその役割。
- Airflow 3.0の主な機能: モダンなUI、DAGのバージョン管理、タスクの分離、複数言語のサポート。

More Decks by LINEヤフーTech (LY Corporation Tech)

Other Decks in Technology

Transcript

  1. Discover the latest innovations in Apache Airflow 3.0, designed to

    enhance data orchestration for teams of all sizes. LINE Taiwan & EC Data Jason Lai, Echo Lee
  2. - What’s Airflow? - Airflow 3.0 Architecture - Flexible Timetable

    - UI Modernization - DAG Versioning - Improved Backfill - Data Assets & Asset-Aware Scheduling - Event-driven scheduling - Scenario - Conclusion Agenda
  3. Airflow DAG You define each task as a Python function

    or an operator and then organize these tasks into a Directed Acyclic Graph (DAG) to manage dependencies and execution order. You can set a scheduled time similar to a cron job.
  4. Airflow Task Log Task’s detailed logs If you want to

    see detailed logs of a task, simply click on the log to view it. Airflow will then open the full log view, so you can quickly understand what happened.
  5. Airflow 3 Architecture Airflow 2 Airflow 3 Airflow 2 vs.

    Airflow 3 read The system turns the code into metadata write data create a new DagRun push job run the actual code limit scalability and security Overall, Airflow 3 puts the API in the center, separates components, reduces database load, and supports remote or cloud-native deployment. update task status cache frequent queries easing database load boosting overall throughput reduce the chance of misuse or attacks These nodes can live outside the cluster.
  6. It accepts multiple cron expressions and schedules a DAG run

    whenever any of the expressions match the current time. TimeTable MultipleCronTriggerTimetable Flexible Timetable It’s a specialized timetable that allows for the scheduling of DAGs based on both time-based schedules and asset events. AssetOrTimeSchedule https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/timetable.html#multiplecrontriggertimetable https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/timetable.html#multiplecrontriggertimetable
  7. Overview (The New UI is built on React and FastAPI.)

    • Dag Runs • Task Instances Airflow 3.0 UI Modernization Differences Only DAGs List Airflow 2.x • Asset Events
  8. - Upstream: Set asset outlets - Downstream: Schedule awareness the

    assets. Data Assets & Asset-Aware Scheduling Code example Data-Aware Scheduling Concept - DAG: when asset output, it will trigger asset_consumes. DAG example
  9. Runs were only triggered by asset. Scenario - AssetOrTimeSchedule Before

    - Asset Project: Data Curation Runs can now be triggered by either schedule time or asset. After - AssetOrTimeSchedule
  10. Runs were only triggered by schedule time. Scenario – Event-driven

    scheduling Before – Schedule Project: CRM Runs can be triggered by Kafka events. After – Event-driven
  11. Developer Productivity • Flexible Timetables • DAG Versioning • Improved

    Backfill Cost Efficiency • Event-driven scheduling • UI improvements and parallel DAG runs Better Data Governance • Assets and asset- driven scheduling • AssetWatcher & AssetEvent Conclusion