Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
A Newcomer's Guide To Airflow's Architecture
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Andrew Godwin
July 12, 2021
Programming
420
0
Share
A Newcomer's Guide To Airflow's Architecture
A talk I gave at Airflow Summit 2021.
Andrew Godwin
July 12, 2021
More Decks by Andrew Godwin
See All by Andrew Godwin
Reconciling Everything
andrewgodwin
1
390
Django Through The Years
andrewgodwin
0
310
Writing Maintainable Software At Scale
andrewgodwin
0
520
Async, Python, and the Future
andrewgodwin
2
730
How To Break Django: With Async
andrewgodwin
1
810
Taking Django's ORM Async
andrewgodwin
0
830
The Long Road To Asynchrony
andrewgodwin
0
760
The Scientist & The Engineer
andrewgodwin
1
850
Pioneering Real-Time
andrewgodwin
0
510
Other Decks in Programming
See All in Programming
Inside Stream API
skrb
1
620
開発体験を左右するライブラリの API 設計 - GraphQL スキーマ構築ライブラリから考える #tskaigi
izumin5210
2
1.6k
AIエージェントと協働するCLI開発 — BunとOpenClawで学んだこと
yoshikouki
1
230
DynamoDBには集計系のクエリがないけどなんとかしたい
musan
1
130
3Dシーンの圧縮
fadis
1
590
SPMマルチモジュールで テストカバレッジを取得する技法
yosshi4486
0
140
Spec-Driven Development with AI-Agents: From High-Level Requirements to Working Software
antonarhipov
2
440
Java × distroless で 軽量なコンテナイメージを / Java on Distroless
contour_gara
0
470
密結合なバックエンドから TypeScript のコードを生成する
kemuridama
1
690
エージェンティックRAGにAWSで入門しよう!
har1101
5
110
今さら聞けないCancellationToken
htkym
0
220
ADKを使って簡単にAIエージェントを作ってみよう
k1mu21
0
190
Featured
See All Featured
Self-Hosted WebAssembly Runtime for Runtime-Neutral Checkpoint/Restore in Edge–Cloud Continuum
chikuwait
0
560
How to Ace a Technical Interview
jacobian
281
24k
Color Theory Basics | Prateek | Gurzu
gurzu
0
320
エンジニアに許された特別な時間の終わり
watany
107
250k
AI Search: Where Are We & What Can We Do About It?
aleyda
0
7.5k
The Director’s Chair: Orchestrating AI for Truly Effective Learning
tmiket
1
180
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
38
2.9k
svc-hook: hooking system calls on ARM64 by binary rewriting
retrage
2
280
Bash Introduction
62gerente
615
210k
Unlocking the hidden potential of vector embeddings in international SEO
frankvandijk
0
830
Neural Spatial Audio Processing for Sound Field Analysis and Control
skoyamalab
0
310
Connecting the Dots Between Site Speed, User Experience & Your Business [WebExpo 2025]
tammyeverts
11
930
Transcript
A NEWCOMER'S GUIDE TO ANDREW GODWIN // @andrewgodwin AIRFLOW'S ARCHITECTURE
Hi, I’m Andrew Godwin • Principal Engineer at • Also
a Django core developer, ASGI author • Using Airflow since March 2021
None
High-Level Concepts What exactly is going on? The Good and
the Bad Or, How I Learned To Stop Worrying And Love The Scheduler Problems, Fixes & The Future Where we go from here
Differences from things I have worked on? (An eclectic variety
of web and backend systems)
"Real-time" versus batch The availability versus consistency tradeoff is different!
Simple concepts, hard to master In Django, it's the ORM. In Airflow, scheduling. It's all still distributed systems Which is fortunate, after fifteen years of doing them
Airflow grew organically It started off as an internal ETL
tool
None
DAG ➡ DagRun One per scheduled run, as the run
starts Operator ➡ Task When you call an operator in a DAG Task ➡ TaskInstance When a Task needs to run as part of a DagRun
Scheduler Works out what TaskInstances need to run Executor Runs
TaskInstances and records the results
Scheduler LocalExecutor Webserver Database DAG Files
Scheduler CeleryExecutor Webserver Database DAG Files Redis/Queue Workers
The Executor runs inside the Scheduler Its logic, at least,
and the tasks too for local ones
Everything talks to the database It's the single central point
of coordination
Scheduler, Workers, Webserver All can be run in a high-availability
pattern
Scheduler Works out what TaskInstances need to run Executor Runs
TaskInstances and records the results
Scheduler Works out what TaskInstances need to run Executor Runs
TaskInstances and records the results
Timing Dependencies Retries Concurrency Callbacks ...
Scheduler Works out what TaskInstances need to run Executor Runs
TaskInstances and records the results
Celery or Kubernetes Our two main options, currently
Scheduler CeleryExecutor Webserver Database DAG Files Redis/Queue Workers
Scheduler KubernetesExecutor Webserver Database DAG Files Kubernetes Task Pods
None
Tasks are the core part of the model DAGs are
more of a grouping/trigger mechanism
Very flexible runtime environments Airflow's strength, and its weakness
Airflow doesn't know what you're running This is both an
advantage and a disadvantage.
What can we improve? Let's talk about The Future
More Async & Eventing Anything that involves waiting!
Scheduler CeleryExecutor Webserver Database DAG Files Redis/Queue Workers Triggerer
Removing Database Connections APIs scale a lot better!
I do like the database, though There's a lot of
benefit in proven technology
Software Engineering is not just coding Any large-scale project needs
documentation, architecture, and coordination
Maintenance & compatibility is crucial Anyone can write a tool
- supporting it takes effort
Airflow is forged by people like you. Coding, documentation, triage,
QA, support - it all needs doing.
Thanks. Andrew Godwin @andrewgodwin
[email protected]