Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Airflow in the Cloud: Programmatically o...

Apache Airflow in the Cloud: Programmatically orchestrating workloads with Python - PyData London 2018

Introducing the basics of Airflow and how to orchestrate workloads on Google Cloud Platform (GCP).

Kaxil Naik

April 27, 2018
Tweet

More Decks by Kaxil Naik

Other Decks in Programming

Transcript

  1. Basic instructions • Join the chat room: https://tlk.io/pydata-london-airflow (All necessary

    links are in this chat room) ◦ Please fill the google form - (For those who haven’t filled the form prior to this tutorial) - This is to add you in our Google Cloud Platform project ◦ Download the JSON file for Google Cloud access ◦ Clone the Github repo for this tutorial ◦ Follow the link for airflow installation ◦ Link to this slide deck • Clone the repository ◦ Github link: https://github.com/DataReplyUK/airflow-workshop-pydata-london-2018
  2. Agenda • Who we are and why we are here

    ? • Different workflow management system • What is Apache Airflow ? • Basic Concept and UI Walkthrough • Tutorial 1: Basic workflow • Tutorial 2: Dynamic workflow • GCP introduction • Tutorial 3: Workflow in GCP
  3. Hello! I am Kaxil Naik Data Engineer at Data Reply

    and an Airflow contributor Who we are! Hi! I am Satyasheel Data Engineer at Data Reply
  4. A bad data can make fail everything From internet to

    local computer to Service API’s
  5. Pros: ◦ Used by thousands of companies ◦ Web api,

    Java api, cli and html support ◦ Oldest among all Oozie Cons: ◦ XML ◦ Significant effort in managing ◦ Difficult to customize https://i.pinimg.com/originals/31/f9/bc/31f9bc4099f1ab2c553f29b8d95274c7.jpg
  6. Pros: ◦ Pythonic way to write a DAG ◦ Pretty

    Stable ◦ Huge Community ◦ Came from Spotify engineering team Luigi Cons: ◦ Have to schedule workflows externally ◦ The open source Luigi UI is hard to use ◦ No inbuilt monitoring, alerting https://cdn.vectorstock.com/i/1000x1000/88/00/car-sketch-vector-98800.jpg
  7. Pros: ◦ Python Code Base ◦ Active community ◦ Trigger

    rules ◦ Cool web UI and rich CLI ◦ Queues & Pools ◦ Zombie Cleanup ◦ Easily extendable Airflow Cons: ◦ No role based access control ◦ Minor issues (Deleting DAGs is not straight forward) ◦ Umm!!! Image Source: https://s3.envato.com/files/133677334/Preview_Image_Set/GRID_06.jpg
  8. Pros: ◦ Python Code Base ◦ Active community ◦ Trigger

    rules ◦ Cool web UI and rich CLI ◦ Queues & Pools ◦ Zombie Cleanup ◦ Easily extendable Airflow Cons: ◦ No role based access control ◦ Minor issues (Deleting DAGs is not straight forward) ◦ Umm!!! Image Source: https://s3.envato.com/files/133677334/Preview_Image_Set/GRID_06.jpg
  9. A bad data can make fail everything From internet to

    local computer to Service API’s
  10. What is Airflow ? Airflow is a platform to programatically

    author, schedule and monitor workflows (a.k.a DAGs)
  11. What is Airflow ? • Open Source ETL workflow management

    tool written purely in python • It’s the glue that binds your data ecosystem together • It can handle failures • Alert on failures • Monitor performance of tasks over time • Scale! • Developed by Airbnb • Inspired by Facebook’s dataswarm • It is a production ready • It Ships with: ◦ DAG scheduler ◦ Web Application UI ◦ Powerful CLI
  12. What is Airflow ? • Open Source ETL workflow management

    tool written purely in python • It’s the glue that binds your data ecosystem together • It can handle failures • Alert on failures • Monitor performance of tasks over time • Scale! • Developed by Airbnb • Inspired by Facebook’s dataswarm • It is a production ready • It Ships with: ◦ DAG scheduler ◦ Web Application UI ◦ Powerful CLI
  13. What is Airflow ? • Open Source ETL workflow management

    tool written purely in python • It’s the glue that binds your data ecosystem together • It can handle failures • Alert on failures (Email, Slack) • Monitor performance of tasks over time • Scale! • Developed by Airbnb • Inspired by Facebook’s dataswarm • It is a production ready • It Ships with: ◦ DAG scheduler ◦ Web Application UI ◦ Powerful CLI
  14. What is Airflow ? • Open Source ETL workflow management

    tool written purely in python • It’s the glue that binds your data ecosystem together • It can handle failures • Alert on failures • Monitor performance of tasks over time • Scale! • Developed by Airbnb • Inspired by Facebook’s dataswarm • It is a production ready • It Ships with: ◦ DAG scheduler ◦ Web Application UI ◦ Powerful CLI
  15. What is Airflow ? • Open Source ETL workflow management

    tool written purely in python • It’s the glue that binds your data ecosystem together • It can handle failures • Alert on failures • Monitor performance of tasks over time • Scale! • Developed by Airbnb • Inspired by Facebook’s dataswarm • It is a production ready • It Ships with: ◦ DAG scheduler ◦ Web Application UI ◦ Powerful CLI
  16. What is Airflow ? • Open Source ETL workflow management

    tool written purely in python • It’s the glue that binds your data ecosystem together • It can handle failures • Alert on failures • Monitor performance of tasks over time • Scale! • Developed by Airbnb • Inspired by Facebook’s dataswarm • It is a production ready • It Ships with: ◦ DAG scheduler ◦ Web Application UI ◦ Powerful CLI
  17. What is Airflow ? • Open Source ETL workflow management

    tool written purely in python • It’s the glue that binds your data ecosystem together • It can handle failures • Alert on failures • Monitor performance of tasks over time • Scale! • Developed by Airbnb • Inspired by Facebook’s dataswarm • It is production ready • It Ships with: ◦ DAG scheduler ◦ Web Application UI ◦ Powerful CLI
  18. Similar to Cron This is how it looks once you

    start running your DAGs ….
  19. No. of successfully completed tasks No. of Queued tasks No.

    of failed tasks Status for recent DAG Runs
  20. Can pause a Dag by switching it Off No. of

    successful DAG runs No of DAG instance running No of times DAG fails to run
  21. Web UI: Tree View A tree representation of the DAG

    that spans across time(task run history)
  22. Workflow as code Python Code DAG (Workflow) DAG Definition DAG

    Default Configurations Tasks Task Dependencies
  23. Concepts: DAG • DAG - Directed Acyclic Graph • Define

    workflow logic as shape of the graph • It is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies.
  24. Concepts: OPERATORS • Workflows are composed of Operators • While

    DAGs describe how to run a workflow, Operators determine what actually gets done
  25. Concepts: OPERATORS • 3 main types of operators: ◦ Sensors

    are a certain type of operator that will keep running until a certain criterion is met ◦ Operators that perform an action, or tell another system to perform an action ◦ Transfer Operators move data from one system to another
  26. Concepts: TASKS • A parameterized instance of an operator Once

    an operator is instantiated, it is referred as a “task”. The instantiation defines specific values when calling the abstract operator • The parameterized task becomes a node in a DAG
  27. Concepts: TASK INSTANCE • Represents a specific run of a

    task • Characterized as the combination of a dag, a task, and a point in time. • Task instances also have an indicative state, which could be “running”, “success”, “failed”, “skipped”, “up for retry”, etc.
  28. Architecture Airflow comes with 5 main types of built in

    execution modes: • Sequential • Local • Celery (Out of the box) • Mesos (Community driven) • Kubernetes (Community driven) Runs on a single machine Runs on distributed system
  29. Architecture: Sequential Executor • Default mode • Minimum setup -

    work with sqlite as well • Process 1 task at a time • Good for demo purpose
  30. Architecture: Local Executor • Spawned by scheduler process • Vertical

    scaling • Production grade • Does not need broker or any other negotiator
  31. Architecture: Celery Executor • Vertical & Horizontal scaling • Production

    grade • Can be monitored (Via Flower) • Supports pool and queues
  32. Instructions to start Airflow • SetUp Airflow installation directory $

    export AIRFLOW_HOME=~/airflow • Initiating Airflow Database $ source airflow_workshop/bin/activate $ airflow initdb • Start the web server, default port is 8080 $ airflow webserver -p 8080 • Start the scheduler (In another terminal) $ source airflow_workshop/bin/activate $ airflow scheduler • Visit localhost:8080 in the browser to see Airflow Web UI
  33. Copy DAGs • Clone the Git Repo $ git clone

    https://github.com/DataReplyUK/airflow-workshop-pydata-london-2018.git • Copy the dags folder in AIRFLOW_HOME $ cp -r airflow-workshop-pydata-london-2018/dags $AIRFLOW_HOME/dags
  34. Concepts: XCOM • Abbreviation of “cross-communication” • Means of communication

    between task instances • Saved in database as a pickled object • Best suited for small pieces of data (ids, etc.)
  35. Concepts: Trigger Rule • Trigger condition for next upstream task

    • Each operator has ‘trigger_rule’ argument • Following are the different trigger rules: ◦ all_success: (default) all parents have succeeded ◦ all_failed: all parents are in a failed or upstream_failed state ◦ all_done: all parents are done with their execution ◦ one_failed: fires as soon as at least one parent has failed, it does not wait for all parents to be done ◦ one_success: fires as soon as at least one parent succeeds, it does not wait for all parents to be done
  36. Concepts: Variables • Generic way to store and retrieve arbitrary

    content • Can be used for storing settings as a simple key value store • Variables can be created, updated, deleted and exported into json file form the UI
  37. Concepts: Branching • Branches the workflow based on condition. •

    Condition can be defined using BranchPythonOperator
  38. GCP: Introduction • A cloud computing service offered by Google

    • Popular products that we are going to use today: ◦ Google Cloud storage: File and object storage ◦ BigQuery: Large scale analytics data warehouse ◦ And many more.. • Click here to access our GCP Project ◦ Google Cloud Storage - link ◦ BigQuery - link
  39. Tutorial 3: Create Connection • Click on Admin ↠ Connections

    • Or Visit http://localhost:8080/admin/connection/
  40. Tutorial 3: Create Connection • Click on Create and enter

    following details: ◦ Conn Id: airflow-service-account ◦ Conn Type: Google Cloud Platform ◦ Project ID: pydata2018-airflow ◦ Keyfile Path: PATH_TO_YOUR_JSON_FILE ▪ E.g. “/Users/kaxil/Desktop/Service_Account_Keys/sb01-service-account.json” ◦ Keyfile JSON: ◦ Scopes: https://www.googleapis.com/auth/cloud-platform • Click on Save
  41. Tutorial 3: Import Variables • Click on Admin ↠ Variables

    • Or Visit http://localhost:8080/admin/variable/
  42. Tutorial 3: Import Variables • Click on Choose file and

    select variables.json (file in the directory where you have cloned the Git repo) • Click on Import Variables button. • Edit bq_destination_dataset_table variable to enter: “pydata_airflow.kaxil_usa_names” after replacing kaxil with your firstname_lastname
  43. Tutorial 3 • Objective: ◦ Waits for a file to

    be uploaded in Google Cloud Storage ◦ Once the files are uploaded, a BigQuery table is created and the data from GCS is imported to it • Visit: ◦ Folder where files would be uploaded: click here ◦ Dataset where the table would be created: click here
  44. Summary • Airflow = workflow as a code • Integrates

    seamlessly into “pythonic” data science stack • Easily extensible • Clean management of workflow metadata • Different alerting system (email, Slack) • Huge community and under active development • Proven real-world projects