Upgrade to Pro — share decks privately, control downloads, hide ads and more …

La Kopi & WDA4 - Developing a data pipeline on ...

La Kopi & WDA4 - Developing a data pipeline on cloud

Developing a data pipeline on cloud
By Fon, Kamolphan Liwprasert

Presented at La Kopi Cloud event up by Google Developer Group (GDG) on Jul 21st, 2021.

WDA4: https://youtu.be/jtwDGufzFxo
La Kopi: https://youtu.be/J48mHBh5DyY

Kamolphan Liwprasert

July 21, 2021
Tweet

More Decks by Kamolphan Liwprasert

Other Decks in Technology

Transcript

  1. For analytics / ML projects Developing a data pipeline on

    cloud Fon Liwprasert ML Engineer at Sertis
  2. Kamolphan Liwprasert (Fon) Machine Learning Engineer, Sertis I’m an ML

    engineering enthusiast with data engineering background. I love to develop solutions on cloud. 6 x GCP certified.
  3. Text & diagram slides Agenda • Why do we need

    data pipeline? • What are options on cloud? Storages / Computes / Pipelines • Introducing Apache Airflow • Reference Architecture + Demo code
  4. Purposes of data pipeline Ingest data from multiple sources Transform

    or clean the data to ensure data quality Automation of the process
  5. Choosing Compute Options Google Compute Engine Google Kubernetes Engine Google

    Cloud Run Google App Engine Google Cloud Function IAAS : Infrastructure as a service CAAS : Container as a service PAAS : Platform as a service FAAS : Function as a service GCE GKE GAE GCF Virtual Machine Managed K8s cluster Serverless container Serverless application Serverless function platform
  6. Choosing Data Processing Options Cloud Dataproc Processing data Workflow and

    scheduler Cloud Scheduler Cloud Dataflow Cloud Composer Spark or Hadoop Data processing Unified pipeline w/ Apache Beam 3 2 1 Cloud Functions or or Cloud Run Serverless options Cloud Workflows (optional)
  7. Option 1: Low-cost & severless option Processing data (light workload)

    Cloud Functions Cloud Run (Or Pub/Sub) for Cloud Functions Workflow and scheduler REST API Scheduler ✓ Severless: easy & fast ✓ Low-cost solution ✓ Suitable for light workload
  8. Option 2: Big data solution Spark or Hadoop Data processing

    Unified data pipeline with Apache Beam Processing data (Big data workload) Workflow and scheduler Cloud Dataproc REST API Cloud Dataflow Scheduler ✓ Big data framework: Spark, Apache Beam, Flink ✓ Scalability and reliability ✓ Opensource solutions
  9. Option 3: Cloud Composer (Airflow) Cloud Composer Kubernetes Engine Cloud

    SQL Managed service + ✓ Easier for maintenance ✓ Scalability and reliability ✓ Suitable for large number of jobs that require workers
  10. Why Apache Airflow? • Popular open-source project for ETL or

    data pipeline orchestration. • All codes are in Python. Easy to learn and use. • Can be run locally as well for development environments.
  11. Apache Airflow basic components Sensor Wait on an event i.e.

    poking for a file Operator Running an action; PythonOperator Hook Interface to external services or system
  12. Reference Architecture Batch Ingestion BigQuery Cloud Storage SFTP server Cloud

    Composer SFTPToGCSOperator GCSToBigQueryOperator Analytics Workload BI Dashboards SFTPSensor
  13. Text & diagram slides DAG overview SFTPSensor SFTPToGCSOperator GCSToBigQueryOperator Check

    if a file is available Upload that file to GCS Load file from GCS to BigQuery
  14. Text & diagram slides Simple data pipeline using Airflow (1)

    Import and initialize DAG bit.ly/airflow_gcp_demo ← DAG name ← Schedule ← Import necessary components
  15. Text & diagram slides Simple data pipeline using Airflow (2)

    SFTPSensor for waiting for a file to be available bit.ly/airflow_gcp_demo ← SFTP Connection ← Path to file wait_for_file check-for-file SFTPSensor
  16. Text & diagram slides Simple data pipeline using Airflow (3)

    SFTPToGCSOperator for uploading file(s) from SFTP to GCS bit.ly/airflow_gcp_demo ← GCP Connection ← Path to File & GCS upload_file_to_gcs upload-file-from-sftp SFTPToGCSOperator
  17. Simple data pipeline using Airflow (4) GCSToBigQueryOperator for loading data

    to BigQuery from GCS source file(s) bit.ly/airflow_gcp_demo ← source GCS ← JSON schema (dictionary) ← format: CSV, AVRO, parquet load_to_bigquery load-to-bigquery GCSToBigQueryOperator
  18. Text & diagram slides Simple data pipeline using Airflow (5)

    Put everything together! wait_for_file check-for-file upload_file_to_gcs upload-file-from-sftp load_to_bigquery load-to-bigquery SFTPSensor SFTPToGCSOperator GCSToBigQueryOperator bit.ly/airflow_gcp_demo ← Create DAG!
  19. Text & diagram slides Key Takeaways 🔑 Choosing the right

    solution for data pipeline depends on requirements and workload.
  20. Text & diagram slides Thank you 😃 Let’s connect! Fon

    Liwprasert ML Engineer at Sertis linkedin.com/in/fonylew fonylew