Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cloud Composerで組む機械学習パイプライン

Avatar for 2kyym 2kyym
August 21, 2020

Cloud Composerで組む機械学習パイプライン

Discovery DataScience Meet up (DsDS) #0 にて発表した内容に、スライドを幾つか追加したものになります。
https://scramble.connpass.com/event/171602/

Avatar for 2kyym

2kyym

August 21, 2020
Tweet

More Decks by 2kyym

Other Decks in Programming

Transcript

  1. 

  2.  import datetime import logging from airflow.models import DAG from

    airflow.operators.bash_operator import BashOperator from airflow.operators.python_operator import PythonOperator def greeting(): logging.info("Hello World!") DEFAULT_ARGS = { "start_date": datetime.datetime(2018, 1, 1), "retries": 5, } dag = DAG( dag_id="test_dag", schedule_interval=datetime.timedelta(days=1), default_args=DEFAULT_ARGS, ) hello_python = PythonOperator(task_id="hello", python_callable=greeting, dag=dag) goodbye_bash = BashOperator(task_id="bye", bash_command="echo Goodbye.", dag=dag) hello_python >> goodbye_bash ύΠϓϥΠϯ %BH ఆٛ ԋࢉࢠʢ0QFSBUPSʣ͔Β λεΫΛੜ੒ λεΫ࣮ߦॱংΛఆٛ
  3.  Variables w $PNQPTFS؀ڥશମͰڞ༗͞ΕΔ؀ڥม਺ͷΑ͏ͳ΋ͷɻ w ੩తͳ஋Λ֨ೲ͓ͯ͘͠ɻҟͳΔ%BHͰಉ͡,FZΛ࢖Θͳ͍Α͏஫ҙɻ w ྫʛ($4ͷೖग़ྗύεɺ(,&ͷΫϥελ໊ɺ($3*NBHF%JHFTUͳͲ XCom (cross

    communication) w %BH3VO಺ͰͷΈڞ༗͞Εɺ͋ΔλεΫ͔ΒλεΫ΁ͱड͚౉͞ΕΔࣙॻɻ w 3VO͝ͱʹมΘΔ஋Λ౉͢ɻ5BTL*OTUBODFΦϒδΣΫτ͔ΒࢀরͰ͖Δɻ w ྫʛֶशσʔλͷूܭظؒɺϑΝΠϧ໊ʹ෇͚Δϋογϡ஋ͳͲ λεΫ΁ͷ৘ใͷ౉͠ํ
  4.  def set_env_variables(c, key, value): c.run( f"gcloud --project {PROJECT} composer

    environments run {COMPOSER_NAME} --location {LOCATION} \ variables -- -s {key} {value}" ) ) 7BSJBCMFͷ௥Ճ from airflow.models import Variable VALUE = Variable.get(key) 7BSJBCMFͷࢀর
  5.  9$PN7BMVFͷ௥Ճྫ def create_args(**kwargs): execution_date = kwargs["execution_date"] preprocess_start_datetime = execution_date

    - timedelta(days=PREPROCESS_DIFF) kwargs["ti"].xcom_push( key="preprocess_start_datetime", value=preprocess_start_datetime.strftime("%Y-%m-%dT%H:%M:%S"), ) create_args_task = PythonOperator( task_id="create_args", python_callable=create_args, dag=dag ) 1ZUIPO0QFSBPSͰͷؔ਺࣮ߦ࣌ɺՄม௕Ҿ਺͔Β࣮ߦ೔࣌΍λεΫΠϯελϯεΛࢀরͰ͖Δ ˞UJ͸UBTLJOTUBODFͷུ UBTL@JOTUBODF YDPN@QVTI ࠷ॳͷλεΫͰ QVTI͓ͯ͘͠
  6.  NBJO@EBHQZ ڞ௨෦෼Λ੾Γग़͠ɺλεΫ͝ͱͷ0QFSBUPSϥούʔΛ࡞Δͱ͖ͬ͢Γ͢Δ create_args_task = PythonOperator( task_id="create_args", python_callable=create_args, dag=dag )

    profiler_task = profiler_operator.create_operator(dag) preprocess_a_task = preprocess_operator.create_operator(dag, "a") preprocess_b_task = preprocess_operator.create_operator(dag, "b") train_task = train_operator.create_operator(dag) create_args_task >> profiler_task >> [ preprocess_a_task, preprocess_b_task, ] >> train_task def create_operator(dag, task_id, create_args_task_id): container_arguments = [ "--bucket_name", BUCKET_NAME, "preprocess", "--start_datetime", "{{ ti.xcom_pull(task_ids='" + create_args_task_id + "', key='preprocess_start_datetime') }}", "--bq_dataset_name", BQ_DATASET_NAME, "--gcs_path", GCS_PATH, ] operator = GKEPodOperator( task_id=task_id, project_id=PROJECT, location=CLUSTER_LOCATION, cluster_name=CLUSTER_NAME, namespace="default", image=IMAGE, arguments=container_arguments, dag=dag, ) return operator QSFQSPDFTT@PQFSBUPSQZ Import
  7.  (,&1PE0QFSBUPSͰશ෦΍Δ ཧ༝ w .-ଆͷ࣮૷ͱύΠϓϥΠϯ࣮૷Λग़དྷΔ͚ͩಠཱ͍ͤͨ͞ w σʔλαΠΤϯςΟετਞʹύΠϓϥΠϯଆͷ࣮૷Λҙࣝͤͨ͘͞ͳ͍ w 1ZUIPO0QFSBUPSͷ੍໿ʢޙड़ʣ౳ɺ$PNQPTFS؀ڥ͸ෳࡶͳॲཧʹෆ޲͖ ۩ମతʹ

    w ผϦϙδτϦΛ࡞Γɺ.-ΞϧΰϦζϜ౳ͷ࣮૷͸ͦͪΒͰ؅ཧ͢Δ w લॲཧɺֶशɺͦͷଞࡉʑͨ͠ॲཧ͸શͯ%PDLFSΠϝʔδʹด͡ࠐΊΔ w #JH2VFSZΛ࢖͏৔߹΋ɺ42-ͱ+PCൃߦॲཧ͸ˢͷΠϝʔδʹด͡ࠐΊΔ w ෼ੳɺ࣮ݧɺϩʔΧϧͰͷ։ൃΛߟྀͯ͠΋͜ͷํ๏͕ಘࡦ
  8.  1ZUIPO0QFSBUPSͷ੍໿ PythonOperator w 1Z1*ύοέʔδΛඞཁͱ͠ͳ͍ൣғͷ؆୯ͳॲཧͳΒ͓ͦΒ͘࠷దղ w 7BSJBCMFTHFUTFUͰ஋ͷड͚౉͕͠ඇৗʹָɺ9$PN΋༰қʹ࢖͑Δ w Ұํɺ1Z1*ύοέʔδΛඞཁͱ͢ΔॲཧͰ͸$PNQPTFS؀ڥΛԚછ͢Δ w

    "JSqPXͷύοέʔδґଘͱিಥ͢ΔͳͲɺ࠶ݱੑ͕ݫ͍͠ PythonVirtualenvOperator w ྑ͍ͱ͜औΓ͔ͱࢥ͍͖΍ѱ͍ͱ͜औΓͩͬͨ w 7BSJBCMFT΋9$PN΋࢖͑ͣɺ࢖͍উख͸(,&1PE0QFSBUPSҎԼ w ҰͭͷDBMMBCMFʹશͯΛ٧ΊࠐΉඞཁ͕͋Γɺඇৗʹ࢖͍ͮΒ͍
  9.  (,&1PE0QFSBUPSͰશ෦΍Δ࣌ͷ஫ҙ఺ σϝϦοτ w λεΫ࣮ߦ࣌ʹ7BSJBCMFT͕ίʔυ͔ΒࢀরͰ͖ͳ͍ w YDPN@QVTI YDPN@QVMM ΋࢖͑ͳ͍ ղܾࡦ

    w (,&1PE0QFSBUPSͰίϯςφҾ਺͔Βશͯ౉ͯ͠΍Δ w %PDLFSpMFͰHDMPVE4%,ͱLVCFDUMΛೖΕΕ͹େମԿͰ΋Ͱ͖Δ w ผ؀ڥͷݖݶ͕ඞཁͳ৔߹͸4FSWJDF"DDPVOU,FZpMFΛ҉߸Խͯ͠౉͢
  10.  ೖग़ྗύε΍ूܭظؒͳͲ΋શͯίϚϯυϥΠϯҾ਺Ͱ੍ޚͰ͖ΔΑ͏ʹ͓ͯ͘͠ ˞ຊൃදͷൣғ֎͕ͩɺ1ZUIPO'JSF΍*OWPLF 'BCSJD Λ࢖͏ͱָ container_arguments = [ “preprocess", "--bucket_name",

    BUCKET_NAME, "--start_datetime", PREPROCESS_START_DATETIME, "--bq_dataset_name", BQ_DATASET_NAME, “—gcs_export_path", GCS_EXPORT_PATH, ] (,&1PE0QFSBUPSʹҾ਺Λ౉͢
  11.  (,&1PE0QFSBUPSʹҾ਺Λ౉͢ BUCKET_NAME = Variable.get("bucket_name") CLUSTER_NAME = Variable.get("cluster_name") CLUSTER_LOCATION =

    Variable.get("cluster_location") IMAGE = f"gcr.io/{PROJECT}/test-image@{Variable.get('test_image_digest')}" BQ_PROFILE_DATASET_NAME = Variable.get("bq_dataset_name") ඞཁͳ7BSJBCMFT͸ࣄલʹऔಘ͓ͯ͘͠
  12.  def create_operator(dag, task_id, create_args_task_id): container_arguments = [ “preprocess", "--bucket_name",

    BUCKET_NAME, "--start_datetime", "{{ ti.xcom_pull(task_ids='" + create_args_task_id + "', key='preprocess_start_datetime') }}", "--bq_dataset_name", BQ_DATASET_NAME, "--gcs_path", GCS_PATH, ] operator = GKEPodOperator( task_id=task_id, project_id=PROJECT, location=CLUSTER_LOCATION, cluster_name=CLUSTER_NAME, namespace="default", image=IMAGE, arguments=container_arguments, dag=dag, ) return operator +JOKBςϯϓϨʔτͰ 9$PN஋ΛࢀরͰ͖Δ BSHVNFOUTҾ਺͸ ςϯϓϨʔτஔ׵ର৅ ˞೾ׅހͰғͬͨจࣈྻ͕λεΫ࣮ߦ௚લʹ ςϯϓϨʔτஔ׵͞ΕΔ
  13.  .-ΞϧΰϦζϜͷߋ৽Λࣗಈ൓ө ໨ඪ w .-ଆϦϙδτϦʹมߋ͕͋ͬͯ΋ɺύΠϓϥΠϯଆ͸मਖ਼ෆཁͳঢ়ଶ͕ཧ૝ w (,&1PE0QFSBUPSͰ࣮ߦ͞ΕΔΠϝʔδΛߋ৽͢Ε͹͍͍͚ͩɺͱ͍͏ঢ়ଶ ۩ମతʹ w .-ଆϦϙδτϦͷNBTUFSϒϥϯνʹϚʔδ͞Εͨࡍɺ$JSDMF$*ͰࣗಈϏϧυ

    w Ϗϧυ͞Εͨ($3*NBHF%JHFTUΛHDMPVEDPNQPTFSWBSJBCMFTTFUͰઃఆ͢Δ w ࣍ճύΠϓϥΠϯ࣮ߦ࣌ʹ͸উखʹߋ৽͕൓ө͞Ε͍ͯΔ w ίϚϯυϥΠϯҾ਺มߋ΍ػೳ௥Ճ͕͋ͬͨࡍ͸΍Ήͳ͘ύΠϓϥΠϯΛमਖ਼
  14.  build_dev: docker: - image: google/cloud-sdk environment: GCP_PROJECT: dummy-gcp COMPOSER_NAME:

    dummy-composer IMAGE_TAG: dummy-tag steps: - checkout - setup_remote_docker: docker_layer_caching: true - attach_workspace: at: . - run: name: build command: &build | TAG=gcr.io/${GCP_PROJECT}/test-image:${IMAGE_TAG} docker build -t ${TAG} -f images/runner/Dockerfile . docker push ${TAG} IMAGE_DIGEST=$(gcloud container images describe gcr.io/${GCP_PROJECT}/test-image: ${IMAGE_TAG} —format='value(image_summary.digest)') gcloud composer environments run ${COMPOSER_NAME} --location asia-northeast1 variables -- -s pipeline_image_digest ${IMAGE_DIGEST} .-ଆϦϙδτϦ಺ʹஔ͔Εͨ$JSDMF$*༻ͷDPOGJHZNM ($3*NBHF%JHFTUΛ 7BSJBCMFTʹొ࿥
  15.  ϞσϧͷධՁͱࣗಈσϓϩΠ ϞσϧͷࣗಈσϓϩΠ w ਪ࿦༻ͷ*NBHF%JHFTUΛ࣋ͭ7BSJBCMFΛ্ॻ͖͢Ε͹Α͍ w ͭ·Γɺ$JSDMF$*ͰσϓϩΠύΠϓϥΠϯ*NBHFΛߋ৽͍ͯͨ͠ͷͱຆͲಉ͡ w ࠷ޙஈͷλεΫͰ৽چϞσϧͷൺֱධՁͱ7BSJBCMFͷ্ॻ͖Λߦ͏ ϞσϧͷධՁ

    w ৄࡉ͸ল͕͘ɺλΫγʔ৐຿γϛϡϨʔλͰ࠷ऴతͳϞσϧධՁΛߦ͍ͬͯΔ w ৽ϞσϧͱطଘϞσϧͷ྆ํͰόονਪ࿦ͱγϛϡϨʔγϣϯΛฒྻ࣮ߦ w Ϟσϧߋ৽ج४ʛ̎िؒ࿈ଓͰطଘϞσϧͷύϑΥʔϚϯεΛ্ճΔ͜ͱ