Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Case studies of Spark Extension development by ...

Case studies of Spark Extension development by data scientists.

Takahiro Yoshinaga
LINE Data Science Team 4 Data Scientist
https://linedevday.linecorp.com/jp/2019/sessions/S1-17

Avatar for LINE DevDay 2019

LINE DevDay 2019

November 20, 2019
Tweet

More Decks by LINE DevDay 2019

Other Decks in Technology

Transcript

  1. 2019 DevDay Case Studies of Spark Extension Development by Data

    Scientists. > Takahiro Yoshinaga > LINE Data Science Team 4 Data Scientist
  2. Self Introduction Subtitle 30pt / Arial / Normal > 2018-02

    Joined LINE corporation as a Data Scientist >Responsible for data analysis and development of service for corporations. > Takahiro Yoshinaga, Ph.D. (Science)
  3. OASIS Subtitle 30pt / Arial / Normal Spark Application in

    LINE corporation - Web application in a notebook format - Create / Execute query - Visualize and Share easily - Have access to R, Python (SparkR, Pyspark) - Extension - Create UDF to make data analysis convenient - Add stand-alone JAR in spark-submit options
  4. Extension Is Convenient, but It’s Hassle… Subtitle 30pt / Arial

    / Normal Upload JAR file manually Create Build Environment (Scala) in local/prod Manage JAR file (Versioning) manually
  5. Our Team Use CI / CD Subtitle 30pt / Arial

    / Normal Upload JAR file automatically Create Build Environment (Scala) in Docker Manage JAR file (Versioning) in Github Drone.io
  6. As Is / To Be Subtitle 30pt / Arial /

    Normal - Write Scala code - Build on local machine - sbt test - Upload JAR to HDFS - Check it on OASIS - Review & Merge on Github - Re-build and versioning - Upload to OASIS - Write Scala code - <deleted> - <deleted> - <deleted> - git push & check it on OASIS - Review & Merge on Github - <deleted> - <deleted>
  7. Case Studies Subtitle 30pt / Arial / Normal Register Metrics

    and Automate Group by - Before : Ad hoc aggregation by requirements - After : Auto calculation by frequently used metrics Mapping - Before : long long case when … - After : only one UDF
  8. Summary Subtitle 30pt / Arial / Normal Out team utilize

    CI / CD in our development. LINE has an environment that data scientist can develop in a modern way. We realize high performance in data analysis thanks to our development.