Spark Summit 2017 SF - Spark on Kubernetes

Anirudh Ramananathan ([email protected]) Software Engineer (Kubernetes) Timothy Chen ([email protected]) Co-founder
& CTO (HyperPilot) Apache Spark on Kubernetes

Agenda • Kubernetes & Containers • Motivation • Design •
Demo • Deep Dive • Roadmap

What is Kubernetes?

Kubernetes Kubernetes is an open-source system

Kubernetes Kubernetes is an open-source system for automating deployment, scaling,
and management

Kubernetes Kubernetes is an open-source system for automating deployment, scaling,
and management of containerized applications.

‘Containerized’

Containers libs app kernel libs app libs app libs app
• Repeatable Builds and Workflows • Application Portability • High Degree of Control over Software • Faster Development Cycle • Reduced dev-ops load • Improved Infrastructure Utilization

• Large OSS Community - 1200+ contributors and 45k+ commits
• Ecosystem and Partners - 100+ organizations involved • One of the top 100 projects overall on GitHub - 23k+ stars Kubernetes

Overview

At a Glance kubelet UI kubelet CLI API users master
nodes etcd kubelet scheduler controllers apiserver

Nodes and Pods Pod Volume Containers Pod Containers 80 80
8080 8080 Volume Node • A pod is a set of co-located containers • Created by a declarative specification supplied to the master • Each pod has its own IP address • Volumes can be local or network-attached

Motivation

Why Spark on Kubernetes? • Resource sharing between batch, serving
and stateful workloads – Streamlined developer experience – Reduced operational costs – Improved infrastructure utilization • Kubernetes and the Container Ecosystem – Lots of addon services: third-party logging, monitoring, and security tools – For example, the Istio project, announced May 24, by IBM, Google and Lyft, provides a service mesh for authenticating, authorizing, tracing, and timing, and rate-limiting container-to-container communication, and more.

Design

Spark, meet Kubernetes! Spark Core Kubernetes Standalone YARN Mesos GraphX
SparkSQL MLlib Streaming

Spark, meet Kubernetes! Spark Core Kubernetes Scheduler Backend Kubernetes Cluster
new executors remove executors configuration • Resource Requests • Authnz • Communication with K8s

Kubernetes, meet Spark! Kubernetes Cluster File Staging Server • Staging
server: component to stage local files • Spark Shuffle service: component to store shuffle data for dynamic allocation • ThirdParty/CustomResources: extend Kubernetes API with Spark Knowledge Shuffle Service SparkJob API Endpoint

Kubernetes Integration Dependencies Container images with dependencies baked in Files
from GCS/S3/HDFS/HTTP File Staging Server Staged files and JARs Several ways of running Spark Jobs along with their dependencies on Kubernetes

Administration Namespaces Resource Accounting Logging Monitoring Resource Quota Pluggable Authorization
Admission Control RBAC • Launch Spark Jobs as a particular user into a specific namespace • RBAC and Namespace-level resource quotas • Audit logging for clusters • Several monitoring solutions to see node, cluster and pod-level statistics

Focus Areas Wordcloud of the command-line options we added to
spark-submit on Kubernetes

Deep Dive

Deep Dive spark-subm it kubernetes cluster apiserver scheduler • Spark
Submit submits job to K8s

• Spark Submit submits job to K8s • K8s schedules
the driver for job Deep Dive kubernetes cluster apiserver scheduler schedule driver pod spark driver

the driver for job Deep Dive • Spark Submit submits job to K8s • K8s schedules the driver for job • Driver requests executors as needed kubernetes cluster apiserver scheduler spark driver create executor pods

the driver for job Deep Dive • Spark Submit submits job to K8s • K8s schedules the driver for job • Driver requests executors as needed • Executors scheduled and created kubernetes cluster apiserver scheduler spark driver schedule executor pods executors

the driver for job Deep Dive • Spark Submit submits job to K8s • K8s schedules the driver for job • Driver requests executors as needed • Executors scheduled and created • Executors run tasks kubernetes cluster apiserver scheduler spark driver executors

the driver for job Deep Dive • Spark Submit submits job to K8s • K8s schedules the driver for job • Driver requests executors as needed • Executors scheduled and created • Executors run tasks • Driver “completes” job and persists logs kubernetes cluster apiserver scheduler spark driver

Roadmap

Spark Streaming Spark Roadmap Spark Shell Client Mode Python/R support
Cluster Mode Java/Scala Support Dynamic Allocation Local File Staging High Availability Spark SQL GraphX MLlib Dec 2016 Development Began Mar 2017 Alpha Release June 2017 Beta Release Nov 2016 Design = supported but untested = not yet supported

We’re just getting started... • Kubernetes CustomResources • Priorities and
Preemption for Pods • Batch Scheduling and Resource Sharing • Cluster Federation and Multi-cloud deployments • Ecosystem: Kafka, Cassandra, HDFS, etc

Contributors Organizations Alphabetically: • Google • Haiwen • Hyperpilot •
Intel • Palantir • Pepperdata • Red Hat Links: • Spark 2.2.0 Documentation • https://issues.apache.org/jira/browse /SPARK-18278 • https://github.com/kubernetes/kubern etes/issues/34377

Try it out today! https://github.com/apache-spark-on-k8s/spark

Thank You. HDFS on Kubernetes - Lessons Learned June 7
at 11:00 AM in Room 2003 Join us Wednesdays at 10am PT at the SIG BigData meeting https://github.com/kubernetes/community/

Spark Summit 2017 SF - Spark on Kubernetes

Spark Summit 2017 SF - Spark on Kubernetes

More Decks by Timothy Chen

Other Decks in Technology

Featured

Transcript