Introduction to "Spark Connect" and Implementation of Managed Spark as a Service

Introduction to "Spark Connect" and Implementation of Managed Spark as
a Service Data Platform Engineering Mark Lee, 이승, 李承

- Apache Spark Connect - Apache Spark as a Service
- DEMO: Spark as a Service - DEMO: Spark Client - Recap Agenda

Spark Connect

Apache Spark is a multi-language engine for executing data engineering,
data science, and machine learning on single-node machines or clusters. Key Features - Batch/streaming data - SQL analytics - Data science at scale - Machine learning Apache Spark https://spark.apache.org/

Spark Cluster https://spark.apache.org/docs/latest/cluster-overview.html

Challenges of traditional Spark Single Point of Failure - Limited
Fault Tolerance - Resource Bottlenecks Network Dependency - Network partitions or instability disrupts execution - Clients Need Direct Access - Scheduling Restrictions Difficult Multi-Tenancy - Cluster endpoint must be protected Tight coupling of Client and Driver

Remote connectivity to Spark clusters - by DataFrame API and
unresolved logical plans - via gRPC with Apache Arrow encoded row batches Can be embedded everywhere - application servers, IDEs, notebooks and programming languages About Spark Connect Decouple layer between Client and Driver

Spark Connect architecture https://spark.apache.org/docs/latest/spark-connect-overview.html

Spark Connect Workflow https://spark.apache.org/docs/latest/spark-connect-overview.html

Spark vs Spark Connect spark spark connect client location in
cluster out cluster client requirement full spark, cluster access and JVM-based thin client: minimal gRPC client with DataFrame Library network topology must have direct network access to the cluster remote access via gRPC scalability driver + client handles all scheduling and coordination driver runs on cluster; clients are stateless fault tolerance client crash causes job lost client can disconnect and reconnect

Spark vs Spark Connect (cont.) spark spark connect language limited
to spark supported languages: JVM, PySpark, R any ex) javascript, go, c#, rust cloud native tight coupling and driver requirements limit flexibility designed for cloud-native, containerized and serverless security exposing cluster endpoints can raise security concerns can secure gRPC channel with centralized control versioning client and server must be tightly version-matched client and server can evolve independently

PySpark: DataFrame, Functions and Column Scala: Dataset, functions, Column, Catalog
and KeyValueGroupedDataset User Defined Functions Streaming APIs: DataStreamReader, DataStreamWriter, StreamingQuery and StreamingQueryListener Supported APIs https://spark.apache.org/docs/latest/spark-connect-overview.html#what-is-supported

- Private Fields of DataFrame, Column, SparkSession, etc - Execution
APIs: RDD - Environment Manipulation APIs: SparkContext, Spark Configuration Unsupported APIs https://spark.apache.org/docs/latest/spark-connect-overview.html#how-spark-connect-client-applications-differ-from-classic-spark- application

- Extensibility - Spark-as-a-Service in Cloud Environment - Multi-language Clients:
Rust, Go, Javascript, etc - Interactivity - Notebooks: Jupyter, VSCode - Integrated Development Environment - Minimality - Web UI or BI Tools - Mobile or Microservices - MCP servers Use Cases

powered by Spark Connect Spark as a Service

- Integrate with in-house platforms - Access Control - Customize
spark client per user/session - python/R/scala libraries - client & driver specs: CPU, Memory, GPU, etc - Customize Spark itself - in-house patches - fine-tune spark properties - remote shuffle services Requirements

Spark Connect vs Alternatives spark connect livy kyuubi abstraction low
medium high protocol gRPC REST JDBC/ODBC language any python, scala, java, R SQL spark version 3.4+ 2.x – 3.x (standalone) 2.x – 3.x (integrated)

Spark Multitenancy spark connect livy kyuubi built-in No Partial Yes
Session Isolation Yes Yes Yes Session Management No Basic Full User Authentication Limited Pluggable Strong Resource Quota No No Yes

Spark Extensibility spark connect livy kyuubi Language Any JVM, Python,
R, SQL SQL Custom Spark Yes Limited No Custom Session Yes Limited No Custom User Authentication Yes Limited No Custom Resource Quota Yes Limited No

- Serve Spark Driver instance with Spark Connect Addon -
Support Spark Credentials: kerberos keytab - Support resource quotas - Support role management - Support audit logging - Horizontally Scalable Requirement 1: Runtime

- Serve Spark Driver instance with Spark Connect Addon -
Support Spark Credentials: kerberos keytab - Support resource quotas - Support role management - Support audit logging - Horizontally Scalable ➔Kubernetes with multiple namespaces + Envoy ADS(CDS, EDS) Requirement 1: Runtime

While Spark Connect does not have built-in authentication, it is
designed to work seamlessly with your existing authentication infrastructure. Its gRPC HTTP/2 interface allows for the use of authenticating proxies, which makes it possible to secure Spark Connect without having to implement authentication logic in Spark directly. Requirement 2: Auth https://spark.apache.org/docs/latest/spark-connect-overview.html#what-is-supported

While Spark Connect does not have built-in authentication, it is
designed to work seamlessly with your existing authentication infrastructure. Its gRPC HTTP/2 interface allows for the use of authenticating proxies, which makes it possible to secure Spark Connect without having to implement authentication logic in Spark directly. ➔Envoy xDS ExtAuthz filter Requirement 2: Auth https://spark.apache.org/docs/latest/spark-connect-overview.html#what-is-supported

Envoy xDS

- Self manage Spark Driver instances - Self manage Credential
to access Spark Driver instances Requirement 3: Operation

- Self manage Spark Driver instances - Self manage Credential
to access Spark Driver instances ➔Basic Website Requirement 3: Operation

Management Console

Hibana Architecture

DEMO: Spark as a Service

Hibana

DEMO: Spark Client

Jupyter

- Spark Connect is a gRPC-based client-server protocol that enables
remote, interactive access to Apache Spark using the native DataFrame and SQL APIs. Spark Connect

- Spark Connect is a gRPC-based client-server protocol that enables
remote, interactive access to Apache Spark using the native DataFrame and SQL APIs. - Spark Connect can be used with external job tracker, job scheduler and user/session management Spark Connect

Introduction to "Spark Connect" and Implementat...

Introduction to "Spark Connect" and Implementation of Managed Spark as a Service

More Decks by LINEヤフーTech (LY Corporation Tech)

Other Decks in Technology

Featured

Transcript