Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to "Spark Connect" and Implementat...

Introduction to "Spark Connect" and Implementation of Managed Spark as a Service

「Spark Connect」プロトコルについて紹介し、Spark Connectを活用した社内管理型Spark as a Serviceの実装アーキテクチャおよび使用例を共有します。

More Decks by LINEヤフーTech (LY Corporation Tech)

Other Decks in Technology

Transcript

  1. Introduction to "Spark Connect" and Implementation of Managed Spark as

    a Service Data Platform Engineering Mark Lee, 이승, 李承
  2. - Apache Spark Connect - Apache Spark as a Service

    - DEMO: Spark as a Service - DEMO: Spark Client - Recap Agenda
  3. Apache Spark is a multi-language engine for executing data engineering,

    data science, and machine learning on single-node machines or clusters. Key Features - Batch/streaming data - SQL analytics - Data science at scale - Machine learning Apache Spark https://spark.apache.org/
  4. Challenges of traditional Spark Single Point of Failure - Limited

    Fault Tolerance - Resource Bottlenecks Network Dependency - Network partitions or instability disrupts execution - Clients Need Direct Access - Scheduling Restrictions Difficult Multi-Tenancy - Cluster endpoint must be protected Tight coupling of Client and Driver
  5. Remote connectivity to Spark clusters - by DataFrame API and

    unresolved logical plans - via gRPC with Apache Arrow encoded row batches Can be embedded everywhere - application servers, IDEs, notebooks and programming languages About Spark Connect Decouple layer between Client and Driver
  6. Spark vs Spark Connect spark spark connect client location in

    cluster out cluster client requirement full spark, cluster access and JVM-based thin client: minimal gRPC client with DataFrame Library network topology must have direct network access to the cluster remote access via gRPC scalability driver + client handles all scheduling and coordination driver runs on cluster; clients are stateless fault tolerance client crash causes job lost client can disconnect and reconnect
  7. Spark vs Spark Connect (cont.) spark spark connect language limited

    to spark supported languages: JVM, PySpark, R any ex) javascript, go, c#, rust cloud native tight coupling and driver requirements limit flexibility designed for cloud-native, containerized and serverless security exposing cluster endpoints can raise security concerns can secure gRPC channel with centralized control versioning client and server must be tightly version-matched client and server can evolve independently
  8. PySpark: DataFrame, Functions and Column Scala: Dataset, functions, Column, Catalog

    and KeyValueGroupedDataset User Defined Functions Streaming APIs: DataStreamReader, DataStreamWriter, StreamingQuery and StreamingQueryListener Supported APIs https://spark.apache.org/docs/latest/spark-connect-overview.html#what-is-supported
  9. - Private Fields of DataFrame, Column, SparkSession, etc - Execution

    APIs: RDD - Environment Manipulation APIs: SparkContext, Spark Configuration Unsupported APIs https://spark.apache.org/docs/latest/spark-connect-overview.html#how-spark-connect-client-applications-differ-from-classic-spark- application
  10. - Extensibility - Spark-as-a-Service in Cloud Environment - Multi-language Clients:

    Rust, Go, Javascript, etc - Interactivity - Notebooks: Jupyter, VSCode - Integrated Development Environment - Minimality - Web UI or BI Tools - Mobile or Microservices - MCP servers Use Cases
  11. - Extensibility - Spark-as-a-Service in Cloud Environment - Multi-language Clients:

    Rust, Go, Javascript, etc - Interactivity - Notebooks: Jupyter, VSCode - Integrated Development Environment - Minimality - Web UI or BI Tools - Mobile or Microservices - MCP servers Use Cases
  12. - Integrate with in-house platforms - Access Control - Customize

    spark client per user/session - python/R/scala libraries - client & driver specs: CPU, Memory, GPU, etc - Customize Spark itself - in-house patches - fine-tune spark properties - remote shuffle services Requirements
  13. Spark Connect vs Alternatives spark connect livy kyuubi abstraction low

    medium high protocol gRPC REST JDBC/ODBC language any python, scala, java, R SQL spark version 3.4+ 2.x – 3.x (standalone) 2.x – 3.x (integrated)
  14. Spark Multitenancy spark connect livy kyuubi built-in No Partial Yes

    Session Isolation Yes Yes Yes Session Management No Basic Full User Authentication Limited Pluggable Strong Resource Quota No No Yes
  15. Spark Extensibility spark connect livy kyuubi Language Any JVM, Python,

    R, SQL SQL Custom Spark Yes Limited No Custom Session Yes Limited No Custom User Authentication Yes Limited No Custom Resource Quota Yes Limited No
  16. - Serve Spark Driver instance with Spark Connect Addon -

    Support Spark Credentials: kerberos keytab - Support resource quotas - Support role management - Support audit logging - Horizontally Scalable Requirement 1: Runtime
  17. - Serve Spark Driver instance with Spark Connect Addon -

    Support Spark Credentials: kerberos keytab - Support resource quotas - Support role management - Support audit logging - Horizontally Scalable ➔Kubernetes with multiple namespaces + Envoy ADS(CDS, EDS) Requirement 1: Runtime
  18. While Spark Connect does not have built-in authentication, it is

    designed to work seamlessly with your existing authentication infrastructure. Its gRPC HTTP/2 interface allows for the use of authenticating proxies, which makes it possible to secure Spark Connect without having to implement authentication logic in Spark directly. Requirement 2: Auth https://spark.apache.org/docs/latest/spark-connect-overview.html#what-is-supported
  19. While Spark Connect does not have built-in authentication, it is

    designed to work seamlessly with your existing authentication infrastructure. Its gRPC HTTP/2 interface allows for the use of authenticating proxies, which makes it possible to secure Spark Connect without having to implement authentication logic in Spark directly. ➔Envoy xDS ExtAuthz filter Requirement 2: Auth https://spark.apache.org/docs/latest/spark-connect-overview.html#what-is-supported
  20. - Self manage Spark Driver instances - Self manage Credential

    to access Spark Driver instances Requirement 3: Operation
  21. - Self manage Spark Driver instances - Self manage Credential

    to access Spark Driver instances ➔Basic Website Requirement 3: Operation
  22. IDE

  23. MCP

  24. - Spark Connect is a gRPC-based client-server protocol that enables

    remote, interactive access to Apache Spark using the native DataFrame and SQL APIs. Spark Connect
  25. - Spark Connect is a gRPC-based client-server protocol that enables

    remote, interactive access to Apache Spark using the native DataFrame and SQL APIs. - Spark Connect can be used with external job tracker, job scheduler and user/session management Spark Connect