Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Presentation: Boost Hadoop and Spark with in-me...

Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Chaudhri at Big Data Spain 2017

In this presentation, attendees will see how to speed up existing Hadoop and Spark deployments by just making Apache Ignite responsible for RAM utilization. No code modifications, no new architecture from scratch!

https://www.bigdataspain.org/2017/talk/boost-hadoop-and-spark-with-in-memory-technologies

Big Data Spain 2017
November 16th - 17th Kinépolis Madrid

Big Data Spain

December 05, 2017
Tweet

More Decks by Big Data Spain

Other Decks in Technology

Transcript

  1. Agenda • Introduction to Apache Ignite • Hadoop Acceleration •

    Spark Acceleration • Demos • Q&A Big Data Spain 2017
  2. Apache Ignite in one slide • Memory-centric platform – that

    is strongly consistent – and highly-available – with powerful SQL – key-value and processing APIs • Designed for – Performance – Scalability Big Data Spain 2017
  3. Apache Ignite • Data source agnostic • Fully fledged compute

    engine and durable storage • OLAP and OLTP • Fully ACID transactions across memory and disk • In-memory SQL support • Early ML libraries • Growing community Big Data Spain 2017
  4. Hadoop Acceleration • In-memory Hadoop Execution • Alternative job tracker

    – Faster MapReduce • Built on Ignite File System (IGFS) • Secondary File System – Read-through and Write-through Big Data Spain 2017
  5. Ignite In-Memory File System • Distributed in-memory file system •

    Implements HDFS API • Can be transparently plugged into Hadoop or Spark deployments Big Data Spain 2017
  6. MapReduce • Parallelize processing of data in HDFS • Eliminate

    Hadoop JobTracker and TaskTracker overhead • Low-Latency distributed processing • Minimal configuration change Big Data Spain 2017
  7. Spark Acceleration • Long running applications – Passing state between

    jobs • Disk File System – Convert RDDs to disk files and back • Share RDDs in-memory – Native Spark API – Native Spark transformations Big Data Spain 2017
  8. Ignite for Spark • Spark RDD abstraction • Shared in-memory

    view on data across different Spark jobs, workers or applications • Implemented as a view over a distributed Ignite cache Big Data Spain 2017
  9. IgniteContext • Main entry-point to Spark-Ignite integration • SparkContext plus

    either one of – IgniteConfiguration() – Path to XML configuration file • Optional Boolean client argument – true => Shared deployment – false => Embedded deployment Big Data Spain 2017
  10. IgniteContext examples Big Data Spain 2017 val i gni t

    eCont ext = new I gni t eCont ext ( sparkCont ext , ( ) = > new I gni t eConf i gurat i on( ) ) val i gni t eCont ext = new I gni t eCont ext( sparkCont ext , "exam pl es/ conf i g/ spark/ exam pl e- shar ed- r dd. xm l ")
  11. IgniteRDD • Implementation of Spark RDD representing a live view

    of an Ignite cache • Mutable (unlike native RDDs) – All changes in Ignite cache will be visible to RDD users immediately • Provides partitioning information to Spark executor • Provides affinity information to Spark so that RDD computations can use data locality Big Data Spain 2017
  12. Write to Ignite • Ignite caches operate on key-value pairs

    • Spark tuple RDD for key-value pairs and savePairs method – RDD partitioning, store values in parallel if possible • Value-only RDD and saveValues method – IgniteRDD generates a unique affinity-local key for each value stored into the cache Big Data Spain 2017
  13. Write code example Big Data Spain 2017 val conf =

    new SparkConf ( ) . set AppNam e( "SparkI gni t eW ri t er") val sc = new SparkCont ext ( conf ) val i c = new I gni t eCont ext ( sc, "exam pl es/ conf i g/ spark/ exam pl e- shar ed- r dd. xm l ") val shar edRD D : I gni t eRD D [ I nt , I nt ] = i c. f r om Cache( "shar edRD D ") shar edRD D . savePai rs( sc. paral l el i ze( 1 t o 100000, 10) . m ap( i = > ( i , i ) ) )
  14. Read from Ignite • IgniteRDD is a live view of

    an Ignite cache – No need to explicitly load data to Spark application from Ignite – All RDD methods are available to use right away after an instance of IgniteRDD is created Big Data Spain 2017
  15. Read code example Big Data Spain 2017 val conf =

    new SparkConf ( ) . set AppNam e( "SparkI gni t eReader") val sc = new SparkCont ext ( conf ) val i c = new I gni t eCont ext ( sc, "exam pl es/ conf i g/ spark/ exam pl e- shar ed- r dd. xm l ") val shar edRD D : I gni t eRD D [ I nt , I nt ] = i c. f r om Cache( "shar edRD D ") val gr eat erThanFi f t yThousand = shar edRD D . f i l t er( _. _2 > 50000) pri nt l n( "The count i s "+ gr eat erThanFi f t yThousand. count ( ) )
  16. Any Questions? Thank you for joining us. Follow the conversation.

    http://ignite.apache.org Big Data Spain 2017