[OracleCode SF] In-memory Analytics with Spark ...

March 01, 2017

160

[OracleCode SF] In-memory Analytics with Spark and Hazelcast

Apache Spark is a distributed computation framework optimized to work in-memory, and heavily influenced by concepts from functional programming languages.

Hazelcast - open source in-memory data grid capable of amazing feats of scale - provides wide range of distributed computing primitives computation, including ExecutorService, M/R and Aggregations frameworks.

The nature of data exploration and analysis requires data scientists be able to ask questions that weren't planned to be asked—and get an answer fast!

In this talk, Viktor will explore Spark and see how it works together with Hazelcast to provide a robust in-memory open-source big data analytics solution!

Viktor Gamov

March 01, 2017

Tweet

More Decks by Viktor Gamov

See All by Viktor Gamov

Processing Streaming Data with KSQL

4

370

[VirtualJUG] Apache Kafka — A Streaming Data Platform

3

370

[SF JUG] Apache Kafka — A Streaming Data Platform

4

82

[OracleCode NYC-2018] Apache Kafka A Streaming Data Platform

1

170

[OracleCode NYC-2018] Rethinking Stream Processing with KStreams and KSQL

2

230

[JBreak-2018] Это кто там твитить про #jbreak?

0

210

[DevNexus-2018] Apache Kafka A Streaming Data Platform

2

280

[DataSciCon] Divide, Distribute and Conquer: Stream v. Batch

0

110

[Philly JUG] Divide, Distribute and Conquer: Stream v. Batch

0

470

Other Decks in Programming

See All in Programming

Quality Gates in the Age of Agentic Coding

PRO

1

120

副作用と戦う PHP リファクタリング ─ ドメインイベントでビジネスロジックを解きほぐす

3

520

管你要 trace 什麼、bpftrace 用下去就對了 — COSCUP 2025

0

170

バイブコーディング超えてバイブデプロイ〜CloudflareMCPで実現する、未来のアプリケーションデリバリー〜

3

780

Flutterと Vibe Coding で個人開発！

1

230

Claude Code で Astro blog を Pages から Workers へ移行してみた

0

170

Strands Agents で実現する名刺解析アーキテクチャ

1

110

新しいモバイルアプリ勉強会（仮）について

1

250

React 使いじゃなくても知っておきたい教養としての React

18

5.3k

プロダクトという一杯を作る - プロダクトチームが味の責任を持つまでの煮込み奮闘記

0

390

SwiftでMCPサーバーを作ろう！

PRO

2

220

Google I/O Extended Incheon 2025 ~ What's new in Android development tools

1

220

Featured

See All Featured

The Web Performance Landscape in 2024 [PerfNow 2024]

8

750

4 Signs Your Business is Dying

184

22k

Designing for Performance

610

69k

Bash Introduction

614

210k

Adopting Sorbet at Scale

77

9.5k

Music & Morning Musume

46

6.7k

Refactoring Trust on Your Teams (GOTO; Chicago 2020)

34

3.1k

Design and Strategy: How to Deal with People Who Don’t "Get" Design

130

19k

Chrome DevTools: State of the Union 2024 - Debugging React & Beyond

7

790

We Have a Design System, Now What?

53

7.7k

Creating an realtime collaboration tool: Agile Flush - .NET Oxford

30

2.2k

Site-Speed That Sticks

10

750

Transcript

@gamussa @hazelcast #oraclecode IN-MEMORY ANALYTICS with APACHE SPARK and HAZELCAST
@gamussa @hazelcast #oraclecode Solutions Architect Developer Advocate @gamussa in internetz
Please, follow me on Twitter I’m very interesting © Who am I?
@gamussa @hazelcast #oraclecode What’s Apache Spark? Lightning-Fast Cluster Computing
@gamussa @hazelcast #oraclecode Run programs up to 100x faster than
Hadoop MapReduce in memory, or 10x faster on disk.
@gamussa @hazelcast #oraclecode When to use Spark? Data Science Tasks
when questions are unknown Data Processing Tasks when you have to much data You’re tired of Hadoop
@gamussa @hazelcast #oraclecode Spark Architecture
@gamussa @hazelcast #oraclecode
@gamussa @hazelcast #oraclecode RDD
@gamussa @hazelcast #oraclecode Resilient Distributed Datasets (RDD) are the primary
abstraction in Spark – a fault-tolerant collection of elements that can be operated on in parallel
@gamussa @hazelcast #oraclecode
@gamussa @hazelcast #oraclecode RDD Operations
@gamussa @hazelcast #oraclecode operations on RDDs: transformations and actions
@gamussa @hazelcast #oraclecode transformations are lazy (not computed immediately) the
transformed RDD gets recomputed when an action is run on it (default)
@gamussa @hazelcast #oraclecode RDD Transformations
@gamussa @hazelcast #oraclecode
@gamussa @hazelcast #oraclecode
@gamussa @hazelcast #oraclecode RDD Actions
@gamussa @hazelcast #oraclecode
@gamussa @hazelcast #oraclecode
@gamussa @hazelcast #oraclecode RDD Fault Tolerance
@gamussa @hazelcast #oraclecode
@gamussa @hazelcast #oraclecode RDD Construction
@gamussa @hazelcast #oraclecode parallelized collections take an existing Scala collection
and run functions on it in parallel
@gamussa @hazelcast #oraclecode Hadoop datasets run functions on each record
of a file in Hadoop distributed file system or any other storage system supported by Hadoop
@gamussa @hazelcast #oraclecode What’s Hazelcast IMDG? The Fastest In-memory Data
Grid
@gamussa @hazelcast #oraclecode Hazelcast IMDG is an operational, in-memory, distributed
computing platform that manages data using in-memory storage, and performs parallel execution for breakthrough application speed and scale
@gamussa @hazelcast #oraclecode High-Density Caching In-Memory Data Grid Web Session
Clustering Microservices Infrastructure
@gamussa @hazelcast #oraclecode What’s Hazelcast IMDG? In-memory Data Grid Apache
v2 Licensed Distributed Caches (IMap, JCache) Java Collections (IList, ISet, IQueue) Messaging (Topic, RingBuffer) Computation (ExecutorService, M-R)
@gamussa @hazelcast #oraclecode Green Primary Green Backup Green Shard
@gamussa @hazelcast #oraclecode
@gamussa @hazelcast #oraclecode final SparkConf sparkConf = new SparkConf() .set("hazelcast.server.addresses",
"localhost") .set("hazelcast.server.groupName", "dev") .set("hazelcast.server.groupPass", "dev-pass") .set("hazelcast.spark.readBatchSize", "5000") .set("hazelcast.spark.writeBatchSize", "5000") .set("hazelcast.spark.valueBatchingEnabled", "true"); final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf); final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc); final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie"); final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my- cache");
@gamussa @hazelcast #oraclecode final SparkConf sparkConf = new SparkConf() .set("hazelcast.server.addresses",
"localhost") .set("hazelcast.server.groupName", "dev") .set("hazelcast.server.groupPass", "dev-pass") .set("hazelcast.spark.readBatchSize", "5000") .set("hazelcast.spark.writeBatchSize", "5000") .set("hazelcast.spark.valueBatchingEnabled", "true"); final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf); final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc); final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie"); final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my- cache");
@gamussa @hazelcast #oraclecode final SparkConf sparkConf = new SparkConf() .set("hazelcast.server.addresses",
"localhost") .set("hazelcast.server.groupName", "dev") .set("hazelcast.server.groupPass", "dev-pass") .set("hazelcast.spark.readBatchSize", "5000") .set("hazelcast.spark.writeBatchSize", "5000") .set("hazelcast.spark.valueBatchingEnabled", "true"); final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf); final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc); final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie"); final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my- cache");
@gamussa @hazelcast #oraclecode final SparkConf sparkConf = new SparkConf() .set("hazelcast.server.addresses",
"localhost") .set("hazelcast.server.groupName", "dev") .set("hazelcast.server.groupPass", "dev-pass") .set("hazelcast.spark.readBatchSize", "5000") .set("hazelcast.spark.writeBatchSize", "5000") .set("hazelcast.spark.valueBatchingEnabled", "true"); final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf); final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc); final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie"); final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my- cache");
@gamussa @hazelcast #oraclecode Demo
@gamussa @hazelcast #oraclecode LIMITATIONS
@gamussa @hazelcast #oraclecode DATA SHOULD NOT BE UPDATED WHILE READING
FROM SPARK
@gamussa @hazelcast #oraclecode WHY ?
@gamussa @hazelcast #oraclecode MAP EXPANSION SHUFFLES THE DATA INSIDE THE
BUCKET
@gamussa @hazelcast #oraclecode CURSOR DOESN’T POINT TO CORRECT ENTRY ANYMORE,
DUPLICATE OR MISSING ENTRIES COULD OCCUR
@gamussa @hazelcast #oraclecode github.com/hazelcast/hazelcast-spark
@gamussa @hazelcast #oraclecode THANKS! Any questions? You can find me
at @gamussa [email protected]