Storm - an overview

Into the Storm An introduction to and overview of Apache
Storm Oliver Hall, Engineer, MetaBroadcast

What is Storm? • "free and open source distributed realtime
computation platform" • tasks made from nodes, spread over multiple physical hosts • at-least-once guarantee for message processing • fault tolerant

Who uses it? • Twitter • Groupon • Ooyala •
Taobao • Alibaba • and, of course... MetaBroadcast

What are we using it for? • labelling • statistics
• impressions counting • and potentially much more...

History • developed by Nathan Martz and BackType • acquired
by Twitter • initial release in September 2011 • currently at version 0.8, still under development

So, what is it?

Overview • run on cluster of machines • consists of
topologies (continuously running processing tasks) • cluster is a series of nodes ◦ master ◦ one or more workers

Master Node • master node runs Nimbus • distributes code
around the cluster • monitors for failures

Worker Node • runs a Supervisor • listens for work
assigned to their machine • starts / stops worker processes as necessary • each worker process runs sub-section of a topology • topology :- multiple worker processes spread across several machines

Zookeeper • Nimbus and Supervisors communicate via Zookeeper cluster

Topologies • a graph of computation nodes • links show
data flow

Spouts • source of data • reliable or unreliable •
can emit tuples to one or more streams data source data source ... data source

Bolts • where all Storm processing occurs • filters, aggregations,
functions, database calls, and more input tuples processing output tuples ... output tuples

Topologies (again) • topologies tell bolts and spouts where to
send their data • can parallelize every step N.B. you can define spouts, bolts and topologies in many languages, including non-JVM languages

TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split",
new SplitSentence(), 8) .shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word")); Config conf = new Config(); conf.setMaxTaskParallelism(3); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf, builder.createTopology());

public class RandomSentenceSpout extends BaseRichSpout { ... @Override public void
nextTuple() { Utils.sleep(100); String[] sentences = new String[] { "the cow jumped over the moon", "an apple a day keeps the doctor away", "four score and seven years ago", "snow white and the seven dwarfs", "i am at two with nature"}; String sentence = sentences[_rand.nextInt(sentences.length)]; _collector.emit(new Values(sentence)); } ... @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } }

public static class WordCount extends BaseBasicBolt { Map<String, Integer> counts
= new HashMap<String, Integer>(); @Override public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if(count==null) count = 0; count++; counts.put(word, count); collector.emit(new Values(word, count)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); } }

Trident • a high-level abstraction on top of Storm •
gives high-level throughput with stateful stream processing (think Pig or Cascading) • exactly-once message semantics

Only-once • tuples are processed in small batches • each
batch has a unique batch id • state updates are ordered

Trident Topologies • a stream is channelled through a number
of processing stages • each stage can be a filter, aggregation, function, or other similar process • sounds familiar? • individual steps are combined into spouts / bolts at runtime

What are functions? • basic building blocks in trident input
tuples output tuples processing public class Split extends BaseFunction { public void execute(TridentTuple tuple, TridentCollector collector) { String sentence = tuple.getString(0); for(String word: sentence.split(" ")) { collector.emit(new Values(word)); } } }

Example Trident Topology TridentTopology topology = new TridentTopology(); TridentState wordCounts
= topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")) .parallelismHint(6);

DRPC • Distributed Remote Procedure Calls • executes an RPC
across a Storm Cluster • query transformed into tuple, then flows through topology

Trident State • means of persistence, either in memory or
in store such as Cassandra • state updates are idempotent in the face of retries or failures

Trident Spouts • can be one of three types ◦
Non-transactional ◦ Transactional ◦ Opaque Transactional

Achieving exactly-once semantics Spout Non- transactional Transactional Opaque Transactional State
Non- transactional No No No Transactional No Yes Yes Opaque Transactional No No Yes

Issues • Storm does have some negative points ◦ lack
of documentation ◦ logging issues ◦ no testing framework for Trident ◦ rapidly changing • however, it is still early days

Summary • Storm is a realtime scalable, resilient computation platform
• Trident offers extremely good message guarantees • still an evolving technology - much may change

Thank you! Any questions? images from the Storm Tutorial -
https://github. com/nathanmarz/storm/wiki/Tutorial

Storm - an overview

Storm - an overview

MetaBroadcast

More Decks by MetaBroadcast

Other Decks in Technology

Featured

Transcript

Into the Storm An introduction to and overview of Apache

What is Storm? • "free and open source distributed realtime

Who uses it? • Twitter • Groupon • Ooyala •

What are we using it for? • labelling • statistics

History • developed by Nathan Martz and BackType • acquired

So, what is it?

Overview • run on cluster of machines • consists of

Master Node • master node runs Nimbus • distributes code

Worker Node • runs a Supervisor • listens for work

Zookeeper • Nimbus and Supervisors communicate via Zookeeper cluster

Topologies • a graph of computation nodes • links show

Spouts • source of data • reliable or unreliable •

Bolts • where all Storm processing occurs • filters, aggregations,

Topologies (again) • topologies tell bolts and spouts where to

TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split",

public class RandomSentenceSpout extends BaseRichSpout { ... @Override public void

public static class WordCount extends BaseBasicBolt { Map<String, Integer> counts

Trident • a high-level abstraction on top of Storm •

Only-once • tuples are processed in small batches • each

Trident Topologies • a stream is channelled through a number

What are functions? • basic building blocks in trident input

Example Trident Topology TridentTopology topology = new TridentTopology(); TridentState wordCounts

DRPC • Distributed Remote Procedure Calls • executes an RPC

Trident State • means of persistence, either in memory or

Trident Spouts • can be one of three types ◦

Achieving exactly-once semantics Spout Non- transactional Transactional Opaque Transactional State

Issues • Storm does have some negative points ◦ lack

Summary • Storm is a realtime scalable, resilient computation platform

Thank you! Any questions? images from the Storm Tutorial -