Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Stream Processing: Philosophy, Concepts, and Te...

Stream Processing: Philosophy, Concepts, and Technologies

Given at PhillyETE 2013: Stream processing has emerged in recent years as a very fast-growing paradigm in data science infrastructure. This rise can be partly attributed to some factors external to system design, such as business demands for near-realtime data or inability of hardware to manage an ever-growing data set. However, this paradigm also possesses many inherent strengths, and there is good reason for it to be embraced, not simply tolerated. In this talk I’ll discuss some high level advantages of processing data in streams, such as fault tolerance, horizontal scalability, and composability. I’ll then introduce NSQ, Bitly’s open source queueing system, and discuss how it provides us with these advantages and how it approaches the tradeoffs inherent in designing distributed systems. I’ll also discuss some of the burdens that NSQ places on developers, such as idempotent operations, and why they are necessary. Finally, I’ll discuss some new technologies that aim to abstract away the mechanism of communcation between streaming programs, and talk about the powerful opportunities and risks that they offer.

Dan Frank

April 03, 2013
Tweet

Other Decks in Programming

Transcript

  1. • Stream processing as a tool for decomposition and modularity

    • Stream processing composition building blocks What did I just sign up for?
  2. • Stream processing as a tool for decomposition and modularity

    • Stream processing composition building blocks • Stream processing in your distributed web application What did I just sign up for?
  3. • Stream processing as a tool for decomposition and modularity

    • Stream processing composition building blocks • Stream processing in your distributed web application • NSQ, Bitly’s distributed messaging framework What did I just sign up for?
  4. • Stream processing as a tool for decomposition and modularity

    • Stream processing composition building blocks • Stream processing in your distributed web application • NSQ, Bitly’s distributed messaging framework • The future now: stream processing within your programs, and technologies to do it What did I just sign up for?
  5. A QUICK NOTE ON • Hadoop is a dominant framework

    for doing batch tasks: tasks that operate on a fully populated dataset and just need to be done “later”. Offline • Stream processing is basically the opposite of this: operating as new data comes in, computation happens online. No concept of “complete” dataset • BUT, using the two as complementary data analysis components is very effective
  6. NAÏVE “ARCHITECTURE” for line in lines: new_line = do_something(line) newer_line

    = do_something_else(new_line) # ... outputs.append(newest_line)
  7. NAÏVE “ARCHITECTURE” for line in lines: new_line = do_something(line) newer_line

    = do_something_else(new_line) # ... outputs.append(newest_line) Composition of our functions is static, built into our program
  8. NAÏVE “ARCHITECTURE” for line in lines: new_line = do_something(line) newer_line

    = do_something_else(new_line) # ... outputs.append(newest_line) Composition of our functions is static, built into our program Error handling? Uhh
  9. Unix Solution: Pipes < lines do_something | do_something_else | ...

    Composition happens outside the application code
  10. Unix Solution: Pipes < lines do_something | do_something_else | ...

    Composition happens outside the application code Errors are printed to stderr, execution continues. It’ll do...
  11. ASIDE ON MODULARITY • Modularity in code • Logically simpler

    functions, more easily grokked + tested • Smaller functions more easily reused throughout program, DRY
  12. ASIDE ON MODULARITY • Modularity in code • Logically simpler

    functions, more easily grokked + tested • Smaller functions more easily reused throughout program, DRY • Modularity in architecture • Fine grained scaling of individual components • Isolate failures • All of the above
  13. “QUEUEREADER” applications consume messages generated as outlined above • May

    modify messages and send further downstream • May update some sort of database
  14. “QUEUEREADER” applications consume messages generated as outlined above • May

    modify messages and send further downstream • May update some sort of database • Probably a good idea to do some archival as well
  15. ARCHIVAL GOODIES •Backfill new systems •Repair busted systems •Ripe for

    batch processing •Include timestamps in your messages!
  16. Pubsub / Multicast Model PS msg msg msg Producer ConsumerA

    ConsumerB Messages duplicated to multiple consumers Decouple independent stream operations
  17. Q m2 m2 m1 Producer ConsumerA ConsumerA m1 Distribution Model

    Messages distributed among consumers Horizontally scale workers to achieve desired throughput
  18. Q m2 m2 m1 Producer Consumer Consumer m1 Distribution Model

    Fault Tolerance: In face of consumer failure, other consumers (try to) pick up the slack
  19. Q m1 Producer Consumer Consumer m2 Buffered Model Buffering: If

    consumers cannot keep up with producers, the queue is able to hold onto messages so they can be processed later m3
  20. MAKE IT WEBSCALE!!! what does this have to do with

    my webapp? Web requests are serialized as event messages
  21. MAKE IT WEBSCALE!!! what does this have to do with

    my webapp? Web requests are serialized as event messages Messages make up a stream that can be processed elsewhere in your distributed application
  22. App ❶ ❹ ❸ ❷ ASYNC DATA FLOW incoming request

    sync persist data send response async queue message
  23. App ❶ ❹ ❸ ❷ ASYNC DATA FLOW incoming request

    sync persist data send response async queue message Downstream processing decoupled from request / response
  24. IT’S NICE BUT • Stringing together queues and pubsubs implementing

    these models a pain • Single conduit for messages a SPOF • Single queue leads to rigid dependencies between services
  25. TYPICAL (OLD) ARCHITECTURE Host A API simplequeue queuereader Host B

    pubsub Host C simplequeue queuereader ps_to_http
  26. TYPICAL (OLD) ARCHITECTURE Host A API simplequeue queuereader Host B

    pubsub Host C simplequeue queuereader ps_to_http SPOF SPOF COMPLEX
  27. TYPICAL (OLD) ARCHITECTURE Host A API simplequeue queuereader Host B

    pubsub Host C simplequeue queuereader ps_to_http SPOF SPOF COMPLEX ANARCHY
  28. NSQ Core Features Queue daemon facilitates multicast, distribution, and buffering

    Lookup service simplifies configuration and allows topology to change dynamically Fully distributed and decentralized
  29. MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND MESSAGE FLOW

    • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “clicks” Topics
  30. MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND MESSAGE FLOW

    • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics
  31. MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND MESSAGE FLOW

    • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis”
  32. MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND MESSAGE FLOW

    • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive”
  33. separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND

    MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers
  34. separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND

    MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers
  35. separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND

    MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers
  36. separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND

    MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers A A A
  37. separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND

    MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers A A A
  38. separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND

    MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers A A A
  39. separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND

    MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers A A A
  40. separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND

    MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers A A A B B B
  41. separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND

    MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers A A A B B B
  42. separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND

    MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers A A A B B B
  43. separate hosts MULTICAST AND BUFFERING, YOU SAY? NSQ CONCEPTS AND

    MESSAGE FLOW • a topic is a distinct stream of messages (a single nsqd instance can have multiple topics) • a channel is an independent queue for a topic (a topic can have multiple channels) • consumers discover producers by querying nsqlookupd (a discovery service for topics) • topics and channels are created at runtime (just start publishing/subscribing) nsqd “metrics” Channels “clicks” Topics “spam_analysis” “archive” Consumers A A A B B B
  44. DISCOVERY remove the need for publishers and consumers to know

    about each other nsqlookupd nsqd producer nsqlookupd
  45. DISCOVERY remove the need for publishers and consumers to know

    about each other nsqlookupd nsqd ❶ publish msg (specifying topic) producer nsqlookupd
  46. DISCOVERY remove the need for publishers and consumers to know

    about each other nsqlookupd nsqd ❶ publish msg (specifying topic) producer ➋ IDENTIFY persistent TCP connections nsqlookupd
  47. DISCOVERY remove the need for publishers and consumers to know

    about each other nsqlookupd nsqd ❶ publish msg (specifying topic) producer ➋ IDENTIFY persistent TCP connections nsqlookupd ➌ REGISTER (topic/channel)
  48. DISCOVERY (CLIENT) remove the need for publishers and consumers to

    know about each other nsqlookupd nsqlookupd consumer
  49. DISCOVERY (CLIENT) remove the need for publishers and consumers to

    know about each other nsqlookupd nsqlookupd consumer ➊ regularly poll for topic producers HTTP requests
  50. DISCOVERY (CLIENT) remove the need for publishers and consumers to

    know about each other nsqlookupd nsqlookupd consumer ➊ regularly poll for topic producers ➋ connect to all producers HTTP requests
  51. ELIMINATE ALL THE SPOF •easily enable distributed and decentralized topologies

    •no brokers •consumers connect to all producers •messages are pushed to consumers •nsqlookupd instances are independent and require no coordination (run a few for HA)
  52. ELIMINATE ALL THE SPOF nsqd nsqd nsqd •easily enable distributed

    and decentralized topologies •no brokers •consumers connect to all producers •messages are pushed to consumers •nsqlookupd instances are independent and require no coordination (run a few for HA)
  53. ELIMINATE ALL THE SPOF nsqd nsqd nsqd consumer •easily enable

    distributed and decentralized topologies •no brokers •consumers connect to all producers •messages are pushed to consumers •nsqlookupd instances are independent and require no coordination (run a few for HA)
  54. ELIMINATE ALL THE SPOF nsqd nsqd nsqd consumer •easily enable

    distributed and decentralized topologies •no brokers •consumers connect to all producers •messages are pushed to consumers •nsqlookupd instances are independent and require no coordination (run a few for HA)
  55. ELIMINATE ALL THE SPOF nsqd nsqd nsqd consumer consumer •easily

    enable distributed and decentralized topologies •no brokers •consumers connect to all producers •messages are pushed to consumers •nsqlookupd instances are independent and require no coordination (run a few for HA)
  56. ELIMINATE ALL THE SPOF nsqd nsqd nsqd consumer consumer •easily

    enable distributed and decentralized topologies •no brokers •consumers connect to all producers •messages are pushed to consumers •nsqlookupd instances are independent and require no coordination (run a few for HA)
  57. EXAMPLE NSQ ARCHITECTURE NSQ NSQD API consumer NSQ NSQD API

    NSQ NSQD API consumer nsqlookupd nsqlookupd
  58. EXAMPLE NSQ ARCHITECTURE NSQ NSQD API consumer NSQ NSQD API

    NSQ NSQD API consumer nsqlookupd nsqlookupd PUBLISH
  59. EXAMPLE NSQ ARCHITECTURE NSQ NSQD API consumer NSQ NSQD API

    NSQ NSQD API consumer nsqlookupd nsqlookupd PUBLISH REGISTER
  60. EXAMPLE NSQ ARCHITECTURE NSQ NSQD API consumer NSQ NSQD API

    NSQ NSQD API consumer nsqlookupd nsqlookupd PUBLISH REGISTER DISCOVER
  61. EXAMPLE NSQ ARCHITECTURE NSQ NSQD API consumer NSQ NSQD API

    NSQ NSQD API consumer nsqlookupd nsqlookupd PUBLISH REGISTER DISCOVER SUBSCRIBE
  62. A WORD ON ERRORS •If a reader does not reply

    to confirm completion of a message within a timeout, the message is requeued. •Abandoned after configurable number of requeues •Allows for recovery in face of transient problems without getting hung up on bad messages
  63. OTHER NSQ NICETIES •Admin interface: server-side channel pausing, admin action

    notifications •Configurable high-water mark on memory usage •Ephemeral channels for stream sampling
  64. DISTRIBUTED MESSAGING CAVEATS •Messages in order? Fuggedaboudit!* •NSQ protocol guarantees

    delivery at least once - idempotence is a must! (_ids help) •Try not to be shocked by effortless recovery from node failure
  65. DISTRIBUTED MESSAGING CAVEATS •Messages in order? Fuggedaboudit!* •NSQ protocol guarantees

    delivery at least once - idempotence is a must! (_ids help) •Try not to be shocked by effortless recovery from node failure *See http://bit.ly/life_beyond_transactions
  66. STREAM PROCESSING: WHY NOW? •Cheap node distribution: EC2 etc •Moore’s

    law, Amdahl’s law, battered deceased equines...
  67. STREAM PROCESSING: WHY NOW? •Cheap node distribution: EC2 etc •Moore’s

    law, Amdahl’s law, battered deceased equines... •Taking advantage of CPU parallelism the way forward for program efficiency - good thing we just went over a paradigm for distributing tasks among parallel workers!
  68. •Channels allow synchronized passage of messages between two goroutines •Goroutine

    independence (through synchronization) allows stream-like architecture: •“Don’t communicate by sharing memory, share memory by communicating” •Golang scheduler can parallelize between cores (GOMAXPROCS) •Channels act like queues. Multicast not really an option •Queuereader applications are a particularly good fit for goroutine concurrency
  69. Q m... m1 ConsumerA ConsumerA CPU 1 m2 m1 m3

    CPU 2 Goroutine 1 Goroutine 2 Goroutine 3 m1 m2 m3 •Within each consumer, messages distributed among goroutines •Goroutines, when possible, parallelized across CPUs •OK to have more goroutines than CPUs - golang scheduler will give them CPU time when another goroutine is idle (e.g. waiting on network) Golang Channel
  70. ZMQ FEATURES •Networking library that provides building blocks discussed earlier

    •Unlike golang channels, does support many more complex patterns •Transport layer abstracted out: same application can connect multiple threads or multiple machines •Can start by distributing among processes, and scale up to several boxes. Application code doesn’t need to know about it! •All the rage among the webscale set, but unclear what the hell is going on in the community
  71. WHAT HAVE WE SEEN HERE? •Stream processing paradigm is a

    great tool for writing composed, modular applications •Fault tolerance and horizontal scalability come in the box •Your web application is probably better suited to this design than you think •NSQ is the tool we use to write distributed stream processing applications and it kicks ass at it •These same paradigms can aid in writing performant applications making use of multicore computer architecture, so you should plan on seeing a lot more of this stuff in the near future, whether you like it or not