DB Data App App App Faster Larger Database M. Stonebraker and U. Çetintemel, “‘One Size Fits All’: An Idea Whose Time Has Come and Gone,” in ICDE ’05 A. Jacobs, “The Pathologies of Big Data,” Communications of the ACM, 52(8):36–44, Aug. 2009 15
DB Data App App App Faster Larger Database M. Stonebraker and U. Çetintemel, “‘One Size Fits All’: An Idea Whose Time Has Come and Gone,” in ICDE ’05 A. Jacobs, “The Pathologies of Big Data,” Communications of the ACM, 52(8):36–44, Aug. 2009 15
DB Data App App App Faster Larger Database M. Stonebraker and U. Çetintemel, “‘One Size Fits All’: An Idea Whose Time Has Come and Gone,” in ICDE ’05 A. Jacobs, “The Pathologies of Big Data,” Communications of the ACM, 52(8):36–44, Aug. 2009 15
DB Data App App App Faster Larger Database M. Stonebraker and U. Çetintemel, “‘One Size Fits All’: An Idea Whose Time Has Come and Gone,” in ICDE ’05 A. Jacobs, “The Pathologies of Big Data,” Communications of the ACM, 52(8):36–44, Aug. 2009 15
—2011 —2013 Aurora STREAM Borealis SPC SPADE Storm S4 1st generation 2nd generation 3rd generation Abadi et al., “Aurora: a new model and architecture for data stream management,” VLDB Journal, 2003 Arasu et al., “STREAM: The Stanford Data Stream Management System,” Stanford InfoLab, 2004. Abadi et al., “The Design of the Borealis Stream Processing Engine,” in CIDR ’05 Amini et al., “SPC: A Distributed, Scalable Platform for Data Mining,” in DMSSP ’06 Gedik et al., “SPADE: The System S Declarative Stream Processing Engine,” in SIGMOD ’08 Neumeyer et al., “S4: Distributed Stream Computing Platform,” in ICDMW ’10 http://storm-project.net Samza http://samza.incubator.apache.org
PE2 PE3 PE4 RawStatus null text="Int..." EV KEY VAL Topic topic="S4" count=1 EV KEY VAL Topic topic="stream" count=1 EV KEY VAL Topic reportKey="1" topic="S4", count=4 EV KEY VAL TopicExtractorPE (PE1) extracts hashtags from status.text TopicCountAndReportPE (PE2-3) keeps counts for each topic across all tweets. Regularly emits report event if topic count is above a configured threshold. TopicNTopicPE (PE4) keeps counts for top topics and outputs top-N topics to external persister 18
You Have is a Hammer, Throw Away Everything That’s Not a Nail!” [J. Lin, in Big Data, 1(1):28–37, 2013] “Data whose characteristics forces us to look beyond the traditional methods that are prevalent at the time” [A. Jacobs, in ACM Queue, 7(6):10,2009] 19
high speed of arrival Change over time (concept drift) Approximation algorithms (small error with high probability) Single pass, one data item at a time Sub-linear space and time per data item 21
Decision Tree Algorithm,” The Journal of Machine Learning Research, vol. 11, pp. 849–872, Mar. 2010. 33 Stats Stats Stats Stream Histograms Model Instances Model Updates
A Platform for Mining Big Data Streams Runs on existing DSPEs (Storm, Samza, S4) Algorithms for classification, regression, clustering Available and open-source http://samoa-project.net A platform for collaboration and research on distributed stream mining 43
+ crowdsourcing Millions of classes (e.g., Wikipedia pages) Multi-target learning System issues (load balancing, communication) Programming paradigms and abstractions 44