The rate at which the world produces data is growing steadily, thus creating ever larger streams of continuously evolving data. However, current (de-facto standard) solutions for big data analysis are not designed to mine evolving streams. Big data streams are just starting to be studied systematically, they are the next frontier for data analytics. As such, best practices in this context are still not ironed out, and the landscape is rapidly changing: it’s a wild west.
In this talk, we present a core of solutions for stream analysis that constitutes an initial foray into this uncharted territory. In doing so, we introduce Apache SAMOA, an open-source platform for mining big data streams (http://samoa.incubator.apache.org). Apache SAMOA provides a collection of distributed streaming algorithms for data mining tasks such as classification, regression, and clustering. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm, S4, Samza, and Flink.
As a case study, we present one of SAMOA's main algorithms for classification, the Vertical Hoeffding Tree (VHT). Then, we propose a framework for online performance evaluation of streaming classifiers. We conclude by highlighting the issue of load balancingfrom a distributed systems perspective, and describing a generalizable solution.