Recap • Introduction • Use case and demo Acknowledgement: Some portion of the contents in this slide deck is adapted from Matthew Rocklin’s talk, Streaming Processing with Dask, at PyData 2017. Thank you ☺
(20 April 2021) Share your real-world use cases on database, data warehouse and data lake. How does your company or yourself utilize them to manage data? What problems do your company/yourself face along the way?
Streaming data is one of the common forms of big data (volume, velocity) • Streaming data is: • Unbounded/infinite: we might receive data continuously, forever • Timely: we care about responding quickly (near real-time) • Used in: • Web server logs • Financial time series (trading) • Network data • IoT sensors
When dealing with streaming data, it is a common scenario where we only need a small subset of the data for analysis etc. • It is also common not to store the data i.e. just analyse and output on the fly.
Streamz : a Python library for dealing with streaming data • Pythonic • Simple in simple cases • Flexible enough for complex cases • Integrates well with Python libraries (Jupyter, Pandas etc.) • Other Python libraries for streaming data? Not much. • Scikit-multiflow : Python machine learning library for streaming data (incremental learning)
Problems • Network traffic analysis is critical for network administrators to manage their network efficiently. • Network traffic is continuous in nature. • Network traffic payload sizes are large. But the headers are much smaller. • Can we utilize the traffic stream and produce near real-time analysis?
• Streaming data is valuable and in abundance. • While storing is expensive (for some), computing on the fly is preferable. • Near real-time analysis assist in rapid decision making. • Python presents an accessible approach to customize streaming processing with manageable complexity.