Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Managing your Big Data: Examples and Use Cases

Managing your Big Data: Examples and Use Cases

WQD 7007 Guest Lecture.

Avatar for Faiz Zaki

Faiz Zaki

April 20, 2021
Tweet

More Decks by Faiz Zaki

Other Decks in Technology

Transcript

  1. Guest Lecture Name WQD 7007- Big Data Management Managing your

    big data: examples and use cases Faiz Zaki Network Analytics Lab Universiti Malaya 20 April 2021
  2. Guest Lecture Name WQD 7007- Big Data Management Introduction •

    Big Data is.. big. • 3Vs : Volume, velocity, variety • The term Big Data started growing in popularity in the last decade (> 2012). • Doug Cutting created Hadoop at around the same period. • Do we really have or need big data? Why?
  3. Guest Lecture Name WQD 7007- Big Data Management Introduction •

    According to IDC, as much as 90% of digital data is unstructured. • Structured data fits nicely in a table: names, addresses etc. • Unstructured data exists in its raw form: text, social media posts, emails, videos etc.
  4. Guest Lecture Name WQD 7007- Big Data Management Introduction •

    All of these data leads to another issue. • How do you store and use them? • Databases, data warehouses (ETL), data lakes (ELT) VS
  5. Guest Lecture Name WQD 7007- Big Data Management Tools •

    Hierarchical Data Format (HDF5) • File directory structure i.e. group and dataset • A single file • Vaex • Lazy, out-of-core processing • Dask • Parallel multi-core processing • Built on top of Pandas
  6. Guest Lecture Name WQD 7007- Big Data Management Tools •

    Hadoop • MapReduce • HDFS • Name Node • Data Node • YARN • Hortonworks, Cloudera, Azure HDInsight, Amazon EMR simplify deployment of Hadoop clusters.
  7. Guest Lecture Name WQD 7007- Big Data Management Tools Cloud

    Data Warehouses • Google Big Query • Snowflake Data Mining (ETL) • Xplenty • RapidMiner
  8. Guest Lecture Name WQD 7007- Big Data Management Use Case

    Problems • Pusat Teknologi Maklumat (PTM), UM has been facing constant cyber attacks. MyCERT often identified UM as a source of botnet attacks. As PTM implemented dynamic IP address allocation to all authenticated users, it became difficult to trace the origin of any attacks. • PTM’s Palo Alto firewalls are underutilized. • Traffic logs amounted to approximately 50GB daily without any payload. • No centralized system to monitor the network (SIEM).
  9. Guest Lecture Name WQD 7007- Big Data Management Use Case

    Solution • Palo Alto firewalls, ADs and other sources send out logs to SIEM which is equipped with ELK stack, acting as the data lake. • Elasticsearch and Logstash index all the logs for rapid retrieval. • Conducts user mapping to IP address at a fixed interval for accurate identification. • Kibana visualizes the log. • Search Guard manages the authentication to ELK. • Skedler sends out automated reporting to PTM.
  10. Guest Lecture Name WQD 7007- Big Data Management Use Case

    Outcomes • Saved (a lot of) money. • Able to map IP addresses to specific users. • Identified source of attacks. • Built network profiles.
  11. Guest Lecture Name WQD 7007- Big Data Management Take-home messages

    • You might have big data if you constantly deal with structured and unstructured data. • More often than not, standalone tools such as Python and its libraries are capable of handling big data. • Various point and click tools available. • Big data technologies have heavily abstracted the complexity of handling big data in such a way that we are beginning to view big data as just normal data.