Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Distributed Adaptive Model Rules for Mining Big...

Distributed Adaptive Model Rules for Mining Big Data Streams

Decision rules are among the most expressive data mining models. We propose the first distributed streaming algorithm to learn decision rules for regression tasks. The algorithm is available in SAMOA (SCALABLE ADVANCED MASSIVE ONLINE ANALYSIS), an open-source platform for mining big data streams. It uses a hybrid of vertical and horizontal parallelism to distribute Adaptive Model Rules (AMRules) on a cluster. The decision rules built by AMRules are comprehensible models, where the antecedent of a rule is a conjunction of conditions on the attribute values, and the consequent is a linear combination of the attributes. Our evaluation shows that this implementation is scalable in relation to CPU and memory consumption. On a small commodity Samza cluster of 9 nodes, it can handle a rate of more than 30000 instances per second, and achieve a speedup of up to 4.7x over the sequential version.

More Decks by Gianmarco De Francisci Morales

Other Decks in Research

Transcript

  1. Distributed Adaptive Model Rules for Mining Big Data Streams
 Anh

    Thu Vu, Gianmarco De Francisci Morales, Joao Gama, Albert Bifet
  2. Motivation Regression: fundamental machine learning task Predict how much rain

    tomorrow Applications Trend prediction Click-through rate prediction 2
  3. Regression Input: training examples with numeric label Output: model that

    predicts value of unlabeled instance x ŷ=ƒ(x) Minimize error
 MSE = ∑(y-ŷ)2 3
  4. SAMOA 5 SAMOA Data Mining Distributed Batch Hadoop Mahout Stream

    Storm, S4, Samza SAMOA Non Distributed Batch R, WEKA,… Stream MOA G. De Francisci Morales, A. Bifet: “SAMOA: Scalable Advanced Massive Online Analysis”. JMLR (2014) http://samoa-project.net
  5. Rules     Rules Rules: self-contained, modular, easy

    to interpret,
 no need to cover universe keeps sufficient statistics to: make predictions expand the rule detect changes and anomalies 6
  6. AMRules Rule sets Predicting with a rule set  

            E.g: x = [4, 1, 1, 2] ˆ f( x ) = X Rl 2S( x i ) ✓l ˆ yl, Adaptive Model Rules Ruleset: ensemble of rules Rule prediction: mean, linear model Ruleset prediction Weighted avg. of predictions of rules covering instance x Weights inversely proportional to error Default rule covers uncovered instances 7
  7. Ensembles of Adaptive Model Rules from High-Speed Data Streams AMRules

    Rule sets Algorithm 1: Training AMRules Input : S: Stream of examples begin R {}, D 0 foreach ( x , y) 2 S do foreach Rule r 2 S( x ) do if ¬IsAnomaly( x , r) then if PHTest(errorr , ) then Remove the rule from R else Update sufficient statistics Lr ExpandRule(r) if S( x ) = ; then Update LD ExpandRule(D) if D expanded then R R [ D D 0 return (R, LD ) Rule Induction • Rule creation: default rule expansion • Rule expansion: split on attribute maximizing σ reduction • Hoeffding bound ε • Expand when σ1st /σ2nd < 1 - ε • Evict rule when drift is detected 
 (Page-Hinckley test error large) • Detect and explain local anomalies = r R2 ln(1/ ) 2n 8
  8. DSPEs Live Streams Stream 1 Stream 2 Stream 3 PE

    PE PE PE PE External Persister Output 1 Output 2 Event routing 9
  9. Example status.text:"Introducing #S4: a distributed #stream processing system" PE1 PE2

    PE3 PE4 RawStatus null text="Int..." EV KEY VAL Topic topic="S4" count=1 EV KEY VAL Topic topic="stream" count=1 EV KEY VAL Topic reportKey="1" topic="S4", count=4 EV KEY VAL TopicExtractorPE (PE1) extracts hashtags from status.text TopicCountAndReportPE (PE2-3) keeps counts for each topic across all tweets. Regularly emits report event if topic count is above a configured threshold. TopicNTopicPE (PE4) keeps counts for top topics and outputs top-N topics to external persister 10
  10. PE PE PEI PEI PEI PEI Groupings Key Grouping 


    (hashing) Shuffle Grouping
 (round-robin) All Grouping
 (broadcast) 11
  11. PE PE PEI PEI PEI PEI Groupings Key Grouping 


    (hashing) Shuffle Grouping
 (round-robin) All Grouping
 (broadcast) 12
  12. PE PE PEI PEI PEI PEI Groupings Key Grouping 


    (hashing) Shuffle Grouping
 (round-robin) All Grouping
 (broadcast) 12
  13. PE PE PEI PEI PEI PEI Groupings Key Grouping 


    (hashing) Shuffle Grouping
 (round-robin) All Grouping
 (broadcast) 12
  14. PE PE PEI PEI PEI PEI Groupings Key Grouping 


    (hashing) Shuffle Grouping
 (round-robin) All Grouping
 (broadcast) 13
  15. PE PE PEI PEI PEI PEI Groupings Key Grouping 


    (hashing) Shuffle Grouping
 (round-robin) All Grouping
 (broadcast) 13
  16. PE PE PEI PEI PEI PEI Groupings Key Grouping 


    (hashing) Shuffle Grouping
 (round-robin) All Grouping
 (broadcast) 13
  17. PE PE PEI PEI PEI PEI Groupings Key Grouping 


    (hashing) Shuffle Grouping
 (round-robin) All Grouping
 (broadcast) 14
  18. PE PE PEI PEI PEI PEI Groupings Key Grouping 


    (hashing) Shuffle Grouping
 (round-robin) All Grouping
 (broadcast) 14
  19. PE PE PEI PEI PEI PEI Groupings Key Grouping 


    (hashing) Shuffle Grouping
 (round-robin) All Grouping
 (broadcast) 14
  20. Model Aggregator Learner 1 Learner 2 Learner p Predictions Instances

    New Rules Rule Updates VAMR Vertical AMRules Model: rule body + head Target mean updated continuously
 with covered instances for predictions Default rule 
 (creates new rules) 15
  21. VAMR Learner: statistics Vertical: Learner tracks statistics of independent subset

    of rules One rule tracked by only one Learner Model -> Learner: key grouping on rule ID Model Aggregator Learner 1 Learner 2 Learner p Predictions Instances New Rules Rule Updates 16
  22. HAMR VAMR single model is bottleneck Hybrid AMRules
 (Vertical +

    Horizontal) Shuffle among multiple
 Models for parallelism Learners Model Aggregator 1 Model Aggregator 2 Model Aggregator r Predictions Instances New Rules Rule Updates Learners Learners 17
  23. HAMR Problem: distributed default rule decreases performance Separate dedicate Learner

    for default rule Predictions Instances New Rules Rule Updates Learners Learners Learners Model Aggregator 2 Model Aggregator 2 Model Aggregators Default Rule Learner New Rules 18
  24. Task Overview Instances, Rules, Predictions Double line = broadcast Source

    -> Model = shuffle grouping Model -> Learner = 
 key grouping Source Default Rule Learner Learner Model Aggregator Evaluator 19
  25. Experiments 10-nodes Samza cluster + Kafka 2VCPUs, 4GB RAM Throughput,

    Accuracy, Memory usage Compare with sequential algorithm in MOA # instances # attributes Airlines 5.8M 10 Electricity 2M 12 Waveform 1M 40 20
  26. Throughput (Airlines) 1 2 4 8 Parallelism Level Fig. 5:

    Throughput of distributed AMRules with electricity. 0 5 10 15 20 25 30 35 1 2 4 8 Throughput (thousands instances/second) Parallelism Level MAMR VAMR HAMR-1 HAMR-2 Fig. 6: Throughput of distributed AMRules with airlines. Fig. Al compu chang bottle instan coveri learne of the instan parall Th scalab the th model when throug is in t 21
  27. Throughput (Electricity) 0 5 10 15 20 25 30 35

    1 2 4 8 Throughput (thousands instances/second) Parallelism Level MAMR VAMR HAMR-1 HAMR-2 Fig. 5: Throughput of distributed AMRules with electricity. 0 5 10 15 20 25 30 35 Throughput (thousands instances/second) Fig. 7 22
  28. Throughput (Waveform) ctricity. 0 5 10 15 20 25 30

    35 1 2 4 8 Throughput (thousands instances/second) Parallelism Level MAMR VAMR HAMR-1 HAMR-2 Fig. 7: Throughput of distributed AMRules with waveform. 23
  29. Throughput / Message Size (a) MAE Fig. 9: MAE and

    RMSE of distributed AMR 0 10 20 30 40 50 500 Airlines Electricity 1000 Waveform 2000 Throughput (thousands instances/second) Result message size (B) Reference Max throughput Fig. 8: Maximum throughput of HAMR vs message size. TABL datase TABL datase 24
  30. Accuracy (Airlines) 8 0 0.005 0.01 0.015 0.02 1 2

    4 8 RMSE/(Max-Min) Parallelism Level MAMR VAMR HAMR-1 HAMR-2 (b) RMSE 25
  31. 8 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 1

    2 4 8 RMSE/(Max-Min) Parallelism Level MAMR VAMR HAMR-1 HAMR-2 (b) RMSE Accuracy (Electricity) 26
  32. Accuracy (Waveform) 8 0 0.05 0.1 0.15 0.2 0.25 0.3

    0.35 0.4 1 2 4 8 RMSE/(Max-Min) Parallelism Level MAMR VAMR HAMR-1 HAMR-2 (b) RMSE 27
  33. Memory Usage e. the ted the TABLE III: Memory consumption

    of VAMR for different datasets and parallelism levels. Dataset Parallelism Memory Consumption (MB) Model Aggregator Learner Avg. Std. Dev. Avg. Std. Dev. Electricity 1 266.5 5.6 40.1 4.3 2 264.9 2.8 23.8 3.9 4 267.4 6.6 20.1 3.2 8 273.5 3.9 34.7 29 Airlines 1 337.6 2.8 83.6 4.1 2 338.1 1.0 38.7 1.8 4 337.3 1.0 38.8 7.1 8 336.4 0.8 31.7 0.2 Waveform 1 286.3 5.0 171.7 2.5 2 286.8 4.3 119.5 10.4 4 289.1 5.9 46.5 12.1 8 287.3 3.1 33.8 5.7 28
  34. Memory Usage e. the ted the TABLE III: Memory consumption

    of VAMR for different datasets and parallelism levels. Dataset Parallelism Memory Consumption (MB) Model Aggregator Learner Avg. Std. Dev. Avg. Std. Dev. Electricity 1 266.5 5.6 40.1 4.3 2 264.9 2.8 23.8 3.9 4 267.4 6.6 20.1 3.2 8 273.5 3.9 34.7 29 Airlines 1 337.6 2.8 83.6 4.1 2 338.1 1.0 38.7 1.8 4 337.3 1.0 38.8 7.1 8 336.4 0.8 31.7 0.2 Waveform 1 286.3 5.0 171.7 2.5 2 286.8 4.3 119.5 10.4 4 289.1 5.9 46.5 12.1 8 287.3 3.1 33.8 5.7 MRules with electricity dataset. ABLE II: Memory consumption of MAMR for different asets. Dataset Memory consumption (MB) Avg. Std. Dev. Electricity 52.4 2.1 Airlines 120.7 51.1 Waveform 223.5 8 ABLE III: Memory consumption of VAMR for different asets and parallelism levels. Dataset Parallelism Memory Consumption (MB) Model Aggregator Learner Avg. Std. Dev. Avg. Std. Dev. Electricity 1 266.5 5.6 40.1 4.3 2 264.9 2.8 23.8 3.9 4 267.4 6.6 20.1 3.2 8 273.5 3.9 34.7 29 Airlines 1 337.6 2.8 83.6 4.1 2 338.1 1.0 38.7 1.8 4 337.3 1.0 38.8 7.1 8 336.4 0.8 31.7 0.2 Waveform 28
  35. Memory Usage (Learner) SAMOA Distributed Streaming Regression Rules Evaluation Conclusions

    Memory Usage Memory Usage of Learner 0 50 100 150 200 Airlines Electricity Waveform Average Memory Usage (MB) P=1 P=2 P=4 P=8 36 / 38 29
  36. Conclusions Distributed streaming algorithm for regression Runs on top of

    distributed stream processing engines Up to ~5x increase in throughput Accuracy comparable with sequential algorithm Scalable memory usage 30