Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unbalanced data: Same algorithms different tech...

Unbalanced data: Same algorithms different techniques by Eric Martín at Big Data Spain 2017

Unbalanced data is a specific data configuration that appears commonly in nature. Applying machine learning techniques to this kind of data is a difficult process, usually addressed by unbalanced reduction techniques.

https://www.bigdataspain.org/2017/talk/unbalanced-data-same-algorithms-different-techniques

Big Data Spain 2017
November 16th - 17th Kinépolis Madrid

Avatar for Big Data Spain

Big Data Spain

December 04, 2017
Tweet

More Decks by Big Data Spain

Other Decks in Technology

Transcript

  1. ALGORITHMS POINT OF VIEW 3 ▪ Accuracy ▪ 1,000,000 total

    TRX ▪ 10 Fraud TRX = 99.9999% Recall, f1score, detection probability
  2. UNDERSTANDING THE PROBLEM 4 ▪ Scattering Matrix: Real 0 Real

    1 Pron.0 Pron.1 LESS ACCURACY ! Trading Illness Detection Real 0 Real 1 Pron.0 Pron.1
  3. MOST COMMON PRACTISES 6 ▪ Dimensionality reduction: ▫ Smote ▫

    Sintetic samples creation Y = 0 Y = 1 Y = 0 Y = 1
  4. SAME ALGORITHMS DIFFERENT TECHNIQUES ▪ If you expect different results

    you have to do different things ▪ Explote all data you have ▪ Bagging Algo: First step Random Forest 7
  5. RANDOM FOREST 8 F1 F2 F3 …… … FN Y

    1 1.2 25 True … 0.185 1 2 3.4 55 False… 0.211 1 3 2.2 58 True … 0.171 0 4 4.0 34 True … 0.132 1 5 1.1 63 True … 0.652 0 6 0.7 61 False… 0.153 0 7 3.3 12 False… 0.477 1 8 3.1 23 True … 0.311 1 9 1.2 29 False… 0.171 1 1 0 3.4 45 True … 0.132 0 1 1 2.1 55 True … 0.652 1 1 2 1.7 19 False… 0.189 0 1 3 3.3 12 False… 0.477 1 1 4 3.1 23 True … 0.311 1 1 5 1.2 29 False… 0.171 1 1 6 2.2 58 True … 0.171 0 1
  6. RANDOM FOREST 9 F1 F2 F3 … … … FN

    Y 1.5 25 False … 0.185 ??? 1 1 0 MAJORITY VOTE 1
  7. EM FOREST 10 F1 F2 F3 …… … FN Y

    1 1.2 25 True … 0.185 1 2 3.4 55 False… 0.211 1 3 2.2 58 True … 0.171 0 4 4.0 34 True … 0.132 1 5 1.1 63 True … 0.652 0 6 0.7 61 False… 0.153 0 7 3.3 12 False… 0.477 1 8 3.1 23 True … 0.311 1 9 1.2 29 False… 0.171 1 1 0 3.4 45 True … 0.132 0 1 1 2.1 55 True … 0.652 1 1 2 1.7 19 False… 0.189 0 1 3 3.3 12 False… 0.477 1 1 4 3.1 23 True … 0.311 1 1 5 1.2 29 False… 0.171 1 1 6 2.2 58 True … 0.171 0 1
  8. Tree1 Tree2 Tree3 Y 1 1 1 0 1 2

    3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 EM FOREST: Transforming the problem 11 F1 F2 F3 …… … FN Y 1 1.2 25 True … 0.185 1 2 3.4 55 False… 0.211 1 3 2.2 58 True … 0.171 0 4 4.0 34 True … 0.132 1 5 1.1 63 True … 0.652 0 6 0.7 61 False… 0.153 0 7 3.3 12 False… 0.477 1 8 3.1 23 True … 0.311 1 9 1.2 29 False… 0.171 1 1 0 3.4 45 True … 0.132 0 1 1 2.1 55 True … 0.652 1 1 2 1.7 19 False… 0.189 0 1 3 3.3 12 False… 0.477 1 1 4 3.1 23 True … 0.311 1 1 5 1.2 29 False… 0.171 1 1 6 2.2 58 True … 0.171 0 1 0 1 0 1
  9. EM FOREST: The new problem 12 Tree1 Tree2 Tree3 Y

    1 1 1 0 1 2 1 0 1 1 3 1 1 1 0 4 0 1 0 1 5 0 0 0 0 6 1 0 1 0 7 0 1 0 1 8 0 1 0 1 9 1 0 1 1 10 1 1 0 0 11 0 1 0 1 12 0 0 1 0 13 1 0 1 1 14 1 1 0 1 15 1 1 0 1 16 0 0 1 0 17 0 1 0 1 18 1 0 0 0
  10. EM FOREST: The new possibilities 13 Tree1 Tree2 Tree3 Y

    1 1 1 0 1 2 1 0 1 1 3 1 1 1 0 4 0 1 0 1 5 0 0 0 0 6 1 0 1 0 7 0 1 0 1 8 0 1 0 1 ▪ Vector vs. Aggregated Agg Y 1 2 1 2 2 1 3 3 0 4 0 1 5 1 0 6 2 0 7 1 1 8 1 1
  11. EM FOREST: The new results 14 ▪ Result improvement: Better

    score ( at least the same ) than Random Forest ▪ Result flexibility: Better in balanced and unbalanced data (Trading and illness detection )
  12. EM FOREST: Use cases 16 ▪ Real projects: Credit card

    usage trends ▪ Demo projects: Bank fraud Alcohol in students dataset