Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Know Thy Neighbor: Scikit and the K-Nearest Nei...

Know Thy Neighbor: Scikit and the K-Nearest Neighbor Algorithm

This presentation will give a brief overview of machine learning, the k-nearest neighbor algorithm and scikit-learn. Sometimes developers need to make decisions, even when they don't have all of the required information. Machine learning attempts to solve this problem by using known data (a training data sample) to make predictions about the unknown. For example, usually a user doesn't tell Amazon explicitly what type of book they want to read, but based on the user's purchasing history, and the user's demographic, Amazon is able to induce what the user might like to read.

Scikit-learn makes use of the k-nearest neighbor algorithm and allows developers to make predictions. Using training data one could make inferences such as what type of food, tv show, or music the user prefers. In this presentation we will introduce the k-nearest neighbor algorithm, and discuss when one might use this algorithm.

PyCon 2014

April 13, 2014
Tweet

More Decks by PyCon 2014

Other Decks in Science

Transcript

  1. Know  Thy  Neighbor:  An  Introduc6on  to   Scikit-­‐Learn  and  K-­‐NN

      Por6a  Burton   PLB  Analy6cs   www.github.com/pkafei    
  2. About  Me:   •  Organizer  of  the  Portland  Data  Science

     group   •  Volunteer  of  HackOregon   •  Founder  of  PLB  Analy6cs  
  3. What  We  will  Cover  Today   1.  Brief  Intro  to

     Machine  Learning   2.  Go  Over  Scikit-­‐learn   3.  Explain  the  k-­‐Nearest  Neighbor  algorithm   4.  Demo  of  Scikit-­‐learn  and  kNN  
  4. What  is  Machine  Learning   Algorithms  use  data  to….  

      • Create  predic6ve  models   • Classify  unknown  en66es   • Discover  paWerns  
  5. 70%   • Clean  and  Standardize  Data   20%   • Preprocess,

     Training,   Validate   10%   • Analyze  and  Visualize  
  6. What  is  scikit-­‐learn?   • Python  machine  learning  package   • Great

     documenta6on   • Has  built  in  datasets(i.e.  Boston  housing   market)  
  7. **

  8. Are  You  a  Recipe?  Yum.   •  Dis6nguishes  ‘recipe’  notes

      from  ‘work’  notes   •  Sugges6ng  notebooks  is  a   classifica6on  problem   •  Implements  naïve  bayes     classifica6on  algorithm  
  9. Unsupervised  Learning   Data  points  are  not  labeled  with  outcomes.

      PaWerns  are  found  by  the  algorithm.  
  10. k-­‐NN   • k  Nearest  Neighbor  algorithm   – The  simplest  machine

     learning  algorithm   – It  is  a  lazy  algorithm  :  doesn’t  run  computa6ons   on  the  dataset  un6l  you  give  it  a  new  data  point   you  are  trying  to  test   – Our  example  uses  k-­‐NN  for  supervised  learning    
  11. Majority  Vote   •  Equal  weight:  Each  kNN  neighbor  has

     equal   weight   •  Distance  weight:  Each  kNN  neighbor’s  vote  is   based  on  the  distance    
  12. Downsides  of  kNN   • Since  there  is  minimum  training  there

     is  a   high  computa6onal  cost  in  tes6ng  new  data   • Correla6on  is  falsely  high  (data  points  can  be   given  too  much  weight)  
  13. Our  Data  Set:   •  Typical!   •  Mul6variate  data

     set   was  created  in  1936   •  Analyzed  by  Sir   Ronald  Fischer   •  Collected  by  Edgar   Anderson    
  14. The  plot  from  the  use  case   Sepal Length (cm)

    Sepal Width (cm) Training Data Test Data
  15. Example  data  points  for  each  iris  species   Sepal  

    length     (x-­‐axis)   Sepal   width     (y-­‐axis)   Species   5.1   3.5   I.  setosa   5.5   2.3   I.  versicolor   6.7   2.5   I.  virginica  
  16. References:   hWp://www.solver.com/xlminer/help/k-­‐nearest-­‐neighbors-­‐predic6on-­‐example     hWp://saravananthirumuruganathan.wordpress.com/2010/05/17/a-­‐detailed-­‐introduc6on-­‐to-­‐k-­‐nearest-­‐ neighbor-­‐knn-­‐algorithm/     hWp://scikit-­‐learn.org/stable/modules/neighbors.html

        hWp://peekaboo-­‐vision.blogspot.com/2013/01/machine-­‐learning-­‐cheat-­‐sheet-­‐for-­‐scikit.html     hWp://stackoverflow.com/ques6ons/1832076/what-­‐is-­‐the-­‐difference-­‐between-­‐supervised-­‐learning-­‐and-­‐ unsupervised-­‐learning     hWp://stackoverflow.com/ques6ons/2620343/what-­‐is-­‐machine-­‐learning  
  17. Theore6cal  data  model  for   unsupervised  learning   The “outcomes”

    are our observations. This is what is given to the algorithm Variables that are unknown to us Output of algorithm: Relationships among the ‘outcomes’. Ex: clusters of data points