Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mali Akmanalp - Other people's messy data (and ...

Mali Akmanalp - Other people's messy data (and how not to hate it!)

Have you ever viscerally hated a dataset? Do you want to just get data cleaning out of the way? Are you always left wondering how it consumes most of your time? Whether you work in the sciences, work with government data or scrape websites, data cleaning is a necessary evil. We'll share our woes and check out state of the art in day to day data cleaning tools and strategies.

https://us.pycon.org/2015/schedule/presentation/469/

PyCon 2015

April 18, 2015
Tweet

More Decks by PyCon 2015

Other Decks in Programming

Transcript

  1. i m p o r t p a n d

    a s a s p d
  2. I n [ 1 ] : p d . r

    e a d _ < t a b > p d . r e a d _ c l i p b o a r d p d . r e a d _ f w f p d . r e a d _ h t m l p d . r e a d _ p i c k l e p d . r e a d _ s q l _ t a b l e p d . r e a d _ c s v p d . r e a d _ g b q p d . r e a d _ j s o n p d . r e a d _ s q l p d . r e a d _ s t a t a p d . r e a d _ e x c e l p d . r e a d _ h d f p d . r e a d _ m s g p a c k p d . r e a d _ s q l _ q u e r y p d . r e a d _ t a b l e
  3. I n [ 5 ] : p d . r

    e a d _ s t a t a ( " s t a t e _ n a m e s . d t a " ) O u t [ 5 ] : r r _ n a m e 0 1 A g u a s c a l i e n t e s . . . 1 4 1 5 M à © x i c o 1 5 1 6 M i c h o a c à ¡ n d e O c a m p o
  4. I n [ 6 ] : p d . r

    e a d _ s t a t a ( " s t a t e _ n a m e s . d t a " , e n c o d i n g = " u t f - 8 " ) O u t [ 6 ] : r r _ n a m e 0 1 A g u a s c a l i e n t e s . . . 1 4 1 5 M é x i c o 1 5 1 6 M i c h o a c á n d e O c a m p o
  5. > > > i m p o r t c

    h a r d e t > > > c h a r d e t . d e t e c t ( r a w d a t a ) { ' e n c o d i n g ' : ' E U C - J P ' , ' c o n f i d e n c e ' : 0 . 9 9 }
  6. $ f i l e b o s t o

    n _ p y t h o n _ i s _ a w e s o m e . t s v b o s t o n _ p y t h o n _ i s _ a w e s o m e . t s v : U T F - 8 U n i c o d e E n g l i s h t e x t
  7. > > > f r o m u n i

    d e c o d e i m p o r t u n i d e c o d e > > > p r i n t u ' H \ x e b \ x e 4 v y M \ x e b t \ x e 4 l ' H ë ä v y M ë t ä l > > > u n i d e c o d e ( u ' H \ x e b \ x e 4 v y M \ x e b t \ x e 4 l ' ) ' H e a v y M e t a l '
  8. I n [ 8 9 ] : d f =

    p d . r e a d _ e x c e l ( " 1 0 3 3 - p r o g r a m - f o i a - m a y - 2 0 1 4 . x l s x " )
  9. I n [ 2 4 ] : d f .

    c o l u m n s O u t [ 2 4 ] : I n d e x ( [ u ' S t a t e ' , u ' C o u n t y ' , u ' N S N ' , u ' I t e m N a m e ' , u ' Q u a n t i t y ' , u ' U I ' , u ' A c q u i s i t i o n C o s t ' , u ' S h i p D a t e ' ] , d t y p e = ' o b j e c t ' )
  10. I n [ 6 ] : d f . d

    e s c r i b e ( ) O u t [ 6 ] : Q u a n t i t y A c q u i s i t i o n C o s t c o u n t 7 3 0 2 8 . 0 0 0 0 0 0 7 3 0 2 8 . 0 0 0 0 0 0 m e a n 1 5 . 0 0 6 7 9 2 7 9 6 7 . 5 7 5 4 9 0 s t d 3 8 4 . 6 2 3 9 3 0 1 9 7 2 9 3 . 2 4 3 3 5 6 m i n 1 . 0 0 0 0 0 0 0 . 0 0 0 0 0 0 2 5 % 1 . 0 0 0 0 0 0 5 8 . 7 1 0 0 0 0 5 0 % 1 . 0 0 0 0 0 0 2 0 0 . 0 0 0 0 0 0 7 5 % 5 . 0 0 0 0 0 0 4 9 9 . 0 0 0 0 0 0 m a x 9 1 0 0 0 . 0 0 0 0 0 0 1 8 0 0 0 0 0 0 . 0 0 0 0 0 0
  11. I n [ 9 0 ] : d f .

    i n f o ( ) < c l a s s ' p a n d a s . c o r e . f r a m e . D a t a F r a m e ' > I n t 6 4 I n d e x : 7 3 0 2 8 e n t r i e s , 0 t o 7 3 0 2 7 D a t a c o l u m n s ( t o t a l 8 c o l u m n s ) : S t a t e 7 3 0 2 8 n o n - n u l l o b j e c t C o u n t y 7 3 0 2 8 n o n - n u l l o b j e c t N S N 7 2 9 8 3 n o n - n u l l o b j e c t I t e m N a m e 7 1 7 3 2 n o n - n u l l o b j e c t Q u a n t i t y 7 3 0 2 8 n o n - n u l l i n t 6 4 U I 7 3 0 2 8 n o n - n u l l o b j e c t A c q u i s i t i o n C o s t 7 3 0 2 8 n o n - n u l l f l o a t 6 4 S h i p D a t e 7 3 0 2 8 n o n - n u l l o b j e c t d t y p e s : f l o a t 6 4 ( 1 ) , i n t 6 4 ( 1 ) , o b j e c t ( 6 )
  12. Adding nulls make int columns floats! I n [ 3

    2 ] : x = p d . r e a d _ c s v ( S t r i n g I O ( " a , b \ n 5 , 6 . 0 \ n , " ) ) I n [ 3 3 ] : x O u t [ 3 3 ] : a b 0 5 6 1 N a N N a N I n [ 3 4 ] : x . i n f o ( ) < c l a s s ' p a n d a s . c o r e . f r a m e . D a t a F r a m e ' > I n t 6 4 I n d e x : 2 e n t r i e s , 0 t o 1 D a t a c o l u m n s ( t o t a l 2 c o l u m n s ) : a 1 n o n - n u l l f l o a t 6 4 b 1 n o n - n u l l f l o a t 6 4 d t y p e s : f l o a t 6 4 ( 2 ) m e m o r y u s a g e : 4 8 . 0 b y t e s
  13. Converting types p d . r e a d _

    c s v ( . . . , d t y p e = { " c o l u m n _ 1 " : i n t , " c o l u m n _ 2 " : o b j e c t } ) # o r d f . c o l u m n = d f . c o l u m n . a s t y p e ( i n t )
  14. Ordinals (Also not strings!) low, medium, high ★, ★★, ★★★,

    ★★★★ high school, undergrad, graduate
  15. Converting custom N/A values p d . r e a

    d _ c s v ( . . . , n a _ v a l u e s = [ " N / A " , " U n k n o w n " ] ) # o r d f . r e p l a c e ( " N / A " , N o n e )
  16. Dropping nulls O u t [ 6 5 ] :

    a b c d 0 1 2 N a N N a N 1 N a N 3 N a N N a N 2 4 5 N a N N a N
  17. I n [ 6 8 ] : d f .

    d r o p n a ( a x i s = 1 ) O u t [ 6 8 ] : b 0 2 1 3 2 5
  18. I n [ 6 9 ] : d f .

    d r o p n a ( a x i s = 1 , h o w = " a l l " ) O u t [ 6 9 ] : a b 0 1 2 1 N a N 3 2 4 5
  19. I n [ 8 6 ] : d f .

    f i l l n a ( m e t h o d = " b f i l l " ) O u t [ 8 6 ] : a b c d 0 1 2 N a N N a N 1 4 3 N a N N a N 2 4 5 N a N N a N
  20. I n [ 9 0 ] : d f .

    i n t e r p o l a t e ( ) O u t [ 9 0 ] : a b c d 0 1 . 0 2 N a N N a N 1 2 . 5 3 N a N N a N 2 4 . 0 5 N a N N a N
  21. Selecting columns I n [ 5 0 ] : d

    f . S t a t e O u t [ 5 0 ] : 0 A K 1 A K 2 A K . . . I n [ 5 1 ] : d f [ " I t e m N a m e " ] O u t [ 5 1 ] : 0 R I F L E , 5 . 5 6 M I L L I M E T E R 1 R I F L E , 5 . 5 6 M I L L I M E T E R 2 R I F L E , 5 . 5 6 M I L L I M E T E R 3 R I F L E , 5 . 5 6 M I L L I M E T E R 4 R I F L E , 5 . 5 6 M I L L I M E T E R . . .
  22. Selecting multiple columns I n [ 5 4 ] :

    d f [ [ " S t a t e " , " C o u n t y " ] ] O u t [ 5 4 ] : S t a t e C o u n t y 0 A K A N C H O R A G E 1 A K A N C H O R A G E 2 A K A N C H O R A G E
  23. Queries I n [ 5 8 ] : d f

    [ " A c q u i s i t i o n C o s t " ] > 1 0 0 0 0 0 O u t [ 5 8 ] : 0 F a l s e 1 F a l s e 2 F a l s e 3 F a l s e 4 F a l s e 5 F a l s e . . .
  24. Queries I n [ 5 9 ] : d f

    [ d f [ " A c q u i s i t i o n C o s t " ] > 1 0 0 0 0 0 ] O u t [ 5 9 ] : S t a t e C o u n t y N S N \ 1 4 6 A K A N C H O R A G E 2 3 5 5 - D S - C O M - B T V 2 5 1 5 A L B A L D W I N 2 3 2 0 - 0 1 - 0 4 7 - 8 7 5 4 6 6 3 A L B L O U N T 2 3 2 0 - 0 1 - 0 4 7 - 8 7 5 4 6 7 4 A L B L O U N T 2 3 2 0 - 0 1 - 2 3 0 - 0 3 0 4 6 9 3 A L B L O U N T 2 3 5 5 - 0 1 - 5 5 5 - 0 9 0 8 . . . I n [ 6 0 ] : d f [ d f [ " A c q u i s i t i o n C o s t " ] > 1 0 0 0 0 0 ] . c o u n t ( ) O u t [ 6 0 ] : S t a t e 6 4 6 C o u n t y 6 4 6 N S N 6 4 5 I t e m N a m e 6 4 1 . . .
  25. What were the highest cost items? I n [ 3

    8 ] : d f . s o r t ( " A c q u i s i t i o n C o s t " , a s c e n d i n g = F a l s e ) \ [ [ " I t e m N a m e " , " A c q u i s i t i o n C o s t " ] ] O u t [ 3 8 ] : I t e m N a m e A c q u i s i t i o n C o s t 6 5 7 5 4 A I R C R A F T , R O T A R Y W I N G 1 8 0 0 0 0 0 0 6 5 7 5 3 A I R C R A F T , R O T A R Y W I N G 1 8 0 0 0 0 0 0 6 5 7 6 0 A I R C R A F T , R O T A R Y W I N G 1 8 0 0 0 0 0 0 6 5 7 5 9 A I R C R A F T , R O T A R Y W I N G 1 8 0 0 0 0 0 0 6 5 7 5 8 A I R C R A F T , R O T A R Y W I N G 1 8 0 0 0 0 0 0 6 5 7 5 7 A I R C R A F T , R O T A R Y W I N G 1 8 0 0 0 0 0 0 6 5 7 5 6 A I R C R A F T , R O T A R Y W I N G 1 8 0 0 0 0 0 0 6 5 7 5 5 A I R C R A F T , R O T A R Y W I N G 1 8 0 0 0 0 0 0 4 8 0 7 4 A I R C R A F T , R O T A R Y W I N G 6 5 0 0 0 0 0 1 9 5 2 4 A I R P L A N E , C A R G O - T R A N S P O R T 5 3 4 0 0 0 0 6 5 6 7 8 A I R C R A F T , F I X E D W I N G 4 3 1 7 0 6 7 2 1 5 0 R A D A R S U R V E I L L A N C E C E N T R A L 1 5 2 6 7 0 3
  26. What were the largest quantities of items? I n [

    1 5 ] : d f . s o r t ( " Q u a n t i t y " , a s c e n d i n g = F a l s e ) \ [ [ " I t e m N a m e " , " Q u a n t i t y " , " U I " ] ] O u t [ 1 5 ] : I t e m N a m e Q u a n t i t y U I 5 2 5 3 0 W I R E , E L E C T R I C A L 9 1 0 0 0 F o o t 3 6 2 2 1 S C R E W , C A P , S O C K E T H E A D 4 3 8 2 2 E a c h 5 2 3 9 9 S T R A P , T I E D O W N , E L E C T R I C A L C O M P O N E N T S 6 0 0 0 E a c h 5 2 5 3 6 C A B L E C O A X 6 0 0 0 F T 3 9 1 8 9 R U B B E R S H E E T , S O L I D 6 0 0 0 E a c h
  27. value_counts() is awesome I n [ 1 6 ] :

    d f . U I . v a l u e _ c o u n t s ( ) O u t [ 1 6 ] : E a c h 5 1 5 8 1 E A 1 3 3 7 0 P a i r 1 3 8 1 P R 1 2 8 5 U n k n o w n 9 8 2 K i t 7 3 2 S e t 4 6 3 K T 4 4 1 S E 3 6 0
  28. I n [ 1 7 ] : d f =

    d f . r e p l a c e ( { " U I " : [ " E A " , " E A C H " , " P R " . . . ] } , { " U I " : [ " E a c h " , " E a c h " , " P a i r " . . . ] } ) I n [ 1 8 ] : d f = d f . r e p l a c e ( { " U I " : " U n k n o w n " } , { " U I " : n p . N a N } )
  29. I n [ 2 2 ] : d f .

    U I . v a l u e _ c o u n t s ( ) [ 2 0 : ] O u t [ 2 2 ] : A s s o r t m e n t 1 0 G L 9 C a n 8 O T 7 B D 7 . . . J R 1 E 4 1 S k e i n 1 U U 1 S P 1 L e n g t h : 5 4 , d t y p e : i n t 6 4
  30. d o n t c a r e = d

    f . U I . v a l u e _ c o u n t s ( ) [ 2 0 : ] d f = d f [ ~ d f . U I . i s i n ( d o n t c a r e . i n d e x ) ]
  31. I n [ 4 6 ] : d f .

    s o r t ( " Q u a n t i t y " , a s c e n d i n g = F a l s e ) \ . . . . : [ [ " I t e m N a m e " , " Q u a n t i t y " , " U I " ] ] O u t [ 4 6 ] : I t e m N a m e Q u a n t i t y U I 5 2 5 3 0 W I R E , E L E C T R I C A L 9 1 0 0 0 F o o t 3 6 2 2 1 S C R E W , C A P , S O C K E T H E A D 4 3 8 2 2 E a c h 5 2 3 9 9 S T R A P , T I E D O W N , E L E C T R I C A L C O M P O N E N T S 6 0 0 0 E a c h 5 2 5 3 6 C A B L E C O A X 6 0 0 0 F o o t
  32. Regexes and other string functions > > > d f

    . c o l u m n [ d f . c o l u m n . s t r . c o n t a i n s ( " ( \ d { 4 } - \ d { 2 } - \ d { 2 } ) " ) ] > > > d f [ " I t e m N a m e " ] . s t r . l o w e r ( )
  33. Pivoting I n [ 1 ] : d f O

    u t [ 1 ] : d a t e v a r i a b l e v a l u e 0 2 0 0 0 - 0 1 - 0 3 A 0 . 4 6 9 1 1 2 1 2 0 0 0 - 0 1 - 0 4 A - 0 . 2 8 2 8 6 3 2 2 0 0 0 - 0 1 - 0 5 A - 1 . 5 0 9 0 5 9 3 2 0 0 0 - 0 1 - 0 3 B - 1 . 1 3 5 6 3 2 4 2 0 0 0 - 0 1 - 0 4 B 1 . 2 1 2 1 1 2 5 2 0 0 0 - 0 1 - 0 5 B - 0 . 1 7 3 2 1 5 . . . .
  34. Pivoting I n [ 3 ] : d f .

    p i v o t ( i n d e x = ' d a t e ' , c o l u m n s = ' v a r i a b l e ' , v a l u e s = ' v a l u e ' ) O u t [ 3 ] : v a r i a b l e A B C D d a t e 2 0 0 0 - 0 1 - 0 3 0 . 4 6 9 1 1 2 - 1 . 1 3 5 6 3 2 0 . 1 1 9 2 0 9 - 2 . 1 0 4 5 6 9 2 0 0 0 - 0 1 - 0 4 - 0 . 2 8 2 8 6 3 1 . 2 1 2 1 1 2 - 1 . 0 4 4 2 3 6 - 0 . 4 9 4 9 2 9 2 0 0 0 - 0 1 - 0 5 - 1 . 5 0 9 0 5 9 - 0 . 1 7 3 2 1 5 - 0 . 8 6 1 8 4 9 1 . 0 7 1 8 0 4
  35. Transforming with functions I n [ 1 2 8 ]

    : d f 4 O u t [ 1 2 8 ] : o n e t h r e e t w o a - 0 . 6 2 6 5 4 4 N a N - 0 . 3 5 1 5 8 7 b - 0 . 1 3 8 8 9 4 - 0 . 1 7 7 2 8 9 1 . 1 3 6 2 4 9 c 0 . 0 1 1 6 1 7 0 . 4 6 2 2 1 5 - 0 . 4 4 8 7 8 9 d N a N 1 . 1 2 4 4 7 2 - 1 . 1 0 1 5 5 8 I n [ 1 2 9 ] : f = l a m b d a x : l e n ( s t r ( x ) ) I n [ 1 3 0 ] : d f 4 [ ' o n e ' ] . m a p ( f ) O u t [ 1 3 0 ] : a 1 4 b 1 5 c 1 5 d 3 N a m e : o n e , d t y p e : i n t 6 4
  36. Bane of everyoneʹs existence Make sure the types match Check

    dataframe size before and after Try with how=inner and how=outer
  37. I n [ 8 3 ] : a g e

    s O u t [ 8 3 ] : a g e c a t s 0 1 s i m o n 1 2 p h o e b e 2 3 n o r m a n I n [ 8 4 ] : w e i g h t s O u t [ 8 4 ] : c a t s w e i g h t 0 s i m o n 1 1 n o r m a n 2
  38. I n [ 8 6 ] : p d .

    m e r g e ( a g e s , w e i g h t s , l e f t _ o n = " c a t s " , r i g h t _ o n = " c a t s " ) O u t [ 8 6 ] : a g e c a t s w e i g h t 0 1 s i m o n 1 1 3 n o r m a n 2
  39. I n [ 8 8 ] : p d .

    m e r g e ( a g e s , w e i g h t s , l e f t _ o n = " c a t s " , r i g h t _ o n = " c a t s " , h o w = " o u t e r " ) O u t [ 8 8 ] : a g e c a t s w e i g h t 0 1 s i m o n 1 1 2 p h o e b e N a N 2 3 n o r m a n 2
  40. > > > f r o m f u z

    z y w u z z y i m p o r t f u z z > > > f u z z . r a t i o ( " t h i s i s a t e s t " , " t h i s i s a t e s t ! " ) 9 6
  41. Be conscious about what you load into memory p d

    . r e a d _ c s v ( . . . , u s e c o l s = [ " b l a h " ] ) p d . r e a d _ c s v ( . . . , i t e r a t o r = T r u e , c h u n k s i z e = 1 0 0 0 0 0 )