Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MongoDB Sao Paulo 2012: Big Blog Analysis: Map ...

mongodb
July 13, 2012
560

MongoDB Sao Paulo 2012: Big Blog Analysis: Map Reduce and Sharding with MongoDB

Henrique Dias, IT Analyst at Federal University of Rio Grande do Sul
A mineração de dados sobre blogs é uma tarefa difícil devido ao seu grande volume da dados. O MongoDB é uma ótima solução para distribuir os dados em shards, dentro de um cluster de computadores, e analisar as informações com tarefas MapReduce sobre 30 milhões de postagens coletadas de usuários Brasileiros.

mongodb

July 13, 2012
Tweet

Transcript

  1. 9LUWXDO6KDUGV 90 *%Y3 90 *%Y3 90 *%Y3 0RQJR 6KDUG 0RQJR

    6KDUG 0RQJR 6KDUG 0RQJR 6KDUG 0RQJR &RQILJ 0RQJR 6KDUG 0RQJR 6KDUG
  2. 3RVW-621 ^ BLG2EMHFW,G HHHEIIIIE  DXWKRU,' EORJ,' SRVW,' SXEOLVKHG,62'DWH 7=

     WLWOH4XLVDXWHPYHOHXP FRQWHQW/RUHPLSVXPGRORUVLWDPHW WDJV>YROXSWDWHPDFFXVDQWLXP@ FRPPHQWV> ^ FRPPHQW,' DXWKRU,' SXEOLVKHG,62'DWH 7=  FRQWHQW1HTXHSRUURTXLVTXDPHVW `@ `
  3. 0DS5HGXFH SROLWLFV KHOWK FDUV PRYLHV IDVKLRQ VRFFHU ERRNV PRYLHV 

    0$3 5HGXFH KHDOWK VRFFHU SROLWLFV KHDOWK VRFFHU SROLWLFV   GRFXPHQWV
  4. 0DS5HGXFH0RQJR'% -DYD6FULSW)XQFWLRQV !PDS IXQFWLRQ ^ WKLVFRQWHQWVSOLW  IRU(DFK IXQFWLRQ ZRUG

    ^HPLW ZRUG `  ` !UHGXFH IXQFWLRQ NH\YDOXHV ^ YDUFRXQW  YDOXHVIRU(DFK IXQFWLRQ YDOXH ^FRXQW YDOXH`  UHWXUQFRXQW ` !GESRVWVPDS5HGXFH PDSUHGXFH^RXW^LQOLQH`` 
  5. 7)7HUP)UHTXHQF\ WI WG   ,'),QYHUVH'RFXPHQW)UHTXHQF\ LGI W'  ORJ

    WI[LGI WG'  WI WG [LGI W' 7),') _'_ _^G䌜㻌㻰㻌㻦㻌㼠㻌䌜㻌㼐㼩㼨 _W_ _7_
  6. /RUHP LSVXP GRORU VLW DPHW FRQVHFWHWXU DGLSLVLFLQJ HOLW VHG GR

    HLXVPRG WHPSRU LQFLGLGXQWXWODERUHHWGRORUHPDJQDDOLTXD 8W HQLP DG PLQLP YHQLDP TXLV QRVWUXG H[HUFLWDWLRQ XOODPFR ODERULV QLVL XW DOLTXLS H[ HDFRPPRGRFRQVHTXDW 'XLV DXWH LUXUH GRORU LQ UHSUHKHQGHULW LQ YROXSWDWH YHOLW HVVH FLOOXP GRORUH HX IXJLDW QXOODSDULDWXU ([FHSWHXUVLQWPDJQDRFFDHFDWFXSLGDWDWQRQ SURLGHQW VXQW LQ FXOSD TXL RIILFLD GHVHUXQW PROOLWDQLPLGHVWODERUXP 7),')
  7. /RUHP LSVXP GRORU VLW DPHW FRQVHFWHWXU DGLSLVLFLQJ HOLW VHG GR

    HLXVPRG WHPSRU LQFLGLGXQWXWODERUHHWGRORUHPDJQDDOLTXD 8W HQLP DG PLQLP YHQLDP TXLV QRVWUXG H[HUFLWDWLRQ XOODPFR ODERULV QLVL XW DOLTXLS H[ HDFRPPRGRFRQVHTXDW 'XLV DXWH LUXUH GRORU LQ UHSUHKHQGHULW LQ YROXSWDWH YHOLW HVVH FLOOXP GRORUH HX IXJLDW QXOODSDULDWXU ([FHSWHXUVLQWPDJQDRFFDHFDWFXSLGDWDWQRQ SURLGHQW VXQW LQ FXOSD TXL RIILFLD GHVHUXQW PROOLWDQLPLGHVWODERUXP 7),')
  8. /RUHP LSVXP GRORU VLW DPHW FRQVHFWHWXU DGLSLVLFLQJ HOLW VHG GR

    HLXVPRG WHPSRU LQFLGLGXQWXWODERUHHWGRORUHPDJQDDOLTXD 8W HQLP DG PLQLP YHQLDP TXLV QRVWUXG H[HUFLWDWLRQ XOODPFR ODERULV QLVL XW DOLTXLS H[ HDFRPPRGRFRQVHTXDW 'XLV DXWH LUXUH GRORU LQ UHSUHKHQGHULW LQ YROXSWDWH YHOLW HVVH FLOOXP GRORUH HX IXJLDW QXOODSDULDWXU ([FHSWHXUVLQWPDJQDRFFDHFDWFXSLGDWDWQRQ SURLGHQW VXQW LQ FXOSD TXL RIILFLD GHVHUXQW PROOLWDQLPLGHVWODERUXP 7),') 1 ' 
  9. /RUHP LSVXP GRORU VLW DPHW FRQVHFWHWXU DGLSLVLFLQJ HOLW VHG GR

    HLXVPRG WHPSRU LQFLGLGXQWXWODERUHHWGRORUHPDJQDDOLTXD 8W HQLP DG PLQLP YHQLDP TXLV QRVWUXG H[HUFLWDWLRQ XOODPFR ODERULV QLVL XW DOLTXLS H[ HDFRPPRGRFRQVHTXDW 'XLV DXWH LUXUH GRORU LQ UHSUHKHQGHULW LQ YROXSWDWH YHOLW HVVH FLOOXP GRORUH HX IXJLDW QXOODSDULDWXU ([FHSWHXUVLQWPDJQDRFFDHFDWFXSLGDWDWQRQ SURLGHQW VXQW LQ FXOSD TXL RIILFLD GHVHUXQW PROOLWDQLPLGHVWODERUXP 7),') 1 '  WI PDJPDG   LGI PDJPD'   
  10. /RUHP LSVXP GRORU VLW DPHW FRQVHFWHWXU DGLSLVLFLQJ HOLW VHG GR

    HLXVPRG WHPSRU LQFLGLGXQWXWODERUHHWGRORUHPDJQDDOLTXD 8W HQLP DG PLQLP YHQLDP TXLV QRVWUXG H[HUFLWDWLRQ XOODPFR ODERULV QLVL XW DOLTXLS H[ HDFRPPRGRFRQVHTXDW 'XLV DXWH LUXUH GRORU LQ UHSUHKHQGHULW LQ YROXSWDWH YHOLW HVVH FLOOXP GRORUH HX IXJLDW QXOODSDULDWXU ([FHSWHXUVLQWPDJQDRFFDHFDWFXSLGDWDWQRQ SURLGHQW VXQW LQ FXOSD TXL RIILFLD GHVHUXQW PROOLWDQLPLGHVWODERUXP 7),') 1 '  WI PDJPDG   LGI PDJPD'   WILGI PDJQDG'   
  11. /RUHP LSVXP GRORU VLW DPHW FRQVHFWHWXU DGLSLVLFLQJ HOLW VHG GR

    HLXVPRG WHPSRU LQFLGLGXQWXWODERUHHWGRORUHPDJQDDOLTXD 8W HQLP DG PLQLP YHQLDP TXLV QRVWUXG H[HUFLWDWLRQ XOODPFR ODERULV QLVL XW DOLTXLS H[ HDFRPPRGRFRQVHTXDW 'XLV DXWH LUXUH GRORU LQ UHSUHKHQGHULW LQ YROXSWDWH YHOLW HVVH FLOOXP GRORUH HX IXJLDW QXOODSDULDWXU ([FHSWHXUVLQWPDJQDRFFDHFDWFXSLGDWDWQRQ SURLGHQW VXQW LQ FXOSD TXL RIILFLD GHVHUXQW PROOLWDQLPLGHVWODERUXP 7),') 1 '  WI PDJPDG   LGI PDJPD'   WILGI PDJQDG'   WI LQG   LGI LQ'   
  12. /RUHP LSVXP GRORU VLW DPHW FRQVHFWHWXU DGLSLVLFLQJ HOLW VHG GR

    HLXVPRG WHPSRU LQFLGLGXQWXWODERUHHWGRORUHPDJQDDOLTXD 8W HQLP DG PLQLP YHQLDP TXLV QRVWUXG H[HUFLWDWLRQ XOODPFR ODERULV QLVL XW DOLTXLS H[ HDFRPPRGRFRQVHTXDW 'XLV DXWH LUXUH GRORU LQ UHSUHKHQGHULW LQ YROXSWDWH YHOLW HVVH FLOOXP GRORUH HX IXJLDW QXOODSDULDWXU ([FHSWHXUVLQWPDJQDRFFDHFDWFXSLGDWDWQRQ SURLGHQW VXQW LQ FXOSD TXL RIILFLD GHVHUXQW PROOLWDQLPLGHVWODERUXP 7),') 1 '  WI PDJPDG   LGI PDJPD'   WILGI PDJQDG'   WI LQG   LGI LQ'   WILGI LQG'   
  13. 7),')-RE 0DS IXQFWLRQ ^ YDUWDJV WKLVWDJV WKLVFRQWHQWVSOLW   IRU(DFK

    IXQFWLRQ V:RUG ^ WDJVIRU(DFK IXQFWLRQ V7DJ ^ HPLW ^WDJV7DJZRUGV:RUG`  `  `  `
  14. 7),')5HVXOWDGR +HDOWK KHDOWK ZDWHU GLVHDVH VNLQ ERG\ V\PSWRPV DQLPDOV IRRG

    NLGV FHOOV  3ROLWLFV GHSXW\ SUHVLGHQW JRYHUQPHQW DJDLQVW 2EDPD PLQLVWHU 0LQLVWU\ 6WDWH SROLWLFV FKDPEHU  6RFFHU JRDO WHDP VRFFHU FRDFK IDQ SOD\HU &XS DJDLQVW URXQG PDWFK 
  15. 3DJH5DQN B B B B B B BB B 

           
  16. 3DJH5DQN-RE 4XHU\ ^WDJVKHDOWK` 0DS IXQFWLRQ ^ YDULG$XWKRU WKLVDXWKRU,' WKLVFRPPHQWVIRU(DFK IXQFWLRQ

    FRPPHQW ^ LI FRPPHQWXVHU,' LG$XWKRU ^ HPLW FRPPHQWXVHU,'>LG$XWKRU@  ` `  `
  17. 3DJH5DQN-RE 0DS IXQFWLRQ ^ YDUSU. WKLVYDOXHSUWKLVYDOXHRXW/OHQJWK WKLVYDOXHRXW/IRU(DFK IXQFWLRQ DXWKRU,' ^

    HPLW DXWKRU,'^SUSU.RXW/>@SU2OG`   `  LI WKLVYDOXHRXW/OHQJWKSU.!  HPLW WKLVBLG^ SU RXW/WKLVYDOXHRXW/ SU2OGWKLVYDOXHSU`  `
  18. 3DJH5DQN-RE 5HGXFH IXQFWLRQ NH\YDOXHV ^ YDUUHVXOW ^SURXW/>@SU2OG` YDOXHVIRU(DFK IXQFWLRQ YDOXH

    ^ UHVXOWSU YDOXHSU UHVXOWRXW/ UHVXOWRXW/FRQFDW YDOXHRXW/  UHVXOWSU2OG YDOXHSU2OG `  UHWXUQUHVXOW ` ([HFXWHLWHUDFWLRQVXQWLO3DJH5DQNFRQYHUJH
  19. 3DJH5DQN5HVXOW +HDOWK UHIXJLDG VWDUGROO FRWXUQRQ ZHORYHIU IFRHOLHV DQJHORUL IRUELGGH 

    3ROLWLFV ELJERVW PLOLWDU SROLELR IDWRU WULERGR KHPSDGD EORJGRS  6RFFHU PHGRE YDOFDEUD JHVSWHF KXJRJRHV QRYREORJ ELJERWKH EORJGRPD 
  20. &UDZOHU 0RQJR 6KDUG 0RQJR 6KDUG 0RQJR 6KDUG 0RQJR 6KDUG 0RQJR

    &RQILJ 0RQJR 6KDUG 0RQJR 6KDUG [PRQJRV &UDZOHUV