Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Social Media Intelligence Text, Network Mining ...

Social Media Intelligence Text, Network Mining and Predictive Analytics Combined

Talk by Phil Winters, Data Whisperer @Knime Data Science London @ds_ldn meetup on 12/02/2013

Data Science London

February 18, 2013
Tweet

More Decks by Data Science London

Other Decks in Technology

Transcript

  1. Social Media Intelligence Text, Network Mining and Predictive Analytics Combined

    Phil Winters Customer Perspective Champion Data Wisperer [email protected] www.knime.com
  2. NOTE Examples, workflows (ie: the complete programs) as well as

    white papers are available for download on: www.knime.com 2
  3. KNIME Selected Node Highlights • Statistics • Data Mining •

    Time Series • Image Processing • Neighborgrams • Web Analytics • Text Mining • Network Analysis • Social Media Analysis • WEKA • R Over 1000 native and imbedded nodes included: • Database Support • ETL • Text Processing • Data Generation • XML Read/Write • PMML Read / Write • Social Media Analysis • Business Intelligence • Community Nodes • 3rd Party Nodes Advanced Visualization 4
  4. KNIME rated #1 in satisfaction for open source analytics platforms

    Copyright © 2012 by KNIME.com AG All Rights Reserved - Confidential
  5. Social Media Analysis Water Water Everywhere, and not a drop

    to drink Approaches and Challenges: Cloud-based Approach: No Access to Data In-House Dashboard: No Analytics In-House Text Mining: Sentiment but no relevance In-House Network Mining: Relevance but no Sentiment 8
  6. Case Study: Major European Telco Very rich new data sources

    about customers ! Combine – Text mining – Network Analysis – Classic Predictive Analytics • Modeling, Clustering, Time Series, etc Combine with internal Data makes the text „relevant“ – Include Product names/Categories – exclude Staff Members – Include number of web hits per page... – Include existing marketing positioning – Include major campaign information 9
  7. Our Goal in Social Media Analysis 11 Text Mining for

    Sentiment Drill Down on special cases Network Mining for Relevance Analytics for Prediction
  8. Case Study Example: Slashdot Data “News for Nerds, Stuff that

    Matters“ 12 Basic Facts: • 24532 users • 491 threads with 15 – 843 responses from 12 – 507 users • 113505 posts (text mining on posts) • 60 main topics
  9. Text Mining Remove anonymous users, group by PostID Words Tagging

    Positive words Negative words MPQA Corpus BoW Standard Named Entity Filter Word Frequency User Bins Word cloud for selected users
  10. Slashdot – Text Mining List of negative and positive words

    (MPQA Opinion Corpus) Tag positive and negative words Count words in posts Aggregate over users Negative + Positive User. Most positive user: dada21 (2838 positive / 1725 negative words) Most negative user: pNutz (43 positive / 109 negative words) 16016 positive users 7107 negative users Which Topics have positive users in common ? – Government – People – Law/s – Money – Market
  11. Slashdot – Text Mining Positive vs. negative word frequency dada21

    99BottlesOfBeer dbIII pNutz positive word count negative word count
  12. Hubs & Authorities 23 • Hubs = Follower • Authorities

    = Leader Filtering anonymous users and creating network Centrality index to define hub weight and authority weight Users with hub and authority weights and other features
  13. Combining Text and Network Mining 25 Network Analysis Text Analysis

    Hub and Authority Score per User Attitude Level per User
  14. 26 Carl Bialik dada21 Doc Ruby 99BottlesOfBeerInMyF WebHosting Guy pNutz

    Tube Steak Catbeller Hubs, Authorities &Attitudes from the WSJ
  15. What we have found ... - The positive leaders -

    The neutral leaders - The negative leaders - The inactive users 27 What identifies each group? How do I identify a new user? How do I handle each user?
  16. Why Clustering? - No a priori knowledge (not even on

    a subset of users) - Prediction and interpretation capabilities required 28 k-Means algorithm
  17. Normalization 29 • (Authority score, Hub score) in [0,1] x

    [0,1] • Attitude level in [-66, 1113]
  18. Number of Clusters Users with a negative attitude are hard

    to catch! K=30: 10 clusters with more than 1000 users; 2 clusters with clear negative attitude (< 0.4) K=20: 5 clusters with more than 1000 users; 2 clusters with negative attitude (<0.4) K=10: 2 clusters with more than 5000 users and no cluster with a negative attitude anymore. 31
  19. Additional Discoveries • There are only very few real leaders!

    Authority and hub scores identify active participants rather than leaders. • Superfans can be found in cluster_3 • Negative and (sigh!) active users are collected in cluster_1. • Neutral users are usually inactive (cluster_2, cluster_7, and cluster_8) • Positive users with different degrees of activity are scattered across the remaining clusters. 32
  20. Lessons Learned Data Manipulation is the key…. The decision science

    flows from that Sentiment analysis is all about the Corpus ! 35 Network Analysis Sentiment Analysis
  21. NOTE Examples, workflows (ie: the complete programs) as well as

    white papers are available for download on: www.knime.com 38
  22. Copyright © 2013 by KNIME.com AG All Rights Reserved -

    Confidential Mark Your Calendars: KNIME’s 6th User Group Meeting 6.-7. March 2013 Zurich, Switzerland www.KNIME.com 39