– User interest modelling, retrieval, event detection, summarization • Image understanding – Low-level features (e.g., SIFT) do not work well due to the semantic gap – How about visual objects? 10/18/2016 4
interest modelling, retrieval, event detection, summarization • Image understanding – Low-level features (e.g., SIFT) do not work well due to the semantic gap – How about visual objects? • Not sufficient for microblog images 10/18/2016 7
post’s text • We focus on conflating the variants of hashtags – #icebucket, #ALSIceBucketChallenge – 14.3% of image tweets have multi-words hashtags 10/18/2016 Microsoft Word Breaker API [Wang et al. NAACL’10] ice bucket ALS ice bucket challenge 12
tool (Google Tesseract) to extract text from images • 26.4% of the images have at least one recognized textual word Coming soon!!! imdb.to/IGxE9f Pretty much 13
images in microblogs are user generated – Used in other places with a similar context 10/18/2016 Pages that contain the image Named entity Best guess Google Image Search - 76.0% of images have been indexed by Google 16
images in microblogs are user generated – Used in other places with a similar context 10/18/2016 Google Image Search Named entity Best guess Pages that contain the image 18
images in microblogs are user generated – Used in other places with a similar context 10/18/2016 Pages that contain the image Named entity Best guess Google Image Search - 76.0% of images have been indexed by Google 19
Hashtag enhanced text URLs 3. External web page Word Breaker 2. Text in Image OCR Tool Google Image Search Best guess Named entity Pages that contain the image 4. Search Result 20
Hashtag enhanced text URLs 3. External web page Word Breaker 2. Text in Image OCR Tool Google Image Search Best guess Named entity Pages that contain the image 4. Search Result Text quality: Hashtag > External pages > OCR text > Search results 26
text (94.8%): Text from post + enhanced hashtags (14.3%) URLs Basic + Text from External Pages Basic + OCR Text Basic + Text from Search Result OCR Text Yes 14.4% No Yes 23.5% Search Engine No Yes 48.9% Reduce contextual text acquisition cost by 18% No Basic text Text quality: Hashtag > External pages > OCR text > Search results 27
1 0 0 0 1 1 0 1 1 0 ? U1 U2 U3 U4 Matrix Factorization (MF) - The state-of-the-art collaborative filtering algorithm - Learn a vector representation for each user and Item in a latent space Will U4 retweet I4 ? User’s latent factor Item’s latent factor 29 I1 I2 I3 I4
types of features into users’ interest modeling • Not susceptible to cold start 10/18/2016 N types of features (e.g., CITING text, visual objects) A feature’s latent factor Item’s latent factor User’s latent factor 31
has a better rank than negative ones (non-retweets) – Bayesian Personalized Ranking [Rendle et al. 2009] – Minimize loss function • Infer the parameters via stochastic gradient descent (SGD) 10/18/2016 Regularization term 32
sample negative instances based on the time of retweets 10/18/2016 33 Retweet time Non-retweet 1 Non-retweet 2 2 is more likely to be a real negative instance
a test set • Evaluation metrics – Mean Average Precision (MAP) – Average Precision at top ranks 10/18/2016 Users Retweets All Tweets Ra8ngs Training 926 174,765 1,316,645 1,592,837 Test 9,021 77,061 82,743 34
– Other contexts: geo-location, time, author – Other fusion approaches, e.g., learn weights of each contextual source 10/18/2016 CITING framework to model image tweets - Hashtag enhanced text - OCR text - External pages - Search results Feature-aware MF to recommend image tweets - Decompose user-item interaction to user-feature interaction - Alleviate cold-start problem 45