most similar #hashtags Similarity based on co-occurrence [Feng2015], corrects for popularity
Topic as a means to express their opinion. Using a single hashtag may thus miss part of th relevant posts. To address this limitation, we extend the definition of topic to be more encompassin Given a seed hashtag, we define a topic as a set of related hashtags, which co-occur wit the seed hashtag. To find related hashtags, we employ (and improve upon) a recen clustering algorithm tailored for the purpose [Feng et al. 2015]. Feng et al. [2015] develop a simple measure to compute the similarity between tw hashtags, which relies on co-occurring words and hashtags. The authors then use th similarity measure to find closely related hashtags and define clusters. However, th simple approach presents one drawback, in that very popular hashtags such as #ff o #follow co-occur with a large number of hashtags. Hence, directly applying the origina approach results in extremely noisy clusters. Since the quality of the topic affect critically the entire pipeline, we want to avert this issue and ensure minimal noise introduced in the expanded set of hashtags. Therefore, we improve the basic approach by taking into account and normalizin for the popularity of the hashtags. Specifically, we compute the document frequenc of all hashtags on a random 1% sample of the Twitter stream, and normalize th original similarity score between two hashtags by the inverse document frequency. Th similarity score is formally defined as sim ( h s , h t) = 1 1 + log( df ( h t)) ( ↵ cos( W s , W t) + (1 ↵ ) cos( H s , H t)) , (1 where h s is the seed tag, h t is the candidate tag, W x and H x are the sets of word and hashtags that co-occur with hashtag h x , respectively, cos is the cosine similarit between two vectors, df is the document frequency of a tag, and ↵ is a parameter tha