Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Experimental work done regarding the use of Topic Models to implement and to improve some common tasks of
Information Retrieval and Word Sense Disambiguation.
Bicocca super abundant amount of digital unstructured information it continues to grow at an astonishing rate (it doubles every two years) man can not manage it: information overload. problems: crawling, representing, storing, summarizing, clustering, searching ... (general rule: every problem is an opportunity) opportunity: automatically extract value from chaos what value? how to do it?
Goals the value that we want to extract is: clusters of semantically related documents our purpose is [1] the unsupervised clustering of a text dataset [2] the implementation of information retrieval procedures that exploit the representation of documents at the topic level [3] the modeling of the ability to computationally identify the meaning of words in context (word sense disambiguation) our documents collection: a partition of the Associated Press dataset ~ 2300 english textual news (dating back to the '90s) characteristic of any text document: it is often messy, has flaws and noise we need to clean the data we need a structured representation of the data Dataset
Pre-Processing google refine [ link ] [1] replacement of abbreviations and common entities with expressions that normalize them (e.g., {dlrs, dlr, $, ...} → {dollar}, {mln, mlns, ...} → {million}) [2] adjustment of flaws and [3] stripping metadata entities through regular expressions mallet [ link ] [1] make all the characters lowercase [2] tokenization [3] stop-word removal [4] vocabulary proportional cut-off, with threshold 0.03 [5] term-frequency representation of each document corpus is a unique file, every line is a document with this format: results: |W| = 32349 token types, 241908 words
Topic Models probabilistic generative models for uncovering the underlying semantic structure of a document collection based on a Bayesian analysis of the original texts [ Blei, 2003 ] goal: discover patterns of word-use and connect documents that exhibit similar patterns idea: documents are mixtures of topics (assignments) and each topic is a multinomial probability distribution over words which are the topics have generated the given corpus of documents with the maximum likelihood ? we have to infer 3 latent variables: [1] the word distribution over topics [2] the topics distribution over documents [3] the word-topic assignments [1] Φ(j) = P(W|Z = j) [2] Θ(d) = P (Z|D = d) [3] P(Z|W)
Topic Models Latent Dirichlet Allocation (LDA) model associates with [2] and [1] two smoothing hyper-parameters α and β. the number of times a topic j which has been selected for a document is indicated by α j (α 1 , ..., α T are the parameters of a prior Dirichlet) β is the parameter of a prior Dirichlet which indicates the count of extracted words from a topic (before observing any corpus document) To estimate them we can use different methods (e.g.; Gibbs Sampling) we need to estimate the distributions Φ and Θ: it is possible compute them directly through the matrixes of counts
Tuning which are the best value for hyper-parameters ? usually α = 50/T and β = 0.01 are those that give the best results [ Steyvers and Griffiths, 2007 ] which is the optimal number of topics T ? and the number of iterations I ? it depends on the specific problem, it's an open problem we have set T = 35 and T = 40 there are topics evaluation techniques that try to face this problem ... we have used one of those techniques (i.e., the topic coherence metric, which evaluates the semantic coherence of a topic) to compare two model configurations: symmetric α versus asymmetric α
Symmetric α versus Asymmetric α an asymmetric configuration (AS) for the alpha hyper-parameters serves to calibrate with more flexibility the degree of topics sparseness has been empirically demonstrated that optimizing Dirichlet hyper- parameters (α i , ..., α T ) for topics-document distribution makes a huge difference: topics are not dominated by very common words and they are more stable as their number increase [ Wallach, 2009 ] it has not been verified by our experimentation: the topic's average coherence for AS configuration was worse than SS configuration why ? in our corpus there isn’t a topic that tends to occur in each document (or the optimal number of T may be greater, or simply the answer is more trivial ...)
Post-Processing - Information Retrieval why should we use topic models to improve information retrieval tasks ? [1] we can cluster queries according the extracted topics [2] two documents which share no common words can be measured as similar query likelihood model is a basic approach for information retrieval in this context (generative model) we can evaluate how well a document matches a query specifying how the words of the query may have been generated by a language model we derive a language model for each document (a mixture of topics) so, the relevant documents will have a topic distribution that is likely may generated the set of words contained in (or associated with) the query → documents similarity
Documents Similarity two approaches to compute the similarity between documents [1] probabilistic query approach [2] comparison of topics distribution of documents how ? through divergence metrics (e.g., symmetrised Kullback-Leibler, Jenson-Shannon)
Similar documents for query "forest fire" AP880727-0015 X Fire-spitting helicopters were dispatched to Yellowstone National Park on Tuesday to help protect the Old Faithful geyser area from a 6,000-acre blaze ...
Post-Processing - Word Sense Disambiguation the ability to identify the meaning of words in context in a computational manner is usually referred as the Word Sense Disambiguation four elements: [1] selection of word senses (i.e., the classes) [2] use of external knowledge sources [3] representation of context [4] selection of an automatic classification method input: a user specified context document d c that contains the word w x to be disambiguated [1] → given s most similar words for w x , for each of this we build a sense document capturing synsets, glosses, example phrases, and other relevant relations from WordNet [2] → WordNet as external knowledge sources to create the sense documents d s [3] → the topical and the semantic features [4] → comparison of document d c with each of the s d s document (with one of the two approaches presented): the most similar will be the sense of word w x in context d c
Words similarity two possible approaches to compute the similarity between words: [1] associative relation [2] comparison of (topics-words) P(Z|W) distribution
Future Work topic modeling → • train an LDA model with asymmetric α for increasing values of T and evaluate the resulting quality of topics • train an LDA model with asymmetric α on a vocabulary on which has not been performed any proportional cut-off • investigate a possible implementation of a multiple chain model to obtain topics more stable • use other metric of topic evaluation information retrieval → • assess and fine-tune the prior probability of a document in the query likelihood model • use other high-frequency metrics (e.g., α-skew) in relation to the comparison of distributions word sense disambiguation → • implement and evaluate other methods to compare context document and sense documents (e.g., compute P(d c , d s ) under the assumption that they are conditionally independent, given the topic variable) • refine the mechanism of sense selection (e.g., choosing each of the s most probable words into probability interval in order to minimize the risk that all the most similar words refer to meanings really strictly correlated)