The smoothed Dirichlet distribution: Understanding cross -entropy ranking in information retrieval
Unigram Language modeling is a successful probabilistic framework for Information Retrieval (IR) that uses the multinomial distribution to model documents and queries. An important feature in this approach is the usage of the empirically successful cross-entropy function between the query model and document models as a document ranking function. However, this function does not follow directly from the underlying models and as such there is no justification available for its usage till date.
Another related and interesting observation is that the naïve Bayes model for text classification uses the same multinomial distribution to model documents but in contrast, employs document-log-likelihood that follows directly from the model, as a scoring function. Curiously, the document-log-likelihood closely corresponds to cross entropy, but to an asymmetric counterpart of the function used in language modeling. It has been empirically demonstrated that the version of cross entropy used in IR is a better performer than document-log-likelihood, but this interesting phenomenon remains largely unexplained.
One of the main objectives of this work is to develop a theoretical understanding of the reasons for the success of the version of cross entropy function used for ranking in IR. We also aim to construct a likelihood based generative model that directly corresponds to this cross-entropy function. Such a model, if successful, would allow us to view IR essentially as a machine learning problem. A secondary objective is to bridge the gap between the generative approaches used in IR and text classification through a unified model.
In this work we show that the cross entropy ranking function corresponds to the log-likelihood of documents w.r.t. the approximate Smoothed-Dirichlet (SD) distribution, a novel variant of the Dirichlet distribution. We also empirically demonstrate that this new distribution captures term occurrence patterns in documents much better than the multinomial, thus offering a reason behind the superior performance of the cross entropy ranking function compared to the multinomial document-likelihood.
Our experiments in text classification show that a classifier based on the Smoothed Dirichlet performs significantly better than the multinomial based naïve Bayes model and on par with the Support Vector Machines (SVM), confirming our reasoning. In addition, this classifier is as quick to train as the naïve Bayes and several times faster than the SVMs owing to its closed form maximum likelihood solution, making it ideal for many practical IR applications. We also construct a well-motivated generative classifier for IR based on SD distribution that uses the EM algorithm to learn from pseudo-feedback and show that its performance is equivalent to the Relevance model (RM), a state-of-the-art model for IR in the language modeling framework that uses the same cross-entropy as its ranking function. In addition, the SD based classifier provides more flexibility than RM in modeling documents owing to a consistent generative framework. We demonstrate that this flexibility translates into a superior performance compared to RM on the task of topic tracking, an online classification task.
0723: Information systems