Cluster-based retrieval from a language modeling perspective

2008 2008

Other formats: Order a copy

Abstract (summary)

The standard approach to document retrieval is to assume that the relevance of documents could be assessed independently. The fact that a document is relevant does not contribute to predicting the relevance of a closely-related document. Cluster-based retrieval, on the other hand, assumes that the probability of relevance of a document should depend on the relevance of other similar documents to the same query. The goal is to find the best group of documents.

The most common approach to cluster-based retrieval, which was proposed in the 1970s, is to retrieve one or more clusters in their entirety to a query. Research in this area has suggested that "optimal" clusters exist that, if retrieved, would yield very large improvements in effectiveness relative to document retrieval. However, no real retrieval strategy has achieved this result. Except for precision-oriented searches on very small data sets, document retrieval is found to be generally more effective. There has been a resurgence of research in cluster-based retrieval in the past few years including our own efforts in this area. The general approach is to use clusters as a form of document smoothing. Studies have shown that clusters can indeed improve retrieval performance automatically on modern test collections and the language modeling framework is an effective probabilistic retrieval framework for studying this type of problems.

This thesis revisits the problem of retrieving the best group of documents, from the language-modeling perspective. We study both cluster smoothing and cluster retrieval. We analyze the advantages and disadvantages of a range of representation techniques, derive features that characterize good document clusters, and develop new probabilistic representations that capture the identified features. An extensive empirical evaluation is provided for various techniques proposed in this work. We find that whether good document clusters could be successfully identified or utilized by an IR system largely depends on how they are represented. Both the CBDM model for cluster smoothing and the geometric mean representation for cluster retrieval are shown to be effective approaches for cluster-based retrieval.

Indexing (details)

Computer science
0984: Computer science
Identifier / keyword
Applied sciences; Cluster representation; Cluster-based retrieval; Geometric mean; Language modeling; Smoothing; Topic models
Cluster-based retrieval from a language modeling perspective
Liu, Xiaoyong
Number of pages
Publication year
Degree date
School code
DAI-B 69/07, Dissertation Abstracts International
Place of publication
Ann Arbor
Country of publication
United States
Croft, W. Bruce
Committee member
Allan, James; Clifton, Charles; Diao, Yanlei
University of Massachusetts Amherst
Computer Science
University location
United States -- Massachusetts
Source type
Dissertations & Theses
Document type
Dissertation/thesis number
ProQuest document ID
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Document URL
Access the complete full text

You can get the full text of this document if it is part of your institution's ProQuest subscription.

Try one of the following:

  • Connect to ProQuest through your library network and search for the document from there.
  • Request the document from your library.
  • Go to the ProQuest login page and enter a ProQuest or My Research username / password.