Tandem learning: A learning framework for document categorization

2007 2007

Other formats: Order a copy

Abstract (summary)

Supervised machine learning techniques rely on the availability of ample training data in the form of labeled instances. However, in text, users can have a strong intuition about the relevance of features, that is, words that are indicative of a topic. In this work we show that user prior knowledge on features is useful for text classification, a domain with many irrelevant and redundant features. The benefit of feature selection is more pronounced when the objective is to learn a classifier with as few training examples as possible. We will demonstrate the role of feature feedback in training a classifier to suitable performance quickly. We find that aggressive feature feedback is necessary to focus the classifier during the early stages of active learning by mitigating the Hughes phenomenon. We will describe an algorithm for tandem learning that begins with a couple of labeled instances, and then at each iteration recommends features and instances for a user to label. The algorithm contains methods to incorporate feature feedback into Support Vector Machines. We design an oracle to estimate an upper bound on tandem learning performance. Tandem learning using an oracle results in much better performance than learning on only features or only instances. We find that humans can emulate the oracle to an extent that results in performance (accuracy) comparable to that of the oracle. Our unique experimental design helps factor out system error from human error, leading to a better understanding of when and why interactive feature selection works from a user perspective. We also design a set of difficulty measures that capture the inherent instance and feature complexity of a problem. We verify the robustness of our measures by showing how instance and feature complexity are highly correlated. Our complexity measures serve as a tool to understand when tandem learning is beneficial for text classification.

Indexing (details)

Computer science
0984: Computer science
Identifier / keyword
Applied sciences; Document categorization; Feature feedback; Learning; Machine learning
Tandem learning: A learning framework for document categorization
Raghavan, Hema
Number of pages
Publication year
Degree date
School code
DAI-B 68/07, Dissertation Abstracts International
Place of publication
Ann Arbor
Country of publication
United States
Allan, James
University of Massachusetts Amherst
Computer Science
University location
United States -- Massachusetts
Source type
Dissertations & Theses
Document type
Dissertation/thesis number
ProQuest document ID
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Document URL
Access the complete full text

You can get the full text of this document if it is part of your institution's ProQuest subscription.

Try one of the following:

  • Connect to ProQuest through your library network and search for the document from there.
  • Request the document from your library.
  • Go to the ProQuest login page and enter a ProQuest or My Research username / password.