Classification and variable selection for high dimensional multivariate binary data: Adaboost based new methods and a theory for the plug -in rule

2006 2006

Other formats: Order a copy

Abstract (summary)

We consider theoretically a classification problem where all the covariates are independent Bernoulli random variables Xji,1 ≤ in and j = 0, 1, i.e., each variable has values 0 or 1, recording the presence or absence of an event. The parameters of Bernoulli random variables are estimated by using maximum likelihood estimation and they are plugged into the optimal Bayes rule, which is called the plug-in rule. This rule has been applied to real DNA fingerprint data as well as simulations in Wilbur et al. [2002] and shown to classify well even when the independence assumption does not hold. The asymptotic performance of the plug-in rule is the primary object of this study.

Since the number of variables and hence the number of Bernoulli parameters depend on the sample size, n, indicating the need of more and more complex models as n increases, the usual notion of consistency, i.e., convergence of estimates to fixed parameter values isn't applicable. We introduce triangular arrays and a suitably modified definition of consistency called persistence based on how close the performance of the plug-in rule to the classifier with known parameters, pji 1 ≤ in and j = 0, 1. We present various cases where the plug-in rule is persistent or not persistent. Under sparsity condition, we show that the plug-in rule with well-chosen variables may overcome non-persistence. This shows that variable selection can be effective in high dimensional data with sparsity condition.

We also discuss convergence rate of the plug-in rule with sobolev ball type parameter spaces. We show that the plug-in rule with selected variables can improve convergence rate which shows that a simpler model may achieve better performance than the full model. As Bickel and Levina [2004] showed a naive Bayes model performs better than the full model, our results also underpin the well-known practical finding that a model with well-chosen variables may achieve better rate in prediction than the full model especially for high dimensional data.

In addition to the theoretical study of the plug-in rule, we propose and study a new methodology for classification and variable selection based on adaboost. Our application to real and simulated data suggests the new methods perform consider ably better than the plug-in rule. A theoretical study of the new methods is yet to be done.

Indexing (details)

0463: Statistics
Identifier / keyword
Pure sciences; Adaboost; Binary data; Classification; High-dimensional; Plug-in rule; Variable selection
Classification and variable selection for high dimensional multivariate binary data: Adaboost based new methods and a theory for the plug -in rule
Park, Junyong
Number of pages
Publication year
Degree date
School code
DAI-B 68/02, Dissertation Abstracts International
Place of publication
Ann Arbor
Country of publication
United States
Ghosh, Jayanta K.
Purdue University
University location
United States -- Indiana
Source type
Dissertations & Theses
Document type
Dissertation/thesis number
ProQuest document ID
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Document URL
Access the complete full text

You can get the full text of this document if it is part of your institution's ProQuest subscription.

Try one of the following:

  • Connect to ProQuest through your library network and search for the document from there.
  • Request the document from your library.
  • Go to the ProQuest login page and enter a ProQuest or My Research username / password.