Statistical models for removing microarray batch effects and analyzing genome tiling microarrays
This work is a presentation of novel statistical methods for preprocessing and downstream analysis of data from applications on microarrays. One topic discussed in this work is a method for preprocessing microarray data for non-biological variation, or batch effects, which are commonly observed across multiple batches of microarray experiments. The ability to combine microarray data sets is advantageous to researchers to increase statistical power in studies where logistical considerations restrict sample size or require the sequential hybridization of arrays. In this work, parametric and nonparametric empirical Bayes frameworks are presented for adjusting data for batch effects that are robust to outliers in small sample sizes. The method is illustrated using example data sets and show that the method is justifiable and useful in practice.
The other focus of this work is the development of methods for preprocessing and analyzing data from applications on one and two color genome tiling microarrays. Commercial tiling array platforms have been developed that file the non-repetitive genomes of many organisms. These tiling array experiments produce massive correlated data sets which are full of experimental artifacts; presenting many challenges to researchers that require innovative analysis methods and efficient computational algorithms. This work presents a two-step model-based approach for analyzing tiling microarray data from one and two color platforms. In the first step, the data are pre-processed using a method for single array normalization and background adjustment, called standardization, that utilizes probe sequence to remove a large portion of the variation in the data which can be determined to be sample or probe bias. The second step, the localization of active transcripts or protein binding regions, is accomplished using moving window-based scan statistics or a doubly stochastic latent variable Bayesian analysis method, utilizing a continuous-time Hidden Markov Model that accounts for genomic distance between probes and is robust to cross-hybridized and non-responsive probes. These methods are illustrated on simulated and real-data examples, showing that the methods are very powerful and can be used on a single sample and without control experiments, thus defraying some of the tremendous overhead cost of conducting experiments on tiling arrays.