Algorithms and inference for mixture models with application to protein sequence analysis
Mixture model-based clustering is a commonly used statistical tool. The first part of my dissertation describes new search algorithms for finding the partition that maximizes a criterion function, and new Markov chain Monte Carlo algorithms for drawing partitions from a target distribution. These algorithms are based on a neighborhood pruning technique that incorporates bottom-up hierarchical clustering methods. The second part of my dissertation gives a new estimator of mixture order for multivariate categorical data. The estimator is related to the finding mixture order via Bayes factors. The finite sample performance of the estimator is good, and its large sample behavior can be analyzed using rate distortion theory and is conjectured to not over-estimate mixture order, asymptotically. The third part of my dissertation uses a Bayesian mixture profile hidden Markov model to find the subfamilies in a protein family. Application to simulated and real datasets show that meaningful partitions with the correct numbers of components can be identified. As subfamilies usually differ in their functions, valuable insights can be gained through this cluster analysis.