Machine learning approaches to understanding the genetic basis of complex traits
Humans differ in many observable qualities, termed 'phenotypes', ranging from appearance to disease susceptibility. Many phenotypes are largely determined by each individual's specific 'genotype', stored in the 3.2 billion bases of his or her genome sequence. Deciphering the genome sequence by finding which sequence variations affect a certain phenotype would have a great impact on human life. The recent advent of high-throughput genotyping methods has enabled retrieval of an individual's sequence information on a genome-wide scale. Classical approaches have focused on finding a significant correlation between a sequence variation S and a particular phenotype P from the genotype and phenotype data. However, it is difficult to directly infer such causal relationships between S and P from limited data, because of: (1) the complexity of cellular mechanisms, through which S causes P, and (2) environmental factors that are not necessarily measurable.
In this dissertation, we present machine learning approaches that address these challenges by explicitly modeling an intermediate process between the genotype and phenotype. More specifically, we model the genetic regulatory mechanisms that are induced by sequence variations and that lead to the phenotype, and we learn the model from genome-wide mRNA expression measurements. Using the learned model, we aim to generate a finer-grained hypothesis such as: a sequence variation S induces regulatory interactions R, which lead to changes in the phenotype P.
To achieve this goal, our approach utilizes sophisticated machine learning techniques that can robustly select relevant biological interactions among a large number of possible interactions and can efficiently solve the optimization problem from a large amount of data. For example, our 'meta-prior algorithm' can learn the regulatory potential of each sequence variation based on their intrinsic characteristics, and this improvement helps to identify a true causal sequence variation among a large number of variations in the same chromosomal region. Our approaches have led to novel insights on sequence variations, and some of the hypotheses have been validated through biological experiments. Some of the machine learning techniques developed for biological problems are generally applicable to a wideranging set of applications such as collaborative filtering and natural language processing.
0800: Artificial intelligence
0984: Computer science