Machine learning approaches to understanding the genetic basis of complex traits

2009 2009

Other formats: Order a copy

Abstract (summary)

Humans differ in many observable qualities, termed 'phenotypes', ranging from appearance to disease susceptibility. Many phenotypes are largely determined by each individual's specific 'genotype', stored in the 3.2 billion bases of his or her genome sequence. Deciphering the genome sequence by finding which sequence variations affect a certain phenotype would have a great impact on human life. The recent advent of high-throughput genotyping methods has enabled retrieval of an individual's sequence information on a genome-wide scale. Classical approaches have focused on finding a significant correlation between a sequence variation S and a particular phenotype P from the genotype and phenotype data. However, it is difficult to directly infer such causal relationships between S and P from limited data, because of: (1) the complexity of cellular mechanisms, through which S causes P, and (2) environmental factors that are not necessarily measurable.

In this dissertation, we present machine learning approaches that address these challenges by explicitly modeling an intermediate process between the genotype and phenotype. More specifically, we model the genetic regulatory mechanisms that are induced by sequence variations and that lead to the phenotype, and we learn the model from genome-wide mRNA expression measurements. Using the learned model, we aim to generate a finer-grained hypothesis such as: a sequence variation S induces regulatory interactions R, which lead to changes in the phenotype P.

To achieve this goal, our approach utilizes sophisticated machine learning techniques that can robustly select relevant biological interactions among a large number of possible interactions and can efficiently solve the optimization problem from a large amount of data. For example, our 'meta-prior algorithm' can learn the regulatory potential of each sequence variation based on their intrinsic characteristics, and this improvement helps to identify a true causal sequence variation among a large number of variations in the same chromosomal region. Our approaches have led to novel insights on sequence variations, and some of the hypotheses have been validated through biological experiments. Some of the machine learning techniques developed for biological problems are generally applicable to a wideranging set of applications such as collaborative filtering and natural language processing.

Indexing (details)

Artificial intelligence;
Computer science
0715: Bioinformatics
0800: Artificial intelligence
0984: Computer science
Identifier / keyword
Applied sciences; Biological sciences; Complex traits; Computational biology; Gene regulation; Machine learning; Sequence variation
Machine learning approaches to understanding the genetic basis of complex traits
Lee, Su-In
Number of pages
Publication year
Degree date
School code
DAI-B 70/01, Dissertation Abstracts International
Place of publication
Ann Arbor
Country of publication
United States
Stanford University
University location
United States -- California
Source type
Dissertations & Theses
Document type
Dissertation/thesis number
ProQuest document ID
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Document URL
Access the complete full text

You can get the full text of this document if it is part of your institution's ProQuest subscription.

Try one of the following:

  • Connect to ProQuest through your library network and search for the document from there.
  • Request the document from your library.
  • Go to the ProQuest login page and enter a ProQuest or My Research username / password.