Content area
Abstract
This dissertation discusses computational aspects of a comparative sequence analysis technique called phylogenetic footprinting, aimed at identifying regulatory elements in DNA sequences. We formulate a new algorithmic problem, where we ask to identify motifs that are highly conserved across a set of orthologous sequences. A motif's conservation is measured using its parsimony score on a given phylogenetic tree. We describe a practical algorithm to solve this NP-complete problem exactly. We also generalize it to allow us to identify regions that are conserved in only a subset of the input sequences. We investigate issues related to estimating the statistical significance of the motifs found. We show that our approach outperform other approaches that have been used (but not designed for) phylogenetic footprinting. Our algorithm is applied to a large number of sets of orthologous regulatory regions and we identify several known regulatory elements, as well as many well conserved regions whose functions are unknown. Finally, we study an application of our algorithms to the problem of identify regions of an RNA sequence whose secondary structure is conserved across species.