Reconstructing signaling pathways from high throughput data
Many bioinformatics problems can be tackled from a fresh angle offered by the network perspective. Taking into account the network constraints on gene interaction, we propose a series of logically-coherent approaches to reconstruct signaling pathways from high throughput expression profiling data. These approaches proceed in three consecutive steps: co-expression network construction with controlled biological and statistical significance, network constrained clustering, and reconstruction of the order of pathway components.
The first step relies on detecting pairwise co-expression of genes. We attack the problem from both frequentist statistics and Bayesian statistics perspectives. We designed and implemented a frequentist two-stage co-expression detection algorithm that controls both statistical significance (False Discovery Rate, FDR) and biological significance (Minimum Acceptable Strength, MAS) of the discovered co-expressions. In order to regularize variances of the correlation estimation in small sample scenario, we also designed and implemented a Bayesian hierarchical model, in which correlation parameters are assumed to be exchangeable and sampled from a parental Gaussian distribution. Using simulated data and the galactose metabolism data, we demonstrated advantages of our approaches and compared the differences among them.
The second problem considered is distance-based clustering that accounts for "network constraints" extracted from the Giant Connected Component (GCC) of the network discovered from the data. The clustering is performed using a "hybrid" distance matrix composed of direct distance between adjacent genes and "shortest-path" distance between non-adjacent genes in the network. The third problem considered is the reconstruction of the order of pathway components. We applied a first-order Markov model, originally developed and applied to a network tomography problem in telecommunication networks, to reconstruct three well-known signaling pathways from unordered pathway components. We suggest that the methods proposed here can also be applied to other high throughput data analysis problems.