Comparative Genomics     


Comparative Genomics - overview

Whole Genome Comparison

Common gene clusters

Various genetic events during the process of natural evolution shape the landscape of the genomes. While mutations occur at a nucleotide level, the effect of most of the other events, such as tandem repeats and genetic exchange, is seen on larger segments of the chromosome. The consequences of these genetic events is some shuffling or amplification of segments on the genome. Then are there tell-tale `patterns' on the chromosomes that are proxies for these genetic events, and can be identified? One such modeless approach is to work on the chromosomes at a larger granularity, i.e., operate not on nucleotides but on large segments or genes made of hundreds or thousands of nucleotides. This conveniently separates two interrelated problems, that of identifying sequence-similar patterns, say orthologous or paralogous genes, and composition-similar patterns, say, clusters of genes. Here we provide tools for the second class of problems. Common gene clusters (CgC) are usually functionally related, hence of particular interest to geneticists.

Mathematical Model

There are two issues which sets the problem of CgC discovery apart from the well-studied and well-understood sequence-similar patterns. Firstly, there is a problem of overcounting associated with the patterns. Recognizing and eliminating this overcounting is vital not only in quantifying the statistical significance of clusters but also in intelligently restricting the output volume. Secondly, due to the allowable permutations of the elements within the cluster, the problem does not respect the apriori property, making the pattern discovery algorithm difficult. It turns out that a mathematical structure called the PQ tree (see the adjoining figure) tackles the overcounting problem with maximality. Since the archetecture is a tree, the common genes also have a nested strcture as shown. PQ tree has two kinds of internal nodes: P, shown as an ellipse, and Q, shown as a rectangle. The PQ tree model kills two birds with one stone: the first of overcounting and the second of cluster statistical signficance.

In the following illustrations, a line between the genomes of two organisms denotes the positions of a common gene clusters in the respective genomes and its color denote its statistical significance- red is the highest and blue the lowest.

Example of three-way comparison: Comparison of Arabidopsis (top) with poplar (bottom) genomes via grape chromosome 1. The number of maximal clusters is much smaller and they are generally more statistically significant.

Related Publications

  1. Discovering Patterns in Gene Order, Laxmi Parida and Niina Haiminen. In Evolutionary genomics: statistical and computational methods (Methods in Molecular Biology), Editor: Maria Anisimova (Series editor: John Walker) Publisher: Springer Humana, 2012.
  2. Gapped Permutation Pattern Discovery for Gene Order Comparisons, Laxmi Parida, Journal of Computational Biology, vol 14, No 1, pp 46-56, 2007.
  3. Using PQ Structures for Genomic Rearrangement Phylogeny, Laxmi Parida, Journal of Computational Biology, 13(10), pp 1685-1700, 2006
  4. Automatic Discovery of Gapped Permutation Patterns with Size Constraints, Laxmi Parida, Dagstuhl Proceedings of Series 06201 Combinatorial and Algorithmic Foundations of Pattern and Association Discovery, May 15-20, 2006.
  5. Using Permutation Patterns for Content-Based Phylogeny, Enam Karim, Laxmi Parida, Arun Lakhotia, Pattern Recognition in Bioinformatics, LNBI 4146, pp 115-125, 2006.
  6. Gapped Permutation Patterns for Comparative Genomics, Laxmi Parida, Proceedings of WABI, Algorithms in Bioinformatics, LNBI 4175, pp 376-387, 2006.
  7. A PQ Framework for Reconstructions of Common Ancestors & Phylogeny Laxmi Parida, Proceedings of RECOMB-CG, Comparative Genomics, LNBI 4205, pp 141-155, 2006.
  8. Gene Proximity Analysis Across Whole Genomes via PQ Trees, G M Landau, L Parida, O Weimann, Journal of Computational Biology, 12(10), pp 1289--1306, 2005.
  9. Malware Phylogeny Generation Using Permutations of Code, M E Karim, A Walenstein, A Lakhotia, L Parida, Journal in Computer Virology, 2005.
  10. Using PQ Trees for Comparative Genomics, G M Landau, L Parida, O Weimann, CPM 2005 Jeju Island, South Korea, LNCS 3537 Springer 2005, ISBN 3-540-26201-6, pp 128-143, June 19-21, 2005. abstract.
  11. Malware Phylogeny Using Maximal pi-Patterns, Arun Lakhotia, Md Enamul Karim, Andrew Walenstein, Laxmi Parida, EICAR 2005, Malta, April 30-May 3, 2005.
  12. Permutation Pattern Discovery in Biosequences, R Eres, G M Landau, L Parida, Journal of Computational Biology, vol 11, No 6, pp 1050-1060, 2004.
  13. A Combinatorial Approach to Automatic Discovery of Cluster Patterns, Revital Eres, Gad M Landau, Laxmi Parida, Proceedings of WABI 2003, LNBI vol 2812, pp 139--150, September 15-20, 2003. ( conference photos by Tetsuo)