Combinatorial Patterns - overview
Extensible Pattern Discovery
The discovery of motifs in biosequences is frequently torn between the rigidity of the model on the one hand and the abundance of candidates on the other. In particular, the variety of motifs described by strings that include ‘don’t care’ (dot) patterns escalates exponentially with the length of the motif, and this gets only worse if a dot is allowed to stretch up to some prescribed maximum length. This is unfortunate, as it seems to preclude precisely those massive analyses that have become conceivable with the increasing availability of massive genomic and protein data. Although a part of the problem is endemic, another part of it seems rooted in the various characterizations offered for the notion of a motif, that are typically based either on syntax or on statistics alone. It seems worthwhile to consider alternatives that result from a prudent combination of these two aspects in the model.
We show that a combination of appropriate saturation conditions (expressed in terms of minimum number of dots compatible with a given list of occurrences) and the monotonicity of probabilistic scores over regions of constant frequency afford us significant parsimony in the generation and testing of candidate over-represented motifs.
The merits of Varun are documented by the results it obtained, which specifically targeted protein sequence families. In all cases tested, the motif reported in PROSITE as the most important in terms of functional/structural relevance emerges among the top 30 extensible motifs returned by Varun, often right at the top. Of equal importance seems the fact that the sets of all surprising motifs returned in each experiment are extracted. faster and come in much more manageable sizes than would be obtained in the absence of saturation constrains.
Extensible patterns: GTP-binding elongation factors signature [Q01698(EFTU_THEAQ)]; Heat shock protein Hsp70 [P19120 (HSP7C_BOVIN)]; Phosphoglycerate kinase signature [P36204 (PGKT_THEMA].
The 32-bit Linux version is downloadable here.
- VARUN: Discovering Extensible Motifs under Saturation Constraints, Alberto Apostolico, Matteo Comin, Laxmi Parida, IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 7, no. 4, pp. 752-762, Oct.-Dec. 2010, doi:10.1109/TCBB.2008.123.
- Detection of Subtle Variations as Consensus Motifs, Matteo Comin, Laxmi Parida, Theoretical Computer Science, 395(2-3), pp 158-170, May, 2008.
- Subtle Motif Discovery for Detection of DNA regulatory sites, Matteo Comin, Laxmi Parida, Series on Advances in Bioinformatics and Computational Biology, as proceedings of Asia Pacific Bioinformatics Conference (APBC2007), vol 5, Hong Kong, pp 27--36, Jan 14-17, 2007.
- Motif Patterns in 2D, Alberto Apostolico, Laxmi Parida, Simona E. Rombo, Theoretical Computer Science, vol 390, N0 1, pp 40-55, 22 January 2008.
- Mining, Compressing and Classifying with Extensible Motifs, A. Apostolico, M. Comin, L. Parida, Algorithms for Molecular Biology.
- Off-line Compression by Extensible Motifs, A. Apostolico, M. Comin, L. Parida, Data Compression Conference (DCC), Snowbird, Utah, March 29-31, 2005.
- Conservative Extraction of Over-represented Motifs, A. Apostolico, M. Comin, L. Parida, ISMB 2005 Michigan, pp 9-18, June 25-29, 2005. abstract, PDF.
- Motifs in Ziv-Lempel-Welch Clef, A. Apostolico, M. Comin, L. Parida, IEEE Proceedings of Data Compression Conference (DCC), Snowbird, Utah, pp 72--81, March 23-25, 2004.
- Incremental Paradigms of Motif Discovery, Alberto Apostolico, Laxmi Parida, Journal of Computational Biology, vol 11, No 1, pp 15-25, 2004.
- An inexact suffix tree based algorithm for extensible pattern discovery, Abhijit Chattaraj, Laxmi Parida, Theoretical Computer Science, 335:1, pp 3-14, 2005.
- Bridging Lossy and Lossless Compression by Motif Pattern Discovery, A. Apostolico, M. Comin, L. Parida, General Theory of Information Transfer and Combinatorics, Vol II, (editors R Ahlswede, L Baumer, N Cai), 2004.
- Compression and the Wheel of Fortune, Alberto Apostolico, Laxmi Parida, IEEE Proceedings of Data Compression Conference (DCC), Snowbird, Utah, pp 143--152, March 25-27, 2003. Work also presented at LSD 03, ( London Stringology Day 03 ), King's College, London, February 26, 2003.
- Some Results on Flexible-pattern Discovery, Laxmi Parida, Combinatorial Pattern Matching (CPM 2000), LNCS vol 1848, pp 33--45, 2000.
- Pattern Discovery on character sets and real-valued data: linear bound on irredundant motifs and polynomial time algorithms, Laxmi Parida, Isidore Rigoutsos, Aris Floratos, Dan Platt, Yuan Gao, Proceedings of the eleventh ACM-SIAM Symposium on Discrete Algorithms (SODA 2000), January 2000, pp 297--308.
- MUSCA: An Algorithm for Constrained Alignment of Multiple Data Sequences, Laxmi Parida, Aris Floratos and Isidore Rigoutsos, Genome Informatics 1998, No 9, 112-119, Universal Academy Press, 1998.
- An Approximation Algorithm for Alignment of Mulitple Sequences using Motif Discovery, Laxmi Parida, Aris Floratos and Isidore Rigoutsos, Journal of Combinatorial Optimization, Vol 3 No 2/3, August 1999, pp 247--275.