𝔖 Bobbio Scriptorium
✦   LIBER   ✦

Dictionary building via unsupervised hierarchical motif discovery in the sequence space of natural proteins

✍ Scribed by Isidore Rigoutsos; Aris Floratos; Christos Ouzounis; Yuan Gao; Laxmi Parida


Publisher
John Wiley and Sons
Year
1999
Tongue
English
Weight
306 KB
Volume
37
Category
Article
ISSN
0887-3585

No coin nor oath required. For personal study only.

✦ Synopsis


Using TEIRESIAS, a pattern discovery method that identifies all motifs present in any given set of protein sequences without requiring alignment or explicit enumeration of the solution space, we have explored the GenPept sequence database and built a dictionary of all sequence patterns with two or more instances. The entries of this dictionary, henceforth named seqlets, cover 98.12% of all amino acid positions in the input database and in essence provide a comprehensive finite set of descriptors for protein sequence space. As such, seqlets can be effectively used to describe almost every naturally occurring protein. In fact, seqlets can be thought of as building blocks of protein molecules that are a necessary (but not sufficient) condition for function or family equivalence memberships. Thus, seqlets can either define conserved family signatures or cut across molecular families and previously undetected sequence signals deriving from functional convergence. Moreover, we show that seqlets also can capture structurally conserved motifs. The availability of a dictionary of seqlets that has been derived in such an unsupervised, hierarchical manner is generating new opportunities for addressing problems that range from reliable classification and the correlation of sequence fragments with functional categories to faster and sensitive engines for homology searches, evolutionary studies, and protein structure prediction.