Saturating representation of loop conformational fragments in structure databanks

Type Article
Original languageEnglish
Article number3488
Pages (from-to)15
JournalBMC Structural Biology
Issue number15
Publication statusPublished - 04 Jul 2006
Permanent link
Show download statistics
View graph of relations
Citation formats


Short fragments of proteins are fundamental starting points in various structure prediction applications, such as in fragment based loop modeling methods but also in various full structure build-up procedures. The applicability and performance of these approaches depend on the availability of short fragments in structure databanks.

We studied the representation of protein loop fragments up to 14 residues in length. All possible query fragments found in sequence databases (Sequence Space) were clustered and cross referenced with available structural fragments in Protein Data Bank (Structure Space). We found that the expansion of PDB in the last few years resulted in a dense coverage of loop conformational fragments. For each loops of length 8 in the current Sequence Space there is at least one loop in Structure Space with 50% or higher sequence identity. By correlating sequence and structure clusters of loops we found that a 50% sequence identity generally guarantees structural similarity. These percentages of coverage at 50% sequence cutoff drop to 96, 94, 68, 53, 33 and 13% for loops of length 9, 10, 11, 12, 13, and 14, respectively. There is not a single loop in the current Sequence Space at any length up to 14 residues that is not matched with a conformational segment that shares at least 20% sequence identity. This minimum observed identity is 40% for loops of 12 residues or shorter and is as high as 50% for 10 residue or shorter loops. We also assessed the impact of rapidly growing sequence databanks on the estimated number of new loop conformations and found that while the number of sequentially unique sequence segments increased about six folds during the last five years there are almost no unique conformational segments among these up to 12 residues long fragments

The results suggest that fragment based prediction approaches are not limited any more by the completeness of fragments in databanks but rather by the effective scoring and search algorithms to locate them. The current favorable coverage and trends observed will be further accentuated with the progress of Protein Structure Initiative that targets new protein folds and ultimately aims at providing an exhaustive coverage of the structure space

Functional characterization of proteins is one of the most frequent problems in biology. While sequences provide valuable information, their high plasticity makes it frequently impossible to identify functionally relevant residues. For instance, it is estimated that 75% of homologous enzymes share less than 30% identical positions[1]. Meanwhile less than 30% of related protein pairs above 50% sequence identity have entirely identical EC numbers[2]. Functional characterization of a protein is usually facilitated by its three-dimensional (3D) structure[3]. These structures can be obtained by experiments, such as X-ray crystallography, NMR spectroscopy, Cryo-electron microscopy, or by computation. It has been recognized that technically complicate, time consuming and expensive 3D experimental approaches will not catch up with the millions of sequences that are emerging from high-throughput projects around the world[4]. Two major computational alternatives are available[5]. The first ones are the template based approaches (comparative modeling, threading) that have been employed in the Protein Structure Initiative (PSI)[6]. PSI efforts are expected to experimentally solve ~5000 carefully selected protein folds that could serve as seed templates for comparative modeling to provide useful three dimensional models for the rest of the -hundreds of thousands of- sequences. While the resulting comparative models will be accurate for most of the structure, some of the most critical parts of the proteins may not be modeled accurately. For instance, the overall accuracy of a comparative model for a protein that belongs to one of the superfolds[7] can be very high, because there are many high resolution structures available as templates sharing the same general fold, however the variable regions of these structures are different. The variable regions are often unique in each of these proteins, and define the function and specificity of the molecules. For these unique structural segments that are often found on the surface of the proteins and without any translational symmetry (i.e., loops), comparative modeling techniques cannot generally be applied; loop segments in the target may be missing in the template or structurally divergent, resulting in inaccurate parts in the model. On the other hand, short fragments of proteins play a critical role in full structure buildup approaches, too. Some of the most accurate methods available assemble full protein structures by locating short segments in the databanks and packing them together using some sort of minimization protocol such as Monte Carlo simulation[8, 9]. These approaches proved to be useful to provide reasonable structures (within 4–8 A RMSD to the experimental solution) for small proteins, typically less than a 100 residues[10]. To improve the accuracy of models that are already subject to computational modeling and to extend the applicability of whole structure buildup methods to more sequences it is critical to have a good selection of short building blocks in the structure databases