Bioinformatics: Using predictions to improve predictions of membrane proteins

Machine learning methods have a long history within bioinformatics. Sequence based machine learning methods can roughly be divided into two classes, local and structural. The local methods are trained on information from a fixed length sequence window surrounding a residue. Thereafter various information, primarily sequence and evolutionary, from this window is gathered and feed into a machine learning algorithm, typically a support vector machine or an artificial neural network, to predict some feature of the central residues. This approach has been successful for many applications, including the prediction of secondary structure of proteins.

The structural machine learning algorithms used in sequence-based bioinformatics are based on some type of graph structure describing the protein sequence. This structure allows for non-fixed length windows in the training procedure as a sequence can be generated from different pathways through the graph. Typically hidden Markov models, Bayesian networks or conditional random fields are applied here. They have successfully been applied to the prediction of secondary structure (or topology) of membrane proteins and other problems. One limitation with these algorithms is that they cannot take into account correlations between different sites in the sequence. To tackle this problem we are developing methods based on a combination of local and globular machine learning algorithms. So far we have successfully applied them to membrane protein topology predictions of both alpha and beta types.

The next step in predictions of membrane protein structure predictions is to aim for full 3D-predictions. One way to do this is to use “direct information” and predict interacting residues in a protein and then use these to model the structure. The idea of contact predictions using “direct information” is based on the observation that two residues in a protein that are close in space often shows a co-variations, e.g for instance if residue A is large and residue B is small than if A is replaced by a small residue it is quite likely that B is replaced by a large residue. This type of correlated mutations has been used quite unsuccessfully to predict contacts in proteins for many years. However, recently it has been shown that methods based on “direct information” can improve the predictions drastically. The idea here is that if A is in contact with B and B is in contact with C, it is possible that stront co-variation between A and C can also be seen but that it is possible to disentangle the direct and indirect contacts.

The origin of “direct information” is work done by Lapedes in the late 1990’s. However, only in the last few years it has been shown that it provides a significant improvement, as it needs to have thousands of homologous sequences aligned to show high accuracy. But if sufficient number of sequences is available to performance of these methods is fantastic. Our latest implementation using a combination of plmDCA (developed by Erik Aurell at KTH) and psicov (developed by David Jones in the UK) and our own postprocessing filter (PconsC) have 77% accuracy among the top L (length of the protein) predictions. Earlier methods rarely reached over 20%. Now we are implementing these methods into our structure prediction pipelines.

References
Hayat, S. and Elofsson, A. (2012) Ranking models of transmembrane beta-barrel proteins using Z-coordinate predictions. Bioinformatics 28 (12) : i90-i96.
Tsirigos, K.D., Hennerdal, A., Kall, L. and Elofsson, A. (2012) A guideline to proteome-wide alpha-helical membrane protein topology predictions. Proteomics 12 (14) : 2282-2294.
Hayat, S. and Elofsson, A. (2012) BOCTOPUS: improved topology prediction of transmembrane beta barrel proteins. Bioinformatics 28 (4) : 516-522.
Shu, N. and Elofsson, A. (2011) KalignP: Improved multiple sequence alignments using position specific gap penalties in Kalign2. Bioinformatics 27 (12) : 1702-1703.
Hennerdal, A. and Elofsson, A. (2011) Rapid membrane protein topology prediction. Bioinformatics 27 (9) : 1322-1323.
Larsson, P., Skwark, M.J., Wallner, B. and Elofsson, A. (2011) Improved predictions by Pcons.net using multiple templates.Bioinformatics 27 (3) : 426-427.
llergard, K., Kauko, A. and Elofsson, A. (2011) Why are polar residues within the membrane core evolutionary conserved?Proteins 79 (1) : 79-91.