Bioinformatics: Using machine learning potentials in conformational sampling and selection of protein structures

Modeling of protein structure is a central challenge in structural bioinformatics, and holds the promise not only to identify classes of structure, but also to provide detailed information about the specific structure and biological function of molecules. This is critically important to guide and understand experimental studies: It enables prediction of binding, simulation, and design for a huge set of proteins whose structures have not yet been determined experimentally (or cannot be obtained), and it is a central part of contemporary drug development.

The accuracy of protein structure prediction has increased tremendously over the last decade, and today it is frequently possible to build models with 2-3Å resolution even when there are only distantly related templates available. However, a long-standing problem in the structure prediction field is that even if a method sometimes is able to produce a high accuracy model the method itself frequently fails to recognize it as such. In other words if a method is given the problem of ranking five models it has produced, it will almost never produce a correct ranking. This does not only limit the quality of the highest ranked model from a given method, it is quite likely that the inability to distinguish right from wrong during structure generation limits the quality of all generated models.

In this project we will develop methods that tackle the problem of conformational sampling and selection by combining our expertise from two different areas; machine learning in structural evaluation and conformational sampling and structure prediction within the Rosetta modeling suite. Rosetta is currently the best program to model protein structure. It is also the program within the protein structure prediction field that has most scientific developers (+100 or so) worldwide and a steadily increasing user base.

The project is divided in to three parts:

  1. Method development: In this part we will develop the methods that will be used in further parts. This includes predictors of local and global model quality optimized both for membrane and globular proteins.
  2. Implementation & Benchmark: This part involves implementing the methods as new scoring (energy) functions into the Rosetta modeling suite and benchmarking the results in terms of ranking on all models the BAKER-ROSETTASERVER group generated for CASP10 (~3.5M models).
  3. Complete integration with Rosetta modeling protocols: While the second part dealt with using the method for final ranking of models this part will involve integrating the new scoring functions into the protocols used for generating structure. In particular, the ab initio, membrane ab initio, and relax protocol used for structure refinement.

To date, (1) and the implementation part of (2) is completely done and we are currently working on the benchmark and protocol integration.

Investigators