In genomics, we routinely work with multi-TB datasets and we will use Machine Learning to attack the protein structure prediction problem in molecular biology, and if we can improve state-of-the-art prediction models, it will have implications in many practical applications in structural biology and potentially provide the means of understanding disease mechanism, enabling us to design drugs to rewire protein-protein interaction networks.

In this project, we will develop deep learning methods using biological data. In particular, we will address the protein structure prediction problem, which involves predicting the structure from the amino acid sequence, predict interactions with other proteins and peptides, evaluating model qualities and predict amino acid contacts. We will evaluate different deep learning methods to predict the structure of a protein directly or by predicting contacts from which a structure can then be modeled We will also investigate how to best represent a protein structure in a DNN, investigating the use of both LSTMs (Long Short-Term Memory), CNNs (convolutional neural networks), 3D embedding, and RGN (recurrent geometric networks), and CRFs (conditional random fields). Interactions between proteins are also important since proteins are social molecules that function by interacting with partners. Many diseases are caused by malfunctioning protein-protein interaction networks or by pathogens interacting with host proteins during infection. Thus, we want to apply deep learning to understand the details of these networks and interactions and for coming up with potential cures.

A current computational limitation is available memory in available GPUs, i.e. the larger we make the model the better results. In collaboration with the DigDeeper, we will investigate techniques for distributing models across many GPUs.