Original Source Here
AI after AlphaFold
New preprint describes a novel parameter-free geometric transformer of atomic coordinates to predict biological interfaces in proteins
And it runs so fast that it can even scan large ensembles of protein structures to search interaction-prone amino acids.
Machine learning, artificial neural networks, and other mathematical methods based on “artificial intelligence” (I don’t really like the term, but it sticks!) have been applied to scientific problems for decades. But as we all know, they are now having unprecedented applications, changing sciences like chemistry and biology in radical ways.
Probably one of the most striking applications of modern AI is in predicting protein structures, which began around 5–10 years ago and reached a culprit with AlphaFold 2 presented in late 2020 and 2021. I discussed AlphaFold 2 in detail in previous articles:
Scientists working on protein structure prediction first felt negative emotions about the impact that AlphaFold had, because it kind of disrupted their own fields, achieving many of their long-hoped goals. However, after a short mourning they embraced and actually capitalized on AlphaFold to make new discoveries and develop new tools, many of which I already discussed:
After all, AlphaFold 2 didn’t solve all the relevant problems in molecular and structural biology. In fact, it only solved a small part of the huge puzzle (which doesn’t make it small at all!) What AlphaFold 2 kind of solved (I say “kind of” because even this problem isn’t fully solved; and I keep stressing AlphaFold 2 because its first version wasn’t so good at it yet) is predicting the so-called “tertiary structures” of proteins, which essentially means how their constituent atoms arrange in 3D space.
But protein structures have several levels of complexity. Proteins are long linear chains of amino acids that fold into 3D structures to achieve tertiary structures, but these can in turn form higher-order structures i.e. complexes between multiple proteins or between proteins and other biological macromolecules such as nucleic acids (DNA and RNA) or with membranes, ions, small molecules, etc. In most cases, in fact, the biological function of a protein resides or is modulated in a physiologically relevant way by these complexes.
When a protein interacts with another protein we talk about a protein-protein complex, and AlphaFold 2 can predict some of these interactions (especially in its “AlphaFold Multimer” flavor) but it isn’t very good at it yet. And if we consider the other kinds of interactions that proteins can establish, AlphaFold is out of the game. It is just not designed to predict interactions between proteins and molecules other than proteins, such as DNA, RNA, ions, small molecules such as amino acids, metabolic intermediates, cell signaling molecules, etc. or biological membranes and their constituents, lipids.
Modeling these other interactions is the next step on the road to modeling biological structures, interactions, and functions at atomic level, and there are many groups who have been working on this for years. It wouldn’t surprise me if Deepmind itself now moves on to tackle some of these other interactions that proteins can engage in. In particular, the specific prediction of small molecule binding is of huge relevance to pharma, because most compounds of clinical use are themselves small molecules that interact with specific proteins.
To know more about the next routes for AI in protein structure prediction and structural biology/structural bioinformatics in general, you can check this recent article I wrote:
Predicting what a protein will interact with by using a parameter-free geometric transformer
A new preprint from the lab where I work has now tackled this exact question using a novel formulation:
Given a protein’s structure or model, predict what interfaces it can form to bind other proteins, or nucleic acids, or lipids, or ions, or other kinds of small molecules.
The doctoral student leading this work developed a geometric transformer that reads and processes the 3D coordinates of the input protein and produces residue-specific scores that predict how likely each amino acid of the protein is to be part of an interface with other protein/s, with nucleic acid/s, with ion/s, etc. The method, called PeSTo after Protein Structure Transformer, has very high accuracy, barely confuses interfaces, and has a couple of very advantageous points over alternative methods:
- Running the model doesn’t involve any calculation of the input protein’s surface as most alternative methods need. Surface calculations are slow to compute and are very sensitive to errors in the 3D structures.
- The model runs in milliseconds, including its loading time, which means you can process large numbers of structures in a short amount of time. In fact, it is so fast that it can process whole molecular dynamics trajectories within seconds, which turns out useful to identify transient interfaces that are only accessible when the protein moves, as we show. We could also process the whole human proteome, discovering new biology.
- The model doesn’t rely on any parametrization or even classification, as it is totally trained based on atom elements and positions in space. Thus, although we applied PeSTo to proteins and their C, N, O atoms it should be easily re-trainable for other purposes, for example in the material sciences.
A new geometric transformer for atomic coordinates
Let me mention a few key points about how PeSTo works. For more details, you can refer to the preprint:
PeSTo treats protein structures as clouds of point atoms, representing the geometry through pairwise distances and relative displacement vectors that guarantee translation invariance. Each point atom is described using only its elemental name and none of the numerical parametrizations that other methods use such as radius or charge. Each atom is encoded through a geometrical transformer that accounts for its local neighborhood through scalar and vector states and distances computed from the surrounding atoms at increasing distances. Upon query, this descriptor is propagated through the network producing atom-specific outputs through a multi-head attention operation. The atom-based outputs are then gathered for each protein residue by two additional modules that end up predicting whether each residue of the protein is likely to be at an interface or not.
Based on a dataset derived from the Protein Data Bank, we trained the model to output residue-wise probabilities of engagement in protein-protein, protein-nucleic acid, protein-ion, protein-ligand, protein-ion, and protein-lipid interfaces.
Webserver implementation and a concrete example
The preprint includes some selected examples. I’ll show you here one specific example I ran on the webserver implementation at https://pesto.epfl.ch
When you access the website you are given the option to carry out predictions on:
- A protein structure from the PDB, entered with its 4-character ID
- A protein model pre-computed in AlphaFold-EBI’s database, entered as a UniProt ID
- A protein structure/model that you upload.
Let’s try out here a structure from the PDB, as this allows me to introduce another feature of the input page:
I took 4ITQ on purpose. This is an X-ray structure of a protein bound to DNA. The biological assembly as annotated in the PDB marks one specific protein-DNA surface, but solution-based NMR experiments I did on a related protein for another work revealed a more extensive DNA-interacting surface. What does PeSTo predict?
Let’s first look at all of PeSTO’s predictions for this protein:
PeSTo produces a quite large surface of residues that could be involved in binding nucleic acids, and also a loop that could be binding ions. Nothing for binding other proteins, lipids, or ligands other than ions.
The interface for DNA binding is very large and is very consistent with the NMR results in my previous paper, which indicated binding through at least two interfaces and not just one as proposed by the X-ray structure. Moreover, AFM experiments in that paper showed that this protein introduces strong loops and kinks in DNA, tentatively as it forces DNA to wrap around it, unwinding it to achieve some yet unclear biological function.
Applications to discovering interfaces in molecular dynamics simulations and in foldomes
PeSTo runs so fast, that we can apply it to very large numbers of structures. For our preprint we tried it as a tool to identify interfaces in protein subject to molecular simulations, and to the collection of structures of the full human proteome.
Applied to molecular dynamics simulations PeSTo is very useful because it can automatically detect interfaces that might not be obvious in the structure used to start the simulation but might become exposed upon dynamics. This could be especially powerful to discover so-called cryptic pockets, i.e. small pockets in the protein surfaces that appear and disappear as the protein moves and hence might be lost in static X-ray structures.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot