Deepmind AlphaFold2: Highly accurate protein structure prediction

https://miro.medium.com/max/1200/0*epOpzCh7fzHYv77t

Original Source Here

The key principle of the building block of the network, named Evoformer, is to view protein structure prediction as a graph inference problem in 3-D space where the edges of the graph are defined by residues in proximity.

Source: Alphafold2 Article

The Evoformer is responsible for encoding and reasoning about the 3-D protein graph. For the produces MSA matrix above, the columns encode the individual residues of the input protein sequence while the rows encode the part-sequences in which those residues appear. One of the upgrades that were done on top of AlphaFold1 is that the MSA matrix update operation is applied within every block in the network rather than once. This allows for continuous communication between the building blocks and embraces their iterative design.

To properly process a 3-D protein structure, there have to be many constraints that need to be satisfied throughout the network, one of them is the triangle inequality on distances (this means that the distance from point A to point B via point C is at least as great as from point A to point B directly). To satisfy this condition, they add an extra logit bias to the attention mechanism to compensate for a triangle’s “missing edge” and they define a “triangle multiplicative update” operation that uses 2 edges to update the missing third one.

They also use a variant of attention in the Evoformer, known as axial attention which aligns more smoothly with several dimensions of tensors in both the encoding and decoding aspect (source: PapersWithCode).

Model Training

For their training data, they used a mixture of labeled and unlabelled data, which I guess makes this a semi-supervised model (if only it was fully unsupervised, would have been impressive). They used a similar approach to noisy student self-distillation to do this. This essentially means that they firstly predict the structure of a portion of the dataset and then they make a new labeled dataset out of that by filtering it to a high-confidence subset. They then train the same architecture again from scratch using a mixture of the originally labeled dataset and this new subset. This makes effective use of the unlabelled data and improves the accuracy of the resulting network.

Structure Prediction

Their prediction module relies on the 3-D backbone structure defined above that uses the MSA pair representation, which also uses unlabelled sequence data to improve the prediction accuracy. They also randomly mask residues within the MSA and use a BERT-style transformer to predicted the masked elements of the MSA sequences. This motivates the network to reason about the evolutionary relationships found within the protein sequences.

One of the sections that are becoming increasingly significant in bioinformatics ML papers, is Model interpretability. They didn’t include many details in this section, but my main takeaway was that a convention of training a separate network for model interpretability is starting to become a bit popular (in their case they trained a separate Structure Module, which is only part of the network). I believe for medical projects, this is quite worth the while.

Model limitations

There are some notes that they have observed for this model in which it doesn’t perform too well. the first one is that the accuracy drops greatly when the mean alignment depth for protein sequences is less than ~30 sequences. They claim that the MSA information is needed to properly define the correct structure within the early stages of the network. They have also found that depths over ~100 sequences don’t lead to great improvements so it seems like the ideal range is 30–100. They have also found that the network performs on some types of proteins that have fewer intra-chain contacts

Conclusion

Of course, I find this all quite impressive and fascinating, but the fact is this is still in a research/experimental phase, so I wouldn’t be entirely sure that the protein folding problem has actually been solved yet. A lot of models give impressive performances in labs and competitions and then fail in real-world scenarios, hopefully, Alphafold2 isn’t one of them. But, at least now given that they have released the code here, the community can perform all sorts of tests and benchmarks.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: