Anchor 1

COMPUTATIONAL BIOLOGY

Solving the protein folding problem

with artificial intelligence

By Matt Warren

16th March 2020

An insight into AlphaFold and how DeepMind are advancing scientific discovery.

Image source: DeepMind.com

Back in 2016, artificial intelligence (AI) company DeepMind embarked on their first big science project, developing a system to address the “protein folding problem” — an age-old challenge at the heart of biology. They made significant contributions to the field, demonstrating once again the power of machine-aided scientific discovery and their capabilities as a world-leading centre for AI research.

Having recently published their results and methods in scientific journals Nature and Proteins, this article will take a deeper look into their system, AlphaFold, and what this work could mean for the future of scientific research. Along the way we’ll discuss what the protein folding problem is and why it’s important, followed by a brief introduction into machine learning and how AlphaFold actually works.

So what is the protein folding problem?

Let's start with proteins. These macromolecules are essential for all known forms of life, performing a huge range of functions. They make up the structure of cells in our bodies; carry out all the reactions essential for our metabolism; and enable communication and coordination between our 37 trillion cells, to name just a few of their many functions.

Proteins consist of amino acids (also called residues) linked together in chains, which then coil and fold into a three-dimensional structure, like how an earphone cord can form a tangled bundle in your pocket. But unlike these tangles, proteins don’t fold randomly: assuming nothing goes wrong, the same string of amino acids will fold into the same structure time and time again.

This is important because the structure of almost every protein is determined by, and is essential for, its function. Take haemoglobin, the oxygen-delivery protein in our blood. To absorb and release oxygen in the right places, each haemoglobin unit undergoes a series of small, precise changes in its structure upon binding its cargo. A single error in one its 574 amino acids — an otherwise subtle alteration to its overall shape — and its transport function is lost.

Protein structure and levels of organisation. A string of amino acids (primary structure) can

arrange into coils and sheets through hydrogen bonds (secondary structure) with then interact

to 'fold' into a 3D shape (tertiary structure). Some proteins also exist as complexes made up of

more than one amino acid chain (quaternary structure).

The exact sequence of amino acids that make up a given protein is contained in the genetic information stored in DNA, but the instructions for how it must fold are seemingly missing. The 3D structures of proteins are in fact determined intrinsically by the interactions between its residues, for example positively and negatively charged atoms, and by interactions with water and other molecules which surround it.

A protein will fold in a way to minimise the energy of these interactions, leading to what’s called a thermodynamically stable, or native, state. Predicting how the chains will fold into their native 3D structure based on their amino acid sequence is what’s known as the protein folding problem.

Why is it a problem at all?

Because proteins typically range in size from hundreds to thousands of amino acids, where each amino acid in the chain can twist and turn relative to the next, the number of different shapes that a protein could adopt is astounding.

In fact, the number of possible configurations is more than the number of atoms in the universe. In other words, even if we could calculate each configuration in a fraction of a second, using all of the computing resources on earth, the calculations would still take billions of years to complete. Brute force is not an option.

Understanding this code or otherwise developing a computational tool for protein structure prediction are therefore central challenges of the protein folding problem, and those which DeepMind and scores of other researchers have spent their careers working to address.

Why is it so important?

As the shape of a protein is closely related to its function, the more we know about protein structures, the more we understand about their biological roles and relevance in disease.

Most of the drugs humans take to control or cure disease work by interacting with proteins, for example by blocking the site of the molecule where an unwanted chemical reaction might otherwise take place. A knowledge of protein structures can therefore translate to improved medicines and an understanding of how they work.

Protein misfolding has also been implicated in a variety of diseases including Alzheimer’s and diabetes. In these cases, understanding how the folding process works should also provide insights into what happens when it goes wrong.

Last but not least, solving the protein folding problem will also be a boon for synthetic biology: the field of research associated with the (re)engineering of proteins with useful applications. These include, for example, designing a protein that could break down plastics in the environment or improving the accuracy of gene editing proteins which could help cure inherited diseases (see our article on CRISPR).

Currently there are multiple ways to determine a protein's structure, such as X-ray crystallography and cryo-electron microscopy, but while these methods have been remarkably effective (there are currently over 150,000 structures deposited in the protein databank), they can be very costly, time consuming and for extremely challenging for certain proteins.

So, while the problem is a tough one to crack, the potential impacts are huge and there is ultimately little disagreement on this bottom-line: if the structure of a given protein could be determined accurately from nothing more than its amino acid sequence, this would facilitate unprecedented advancements in scientific research and medicine.

So how is this problem being addressed?

Given these benefits, it’s not surprising that this problem has received considerable attention from many researchers over the last few decades. In a situation rather unusual in academic research, there even exists a formal competition wherein scientists go head-to-head to compare their latest structure prediction methods.

CASP - short for the Critical Assessment of Structure Prediction - was founded 1994 by John Moult, a computational biologist at University of Maryland, and involves a series of experiments which now represent the gold-standard for assessing these prediction techniques: the "protein olympics", if you will.

The CASP organisers select proteins with unpublished structures and then challenge the teams to predict them from their amino acid sequences. As the structures are unpublished, but known to the organisers, the competition is “blind”, with all entrants starting with the same information. Once submitted, each prediction is compared to the experimentally determined structure and given a score based on how accurately it matches the “real” thing.

This competition last took place in 2018, which was the 13th competition event (CASP13). In this edition, there were over 100 teams competing across a number of different categories. Usually the competition isn’t well publicised, however this year of it attracted significant coverage when AlphaFold, the system built by DeepMind, swept the competition.

Their models were able to generate more accurate predictions than any other research group, topping the tables by an impressive margin, in what the organisers called “unprecedented progress in the ability of computational methods to predict protein structure”.

To understand how they won, the next section will take a look at how their system works and how it's different from the rest. Keep in mind that this is a complex topic, and although this article isn’t intended to go too far beyond the basics, you can skip to the next section if you're more interested in the results.

How does AlphaFold work?

Fundamentally, AlphaFold consists of a neural network and gradient descent algorithms which generate and optimise potential protein structures from their amino acid sequence and coevolutionary data. So, what does this all mean?

Neural networks are a form of machine learning - algorithms which are said to be able to ‘learn’ or infer knowledge from (often complex) data in a recursive, self-improving way. Like with all machine learning algorithms, this works by first ‘training’ the network on a set of labelled data, where the ‘label’ corresponds to the conclusion that should be drawn.

For example, if we wanted to build a model to predict the price of a car, we might provide it with descriptive information such as the make, age and mileage, while the label would be the price of the car itself. During the training stage, the algorithm attempts to quantify any relationship between these descriptors and the price. If suitably trained, the model could then be used to predict the price when given the descriptions alone.

The name “neural network” comes from the fact that their architecture loosely resembles the structure of biological neurons in the brain, with thousands or potentially millions of interconnected nodes. Information is transmitted between the nodes via connections, each which have an associated weight; these weights are continually updated during training and represent the model’s ‘memory’.

Like a lot of machine learning algorithms, neural networks been around for a while, but have seen a recent resurgence in their use thanks in part to improvements in processing units used by computers. They are now the method of choice for analysing complex input data such as in image recognition and language processing.

The neural network at the heart of AlphaFold is similar to those used in image recognition tasks, known as convolutional neural networks. The network was trained using protein sequence data and then asked to predict the distances between the residues (and the angles that connect them) in their fully folded form.

The ability to predict 3D structural information in this way relies on a concept in biology known as molecular coevolution. Briefly, this simple but powerful idea suggests that if we compare the sequences of similar proteins (e.g. haemoglobin from humans, pigs and mice) and find residues that co-vary (i.e. those that have evolved together in a way that they seem dependent on one another), they are likely to interact with one another chemically in the folded state, and must therefore be in close physical proximity.

To extract these co-evolutionary couplings, the sequence data fed into the model includes features from so-called ‘multiple sequence alignments’, which are just the sequences of many similar proteins (often 10,000-100,000 sequences) lined up together so that their differences, and thus co-evolved residues, can be mapped.

This clever approach was suggested multiple decades ago and thanks to the exponential increase in available sequence data it is now being employed in most protein structure models. However, AlphaFold took this a step further by using the coevolutionary data to not only predict whether two residues were in contact (i.e. “yes/no” output) but the actual distances between them.

Not only this, but they then represented each distance prediction as a distance function which, together with all the other pairwise distance functions, they could optimise to fold the protein and produce a final structure using a mathematical technique known as gradient descent.

And that’s all there is to it, at least in principle. There are number of technical details which we’ve overlooked here, but in keeping with our aims of making this article accessible, we’d like to direct the interested reader to this peer-reviewed article written by another competition entrant which should fill in the gaps we’ve missed.

AlphaFold architecture. An overview of how the AlphaFold system generates

protein structures from primary sequence data. Image: DeepMind.com

How did AlphaFold do?

AlphaFold performed very well indeed, scoring considerably higher than all other co-competitors in this year's competition and in those before it.

CASP assesses the accuracy of each predicted structure using a global distance test (GDT) score, which can be thought of as the percentage of amino acids whose predicted positions are in good agreement with the experimentally determined positions.

In CASP13, AlphaFold achieved a median GDT score of 68.3 across all targets, compared to just 48.2 for the team in second place. If that sounds a long way short of perfect predictions (i.e. 100), well, it is, but a little more context might be needed.

Historically, progress in CASP has ebbed and flowed, with GDT scores never exceeding 50. AlphaFold challenged this convention, increasing the rate of advancement by roughly double compared to previous years. AlphaFold was also very successful at predicting structures for targets rated as "most difficult" by the organisers – typically these are proteins which share very little sequence similarity with any known structures, making predictions even more challenging.

Competition results. Bars show the mean GDT scores of first and second ranking

entrants across the previous four competitions, with the cyan bar showing

AlphaFolds result. Note: rankings are based on CASP assessors formula and are

not always equal to mean GDT score.

Does this mean the protein folding problem is solved? Not yet, but a convincing solution might not be far off. Following the new trajectory that DeepMind have set, we could see computational methods replacing conventional techniques in a matter of years, at least for all but the most challenging of structures. This is why this years results carry such significance.

Maintaining this pace would likely require a number of conceptual breakthroughs, which are by no means guaranteed, but are equally not out of the question, especially given the resources that DeepMind has at its disposal. Ultimately, their entry in CASP13 was an upward departure from previous rates of progress and with CASP14 on the horizon, it is hard not to feel they are the team to watch.

Update (December 2020): DeepMind's new system AlphaFold 2, has not only won CASP14, but achieved an astonishing median GDT score of 92.4. We intend to post a follow-up article detailing their result and its implications. Stay tuned - the protein folding problem may have just been solved.

Primary

Secondary

Tertiary

Quaternary

Century Science

Exploring ideas in science & technology

COMPUTATIONAL BIOLOGY

Solving the protein folding problem

with artificial intelligence

By Matt Warren

16th March 2020

An insight into AlphaFold and how DeepMind are advancing scientific discovery.

Image source: DeepMind.com

Century Science

COMPUTATIONAL BIOLOGY

Solving the protein folding problem

with artificial intelligence

By Matt Warren

16th March 2020

An insight into AlphaFold and how DeepMind are advancing scientific discovery.

Image source: DeepMind.com

​

​

​