Conceptual illustration of computation chemistry
Discovery
|
Khaled Abdoul-Maksoud

AI’s Red Carpet Year

How the pioneering work of five Nobelists is reshaping science … and society 

It’s not an exaggeration that the work of last year’s Physics Nobelists played a central role in shaping what AI can do today. John J. Hopefield of Princeton University and Geoffrey Hinton of the University of Toronto laid the foundation for the machine learning that powers many of today’s AI-based products and applications.

With the advent of large language modelling, anyone can now gain access to and use the amazing tools made possible by their fundamental discoveries. The most sophisticated among these models include the likes of ChatGPT—which now clocks an astounding 200 million weekly active users worldwide—as well as Claude and Llama which can discuss almost anything with a user. We can use these tools to refine ideas and, most of the time, generate anything, whether it be entire files of code for a software package, a chapter of a novel, even recipes for dinner. 

Hopfield and Hinton also facilitated the work of the Chemistry Nobel Prize winners, too. David Baker from the University of Washington and the Howard Hughes Medical Institute and Demis Hassabis and John Jumper, both from Google DeepMind, used AI to develop ways of predicting and creating proteins. The collective contribution of these three scientists involved the development of computational methods for designing proteins. For Hassabis and Jumper, it was the development of Google DeepMind’s AlphaFold, the first method to generate the fully folded structure of (almost) any protein for which you could provide an amino acid sequence. Baker and his team developed the Rosetta program for developing novel structures for some proteins from scratch using very little additional data from the Protein Data Bank (PDB). This method has now been refined and augmented with modern machine learning methods to produce RoseTTAFold, which can be used alongside the latest iterations of AlphaFold to generate a folded protein structure from a given sequence and predict if the designed structure is likely to fold correctly.

About that protein folding problem… 

Focusing on AlphaFold, this method proved to be a turning point in protein structure determination when it first gained the attention of the biophysicists in the CASP 13 competition. During that event, AlphaFold showed that it could give traditional methods a run for their money, giving the best overall performance when compared to other methods evaluated. A closer look at the results from the A7D team (the name that the AlphaFold team was registered under for this competition) showed that the method does very well against other methods, but then really falls short in cases where it doesn’t have enough data to work effectively. AlphaFold2 came onto the scene with significantly improved results over its predecessor, this time competing in the CASP 14 competition in 2020 with other computational methods for generating protein structures. This time around, the method blew others out of the water. The result that earned AlphaFold2’s spot in mainstream headlines claiming that it had “solved the protein folding problem” was that it was able to predict two thirds of all targets given to it within experimental error. With AlphaFold2 in hand, Hassabis, Jumper, and their team proceeded to generate a database of predicted protein structures from all publicly available protein sequences. At the time of this writing, the AlphaFold database contains more than 200 million structures. Thus, if you have a gene sequence for a protein, it’s likely that an AlphaFold2-generated structure exists for it in this database. Furthermore, there’s a good chance it’ll be predicted with high accuracy.

With this year’s latest release, AlphaFold3, the underlying “learning” method has been overhauled. This new iteration has been rebuilt from the foundations, using the “diffusion model” architecture for machine learning to allow for predictions to go beyond just monomeric proteins. Due to the new method, AlphaFold3 can predict structures for a wide range of biomolecules, from multimeric proteins with complex quaternary structures, to proteins in complex with ligands, peptides, other proteins or nucleic acids. 

While this is no doubt impressive, a key shortcoming remains in the AlphaFold method that limits the impact that the tool can make currently within drug discovery. While it’s impressive that a structure can be generated for almost any protein of interest for which a sequence is known, developing a structural model of the protein of interest is only the first step of many in the process of molecular design. The next step, and the one that AlphaFold and similar structure determination methods struggle with, is to understand the often-subtle changes in the shape and dynamics of the protein as a molecule binds and how those changes correlate with the energetics of binding. This relationship is often very important to understand as it allows us to think about designing a molecule more rationally – for example, what kind of chemistry can be added or removed and how would this affect binding? Questions like this are still out of the reach for these methods, as it requires modelling very small but important side chain movements in the binding site, which is a level of detail that is currently beyond these methods.

Nevertheless, these limitations shouldn’t detract from the incredible achievement that AlphaFold has been and continues to be. In a very short time, our perspective on biomolecule structure determination has changed completely for the better. The fact that almost any protein structure one would need could feasibly be constructed using this method or a similar one is no doubt a giant leap in improving access to high quality biomolecular structure. In addition, protein design methods can go even further and propose structures for novel proteins that are not known within nature. As the latest iteration of AlphaFold has now become publicly available to academic groups, the leaps in progress made are set to continue.

What’s next for AI?

Owing to how quickly the field is developing, many of the key players in the field of foundational AI/ML research and development have very large ambitions for their goals in 2025. For the largest players in this area – those including OpenAI, Google via their DeepMind team, Anthropic to name just a few – a single milestone has come into focus, that being achieving what’s known as “artificial general intelligence” or AGI for short. AGI is essentially a modelled intelligent agent that can perform most tasks at the same level as a technical expert in the domain concerned. Making a judgment on what is required to get to an AGI from where the state-of-the art models are right now, is in equal parts the most challenging and most fascinating element of what to look out for in AI development this year, considering that these different research groups disagree on how AGI can be reached from where we are now. A prevailing theory is what’s been named the “scaling law” – The idea being that making a model larger is all that’s needed to make it smarter. Thus far, this has proven to be true and has been the key to how quickly large language models have been able to grow into the new best-in-class for natural language processing tasks. 

Scaling this back (pardon the bad pun), what would this supposed “race for AGI” mean for AI/ML in drug discovery? The most important consequence of these rapid developments will of course be the continuous improvement of what kinds of biomolecular structures can be predicted by AlphaFold, RoseTTAFold and other similar structure determination models. Another key development out of the drug discovery scope that has helped build the sophistication of more general models is “multimodality” or “multimodal models”. These are models that are comprised of several smaller “specialist” models, each trained to perform a specific task in some domain. The hierarchy of a multimodal model is less of a singular “genius” model trained to perform every task imaginable, and more a head “manager” model that is trained to best decide which of its “specialist” models to delegate the processing of given input data to. It wouldn’t be hard to imagine that the next AlphaFold version could be multimodal then, formed of models that are perhaps more specialised in predicting the overall structure of some input protein sequence, and other models specialised in predicting finer details that are often poorly predicted by the current model. Whether the “holy grail” of AGI is reached in 2025 or not, this is indeed going to be a very exciting year for AI in drug discovery.

Khaled Abdel-Maksoud is a Scientist in Computer-aided Drug Design (CADD) and Machine Learning at Charles River. Khaled has a background in computational chemistry, focusing on how statistical methods and machine learning approaches can be used alongside traditional CADD techniques to accelerate drug discovery. Khaled graduated from King's College London where he studied chemistry with biomedicine. He then completed an MSc in theoretical chemistry at the University of Oxford, and studied toward his PhD at the University of Southampton under the supervision of Professor Jonathan Essex. Before joining Charles River, Khaled worked as a postdoctoral researcher at the University of Southampton, focusing on the development of methods for free energy calculations.