In July of this year, DeepMind, which developed AlphaFold, announced that it had increased the amount of predicted protein structure data from 1 million to 220 million, and no longer just looked at human proteins, but also included plants, bacteria, animals and other organisms in protein structures within various species. It also changed nearly every known protein in the DNA database.
And now, another tech giant, Meta, is filling the protein universe with dark matter.
Meta researchers used artificial intelligence (AI) to predict the structures of more than 600 million proteins from bacteria, viruses and other as yet uncharacterized microbes.
The ESM metagenomic map database contains structural predictions for 617 million proteins
The Meta AI protein team generated these structural predictions using a “large-scale language model,” and published a preprint on Nov. 1 describing the results.
Alexander Rives, head of research on the Meta AI protein team, said that these proteins from microbes from the soil, ocean and human body are the least known structures, and these proteins are very mysterious and can provide us with the potential to gain insight into biology.
“Large Language Model” An artificial intelligence (AI) model that predicts text from a few letters or words, usually a language model is trained on a large amount of text. To apply it to protein structure prediction, the research team sequenced with known protein sequences, which are represented by 20 different amino acid groups, each represented by a letter. The model then learned to “autocomplete” protein sequences in the presence of ambiguous amino acid ratios.
Protein sequence “autocompletion”
Alexander Rives said this training gave the model an intuitive understanding of the protein sequence, which contains information about the shape of the protein’s structure. Inspired by AlphaFold, DeepMind’s pioneering protein structure tool, combined this insight with information about the relationship between known protein structures and sequences to generate predicted structures from protein sequences. The model then learned to “autocomplete” proteins with ambiguous amino acid ratios.
Meta’s research team said in a report published this summer that the protein structure prediction tool it developed, ESMFold, was not as accurate as AlphaFold, but was about 60 times faster in terms of speed, meaning that structure predictions could be extended to larger scales. in the database.
As a test case, they decided to apply the predictive model to “metagenomes,” a large database of sequenced DNA from the environment, including soil, seawater, human gut, skin, and other microbial habitats. The vast majority of DNA sequences encoding potential proteins come from organisms that have never been cultured and are not known to science.
In total, Meta’s team has predicted the structures of more than 617 million proteins. The work only took two weeks.
Alexander Rives said the predictive model is free and anyone can use it, just like the underlying code of the model.
Of the 617 million predicted protein structures, the model considered more than a third of the predictions to be of high quality, so researchers could be confident that the overall structure of the protein was correct and, in some cases, to identify finer atomic detail. Many of these structures are completely new, unlike anything in the experimentally determined database of protein structures or the AlphaFold database predicted from known organisms.
Martin Steinegger, a computational biologist at Seoul National University, said that a large part of the AlphaFold database is made up of structures that are nearly identical to each other, and that the “metagenome” database should cover a large part of the protein universe that has never been seen before. There is a great opportunity to uncover more proteins in the dark.
But Harvard evolutionary biologist Sergey Ovchinnikov is skeptical of ESMFold’s hundreds of millions of predictions, some of which may lack a definite structure, while others may be non-coding DNA that has been mistaken for protein-coding, and more than half seem to be missing. The protein space is something we don’t know.
Burkhard Rost, a computational biologist at the Technical University of Munich in Germany, was impressed with the speed and accuracy of ESMFold’s predictions. But he also wonders if it really has an advantage over AlphaFold’s accuracy when predicting proteins from metagenomic databases. Prediction methods based on language models are better for quickly determining how mutations change protein structure, which AlphaFold cannot. We will see structural forecasting become leaner, simpler and cheaper, which will open the door to new things, he said.
A DeepMind representative said the company currently has no plans to include metagenomic structure predictions in its database, but did not rule out the possibility of doing so in the future.
But Martin Steinegger, a computational biologist at Seoul National University, says he and collaborators have used AlphaFold to predict the structure of about 30 million metagenomic proteins. They hope to discover new RNA virus species from it. The obvious next step for such predictive tools, he argues, is to study dark matter in biology. It is expected that we will soon see an explosion in the analysis of these metagenomic structures.