Limitless Possibilities – AI Technology Generates Original Proteins From Scratch

Artificial Intelligence Brain Technology

According to the researchers, this new technology has the potential to surpass directed evolution and could energize the field of protein engineering by hastening the creation of new proteins for various purposes, such as therapeutics and plastic degradation.

A natural language model has jumpstarted the process of protein design by creating active enzymes.

Researchers have developed an AI system that can generate artificial enzymes from scratch. In laboratory experiments, some of these enzymes demonstrated efficacy comparable to natural enzymes, even when their artificially created amino <span class="glossaryLink" aria-describedby="tt" data-cmtooltip="

Any substance that when dissolved in water, gives a pH less than 7.0, or donates a hydrogen ion.

” data-gt-translate-attributes=”[{"attribute":"data-cmtooltip", "format":"html"}]”>acid sequences greatly deviated from any known natural protein.

The experiment shows that natural language processing, initially created for reading and writing language text, can grasp certain fundamental concepts of biology. The AI program, known as ProGen, was developed by Salesforce Research and employs next-token prediction to construct artificial proteins from amino acid sequences.

Scientists said the new technology could become more powerful than directed evolution, the Nobel-prize-winning protein design technology, and it will energize the 50-year-old field of protein engineering by speeding the development of new proteins that can be used for almost anything from therapeutics to degrading plastic.

“The artificial designs perform much better than designs that were inspired by the evolutionary process,” said James Fraser, Ph.D., professor of bioengineering and therapeutic sciences at the UCSF School of Pharmacy, and an author of the work, which was recently published in Nature Biotechnology. A previous version of the paper has been available on the preprint server BiorXiv since July of 2021, where it garnered several dozen citations before being published in a peer-reviewed journal.


“The language model is learning aspects of evolution, but it’s different than the normal evolutionary process,” Fraser said. “We now have the ability to tune the generation of these properties for specific effects. For example, an enzyme that’s incredibly thermostable or likes acidic environments or won’t interact with other proteins.”

To create the model, scientists simply fed the amino acid sequences of 280 million different proteins of all kinds into the <span class="glossaryLink" aria-describedby="tt" data-cmtooltip="

machine learning
Machine learning is a subset of artificial intelligence (AI) that deals with the development of algorithms and statistical models that enable computers to learn from data and make predictions or decisions without being explicitly programmed to do so. Machine learning is used to identify patterns in data, classify data into different categories, or make predictions about future events. It can be categorized into three main types of learning: supervised, unsupervised and reinforcement learning.

” data-gt-translate-attributes=”[{"attribute":"data-cmtooltip", "format":"html"}]”>machine learning model and let it digest the information for a couple of weeks. Then, they fine-tuned the model by priming it with 56,000 sequences from five lysozyme families, along with some contextual information about these proteins.

The model quickly generated a million sequences, and the research team selected 100 to test, based on how closely they resembled the sequences of natural proteins, as well as how naturalistic the AI proteins’ underlying amino acid “grammar” and “semantics” were.

Out of this first batch of a 100 proteins, which were screened in vitro by Tierra Biosciences, the team made five artificial proteins to test in cells and compared their activity to an enzyme found in the whites of chicken eggs, known as hen egg white lysozyme (HEWL). Similar lysozymes are found in human tears, saliva, and milk, where they defend against bacteria and fungi.

Two of the artificial enzymes were able to break down the cell walls of bacteria with activity comparable to HEWL, yet their sequences were only about 18% identical to one another. The two sequences were about 90% and 70% identical to any known protein.


Just one mutation in a natural protein can make it stop working, but in a different round of screening, the team found that the AI-generated enzymes showed activity even when as little as 31.4% of their sequence resembled any known natural protein.

The AI was even able to learn how the enzymes should be shaped, simply from studying the raw sequence data. Measured with X-ray crystallography, the atomic structures of the artificial proteins looked just as they should, although the sequences were like nothing seen before.


Salesforce Research developed ProGen in 2020, based on a kind of natural language programming their researchers originally developed to generate English language text.

They knew from their previous work that the AI system could teach itself grammar and the meaning of words, along with other underlying rules that make writing well-composed.

“When you train sequence-based models with lots of data, they are really powerful in learning structure and rules,” said Nikhil Naik, Ph.D., Director of AI Research at Salesforce Research, and the senior author of the paper. “They learn what words can co-occur, and also compositionality.”

With proteins, the design choices were almost limitless. Lysozymes are small as proteins go, with up to about 300 <span class="glossaryLink" aria-describedby="tt" data-cmtooltip="

amino acids
&lt;div class=&quot;cell text-container large-6 small-order-0 large-order-1&quot;&gt;
&lt;div class=&quot;text-wrapper&quot;&gt;&lt;br /&gt;Amino acids are a set of organic compounds used to build proteins. There are about 500 naturally occurring known amino acids, though only 20 appear in the genetic code. Proteins consist of one or more chains of amino acids called polypeptides. The sequence of the amino acid chain causes the polypeptide to fold into a shape that is biologically active. The amino acid sequences of proteins are encoded in the genes. Nine proteinogenic amino acids are called &quot;essential&quot; for humans because they cannot be produced from other compounds by the human body and so must be taken in as food.&lt;br /&gt;&lt;/div&gt;

” data-gt-translate-attributes=”[{"attribute":"data-cmtooltip", "format":"html"}]”>amino acids. But with 20 possible amino acids, there are an enormous number (20,300) of possible combinations. That’s greater than taking all the humans who lived throughout time, multiplied by the number of grains of sand on Earth, multiplied by the number of atoms in the universe.

Given the limitless possibilities, it’s remarkable that the model can so easily generate working enzymes.

“The capability to generate functional proteins from scratch out-of-the-box demonstrates we are entering into a new era of protein design,” said Ali Madani, Ph.D., founder of Profluent Bio, a former research scientist at Salesforce Research, and the paper’s first author. “This is a versatile new tool available to protein engineers, and we’re looking forward to seeing the therapeutic applications.”


Reference: “Large language models generate functional protein sequences across diverse families” by Ali Madani, Ben Krause, Eric R. Greene, Subu Subramanian, Benjamin P. Mohr, James M. Holton, Jose Luis Olmos Jr., Caiming Xiong, Zachary Z. Sun, Richard Socher, James S. Fraser and Nikhil Naik, 26 January 2023, Nature Biotechnology.
DOI: 10.1038/s41587-022-01618-2

Please see the paper for a complete author and funding list. A comprehensive codebase for the methods described in the paper is publicly available at