Foundation Models in Biology represent one of the most important advances in bioinformatics in recent years. These models differ significantly from classical approaches in the field, as they are rooted in concepts originally developed in other artificial intelligence (AI) areas. As a result, even experienced bioinformaticians may find them challenging to understand. This article provides a summary of what you need to know about Foundation Models in Biology and their potential applications.
Many readers may be familiar with Large Language Models such as ChatGPT, which are trained on massive amounts of text data and are capable of generating responses that mimic human language. In a way, Foundation models for Biology try to accomplish similar objectives, but using the language of biology instead of human text.

When I first heard about Evo, a Foundation model based on Nucleotide data in Prokaryotes, I tried to chat with it, as if it was chatGPT. Its answer was “ACACACTGGA…” 😊
New models are published almost weekly, and keeping a list updated is difficult. The table below shows some examples of models, mentioned elsewhere in this text. For a more complete list, check this repository https://github.com/HICAI-ZJU/Scientific-LLM-Survey (I hope they accept my pull requests for adding some recent ones). I also like following the https://kiinai.substack.com/ for recent news and great way to stay on top of this fast moving field
Training a Foundation model is very expensive and requires large quantities of data. However, once a model has been pre-trained, it can be used and fine-tuned for other tasks. This means that we can take advantage of big atlases (provided the model is made public), and use them on smaller datasets, without having to download all the original data and train on it from scratch.
Suppose you are developing a model to predict whether a patient is affected by a disease. You have access to a dataset of gene expression in cases/controls from a previous study.
A traditional bioinformatics approach might involve creating a data frame of gene expression values across all samples and training a model to differentiate between cases and controls. Popular methods include random forests, XGBoost, or logistic regression. In this setup, the predicted variable would be "disease," and the gene expression values would serve as inputs. Alternatively, you could perform a Differential Expression analysis to identify key genes or conduct a gene set enrichment analysis to determine which pathways are most affected.
However, these approaches often overlook the intricate relationships between genes. For example, Gene A might belong to the same pathway as Gene B, frequently being co-expressed. Gene C could act as a transcription factor for Gene D, but only in the presence of Gene E. Biology is incredibly complex, and many of these relationships remain poorly understood.
If your training dataset is sufficiently large, your model could potentially learn these gene relationships independently. You might also apply feature engineering to simplify the training process. Unfortunately, in most real-world scenarios, training datasets are far too small—often limited to cohorts of 10 or 20 samples, or even hundreds in the best-case scenario, which still might not suffice.
This is where Foundation Models come into play. By leveraging a pre-trained model trained on extensive gene expression datasets, such as the Gene Expression Atlas, you can transform your small dataset into embeddings. These embeddings encapsulate the knowledge of gene interactions and relationships acquired during pre-training. By using these embeddings instead of raw gene expression values, your predictions become more accurate and biologically informed, integrating the wealth of insights the Foundation Model has learned.