Foundation models for Biology: a short primer

Foundation Models in Biology represent one of the most important advances in bioinformatics in recent years. These models differ significantly from classical approaches in the field, as they are rooted in concepts originally developed in other artificial intelligence (AI) areas. As a result, even experienced bioinformaticians may find them challenging to understand. This article provides a summary of what you need to know about Foundation Models in Biology and their potential applications.

Many readers may be familiar with Large Language Models such as ChatGPT, which are trained on massive amounts of text data and are capable of generating responses that mimic human language. In a way, Foundation models for Biology try to accomplish similar objectives, but using the language of biology instead of human text.

When I first heard about Evo, a Foundation model based on Nucleotide data in Prokaryotes, I tried to chat with it, as if it was chatGPT. Its answer was “ACACACTGGA…” 😊

A list of Foundation Models for Biology

New models are published almost weekly, and keeping a list updated is difficult. The table below shows some examples of models, mentioned elsewhere in this text. For a more complete list, check this repository https://github.com/HICAI-ZJU/Scientific-LLM-Survey (I hope they accept my pull requests for adding some recent ones). I also like following the https://kiinai.substack.com/ for recent news and great way to stay on top of this fast moving field

Foundation models for biology

What are Foundation Models for Biology?

Generally speaking, Foundation Models in Biology are models trained on large quantities of biological data, such as genomic sequences or chemical structures, using architectures like the Transformer. This architecture is widely recognized for its role in training Large Language Models (LLMs) such as ChatGPT. The term “Foundation Models for Biology” has become popular thanks to the paper https://arxiv.org/abs/2402.04286 published in 2024.
Instead of focusing on text or human conversations, these models are trained on biological sequences or other types of biological data. By doing so, they learn the "language of life" by identifying and capturing hierarchical patterns inherent in the data.
For instance, a Foundation Model might be trained on gene expression data obtained from single-cell sequencing of blood tissue. Through this training, the model learns which genes are typically co-expressed in specific cell types. Remarkably, it uncovers these patterns and interactions purely from the data itself, without relying on prior knowledge of biological pathways or predefined gene sets.

What Are "Pre-training" and "Fine-tuning"?

Training a Foundation model is very expensive and requires large quantities of data. However, once a model has been pre-trained, it can be used and fine-tuned for other tasks. This means that we can take advantage of big atlases (provided the model is made public), and use them on smaller datasets, without having to download all the original data and train on it from scratch.

Pre-training: This initial phase involves training the foundation model on a vast dataset of biological information. For instance, the model may learn to predict the next nucleotide in a sequence or complete missing segments within it. Through this process, the model identifies and internalizes the fundamental patterns within the data, effectively building a comprehensive understanding of the "language of biology."
Fine-tuning: After pre-training, the model can be customized for specific tasks by adjusting its parameters and continuing training on smaller, task-specific datasets. This approach is advantageous because the model has already developed a foundational understanding of the data’s inherent patterns during pre-training, significantly reducing the resources needed for task-specific adaptation.
- A pre-trained model might have learned general genomic patterns. By fine-tuning, it can be trained to classify patients as healthy or diseased based on their genomic data.

Example 1: using a pre-trained model to predict disease

Suppose you are developing a model to predict whether a patient is affected by a disease. You have access to a dataset of gene expression in cases/controls from a previous study.

A traditional bioinformatics approach might involve creating a data frame of gene expression values across all samples and training a model to differentiate between cases and controls. Popular methods include random forests, XGBoost, or logistic regression. In this setup, the predicted variable would be "disease," and the gene expression values would serve as inputs. Alternatively, you could perform a Differential Expression analysis to identify key genes or conduct a gene set enrichment analysis to determine which pathways are most affected.

However, these approaches often overlook the intricate relationships between genes. For example, Gene A might belong to the same pathway as Gene B, frequently being co-expressed. Gene C could act as a transcription factor for Gene D, but only in the presence of Gene E. Biology is incredibly complex, and many of these relationships remain poorly understood.

If your training dataset is sufficiently large, your model could potentially learn these gene relationships independently. You might also apply feature engineering to simplify the training process. Unfortunately, in most real-world scenarios, training datasets are far too small—often limited to cohorts of 10 or 20 samples, or even hundreds in the best-case scenario, which still might not suffice.

This is where Foundation Models come into play. By leveraging a pre-trained model trained on extensive gene expression datasets, such as the Gene Expression Atlas, you can transform your small dataset into embeddings. These embeddings encapsulate the knowledge of gene interactions and relationships acquired during pre-training. By using these embeddings instead of raw gene expression values, your predictions become more accurate and biologically informed, integrating the wealth of insights the Foundation Model has learned.