1.2 An Overview of AI Models#
1.2.1 First things first, what is a “model”?#
A model is simply an algorithm (a function) that has been trained on data in order to recognize patterns and make predictions. The data used to train a model typically consists of examples. Each example may contain:
an input (e.g., a protein sequence, microscopy image, gene expression profile)
optionally a label describing what is being predicted (e.g., the protein’s function, the cell type in an image, or whether a mutation is pathogenic)
Algorithm vs Model
An Algorithm is a precise set of instructions or rules to solve a specific problem or task. Often written in mathematical formulation.
A Model is a specific algorithm applied to a dataset.
During training, the model examines many such examples and gradually learns statistical relationships between the inputs and outputs. Once trained, the model can be used to make predictions on new data it has never seen before. For instance, it might predict the function of a newly discovered protein or identify structures in a microscopy image.
It is also useful to distinguish between supervised and self-supervised training approaches.
Supervised learning: the training data includes both inputs and labels. The model learns by comparing its predictions to the correct answers (labels) and adjusting itself to reduce errors. For example, a dataset might contain protein sequences along with their gene ontology (GO) terms, and the model learns to predict the GO terms given a sequence.
Self-supervised learning: the model learns patterns from data without explicit labels. Instead, part of the data is hidden or modified and the model learns to predict the missing information. For instance, a model trained on protein sequences might learn to predict masked amino acids within sequences. By solving this task across millions of sequences, the model learns useful representations of proteins that can later be applied to many biological problems.
You may refer to Andrew White’s Deep Learning for Molecules & Materials[Whi21] for an in-depth overview on machine learning.
1.2.2 Relationship between AI, ML, DL and GenAI#
Fig. 3 Figure 1.2.1: AI is a super class of learning algorithms which encompases machine learning (ML), deep learning (DL) and generative AI (GenAI). (Created with gpt-image-1.5-2025-12-16)#
Artificial Intelligence (AI): The broad field of creating systems that perform tasks requiring human-like intelligence (reasoning, perception, decision-making).
Machine Learning (ML): A subset of AI where systems learn patterns from data instead of being explicitly programmed.
Deep Learning (DL): A subset of ML that uses multi-layer neural networks to learn complex patterns, especially from large datasets.
Generative AI (GenAI): A type of deep learning model designed to generate new content (e.g., text, images, proteins) by learning patterns from large datasets.
1.2.3 An overview on the models we see in science#
AI models and computational models are foundational to modern biology and science, spanning a continuum from first-principles physics-based simulations to data-driven machine learning. Let’s take a look at the major categories of AI and computational models encountered in biological research.
Physics-Based Models: encode known physical laws, conservation principles, and mechanistic understanding directly into mathematical formulations. The oldest class of computational models. eg: Ordinary and Partial Differential Equations (ODEs/PDEs), Molecular dynamics
Traditional ML Models: a general starting point for many tasks. Faster to develop, easier to interpret, and often appropriate when datasets are modest in size. eg: Support vector machines (SVMs), Radom Forest (RF) models.
Deep Learning Models: automatically learns hierarchical representations from raw data. Most advantageous when large datasets, many features, or highly structured inputs are available. eg: Graph Neural Networks (GNNs), Recurrent Neural Networks (RNNs)
Foundation Models: represents a paradigm shift in computational biology. Large-scale AI systems trained on vast, often unlabeled datasets using self-supervised or generative objectives. Large language models belong to this category.
Model Category |
Methods/Architectures |
Key Characteristics |
Primary Applications |
|---|---|---|---|
Physics-Based Models |
Ordinary/Partial Differential Equations (ODEs/PDEs), reaction–diffusion, continuum mechanics |
Deterministic time/space dynamics; handles feedback, stiffness, diffusion and transport; extendable to multiscale couplings |
Gene regulatory and signaling networks; tissue/organ-level diffusion–reaction and biomechanics [MW07, ZY24] |
Physics-Based Models |
Molecular dynamics (atomistic, coarse-grained), molecular mechanics force fields |
Newtonian/Langevin dynamics with bonded/non-bonded forces; atomistic detail to coarse-grained scales; ensemble/dynamics focus |
Protein folding and conformational dynamics; protein–ligand binding; free-energy and mechanistic studies [HBD+19, SDS+08] |
Physics-Based Models |
Agent-based/multicellular (cellular automata, Cellular Potts, center-based/off-lattice); hybrid discrete–continuum |
Explicit single-cell agents with mechanical/biochemical rules; often coupled to PDE transport; stochasticity supported |
Tumor growth, angiogenesis, cell–cell and cell–ECM interactions; virtual tissues and therapy testing [MWHM19] |
Traditional Machine Learning |
Random Forest (RF), Support Vector Machines (SVM), k-Nearest Neighbors (kNN), Logistic/Regularized Regression |
Supervised learners for tabular/omics; RF provides feature importance; SVM handles non-linear kernels; kNN is simple baseline; linear models offer interpretability |
Disease/phenotype prediction; SNP–trait and GWAS interaction discovery; sequence/site classification; proteomics/microarray sample classification [AAH+13, GKMJ22, Qi12, TCC+07, TBO+13] |
Deep Learning |
Convolutional Neural Networks (CNNs), Recurrent NNs (RNNs/LSTM/GRU) |
CNNs capture local spatial patterns (images, 1D motifs); RNNs model sequences/time; improved with residual/attention add-ons |
Histopathology and biomedical imaging; DNA/RNA motif discovery; protein secondary structure; temporal gene expression [GKMJ22] |
Deep Learning |
Generative models (GANs, VAEs), Autoencoders |
Synthetic data generation; latent-space modeling; denoising/imputation for noisy omics |
Synthetic biology (sequence/design), single-cell denoising/imputation, metabolomics feature learning |
Deep Learning |
Graph Neural Networks (GNN/GCN), Transformers |
GNNs operate on molecular/interaction graphs; Transformers capture long-range dependencies in sequences and structures |
Drug–target/response prediction, PPI networks, single-cell graphs; protein contacts/structures; regulatory genomics [HGO+25, KGA+23] |
Foundation Models |
Protein language models (e.g., ESM-2) and structure FMs (e.g., ESMFold, AlphaFold) |
Large-scale self-supervised pre-training on sequences/MSAs; zero-/few-shot transfer; sequence-to-structure heads |
Protein property/function prediction; atomic-level structure prediction; design scaffolding [GGL+25, LHW+24, LAR+22] |
Foundation Models |
Genomic foundation models (DNABERT, DNABERT-2, Enformer, HyenaDNA, Nucleotide Transformer) |
Masked LM, long-context and supervised pre-training for regulatory signals; model long-range genomic interactions |
Promoter/enhancer and variant-effect prediction; gene expression forecasting; regulatory annotation [LHW+24] |
Foundation Models |
Single-cell foundation models (scGPT, Geneformer, scFoundation/LCMs) |
Pre-trained on millions of cells; mitigate batch effects; model gene–gene interactions and perturbations |
Cell-type annotation; expression imputation; perturbation response; multi-omics integration [GGL+25, LHW+24] |
Hybrid/Integrative |
Physics-Informed Neural Networks (PINNs) |
Embed ODE/PDE residuals and physical constraints into NN loss; solve forward/inverse problems with sparse/noisy data |
Tumor growth curves; gene expression kinetics; epidemiology (SIR) with parameter inference [FYHES25] |
Hybrid/Integrative |
Differentiable biology (differentiable programming with simulators + deep nets) |
End-to-end autodiff of mechanistic + learnable components; integrate priors (symmetries, simulators) with data-driven modules |
Molecular mechanism modeling; differentiable dynamics; end-to-end folding/energy landscapes; uncertainty-aware inference [AS21] |
1.2.4 Additional materials#
References#
Ashfaq Ahmed, K. S. Aljahdali, S. Hussain, M. Kumari, Sunila Godara, and A. Kinoshita. Comparative prediction performance with support vector machine and random forest classification techniques. International Journal of Computer Applications, 69:12–16, May 2013. URL: https://doi.org/10.5120/11885-7922, doi:10.5120/11885-7922.
Mohammed AlQuraishi and Peter K. Sorger. Differentiable biology: using deep learning for biophysics-based and data-driven modeling of molecular mechanisms. Nature Methods, 18:1169–1180, Oct 2021. URL: https://doi.org/10.1038/s41592-021-01283-4, doi:10.1038/s41592-021-01283-4.
Amer Farea, Olli Yli-Harja, and Frank Emmert-Streib. Using physics-informed neural networks for modeling biological and epidemiological dynamical systems. Mathematics, 13:1664, May 2025. URL: https://doi.org/10.3390/math13101664, doi:10.3390/math13101664.
Joe G. Greener, Shaun M. Kandathil, Lewis Moffat, and David T. Jones. A guide to machine learning for biologists. Nature Reviews Molecular Cell Biology, 23:40–55, Sep 2022. URL: https://doi.org/10.1038/s41580-021-00407-0, doi:10.1038/s41580-021-00407-0.
Fei Guo, Renchu Guan, Yaohang Li, Qi Liu, Xiaowo Wang, Can Yang, and Jianxin Wang. Foundation models in bioinformatics. National Science Review, Jan 2025. URL: https://doi.org/10.1093/nsr/nwaf028, doi:10.1093/nsr/nwaf028.
Zaw Myo Hein, Dhanyashri Guruparan, Blaire Okunsai, Che Mohd Nasril Che Mohd Nassir, Muhammad Danial Che Ramli, and Suresh Kumar. Ai and machine learning in biology: from genes to proteins. Biology, 14:1453, Oct 2025. URL: https://doi.org/10.3390/biology14101453, doi:10.3390/biology14101453.
David J. Huggins, Philip C. Biggin, Marc A. Dämgen, Jonathan W. Essex, Sarah A. Harris, Richard H. Henchman, Syma Khalid, Antonija Kuzmanic, Charles A. Laughton, Julien Michel, Adrian J. Mulholland, Edina Rosta, Mark S. P. Sansom, and Marc W. van der Kamp. Biomolecular simulations: from dynamics and mechanisms to computational assays of biological activity. Wiley Interdisciplinary Reviews: Computational Molecular Science, Sep 2019. URL: https://doi.org/10.1002/wcms.1393, doi:10.1002/wcms.1393.
Suresh Kumar, Dhanyashri Guruparan, Pavithren Aaron, Philemon Telajan, Kavinesh Mahadevan, Dinesh Davagandhi, and Ong Xin Yue. Deep learning in computational biology: advancements, challenges, and future outlook. ArXiv, Oct 2023. URL: https://doi.org/10.48550/arxiv.2310.03086, doi:10.48550/arxiv.2310.03086.
Qing Li, Zhihang Hu, Yixuan Wang, Lei Li, Yimin Fan, Irwin King, Le Song, and Yu Li. Progress and opportunities of foundation models in bioinformatics. Briefings in Bioinformatics, Feb 2024. URL: https://doi.org/10.48550/arxiv.2402.04286, doi:10.48550/arxiv.2402.04286.
Z Lin, H Akin, R Rao, B Hie, Z Zhu, and W Lu. Language models of protein sequences at the scale of evolution enable accurate structure prediction. MedRxiv, 2022. URL: https://doi.org/10.1101/2022.07.20.500902v1, doi:10.1101/2022.07.20.500902v1.
Wayne Materi and David S. Wishart. Computational systems biology in drug discovery and development: methods and applications. Drug discovery today, 12 7-8:295–303, Apr 2007. URL: https://doi.org/10.1016/j.drudis.2007.02.013, doi:10.1016/j.drudis.2007.02.013.
John Metzcar, Yafei Wang, Randy W. Heiland, and P. Macklin. A review of cell-based computational modeling in cancer biology. JCO Clinical Cancer Informatics, pages 1–13, Dec 2019. URL: https://doi.org/10.1200/cci.18.00069, doi:10.1200/cci.18.00069.
Yanjun Qi. Random forest for bioinformatics. ArXiv, pages 307–323, Jan 2012. URL: https://doi.org/10.1007/978-1-4419-9326-7\_11, doi:10.1007/978-1-4419-9326-7\_11.
Jeanette P. Schmidt, Scott L. Delp, Michael A. Sherman, Charles A. Taylor, Vijay S. Pande, and Russ B. Altman. The simbios national center: systems biology in motion. Aug 2008. URL: https://doi.org/10.1109/jproc.2008.925454, doi:10.1109/jproc.2008.925454.
Adi L Tarca, Vincent J Carey, Xue-wen Chen, Roberto Romero, and Sorin Drăghici. Machine learning and its applications to biology. PLoS Computational Biology, 3:e116, Jun 2007. URL: https://doi.org/10.1371/journal.pcbi.0030116, doi:10.1371/journal.pcbi.0030116.
W. G. Touw, J. R. Bayjanov, L. Overmars, L. Backus, J. Boekhorst, M. Wels, and S. A. F. T. van Hijum. Data mining in the life sciences with random forest: a walk in the park or lost in the jungle? Briefings in Bioinformatics, 14:315–326, Jul 2013. URL: https://doi.org/10.1093/bib/bbs034, doi:10.1093/bib/bbs034.
Andrew D White. Deep learning for molecules and materials. Living Journal of Computational Molecular Science, 3(1):1499, 2021. URL: https://dmol.pub, doi:10.33011/livecoms.3.1.1499.
Jiayao Zhou and Shudan Yan. Dynamic modeling in systems biology: from pathway analysis to whole-cell simulations. Computational Molecular Biology, Jan 2024. URL: https://doi.org/10.5376/cmb.2024.14.0016, doi:10.5376/cmb.2024.14.0016.