Machine learning for
adaptive immune receptors

An open-source platform for reproducible machine learning analysis of AIRR data — covering classification, clustering, generative modeling, simulation, and visualization.

Open source AGPL-3.0
AIRR-C standard Interoperable

What is AIRR?

Adaptive immune receptor repertoires (AIRRs) are the full set of T or B-cell receptor sequences in an individual — a molecular record of past and ongoing immune responses to pathogens, vaccines, and disease. Understanding the patterns encoded in AIRR data opens opportunities for diagnostic, prognostic, and therapeutic applications.

Why immuneML?

immuneML handles the infrastructure — data import, preprocessing, subsampling, encoding, hyperparameter optimization, and reporting — so that ML researchers can focus on developing new methods, and immunologists on biological questions. Analyses are defined in shareable YAML specification files, ensuring full reproducibility.

Unsupervised ML analysis

The latest immuneML release extends the platform to cover unsupervised ML — clustering, generative modeling, protein language model embeddings, and dimensionality reduction — alongside the existing supervised classification workflows.

New
Clustering Workflows
Systematic clustering with stability-based model selection, internal and external validation indices, and held-out dataset validation — following established best practices for unsupervised ML evaluation.
New
Generative Modeling
Train and compare different generative models on receptor sequences, with built-in visualizations to compare generated and original sequence distributions.
New
PLM Embeddings
Protein language model embeddings integrated directly into workflows: ProtT5, TCR-BERT, and ESMC — alongside the existing Word2Vec-based encoding — for richer receptor sequence representations.
New
Dimensionality Reduction
PCA, t-SNE, and UMAP as first-class components, combinable with any encoding for data visualization, confounder exploration, and interpretation of clustering and generative model results.

What immuneML supports

immuneML provides components for the full ML analysis lifecycle, from data import to result reporting, in a unified and extensible framework.

Supervised ML — Train & Assess
Train classifiers for repertoire, receptor, or sequence classification. Benchmark multiple ML methods (logistic regression, SVM, CNN, DeepRC) and encodings using nested cross-validation to select the optimal model.
Documentation →
Apply Trained Models
Export trained models and encoding parameters, then apply them to new datasets without re-training — for external validation on new cohorts or datasets.
Documentation →
New in 2026
Clustering Analysis
Systematic clustering workflow with stability-based model selection and result/method-based validation on held-out datasets, following established best practices for unsupervised evaluation.
Documentation →
New in 2026
Generative Models
Train LSTM, VAE, SoNNia, and ProGen2 generative models. Built-in visualizations compare generated vs. original distributions and assess epitope specificity and novelty across domain-specific criteria.
Documentation →
New in 2026
PLM Embeddings & Dim. Reduction
ProtT5, TCR-BERT, and ESMC protein language model embeddings, combinable with PCA, t-SNE, or UMAP for visualization, confounder detection, and downstream classification or clustering.
Documentation →
Exploratory Analysis
Examine dataset properties — sequence length distributions, gene usage, diversity metrics, label overlaps, and dataset visualizations — to inform model choices and identify potential confounders before training.
Documentation →
Simulation with LIgO
Generate synthetic AIR(R) datasets with fully annotated ground-truth properties using the integrated LIgO tool — for benchmarking ML methods and validating pipelines under controlled conditions.
Documentation →
Extend the Platform
Integrate new ML methods, encodings, generative models, or reports by implementing a documented abstract class. immuneML handles data loading, preprocessing, subsampling, and result reporting automatically.
Developer docs →
AIRR Community Standards
Full compliance with AIRR-C software and sequence annotation standards, ensuring interoperability with MiXCR, Immcantation, immunarch, iReceptor, VDJdb, and other tools in the AIRR ecosystem.
Documentation →

Install immuneML

Available as a Python package, Docker image, and conda package. Full documentation with tutorials, YAML examples, and developer guides at docs.immuneml.uio.no.

PyPI

pip install immuneML

Conda

conda install -c bioconda immuneml

Docker

docker pull milenapavlovic/immuneml

GitHub

github.com/uio-bmi/immuneML

Reproducible workflows

Every immuneML analysis is defined in a human-readable YAML specification file. Running it produces a structured HTML report along with exported models, raw data, and all intermediate results.

01 / Input
AIRR dataset + YAML spec
Provide your AIRR-format data and a specification file defining datasets, encodings, ML methods, and instructions. Supports most common AIRR data formats.
AIRR format MiXCR YAML
02 / Analysis
immuneML analysis
immuneML runs the specified instruction — classification, clustering, generative modeling, simulation, or exploratory analysis — with automatic preprocessing and result collection.
TrainMLModel Clustering TrainGenModel ExploratoryAnalysis
03 / Output
HTML report + results
A structured HTML report with plots, performance metrics, and model interpretations. Trained models, raw result data, and a resolved specification file are exported alongside.
HTML report Exported models Reproducible

Unsupervised ML in practice

Three use cases from the 2026 preprint demonstrate clustering, generative modeling, and exploratory analysis workflows on both synthetic and experimental AIRR datasets.

01 / Generative models
Comparing generative models of epitope-specific TCR sequences
Three generative models (LSTM, VAE, PWM) were trained on a LIgO-simulated dataset with five k-mer motifs determining epitope specificity. immuneML's built-in reports quantified epitope specificity and novelty — LSTM produced the highest fraction of epitope-specific sequences (∼99%), while VAE generated a higher proportion of novel sequences (∼73%).
LSTM VAE LIgO simulation Epitope specificity
02 / Clustering
Exploring biological structure of epitope-specific TCRβ sequences
Using ∼48,000 human TCRβ sequences from IEDB, multiple clustering approaches were evaluated — including ProtT5, ESMC, TCR-BERT, k-mer, and tcrdist-based representations. Stability analysis showed tcrdist with hierarchical clustering best captured epitope and HLA structure, validated on an independent held-out dataset.
IEDB dataset tcrdist Stability analysis Epitope specificity
03 / Confounder analysis
Unsupervised analysis of confounders in an experimental AIRR dataset
BCR data from 166 IBD patients and healthy controls was used to demonstrate immuneML's exploratory analysis workflow. Dimensionality reduction and clustering revealed sequencing batch effects; stability analysis showed high uncertainty in clustering, suggesting batch-related sequence patterns do not strongly dominate repertoire-level similarity.
IBD cohort BCR repertoires Batch effects Exploratory analysis

How to cite

If you use immuneML in your research, please cite the relevant paper(s) below.

2026

Unsupervised machine learning for adaptive immune receptors with immuneML

Pavlović M. et al.
bioRxiv preprint · doi:10.64898/2026.04.15.718648v1
View preprint →
2021

The immuneML ecosystem for machine learning analysis of adaptive immune receptor repertoires

Pavlović M., Scheffer L., et al.
Nature Machine Intelligence, 3, 936–944
View paper →

Contact

Questions, bug reports, or contributions — we're happy to hear from you.