immuneML: a platform for machine learning analysis of adaptive immune receptor repertoires

What is AIRR?

Adaptive immune receptor repertoires (AIRRs) are the full set of T or B-cell receptor sequences in an individual — a molecular record of past and ongoing immune responses to pathogens, vaccines, and disease. Understanding the patterns encoded in AIRR data opens opportunities for diagnostic, prognostic, and therapeutic applications.

Why immuneML?

immuneML handles the infrastructure — data import, preprocessing, subsampling, encoding, hyperparameter optimization, and reporting — so that ML researchers can focus on developing new methods, and immunologists on biological questions. Analyses are defined in shareable YAML specification files, ensuring full reproducibility.

2026 Release

Unsupervised ML analysis

The latest immuneML release extends the platform to cover unsupervised ML — clustering, generative modeling, protein language model embeddings, and dimensionality reduction — alongside the existing supervised classification workflows.

New

Clustering Workflows

Systematic clustering with stability-based model selection, internal and external validation indices, and held-out dataset validation — following established best practices for unsupervised ML evaluation.

New

Generative Modeling

Train and compare different generative models on receptor sequences, with built-in visualizations to compare generated and original sequence distributions.

New

PLM Embeddings

Protein language model embeddings integrated directly into workflows: ProtT5, TCR-BERT, and ESMC — alongside the existing Word2Vec-based encoding — for richer receptor sequence representations.

New

Dimensionality Reduction

PCA, t-SNE, and UMAP as first-class components, combinable with any encoding for data visualization, confounder exploration, and interpretation of clustering and generative model results.

Platform overview

What immuneML supports

immuneML provides components for the full ML analysis lifecycle, from data import to result reporting, in a unified and extensible framework.

Supervised ML — Train & Assess

Train classifiers for repertoire, receptor, or sequence classification. Benchmark multiple ML methods (logistic regression, SVM, CNN, DeepRC) and encodings using nested cross-validation to select the optimal model.

Documentation →

Apply Trained Models

Export trained models and encoding parameters, then apply them to new datasets without re-training — for external validation on new cohorts or datasets.

Documentation →

New in 2026

Clustering Analysis

Systematic clustering workflow with stability-based model selection and result/method-based validation on held-out datasets, following established best practices for unsupervised evaluation.

Documentation →

New in 2026

Generative Models

Train LSTM, VAE, SoNNia, and ProGen2 generative models. Built-in visualizations compare generated vs. original distributions and assess epitope specificity and novelty across domain-specific criteria.

Documentation →

New in 2026

PLM Embeddings & Dim. Reduction

ProtT5, TCR-BERT, and ESMC protein language model embeddings, combinable with PCA, t-SNE, or UMAP for visualization, confounder detection, and downstream classification or clustering.

Documentation →

Exploratory Analysis

Examine dataset properties — sequence length distributions, gene usage, diversity metrics, label overlaps, and dataset visualizations — to inform model choices and identify potential confounders before training.

Documentation →

Simulation with LIgO

Generate synthetic AIR(R) datasets with fully annotated ground-truth properties using the integrated LIgO tool — for benchmarking ML methods and validating pipelines under controlled conditions.

Documentation →

Extend the Platform

Integrate new ML methods, encodings, generative models, or reports by implementing a documented abstract class. immuneML handles data loading, preprocessing, subsampling, and result reporting automatically.

Developer docs →

AIRR Community Standards

Full compliance with AIRR-C software and sequence annotation standards, ensuring interoperability with MiXCR, Immcantation, immunarch, iReceptor, VDJdb, and other tools in the AIRR ecosystem.

Documentation →

How it works

Reproducible workflows

Every immuneML analysis is defined in a human-readable YAML specification file. Running it produces a structured HTML report along with exported models, raw data, and all intermediate results.

01 / Input

AIRR dataset + YAML spec

Provide your AIRR-format data and a specification file defining datasets, encodings, ML methods, and instructions. Supports most common AIRR data formats.

AIRR format MiXCR YAML

02 / Analysis

immuneML analysis

immuneML runs the specified instruction — classification, clustering, generative modeling, simulation, or exploratory analysis — with automatic preprocessing and result collection.

TrainMLModel Clustering TrainGenModel ExploratoryAnalysis

03 / Output

HTML report + results

A structured HTML report with plots, performance metrics, and model interpretations. Trained models, raw result data, and a resolved specification file are exported alongside.

HTML report Exported models Reproducible

Use Cases — 2026 Paper

Unsupervised ML in practice

Three use cases from the 2026 preprint demonstrate clustering, generative modeling, and exploratory analysis workflows on both synthetic and experimental AIRR datasets.

01 / Generative models

Comparing generative models of epitope-specific TCR sequences

Three generative models (LSTM, VAE, PWM) were trained on a LIgO-simulated dataset with five k-mer motifs determining epitope specificity. immuneML's built-in reports quantified epitope specificity and novelty — LSTM produced the highest fraction of epitope-specific sequences (∼99%), while VAE generated a higher proportion of novel sequences (∼73%).

LSTM VAE LIgO simulation Epitope specificity

02 / Clustering

Exploring biological structure of epitope-specific TCRβ sequences

Using ∼48,000 human TCRβ sequences from IEDB, multiple clustering approaches were evaluated — including ProtT5, ESMC, TCR-BERT, k-mer, and tcrdist-based representations. Stability analysis showed tcrdist with hierarchical clustering best captured epitope and HLA structure, validated on an independent held-out dataset.

IEDB dataset tcrdist Stability analysis Epitope specificity

03 / Confounder analysis

Unsupervised analysis of confounders in an experimental AIRR dataset

BCR data from 166 IBD patients and healthy controls was used to demonstrate immuneML's exploratory analysis workflow. Dimensionality reduction and clustering revealed sequencing batch effects; stability analysis showed high uncertainty in clustering, suggesting batch-related sequence patterns do not strongly dominate repertoire-level similarity.

IBD cohort BCR repertoires Batch effects Exploratory analysis

Publications

How to cite

If you use immuneML in your research, please cite the relevant paper(s) below.

2026

Unsupervised machine learning for adaptive immune receptors with immuneML

Pavlović M. et al.

bioRxiv preprint · doi:10.64898/2026.04.15.718648v1

View preprint →

2021

The immuneML ecosystem for machine learning analysis of adaptive immune receptor repertoires

Pavlović M., Scheffer L., et al.

Nature Machine Intelligence, 3, 936–944

View paper →

Machine learning for
adaptive immune receptors

What is AIRR?

Why immuneML?

Unsupervised ML analysis

What immuneML supports

Install immuneML

PyPI

Conda

Docker

GitHub

Reproducible workflows

Unsupervised ML in practice

How to cite

Unsupervised machine learning for adaptive immune receptors with immuneML

The immuneML ecosystem for machine learning analysis of adaptive immune receptor repertoires

Contact

Machine learning foradaptive immune receptors

What is AIRR?

Why immuneML?

Unsupervised ML analysis

What immuneML supports

Install immuneML

PyPI

Conda

Docker

GitHub

Reproducible workflows

Unsupervised ML in practice

How to cite

Unsupervised machine learning for adaptive immune receptors with immuneML

The immuneML ecosystem for machine learning analysis of adaptive immune receptor repertoires

Contact

Machine learning for
adaptive immune receptors