This page is intended for researchers, data scientists, and IT professionals who need to analyse sensitive data distributed across multiple institutions without centralising it. This includes those working with health data, genomic information, or any other privacy‑sensitive datasets where regulatory constraints prevent data sharing.
Federated learning (FL) is a paradigm for building statistical models without centralising sensitive data. Instead of shipping records to a single repository, the learning algorithm is dispatched to each participating site and only aggregated model updates are exchanged. This decentralised approach preserves data sovereignty and allows hospitals, biobanks and other organisations to collaborate on joint models while keeping raw data local. FL was first deployed at Google for on‑device keyboard prediction, where simulations involved ≈ 1.5 million phones [1]; in health‑care case‑studies cohort sizes typically range from five to ≈ three hundred sites, depending on governance constraints. However, open challenges such as communication cost and fairness remain active research topics [2].
Traditional centralised training often conflicts with privacy legislation because it requires data to leave its origin. FL overcomes this constraint by bringing computation to the data and exchanging only summary statistics. As a result, researchers can pool statistical power across sites while complying with the EU General Data Protection Regulation (GDPR) and ethical frameworks such as the Five Safes.
Data may be partitioned across organisations in different ways, and the partitioning influences algorithm choice and security requirements. In a horizontal (sample‑wise) scenario, each site holds the same features but different cohorts; for example, multiple hospitals may collect identical clinical measurements for different patients. Vertical (feature‑wise) partitioning occurs when participating organisations share individuals but collect different variables, such as genetic data at one site and clinical data at another. Understanding how the data is split helps select appropriate federated algorithms and security mechanisms.
Beyond FedAvg, vertical federations can use SplitNN [3] or PyVertical [4] to train deep models where each party holds disjoint features, and statistical alternatives such as FedSVD [5] offers a federated singular‑value‑decomposition (SVD) algorithm that has already been applied to genome‑wide association studies (GWAS) and other high‑dimensional omics analyses.
Differences in data collection protocols can cause site‑to‑site variability. Before launching a federated study, establish a common data model or phenotype dictionary to align variable names and units. Perform quality control to detect outliers, missing values and batch effects, and apply common pre‑processing pipelines (such as normalisation or imaging correction) across sites. Tools such as runcrate [6], a command‑line utility for manipulating Workflow Run RO‑Crate packages, can be used to package metadata and ensure provenance.
The OMOP Common Data Model (CDM) [7] is widely adopted for observational health data and maps well onto federated SQL back‑ends.
OMOP CDM serves as the default table schema for observational data networks, enabling standardised queries across federated sites. Implementation follows a systematic approach: (1) Extract source data into staging tables, (2) Transform using OHDSI tools to map local vocabularies to standard concepts, and (3) Validate completeness and quality via data quality dashboard. See the OHDSI collaborative protocol for implementation guidance and general OHDSI CDM resources [8].
For phenotypic data exchange, GA4GH Phenopackets
provide structured JSON representation:
{"subject": {"id": "patient1"},
"phenotypicFeatures": []}
. Refer to the
GA4GH Phenopackets specification
for complete schema definitions and validation rules.
To let external analysts discover which federated shards exist, expose a read‑only endpoint using Beacon v2 (yes/no genomic presence queries) [9] or the GA4GH Search/Data‑Connect API for richer tabular filters [10].
FAIR Principle | Implementation in FL | Example |
---|---|---|
Findable | Assign persistent identifiers (PIDs) to models and datasets | DataCite DOIs for FL model versions |
Accessible | Use standardised protocols for model access | HTTPS APIs with OIDC authentication |
Interoperable | Apply common data models and vocabularies | OMOP CDM for clinical data, GA4GH Phenopackets |
Reusable | Package with rich metadata and clear licenses | RO‑Crate with Model Cards, CC‑BY/Apache‑2.0 |
Federated learning relies on a layered security stack. At the network level, use Transport Layer Security (TLS) or Virtual Private Networks (VPNs) to encrypt communications between the coordinator and clients.
Authentication and authorisation can be handled via OpenID Connect (OIDC) and token‑based access control. Issue GA4GH Passport ‘Visa’ tokens for mutual OIDC authorisation [11].
Aggregated model updates should be computed using secure aggregation protocols, such as SecAgg or SecAgg+, where each client encrypts its updates and the server only decrypts the sum. Differential privacy and noise addition further reduce the risk of re‑identification, and a threat model should guide the choice of protections.
Research note LightSecAgg reduces bandwidth and copes with client drop‑outs, enabling asynchronous FL; LightSecAgg now has a reference implementation [12], but it is not yet merged into Flower core and still requires manual integration.
The original LightSecAgg design [13] details bandwidth savings compared to traditional secure aggregation protocols.
The UK Data Service description of the Five Safes [14] provides a structured approach to ethical and secure data use. Its components can be mapped to federated workflows:
Log provenance: each Beacon or Search query is captured as a
DataDownload
entity inside the Five‑Safes RO‑Crate so auditors can trace
who accessed which variant count [9].
A full JSON profile and example crates are available [15].
Environmental sustainability – carbon footprint monitoring and green AI practices
The following supporting tools and services complement FL frameworks:
Tool | Purpose | Security features | License |
---|---|---|---|
runcrate | Command‑line toolkit for creating and manipulating Workflow Run RO‑Crate packages, useful for packaging federated training runs and preserving provenance | — | Apache‑2.0 |
EUCAIM federated processing API | RESTful interface that orchestrates federated computation across secure nodes within the EUCAIM platform | Kubernetes isolation, secure nodes | Apache‑2.0 (core services) |
Evidently AI | Open‑source ML monitoring framework for drift detection, bias dashboards and model performance tracking in production FL deployments | — | Apache‑2.0 |
RDMKit automatically displays relevant training materials from ELIXIR TeSS above. The following resources provide hands-on tutorials for specific FL frameworks:
Hard, Andrew, Rao, Kanishka, Mathews, Rajiv, Ramaswamy, Swaroop, Beaufays, Françoise, Augenstein, Sean, Eichner, Hubert, Kiddon, Chlo'e, Ramage, Daniel (2018). Federated Learning for Mobile Keyboard Prediction. arXiv preprint arXiv:1811.03604. Available at: https://arxiv.org/abs/1811.03604
Li, Tian, Sahu, Anit Kumar, Talwalkar, Ameet, Smith, Virginia (2020). Federated Learning: Challenges, Methods, and Future Directions. IEEE Signal Processing Magazine, 37, 50-60. DOI: 10.1109/MSP.2020.2975749
Gupta, Otkrist, Raskar, Ramesh (2018). Distributed learning of deep neural network over multiple agents. Journal of Network and Computer Applications, 116, 1-8. DOI: 10.1016/j.jnca.2018.05.003
OpenMined Community (2020). PyVertical – Vertical Federated Learning in PyTorch. https://github.com/OpenMined/PyVertical.
Hartebrodt, Anne, R\öttger, Richard, Blumenthal, David B (2024). Federated singular value decomposition for high-dimensional data. Data Mining and Knowledge Discovery, 38, 938–975. DOI: 10.1007/s10618-023-00983-z
ResearchObject.org (2023). runcrate CLI. https://github.com/ResearchObject/runcrate.
Johns Hopkins Medicine (2024). OMOP on PMAP – Standardising patient information for global research. https://pm.jh.edu/discover-data-stream/epic-emr-clinical-data/omop-on-pmap/.
Observational Health Data Sciences, Informatics (2024). Standardized Data: The OMOP Common Data Model. https://www.ohdsi.org/data-standardization/.
Global Alliance for Genomics, Health (2024). Beacon v2 Specification. https://docs.genomebeacons.org/.
Global Alliance for Genomics, Health (2023). GA4GH Search and Data Connect API Specification. https://www.ga4gh.org/product/data-connect/.
Global Alliance for Genomics, Health (2024). GA4GH Passports Specification. https://www.ga4gh.org/product/ga4gh-passports/.
So, Jinhyun (2022). LightSecAgg – Reference implementation. https://github.com/LightSecAgg/MLSys2022_anonymous.
So, Jinhyun, He, Chaoyang, Yang, Chien-Sheng, Li, Songze, Yu, Qian, E Ali, Ramy, Guler, Basak, Avestimehr, Salman (2022). Lightsecagg: a lightweight and versatile design for secure aggregation in federated learning. Proceedings of Machine Learning and Systems, 4, 694–720.
UK Data Service (2023). What is the Five Safes framework?. https://ukdataservice.ac.uk/help/secure-lab/what-is-the-five-safes-framework/.
Soiland-Reyes, Stian, Wheater, Stuart (2023). Five Safes RO-Crate profile. https://trefx.uk/5s-crate/0.4/.
European Data Protection Supervisor (2025). TechDispatch #1/2025 - Federated Learning. https://www.edps.europa.eu/data-protection/our-work/publications/techdispatch/2025-06-10-techdispatch-12025-federated-learning_en.
UK Information Commissioner’s Office (2024). DPIA template. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/childrens-information/childrens-code-guidance-and-resources/age-appropriate-design-a-code-of-practice-for-online-services/annex-d-dpia-template/.
{Commission Nationale de l’Informatique et des Libert'es} (2024). Privacy Impact Assessment (PIA). https://www.cnil.fr/en/privacy-impact-assessment-pia.
Flower Labs (2025). How-to run simulations. https://flower.ai/docs/framework/how-to-run-simulations.html.
Mart'\i-Bonmat'\i, Luis, Blanquer, Ignacio, Tsiknakis, Manolis, Tsakou, Gianna, Martinez, Ricard, Capella-Gutierrez, Salvador, Zullino, Sara, Meszaros, Janos, Bron, Esther E, Gelpi, Jose Luis, others (2025). Empowering cancer research in Europe: the EUCAIM cancer imaging infrastructure. Insights into Imaging, 16, 47. DOI: 10.1186/s13244-025-01913-x
This page is an example of RDMKit-compliant documentation created by Jorge Miguel Silva.
Original repository: federated_learning_page