Good operational practices are vital for reliable federated systems. Maintain audit logs at both the coordinator and client sides to record training events, authentication attempts and errors.
Use drift detection to monitor whether local distributions diverge from global assumptions; integrate this into CI to run unit tests and simulated federated runs. When possible, visualise round-level metrics (loss/accuracy) without revealing per-site performance (aggregate only). For production, implement data-/concept-drift alarms with tools such as Evidently and scrape system/application metrics via Prometheus exporters.
Reliable FL deployments track three core SLIs — coordinator availability, client-participation rate, and per-round latency — because each has a proven impact on convergence and user experience:
SLI | Why it matters | How to measure |
---|---|---|
Coordinator availability | A single coordinator outage halts every training round; Google’s SRE handbook recommends continuous success-probes to detect silent failures [1]. | Expose periodic HTTP/gRPC health probes and alert on SLO burn rate rather than single failures. |
Client-participation rate | Model quality degrades sharply when too few clients report; large-scale studies see ≥ 8 pp accuracy loss at ≈ 10 % participation [2][3]. | Export fl_clients_participation_ratio = n_active / n_selected each round. Flower and NVFLARE expose this metric via Prometheus [4][5]. |
Per-round latency (p95) | Latency spikes are early warnings of network congestion or stragglers; the Site Reliability Workbook advises alerting when p95 exceeds the agreed SLO budget [6]. | Track fl_round_duration_seconds as a histogram and fire an alert when p95 breaches the budget for three consecutive windows. |
Implementation tip – Prometheus dashboards shipped with Flower ≥ 1.8 and NVFLARE 2.x already publish
server_uptime_seconds
,client_participation_ratio
, andround_duration_seconds
, so all three SLIs can be monitored without additional coding [4][5].
The open‑source Flower framework [7] and several other frameworks implement FL, each with different programming languages, maturity levels and security features:
secagg_mod
/secaggplus_mod
on the client and
SecAgg{,+}Workflow
on the server (≥ v1.8); launch with flwr run --mods
secaggplus_mod
(≥ Flower 1.8) [8].
Lightweight alternative: LightSecAgg protocol offers dropout‑resilient
secure aggregation for asynchronous FL; still research‑grade and not yet
merged in Flower core. See the official secure‑aggregation notebook
[9] for a minimal working example.When choosing a framework, consider compatibility with your existing code, support for secure aggregation and the maturity of the community.
Apache-2.0
, CC-BY-4.0
)
in model metadata to clarify reuse terms.LICENSE
file and SPDX metadata in every
crate to ensure legal clarity.runcrate
and publish results via Zenodo.Use the Data Stewardship Wizard (DSW), the ELIXIR-CONVERGE–supported DMP wizard to create a machine-actionable DMP for federated studies. Select an ELIXIR/CONVERGE knowledge model, answer the guided questions, and export the plan (JSON/PDF) for inclusion in your project records and RO-Crate. DSW complements funder templates and can be used alongside institutional tools such as DMPonline.
Specific questions to cover include:
Example (excerpt) from the wizard dmp.json
:
{
"federated_storage_location": "TRE‑Portuguese Node",
"model_doi": "10.5281/zenodo.9999999",
"retention_policy_days": 90,
"secure_aggregation": true
}
Access the wizard at: Data Stewardship Wizard DSW, or use your institution’s DMPonline service — https://dmponline.dcc.ac.uk/. See also the RDMKit guidance on DMPs [18].
Example YAML snippet:
model_details:
name: "Federated MNIST Classifier"
version: "1.0"
training_algorithm: "FedAvg"
rounds: 100
participants: 5
intended_use:
primary_use: "Handwritten digit classification"
out_of_scope: "Medical imaging, document analysis"
performance:
metric: "accuracy"
global_model: 0.98
fairness_assessment: "Evaluated across demographic groups"
conda‑lock
)
inside the crate for full environment capture.The Workflow-Run RO-Crate (Process Run Crate) profile [?]
formalises provenance for executions.
metadata capture for computational workflows. A minimal ro-crate-metadata.json
for a federated training run is:
{
"@context": [
"https://w3id.org/ro/crate/1.1/context",
"https://w3id.org/ro/terms/workflow-run/context"
],
"@graph": [
{
"@id": "ro-crate-metadata.json",
"@type": "CreativeWork",
"conformsTo": { "@id": "https://w3id.org/ro/crate/1.1" },
"about": { "@id": "./" }
},
{
"@id": "./",
"@type": "Dataset",
"name": "Federated training run",
"conformsTo": { "@id": "https://w3id.org/ro/wfrun/process/0.5" },
"hasPart": [
{ "@id": "model.pkl" },
{ "@id": "metrics.json" }
],
"mentions": { "@id": "#Training_1" }
},
{
"@id": "https://flower.ai/",
"@type": "SoftwareApplication",
"name": "Flower"
},
{
"@id": "#Training_1",
"@type": "CreateAction",
"name": "Federated training",
"instrument": { "@id": "https://flower.ai/" },
"result": { "@id": "model.pkl" }
},
{ "@id": "model.pkl", "@type": "File" },
{ "@id": "metrics.json", "@type": "File" }
]
}
If your run was orchestrated by a workflow engine (e.g., CWL/Galaxy), use the Workflow-Run Crate profile (change the conformsTo URI accordingly) [?]. Full implementations for secure TRE contexts are available in the Five Safes RO-Crate record (Zenodo) [19].
Follow the DOME‑ML recommendations [20] for reproducible machine learning validation:
✓ Data
✓ Optimisation
✓ Model
✓ Evaluation
Immediate actions:
Emerging techniques:
For a comprehensive survey on certified removal [23].
Follow DOME-ML reproducibility protocols for systematic checkpoint management and backup strategies across distributed sites. Document all recovery procedures and test failover scenarios regularly.
The EDPS commentary emphasizes that data subjects retain erasure rights even in federated settings, requiring coordinated deletion mechanisms across all participating nodes.
Use group‑fairness metrics (demographic parity, equal opportunity) to audit both global and per‑site models. Mitigation strategies include re‑weighting, constrained optimisation and fairness‑aware FedAvg variants. Follow DOME-ML fairness evaluation guidelines for systematic bias assessment across federated model performance.
Beyer, Betsy, Jones, Chris, Petoff, Jennifer, Murphy, Niall Richard (2016). Site reliability engineering: how Google runs production systems. O’Reilly Media, Inc.. Available at: https://sre.google/sre-book/
Bonawitz, Keith, Eichner, Hubert, Grieskamp, Wolfgang, Huba, Dzmitry, Ingerman, Alex, Ivanov, Vladimir, Kiddon, Chloé, Kone\vcný, Jakub, Mazzocchi, Stefano, McMahan, Brendan, Van Overveldt, Timon, Petrou, David, Ramage, Daniel, Roselander, Jason (2019). Towards Federated Learning at Scale: System Design. In Proceedings of Machine Learning and Systems, pp. 374–388. Available at: https://proceedings.mlsys.org/paper_files/paper/2019/file/7b770da633baf74895be22a8807f1a8f-Paper.pdf
Lai, Fan, Dai, Yinwei, Zhu, Xiangfeng, Madhyastha, Harsha V., Chowdhury, Mosharaf (2021). FedScale: Benchmarking Model and System Performance of Federated Learning. In Proceedings of the First Workshop on Systems Challenges in Reliable and Secure Federated Learning, pp. 1–3. Association for Computing Machinery. DOI: 10.1145/3477114.3488760
Flower Labs (2023). Monitoring Simulation in Flower. https://flower.ai/blog/2023-02-06-monitoring-simulation-in-flower.
NVIDIA (2024). System Monitoring — NVFLARE User Guide. https://nvflare.readthedocs.io/en/2.6/user_guide/monitoring.html.
Beyer, Betsy, Murphy, Niall Richard, Rensin, David K, Kawahara, Kent, Thorne, Stephen (2018). The site reliability workbook: practical ways to implement SRE. O’Reilly Media, Inc.. Available at: https://sre.google/workbook/alerting-on-slos/
Beutel, Daniel J., Topal, Taner, Mathur, Akhil, Qiu, Xinchi, Fernandez-Marques, Javier, Gao, Yan, Sani, Lorenzo, Li, Kwing Hei, Parcollet, Titouan, de Gusm~ao, Pedro P. B., Lane, Nicholas D. (2022). Flower: A Friendly Federated Learning Research Framework. arXiv preprint arXiv:2007.14390. Available at: https://arxiv.org/abs/2007.14390
Flower Labs (2025). Secure Aggregation Protocols. https://flower.ai/docs/framework/contributor-ref-secure-aggregation-protocols.html.
Flower Labs (2025). Secure aggregation with Flower (the SecAgg+ protocol). https://flower.ai/docs/examples/flower-secure-aggregation.html.
Federated AI Technology Enabler (2024). FATE documentation. https://fate.readthedocs.io/en/develop/.
NVIDIA Corporation (2025). NVIDIA FLARE: Federated Learning Application Runtime Environment. https://github.com/NVIDIA/NVFlare.
Owkin, Linux Foundation AI (2025). Substra: open-source federated learning software. https://github.com/substra.
Jahns, Kevin (2024). Yjs: Shared data types for building collaborative software. https://github.com/yjs/yjs.
Flower Labs (2025). Flower Community Slack Server. https://friendly-flower.slack.com/.
FATE Project (2025). FATE User Mailing List. https://lists.lfaidata.foundation/g/Fate-FedAI.
Federated AI Technology Enabler (FATE) (2025). FATE-Community GitHub organisation. https://github.com/FederatedAI/FATE-Community.
ELIXIR Europe (2025). ELIXIR Federated Human Data Community. https://elixir-europe.org/communities/human-data.
ELIXIR Europe (2025). Data Management Plan (RDMKit task page). https://rdmkit.elixir-europe.org/data_management_plan.
Soiland-Reyes, Stian, Wheater, Stuart (2023). Five Safes RO-Crate profile. https://trefx.uk/5s-crate/0.4/.
Walsh, Christopher J., Ross, Kenneth N., Mills, James G., et al. (2021). DOME: recommendations for supervised machine learning validation in biology. Nature Methods, 18, 1122–1127. DOI: 10.1038/s41592-021-01205-4
Iterative, Inc. (2025). Data Version Control User Guide (v3.1). Available at: https://dvc.org/doc/user-guide
Metz, Cade (2023). Now That Machines Can Learn, Can They Unlearn?. https://www.wired.com/story/machines-can-learn-can-they-unlearn/.
Bourtoule, Lucas, Chandrasekaran, Varun, Choquette-Choo, Christopher A., Jia, Hengrui, Travers, Adelin, Zhang, Baiwu, Lie, David, Papernot, Nicolas (2021). Machine Unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pp. 141-159. DOI: 10.1109/SP40001.2021.00019
This page is an example of RDMKit-compliant documentation created by Jorge Miguel Silva.
Original repository: federated_learning_page