Vol-3194/paper38

From BITPlan ceur-ws Wiki
Revision as of 17:58, 30 March 2023 by Wf (talk | contribs) (edited by wikiedit)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Paper

Paper
edit
description  
id  Vol-3194/paper38
wikidataid  Q117344950→Q117344950
title  Decentralized Federated Learning and Network Topologies: an Empirical Study on Convergence
pdfUrl  https://ceur-ws.org/Vol-3194/paper38.pdf
dblpUrl  https://dblp.org/rec/conf/sebd/KavalionakCDFMC22
volume  Vol-3194→Vol-3194
session  →

Decentralized Federated Learning and Network Topologies: an Empirical Study on Convergence

load PDF

Decentralized Federated Learning and Network
Topologies: an Empirical Study on Convergence
Hanna Kavalionak1 , Emanuele Carlini1 , Patrizio Dazzi3 , Luca Ferrucci1 ,
Matteo Mordacchini2 and Massimo Coppola1
1
  Institute of Information Science and Technologies (ISTI), National Research Council (CNR), Pisa, Italy
2
  Institute of Informatics and Telematics (IIT), National Research Council (CNR), Pisa, Italy
3
  Department of Computer Science, University of Pisa, Italy


                                         Abstract
                                         Federated Learning is a well-known learning paradigm that allows the distributed training of machine
                                         learning models. Federated Learning keeps data in the source devices and communicates only the model’s
                                         coefficients to a centralized server. This paper studies the decentralized flavor of Federated Learning. A
                                         peer-to-peer network replaces the centralized server, and nodes exchange model’s coefficients directly. In
                                         particular, we look for empirical evidence on the effect of different network topologies and communication
                                         parameters on the convergence in the training of distributed models. Our observations suggest that
                                         small-world networks converge faster for small amounts of nodes, while xx are more suitable for larger
                                         setups.

                                         Keywords
                                         Federated Learning, Peer-to-Peer, Distributed Systems,




1. Introduction
Federated Learning (FL) [1] is one of the most popular trends in research communities dealing
with distributed machine learning [2, 3] and decentralized intelligence [4, 5]. At its core, FL is a
distributed framework in which machine learning models are trained over a set of devices, with
each device having a small subset of the whole training data. Each node then runs a learning
process locally. The coefficients computed during the learning phase are periodically sent to a
central aggregation entity, e.g., a remote server. The server collects the model coefficients from
all nodes of the system and aggregates them into the working model, which is then pushed
back to the nodes. FL can find application in many fields [1], including mobile applications [6],
healthcare [7], and autonomous vehicles [8].
   FL offers a series of out-of-the-box benefits that it makes it attractive to academia and industry,
including: (i) Models constantly improve using nodes data with no need to aggregate data in a


SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy
$ hanna.kavalionak@isti.cnr.it (H. Kavalionak); emanuele.carlini@isti.cnr.it (E. Carlini); patrizio.dazzi@unipi.it
(P. Dazzi); luca.ferrucci@isti.cnr.it (L. Ferrucci); matteo.mordacchini@iit.cnr.it (M. Mordacchini);
massimo.coppola@isti.cnr.it (M. Coppola)
� 0000-0002-8852-3062 (H. Kavalionak); 0000-0003-3643-5404 (E. Carlini); 0000-0001-8504-1503 (P. Dazzi);
0000-0003-4159-0644 (L. Ferrucci); 0000-0002-1406-828X (M. Mordacchini); 0000-0002-7937-4157 (M. Coppola)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
�centralized server; (ii) Source data are kept close to the source enabling privacy-aware learning;
(iii) FL also fits situations when the network is limited in bandwidth and latency.
    Recently, research communities devoted much effort to decentralizing the learning process.
Nevertheless, the study of the benefits of fully decentralized against centralized solutions is still
a challenge and is of particular interest [9]. The decentralized learning process assumes nodes
(i.e., peers) to be part of an unstructured peer-to-peer network (i.e., overlay) that can be used
to directly communicate with a subset of the other nodes in the network. The training is an
iterative process: after performing the training of the model locally, the nodes send their model
to the neighbours using point-to-point communications and the aggregate the coefficients.
Due to the nature of decentralized communication, the model can converge to stable values
only after a certain number of iterations. Decentralized FL retains all the benefit of traditional
FL, but also allows a distributed learning process without the need to setup and operate a
centralized entity to aggregate model coefficients. In this paper, we are interested in empirically
measuring how sensitive is the convergence of a decentralized FL system to network topologies
and communication parameters.

The content of this discussion paper is based on another manuscript [10] already published by
the same authors, which contains a more detailed related work and the results of a large scale
experiments.


2. Related work
Hegedus et al.[11] propose an empirical comparison of gossip-based learning and federated
learning. The authors propose the evaluation of the effectiveness of both these technologies
based on the three available datasets. In the paper, the authors consider a network with a
fixed number of neighbors. Lian et al.[12] and Tang et al.[13] introduce the idea of applying
gossip algorithms and evaluate their effectiveness in comparison with centralized solutions.
Koloskova et al.[14] improves the algorithms proposed by Lian et al.[12] with arbitrary gradient
compression. Lalitha et al.[15] propose a formal description and an algorithm for a distributed
federated learning problem. The work of Savazzi et al.[16] studies gossip-based distributed
solutions for the learning models. The focus of this work is on the way data is distributed between
nodes in the system. Another work that relies on the data segmentation between the nodes for
the learning is the work of Hu et al.[17]. The focus of this work is on bandwidth optimization
rather than the speed of accuracy convergence. These works propose a solid theoretical idea for
the concept of decentralized learning. Most of them emphasized how decentralization impacts
the quality of training and, consequently, on the quality of the prediction. However, just a few
explore the decentralized system’s behavior in terms of the characteristics of the overlay and
its convergence speed.
�3. System model
3.1. Federated Learning
Federated Learning comes in many flavors (see [18, 6]) and it is extensively used in many real
world applications [1]. In this paper, we focus on the so-called horizontal FL, in which data
on each node (or device) shares the same feature space but is composed of different, unique
samples. A typical horizontal federated-learning system considers 𝑁 nodes who collaboratively
train a machine learning model.
    More formally, each node 𝑖 has a local dataset 𝒟𝑖 , such that the whole dataset 𝒟 = {𝒟1 ∪
· · ·∪𝒟𝑁 } and that 𝒟𝑖 ∩𝒟𝑖′ = ∅ for 𝑖 ̸= 𝑖′ . The training process of horizontal FL is agnostic with
respect to the specific learning model used. Commonly used models in FL exploit a gradient-
descent strategy to minimize the value of a loss function 𝑙(·) (a function that gives a measure of
the error), defined on the parameters vector 𝑤, i.e. 𝑙(𝑤).
    Procedurally, the training phase of the model is usually composed of the following steps
[1]: (i) the nodes train a local instance of the machine learning model by using their data.
Afterward, The coefficients of the model are sent to a centralized server; (ii) the server performs
an aggregation of the coefficients received by the nodes to create an "aggregated" model; (iii)
the server sends back the aggregated model’s coefficients to the nodes; (iv) nodes update their
model with the aggregated one received. These four steps continue until the training step
is completed (typically until the loss function of the aggregated model reaches convergence).
Therefore, the ultimate objective of a FL system is to maximize the quality of the model at
the centralized server, which is then sent to all the nodes in the system. FL comes with many
security implications, especially in terms of data privacy and integrity. In this paper, we do not
consider these security aspects. In reality, nodes would employ cryptography schemes, such as
homomorphic encryption [19], to protect their privacy and avoid data leakage. While privacy is
essential in the actual implementation of FL systems, our analysis is not affected by the (non)
presence of privacy-preserving mechanisms.

3.2. Decentralized Federated Learning
By comparison with the centralized FL, the general objective of decentralised learning is to
minimize the average of the loss functions for all the nodes in the system. Indeed, the core
difference between centralized and decentralized federated learning is, as the name implies,
that the latter does not require a centralized server to work. Instead, the training phase and
model distribution are performed only with peer-to-peer communications between nodes.
   We model the communication network with a undirected graph of 𝑁 nodes. We define the
neighborhood of a user 𝑖, denoted as 𝒩𝑖 , as the set of nodes 𝑗 that have an edge in the network
going from 𝑖 to 𝑗. We also consider the edge to be bi-directional, i.e. if 𝑗 ∈ 𝒩𝑖 then 𝑖 ∈ 𝒩𝑗 .
Nodes communicate in rounds, with each rounds having the same length for each node, Δ𝑔 . In
our model 𝒩𝑖 does not change over rounds and remains static.
   The training procedure is the following: (i) Each node 𝑖 initially trains its model on the local
𝒟𝑖 and obtain 𝑤𝑖 ; (ii) for each Δ𝑔 , the 𝑤𝑖 is communicated to a subset of 𝒩𝑖 , according to
the available bandwidth. Here, we assume that each neighbour 𝑗 also communicates back its
own 𝑤𝑗 . (iii) upon the reception of a 𝑤 nodes update their model by means of an aggregation
�                                                  Pendigits    HAR       full-HAR
                           Size                    10992       10000     499276
                           Number of features        16          93        93
                           Number of labels          10           4         4
                           Label distribution     balanced     4:3:2:1   4:3:2:1
Table 1
Datasets properties


function 𝒜(·). The aggregation function is a crucial element when considering the quality of a
model. Since our focus is not on the quality, here we consider a simple aggregation function
that performs an average of the coefficients [9].


4. Experimental Setup
All the experiments have been run in Python 3.8 on a single workstation machine. The Scikit-
learn1 library was used for the classifiers. The networks were generated with the NetworkX2
library.

Datasets. For the experiments, we have considered two datasets prepared for multi-label
classification tasks. The properties of the datasets are listed in Table 1. The first dataset is
the Pendigits3 , in which each sample contains features taken from handwritten digits from
250 different subjects. The task is to classify each image with the correct digits. In the second
dataset, HAR4 , four human activities (inactive, active, driving, walking) were recorded with
a mobile app and put in relation with several sensors measurements in the phone. The task
is to classify the activities using data from the accelerometer, gyroscope, magnetometer, and
GPS. For most of the experiments, we randomly selected 10K samples from the original dataset
(which consists of around 500K samples) to be similar in size to the Pendigts dataset.
   The 10% part of each dataset is reserved for testing, and the remainder 90% is distributed
to nodes for training. The training data has been divided among clients proportionally, i.e.,
each client roughly receives the same amount of samples. This means that when testing larger
networks, the number of samples is lower per node: this is the cause of lower general quality in
the prediction power of larger networks. Data does not change during the simulation.

Models. We used two machine learning classifiers in our experiments. The first one is a
stochastic gradient descent (SGD) model that trains linear Support Vector Machine using the
hinge loss function. The second one is a Logistic Regressor classifier. Since logistic regressors are
used for binary classification problems, we use the one-vs-rest (OvR) strategy for the multi-class
classification. The OvR subdivides the original multi-classification problem into multiple binary
sub-problems (one for each class) and trains a model for each sub-problem. The model that

   1
     https://scikit-learn.org/
   2
     https://networkx.org/
   3
     http://archive.ics.uci.edu/ml/datasets/pen-based+recognition+of+handwritten+digits
   4
     https://lbd.udc.es/research/real-life-HAR-dataset/
�obtains the best results is used for the prediction.

Networks. We use three different types of networks derived from graph theory to build the
overlay between nodes. For each overlay, we assume a bidirectional communication channel that
corresponds to an undirected graph. Nodes have no self-edges or double edges. All networks
are also fully connected.

    • Regular random graph. Every node has the same degree. Neighbors are chosen randomly.
    • Small-world graph. It is built with the Watts–Strogatz technique5 . It has short average
      path lengths, i.e., the distance between two randomly chosen nodes is proportional to
      the logarithm of the graph’s size. Some social networks and biological networks can be
      modeled with a small-world graph.
    • Scale-free graph. It is built with the Barabási–Albert preferential attachment technique.
      These graphs have nodes, called hubs, with a disproportionately large degree compared
      with other nodes. Many networks can be models as a scale-free graph, including the
      Internet and many social networks.

   These networks are characterized by two parameters. The first one is the size of the network
𝑁 = {32, 64, 128, 256, 512}. The second parameter 𝐾 = {5, 10, 15, 20} controls how nodes
connect to each other. The semantic of this parameter is different with respect to the network
considered. In the random graph, it defines the degree of the nodes; in the small-world graph,
it defines the number of nearest neighbors that each node connects to in the ring topology;
in the scale-free graph, it defines the number of edges to attach from a new node to existing nodes.

Communication. The communication among nodes is divided into synchronous rounds, or
iterations, in which each node has a chance to communicate and exchange the model coefficients
with another node. Each node can communicate only with its neighbors on the network. An
iteration terminates when all nodes have had the chance to communicate. Communication
between nodes is affected in two ways. First, we simulate connections between nodes not work-
ing properly by dropping some communication with various percentages, 𝑑 = {0, 0.1, 0.2, 0.3}.
When a communication is dropped the corresponding model information is lost. Second, we
define the maximum number of nodes communications per rounds, 𝑐 = {1, 2, 4, 8}. For exam-
ple, when 𝑐 = 1, each node can communicate with only one neighbor at each iteration. The
selection of neighbors is made accordion to a round-robin algorithm. When 𝒩𝑖 < 𝑐𝑖 , node 𝑖
communicates with all its neighbours at each iteration.


5. Evaluation
The system is evaluated by considering how fast each experiment converges to a value for the
model accuracy, i.e., the fraction of correctly classified items. We have also measured other
quality metrics such as precision, zero-one-loss, and f1 score, and we observed that the results
are comparable with the accuracy. The system accuracy is the average of the accuracy values
   5
       The probability of adding a new edge for each edge is set to 0.2
�Figure 1: Accuracy over time with different networks in relation to degree. The dotted horizontal line is
the accuracy obtained with the centralized FL. The vertical lines indicate when a given series (indicated
by the color) reached convergence. The plots are generated with 𝑐 = 1, 𝑑 = 0.3, 𝑛 = 128, and the
Pendigits dataset.




Figure 2: Accuracy over time with different networks in relation to network size. The dotted horizontal
line is the accuracy obtained with the centralized version and 𝑛 = 32. The vertical lines indicate when a
given series (indicated by the color) reached convergence. The plots are generated with 𝑐 = 1, 𝑑 = 0.3,
𝑘 = 5, and the Pendigits dataset.


of all nodes in the network. It is computed at the end of every iteration. The convergence is
measured by counting the difference between consecutive measurements of the system accuracy.
We consider the system to have converged if, for three consecutive times, the accuracy is not
lower than the previous value and the difference is less than 0.001.
   Figure 1 shows the influence of the degree parameter 𝐾 on the convergence and the overall
accuracy of the decentralized learning. On the random graph, the degree has a high impact
in terms of overall classification accuracy. In particular, a 5 degrees network converges to an
accuracy slightly above 86%, whereas the 20 degrees network converges to 88%. The speed
convergence is also affected, with the 10 degrees network converging earliest at the 13th
iteration. The small-world network obtains similar values for the final accuracy. There is a
marginal difference between the various degree values, having a similar trend in all cases.
Similar to the small-world network, the scale-free network shows marginal differences in all
cases. We can notice significant "steps" in the increment of accuracy due to those iterations in
which high-degree nodes are updated with better models.
   Figure 2 shows how the accuracy varies over time with the size of the network 𝑁 . It is worth
noticing that the lower maximum accuracy obtained by larger networks is because the amount
of data per node is much lower, skewing the global accuracy drastically. However, we can notice
that smaller networks reach convergence faster than larger ones, but no significant differences
can be seen for different kind of networks.
�                                              Scale-free      0.87    10.03
                                   Log. Reg   Random Graph    0.86     7.62
                                              Small-world     0.87     6.33
                       HAR
                                              Scale-free      0.88    11.33
                                   SGD        Random Graph    0.87    10.11
                                              Small-world     0.88     8.22
                                              Scale-free      0.86    12.59
                                   Log. Reg   Random Graph    0.86     9.11
                                              Small-world     0.86     7.12
                       Pendigits
                                              Scale-free      0.89    14.50
                                   SGD        Random Graph    0.89    11.17
                                              Small-world     0.89     9.73

Table 2
Accuracy and Convergence speed for different datasets, models, and network type.


   Table 2 reports aggregated results of comparing all experiments varying multiple parameters
with a network of fixed size (𝑁 = 128). The global system accuracy is only slightly influenced
by the various parameter configurations, with better results for the SDG model. In terms
of convergence, it is clear how topology has a relevant impact, as the small-world networks
converge more rapidly than the other networks on average.


6. Conclusion
Although decentralized federated learning is gaining momentum, only a few works analyze
its behavior related to the network topology. This paper evaluated the impact of different
network characteristics on the convergence speed of a decentralized federated learning system.
In particular, we have empirically evaluated how sensitive is the training process for the various
network characteristics and parameters.
   Our results suggest that clustered networks such as scale-free and small-world look more
suitable to support decentralized training than other kind of networks. In particular, small-world
networks seem to converge faster for a small setup when the amount of training sample per
node is relatively low. By comparison, scale-free networks obtained better results in the large-
scale test. That could indicate that having a hierarchical organization of the network in which
hubs (or super peers) can collect the aggregated models and redistribute them could increase
the convergence speed. Naturally, this is a trade-off, as hierarchical systems also increase the
load on specific nodes and are sensitive to the single point of failure (if a super-peer becomes
unavailable, much valuable information is lost). We reserve to study these trade-offs in future
work, also considering dynamic networks, i.e., that changes during the training phase.


Acknowledgments
This work has been partially supported by the European Union’s Horizon 2020 Research and Innovation
program, under the project TEACHING (Grant agreement ID: 871385).
�References
 [1] Q. Yang, Y. Liu, T. Chen, Y. Tong, Federated machine learning: Concept and applications, ACM
     Transactions on Intelligent Systems and Technology (TIST) 10 (2019) 1–19.
 [2] P. Kairouz, et al., Advances and open problems in federated learning., CoRR abs/1912.04977 (2019).
 [3] K. Niwa, N. Harada, G. Zhang, W. B. Kleijn, Edge-Consensus Learning: Deep Learning on P2P
     Networks with Nonhomogeneous Data, Association for Computing Machinery, New York, NY,
     USA, 2020, p. 668–678.
 [4] H. Kavalionak, et al., Edge-based video surveillance with embedded devices, in: Proc. of the 28th
     Italian Symp. on Advanced Database Systems, Villasimius, Italy, June 21-24, volume 2646 of CEUR
     Workshop Proc., CEUR-WS.org, 2020.
 [5] L. Ferrucci, M. Mordacchini, M. Coppola, E. Carlini, H. Kavalionak, P. Dazzi, Latency preserving
     self-optimizing placement at the edge, FRAME ’21, Association for Computing Machinery, New
     York, NY, USA, 2020, p. 3–8.
 [6] W. Y. B. Lim, et al., Federated learning in mobile edge networks: A comprehensive survey, IEEE
     Comm. Surveys & Tutorials 22 (2020).
 [7] N. Rieke, et al., The future of digital health with federated learning, NPJ digital medicine 3 (2020)
     1–7.
 [8] S. R. Pokhrel, J. Choi, A decentralized federated learning approach for connected autonomous
     vehicles, in: 2020 IEEE Wireless Comm. and Networking Conf. Workshops (WCNCW), IEEE, 2020,
     pp. 1–6.
 [9] Q. Xia, W. Ye, Z. Tao, J. Wu, Q. Li, A survey of federated learning for edge computing: Research
     problems and solutions, High-Confidence Computing (2021).
[10] H. Kavalionak, et al., Impact of network topology on the convergence of decentralized federated
     learning systems, in: 2021 IEEE Symp. on Comp. and Comm. (ISCC), IEEE, 2021.
[11] I. Hegedűs, G. Danner, M. Jelasity, Decentralized learning works: An empirical comparison of
     gossip learning and federated learning, Journal of Parallel and Dist. Comp. 148 (2021).
[12] X. Lian, C. Zhang, H. Zhang, C. Hsieh, W. Zhang, J. Liu, Can decentralized algorithms outperform
     centralized algorithms? A case study for decentralized parallel stochastic gradient descent, in:
     Advances in Neural Inf. Processing Systems 30: Annual Conf. on Neural Inf. Processing Systems
     2017, December 4-9, 2017, Long Beach, CA, USA, 2017.
[13] H. Tang, X. Lian, M. Yan, C. Zhang, J. Liu, 𝑑2 : Decentralized training over decentralized data, in:
     Proc. of the 35th Int. Conf. on Machine Learning, volume 80 of Proc. of Machine Learning Research,
     PMLR, 2018, pp. 4848–4856.
[14] A. Koloskova, S. Stich, M. Jaggi, Decentralized stochastic optimization and gossip algorithms with
     compressed communication, in: Proc. of the 36th Int. Conf. on Machine Learning, volume 97 of
     Proc. of Machine Learning Research, PMLR, 2019, pp. 3478–3487.
[15] A. Lalitha, S. Shekhar, T. Javidi, F. Koushanfar, Fully decentralized federated learning, in: Third
     workshop on Bayesian Deep Learning (NeurIPS), 2018.
[16] S. Savazzi, M. Nicoli, V. Rampa, Federated learning with cooperating devices: A consensus approach
     for massive iot networks, IEEE Internet Things J. 7 (2020) 4641–4654.
[17] C. Hu, J. Jiang, Z. Wang, Decentralized federated learning: A segmented gossip approach, 2019.
     arXiv:1908.07782.
[18] T. Li, A. K. Sahu, A. Talwalkar, V. Smith, Federated learning: Challenges, methods, and future
     directions, IEEE Signal Processing Magazine 37 (2020) 50–60.
[19] A. Acar, H. Aksu, A. S. Uluagac, M. Conti, A survey on homomorphic encryption schemes: Theory
     and implementation, ACM Comp. Surveys (CSUR) 51 (2018) 1–35.
�