Paper ID | MLSP-33.2 |
Paper Title |
RESPIPE: RESILIENT MODEL-DISTRIBUTED DNN TRAINING AT EDGE NETWORKS |
Authors |
Pengzhen Li, Erdem Koyuncu, Hulya Seferoglu, University of Illinois at Chicago, United States |
Session | MLSP-33: Optimization Methods |
Location | Gather.Town |
Session Time: | Thursday, 10 June, 15:30 - 16:15 |
Presentation Time: | Thursday, 10 June, 15:30 - 16:15 |
Presentation |
Poster
|
Topic |
Machine Learning for Signal Processing: [MLR-DFED] Distributed/Federated learning |
IEEE Xplore Open Preview |
Click here to view in IEEE Xplore |
Virtual Presentation |
Click here to watch in the Virtual Conference |
Abstract |
The traditional approach to distributed deep neural network (DNN) training is data-distributed learning, which partitions and distributes data to workers. This approach, although has good convergence properties, has high communication cost, which puts a strain especially on edge systems and increases delay. An emerging approach is model-distributed learning, where a training model is distributed across workers. Model-distributed learning is a promising approach to reduce communication and storage costs, which is crucial for edge systems. In this paper, we design ResPipe, a novel resilient model-distributed DNN training mechanism against delayed/failed workers. We analyze the communication cost of ResPipe and demonstrate the trade-off between resiliency and communication cost. We implement ResPipe in a real testbed consisting of Android-based smartphones, and show that it improves the convergence rate and accuracy of training for convolutional neural networks (CNNs). |