Communication-efficient and Fault-tolerant Algorithms for Distributed Machine Learning

Communication-efficient and Fault-tolerant Algorithms for Distributed Machine Learning

Author: Farzin Haddadpour

Publisher:

Published: 2021

Total Pages:

ISBN-13:

DOWNLOAD EBOOK

Distributed computing over multiple nodes has been emerging in practical systems. Comparing to the classical single node computation, distributed computing offers higher computing speeds over large data. However, the computation delay of the overall distributed system is controlled by its slower nodes, i.e., straggler nodes. Furthermore, if we want to run iterative algorithms such as gradient descent based algorithms communication cost becomes a bottleneck. Therefore, it is important to design coded strategies while they are prone to these straggler nodes, at the same time they are communication-efficient. Recent work has developed coding theoretic approaches to add redundancy to distributed matrix-vector multiplications with the goal of speeding up the computation by mitigating the straggler effect in distributed computing. First, we consider the case where the matrix comes from a small (e.g., binary) alphabet, where a variant of a popular method called the ``Four-Russians method'' is known to have significantly lower computational complexity as compared with the usual matrix-vector multiplication algorithm. We develop novel code constructions that are applicable to binary matrix-vector multiplication {via a variant of the Four-Russians method called the Mailman algorithm}. Specifically, in our constructions, the encoded matrices have a low alphabet that ensures lower computational complexity, as well as good straggler tolerance. We also present a trade-off between the communication and computation cost of distributed coded matrix-vector multiplication {for general, possibly non-binary, matrices.} Second, we provide novel coded computation strategies, called MatDot, for distributed matrix-matrix products that outperform the recent ``Polynomial code'' constructions in recovery threshold, i.e., the required number of successful workers at the cost of higher computation cost per worker and higher communication cost from each worker to the fusion node. We also demonstrate a novel coding technique for multiplying $n$ matrices ($n \geq 3$) using ideas from MatDot codes. Third, we introduce the idea of \emph{cross-iteration coded computing}, an approach to reducing communication costs for a large class of distributed iterative algorithms involving linear operations, including gradient descent and accelerated gradient descent for quadratic loss functions. The state-of-the-art approach for these iterative algorithms involves performing one iteration of the algorithm per round of communication among the nodes. In contrast, our approach performs multiple iterations of the underlying algorithm in a single round of communication by incorporating some redundancy storage and computation. Our algorithm works in the master-worker setting with the workers storing carefully constructed linear transformations of input matrices and using these matrices in an iterative algorithm, with the master node inverting the effect of these linear transformations. In addition to reduced communication costs, a trivial generalization of our algorithm also includes resilience to stragglers and failures as well as Byzantine worker nodes. We also show a special case of our algorithm that trades-off between communication and computation. The degree of redundancy of our algorithm can be tuned based on the amount of communication and straggler resilience required. Moreover, we also describe a variant of our algorithm that can flexibly recover the results based on the degree of straggling in the worker nodes. The variant allows for the performance to degrade gracefully as the number of successful (non-straggling) workers is lowered. Communication overhead is one of the key challenges that hinders the scalability of distributed optimization algorithms to train large neural networks. In recent years, there has been a great deal of research to alleviate communication cost by compressing the gradient vector or using local updates and periodic model averaging. Next direction in this thesis, is to advocate the use of redundancy towards communication-efficient distributed stochastic algorithms for non-convex optimization. In particular, we, both theoretically and practically, show that by properly infusing redundancy to the training data with model averaging, it is possible to significantly reduce the number of communication rounds. To be more precise, we show that redundancy reduces residual error in local averaging, thereby reaching the same level of accuracy with fewer rounds of communication as compared with previous algorithms. Empirical studies on CIFAR10, CIFAR100 and ImageNet datasets in a distributed environment complement our theoretical results; they show that our algorithms have additional beneficial aspects including tolerance to failures, as well as greater gradient diversity. Next, we study local distributed SGD, where data is partitioned among computation nodes, and the computation nodes perform local updates with periodically exchanging the model among the workers to perform averaging. While local SGD is empirically shown to provide promising results, a theoretical understanding of its performance remains open. We strengthen convergence analysis for local SGD, and show that local SGD can be far less expensive and applied far more generally than current theory suggests. Specifically, we show that for loss functions that satisfy the \pl~condition, $O((pT)^{1/3})$ rounds of communication suffice to achieve a linear speed up, that is, an error of $O(1/pT)$, where $T$ is the total number of model updates at each worker. This is in contrast with previous work which required higher number of communication rounds, as well as was limited to strongly convex loss functions, for a similar asymptotic performance. We also develop an adaptive synchronization scheme that provides a general condition for linear speed up. We also validate the theory with experimental results, running over AWS EC2 clouds and an internal GPU cluster. In final section, we focus on Federated learning where communication cost is often a critical bottleneck to scale up distributed optimization algorithms to collaboratively learn a model from millions of devices with potentially unreliable or limited communication and heterogeneous data distributions. Two notable trends to deal with the communication overhead of federated algorithms are \emph{gradient compression} and \emph{local computation with periodic communication}. Despite many attempts, characterizing the relationship between these two approaches has proven elusive. We address this by proposing a set of algorithms with periodical compressed (quantized or sparsified) communication and analyze their convergence properties in both homogeneous and heterogeneous local data distributions settings. For the homogeneous setting, our analysis improves existing bounds by providing tighter convergence rates for both \emph{strongly convex} and \emph{non-convex} objective functions. To mitigate data heterogeneity, we introduce a \emph{local gradient tracking} scheme and obtain sharp convergence rates that match the best-known communication complexities without compression for convex, strongly convex, and nonconvex settings. We complement our theoretical results by demonstrating the effectiveness of our proposed methods on real-world datasets.


Fault-tolerant Message-passing Distributed Systems

Fault-tolerant Message-passing Distributed Systems

Author: Michel Raynal

Publisher:

Published: 2018

Total Pages: 459

ISBN-13: 9783319941424

DOWNLOAD EBOOK

This book presents the most important fault-tolerant distributed programming abstractions and their associated distributed algorithms, in particular in terms of reliable communication and agreement, which lie at the heart of nearly all distributed applications. These programming abstractions, distributed objects or services, allow software designers and programmers to cope with asynchrony and the most important types of failures such as process crashes, message losses, and malicious behaviors of computing entities, widely known under the term "Byzantine fault-tolerance". The author introduces these notions in an incremental manner, starting from a clear specification, followed by algorithms which are first described intuitively and then proved correct. The book also presents impossibility results in classic distributed computing models, along with strategies, mainly failure detectors and randomization, that allow us to enrich these models. In this sense, the book constitutes an introduction to the science of distributed computing, with applications in all domains of distributed systems, such as cloud computing and blockchains. Each chapter comes with exercises and bibliographic notes to help the reader approach, understand, and master the fascinating field of fault-tolerant distributed computing.


Communication and Agreement Abstractions for Fault-Tolerant Asynchronous Distributed Systems

Communication and Agreement Abstractions for Fault-Tolerant Asynchronous Distributed Systems

Author: Michel Raynal

Publisher: Springer Nature

Published: 2022-06-01

Total Pages: 251

ISBN-13: 3031020006

DOWNLOAD EBOOK

Understanding distributed computing is not an easy task. This is due to the many facets of uncertainty one has to cope with and master in order to produce correct distributed software. Considering the uncertainty created by asynchrony and process crash failures in the context of message-passing systems, the book focuses on the main abstractions that one has to understand and master in order to be able to produce software with guaranteed properties. These fundamental abstractions are communication abstractions that allow the processes to communicate consistently (namely the register abstraction and the reliable broadcast abstraction), and the consensus agreement abstractions that allows them to cooperate despite failures. As they give a precise meaning to the words "communicate" and "agree" despite asynchrony and failures, these abstractions allow distributed programs to be designed with properties that can be stated and proved. Impossibility results are associated with these abstractions. Hence, in order to circumvent these impossibilities, the book relies on the failure detector approach, and, consequently, that approach to fault-tolerance is central to the book. Table of Contents: List of Figures / The Atomic Register Abstraction / Implementing an Atomic Register in a Crash-Prone Asynchronous System / The Uniform Reliable Broadcast Abstraction / Uniform Reliable Broadcast Abstraction Despite Unreliable Channels / The Consensus Abstraction / Consensus Algorithms for Asynchronous Systems Enriched with Various Failure Detectors / Constructing Failure Detectors


Scalable and Distributed Machine Learning and Deep Learning Patterns

Scalable and Distributed Machine Learning and Deep Learning Patterns

Author: Thomas, J. Joshua

Publisher: IGI Global

Published: 2023-08-25

Total Pages: 315

ISBN-13: 1668498057

DOWNLOAD EBOOK

Scalable and Distributed Machine Learning and Deep Learning Patterns is a practical guide that provides insights into how distributed machine learning can speed up the training and serving of machine learning models, reduce time and costs, and address bottlenecks in the system during concurrent model training and inference. The book covers various topics related to distributed machine learning such as data parallelism, model parallelism, and hybrid parallelism. Readers will learn about cutting-edge parallel techniques for serving and training models such as parameter server and all-reduce, pipeline input, intra-layer model parallelism, and a hybrid of data and model parallelism. The book is suitable for machine learning professionals, researchers, and students who want to learn about distributed machine learning techniques and apply them to their work. This book is an essential resource for advancing knowledge and skills in artificial intelligence, deep learning, and high-performance computing. The book is suitable for computer, electronics, and electrical engineering courses focusing on artificial intelligence, parallel computing, high-performance computing, machine learning, and its applications. Whether you're a professional, researcher, or student working on machine and deep learning applications, this book provides a comprehensive guide for creating distributed machine learning, including multi-node machine learning systems, using Python development experience. By the end of the book, readers will have the knowledge and abilities necessary to construct and implement a distributed data processing pipeline for machine learning model inference and training, all while saving time and costs.


Fault-tolerant Agreement in Synchronous Message-passing Systems

Fault-tolerant Agreement in Synchronous Message-passing Systems

Author: Michel Raynal

Publisher: Springer Nature

Published: 2022-06-01

Total Pages: 167

ISBN-13: 3031020014

DOWNLOAD EBOOK

Understanding distributed computing is not an easy task. This is due to the many facets of uncertainty one has to cope with and master in order to produce correct distributed software. A previous book Communication and Agreement Abstraction for Fault-tolerant Asynchronous Distributed Systems (published by Morgan & Claypool, 2010) was devoted to the problems created by crash failures in asynchronous message-passing systems. The present book focuses on the way to cope with the uncertainty created by process failures (crash, omission failures and Byzantine behavior) in synchronous message-passing systems (i.e., systems whose progress is governed by the passage of time). To that end, the book considers fundamental problems that distributed synchronous processes have to solve. These fundamental problems concern agreement among processes (if processes are unable to agree in one way or another in presence of failures, no non-trivial problem can be solved). They are consensus, interactive consistency, k-set agreement and non-blocking atomic commit. Being able to solve these basic problems efficiently with provable guarantees allows applications designers to give a precise meaning to the words ""cooperate"" and ""agree"" despite failures, and write distributed synchronous programs with properties that can be stated and proved. Hence, the aim of the book is to present a comprehensive view of agreement problems, algorithms that solve them and associated computability bounds in synchronous message-passing distributed systems. Table of Contents: List of Figures / Synchronous Model, Failure Models, and Agreement Problems / Consensus and Interactive Consistency in the Crash Failure Model / Expedite Decision in the Crash Failure Model / Simultaneous Consensus Despite Crash Failures / From Consensus to k-Set Agreement / Non-Blocking Atomic Commit in Presence of Crash Failures / k-Set Agreement Despite Omission Failures / Consensus Despite Byzantine Failures / Byzantine Consensus in Enriched Models


Mastering Distributed Algorithms

Mastering Distributed Algorithms

Author: Roger Wattenhofer

Publisher:

Published: 2020-03-23

Total Pages: 262

ISBN-13:

DOWNLOAD EBOOK

About the book: The Internet is a distributed system, but so are wireless communication, cloud or parallel computing, multi-core systems, mobile networks. Also an ant colony, a brain, or even the human society can be modeled as distributed systems. In this book we will be highlighting common themes and techniques. In particular, we study some of the fundamental issues underlying the design of distributed systems, for example, communication, coordination, fault-tolerance, locality, parallelism, symmetry breaking, synchronization, and uncertainty.About the author: Roger Wattenhofer is a professor at ETH Zurich. Before joining ETH Zurich, he was at Brown University and Microsoft Research. His research interests include fault-tolerant distributed systems, efficient network algorithms, and cryptocurrencies such as Bitcoin. He has published more than 300 scientific articles. In 2017, he published the book Blockchain Science.


Wireless Algorithms, Systems, and Applications

Wireless Algorithms, Systems, and Applications

Author: Zhe Liu

Publisher: Springer Nature

Published: 2021-09-08

Total Pages: 635

ISBN-13: 3030859282

DOWNLOAD EBOOK

The three-volume set LNCS 12937 - 12939 constitutes the proceedings of the 16th International Conference on Wireless Algorithms, Systems, and Applications, WASA 2021, which was held during June 25-27, 2021. The conference took place in Nanjing, China.The 103 full and 57 short papers presented in these proceedings were carefully reviewed and selected from 315 submissions. The following topics are covered in Part I of the set: network protocols, signal processing, wireless telecommunication systems, blockchain, IoT and edge computing, artificial intelligence, computer security, distributed computer systems, machine learning, and others.


Machine Learning and Wireless Communications

Machine Learning and Wireless Communications

Author: Yonina C. Eldar

Publisher: Cambridge University Press

Published: 2022-06-30

Total Pages: 560

ISBN-13: 1108967736

DOWNLOAD EBOOK

How can machine learning help the design of future communication networks – and how can future networks meet the demands of emerging machine learning applications? Discover the interactions between two of the most transformative and impactful technologies of our age in this comprehensive book. First, learn how modern machine learning techniques, such as deep neural networks, can transform how we design and optimize future communication networks. Accessible introductions to concepts and tools are accompanied by numerous real-world examples, showing you how these techniques can be used to tackle longstanding problems. Next, explore the design of wireless networks as platforms for machine learning applications – an overview of modern machine learning techniques and communication protocols will help you to understand the challenges, while new methods and design approaches will be presented to handle wireless channel impairments such as noise and interference, to meet the demands of emerging machine learning applications at the wireless edge.