Coding for Distributed Computing (in Machine Learning and Data Analytics)

Modern distributed computing frameworks play a critical role in various applications, such as large-scale machine learning and big data analytics, which require processing a large volume of data in a high throughput. To achieve high performance when running a job of distributed computing whose input is a large dataset, it is common to run such a job on a large-scale distributed infrastructure, e.g., in the cloud. By dividing a job of distributed computing into multiple tasks, distributed computing frameworks, such as MPI, Horovods, MapReduce, and Spark, have been able to process a high volume of data at the scale of tens of terabytes or more, on a cluster of nodes with limited power of CPU and memory space. However, it is well known that nodes in a distributed infrastructure are subject to various faulty behaviors. For example, a node may experience temporary performance degradation due to resource contention or load imbalance. Even worse, a node may fail to complete a task due to the failure inside the distributed infrastructure. Therefore, when the computation is distributed onto multiple nodes, its progress may be significantly delayed by stragglers, i.e., tasks running on such faulty nodes.

Although the adversarial effects of stragglers in the distributed infrastructure can be mitigated by running replicated tasks on multiple nodes, such replicated tasks will consume a significant amount of resources. Instead of adding replicated tasks, my research focuses on coded computing, running additional tasks that process parity data encoded from the dataset. Compared to simply running duplicated tasks on more nodes, coded computing tolerates potential stragglers with much fewer resources. Jointly considering various resources in the distributed infrastructure, including computing, bandwidth, and storage, my research pushes forward the performance in coded computing and achieves flexible tradeoffs between different resources.

Communication-aware coding for distributed matrix multiplication

Matrix multiplication is a fundamental building block in various machine learning algorithms, including linear regression, logistic regression, deep neural networks, etc. When the matrix comes from a large dataset, the multiplication can be split into multiple tasks that calculate the multiplication of submatrices on different nodes. In order to tolerate stragglers, coded distributed matrix multiplication has been proposed where each task multiplies matrices encoded from the original input matrices, and the result of the multiplication can be decoded from a subset of such tasks. So far, various coding schemes have been proposed which split the input matrices differently and lead to different recovery thresholds, i.e., the number of tasks required for the completion of the job.

As the large-scale matrix multiplication typically requires running in a large-scale distributed infrastructure, it is common to run its corresponding tasks on different nodes hosted in the cloud. However, as resources in the cloud are shared by multiple tenants, the performance of tasks can change frequently over time. Although splitting the input matrices into smaller partitions can lead to lower computational complexity in each task, the corresponding recovery threshold will be larger, leading to higher communication overhead as the results of such tasks need to be uploaded to some single node for decoding. Therefore, it becomes challenging to choose the coding scheme and its parameters with dynamic resources.

In my research, I explore a flexible tradeoff between the computational overhead and the communication overhead. First, I propose dual entangled polynomial codes that achieve a lower recovery threshold by increasing the computational complexity in a task. While conventionally there is only one matrix multiplication in each task, dual entangled polynomial codes allow to execute two matrix multiplications, and then reduce the recovery threshold by 25%, leading to a saving of the communication overhead as well. To achieve a more general tradeoff between computation and communication, I further propose a coding framework that supports changing the coding schemes in a task dynamically. As the performance of resources in the cloud changes frequently, it is challenging to choose the optimal coding scheme and its parameters in advance. However, existing coding schemes all require having their parameters chosen in advance and cannot change them with time. Conventionally, if the coding schemes need to be changed, all the tasks need to be re-encoded on a centralized node, consuming a significant amount of resources for computation and communication. In my on-going work, a coding framework is proposed, which allows the coding scheme of a task to be changed locally on its local node without obtaining additional data, and it can further support a flexible change of the parameters in a coding scheme. Therefore, when the performance of resources changes, the coding scheme in the task can be changed with marginal overhead so that a significantly lower amount of time is needed to complete the job.

Furthermore, my recent work discovers a coding scheme that can calculate multiple matrix multiplications concurrently within only one job only. When nodes are heterogeneous, I have also proposed a coding framework which utilizes partially completed tasks, such that the workload can be dynamically allocated on different nodes.

Related publications:

Parallelism-aware coding for distributed data analytics

Another important application supported by the distributed computing framework is big data analytics. The tasks running in a job of distributed data analytics typically take the input data from a distributed storage system, which stores partitions of the input file on multiple nodes. For example, it is common for a job running in Spark or Hadoop to take its input from the Hadoop distributed file system (HDFS). In order to take advantage of data locality, tasks in the data analytics framework are typically co-located on the same node that stores the corresponding input partition in the distributed storage system.

As the nodes hosting the distributed storage system are also subject to faulty behaviors, erasure coding has been supported by most existing distributed storage systems. By adding parity partitions in additional nodes, the data loss caused by faulty nodes can be tolerated with low storage overhead. However, such parity partitions typically cannot be taken as input by the tasks of distributed data analytics. Therefore, the data parallelism, which refers to the number of partitions that can be read by different tasks simultaneously, is limited by existing erasure codes deployed in the distributed storage system.

In my research, I propose a coding framework that can convert representative erasure codes for distributed storage systems, such as Reed-Solomon codes and locally repairable codes, into linearly equivalent codes that offer a much higher level of data parallelism. After such a conversion, the new code will encode data into partitions that contain both original data and parity data. Hence, the original data can be sequentially embedded into all partitions, instead of just some of them, and therefore data can be read and processed in parallel in all the tasks with a higher overall throughput. Moreover, the new code is linearly equivalent to the original code, and thus it will maintain desirable properties that tolerate faulty nodes in the distributed storage systems. Therefore, by adding more parity data in the partitions, a higher level of data parallelism can also be achieved. This framework has been further extended to support nodes with heterogeneous hardware/software configurations. Hence, the amount of original data in each task/node can be arbitrarily determined, leading to a flexible tradeoff between the computational overhead and the storage overhead.

Related publications: