2D Tensor Parallelism
Author: Zhengda Bian, Yongbin Li
Prerequisite
Example Code
Related Paper
Introduction
1D tensor parallelism does not partition activations, which can also consume a great amount of memory in terms of large-scale models. To evenly distribute the computation and memory load, an efficient 2D tensor parallelism algorithm was introduced based on SUMMA (Scalable Universal Matrix Multiplication Algorithm).
Let's still take a linear layer as an example. Given processors (necessary condition), e.g. , we split both the input and weight into
The calculation includes steps. When , is broadcasted in its row, and is broadcasted in its column. So, we have
Then we multiply and on each processor as
Similarly, when , is broadcasted in its row, is broadcasted in its column, and we multiply them as
By adding and up, we have
Efficiency
Given processors, we present the theoretical computation and memory cost, as well as the communication cost based on the ring algorithm in both the forward and backward pass of 2D tensor parallelism.
Computation | Memory (parameters) | Memory (activations) | Communication (bandwidth) | Communication (latency) |
---|---|---|---|---|
Usage
Currently the newest version of ColossalAI doesn't support 2D tensor parallelism, but this feature will be integrated into Shardformer
in future releases.
For more details about ideas and usages of Shardformer
, please refer to Shardformer Doc.
For users of older version of ColossalAI, please refer to ColossalAI-Examples - 2D Tensor Parallelism.