Skip to main content

2D Tensor Parallelism

Author: Zhengda Bian, Yongbin Li

Prerequisite

Example Code

Related Paper

Introduction

1D tensor parallelism does not partition activations, which can also consume a great amount of memory in terms of large-scale models. To evenly distribute the computation and memory load, an efficient 2D tensor parallelism algorithm was introduced based on SUMMA (Scalable Universal Matrix Multiplication Algorithm).

Let's still take a linear layer Y=XAY = XA as an example. Given P=q×qP=q\times q processors (necessary condition), e.g. q=2q=2, we split both the input XX and weight AA into

[X10X11X00X01] and [A10A11A00A01].\left[\begin{matrix} X_{10} & X_{11} \\ X_{00} & X_{01} \end{matrix} \right] \text{~and~} \left[\begin{matrix} A_{10} & A_{11} \\ A_{00} & A_{01} \end{matrix} \right].

The calculation includes qq steps. When t=1t=1, Xi0X_{i0} is broadcasted in its row, and A0jA_{0j} is broadcasted in its column. So, we have

[X10,A00X10,A01X00,A00X00,A01].\left[\begin{matrix} X_{10},A_{00} & X_{10},A_{01} \\ X_{00},A_{00} & X_{00},A_{01} \end{matrix} \right].

Then we multiply Xi0X_{i0} and A0jA_{0j} on each processor (i,j)(i, j) as

[X10A00X10A01X00A00X00A01](1).\left[\begin{matrix} X_{10}A_{00} & X_{10}A_{01} \\ X_{00}A_{00} & X_{00}A_{01} \end{matrix} \right] (1).

Similarly, when t=2t=2, Xi1X_{i1} is broadcasted in its row, A1jA_{1j} is broadcasted in its column, and we multiply them as

[X11A10X11A11X01A10X01A11](2).\left[\begin{matrix} X_{11}A_{10} & X_{11}A_{11} \\ X_{01}A_{10} & X_{01}A_{11} \end{matrix} \right] (2).

By adding (1)(1) and (2)(2) up, we have

Y=XA=[X10A00+X11A10X10A01+X11A11X00A00+X01A10X00A01+X01A11].Y = XA = \left[\begin{matrix} X_{10}A_{00}+X_{11}A_{10} & X_{10}A_{01}+X_{11}A_{11} \\ X_{00}A_{00}+X_{01}A_{10} & X_{00}A_{01}+X_{01}A_{11} \end{matrix} \right].

Efficiency

Given P=q×qP=q\times q processors, we present the theoretical computation and memory cost, as well as the communication cost based on the ring algorithm in both the forward and backward pass of 2D tensor parallelism.

ComputationMemory (parameters)Memory (activations)Communication (bandwidth)Communication (latency)
O(1/q2)O(1/q^2)O(1/q2)O(1/q^2)O(1/q2)O(1/q^2)O(6(q1)/q)O(6(q-1)/q)O(6(q1))O(6(q-1))

Usage

To enable 2D tensor parallelism for our model, e.g. on 4 GPUs, we need to configure the parallism setting as below.

CONFIG = dict(parallel=dict(
data=1,
pipeline=1,
tensor=dict(size=4, mode='2d'),
))

Then Colossal-AI will automatically apply 2D parallelism to all the layers from colossalai.nn.

Let's define a model that consists of a two-layer multi-layer perceptron (MLP) as below.

import colossalai
import colossalai.nn as col_nn
import torch
from colossalai.utils import print_rank_0

class MLP(torch.nn.Module):
def __init__(self, dim: int = 256):
super().__init__()
intermediate_dim = dim * 4
self.dense_1 = col_nn.Linear(dim, intermediate_dim)
print_rank_0(f'Weight of the first linear layer: {self.dense_1.weight.shape}')
self.activation = torch.nn.GELU()
self.dense_2 = col_nn.Linear(intermediate_dim, dim)
print_rank_0(f'Weight of the second linear layer: {self.dense_2.weight.shape}')
self.dropout = col_nn.Dropout(0.1)

def forward(self, x):
x = self.dense_1(x)
print_rank_0(f'Output of the first linear layer: {x.shape}')
x = self.activation(x)
x = self.dense_2(x)
print_rank_0(f'Output of the second linear layer: {x.shape}')
x = self.dropout(x)
return x

Launch Colossal-AI on 4 GPUs and build the model

parser = colossalai.get_default_parser()
colossalai.launch(config=CONFIG,
rank=args.rank,
world_size=args.world_size,
local_rank=args.local_rank,
host=args.host,
port=args.port)

m = MLP()

We will see the shapes of partitioned parameters(e.g. weights) in the MLP model.

Weight of the first linear layer: torch.Size([128, 512])
Weight of the second linear layer: torch.Size([512, 128])

The complete weight of the first linear layer is supposed to have the shape [256, 1024]. After the partitioning of 2D parallelism, it becomes [128, 512] on each GPU. Similarly, the second layer partitions the weight [1024, 256] into [512, 128].

We can run the model with some random inputs.

from colossalai.context import ParallelMode
from colossalai.core import global_context as gpc
from colossalai.utils import get_current_device

x = torch.randn((16, 256), device=get_current_device())
# partition input
torch.distributed.broadcast(x, src=0)
x = torch.chunk(x, 2, dim=0)[gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL)]
x = torch.chunk(x, 2, dim=-1)[gpc.get_local_rank(ParallelMode.PARALLEL_2D_ROW)]
print_rank_0(f'Input: {x.shape}')

x = m(x)

Then we can see the shapes of activation results.

Input: torch.Size([8, 128])
Output of the first linear layer: torch.Size([8, 512])
Output of the second linear layer: torch.Size([8, 128])

The activation tensors in 2D parallelism are all split in both row and column. E.g. the output of the first linear layer has the shape [8, 512], while the second layer has the output of [8, 128].