Skip to main content

Configure Parallelization

Author: Shenggui Li, Siqi Mai

Prerequisite:

Introduction​

We support multiple parallelization in Colossal-AI. Hybrid parallelism in our codebase refers to namely the combination of data parallelism, pipeline parallelism and tensor parallelism (1D, 2D, 2.5D, 3D).

Each parallelism requires different network topology and thus initialize different process groups. You can initialize the corresponding process group by setting parallel in the config file. The configuration for parallel must obey the following format. Data parallel size will be inferred automatically based on your inputs to pipeline parallelism and tensor parallelism. colossalai.launch will initialize these distributed process groups automatically based on your configuration.

Some sample configurations are shown below:

# sampler format
parallel = dict(
pipeline=dict("size": int),
tensor=dict("size": int, "mode": '1d' or '2d' or '2.5d' or '3d', "kwargs": Any)
)

# this is ok
parallel = dict(
pipeline=dict(size=2),
tensor=dict(size=4, mode='2d')
)

# this is ok
parallel = dict(
pipeline=2,
tensor=dict(size=4, mode='2d')
)

# this is not ok
# as you need to specify the mode for tensor parallelism
parallel = dict(
pipeline=2,
tensor=4
)

# this is ok as well as tensor will be default to size 1
# and mode None
parallel = dict(
pipeline=2
)

# this is ok as well as pipeline will default to size 1
parallel = dict(
tensor=dict(size=4, mode='2d')
)

The key name size refers to the parallel size of the parallelism dimension. For example, pipeline size 2 means there will be 2 pipeline stages. The key name mode in tensor parallel config means the corresponding tensor parallelism will be initialized.

You can choose to not have 'parallel' in your configuration and both pipeline and tensor will default to size 1.

Total number of GPUs must be equal to data parallel size * tensor parallel size * pipeline parallel size

Data Parallel​

Data parallel is the most common way to distribute your training task by splitting data into several shards and train on a single shard on each device. The configuration for data parallel is detected automatically and set for you. You do not have to explicitly set them in your configurations. There are two ways to handle the all-reduce in data parallel in Colossal-AI.

  1. If you specify gradient handlers, gradients will be all-reduced according to the gradient handlers
  2. Otherwise, PyTorch DistributedDataParallel will be used

In most cases, you will be using the second mode unless you have complex handling of the gradients.

1D, 2D, 2.5D and 3D Parallel​

To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each tensor parallel method. These parallel modes need to work with the distributed layers provided by Colossal-AI.

# 1D parallel
parallel = dict(
tensor=dict(size=4, mode='1d')
)

# 2D parallel
parallel = dict(
tensor=dict(size=4, mode='2d')
)

# 2.5D parallel
parallel = dict(
tensor=dict(size=8, mode='2.5d', depth=2)
)

# 3D parallel
parallel = dict(
tensor=dict(size=8, mode='3d')
)

Once you specify the tensor parallel mode in your configuration, you can proceed to use its corresponding distributed operator. For example, if you mode is '2d', you can use colossalai.nn.Linear2D in you model construction.

Pipeline Parallel​

Pipeline parallelism is to split the model into several partitions by layer. For example, let's assume we have a simple model which consists of two linear layer. We have two GPUs, and we can allocate the first linear layer to the first GPU and the second layer to the second GPU.

You can set the number of pipeline stages in your configuration file. When pipeline size is larger than 1, Colossal-AI will automatically creates the pipeline schedule which defines the forward and backward step.

parallel = dict(
pipeline=dict(size=4), # number of pipeline stages
)

Sequence Parallel​

Sequence parallel is to support long-sequence modelling such as document-level text understanding and medical imaging. This method is proposed in Sequence Parallelism: Making 4D Parallelism Possible. You can use specify the mode to be sequence to initialize its process group.

parallel = dict(
tensor=dict(size=4, mode='sequence')
)