Author: Chuanrui Wang, Shenggui Li, Siqi Mai
As mentioned in the previous tutorials stated in the prerequisite, you need to initialize the distributed environment
for Colossal-AI after your config file is prepared.
We call this process
In this tutorial, you will learn how to launch Colossal-AI on your server, be it a small one or big one.
In Colossal-AI, we provided several launch methods to initialize the distributed backend.
In most cases, you can use
colossalai.get_default_parser to pass the
parameters via command line.
If you happen to use launchers such as SLURM, OpenMPI and PyTorch launch utility,
we also provide several launching helper methods to access the rank and world size from the environment variables
set by these launchers directly for your convenience.
In this tutorial we will cover how to launch Colossal-AI to initialize the distributed backends:
- Launch with
- Launch with Colossal-AI CLI
- Launch with SLURM
- Launch with OpenMPI
Launch Distributed Environment
In order to launch Colossal-AI, we need two types of arguments:
- config file
- distributed settings
The config file is always required regardless of the launch method but distributed settings can vary. The config file can be a path to the configuration file or a Python dictionary. The distributed settings can be passed via command line or multi-process launchers.
Command Line Parser
Before we jump to
launch, we firstly need to understand what parameters we need for initialization.
As stated in the
Basic Concepts in Distributed Training section of Distributed Training,
the important parameters are:
In Colossal-AI, we provided a command line parser which has added these arguments in advance. You can get this parser by calling
colossalai.get_default_parser(). This parser is usually used with
# add these lines in your train.py
# get default parser
parser = colossalai.get_default_parser()
# if you want to add your own arguments
# parse arguments
args = parser.parse_args()
Then in your terminal, you can pass in these arguments:
python train.py --host <host> --rank <rank> --world_size <world_size> --port <port> --backend <backend>
backend is optional and the default value is
To initialize the distributed environment, we provided a general
colossalai.launch API. The
colossalai.launch function takes in the parameters
listed above and create a default process group in the communication network. This function is often used with the default
parser for convenience.
# parse arguments
args = colossalai.get_default_parser().parse_args()
# launch distributed environment
Launch with Colossal-AI CLI
To enable easy launching on both single or multi nodes, we have implemented a launcher for Colossal-AI. This launcher is a wrapper of the torch distributed launch utility but enhanced with the capability of launching multi-node jobs easily.
First, we need to set the launch method in our code. As this is a wrapper of the torch distributed launch utility, we will
colossalai.launch_from_torch. The arguments required for distributed environment such as rank, world size, host and port are all set by the PyTorch
launcher and can be read from the environment variable directly.
Next, we can easily start multiple processes with
colossalai run in your terminal. Below is an example to run the code
on a single node with 4 GPUs. You can change the number of GPUs by
nproc_per_node and the default port by
# run on the local node with 4 GPUs (default port: 29500)
colossalai run --nproc_per_node 4 train.py
# run on the local node with 4 GPUs with a different port
colossalai run --nproc_per_node 4 --master_port 29505 test.py
If you are in a cluster and want to launch multi-node training, the CLI can help you start processes on different nodes with one simple command. There are two ways you can launch multi-node jobs.
- Run with
This is suitable when you only have a few nodes. Let's say I have two nodes, namely
host2, I can start
multi-node training with the following command. Compared to single-node training, you must specify the
option, which is auto-set to localhost if running on a single node only.
master_addr cannot be localhost when running on multiple nodes, it should be the hostname or IP address of a node.
# run on these two nodes
colossalai run --nproc_per_node 4 --host host1,host2 --master_addr host1 test.py
- Run with
This method is suitable when you have a lot of nodes. The host file is a simple text file listing the available nodes.
The list of nodes is commonly provided by cluster managers such as SLURM and PBS Pro. For example, you can get the list
of nodes allocated to you via the environment variable
SLURM_NODELIST in SLURM and
PBS_NODEFILE in PBS Pro.
echo $SLURM_NODELIST or
cat $PBS_NODEFILE to check it out. If you do not have such cluster managers, you can
manually create one for your own use.
The host file given to Colossal-AI launcher must be in the following format where each line is the host name of a node.
With the host file ready, we can launch multi-node jobs with the following commands. Just like using
--host, you also
need to specify the
master_addr option. Some extra options are provided for
--hostfile as listed below:
--include: specify the hosts to include for multi-node jobs. For example, if your host file has 8 nodes, but you happen to only want to run on 6 nodes instead, you can add
--include host1,host2,host3,...,host6so that the job will only be launcher on the 6 nodes.
--exclude: specify the hosts to exclude for multi-node jobs. This is useful when some nodes are faulty. For example, if host1 GPU has some problems and you do not wish to run on host1 but all other nodes, you can add
--exclude host1so that the job will only be launched on the remaining nodes.
# run with a hostfile
colossalai run --nproc_per_node 4 --hostfile ./hostfile --master_addr host1 test.py
# only include certain hosts to execute commands
# this is used to manually select nodes to run
colossalai run --nproc_per_node 4 --hostfile ./hostfile --master_addr host1 --include host1 test.py
# exclude certain hosts to execute commands
# this can be used when certain nodes are faulty
colossalai run --nproc_per_node 4 --hostfile ./hostfile --master_addr host1 --exclude host2 test.py
Launch with SLURM
If you are on a system managed by the SLURM scheduler, you can also rely on the
srun launcher to kickstart your Colossal-AI scripts.
We provided the helper function
launch_from_slurm for compatibility with the SLURM scheduler.
launch_from_slurm will automatically read the rank and world size from the environment variables
and use them to start the distributed backend.
Do this in your training script:
You can initialize the distributed environment by using this command in terminal.
srun python train.py --host <master_node> --port 29500
Launch with OpenMPI
If you are more familiar with OpenMPI, you can use
launch_from_openmpi will automatically read the local rank, global rank and world size from the environment variables
OMPI_COMM_WORLD_SIZE respectively and
use them to start the distributed backend.
Do this in your train.py:
A sample command to launch multiple processes with OpenMPI would be:
mpirun --hostfile <my_hostfile> -np <num_process> python train.py --host <node name or ip> --port 29500
- --hostfile: use this option to specify a list of hosts on which to run
- --np: set the number of processes (GPUs) to launch in total. For example, if --np 4, 4 python processes will be initialized to run train.py.