Cluster Utilities
Author: Hongxin Liu
Prerequisite:
Introduction
We provide a utility class colossalai.cluster.DistCoordinator
to coordinate distributed training. It's useful to get various information about the cluster, such as the number of nodes, the number of processes per node, etc.
API Reference
class
colossalai.cluster.DistCoordinator
- rank (int) -- the rank of the current process
- world_size (int) -- the total number of processes
- local_rank (int) -- the rank of the current process on the current node
This class is used to coordinate distributed training. It is a singleton class, which means that there is only one instance of this class in the whole program.
There are some terms that are used in this class:
- rank: the rank of the current process
- world size: the total number of processes
- local rank: the rank of the current process on the current node
- master: the process with rank 0
- node master: the process with local rank 0 on the current node
from colossalai.cluster.dist_coordinator import DistCoordinator
coordinator = DistCoordinator()
if coordinator.is_master():
do_something()
coordinator.print_on_master('hello world')
function
block_all
- process_group (ProcessGroup, optional) -- process group to block. Defaults to None, which refers to the default process group.
Block all processes in the process group.
function
destroy
- process_group (ProcessGroup, optional) -- process group to destroy. Defaults to None, which refers to the default process group.
Destroy the distributed process group.
function
is_last_process
- process_group (ProcessGroup, optional) -- process group to use for the last rank check. Defaults to None, which refers to the default process group.
bool: True if the current process is the last process, False otherwise
Check if the current process is the last process (rank is world size - 1). It can accept a sub process group to check the last rank with respect to the process.
function
is_master
- process_group (ProcessGroup, optional) -- process group to use for the rank 0 check. Defaults to None, which refers to the default process group.
bool: True if the current process is the master process, False otherwise
Check if the current process is the master process (rank is 0). It can accept a sub process group to check the rank 0 with respect to the process.
function
is_node_master
bool: True if the current process is the master process on the current node, False otherwise
Check if the current process is the master process on the current node (local rank is 0).
function
on_master_only
A function wrapper that only executes the wrapped function on the master process (rank 0).
from colossalai.cluster import DistCoordinator
dist_coordinator = DistCoordinator()
@dist_coordinator.on_master_only()
def print_on_master(msg):
print(msg)
function
print_on_master
- msg (str) -- message to print
- process_group (ProcessGroup, optional) -- process group to use for the rank 0 check. Defaults to None, which refers to the default process group.
Print message only from rank 0.
Print message only from local rank 0. Local rank 0 refers to the 0th process running the current node.
function
priority_execution
- executor_rank (int) -- the process rank to execute without blocking, all other processes will be blocked
- process_group (ProcessGroup, optional) -- process group to use for the executor rank check. Defaults to None, which refers to the default process group.
This context manager is used to allow one process to execute while blocking all other processes in the same process group. This is often useful when downloading is required as we only want to download in one process to prevent file corruption.
from colossalai.cluster import DistCoordinator
dist_coordinator = DistCoordinator()
with dist_coordinator.priority_execution():
dataset = CIFAR10(root='./data', download=True)