Skip to main content

Cluster Utilities

Author: Hongxin Liu

Prerequisite:

Introduction

We provide a utility class colossalai.cluster.DistCoordinator to coordinate distributed training. It's useful to get various information about the cluster, such as the number of nodes, the number of processes per node, etc.

API Reference

class
 

colossalai.cluster.DistCoordinator

(*args, **kwargs)
Parameters
  • rank (int) -- the rank of the current process
  • world_size (int) -- the total number of processes
  • local_rank (int) -- the rank of the current process on the current node
Description

This class is used to coordinate distributed training. It is a singleton class, which means that there is only one instance of this class in the whole program.

There are some terms that are used in this class:

  • rank: the rank of the current process
  • world size: the total number of processes
  • local rank: the rank of the current process on the current node
  • master: the process with rank 0
  • node master: the process with local rank 0 on the current node
Example
from colossalai.cluster.dist_coordinator import DistCoordinator
coordinator = DistCoordinator()

if coordinator.is_master():
    do_something()

coordinator.print_on_master('hello world')
function
 

block_all

(process_group: ProcessGroup = None)
Parameters
  • process_group (ProcessGroup, optional) -- process group to block. Defaults to None, which refers to the default process group.
Description

Block all processes in the process group.

function
 

destroy

(process_group: ProcessGroup = None)
Parameters
  • process_group (ProcessGroup, optional) -- process group to destroy. Defaults to None, which refers to the default process group.
Description

Destroy the distributed process group.

function
 

is_last_process

(process_group: ProcessGroup = None)
Parameters
  • process_group (ProcessGroup, optional) -- process group to use for the last rank check. Defaults to None, which refers to the default process group.
Returns

bool: True if the current process is the last process, False otherwise

Description

Check if the current process is the last process (rank is world size - 1). It can accept a sub process group to check the last rank with respect to the process.

function
 

is_master

(process_group: ProcessGroup = None)
Parameters
  • process_group (ProcessGroup, optional) -- process group to use for the rank 0 check. Defaults to None, which refers to the default process group.
Returns

bool: True if the current process is the master process, False otherwise

Description

Check if the current process is the master process (rank is 0). It can accept a sub process group to check the rank 0 with respect to the process.

function
 

is_node_master

()
Returns

bool: True if the current process is the master process on the current node, False otherwise

Description

Check if the current process is the master process on the current node (local rank is 0).

function
 

on_master_only

(process_group: ProcessGroup = None)
Description

A function wrapper that only executes the wrapped function on the master process (rank 0).

Example
from colossalai.cluster import DistCoordinator
dist_coordinator = DistCoordinator()

@dist_coordinator.on_master_only()
def print_on_master(msg):
    print(msg)
function
 

print_on_master

(msg: str, process_group: ProcessGroup = None)
Parameters
  • msg (str) -- message to print
  • process_group (ProcessGroup, optional) -- process group to use for the rank 0 check. Defaults to None, which refers to the default process group.
Description

Print message only from rank 0.

function
 

print_on_node_master

(msg: str)
Parameters
  • msg (str) -- message to print
Description

Print message only from local rank 0. Local rank 0 refers to the 0th process running the current node.

function
 

priority_execution

(executor_rank: int = 0, process_group: ProcessGroup = None)
Parameters
  • executor_rank (int) -- the process rank to execute without blocking, all other processes will be blocked
  • process_group (ProcessGroup, optional) -- process group to use for the executor rank check. Defaults to None, which refers to the default process group.
Description

This context manager is used to allow one process to execute while blocking all other processes in the same process group. This is often useful when downloading is required as we only want to download in one process to prevent file corruption.

Example
from colossalai.cluster import DistCoordinator
dist_coordinator = DistCoordinator()
with dist_coordinator.priority_execution():
    dataset = CIFAR10(root='./data', download=True)