ZeroBubble Pipeline Parallelism

Related Paper

Zero Bubble Pipeline Parallelism

Introduction

ZeroBubble (V Schedule): Crucially, splitting B into two stages (also known as an activation gradient and a weight gradient) and a scheme like 1F1B1W can further reduce the bubble compared to the 1F1B scheme in earlier work.

Hands-On Practice

We now demonstrate how to use ZeroBubble with booster API with 4 GPUs.

step 1. Import libraries

import torch
import torch.distributed as dist
import torch.nn as nn
from torch.testing import assert_close
from transformers.models.llama.configuration_llama import LlamaConfig
from transformers.models.llama.modeling_llama import LlamaModel

import colossalai
from colossalai.booster.booster import Booster
from colossalai.booster.plugin.moe_hybrid_parallel_plugin import HybridParallelPlugin
from colossalai.pipeline.schedule.zero_bubble_pp import ZeroBubbleVPipeScheduler

step 2. Initialize Distributed Environment and Parallism Group

colossalai.launch(rank=rank, world_size=world_size, host="localhost", port=port, backend="nccl")

step 3. Initialize Module, Optimizer, and Pipeline Schedule

Build our model and Optimizer. We created a Llama with 8 Decoder-Layer. Then, inite the PipelineGraph and Pipeline schedule by get_v_schedule() function.

# Global Param
NUM_BATCH = 8
NUM_TOK_PER_BATCH = 4
NUM_LAYERS = 8
HIDDEN_SIZE_PER_HEAD = 4
NUM_HEADS = 4
# Init Llama from huggingface
configuration = LlamaConfig(
    hidden_size=HIDDEN_SIZE_PER_HEAD * NUM_HEADS,
    intermediate_size=HIDDEN_SIZE_PER_HEAD * NUM_HEADS * 2,
    num_hidden_layers=NUM_LAYERS,
    num_attention_heads=NUM_HEADS,
    num_key_value_heads=NUM_HEADS,
    attn_implementation="flash_attention_2",
)
model = LlamaModel(configuration).cuda()
optimizer = torch.optim.Adam(torch_model.parameters(), lr=1)

step 4. Initialize Module, Optimizer, and Pipeline Schedul

Then, we need to create the PipelineGraph and PipelineSchedule using the get_v_schedule() function. We need to initialise the PipelineGraph with the following parameters. x_cost represents the runtime consumed by operation x of each model chunk. x_mem represents the amount of memory consumed by the operation x of each model chunk. These parameters are estimated and filled in before the pipeline starts. In fact, better results can be obtained based on the runtime and memory cost during the real computation of the model. In the following example, we assume that the computation times for the model's forward, reverse B, and reverse W are 1, 1, 1, respectively, and the p2p communication time is 1.

# Init schedule
h, a, s = config.hidden_size, config.num_attention_heads, 1024
mem_f = 34 * h + 5 * a * s
mem_w = -32 * h
mem_b = -mem_w - mem_f
graph = PipelineGraph(
    n_stage=pp_size,
    n_micro=num_microbatches,
    f_cost=1,
    b_cost=1,
    w_cost=1,
    c_cost=1,
    f_mem=mem_f,
    b_mem=mem_b,
    w_mem=mem_w,
)
zbv_schedule = graph.get_v_schedule()

step 5.Init Booster

Pass pp_style="zbv" when initialising the Plugin to use the ZeroBubble Pipeline.

plugin = HybridParallelPlugin(
    pp_size=4,
    num_microbatches=4,
    tp_size=1,
    sp_size=1,
    zero_stage=1,
    initial_scale=1,
    find_unused_parameters=True,
    pp_style="zbv",
    scheduler_nodes=zbv_schedule,
    num_model_chunks=2,
)

dp_size = plugin.dp_size
booster = Booster(plugin=plugin)

step 6.Train Your Model

steps = 10
for step in range(steps):
    input_embeddings = torch.rand(
        NUM_BATCH, NUM_TOK_PER_BATCH, HIDDEN_SIZE_PER_HEAD * NUM_HEADS, requires_grad=True
    ).cuda()
    dist.all_reduce(
        input_embeddings, group=plugin.pp_group
    )
    data_iter = iter([{"inputs_embeds": input_embeddings}])
    output = booster.execute_pipeline(
        data_iter,
        model,
        lambda x, y: x.last_hidden_state.mean(),
        optimizer,
        return_loss=True,
        return_outputs=True,
    )
    optimizer.step()
    optimizer.zero_grad()

Advanced Practice

In ColossalAI, you can get better training performance by using MetaCache and HybridParallel with ZeroBubble.

1.Use MetaCache with ZeroBubble

Pass "enable_metadata_cache=True" when initialising the Plugin to use the Meta Cache with ZeroBubble Pipeline.

plugin = HybridParallelPlugin(
    pp_size=2,
    num_microbatches=4,
    tp_size=2,
    sp_size=2,
    zero_stage=1,
    initial_scale=1,
    enable_metadata_cache=True,
    find_unused_parameters=True,
    pp_style="zbv",
    scheduler_nodes=zbv_schedule,
    num_model_chunks=2,
)

2.HybridParallel with ZeroBubble

Pass pp_size, tp_size, sp_size when initialising the Plugin to use the HybridParallel with ZeroBubble Pipeline.

plugin = HybridParallelPlugin(
    pp_size=2,
    num_microbatches=2,
    tp_size=2,
    sp_size=2,
    zero_stage=1,
    initial_scale=1,
    find_unused_parameters=True,
    pp_style="zbv",
    scheduler_nodes=zbv_schedule,
    num_model_chunks=2,
)

Performance Benchmark

HybridParallel Strategy	Pipeline Parallel	Sequence Parallel + Pipeline Parallel	Data Parallel + Pipeline Parallel
With 1F1B	15.27 samples/sec	17.22 samples/sec	14.06 samples/sec
With Zero Bubble	17.36 samples/sec	18.38 samples/sec	14.44 samples/sec

3.Fine-tuning Scheduler parameters

Model compatibility

Shardformer/Model	Bert	Blip2	Bloom	Chatglm2	Command	Deepseek	Falcon	GPT2	Gptj	Llama	Mistral	Opt	Qwen2	Sam	T5	Vit	Whisper
ZeroBubble	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️

ZeroBubble Pipeline Parallelism

Introduction​

Hands-On Practice​

step 1. Import libraries​

step 2. Initialize Distributed Environment and Parallism Group​

step 3. Initialize Module, Optimizer, and Pipeline Schedule​

step 4. Initialize Module, Optimizer, and Pipeline Schedul​

step 5.Init Booster​

step 6.Train Your Model​

Advanced Practice​

1.Use MetaCache with ZeroBubble​

2.HybridParallel with ZeroBubble​

3.Fine-tuning Scheduler parameters​

Model compatibility​

API Reference​

class

colossalai.pipeline.ZeroBubbleVPipeScheduler

function

backward_b_step

function

backward_w_step

function

forward_backward_step

function

forward_step

function

get_model_chunk_id

function

load_batch

function

load_micro_batch

function

recv_backward

function

recv_forward

function

run_forward_backward

function

schedule_b

function

schedule_f

function

schedule_w

function

send_backward

function

send_forward

Introduction

Hands-On Practice

step 1. Import libraries

step 2. Initialize Distributed Environment and Parallism Group

step 3. Initialize Module, Optimizer, and Pipeline Schedule

step 4. Initialize Module, Optimizer, and Pipeline Schedul

step 5.Init Booster

step 6.Train Your Model

Advanced Practice

1.Use MetaCache with ZeroBubble

2.HybridParallel with ZeroBubble

3.Fine-tuning Scheduler parameters

Model compatibility

API Reference