分布式优化器
Author: Wenxuan Tan, Junwen Duan, Renjie Mao
相关论文
- Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
- CAME: Confidence-guided Adaptive Memory Efficient Optimization
- GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
- Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
介绍
除了广泛采用的Adam和SGD外,许多现代优化器需要逐层统计信息以有效更新参数,因此无法直接应用于模型层在多个设备上分片的并行设置。我们以提供了优化的分布式实现,,并且通过plugin与Tensor Parallel、DDP和ZeRO无缝集成。
优化器
Adafactor 是一种首次采用非负矩阵分解(NMF)的 Adam 变体,用于减少内存占用。CAME 通过引入一个置信度矩阵来改进 NMF 的效果。GaLore 通过将梯度投影到低秩空间,并使用 8 位块状量化进一步减少内存占用。Lamb 允许使用巨大的批量大小而不失准确性,通过按其 Lipschitz 常数的倒数界定的逐层自适应更新实现
使用
现在我们展示如何使用分布式 Adafactor 与 booster API 结合 Tensor Parallel 和 ZeRO 2。即使您不使用distributed optimizer,plugin 也会自动将optimizer转换为分布式版本以方便使用。
step 1. 导包
from transformers import LlamaModel, LlamaConfig
from colossalai.nn.optimizer.distributed_adafactor import DistributedAdaFactor
from colossalai.booster import Booster
from colossalai.booster.plugin import HybridParallelPlugin
import colossalai
import torch
step 2. 初始化分布式
我们需要先初始化分布式环境. 为了展示, 我们使用 colossal run --nproc_per_node 4
. 更多初始化方式请参考 Launch Colossal-AI
colossalai.launch_from_torch()
step 3. 初始化模型和优化器
configuration = LlamaConfig()
model = LlamaModel(configuration).cuda()
criterion = lambda x: x.mean()
dist_optim = DistributedAdaFactor(model.parameters())
step 4.初始化booster和plugin
plugin = HybridParallelPlugin(tp_size=2, zero_stage=2, pp_size=1, enable_all_optimization=True)
booster = Booster(plugin=plugin)
# You should also pass in your own dataset.
model, dist_optim, criterion, dataloader, _ = booster.boost(model, dist_optim, criterion)
step 5.训练
steps = 10
for step in range(steps):
input_ids = torch.ones(1, 100, device="cuda", dtype=torch.int)
attention_mask = input_ids.clone()
outputs = model(input_ids.cuda(), attention_mask.cuda())
loss = criterion(outputs.last_hidden_state)
booster.backward(loss, dist_optim)
dist_optim.step()
dist_optim.zero_grad()
GaLore的特殊初期
对于 GaLore,我们需要为每个参数组指定投影rank,以及量化和分页优化器参数。有关量化的详细信息,请参考 bitandbytes.
from colossalai.nn.optimizer.galore import get_galore_param_groups
from colossalai.nn.optimizer import DistGaloreAwamW
optim = DistGaloreAwamW(
get_galore_param_groups(model, decay=1e-2, rank=8),
lr=lr,
betas=(beta1, beta2),
eps=eps,
nbits=8,
percentile_clipping=100,
block_wise=True,
min_8bit_size=4096,
)
兼容性
Optimizer/Plugin | Hybrid Parallel Plugin | Low Level Zero Plugin | Torch DDP Plugin | Gemini Plugin | Moe Hybrid Plugin | |||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Lamb | ✔️ | ✔️ | ✔️ | ❌ | ❌ | |||||||||||||||||||||||||||||||||
GaLore | ✔️ | ✔️ | ✔️ | ❌ | ❌ | |||||||||||||||||||||||||||||||||
Adafactor | ✔️ | ✔️ | ✔️ | ❌ | ❌ | |||||||||||||||||||||||||||||||||
CAME | ✔️ | ✔️ | ✔️ | ❌ | ❌ | |||||||||||||||||||||||||||||||||
API 参考
class
colossalai.nn.DistributedAdaFactor
function
setup_distributed
tp_group -- The devices group for tensor parallel; dp_group -- The devices group for data parallel;
- shard_to_working_param (Dict) -- ZeRO 2 feeds the optimizer a sharded param view as grads are sharded. This maps from id(view) to working params used in forward & backward. padding_map -- An empty interface placeholder; use_zero -- Whether or not to use zero;
function
step
- closure (callable, optional) -- A closure that reevaluates the model and returns the loss.
Performs a single optimization steps
class
colossalai.nn.DistributedLamb
- params (iterable) -- iterable of parameters to optimize or dicts defining parameter groups
- lr (float, optional) -- learning rate (default: 1e-3)
- betas (Tuple[float, float], optional) -- coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))
- eps (float, optional) -- term added to the denominator to improve numerical stability (default: 1e-8)
- weight_decay (float, optional) -- weight decay (L2 penalty) (default: 0)
.. _Large Batch Optimization for Deep Learning: Training BERT in 76 minutes: https://arxiv.org/abs/1904.00962
function
setup_distributed
- tp_group (dist.ProcessGroup) -- Tensor Parallel process group
- dp_group (dist.ProcessGroup) -- ZeRO 2 process group
- shard_to_working_param (Dict) -- ZeRO 2 feeds the optimizer a sharded param view as grads are sharded. This maps from id(view) to working params used in forward & backward. padding_map -- An empty interface placeholder
- is_zero (bool) -- Whether to use ZeRO 2.
function
step
- closure (callable, optional) -- A closure that reevaluates the model and returns the loss.
class
colossalai.nn.DistGaloreAwamW
- params (iterable) -- iterable of parameters to optimize or dicts defining parameter groups.
- lr (float, optional) -- learning rate. (default: 1e-3)
- betas (Tuple[float, float], optional) -- coefficients used for computing running averages of gradient and its norm. (default: (0.9, 0.999))
- eps (float, optional) -- term added to the denominator to improve numerical stability. (default: 1e-6)
- weight_decay (float, optional) -- weight decay (L2 penalty) (default: 0.01) nbits -- Number of bits for quantization optim states. Only 32 and 8 are supported.
- min_8bit_size (
int
, defaults to 4096) -- The minimum number of elements of the parameter tensors for 8-bit optimization. - percentile_clipping (
int
, defaults to 100) -- Adapts clipping threshold automatically by tracking the last 100 gradient norms and clipping the gradient at a certain percentile to improve stability. - block_wise (
bool
, defaults toTrue
) -- Whether to independently quantize each block of tensors to reduce outlier effects and improve stability. - is_paged (
bool
, defaults toFalse
) -- Whether the optimizer is a paged optimizer (handle memory spike via CPU-GPU transfer) or not. - args (dict, optional) -- quantization-related arguments. If passed, will override all quantization args above.
function
setup_distributed
- tp_group (dist.ProcessGroup) -- Tensor Parallel process group
- dp_group (dist.ProcessGroup) -- ZeRO 2 process group
- shard_to_working_param (Dict) -- ZeRO 2 feeds the optimizer a sharded param view as grads are sharded. This maps from id(view) to working params used in forward & backward.
- padding_map (Dict) -- Padding size of each param from ZeRO's param store. Required if ZeRO is used.
- is_zero (bool) -- Whether to use ZeRO 2.
function
step
- closure (callable, optional) -- A closure that reevaluates the model and returns the loss.
function
to_master_shape
class
colossalai.nn.DistributedCAME
- params (iterable) -- iterable of parameters to optimize or dicts defining parameter groups
- lr (float, optional) -- external learning rate (default: None)
- eps (tuple[float, float]) -- regularization constants for square gradient and instability respectively (default: (1e-30, 1e-16))
- clip_threshold (float) -- threshold of root-mean-square of final gradient update (default: 1.0)
- betas (tuple[float, float, float]) -- coefficient used for computing running averages of
- update, square gradient and instability (default -- (0.9, 0.999, 0.9999)))
- weight_decay (float, optional) -- weight decay (L2 penalty) (default: 0)
function
setup_distributed
tp_group -- The devices group for tensor parallel; dp_group -- The devices group for data parallel;
- shard_to_working_param (Dict) -- ZeRO 2 feeds the optimizer a sharded param view as grads are sharded. This maps from id(view) to working params used in forward & backward. padding_map -- Interface placeholder use_zero -- Whether or not to use zero;
Inject features to the Optimizer
function
step
- closure (callable, optional) -- A closure that reevaluates the model and returns the loss.