RaNNC (Rapid Neural Network Connector)

RaNNC is an automatic parallelization middleware for training very large-scale neural networks. Since modern networks often have billions of parameters, they do not fit the memory of GPUs. RaNNC automatically partitions such a huge network with model parallelism and computes it using multiple GPUs.

Compared to existing frameworks, including Megatron-LM and Mesh-TensorFlow, which require users to implement partitioning of the given network, RaNNC automatically partitions a network for PyTorch without any modification to its description. In addition, RaNNC basically has no limitation on its network architecture while the existing frameworks are only applicable to transformer-based networks.

The code below shows a simple usage of RaNNC. You only need to insert the line highlighted below.

model = Net()                  # Define a network
model.to(torch.device("cuda")) # Move paramsters to a cuda device
optimizer = optim.Adam(model.parameters(), lr=0.01) # Define an optimizer
model = pyrannc.RaNNCModule(model, optimizer)  ##### Wrap by RaNNCModule #####
loss = model(input)            # Run a forward pass
loss.backward()                # Run a backward pass
optimizer.step()               # Update parameters

Models used with RaNNC (Net in the above example) do not need special operators for distributed computation or annotations for partitioning (See our examples: model for the tutorial, enlarged versions of BERT and ResNet). RaNNC automatically partitions a model to subcomponents so that each subcomponent fits to the GPU memory and a high training throughput is achieved.

In contrast, for example, Megatron-LM needs special operators like ColumnParallelLinear and RowParallelLinear (See an example in Transformer). Implementing a model using such operators is very hard even for experts because the user needs to consider computational loads of the model, memory usages, and communication overheads. In addition, some existing frameworks including Megatron-LM can be applicable only to Transformer family networks.

We confirmed that RaNNC can train a BERT model with approximately 100 billion parameters without a manual modification/optimization of the definition of the network for model partitioning.

The initial ideas of RaNNC were published at IPDPS 2021. See our paper 1 for RaNNC’s partitioning algorithm and performance comparisons with other frameworks (preprint).

Contents:

Reference

1: Automatic Graph Partitioning for Very Large-scale Deep Learning, Masahiro Tanaka, Kenjiro Taura, Toshihiro Hanawa and Kentaro Torisawa, In the Proceedings of 35th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2021), pp. 1004-1013, May, 2021.