FAQs

Does RaNNC work with Apex AMP?

Yes. Convert your model with amp.initialize() and pass the resulting model to RaNNCModule using use_amp_master_params=True.

How can I save/load a RaNNC module?

Use state_dict() of the RaNNC module. The returned state_dict can be saved and loaded, as with PyTorch. Make sure state_dict() is called from all ranks. Otherwise, the call of state_dict() would be blocked because RaNNC gathers parameters across all ranks.

RaNNCModule also modifies optimizer’s state_dict(). To collect states across all ranks, set from_global=True. Note that the returned state_dict can be loaded after RaNNC partitions the model. load_state_dict() also needs the keyword argument from_global=True. You can find typical usages in examples.

Can I use gradient accumulation?

Yes.

As default, RaNNC implicitly performs allreduce (sum) of gradients on all ranks after a backward pass. To prevent this allreduce, you can use pyrannc.delay_grad_allreduce(True).

After a specified number of forward/backward steps, you can explicitly perform allreduce using allreduce_grads of your RaNNCModule.

My model takes too long before partitioning is determined

By setting save_deployment=true, RaNNC outputs the deployment state to a file called deployment_file after partitioning is determined. You can load the deployment file by setting load_deployment=true. This greatly saves time if you run a program using RaNNC with similar settings, e.g. with only the learning rate different. (See also Configurations)

When you are unsure whether the partitioning process is continuing or has already failed, you can change the log level of the partitioning module. Changing log levels of MLPartitioner and DPStaging will show you the progress of the partitioning process. (See also Logging)

How can I check partitioning results?

You can save a partitioning result using options SAVE_DEPLOYMENT=true and DEPLOYMENT_FILE=[PATH]. When these options are enabled, RaNNC outputs the partitioning result to the specified path. (See also Configurations)

Then, you can read the output file and display the partitioning results using pyrannc.show_deployment(PATH, batch_size). The second argument is an expected batch size; This shows partitioned subgraphs and micro batch sizes in pipeline parallelism.

A one-liner using this API is:

python -c 'import pyrannc; pyrannc.show_deployment("/.../PATH_TO_DEPLOYMENT_FILE", 64)'