Developer Guide
Distributed Multi-Node Tasks
Overview
In this section, we will cover how to launch multi-node tasks.
Currently, only multi-node jobs are supported. Support for multi-node services is coming soon. If you need multi-node services, send us an email at support@komodo.io.
Config
To make a job config multi-node, simply add the following to your yaml config:
For example, to launch a job with 16 A100s, distributed across 2 nodes, you could do so like this:
Environment Variables
The following environment variables are provided to help you coordinate your distributed jobs:
-
NUM_NODES
: the number of nodes that are part of the task -
NODE_RANK
: the rank of the node executing the task -
NODE_IPS
: a string of IP addresses of the nodes that are part of the task, where each line contains one IP address -
NUM_GPUS_PER_NODE
: the number of GPUs available on each node
Pytorch Example
Here is an example of a job config for a multi-node training job using torchrun: