Use Cases/Distributed Training

Multi-node H100 clusters.
Bring your NCCL jobs.

Fast interconnects, NVLink within nodes, InfiniBand between nodes. The topology your distributed training job actually needs — not approximate shared-cloud neighbors.

Provision a cluster Read the docs
What Distributed Training Requires

Fast intra-node fabric

NVLink 4.0 between GPUs within a node. 900 GB/s bidirectional — essential for tensor parallelism and gradient synchronization at scale.

Low-latency inter-node

InfiniBand HDR (200 Gb/s) between nodes. NCCL all-reduce over IB is an order of magnitude faster than Ethernet-based alternatives.

Low-latency storage

Checkpoint throughput and dataset streaming demand high-bandwidth, low-latency storage. NVMe-backed object storage mounted close to compute.

Cluster Topology

How multi-node works
on the platform.

Clusters are provisioned as single-tenant groups. All nodes in your cluster are allocated together — no partial placements on shared racks.

You get a job launch script and the hostfile. SSH between nodes works out of the box. NCCL environment variables are pre-set. Bring torchrun or deepspeed and run.

Cluster Specs

GPUNVIDIA H100 SXM5 80GB
Intra-nodeNVLink 4.0 · 900 GB/s
Inter-nodeInfiniBand HDR · 200 Gb/s
Min cluster4× H100 (1 node)
Max cluster128× H100 (16 nodes) — contact us
StorageNVMe-backed, low-latency
AccessSSH + torchrun hostfile
Checkpoint Strategy

Spot training with auto-checkpoint.
Resume, don't restart.

Spot clusters are preemptible. That's how you get 40–60% off. The platform handles the rest: periodic checkpoint writes to your private bucket, clean shutdown on preemption signal, automatic requeue when capacity returns.

Your training job resumes from the last checkpoint — not from step 0.

01

Checkpoint interval

Configure how often weights + optimizer state are written. Every N steps or every N minutes.

02

Preemption notice

Platform sends SIGTERM 30 seconds before reclaim. Your job hook flushes final checkpoint.

03

Storage write

Checkpoint lands in your private object bucket. No shared NFS, no risk of corruption from other jobs.

04

Auto-requeue

Job re-queues automatically. On next start, reads latest checkpoint and continues from that step.

Supported Frameworks

PyTorch DDP

Standard multi-GPU. Works on single-node 8× H100 without modification.

PyTorch FSDP

Fully Sharded Data Parallel. Model sharding for 200B+ without pipeline parallelism.

DeepSpeed

ZeRO-1/2/3 + offloading. Megatron-DeepSpeed for large pre-training.

Megatron-LM

Tensor + pipeline parallelism for trillion-parameter scale. Requires InfiniBand — we have it.

JAX / XLA

Multi-host JAX works. Coordinate via the provided TPU-style host topology.

Custom MPI

OpenMPI pre-installed. Bring any MPI-based distributed workload.

Cost Comparison

Spot vs on-demand for a 4×H100 training run.

Indicative numbers. Actual spot pricing varies with supply. Spot preemption rate on Trusted tier is typically below 5% per hour for well-resourced clusters.

On-Demand

$X.XX / GPU·hr

$XXX for 48hr run

Instant availability
No preemptions
Priority queue
Spot~55% savings

$X.XX / GPU·hr

$XXX for 48hr run

Same H100 hardware
Auto-checkpoint
Occasional preemption

Rates shown are directional placeholders. Contact us for current GPU pricing.

Bring your job.
We'll bring the cluster.

Multi-node H100 clusters on-demand. NVLink + InfiniBand topology. Spot with auto-checkpoint for long runs.

Get Started Talk to Sales