Fast interconnects, NVLink within nodes, InfiniBand between nodes. The topology your distributed training job actually needs — not approximate shared-cloud neighbors.
NVLink 4.0 between GPUs within a node. 900 GB/s bidirectional — essential for tensor parallelism and gradient synchronization at scale.
InfiniBand HDR (200 Gb/s) between nodes. NCCL all-reduce over IB is an order of magnitude faster than Ethernet-based alternatives.
Checkpoint throughput and dataset streaming demand high-bandwidth, low-latency storage. NVMe-backed object storage mounted close to compute.
Clusters are provisioned as single-tenant groups. All nodes in your cluster are allocated together — no partial placements on shared racks.
You get a job launch script and the hostfile. SSH between nodes works out of the box. NCCL environment variables are pre-set. Bring torchrun or deepspeed and run.
Cluster Specs
Spot clusters are preemptible. That's how you get 40–60% off. The platform handles the rest: periodic checkpoint writes to your private bucket, clean shutdown on preemption signal, automatic requeue when capacity returns.
Your training job resumes from the last checkpoint — not from step 0.
Configure how often weights + optimizer state are written. Every N steps or every N minutes.
Platform sends SIGTERM 30 seconds before reclaim. Your job hook flushes final checkpoint.
Checkpoint lands in your private object bucket. No shared NFS, no risk of corruption from other jobs.
Job re-queues automatically. On next start, reads latest checkpoint and continues from that step.
Standard multi-GPU. Works on single-node 8× H100 without modification.
Fully Sharded Data Parallel. Model sharding for 200B+ without pipeline parallelism.
ZeRO-1/2/3 + offloading. Megatron-DeepSpeed for large pre-training.
Tensor + pipeline parallelism for trillion-parameter scale. Requires InfiniBand — we have it.
Multi-host JAX works. Coordinate via the provided TPU-style host topology.
OpenMPI pre-installed. Bring any MPI-based distributed workload.
Indicative numbers. Actual spot pricing varies with supply. Spot preemption rate on Trusted tier is typically below 5% per hour for well-resourced clusters.
$X.XX / GPU·hr
$XXX for 48hr run
$X.XX / GPU·hr
$XXX for 48hr run
Rates shown are directional placeholders. Contact us for current GPU pricing.
Multi-node H100 clusters on-demand. NVLink + InfiniBand topology. Spot with auto-checkpoint for long runs.