Spot instances are preemptible — they can be interrupted when capacity is needed elsewhere. For training jobs with checkpointing, that's a minor inconvenience, not a loss. You resume from your last checkpoint. The rest of the run costs 40–60% less.
Use on-demand when you need guaranteed uptime for a time-critical run. Use spot for everything else.
Spot preemptions are handled gracefully. Your job checkpoints on notice, requeues automatically, and resumes from the last saved state. You keep the spot discount; the only cost is queue wait time.
Aircloud sends a SIGTERM to your container with a configurable drain window — default 30 seconds — before the instance is reclaimed.
Your training script saves model weights, optimizer state, and step count to persistent storage. Aircloud provides a native checkpoint SDK, or you can use your own.
The job is automatically returned to the queue. When capacity is available, it restarts from the last checkpoint — no manual intervention.
Your container loads the checkpoint and continues training from where it stopped. Total overhead: the drain window plus queue wait time.
Checkpoint storage is included. Bring your own checkpoint logic or use the Aircloud checkpoint SDK. On-demand jobs are never preempted.
Distributed training jobs treated as first-class workloads — not bolted on. Cluster provisioning, inter-node networking, and teardown handled automatically. You write the training code; we handle the infrastructure.
Spot preemption on distributed jobs checkpoints the entire cluster before reclamation. The job requeues, a new cluster is provisioned, and training resumes from the checkpoint.
Submit jobs that span up to 512 GPUs across multiple nodes. Aircloud handles cluster provisioning, networking, and teardown.
Standard NCCL communication between nodes. Your existing distributed training code works without modification.
InfiniBand and NVLink available on H100 clusters. Network bandwidth matched to your GPU tier.
Start with a smaller cluster, scale up mid-run if your budget allows. Or scale down when the heavy lifting is done.
PyTorch DDP, DeepSpeed, Megatron-LM, JAX. Bring your framework and parallelism strategy.
Jobs queue when capacity is constrained. Assign priority levels across your organization's projects.
H100 clusters for pre-training and large fine-tunes. L40S and A40 for cost-sensitive LoRA jobs. Spot pricing available on all GPU types — use the cheapest hardware that meets your compute requirements.
See GPU pricing