Products/Batch Training
Product 03

Train faster.
Spend less.

Async batch training jobs at scale. Use spot instances at up to 60% off on-demand pricing — with automatic checkpointing so a preemption is a pause, not a loss. Submit once, let it run.

Spot discountUp to 60%
Auto-checkpointYes
Max cluster size512 GPUs
Job queuingPriority-based
Spot vs On-Demand

Training jobs rarely need
guaranteed uptime.

Spot instances are preemptible — they can be interrupted when capacity is needed elsewhere. For training jobs with checkpointing, that's a minor inconvenience, not a loss. You resume from your last checkpoint. The rest of the run costs 40–60% less.

Use on-demand when you need guaranteed uptime for a time-critical run. Use spot for everything else.

SpotRecommended for training
40–60% off on-demand
Same hardware as on-demand
Preemptible with notice
Auto-checkpoint on interrupt
Auto-resume on restart
Priority queue available
On-Demand
Full rate
Guaranteed uptime
No preemption risk
Same auto-checkpoint support
Priority allocation
Instant provisioning
Auto-Checkpointing

A preemption is a pause,
not a loss.

Spot preemptions are handled gracefully. Your job checkpoints on notice, requeues automatically, and resumes from the last saved state. You keep the spot discount; the only cost is queue wait time.

01

Preemption notice sent

Aircloud sends a SIGTERM to your container with a configurable drain window — default 30 seconds — before the instance is reclaimed.

02

Checkpoint written

Your training script saves model weights, optimizer state, and step count to persistent storage. Aircloud provides a native checkpoint SDK, or you can use your own.

03

Job requeued automatically

The job is automatically returned to the queue. When capacity is available, it restarts from the last checkpoint — no manual intervention.

04

Training resumes

Your container loads the checkpoint and continues training from where it stopped. Total overhead: the drain window plus queue wait time.

Checkpoint storage is included. Bring your own checkpoint logic or use the Aircloud checkpoint SDK. On-demand jobs are never preempted.

Distributed Training

Multi-node from day one.

Distributed training jobs treated as first-class workloads — not bolted on. Cluster provisioning, inter-node networking, and teardown handled automatically. You write the training code; we handle the infrastructure.

Spot preemption on distributed jobs checkpoints the entire cluster before reclamation. The job requeues, a new cluster is provisioned, and training resumes from the checkpoint.

Multi-node clusters

Submit jobs that span up to 512 GPUs across multiple nodes. Aircloud handles cluster provisioning, networking, and teardown.

NCCL-compatible

Standard NCCL communication between nodes. Your existing distributed training code works without modification.

High-bandwidth interconnects

InfiniBand and NVLink available on H100 clusters. Network bandwidth matched to your GPU tier.

Elastic scaling

Start with a smaller cluster, scale up mid-run if your budget allows. Or scale down when the heavy lifting is done.

Framework agnostic

PyTorch DDP, DeepSpeed, Megatron-LM, JAX. Bring your framework and parallelism strategy.

Job queuing and priority

Jobs queue when capacity is constrained. Assign priority levels across your organization's projects.

GPU Options for Training

Match hardware
to your workload.

H100 clusters for pre-training and large fine-tunes. L40S and A40 for cost-sensitive LoRA jobs. Spot pricing available on all GPU types — use the cheapest hardware that meets your compute requirements.

See GPU pricing
GPUVRAMInterconnect
H100 SXM5Best for large model training
80GB HBM3NVLink + InfiniBand
H100 PCIeHigh throughput, cost-effective
80GB HBM2ePCIe + InfiniBand
A100 SXM4Proven for distributed training
80GB HBM2eNVLink + InfiniBand
A100 PCIeMid-scale fine-tuning
40GB HBM2PCIe
L40SCost-effective fine-tuning
48GB GDDR6PCIe
A40Good for smaller models
48GB GDDR6PCIe

Submit your
first training job.

Spot pricing with auto-checkpoint. No infra to manage. Scale from one GPU to a 512-GPU cluster on the same API. Early access is open.

Get StartedRead the docs