Products/Batch Training

Product 03

Train faster.
Spend less.

Async batch training jobs at scale. Use spot instances at up to 60% off on-demand pricing — with automatic checkpointing so a preemption is a pause, not a loss. Submit once, let it run.

Submit a training job Read the docs

Spot discountUp to 60%

Auto-checkpointYes

Max cluster size512 GPUs

Job queuingPriority-based

Spot vs On-Demand

Training jobs rarely need
guaranteed uptime.

Spot instances are preemptible — they can be interrupted when capacity is needed elsewhere. For training jobs with checkpointing, that's a minor inconvenience, not a loss. You resume from your last checkpoint. The rest of the run costs 40–60% less.

Use on-demand when you need guaranteed uptime for a time-critical run. Use spot for everything else.

SpotRecommended for training

40–60% off on-demand

Same hardware as on-demand

Preemptible with notice

Auto-checkpoint on interrupt

Auto-resume on restart

Priority queue available

On-Demand

Full rate

Guaranteed uptime

No preemption risk

Same auto-checkpoint support

Priority allocation

Instant provisioning

Auto-Checkpointing

A preemption is a pause,
not a loss.

Spot preemptions are handled gracefully. Your job checkpoints on notice, requeues automatically, and resumes from the last saved state. You keep the spot discount; the only cost is queue wait time.

Preemption notice sent

Aircloud sends a SIGTERM to your container with a configurable drain window — default 30 seconds — before the instance is reclaimed.

Checkpoint written

Your training script saves model weights, optimizer state, and step count to persistent storage. Aircloud provides a native checkpoint SDK, or you can use your own.

Job requeued automatically

The job is automatically returned to the queue. When capacity is available, it restarts from the last checkpoint — no manual intervention.

Training resumes

Your container loads the checkpoint and continues training from where it stopped. Total overhead: the drain window plus queue wait time.

Checkpoint storage is included. Bring your own checkpoint logic or use the Aircloud checkpoint SDK. On-demand jobs are never preempted.

Distributed Training

Multi-node from day one.

Distributed training jobs treated as first-class workloads — not bolted on. Cluster provisioning, inter-node networking, and teardown handled automatically. You write the training code; we handle the infrastructure.

Spot preemption on distributed jobs checkpoints the entire cluster before reclamation. The job requeues, a new cluster is provisioned, and training resumes from the checkpoint.

Multi-node clusters

Submit jobs that span up to 512 GPUs across multiple nodes. Aircloud handles cluster provisioning, networking, and teardown.

NCCL-compatible

Standard NCCL communication between nodes. Your existing distributed training code works without modification.

High-bandwidth interconnects

InfiniBand and NVLink available on H100 clusters. Network bandwidth matched to your GPU tier.

Elastic scaling

Start with a smaller cluster, scale up mid-run if your budget allows. Or scale down when the heavy lifting is done.

Framework agnostic

PyTorch DDP, DeepSpeed, Megatron-LM, JAX. Bring your framework and parallelism strategy.

Job queuing and priority

Jobs queue when capacity is constrained. Assign priority levels across your organization's projects.

GPU Options for Training

Match hardware
to your workload.

H100 clusters for pre-training and large fine-tunes. L40S and A40 for cost-sensitive LoRA jobs. Spot pricing available on all GPU types — use the cheapest hardware that meets your compute requirements.

See GPU pricing

GPUVRAMInterconnect

H100 SXM5Best for large model training

80GB HBM3NVLink + InfiniBand

H100 PCIeHigh throughput, cost-effective

80GB HBM2ePCIe + InfiniBand

A100 SXM4Proven for distributed training

80GB HBM2eNVLink + InfiniBand

A100 PCIeMid-scale fine-tuning

40GB HBM2PCIe

L40SCost-effective fine-tuning

48GB GDDR6PCIe

A40Good for smaller models

48GB GDDR6PCIe

Submit your
first training job.

Spot pricing with auto-checkpoint. No infra to manage. Scale from one GPU to a 512-GPU cluster on the same API. Early access is open.

Get Started Read the docs

Train faster.Spend less.

Training jobs rarely needguaranteed uptime.

A preemption is a pause,not a loss.