Products/Serverless GPU

Product 01

GPU on demand.
Pay by the second.

H100s ready in under 90 seconds. Scale to zero when your job is done. No reserved pods, no idle charges, no ops overhead. True serverless — not a warm-pool dressed up with a pay-as-you-go label.

Deploy now Read the docs

Cold start< 90s

Billing unitPer second

Scale to zeroYes

GPU types7+

How It Works

Deploy your container

Push any Docker image — CUDA environment, your model, dependencies included. Or use one of the pre-built runtime templates.

GPU spins up, job runs

A GPU is provisioned on your first request. Cold start is under 90 seconds. Your workload runs on dedicated hardware — no resource contention from neighbors.

Scale to zero, billing stops

When your workload finishes or the endpoint goes idle, the GPU is released. Billing stops to the second. No idle charges, no minimums, no lingering costs.

Available GPUs

Every major GPU,
available on demand.

From H100s for maximum throughput to RTX PRO 6000s for cost-sensitive batch work. All GPUs available at the tier that matches your security requirements.

New GPU types added regularly. Check the docs for live availability.

GPU ModelVRAMTiers

H100 SXM580GB HBM3Trusted / Secure

H100 PCIe80GB HBM2eTrusted / Secure

A100 SXM480GB HBM2eTrusted / Secure

L40S48GB GDDR6Secure / Community

A6000 Ada48GB GDDR6Secure / Community

RTX PRO 600048GB GDDR6Community

A4048GB GDDR6Community

Why It's Different

Not a warm pool

Lambda, Modal, and most 'serverless' GPU products keep warm pods and call it serverless. We don't. True scale-to-zero means zero idle cost. Your billing stops when your job stops.

No noisy neighbors

Your container runs on a GPU allocated to you. Other jobs don't share your VRAM or memory bandwidth. The cold start cost is real — the isolation is too.

Burst to thousands

Need 500 GPUs for a batch run? Submit the job. We handle provisioning across the supply network. Scale-out without capacity planning or pre-negotiated quotas.

Works with your stack

REST API, Python SDK, Kubernetes operator. No proprietary runtimes or lock-in. Bring your Docker image and go.

Per-second billing, for real

Billed from first request to last byte written. No minimum session length, no rounding up to the minute. A 45-second job costs 45 seconds of GPU time.

Tier-matched isolation

Run serverless workloads on Trusted infrastructure for compliance-sensitive tasks, or Community nodes for batch experiments. Same API, different isolation guarantees.

API Example

One API call to provision
and run.

No infrastructure management. Specify your image, GPU type, and tier. We handle provisioning, scheduling, and teardown. Python SDK and Kubernetes operator also available.

Full API reference

serverless-run.sh

# Deploy a container and run inference
curl -X POST https://api.aircloud.com/v1/serverless/run \
  -H "Authorization: Bearer $TENSORBAY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "image": "your-registry/your-model:latest",
    "gpu": "h100-80gb",
    "tier": "trusted",
    "env": {
      "MODEL_PATH": "/mnt/model",
      "MAX_BATCH_SIZE": "32"
    }
  }'

# Response
{
  "job_id": "job_01j...",
  "status": "provisioning",
  "gpu": "NVIDIA H100 SXM5 80GB",
  "tier": "trusted",
  "estimated_cold_start_ms": 72000
}

Pricing

Per-second. No minimums.

GPU time billed from first request. No session minimums. Rates vary by GPU model and supply tier — Community is cheapest, Trusted reflects hyperscaler infrastructure costs. Full rate card on the pricing page.

See full pricing

No credit card to explore

H100s in under
90 seconds.

Deploy your first container in minutes. No infrastructure setup. Early access is open.

Get Started Free Read the docs

GPU on demand.Pay by the second.