Push any Docker image — CUDA environment, your model, dependencies included. Or use one of the pre-built runtime templates.
A GPU is provisioned on your first request. Cold start is under 90 seconds. Your workload runs on dedicated hardware — no resource contention from neighbors.
When your workload finishes or the endpoint goes idle, the GPU is released. Billing stops to the second. No idle charges, no minimums, no lingering costs.
From H100s for maximum throughput to RTX PRO 6000s for cost-sensitive batch work. All GPUs available at the tier that matches your security requirements.
New GPU types added regularly. Check the docs for live availability.
Lambda, Modal, and most 'serverless' GPU products keep warm pods and call it serverless. We don't. True scale-to-zero means zero idle cost. Your billing stops when your job stops.
Your container runs on a GPU allocated to you. Other jobs don't share your VRAM or memory bandwidth. The cold start cost is real — the isolation is too.
Need 500 GPUs for a batch run? Submit the job. We handle provisioning across the supply network. Scale-out without capacity planning or pre-negotiated quotas.
REST API, Python SDK, Kubernetes operator. No proprietary runtimes or lock-in. Bring your Docker image and go.
Billed from first request to last byte written. No minimum session length, no rounding up to the minute. A 45-second job costs 45 seconds of GPU time.
Run serverless workloads on Trusted infrastructure for compliance-sensitive tasks, or Community nodes for batch experiments. Same API, different isolation guarantees.
No infrastructure management. Specify your image, GPU type, and tier. We handle provisioning, scheduling, and teardown. Python SDK and Kubernetes operator also available.
Full API referenceGPU time billed from first request. No session minimums. Rates vary by GPU model and supply tier — Community is cheapest, Trusted reflects hyperscaler infrastructure costs. Full rate card on the pricing page.
No credit card to explore