Dedicated GPU endpoints for your models. Consistent throughput, sub-100ms time-to-first-token, and zero contention with other tenants' traffic.
On shared-GPU infra, every other tenant's request competes for the same CUDA cores, memory bandwidth, and PCIe lanes. Your p99 latency is held hostage.
KV cache is finite. When multiple models pack a single GPU, you're constantly evicting and reloading. A 200B model cannot share an H100 with anyone.
Tokens-per-second degrades with load. Shared clouds hide this behind aggregate metrics. Your SLA-critical workload sees spikes you can't explain.
One endpoint, one GPU — or a dedicated multi-GPU slice. No sharing, ever. Memory is yours.
Model weights load once at startup. No re-paging from shared storage. Cold-starts happen once, not per request.
With no competing traffic, tokens-per-second is stable and predictable. Benchmark once; hold that number in production.
vLLM-compatible continuous batching out of the box. Maximize GPU utilization across your own request volume.
Time-to-first-token is the metric that makes or breaks interactive LLM products. Below 100ms feels instant. Above 200ms and users notice. Above 500ms and they leave. Dedicated GPU removes the variable that causes most shared-cloud overruns.
Benchmark Reference
Indicative. Final benchmarks vary by model, quantization, and batch config.
Model
Multi-node. MoE routing means ~32B active params per token — fast TTFT — but all 1T weights must live in VRAM.
Model
4-GPU node at FP8. MoE architecture keeps per-token compute low. High tokens/sec on large batches.
Model
MoE. 32–40B active params per token. 4-GPU node covers the 355B variant at FP8; 8 GPUs for the 744B.
Model
Single GPU. Plenty of KV cache headroom. Best cost/quality for high-volume endpoints.
Aircloud inference endpoints are OpenAI API-compatible. Point your existing client at your endpoint URL and swap the API key. No SDK changes, no prompt rewrites, no migration effort.
1import openai23client = openai.OpenAI(4 base_url="https://api.aircloud.com/v1/endpoints/{id}",5 api_key="tb_sk_...",6)78response = client.chat.completions.create(9 model="moonshotai/Kimi-K2-Instruct",10 messages=[{"role": "user", "content": "Explain FSDP."}],11 stream=True,12)1314for chunk in response:15 print(chunk.choices[0].delta.content, end="", flush=True)
Spin up a private inference endpoint in minutes. Bring your own weights or deploy from HuggingFace Hub.