Use Cases/LLM Inference

Serve large language models.
Without the shared-tenant tax.

Dedicated GPU endpoints for your models. Consistent throughput, sub-100ms time-to-first-token, and zero contention with other tenants' traffic.

Deploy an endpoint Read the docs
The LLM Inference Problem

Noisy neighbors

On shared-GPU infra, every other tenant's request competes for the same CUDA cores, memory bandwidth, and PCIe lanes. Your p99 latency is held hostage.

GPU memory contention

KV cache is finite. When multiple models pack a single GPU, you're constantly evicting and reloading. A 200B model cannot share an H100 with anyone.

Unpredictable throughput

Tokens-per-second degrades with load. Shared clouds hide this behind aggregate metrics. Your SLA-critical workload sees spikes you can't explain.

How Private Inference Endpoints Solve This

Dedicated GPU

One endpoint, one GPU — or a dedicated multi-GPU slice. No sharing, ever. Memory is yours.

Your weights only

Model weights load once at startup. No re-paging from shared storage. Cold-starts happen once, not per request.

Consistent throughput

With no competing traffic, tokens-per-second is stable and predictable. Benchmark once; hold that number in production.

Continuous batching

vLLM-compatible continuous batching out of the box. Maximize GPU utilization across your own request volume.

Latency Profile

What sub-100ms means
for LLM serving.

Time-to-first-token is the metric that makes or breaks interactive LLM products. Below 100ms feels instant. Above 200ms and users notice. Above 500ms and they leave. Dedicated GPU removes the variable that causes most shared-cloud overruns.

Benchmark Reference

TTFT (time-to-first-token)< 80ms
Kimi 2.5 on 8× H100 SXM5 (~32B active params)
Throughput (output tokens/s)~2,800 tok/s
H100, batch size 32
Throughput (output tokens/s)~900 tok/s
A100, batch size 16
p99 latency variance< 5%
Dedicated endpoint, no neighbors

Indicative. Final benchmarks vary by model, quantization, and batch config.

GPU Recommendations by Model Size

Model

~1T MoE (Kimi 2.5)

GPU8–16× H100 SXM5 80GB
VRAM req.~800GB+ FP8
PrecisionFP8 / INT4

Multi-node. MoE routing means ~32B active params per token — fast TTFT — but all 1T weights must live in VRAM.

Model

~230B MoE (MiniMax 2.7)

GPU4× H100 SXM5 80GB
VRAM req.~230GB FP8
PrecisionFP8 / BF16

4-GPU node at FP8. MoE architecture keeps per-token compute low. High tokens/sec on large batches.

Model

~355–744B MoE (GLM 5.1)

GPU4–8× H100 SXM5 80GB
VRAM req.~180–380GB FP8
PrecisionFP8 / BF16

MoE. 32–40B active params per token. 4-GPU node covers the 355B variant at FP8; 8 GPUs for the 744B.

Model

27–32B (Gemma 4 27B, GLM 5.1 32B)

GPUH100 SXM5 80GB
VRAM req.~54–64GB BF16
PrecisionBF16

Single GPU. Plenty of KV cache headroom. Best cost/quality for high-volume endpoints.

API Integration

OpenAI-compatible.
Swap one URL.

Aircloud inference endpoints are OpenAI API-compatible. Point your existing client at your endpoint URL and swap the API key. No SDK changes, no prompt rewrites, no migration effort.

inference.py
1import openai
2
3client = openai.OpenAI(
4 base_url="https://api.aircloud.com/v1/endpoints/{id}",
5 api_key="tb_sk_...",
6)
7
8response = client.chat.completions.create(
9 model="moonshotai/Kimi-K2-Instruct",
10 messages=[{"role": "user", "content": "Explain FSDP."}],
11 stream=True,
12)
13
14for chunk in response:
15 print(chunk.choices[0].delta.content, end="", flush=True)

Your model.
Your GPU. No one else.

Spin up a private inference endpoint in minutes. Bring your own weights or deploy from HuggingFace Hub.

Get Started Talk to Sales