The difference matters most when your model handles sensitive data, when latency SLAs are contractual, or when your model weights represent significant IP.
Every Private Inference endpoint runs on the tier you specify. Mix tiers across endpoints. Upgrade or downgrade when your requirements change.
VM-level isolation on hyperscaler infrastructure.
GPU containers provisioned on major cloud providers. Hardware-enforced VM boundaries, provider-backed SLAs, and the highest available isolation level. Use this for workloads with compliance requirements or contractual obligations around data handling.
Verified partner hardware in audited facilities.
Known operators' hardware in certified colocation facilities with physical security controls. Aircloud monitoring, contractual SLAs, and operator vetting. The right balance of cost and assurance for demanding workloads that don't require full hyperscaler overhead.
Best prices. Community-monitored reliability.
Independent operators with community reputation scoring. Docker hardening, basic contractual protections, and uptime monitoring. The lowest cost tier — suitable for internal tooling, development, and workloads without data sensitivity requirements.
Pay for exactly the compute you use. Endpoints billed from first request, down to the second.
Your dedicated GPU is loaded with your model. No shared cache eviction, no cold starts triggered by other users.
Connect your inference endpoint directly to your private VPC. Traffic never crosses the public internet.
Bring your own Docker image with any CUDA version, framework, or serving stack. No locked-in runtime.
Set min/max GPU replicas. Traffic-based autoscaling without manual intervention.
Per-request latency, throughput, and error metrics exported to your observability stack.
Weights stored encrypted on Trusted and Secure tiers. Your IP protected at the storage layer.
Split traffic between model versions. Validate new deployments without full cutover.
Bring weights from HuggingFace Hub or your own storage. Any model that runs on vLLM works here — these are a few teams are deploying today.
MoE · ~1T total
Moonshot AI. ~1T total params, ~32B active per token. Multi-node. Fast TTFT despite scale — MoE routing keeps active compute low.
MoE · 355B–744B
Zhipu AI. MoE with 32–40B active params. 355B variant fits a 4-GPU node at FP8; 744B needs 8 GPUs. Best-in-class on reasoning and agentic tasks.
MoE · ~230B total
MiniMax MoE. ~230B total params. 4-GPU node at FP8. MoE architecture keeps per-token compute low — high throughput per dollar.
Dense · 27B
Google DeepMind. Best single-GPU option — fits on one H100 with ample KV cache headroom. Strong at instruction following and code.
Fraud detection, credit scoring, and risk models running on dedicated infrastructure. Per-request isolation, no co-mingling of financial signals across tenant boundaries. Audit logs on every request for compliance.
Medical imaging models, clinical NLP, and diagnostic workflows on HIPAA-eligible infrastructure. Patient data never touches shared hardware. PHI stays within your network boundary through VPC peering.
Classified and sensitive workloads requiring maximum isolation. Trusted tier provides VM-level boundaries and provider-backed SLAs. Custom security review available for teams with specific compliance requirements.
If you're building an AI API or embedding a model into a product, shared inference adds unpredictable latency and co-tenant risk. Dedicated endpoints give you consistent performance and the ability to sign DPAs with your own customers.