Products/Private Inference
Product 02

Your model.
Your hardware.
Nobody else's.

Dedicated GPU endpoints for production model serving. Single-tenant infrastructure — your weights never touch shared hardware, your requests never compete for resources with other tenants.

This isn't a shared inference API with a privacy checkbox. It's a dedicated compute environment scoped entirely to you.

Deploy an endpointRead the docs
What Single-Tenant Actually Means
Shared inference API
Your request joins a shared queue
GPU resources contended across tenants
Model weights cached alongside other users' models
Latency spikes when neighbors burst
Provider controls the runtime environment
No visibility into co-tenants
Private Inference — Aircloud
Dedicated GPU allocated to your endpoint
Zero resource contention — hardware is yours
Your weights loaded in isolated VRAM
Latency is yours to control
Your Docker image, your runtime
Contractual data isolation guarantees

The difference matters most when your model handles sensitive data, when latency SLAs are contractual, or when your model weights represent significant IP.

Isolation Tiers

Choose the isolation level
your workload requires.

Every Private Inference endpoint runs on the tier you specify. Mix tiers across endpoints. Upgrade or downgrade when your requirements change.

TrustedHyperscaler

VM-level isolation on hyperscaler infrastructure.

GPU containers provisioned on major cloud providers. Hardware-enforced VM boundaries, provider-backed SLAs, and the highest available isolation level. Use this for workloads with compliance requirements or contractual obligations around data handling.

Isolation boundaryVM-level (hypervisor)
SLAProvider-backed
AuditSOC 2 / ISO 27001 eligible
Best forProduction · HIPAA · Finance
SecureColocation

Verified partner hardware in audited facilities.

Known operators' hardware in certified colocation facilities with physical security controls. Aircloud monitoring, contractual SLAs, and operator vetting. The right balance of cost and assurance for demanding workloads that don't require full hyperscaler overhead.

Isolation boundaryContainer + contractual
SLAAircloud + operator SLA
VerificationOperator-vetted + audited
Best forStaging · Regulated workloads
CommunityOpen Network

Best prices. Community-monitored reliability.

Independent operators with community reputation scoring. Docker hardening, basic contractual protections, and uptime monitoring. The lowest cost tier — suitable for internal tooling, development, and workloads without data sensitivity requirements.

Isolation boundaryDocker hardening
SLACommunity reputation + monitoring
Hosts1,000+ nodes globally
Best forDev · Internal tooling · R&D
Features

Per-second billing

Pay for exactly the compute you use. Endpoints billed from first request, down to the second.

No cold-start noise

Your dedicated GPU is loaded with your model. No shared cache eviction, no cold starts triggered by other users.

VPC peering

Connect your inference endpoint directly to your private VPC. Traffic never crosses the public internet.

Custom runtimes

Bring your own Docker image with any CUDA version, framework, or serving stack. No locked-in runtime.

Auto-scaling endpoints

Set min/max GPU replicas. Traffic-based autoscaling without manual intervention.

Request logging and metrics

Per-request latency, throughput, and error metrics exported to your observability stack.

Model weight encryption at rest

Weights stored encrypted on Trusted and Secure tiers. Your IP protected at the storage layer.

Canary deployments

Split traffic between model versions. Validate new deployments without full cutover.

Compatible Models

Modern open models,
deployed privately.

Bring weights from HuggingFace Hub or your own storage. Any model that runs on vLLM works here — these are a few teams are deploying today.

inference.py
1import openai
2
3client = openai.OpenAI(
4 base_url="https://api.aircloud.com/v1/endpoints/{id}",
5 api_key="tb_sk_...",
6)
7
8# Works with any model on your endpoint
9response = client.chat.completions.create(
10 model="moonshotai/Kimi-K2-Instruct",
11 messages=[{"role": "user", "content": "Summarize this contract."}],
12 stream=True,
13)
14
15for chunk in response:
16 print(chunk.choices[0].delta.content, end="", flush=True)

MoE · ~1T total

Kimi 2.5

GPU8–16× H100 SXM5 80GB
VRAM~800GB+ FP8

Moonshot AI. ~1T total params, ~32B active per token. Multi-node. Fast TTFT despite scale — MoE routing keeps active compute low.

MoE · 355B–744B

GLM 5.1

GPU4–8× H100 SXM5 80GB
VRAM~180–380GB FP8

Zhipu AI. MoE with 32–40B active params. 355B variant fits a 4-GPU node at FP8; 744B needs 8 GPUs. Best-in-class on reasoning and agentic tasks.

MoE · ~230B total

MiniMax 2.7

GPU4× H100 SXM5 80GB
VRAM~230GB FP8

MiniMax MoE. ~230B total params. 4-GPU node at FP8. MoE architecture keeps per-token compute low — high throughput per dollar.

Dense · 27B

Gemma 4 27B

GPUH100 SXM5 80GB
VRAM~54GB BF16

Google DeepMind. Best single-GPU option — fits on one H100 with ample KV cache headroom. Strong at instruction following and code.

Use Cases
Fintech

Model inference on financial data

Fraud detection, credit scoring, and risk models running on dedicated infrastructure. Per-request isolation, no co-mingling of financial signals across tenant boundaries. Audit logs on every request for compliance.

SOC 2 eligibleVPC peeringAudit logging
Healthcare

Clinical AI without data exposure

Medical imaging models, clinical NLP, and diagnostic workflows on HIPAA-eligible infrastructure. Patient data never touches shared hardware. PHI stays within your network boundary through VPC peering.

HIPAA eligiblePHI isolationTrusted tier
Defense-adjacent AI

Air-gap compatible compute

Classified and sensitive workloads requiring maximum isolation. Trusted tier provides VM-level boundaries and provider-backed SLAs. Custom security review available for teams with specific compliance requirements.

VM-level isolationCustom SLAsSecurity review
AI Products

Serving your model as a product

If you're building an AI API or embedding a model into a product, shared inference adds unpredictable latency and co-tenant risk. Dedicated endpoints give you consistent performance and the ability to sign DPAs with your own customers.

Consistent latencySLA-backedDPA-ready

Isolated by default.
Not by checkbox.

Deploy a private inference endpoint in minutes. Choose your isolation tier. Per-second billing, no minimums. Early access is open.

Deploy an EndpointTalk to Sales