Products/Private Inference

Product 02

Your model.
Your hardware.
Nobody else's.

Dedicated GPU endpoints for production model serving. Single-tenant infrastructure — your weights never touch shared hardware, your requests never compete for resources with other tenants.

This isn't a shared inference API with a privacy checkbox. It's a dedicated compute environment scoped entirely to you.

Deploy an endpoint Read the docs

What Single-Tenant Actually Means

Shared inference API

Your request joins a shared queue

GPU resources contended across tenants

Model weights cached alongside other users' models

Latency spikes when neighbors burst

Provider controls the runtime environment

No visibility into co-tenants

Private Inference — Aircloud

Dedicated GPU allocated to your endpoint

Zero resource contention — hardware is yours

Your weights loaded in isolated VRAM

Latency is yours to control

Your Docker image, your runtime

Contractual data isolation guarantees

The difference matters most when your model handles sensitive data, when latency SLAs are contractual, or when your model weights represent significant IP.

Isolation Tiers

Choose the isolation level
your workload requires.

Every Private Inference endpoint runs on the tier you specify. Mix tiers across endpoints. Upgrade or downgrade when your requirements change.

TrustedHyperscaler

VM-level isolation on hyperscaler infrastructure.

GPU containers provisioned on major cloud providers. Hardware-enforced VM boundaries, provider-backed SLAs, and the highest available isolation level. Use this for workloads with compliance requirements or contractual obligations around data handling.

Isolation boundaryVM-level (hypervisor)

SLAProvider-backed

AuditSOC 2 / ISO 27001 eligible

Best forProduction · HIPAA · Finance

SecureColocation

Verified partner hardware in audited facilities.

Known operators' hardware in certified colocation facilities with physical security controls. Aircloud monitoring, contractual SLAs, and operator vetting. The right balance of cost and assurance for demanding workloads that don't require full hyperscaler overhead.

Isolation boundaryContainer + contractual

SLAAircloud + operator SLA

VerificationOperator-vetted + audited

Best forStaging · Regulated workloads

CommunityOpen Network

Best prices. Community-monitored reliability.

Independent operators with community reputation scoring. Docker hardening, basic contractual protections, and uptime monitoring. The lowest cost tier — suitable for internal tooling, development, and workloads without data sensitivity requirements.

Isolation boundaryDocker hardening

SLACommunity reputation + monitoring

Hosts1,000+ nodes globally

Best forDev · Internal tooling · R&D

Features

Per-second billing

Pay for exactly the compute you use. Endpoints billed from first request, down to the second.

No cold-start noise

Your dedicated GPU is loaded with your model. No shared cache eviction, no cold starts triggered by other users.

VPC peering

Connect your inference endpoint directly to your private VPC. Traffic never crosses the public internet.

Custom runtimes

Bring your own Docker image with any CUDA version, framework, or serving stack. No locked-in runtime.

Auto-scaling endpoints

Set min/max GPU replicas. Traffic-based autoscaling without manual intervention.

Request logging and metrics

Per-request latency, throughput, and error metrics exported to your observability stack.

Model weight encryption at rest

Weights stored encrypted on Trusted and Secure tiers. Your IP protected at the storage layer.

Canary deployments

Split traffic between model versions. Validate new deployments without full cutover.

Compatible Models

Modern open models,
deployed privately.

Bring weights from HuggingFace Hub or your own storage. Any model that runs on vLLM works here — these are a few teams are deploying today.

inference.py

1import openai
2
3client = openai.OpenAI(
4    base_url="https://api.aircloud.com/v1/endpoints/{id}",
5    api_key="tb_sk_...",
6)
7
8# Works with any model on your endpoint
9response = client.chat.completions.create(
10    model="moonshotai/Kimi-K2-Instruct",
11    messages=[{"role": "user", "content": "Summarize this contract."}],
12    stream=True,
13)
14
15for chunk in response:
16    print(chunk.choices[0].delta.content, end="", flush=True)

MoE · ~1T total

Kimi 2.5

GPU8–16× H100 SXM5 80GB

VRAM~800GB+ FP8

Moonshot AI. ~1T total params, ~32B active per token. Multi-node. Fast TTFT despite scale — MoE routing keeps active compute low.

MoE · 355B–744B

GLM 5.1

GPU4–8× H100 SXM5 80GB

VRAM~180–380GB FP8

Zhipu AI. MoE with 32–40B active params. 355B variant fits a 4-GPU node at FP8; 744B needs 8 GPUs. Best-in-class on reasoning and agentic tasks.

MoE · ~230B total

MiniMax 2.7

GPU4× H100 SXM5 80GB

VRAM~230GB FP8

MiniMax MoE. ~230B total params. 4-GPU node at FP8. MoE architecture keeps per-token compute low — high throughput per dollar.

Dense · 27B

Gemma 4 27B

GPUH100 SXM5 80GB

VRAM~54GB BF16

Google DeepMind. Best single-GPU option — fits on one H100 with ample KV cache headroom. Strong at instruction following and code.

Use Cases

Fintech

Model inference on financial data

Fraud detection, credit scoring, and risk models running on dedicated infrastructure. Per-request isolation, no co-mingling of financial signals across tenant boundaries. Audit logs on every request for compliance.

SOC 2 eligibleVPC peeringAudit logging

Healthcare

Clinical AI without data exposure

Medical imaging models, clinical NLP, and diagnostic workflows on HIPAA-eligible infrastructure. Patient data never touches shared hardware. PHI stays within your network boundary through VPC peering.

HIPAA eligiblePHI isolationTrusted tier

Defense-adjacent AI

Air-gap compatible compute

Classified and sensitive workloads requiring maximum isolation. Trusted tier provides VM-level boundaries and provider-backed SLAs. Custom security review available for teams with specific compliance requirements.

VM-level isolationCustom SLAsSecurity review

AI Products

Serving your model as a product

If you're building an AI API or embedding a model into a product, shared inference adds unpredictable latency and co-tenant risk. Dedicated endpoints give you consistent performance and the ability to sign DPAs with your own customers.

Consistent latencySLA-backedDPA-ready

Isolated by default.
Not by checkbox.

Deploy a private inference endpoint in minutes. Choose your isolation tier. Per-second billing, no minimums. Early access is open.

Deploy an Endpoint Talk to Sales

Your model.Your hardware.Nobody else's.

Choose the isolation levelyour workload requires.