Use Cases/Embeddings & RAG

Batch embed millions of documents.
Production-grade RAG pipelines.

Cost per embedding at scale is a throughput problem. Cold-start latency breaks production retrieval. Both are solvable with the right GPU configuration.

Run a batch job Read the docs
The Embeddings Problem at Scale

Cost per token adds up

API-based embedding at scale is expensive. At 10M documents × 512 tokens, your embedding bill rivals your inference bill. You need your own GPU.

Throughput bottlenecks

CPU-based embedding is too slow for large corpus ingestion. Even a single A100 can embed millions of documents per hour. CPU cannot.

Cold-start latency

Serverless embedders cold-start on every request when idle. For interactive RAG, that's 2–5 seconds on the first query. Dedicated endpoints don't cold-start.

Batch Inference for Embeddings

Async jobs. Cost-optimized spot.
Throughput over latency.

Corpus ingestion is a batch workload. You don't need low latency — you need to process 50M chunks as cheaply and quickly as possible.

Submit an async job, configure batch size, point it at your storage bucket. Spot pricing applies. Output lands back in your bucket when done.

Async job submission

POST your job spec and dataset location. No blocking call. Webhooks or polling for completion.

Auto-batching

The platform packs input sequences into optimal batches for your GPU. BGE-large-en processes thousands of sequences per second on an A100.

Spot pricing

Batch embedding is checkpointable by design. Spot instances cut cost significantly. Interrupted jobs resume from last completed shard.

Output to your bucket

Embeddings write directly to your S3-compatible private bucket. No shared staging, no egress through our storage.

RAG Production Stack

Embedding generation, vector DB, inference — all on private GPU.

A production RAG pipeline has three GPU-adjacent components. Aircloud handles two of them; you bring your vector DB of choice.

01

Embedding generation

Batch-embed your corpus using a dedicated GPU endpoint. BGE, E5, Nomic, or your custom fine-tuned embedder. Private endpoint means consistent latency on live queries.

02

Vector DB integration

Write embeddings to Pinecone, Weaviate, Qdrant, pgvector, or any vector store. Aircloud handles compute; you keep control of your vector index.

03

Inference endpoint

Retrieved chunks go into a private LLM endpoint — GLM 5.1, Gemma 4, Kimi 2.5, or your fine-tuned model. Single-tenant, consistent TTFT, no shared-GPU variability.

Commonly Used Embedding Models

Model

BGE-large-en-v1.5

Embedding dims1024
Recommended GPUL40S 48GB or A100

Strong English retrieval. Best on BEIR. L40S handles 4K+ sequences/sec.

Model

E5-mistral-7b-instruct

Embedding dims4096
Recommended GPUA100 80GB

Instruction-tuned. Top multilingual performance. Needs A100 for high-throughput batch.

Model

Nomic-embed-text-v1.5

Embedding dims768
Recommended GPUL40S 48GB

Open source, Matryoshka dims. Fast inference. Fits comfortably on L40S.

Model

Custom fine-tuned embedder

Embedding dimsVaries
Recommended GPUMatched to model size

Bring your own model, trained on your domain. Deploy as a private endpoint the same way.

Embed your corpus.
At production cost.

Batch jobs on spot. Dedicated endpoints for live queries. Both on isolated GPU, without the API pricing model.

Get Started Talk to Sales