Question 1

What is Cumulus Labs?

Accepted Answer

Cumulus Labs is the unified inference platform for production AI workloads. It consolidates eight subsystems — OpenAI-compatible gateway, per-workflow router, prompt and KV cache, observability, continuous evaluation, LoRA fine-tuning, custom open-weight hosting, and the Ion inference engine — behind one API and one client. Cumulus is backed by Y Combinator (W26) and the NVIDIA Inception Program.

Question 2

How does the Cumulus drop-in API work?

Accepted Answer

Cumulus is OpenAI-compatible. Point any OpenAI, Anthropic, LangChain, LlamaIndex, or Vercel AI SDK client at api.cumuluslabs.io/v1 and your existing code continues to work. The Cumulus Router applies declared routing rules per workflow to choose the right model, provider, and infrastructure for each request — including failover, caching, and shadow evaluation, with zero code changes.

Question 3

What is the Ion inference engine?

Accepted Answer

Ion is the Cumulus inference engine. It runs on our own NVIDIA Grace and Blackwell fleet with custom attention kernels. On the same hardware Ion delivers 30 to 50% more throughput than stock vLLM and SGLang, which means more requests per second and lower cost per token. Ion is the runtime behind custom hosting and LoRA-served workloads on Cumulus.

Question 4

How does Cumulus handle observability, evaluation, and quality?

Accepted Answer

Every request through Cumulus is logged with input, output, model, latency, cost, and quality signals — a replayable audit log and a real-time dashboard. The Evaluation subsystem goes beyond LLM-judge consensus: it generates rubrics from golden examples, synthesizes test data, runs deterministic heuristic checks, and shadow-evaluates candidate models against live production traffic. Teams approve safer or cheaper swaps one workflow at a time.

Question 5

How does Cumulus reduce token cost and improve latency?

Accepted Answer

Cumulus stacks three approaches. The Cache subsystem combines exact-match, prefix, and optional semantic caches to cut input tokens by 40 to 70% on real workloads. The Fine-tune subsystem identifies candidate workflows from production traffic and trains LoRAs that serve the same task on smaller, faster models. The Ion engine delivers 30 to 50% more throughput than vLLM and SGLang on NVIDIA Grace. Combined, these directly reduce cost per token and end-to-end latency.

Question 6

Does Cumulus support multi-provider routing and failover?

Accepted Answer

Yes. The Cumulus Router applies declared, deterministic, traceable rules to pick the model and provider for each request. Providers are health-checked continuously across reasoning, summarization, classification, vision, and other modalities. When a provider degrades or drops, traffic reroutes automatically before users notice — eliminating bespoke if-else failover logic and untested failover paths.

Cumulus
Labs

Change one line.
Keep your code.

Workflows optimized
top down.

Production-grade,
not demo-grade.

Custom kernels on NVIDIA Grace.

Provider outages routed around.

More than judge consensus.

Production is brittle.
Five vendors. Five blind spots.

Eight subsystems.
One platform.

Where it lands.

Routing logic in if-else statements.

Frontier-only by default.

Frontier latency too high. Open-weight too slow.

Four tools across three clouds.

CumulusLabs

Change one line.Keep your code.

Workflows optimizedtop down.

Production-grade,not demo-grade.