Why we self-host our LLM workloads (and when we don't)

The bill from your frontier-model provider this month was higher than last month. Again. And one of your customers asked, in writing, where their data ends up.

If you've felt either of those, this post is for you.

The three options

In 2026 you have, roughly, three options for running LLM workloads in production:

Frontier APIs. Claude, GPT-5, Gemini. Best models, fastest time-to-first-result, highest per-token cost, opaque to your compliance team.
Hosted open-weight. Fireworks, Together, Groq, Anyscale. Open-weight models on someone else's GPUs. Cheaper than frontier, fewer compliance promises than frontier.
Self-hosted open-weight. Llama, Qwen, Mistral, DeepSeek on your own infrastructure. Most control, most engineering overhead.

Every Wenable client engagement uses some mix of all three. There is no universal answer. There are useful rules of thumb.

When we choose frontier APIs

The task is hard and the volume is low. A monthly executive report that runs once and matters a lot? Pay the frontier rate. You're buying brain, not throughput.
Latency-to-first-customer matters more than per-token cost. Early product validation, where the model behind the API isn't a load-bearing decision yet.
The task is creative or open-ended. Frontier models still win on the long tail. Customer-support agents in unconstrained domains, brainstorming UIs, writing assistants.

When we choose hosted open-weight

High volume, moderate task complexity. Document classification. Semantic-search re-ranking. Structured extraction. The bread and butter of enterprise AI.
The customer accepts data residency in a specific region. Most hosted providers offer regional inference, and that's enough for most compliance regimes.
You want to switch model versions without re-procuring infrastructure. This is the underrated middle ground.

When we choose self-hosted

Data sovereignty is non-negotiable. Pharma, healthcare, financial services, defense. Sometimes the contract names the rack the data can live on.
You're running > 100M tokens/day on a stable workload. At that volume the unit economics flip.
You need a fine-tuned model. It's far easier to fine-tune, eval, and roll out an open-weight model you own than to keep a fine-tune working through frontier-API version churn.
You need offline or edge deployment. Trucking-fleet ECU. Hospital LAN. Manufacturing floor where the network drops at 4pm every Tuesday.

What we run

The self-hosted stack we standardized on:

Serving — vLLM with continuous batching for most workloads, SGLang for structured-output workloads where constrained decoding matters.
Routing — a thin Go service that picks the right model for the request based on prompt shape and cost class.
Quantization — FP8 in production for most workloads. Quality loss is negligible against our evals; throughput is 2–3x higher.
Observability — OpenTelemetry traces from every inference call, including the prompt hash, model version, and quantization config.

We don't run training infrastructure ourselves. For fine-tuning runs we rent H100s for the duration of the run, then bring the weights home and serve them on our own boxes.

The honest tradeoff

Self-hosting is not free. You'll spend engineer-weeks on GPU drivers, batching tuning, prompt-cache hit rates, and OOMs at 3am. Make sure the workload is worth that investment before you sign up.

But once you've done it, you have something a competitor running a thin wrapper on a frontier API does not: control over cost, latency, data, and the future shape of the system.

Most enterprises we work with land in a hybrid: frontier for the long tail, hosted open-weight for the workhorse, self-hosted for the high-volume regulated workloads. The trick is to make the routing decision invisible to the rest of the application — so when the pricing or privacy story shifts, you change routing rules, not the product.

If that sounds like a problem you have, we should talk.