Local and Open-Weight Models
Last updated: 2026-04-06
Quick answer: Self-hosting trades control and unit economics for GPU ops, release management, and safety testing you must own end to end.
Definition
Open-weight models are weights you can run on infrastructure you control (on-prem, VPC, or dedicated host). Local usually means inference close to the user or inside your network boundary, not necessarily on a laptop. This sits alongside hosted model APIs in the broader toolchain.
Why it matters
Residency, air-gapped environments, and predictable long-run cost can favor self-hosting. Without mature ops, teams replicate shadow IT clusters with uneven evals and weak rollback.
When to use
Choose local/open-weight when policy requires data not to leave your boundary, when token volume makes API pricing dominant, or when you need offline or fixed-capacity inference.
When not to use
Prefer hosted APIs for fastest iteration, elastic scale, and when your team lacks capacity for model versioning, GPU health, and security patching.
Failure modes
“We deployed a model” without regression tests, without monitoring output distribution drift, and without parity checks against a known-good hosted baseline.
Related pages
Choosing LLM tiers and families · Engineering code review case study · LLMs in agentic systems · Categories