Gigabox Apps · SovereignLive

Your own GPUs, one API endpoint.

Sovereign runs DeepSeek V4 Flash on self-hosted H200 GPUs and exposes an OpenAI-compatible API. When GPUs are offline, requests fall back to OpenRouter transparently. Apps swap one env var and never think about routing again.

sovereign.gigabox.ai/v1/chat/completions

What you get

An OpenAI-compatible inference endpoint backed by your own GPUs, with automatic cloud fallback and full usage visibility.

GPU-First Routing

Requests hit your self-hosted vLLM instance first. If the GPU pod is down or unreachable, the proxy falls back to OpenRouter automatically. Your apps see one endpoint.

OpenAI-Compatible API

Drop-in replacement for OpenAI and OpenRouter. Apps swap one env var (base_url) and keep the same SDK calls, streaming, and tool use.

API Key Management

Create scoped API keys with sv- prefix. Each key tracks usage independently — requests, tokens, cost, and fallback percentage.

Usage Tracking

Every request is metered by source (GPU vs fallback), model, prompt tokens, and completion tokens. Usage is aggregated hourly and queryable via the management API.

Transparent Fallback

When GPU pods are stopped or crash, the proxy routes to OpenRouter with the correct model mapping. Apps never see an error — just a different billing source.

Pod Lifecycle Control

Start and stop RunPod GPU instances on demand. Volumes persist across stops so the model cache survives — restarts load from disk, not from HuggingFace.

How it's built

A FastAPI proxy on the VM routes to vLLM on RunPod GPUs, with OpenRouter as the fallback layer.

ProxyFastAPI + httpx (SSE streaming)
GPU2x NVIDIA H200 SXM on RunPod
ModelDeepSeek V4 Flash (284B MoE, FP4+FP8)
ServingvLLM 0.20 (tensor parallel, enforce-eager)
FallbackOpenRouter (same model, cloud pricing)
InfraGCE + nginx + systemd + acme.sh SSL

Why we built it

Every Gigabox app — EHR, OpenClaw, Hermes — routes LLM calls through OpenRouter. That works, but it means our entire intelligence layer depends on a single vendor. We wanted to own the inference path without forcing every app to handle failover logic.

Sovereign is a proxy that sits between our apps and the models. It tries self-hosted GPUs first (RunPod H200 pods running vLLM with DeepSeek V4 Flash) and falls back to OpenRouter when GPUs are offline. Apps change one env var and get GPU-grade inference when available, cloud pricing when not.

The entire platform — proxy, pod management, API key system, usage tracking, and deployment — was built and deployed by AI in a single session. DeepSeek V4 Flash serves at ~10 tokens per second per request on 2x H200 GPUs, scaling to ~36 tokens per second aggregate under concurrent load.