Self-hosted LLMs: architecture, inference and cost
Why finance, healthcare and government must self-host — and a production path to ship a 70B model in 6 weeks.
Production-grade private model size
Not every enterprise can ship data to the cloud. This article lays out a private-deployment playbook validated by 6 customers: hardware sizing, inference framework selection (vLLM / TGI / TensorRT-LLM), quantization, capacity planning, cost modelling, and a 6-week engineering roadmap.
Hardware: minimum viable setup for a 70B model
Under INT4 quantization, a 70B model fits on 1×H100 (80G) or 2×A100 (80G). For production, 4×H100 + vLLM is recommended: ~2000 token/s throughput, serving 200+ concurrent users. When VRAM is tight, INT8 + TensorRT-LLM is the more aggressive option.
Inference frameworks: how to choose vLLM / TGI / TensorRT-LLM
vLLM fits fast iteration, frequent model swaps, and multi-GPU scaling. TGI fits deep HuggingFace integration. TensorRT-LLM fits a fixed model + peak performance. We pick vLLM in 80% of private deployments.
Quantization: INT4 is production-ready
Under AWQ INT4, a 70B model loses < 2% quality on MT-Bench, latency drops 50%, and VRAM halves. For production, we recommend INT4 with INT8 calibration for critical capabilities. Quantizing after SFT fine-tuning is more stable.
Cost: 6 weeks, one 8×H100 node, ~350K RMB
Includes hardware depreciation + power + ops. Buy-out vs cloud rental pays back in ~9 months. If monthly token volume is below 500M, renting the cloud API is more cost-effective.
A vLLM launch + quantized load snippet
# Launch vLLM with the AWQ INT4-quantized private-70b
# 4xH100 (80G), ~2000 token/s throughput
docker run -d --gpus all --shm-size=1g \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
your-vllm-runtime:latest \
--model ouryun/private-70b-int4 \
--quantization awq \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.92 \
--max-model-len 32768 \
--enable-prefix-caching \
--served-model-name ouryun-private-72b
# Health check
curl http://localhost:8000/v1/models | jq .
# Call (Chat Completions compatible protocol)
curl http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "ouryun-private-72b",
"messages": [{"role": "user", "content": "Summarize the key private-deployment decisions in 3 sentences"}]
}'Private LLM deployment · quantified outcomes
View all insights
5 common pitfalls in RAG engineering
From demo to production: chunking, retrieval eval, prompt injection, cost, and observability.
AI usage governance: turning scattered model calls into auditable capabilities
When 12 teams call multiple mainstream models independently, how do you satisfy audit, compliance and cost at the same time?
4 design principles for an enterprise AI gateway
Consolidate model calls scattered across 7 business systems into one governable capability platform.