Insights私有化

Self-hosted LLMs: architecture, inference and cost

Why finance, healthcare and government must self-host — and a production path to ship a 70B model in 6 weeks.

5/22/202615 min readFinance / healthcare / public sector / manufacturing
Self-hosted LLMs: architecture, inference and cost
Headline
70B

Production-grade private model size

Not every enterprise can ship data to the cloud. This article lays out a private-deployment playbook validated by 6 customers: hardware sizing, inference framework selection (vLLM / TGI / TensorRT-LLM), quantization, capacity planning, cost modelling, and a 6-week engineering roadmap.

01

Hardware: minimum viable setup for a 70B model

Under INT4 quantization, a 70B model fits on 1×H100 (80G) or 2×A100 (80G). For production, 4×H100 + vLLM is recommended: ~2000 token/s throughput, serving 200+ concurrent users. When VRAM is tight, INT8 + TensorRT-LLM is the more aggressive option.

02

Inference frameworks: how to choose vLLM / TGI / TensorRT-LLM

vLLM fits fast iteration, frequent model swaps, and multi-GPU scaling. TGI fits deep HuggingFace integration. TensorRT-LLM fits a fixed model + peak performance. We pick vLLM in 80% of private deployments.

03

Quantization: INT4 is production-ready

Under AWQ INT4, a 70B model loses < 2% quality on MT-Bench, latency drops 50%, and VRAM halves. For production, we recommend INT4 with INT8 calibration for critical capabilities. Quantizing after SFT fine-tuning is more stable.

04

Cost: 6 weeks, one 8×H100 node, ~350K RMB

Includes hardware depreciation + power + ops. Buy-out vs cloud rental pays back in ~9 months. If monthly token volume is below 500M, renting the cloud API is more cost-effective.

bash

A vLLM launch + quantized load snippet

# Launch vLLM with the AWQ INT4-quantized private-70b
# 4xH100 (80G), ~2000 token/s throughput
docker run -d --gpus all --shm-size=1g \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  your-vllm-runtime:latest \
  --model ouryun/private-70b-int4 \
  --quantization awq \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.92 \
  --max-model-len 32768 \
  --enable-prefix-caching \
  --served-model-name ouryun-private-72b

# Health check
curl http://localhost:8000/v1/models | jq .

# Call (Chat Completions compatible protocol)
curl http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "ouryun-private-72b",
    "messages": [{"role": "user", "content": "Summarize the key private-deployment decisions in 3 sentences"}]
  }'
Outcomes

Private LLM deployment · quantified outcomes

6w
kickoff to production
70B
max supported model size
0
data leaving the perimeter
· 5/22/2026
· READ · MORE · INSIGHTS ·