Two years ago the LLM debate was "ChatGPT vs Claude vs Gemini". In 2026 the banking question is different: "can I run the model inside my own VPC, and at what cost?" As KVKK, BDDK and the upcoming EU AI Act tighten, sending customer data to a third-party provider requires explicit consent per request. The answer: on-prem LLM. Three engines have matured for it.
Ollama — developer ergonomics
Best for mixed CPU + GPU workloads. Single binary, GGUF model format, cold start in seconds. With KV cache q8_0, MAX_LOADED_MODELS=3, KEEP_ALIVE=60m and similar settings, small budgets remain workable. Ideal for pilots and narrow production: in one bank an 8-vCPU / 24 GB RAM VM with no GPU, running Qwen2.5 7B q4_K_M, narrates a query in 8-15 seconds. Not full automation, but a viable analyst assistant when paired with banking-tr few-shots.
vLLM — production throughput
By far the highest throughput on GPU. PagedAttention and continuous batching run 5-10× more concurrent requests on the same card. Typical bank-grade production: 1× RTX 4090 or 1× L40S running Qwen2.5 14B q4 + nomic-embed-text 768. The math: planner + narrator + embedding service fit inside 24 GB VRAM, sustaining ~50 concurrent queries on a single card.
TEI — text embeddings inference
HuggingFace’s open-source embedding server. Serves nomic-embed-text-v1.5, bge-m3, e5-mistral and friends in an optimised process. Keeping embedding on a separate worker stops it from contending with planner / narrator. In a Copilot architecture like CentraQL, planner + narrator run on vLLM and embeddings on TEI — single-point-of-contention avoided.
Compliance dimension
The EU AI Act high-risk category (credit scoring, biometric identification) mandates audit, explainability and human-in-the-loop. The KVKK March-2026 guidance crystallised explicit consent for personal data flowing to LLM providers. Cloud LLM is a controllable egress; each request must justify itself in the audit log. CentraQL’s ComplianceProfile + EgressGuard pulls that control into runtime — when the profile is "RegulatedFinance", any cloud call is blocked at the request boundary.
Cost picture
For a typical 3 M-queries-per-year banking analytics workload:
- Cloud (Anthropic Claude Sonnet): ~$15-25K/year on tokens, plus monthly VPC overhead.
- On-prem (1× RTX 4090 + vLLM): ~$2K card amortisation + ~1.2 kW power. Break-even at ~18 months; year 3 onwards the cost is noise.
- Hybrid: planner + narrator on-prem, embedding from the cloud. Rarely worth it — embedding is already cheap, and the data still leaks.
Practical recommendation
For pilots — Ollama (CPU may suffice; start by experimenting). For production — vLLM (GPU required; throughput multiplies). For embedding — TEI (separate process). For domain language — LoRA fine-tune with the banking-tr / banking-en domain pack as the starting dataset.
CentraQL has an adapter abstraction for all three engines; behind an OpenAI-compatible endpoint you can wire whatever you want. The live demo runs on Ollama with CPU-only 8-vCPU today; an RTX 4090 migration is planned for Q1 2027.
