AI & Machine Learning

On-Prem LLM Strategy for Banks: Ollama, vLLM and TEI

In 2026 the LLM debate inside banks has shifted from cloud to on-prem. A practical comparison of three engines, GPU budget and compliance.

BIART Ekibi3 min read1 views
On-prem LLM ve banka altyapısı görseli

Two years ago the LLM debate was "ChatGPT vs Claude vs Gemini". In 2026 the banking question is different: "can I run the model inside my own VPC, and at what cost?" As KVKK, BDDK and the upcoming EU AI Act tighten, sending customer data to a third-party provider requires explicit consent per request. The answer: on-prem LLM. Three engines have matured for it.

Ollama — developer ergonomics

Best for mixed CPU + GPU workloads. Single binary, GGUF model format, cold start in seconds. With KV cache q8_0, MAX_LOADED_MODELS=3, KEEP_ALIVE=60m and similar settings, small budgets remain workable. Ideal for pilots and narrow production: in one bank an 8-vCPU / 24 GB RAM VM with no GPU, running Qwen2.5 7B q4_K_M, narrates a query in 8-15 seconds. Not full automation, but a viable analyst assistant when paired with banking-tr few-shots.

vLLM — production throughput

By far the highest throughput on GPU. PagedAttention and continuous batching run 5-10× more concurrent requests on the same card. Typical bank-grade production: 1× RTX 4090 or 1× L40S running Qwen2.5 14B q4 + nomic-embed-text 768. The math: planner + narrator + embedding service fit inside 24 GB VRAM, sustaining ~50 concurrent queries on a single card.

TEI — text embeddings inference

HuggingFace’s open-source embedding server. Serves nomic-embed-text-v1.5, bge-m3, e5-mistral and friends in an optimised process. Keeping embedding on a separate worker stops it from contending with planner / narrator. In a Copilot architecture like CentraQL, planner + narrator run on vLLM and embeddings on TEI — single-point-of-contention avoided.

Compliance dimension

The EU AI Act high-risk category (credit scoring, biometric identification) mandates audit, explainability and human-in-the-loop. The KVKK March-2026 guidance crystallised explicit consent for personal data flowing to LLM providers. Cloud LLM is a controllable egress; each request must justify itself in the audit log. CentraQL’s ComplianceProfile + EgressGuard pulls that control into runtime — when the profile is "RegulatedFinance", any cloud call is blocked at the request boundary.

Cost picture

For a typical 3 M-queries-per-year banking analytics workload:

  • Cloud (Anthropic Claude Sonnet): ~$15-25K/year on tokens, plus monthly VPC overhead.
  • On-prem (1× RTX 4090 + vLLM): ~$2K card amortisation + ~1.2 kW power. Break-even at ~18 months; year 3 onwards the cost is noise.
  • Hybrid: planner + narrator on-prem, embedding from the cloud. Rarely worth it — embedding is already cheap, and the data still leaks.

Practical recommendation

For pilots — Ollama (CPU may suffice; start by experimenting). For production — vLLM (GPU required; throughput multiplies). For embedding — TEI (separate process). For domain language — LoRA fine-tune with the banking-tr / banking-en domain pack as the starting dataset.

CentraQL has an adapter abstraction for all three engines; behind an OpenAI-compatible endpoint you can wire whatever you want. The live demo runs on Ollama with CPU-only 8-vCPU today; an RTX 4090 migration is planned for Q1 2027.

Share