A founder I was speaking with last month runs a thirty-person consultancy with about ten years of shared-drive history sitting behind her. Policy documents, client proposals, project handovers, the lot. She had two quotes in front of her for putting an AI assistant over the lot. One vendor wanted to drop the whole policy folder into a single prompt against a frontier model and call it done. Another wanted to build a proper retrieval pipeline with a vector database, chunking, and a managed RAG framework, at roughly twenty times the monthly cost.
Both quotes were defensible. Both vendors thought the other was being reckless. Neither of them was wrong, which is part of why the decision feels so hard.
What changed in 2025 and 2026 is that the architecture call is now a real choice rather than an obvious one. Until very recently, anything beyond a single document needed a vector database and a RAG pipeline because the models could not hold enough text in working memory to be useful. That constraint has lifted, and the right answer for a small firm now depends on volume, latency, security, and the regulatory geography you operate in. The cluster of small posts already in the catalogue on data readiness, naming conventions, and SOPs covers the upstream housekeeping. This post takes the architectural question that comes next.
What actually changed in 2025 and 2026?
Frontier models grew context windows by two orders of magnitude while managed vector services dropped in price by about one. GPT-5.5 now holds a million tokens at roughly five US dollars per million input tokens, and Claude Sonnet 4.6 sits in a similar range. The combination means the old default, build a vector stack first, no longer fits the economics for a typical small firm.
Pinecone, Weaviate, Qdrant, and pgvector all now expose serverless or low-end tiers that put a small index well under a hundred pounds a month. Mainstream RAG frameworks such as LlamaIndex and LangChain have converged on standardised patterns. So both ends of the spectrum, no retrieval at all and a full retrieval pipeline, have got cheaper and easier at the same time, and the interesting question is which one fits which firm.
What are the three archetypes worth considering?
For a 5 to 50 person firm, three architectural patterns cover the ground. Long-context only puts the relevant documents directly into the prompt with no retrieval layer. Hosted vector plus a RAG framework combines a service such as Pinecone serverless with orchestration from LlamaIndex or LangChain. Self-hosted runs pgvector or Qdrant on infrastructure you already control. Each pattern trades cost, control, and complexity in different directions.
Long-context only
Best when your total corpus fits inside a million tokens, which for many SMEs means a curated policy folder, a contracts library, or a year of proposals. A staff handbook of two hundred pages is well under that limit. The setup cost is hours not weeks because you are building a prompt template, not a system. The running cost is per-query, so a thousand questions a month at ten thousand tokens each lands around fifty pounds. Latency is the trade. A million-token prompt takes a few seconds to process, which is fine for a research assistant but irritating for a chat experience.
Hosted vector plus a RAG framework
Best when your corpus is too large for long context, when you need sub-second retrieval, or when several users need different views of the same data. Independent comparisons in 2026 put Pinecone’s serverless tier as the easiest fast start for a non-engineering team, with Weaviate strong on hybrid search and Qdrant on fine-grained filtering. LlamaIndex has led recent benchmarks on retrieval accuracy at around 92 per cent, with LangChain offering the widest integration ecosystem. Expect a monthly bill of fifty to three hundred pounds for the storage and queries, plus the model API costs on top, and a setup of one to three weeks if you are paying a competent contractor.
Self-hosted pgvector or Qdrant
Best when you already run PostgreSQL, when data residency matters, or when you have an in-house engineer who can keep the lights on. pgvector is recommended in vendor-neutral analyses up to roughly five million vectors, which is well past the volume of a typical SME knowledge base. Qdrant gives you more performance headroom and better filtering at the cost of running another service. The running cost can be as little as the server you are already paying for, plus model API calls. The trade is operational. Backups, upgrades, and the occasional outage become your problem.
Where do the EU AI Act and UK GDPR change the call?
The architectural choice is now a data-protection choice. The EU AI Act becomes fully applicable for the bulk of its obligations in August 2026, with documentation and logging duties for higher-risk systems. Few SME knowledge assistants will be classified as high-risk, but the surrounding expectations on traceability, data residency, and vendor accountability flow down through contracts and sector guidance regardless.
Embeddings derived from personal data are themselves personal data under GDPR, so the storage limitation principle applies to your vector index just as it applies to the underlying CRM record. ICO guidance to small organisations is consistent on this point. You must be able to justify how long an embedding is kept and have a defined deletion or anonymisation process when the purpose ends.
In practical terms, a US-region serverless vector store holding embeddings of UK client emails creates legal exposure that an EU-region or self-hosted alternative does not. Specialist legal commentary on the interaction between GDPR and the AI Act underlines that deletion of personal data must come first, with only non-personal documentation and properly anonymised records retained for the longer Act-mandated periods.
When is each pattern the right answer?
A rough sorting rule helps. If your full corpus fits inside a million tokens and your users will tolerate a few seconds of latency, start with long-context only. The cost per query is predictable, there is nothing to maintain, and you can test the value of the assistant against your real documents in days rather than weeks.
If your corpus is larger, your query volume is high, or you need different users to see different slices of the index, a hosted vector service with a RAG framework is the proportionate next step. Reach for self-hosted pgvector or Qdrant when data residency, cost control at scale, or integration with an existing PostgreSQL stack makes the operational overhead worth carrying.
Avoid the failure mode where a vendor sells you a full RAG pipeline for a thirty-person firm with a few thousand documents and a handful of daily questions. That is over-engineering, and it will quietly cost you ten thousand pounds a year you do not need to spend.
What related concepts are worth understanding next?
A few terms sit close enough to this decision that they are worth a working grasp before you commission anything. Embeddings are the numerical representations of text that a vector database stores and queries against. Chunking is the process of breaking documents into retrievable units. Hybrid search combines keyword and vector retrieval. Agentic patterns layer multi-step tool use on top of retrieval, and they raise the security stakes considerably.
The choice of embedding model shapes retrieval quality more than many teams realise, and the size and overlap of your chunks affects whether the assistant returns useful answers or muddled paraphrases. Hybrid search frequently outperforms either keyword or vector alone on real corpora. Agentic systems matter because a poisoned document can issue instructions the agent then follows, which turns a retrieval problem into a security one.
Each of these has its own post in the Plain-English AI cluster. The architecture decision sets the frame within which all of them get used.
If you are weighing two quotes that disagree by an order of magnitude on what your firm needs, the answer is rarely either of the extremes. It is almost always the simplest pattern that genuinely fits your volume, latency, and compliance posture, with a clear set of triggers for when to promote to the next step. Book a conversation if you want a peer view on which archetype actually fits your firm before you sign anything.



