RAG for SMEs in 2026: long-context models vs vector databases

Two people reviewing printed documents and a laptop at a desk in an open-plan office, one holding a coffee.
TL;DR

For many 5 to 50 person firms in 2026, the right way to put AI on top of your documents is no longer a default vector-database stack. With million-token context windows in GPT-5.5 and Claude Sonnet 4.6, and commoditised managed vector services, the correct architecture is a proportionate blend of long-context prompts, lightweight RAG, and selective vector storage, chosen on cost, latency, security, and regulatory fit.

Key takeaways

- Million-token context windows in GPT-5.5 and Claude Sonnet 4.6 mean many SME knowledge bases now fit inside a single prompt, removing the need for a vector index in some use cases. - Managed vector services such as Pinecone serverless, Qdrant, Weaviate, and pgvector have commoditised, so a small RAG setup can cost tens of pounds a month rather than thousands. - The three archetypes worth considering are long-context only, hosted vector plus a RAG framework, and self-hosted pgvector or Qdrant, each with different cost, latency, and control trade-offs. - Under GDPR and the EU AI Act, the question of where embeddings live and which jurisdiction processes them is a data-protection decision, not just a technical one. - Start at the simplest pattern that actually works, then promote to a vector index only when volume, latency, or fine-grained access control demand it.

A founder I was speaking with last month runs a thirty-person consultancy with about ten years of shared-drive history sitting behind her. Policy documents, client proposals, project handovers, the lot. She had two quotes in front of her for putting an AI assistant over the lot. One vendor wanted to drop the whole policy folder into a single prompt against a frontier model and call it done. Another wanted to build a proper retrieval pipeline with a vector database, chunking, and a managed RAG framework, at roughly twenty times the monthly cost.

Both quotes were defensible. Both vendors thought the other was being reckless. Neither of them was wrong, which is part of why the decision feels so hard.

What changed in 2025 and 2026 is that the architecture call is now a real choice rather than an obvious one. Until very recently, anything beyond a single document needed a vector database and a RAG pipeline because the models could not hold enough text in working memory to be useful. That constraint has lifted, and the right answer for a small firm now depends on volume, latency, security, and the regulatory geography you operate in. The cluster of small posts already in the catalogue on data readiness, naming conventions, and SOPs covers the upstream housekeeping. This post takes the architectural question that comes next.

What actually changed in 2025 and 2026?

Frontier models grew context windows by two orders of magnitude while managed vector services dropped in price by about one. GPT-5.5 now holds a million tokens at roughly five US dollars per million input tokens, and Claude Sonnet 4.6 sits in a similar range. The combination means the old default, build a vector stack first, no longer fits the economics for a typical small firm.

Pinecone, Weaviate, Qdrant, and pgvector all now expose serverless or low-end tiers that put a small index well under a hundred pounds a month. Mainstream RAG frameworks such as LlamaIndex and LangChain have converged on standardised patterns. So both ends of the spectrum, no retrieval at all and a full retrieval pipeline, have got cheaper and easier at the same time, and the interesting question is which one fits which firm.

What are the three archetypes worth considering?

For a 5 to 50 person firm, three architectural patterns cover the ground. Long-context only puts the relevant documents directly into the prompt with no retrieval layer. Hosted vector plus a RAG framework combines a service such as Pinecone serverless with orchestration from LlamaIndex or LangChain. Self-hosted runs pgvector or Qdrant on infrastructure you already control. Each pattern trades cost, control, and complexity in different directions.

Long-context only

Best when your total corpus fits inside a million tokens, which for many SMEs means a curated policy folder, a contracts library, or a year of proposals. A staff handbook of two hundred pages is well under that limit. The setup cost is hours not weeks because you are building a prompt template, not a system. The running cost is per-query, so a thousand questions a month at ten thousand tokens each lands around fifty pounds. Latency is the trade. A million-token prompt takes a few seconds to process, which is fine for a research assistant but irritating for a chat experience.

Hosted vector plus a RAG framework

Best when your corpus is too large for long context, when you need sub-second retrieval, or when several users need different views of the same data. Independent comparisons in 2026 put Pinecone’s serverless tier as the easiest fast start for a non-engineering team, with Weaviate strong on hybrid search and Qdrant on fine-grained filtering. LlamaIndex has led recent benchmarks on retrieval accuracy at around 92 per cent, with LangChain offering the widest integration ecosystem. Expect a monthly bill of fifty to three hundred pounds for the storage and queries, plus the model API costs on top, and a setup of one to three weeks if you are paying a competent contractor.

Self-hosted pgvector or Qdrant

Best when you already run PostgreSQL, when data residency matters, or when you have an in-house engineer who can keep the lights on. pgvector is recommended in vendor-neutral analyses up to roughly five million vectors, which is well past the volume of a typical SME knowledge base. Qdrant gives you more performance headroom and better filtering at the cost of running another service. The running cost can be as little as the server you are already paying for, plus model API calls. The trade is operational. Backups, upgrades, and the occasional outage become your problem.

Where do the EU AI Act and UK GDPR change the call?

The architectural choice is now a data-protection choice. The EU AI Act becomes fully applicable for the bulk of its obligations in August 2026, with documentation and logging duties for higher-risk systems. Few SME knowledge assistants will be classified as high-risk, but the surrounding expectations on traceability, data residency, and vendor accountability flow down through contracts and sector guidance regardless.

Embeddings derived from personal data are themselves personal data under GDPR, so the storage limitation principle applies to your vector index just as it applies to the underlying CRM record. ICO guidance to small organisations is consistent on this point. You must be able to justify how long an embedding is kept and have a defined deletion or anonymisation process when the purpose ends.

In practical terms, a US-region serverless vector store holding embeddings of UK client emails creates legal exposure that an EU-region or self-hosted alternative does not. Specialist legal commentary on the interaction between GDPR and the AI Act underlines that deletion of personal data must come first, with only non-personal documentation and properly anonymised records retained for the longer Act-mandated periods.

When is each pattern the right answer?

A rough sorting rule helps. If your full corpus fits inside a million tokens and your users will tolerate a few seconds of latency, start with long-context only. The cost per query is predictable, there is nothing to maintain, and you can test the value of the assistant against your real documents in days rather than weeks.

If your corpus is larger, your query volume is high, or you need different users to see different slices of the index, a hosted vector service with a RAG framework is the proportionate next step. Reach for self-hosted pgvector or Qdrant when data residency, cost control at scale, or integration with an existing PostgreSQL stack makes the operational overhead worth carrying.

Avoid the failure mode where a vendor sells you a full RAG pipeline for a thirty-person firm with a few thousand documents and a handful of daily questions. That is over-engineering, and it will quietly cost you ten thousand pounds a year you do not need to spend.

A few terms sit close enough to this decision that they are worth a working grasp before you commission anything. Embeddings are the numerical representations of text that a vector database stores and queries against. Chunking is the process of breaking documents into retrievable units. Hybrid search combines keyword and vector retrieval. Agentic patterns layer multi-step tool use on top of retrieval, and they raise the security stakes considerably.

The choice of embedding model shapes retrieval quality more than many teams realise, and the size and overlap of your chunks affects whether the assistant returns useful answers or muddled paraphrases. Hybrid search frequently outperforms either keyword or vector alone on real corpora. Agentic systems matter because a poisoned document can issue instructions the agent then follows, which turns a retrieval problem into a security one.

Each of these has its own post in the Plain-English AI cluster. The architecture decision sets the frame within which all of them get used.

If you are weighing two quotes that disagree by an order of magnitude on what your firm needs, the answer is rarely either of the extremes. It is almost always the simplest pattern that genuinely fits your volume, latency, and compliance posture, with a clear set of triggers for when to promote to the next step. Book a conversation if you want a peer view on which archetype actually fits your firm before you sign anything.

Sources

- European Commission (2026). Regulatory framework for AI. Phased application dates, documentation duties, and treatment of general-purpose AI models used to ground the regulatory section. https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai - Information Commissioner's Office (2026). Data storage advice for small organisations. UK guidance on storage limitation, retention periods, and secure destruction or anonymisation, used to ground the embeddings-retention point. https://ico.org.uk/for-organisations/advice-for-small-organisations/information-security/data-storage-advice/ - OpenAI (2026). Introducing GPT-5.5. Official release notes for context length and per-token pricing referenced in the long-context economics. https://openai.com/index/introducing-gpt-5-5/ - TechGDPR (2026). Reconciling the regulatory clock. Specialist analysis of how GDPR storage limitation interacts with AI Act documentation retention, used to justify the embeddings-as-personal-data framing. https://techgdpr.com/blog/reconciling-the-regulatory-clock/ - Core.cz (2026). Vector databases 2026. Independent comparison of Pinecone, Weaviate, Qdrant, and pgvector on hybrid search, self-hosting, and scale thresholds, used to anchor the three archetypes. https://core.cz/en/blog/2026/vector-databases-2026/ - RankSquire (2026). Vector database pricing comparison 2026. Per-vector storage, query, and node-based pricing references used to ground the cost ranges quoted. https://ranksquire.com/2026/03/04/vector-database-pricing-comparison-2026/ - Atlan (2026). Enterprise RAG platforms comparison. Retrieval accuracy metrics and integration counts for LlamaIndex, LangChain, Vectara, and cloud-native offerings used to size the framework decision. https://atlan.com/know/enterprise-rag-platforms-comparison/ - Dotz Law (2026). Anthropic 2026, Code with Claude. Coverage of Claude Sonnet 4.6 context length and pricing used alongside the OpenAI source to triangulate the long-context market. https://dotzlaw.com/insights/anthropic-2026-code-with-claude/ - Intuitive Operations (2025). AI regulations and SMEs in 2025, what is changing. UK and EU regulatory commentary that frames how Act and GDPR duties land on smaller firms. https://intuitive-operations.com/2025/12/15/ai-regulations-smes-2025-impact/

Frequently asked questions

Do I still need a vector database if my documents fit inside a single prompt?

Not necessarily. If your full set of policies, contracts, or proposals fits inside a million-token window and you can tolerate a few seconds of latency, a long-context prompt against GPT-5.5 or Claude Sonnet 4.6 will give you a working assistant without any retrieval layer at all. Vector databases earn their keep when your corpus is too large to fit, when you need sub-second retrieval at high query volume, or when you need fine-grained access control on what each user can see.

How much does a small RAG setup actually cost in 2026?

A serverless vector tier from Pinecone or a self-hosted pgvector instance for a 5 to 50 person firm typically lands between £20 and £200 a month, plus model API costs of roughly £5 per million input tokens on the major frontier models. The bigger cost is the time to set up chunking, embeddings, and a sensible retrieval pipeline. Hosted RAG services such as AWS Bedrock Knowledge Bases or Azure AI Search collapse that setup but charge a premium on storage and queries.

Where does the EU AI Act and UK GDPR change my architecture choice?

The Act, fully applicable from August 2026 for most obligations, treats embeddings of personal data as personal data. That means the jurisdiction of your vector store and the retention rules for what sits inside it matter legally, not just commercially. ICO guidance on storage limitation still applies, so embeddings of client records need a defined retention period and a deletion process. Picking an EU-region vector store, or keeping the index on infrastructure you already control, removes a category of legal risk before it starts.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation