The Emerging LLM Stack for RAG

The Emerging LLM Stack for RAG

The Emerging LLM Stack for RAG

The Emerging LLM Stack for RAG
The Emerging LLM Stack for RAG
The Emerging LLM Stack for RAG

Every time that a new technology bursts on the scene, there is a scramble among thought leaders, analysts, investors and start-ups to try to figure out the implications, and how to get one or two steps ahead of the early adopters and start to build out the “new platform” needed to enable the new technology for a wider audience. The search for the “platform” or the “stack” or “infrastructure” or “middleware” or “core libraries” or “service components” or “APIs” is a well-worn path in the history of Silicon Valley and the pursuit of the new-new thing.

So far, Generative AI has been confounding a lot of experts — where is the software platform play? What will the new stack look like? Is a new stack for generative AI even needed? Arguably, unlike the emergence of any other similarly disruptive technology, generative AI is happening mostly in open source and in a fully ‘peak Cloud’ era. Most of the core model IP is published in research papers and fully open sourced, including model architecture, training recipes, and even the trained weights (and often multiple checkpoints). If not in open source, then most of the leading ‘proprietary’ players are full Cloud-based APIs — no messy UIs, application layers, workflow components, integration points, or even a persistent state with its own data model — Call the API, Get a response, Transaction over. In an open source, API-first, cloud era, is there a need for an enterprise software stack around the LLM? In other words, where is the ‘ware (e.g., software or middleware) for LLM-based applications?

The contours of the LLM application stack are just emerging, and like most technology disruptions, the requirements of the “stack” sneak up on you until it becomes obvious. What looks at first like a few small tactical ‘one-off’ requirements, emerge as something more strategic and fundamental, once a business moves beyond the easy cases and starts to look to build real applications using LLMs.

I will frame the discussion around our experiences with a number of clients and partners over the last year and the journey as they have taken initial steps around LLMs.

The first discussions, post the Chat GPT launch, in late 2022 or early 2023 usually started with breathless excited talks about the future of AI, and then once we get into actions, a simple “thanks for the offer to help, but I think we are good — we are experimenting (or waiting for) {{ Open AI | Cohere | Anthropic | Google | Microsoft }} and the API seems to do everything we need.” In fact, after the initial instruct generation models were launched in 2022 and early 2023, most of us had the same reaction in experiencing the huge leap in quality from the earlier ‘text generation-based’ GPT models: with models this powerful, what is left for the rest of us to do besides just call the LLM API?

As initial experimentations, most people focused on “low hanging fruit” scenarios with “limited integration”, such as content generation, or creating a chat window into an existing application. However, pretty quickly after the initial experimentation, there came the business value “use case” question, and how to integrate the LLM into meaningful enterprise workflows in ways that were both practical and useful, e.g., how to use LLMs to create new value for the business in terms of productivity, quality and new automation. This quickly led to the realization that most enterprise workflows require the integration of LLMs with enterprise private contextual knowledge, for several reasons:

(1) reducing hallucination risk to an acceptable level,

(2) providing an “evidence-based” way to assess and “audit” the LLM output, and

(3) performing real-world enterprise work almost always involves interaction with some form of ‘private’ information — internal knowledge-bases, customer support logs, vendor contracts, invoices, training modules, due diligence materials, regulatory documents, sales presentations, product collateral, etc. (As an aside, you could even make the argument that most enterprise processes exist by virtue of the need to read, write and interact with enterprise private knowledge of some form.)

Several months ago, one of the most common answers I heard to this knowledge integration question was that model fine-tuning would solve the problem, and that though a fine-tuning process, you could “inject” knowledge into the LLM. I was (and remain) pretty skeptical of this approach, although recognize that there are some really interesting experimental research paths to leveraging call-outs to knowledge tools, integrating embeddings, and other advancements in model architecture. However, none of these approaches in the short-term can address the traceability and auditability of the information, and will still result in hallucinations and inexplicable fabrications of answers — and potentially some very brittle and expensive bespoke solutions for a particular use case. As many leading commentators have noted, LLMs were not designed for the structured conceptual encoding of knowledge.

I believe the growing recognition that this knowledge integration pattern could not be solved exclusively within the LLM led to the first major step in a path towards a LLM-based software platform: the answer is not just about the LLM. There are additional capabilities that need to be wrapped around the LLM to enable the LLM to do meaningful work in the enterprise.

In principle, this knowledge integration with the LLM is not a complicated process:

Step 1 — retrieve some information;

Step 2- pass the information to the LLM;

Step 3- ask the LLM to read the information, answer some questions, and do some analysis;

Step 4- get the results, verify, link to sources, and pass to the end user.

Step 5 — repeat across different information sources and analyses.

Over the last few months, these five steps have increasingly been referred to as “retrieval-augmented generation” (RAG) as the leading paradigm for addressing knowledge integration with LLMs. It is a common-sense approach that intuitively mirrors how people do research and analysis — search and select your materials, read the materials in depth, and then prepare and deliver the analysis.

Each of these steps, however, starts to introduce complexities in the architecture:

Step 1- LLM-focused retrieval system. This is the hardest part because the retrieval needs to be optimized for an LLM-based process:

— Most high-value knowledge is in documents which need to be parsed, extracted, chunked and indexed — consistently across multiple formats with quality and scale.

— For some use cases, ‘traditional’ text index search is a practical option, while for most ‘state of the art’ natural language use cases, a semantic index must be created — and for many use cases, some combination of both should be applied, and therefore, this requires a text collection index, a vector database, and a semantic embedding model.

Step 2- text packaging and batching pipeline — batching the information to align to the LLM context window, to create an “assembly line” of retrieving, packaging and passing of information to the model. This text packaging and batching pipeline is trivial for a single small piece of information, but quickly becomes complex to generalize across large scale document libraries and retrieval processes.

Step 3- prompt management — batching of the information, packaging in model-specific ways, applying prompt instructions for each model, and addressing questions about data privacy, model accuracy, addressing “not found” and other common classification issues, and fall-back and error handling if models are not available.

Step 4- post-processing- capturing and tracking the LLM transaction in its entirety, with the source information, metadata, usage and then building a set of utilities to apply fact-checking, integrating the output into a larger work product deliverable, and creating reports and analyses for people as a second-level final review.

Step 5- repeat — going beyond a single use case and workflow, and supporting an indefinite number of processes, documents, and analytics.

Taking an inventory of these requirements, we usually start to see many new components that need to be introduced to wrap around the LLM:

— Semantic embedding model;

— Vector Database;

— Text search index with metadata;

— Document Parsing and Text Chunking;

— Post Processing fact-checking and verification;

— LLM transaction capture for audit, analytics, fin-ops and continuous improvement;

— Output management to stitch the LLM-generated output into larger work products, analytics and reports;

— Workflow and data pipelines that tie these processes together;

— Abstraction layers that properly encapsulate and are designed to enable “mix-and-match” of different components.

And this is where it gets complicated. While some of these requirements, in principle, are straightforward, and can be solved in a “5 minutes to do X” tutorial on a particular vendor’s website, assembling these pieces in a scalable way for a real-life use-case is complex and novel.

Usually, when we go back to see that same client several months later, they have gone down some version of this learning curve and share with us either a spaghetti chart on a white-board or a really nice box chart on a powerpoint slide that sets forth all of these components and the choices that they have made to built out an initial POC, which usually looks something like the following components:

— LLM: OpenAI or Google

— Embedding Model: OpenAI, Google, or Open Source

— Vector DB: PineCone, Milvus or FAISS

— Search Index: ElasticPath or Mongo DB (or other Lucene-based document collection datastore).

— Document Parsing and Chunking: Open Source and ad hoc — usually work-in-progress and having some challenges

— Post processing and other stuff: custom and still working that out

— Repeatability: have not gotten there yet.

Usually, at this point, the client says that they feel that they have a better understanding but still many bridges to cross before deploying the solution at greater scale. After initial confidence in the architecture, as the conversation progresses, a whole set of “big picture” issues start to pour out with questions about the ultimate architecture:

— Public cloud vs. private cloud?

— Open source vs. Proprietary?

— Custom fine-tuned vs. Out of the Box vs. “Build your own model” ?

— Single model vendor (generative + embedding) vs. Multi-vendor?

— Speed of deployment vs. Long-term Cost lock-ins ?

— Multi-cloud vs single cloud provider?

As these wider architectural considerations come to the fore-front, they generally point to the criticality for a “loosely coupled” architecture in which different components of the RAG system can be modular and “swappable” to support multiple processes and workflows, and that elements of this process need to be abstracted and separated from the individual components.

After the discussion widens into these architectural issues, we usually then begin to contemplate an even larger set of important lifecycle considerations:

— How to evaluate the performance of different LLMs side-by-side, and swap in different models as required for performance, use case, and cost?

— How to manage the long-term cost implications and avoid potential “lock-ins” that could prove to be onerous at scale?

— What is the right balance between fine-tuning and customization vs. leveraging “out of the box” innovation investments from model providers?

— How to ensure the scalable private ingestion and alignment of internal knowledge bases (which tend to be fragmented and difficult to manage)?

— What is the long-term data governance and security model and how will this fit into existing enterprise systems?

Usually, at this point of the client conversation, we are all equally energized and exhausted, and as we come up to minute 59 of the call, everyone needs to run to their next Zoom call. We agree to review the materials and come back to the discussion in a follow-up call. These are hard issues — and as of October 2023, there are still a lot more questions than answers for most businesses.

Where’s the “ware” for LLMs?

As more businesses move down this learning curve later in 2023 and 2024, increasingly there will be a convergence and growing understanding of the need for an integrated framework that assembles all of the components described to enable enterprises to roll-out LLM-based applications at scale, and that we will be talking about the “LLM stack” or “LLM middleware” — these will be the essential tools that are the behind-the-scenes enablers to unlock the potential of the LLM for enterprise knowledge-based automation.

When we look back years from now, it will probably seem obvious, but as we live through it and try to grapple with these issues, it can be a challenge for all of us to figure out how all of the pieces of the puzzle fit into an integrated solution. That is the beauty of living through a technology revolution — we all have the chance to participate and try to figure it out together!

For the fastest RAG implementation, find our LLMware open-source project at https://github.com/llmware-ai/llmware and our Hugging Face models at https://huggingface.co/llmware.

Article by

Darren Oberst
Darren Oberst

Darren Oberst

CEO and Founder

Published on

Aug 3, 2023

Other Articles by

Darren Oberst

It's time to join the thousands of developers and innovators on LLMWare.ai

It's time to join the thousands of developers and innovators on LLMWare.ai

It's time to join the thousands of developers and innovators on LLMWare.ai