One of the most important parts of generative AI is often over-looked, specifically that GPT-based models are auto-regressive causal decoders that generate output one token at a time. A GPT model takes as an input a “context passage”, converts the words into a sequence of numerical tokens, propagates the activations across multiple layers of transformers inside the model, and then outputs a single token, or more properly, outputs an array, corresponding to the size of the vocabulary, with the output ‘probability distribution’ for each potential token candidate — for a single token.
Back in the early days of these models (way back in 2020–2021), it was very common for introductory tutorials to include the ‘generation sampling’ loop required to do something interesting with the model, which in pseudo-code looks something like:
For users of the Huggingface transformers library, or a client library of llama.cpp, you will see this core logic in the generation loops, enhanced with many configuration options for sampling, stopping criteria, applying grammars and other ‘biases’ to manipulate/adjust the logits (e.g, repetition penalties), and ‘beam search’ strategies that run multiple parallel output streams before picking the best one to output to the user.
Of course, inside OpenAI, Anthropic and API-based calls, there is a similar generation loop (in fact, likely far more sophisticated ones), but we rarely think about it, since our experience with those models is wholistic: provide input text and get output text “from the model” — but properly speaking the output text is a combination of the model AND a generation loop both of which are running inside the API call.
Even though this is a basic and well-known fact, we are all often guilty of conflating the output from the system (e.g, model + generative loop) with the output from the model itself — which has potentially profound implications for our intuitions about AGI and how we can most effectively leverage generative AI:
LLMs do not have a state or update mechanism. Each time that we execute a forward pass on the model, we must ‘manually’ update the set of input tokens, and pass a new string of n+1 tokens, and then the model will produce the next incremental token n+2. Said differently, the LLM model does not have a facility for updating its own state. To simulate the appearance of a complex multi-token response to a user query, we repeatedly call the model in an explicit, rules-based, hard-coded generation loop in which we select the token based on the model’s output probability distribution, and then we update the input context and send it back to the model with the update — otherwise, the model would be literally stuck producing the first token over-and-over again. (Anyone debugging a generation loop with a bug in the past key value cache has experienced this before!).
Stopping Conditions. Generation loops, not the model itself, decide when we stop producing output. We have all received truncated outputs from a LLM generation (e.g., stopping at 20 tokens, 100 tokens, etc.). This cut-off came from the generation loop, not the model. Generally, LLMs are trained to reliably produce some form of a <|endoftext|> token as the ‘next token’ when an input context is substantially complete, but if you have a bug in your generation loop (or you are just curious!), and you continue to pass the updated context to the model in the netherlands beyond <|endoftext|>, the model will happily continue to generate incremental tokens beyond the <|endoftext|> — and usually in very unpredictable ways. Sometimes, it will be a chain of <|endoftext|> (or other whitespace or special characters) but often times, the model will seemingly randomly depart into a completely different and unrelated topic, or continue with a dialog that is tangentially related — it provides the surreal impression of the model continuing to have a dialog with itself “behind the curtains” long after we have departed. Last week, while debugging a generation loop in a leading foundation chat model, we disabled the end of text stopping condition, and our model in the post <|endoftext|> context on a topic on the banking industry started recounting episodes of Everybody Loves Raymond with no apparent connection (although probably some accidental artifact of its original training.) There is a beautiful analogy in this clip about “knowing when to put your pens down” as the key to creating great art (https://youtu.be/QzzcHhFa66Q?si=RMI0bY1aYUEp3QA4). Does the LLM really have an understanding of what it means to stop? Often times, the quality of the output response is determined by the generation loop knowing when to stop (with a helpful nudge of post-processing rules), not the model itself.
No insights to the Right. The state of the increasing output text, which in the aggregate often generates a sequence of words that seem to represent complex analysis and sophistication of expression, is happening *outside* of the state of the model — it is happening in the ‘generation loop’ that is calling the model. Properly speaking, the model does not have any ability to anticipate or project forward generation beyond the current token (further to the “right” in the text generation), nor does it have any state mechanism to know whether it will be invoked again to ‘complete’ a thought, or whether the next token is the last output that will be provided. Can conceptual articulation of a complex thought occur without any anticipation of the words that will follow the current one?
Minion Model Thought Experiment. As a thought experiment, and a relatively easy practical experiment, using a small local LLM, such as tiny-llama, we could re-write the generation script, such that instead of repeatedly invoking one single model, in each step of the generation, we called a different model, responsible for producing one new incremental token in the chain:
In this script, the output of the system would be substantially the same as if a single model was being repeatedly called, with each new model being passed the input from the previous model and, in effect, ‘completing the sentence’ of the previous model.
What this makes explicit is that when we read a thoughtful 300 token analysis of a subject, generated by this system, is that the model was only aware of producing one ‘good’ token at a time — in a completely stateless manner and independent of any other tokens that may follow in the chain of thought — and that in principle, 300 separate and distinct “minion models” each having the ‘incrementally’ updated context and producing a single incremental token would reproduce the same result. If 300 individual models — operating completely independently — each only producing a single token — can reproduce the same ‘complex’ result as a single model, can we say that the single model is generating a complex thought?
5. We inject the randomness — not the model. The stochastic nature of the process is not properly happening inside the model, but rather we are injecting it into the process by probabilistic sampling in the generation loop, which in its own way, is a form of a “human preference” for more varied and expressive use of language. While there is theoretically (and perhaps in reality) the potential for some approximations or variation in a model forward pass, generally, the model is a deterministic math equation with defined functions and parameters, such that the model will produce the exact same output every time in terms of the logit output distribution, given the exact same input. (Note: this leaves aside a very complex and well-known issue of models generating different probability distributions with input that is *almost exactly* the same…). By sampling in a probabilistic manner, we are adding the randomized element to create more varied and interesting text output and potentially injecting the sense of “human” interaction into the model’s response. In a string of 300 tokens, if at each step, we are adding X% of randomness, then over the full 300 steps, we are adding a wide range of potential outcomes. If we adjust the incremental input, then of course, the model will generate a different output — but the levers controlling this are happening outside of the model — and we could run this same experiment with our minion models with the same result. (As a side note, checkout these two videos — we have found that fact-based RAG generations are *more accurate* with sampling turned OFF completely — https://youtu.be/7oMTGhSKuNY?si=AaSaIFjNqvKCg_Gy and https://youtu.be/iXp1tj-pPjM?si=2iKiel5xhh2WqaIt)
6. Next Token Prediction works really well — perhaps unreasonably well — given its superficial simplicity. The training objective of next-token-prediction is empirically spectacularly effective, especially in conjunction with large-scale attention-based architectures (1B+ parameters) and massive-scale of training date (1T+ tokens). The ability of the system — model with generation loop — to produce human-calibre output is, often times, astonishing. However, when we carefully delineate the role of the model versus the role of the system, it is hard to make the intuitive case that the model itself is engaging in anything resembling conceptual thought as it is only able to produce a single token at a time — but it is potentially a profound insight that complex and cogent paragraphs can be generated one token at a time, subject to the limitations highlighted in points 1–5 above.
There is a famous argument in philosophy developed by John Searle, a long-time professor at UC-Berkeley, known as the “Chinese Room” argument (first proposed in a paper in 1980). The basic premise of the thought experiment is to picture a closed room with a person inside that speaks only English and does not understand Mandarin, but has a set of translation materials (e.g., well-designed “flash cards”) that show how to mechanically translate the Mandarin into English and vice-versa. There is a small mail slot in the room, and a person outside the room can pass a Mandarin text through the mail slot. Using the ‘syntactic’ translation materials, the person inside the room can translate the Mandarin text into English even though the person does not understand it at all. To the user on the outside of the room they receive a response, passed through the mail-slot (and not aware of what is really going on inside the room) that appears to demonstrate the person has successfully translated the document, and therefore must “understand” Mandarin. Searle framed this argument to refute Turing style “strong AI” arguments that if a person could not tell the difference between the results of the “black box” room system and a human with understanding, then it constituted AGI.
The “Chinese Room” argument has been one of the most famous thought experiments in academic philosophy over the last 50 years. One of the most common counter-arguments is known as the “System” argument, which essentially concedes that while the person in the room does not understand Mandarin, it can be fairly said that the entire system, comprising the person, the room and the syntactic translation materials, do have something that resembles an understanding of Mandarin.
When we talk about “models” and “AGI” in 2024, we should remember that the generation sampling loop is a core part of the system — and that without that generation loop, the model is very much like the person in Searle’s room, generating one token at a time, but difficult to argue that it has a real understanding. Whether you draw a conclusion skeptical of AGI, or simply make a “system” argument that AGI requires more than just a model, we believe that these are important insights in thinking about the direction of generative AI and the best use cases for leveraging it. Over the last two years, there has been a lot of obsession about the “model” — but we believe that the key to unlocking a lot of powerful use cases is the combination of the model with associated data pipelines and workflows — whether it is in the retrieval pipeline, managing and tracking of metadata, applying ontologies, rules and other knowledge bases, or the generation loop, as we have been highlighting in this piece.
To check out some of our small, specialized fine-tuned models — none of which claim to be AGI but humbly aspire to be really useful fact-based business tools (when used in conjunction with sound generation loops and well-designed data pipelines ) — please go to our repo home page on HuggingFace — LLMWare RAG Instruct Models.
For more information about llmware, please check out our main github repo at llmware-ai/llmware/.
Please also check out video tutorials at: youtube.com/@llmware.