Background
During 2020–2021, AI researchers discovered a powerful new model recipe, launching the “LLM” (Large Language Model) era. This recipe consisted of at least three major elements:
1. GPT — Decoder-only Attention-based transformer architecture;
2. Internet-scale datasets (with ‘Self supervised’ predict-next-token training objective); and
3. 100B+ parameters — unprecedented scale of model parameter size.
This new recipe was heralded by the “Language Models are Few Shot Learners” paper from Open AI in May 2020, which introduced GPT3 and crystallized the power of this formula with the revolutionary insight of observing “few shot” learning behavior — an unexpected “emergent” behavior of models following this recipe. This profound insight, in turn, has fueled two major trends in LLM model training over the last 24 months:
AI Trend #1 — “Bigger is Better Arms Race” — this was the biggest story in 2021 and 2022, as major AI R&D shops trained from scratch (at great cost and expense) truly mega models in the 100B+ range, and several rumored to be in the 1T+ range with an “arms race” to train larger and larger models. This trend seems to have cooled, presumably due to the incredible cost of both training — and deploying — these models, and perhaps, a recognition, that at least with current model architectures, there is diminishing marginal returns (?) above a certain model size.
AI Trend #2 — “Instruct training became a thing” — while the ascendence of instruct training has not commanded the same headlines, we would argue that the formalization of instruction training — and methods of building high-quality instruct datasets — is even more fundamental of a change in unlocking generative AI, as it has led to a fourth standard element in the LLM recipe — after a “base” training layer that is Internet-scale (300 billion — 2 trillion tokens), fine-tune with a high-quality “instruct” layer, with the “instruct” training leading to most of the sophisticated target behavior for specific use cases.
This formalization of instruct fine-tune training as a “thing” in its own right has led to the remarkable proliferation of high-quality Apache 2.0 licensed 7B–100B+ GPT base models, such as Llama2, Pythia, Falcon, Mistral, Cerebras, and Together/Red Pajamas, serving as the “cake”, and an ever growing market of builders, makers, consultants, companies and start-ups “tuning” instruct-layers as the “icing” of the LLM. If you have any doubt about the energy that this created across the open source developer community, check out the LLM Leadership Board on HuggingFace. The evolution of this formula, especially throughout 2023, is arguably one of the greatest examples of open democratized innovation that we have ever seen in the tech space — and we are still in the early stages ...
Research Objective: interplay between Model Size + Instruct Training
At Ai Bloks, and our open source research arm llmware, our focus is building enterprise-based LLM applications, on top of a “retrieval augmented generation” (RAG) foundation. Over the last 18 months, as we have been training our own LLMs for our commercial offerings, we have been experimenting with different facets of this 4-layer LLM formula. We have largely taken as a given the first two elements — attention-based decoder architectures and internet-scale next-token causal training. Our interest is the interplay of the second two elements — model-size and instruct training — and to ask a few questions:
What are the smallest models that can demonstrate meaningful instruction following behavior?
To what extent can targeted high-quality instruction training offset the obvious benefit of larger model size?
Do “hallucinations” and other aberrant behavior occur in lower, higher or same frequencies on instruct-trained smaller LLMs?
How do practical constraints such as applying smaller context windows, narrower domain scope, and focused instruction set improve the performance of smaller LLMs in instruct-following?
Do smaller models have a viable role to play in retrieval augmented generation (RAG) scenarios, or will larger models always be the right answer for production use cases?
Our premise is two-fold:
Learning lessons on the smallest possible models will pay meaningful dividends when applied back upstream into larger models in terms of sharpening and differentiating training objective and datasets; and
CPU-based models are very useful in local testing — especially with RAG, most high-value use cases involve confidential enterprise information, and the ability to develop workflows, test and rapidly build POCs with a local CPU-based model is a very useful asset, even if it is ultimately “swapped out” when moving into production.
Introducing BLING Model Series
To research these questions, we launched the BLING (Best Little Instruct-following No-GPU) model series on HuggingFace (link: https://huggingface.co/llmware).
We started with Apache 2.0 licensed, high-quality decoder model bases that had not yet had any instruct fine-tuning, and that were standard laptop CPU deployable (e.g., less than 3 billion parameters) without any special quantization techniques.
Currently, we are training on three base GPT models — Pythia, Falcon and Cerebras, and have focused so far on trying to find the smallest possible decoder model where we see consistent question-answering instruct following behavior. We have been training and testing models in the range of 100M — 3B parameters extensively, with a primary focus on models in the 1.0–1.5B parameter range.
We are experimenting with different filtered bespoke training datasets with varying combinations of fact-based question-answering, key-value extraction, Boolean question-answering (yes / no), recognition of “not found”, long-form summarization (e.g., summarize with bullet-points), and short-form x-summarization (e.g., summarize in 20 words or less).
Over the last two weeks, we have launched the first four BLING models on HuggingFace, with many more to come.
If you are interested in the topic of instruct-training on smaller open source models, please reach out as we are always welcome to collaboration and sharing of ideas and best practices.
Next Steps
Please keep reading in the next blog posting (“Evaluating LLM Performance in RAG Instruct Use Cases”) which rolls out a new RAG testing framework, and evaluates the BLING model performance.
If you are interested in learning more about implementing LLMs for RAG use cases, please check out our LLMware open-source project at https://github.com/llmware-ai/llmware and our Hugging Face models at https://huggingface.co/llmware.