LLM Finetuning Best Practices
How many of you have had the following interaction with a client, partner, developer or friend over the last six months?
Conversation #1 โ โ(exuberant) we are going to fine-tune our own LLM models from open source โ we have just picked up {{Llama | Falcon | Mistral | RedPajamas}} and it looks pretty straightforward.โ
Conversation #2 โ โ(discouraged) yeah, we tried that, and it didnโt work out. Training models is hard.โ
We are big proponents of open source models โ and believe that model fine-tuning is truly the โsecret sauceโ to deploy enterprise-grade open source models. Having said that, taking base foundation models and fine-tuning them for a particular domain or task is not a trivial process and does require a lot of expertise and skill to achieve the target model behavior.
Most โhello worldโ model training tutorials show how superficially easy it is to build a simple training loop and run a few forward and backward passes through a model. Often times, these tutorials are platform-specific, designed to show the โML Opsโ lifecycle power of a particular tool. Usually the code samples are remarkably simple and straightforward.
Surprisingly, there are very few tutorials with practical guidance on LLM fine-tuning best practices. It is not hard to put cookies in the oven and burn them โ what is hard is to make the perfect batch of cookies!
We would like to fill that gap in tutorials with a few best practices that we have learned the hard way as we fine-tune LLMs. Unfortunately, there is no single short-cut or perfect universal approach โ generally, to produce a good fine-tuned model takes a lot of hard work, multiple training runs, iterations on the datasets, hyper-parameter adjustments, and sometimes a little bit of luck to find a winning recipe. It can be a humbling, and at times, frustrating activity to work through iterations, debug issues, and keep moving a model towards the target behavior.
Here are a few areas of attention that we believe are most critical in our experience:
# 1- Start with a Clear, Targeted Training Objective
It is easy to skip over this step as boilerplate, but this is actually the first and most important step to a successful fine-tuning. Models are not โmind readersโ and training is not โmagic.โ Before kicking-off a fine-tuning initiative, it is important to define the goals โ what exactly is the desired behavior that we are looking to see in the model? Are the objectives realistic and do they map to a specific fine-tuning dataset?
What makes a good training objective?
Perhaps to state the obvious, LLMs are great language pattern learners. This is the single most important insight for good model fine-tuning. (Whether this represents real โintelligenceโ or just the appearance of โintelligenceโ is a topic for another day.).
Think of fine-tuning as teaching the model a specialized transformation function with an input and a target output based on that input. This is the core insight of โmultiple shotโ learning, that if you give a model a few examples of a โtransformation patternโ, it is quite effective at emulating it on new examples. Fine-tuning is generally the process of providing one or more new specialized transformation patterns, with usually hundreds to thousands of examples on each specific pattern, that enable the model to fully adapt to learn that pattern and replicate it.
The most common pitfall that we see is viewing finetuning as a way to impart knowledge to a model, rather than teaching a specific transformation pattern or combination of language.
As an example, our work is focused on retrieval augmented generation workflows in financial services and legal use cases, and our training objectives are usually focused on:
โ Specific industry domain โe.g., financial, regulatory, legal;
โ Specific source material document type โ e.g., complex business, financial, regulatory, contracts; and
โ Specific tasks/skills, e.g., critical reading comprehension, sophisticated extraction, fact-based analysis and summarization.
Within this specific domain, document type and task definition, for a particular project, we may look at a very specific training objective, such as multiple-choice classifications for contract terms, or sorting a list by largest-to-smallest, or providing answers in a targeted number of words or bullets.
These examples are illustrative, but our main point is: be specific and clear of what you are trying to achieve, and the more focused, the better.
At this point, it is also worth asking the threshold question: do you need to finetune the model, or is the behavior โout of the boxโ sufficiently aligned to the intended use case? If you are struggling to define the specific training objective and strategy, it may be a sign that fine-tuning is not really necessary โ and better to skip fine-tuning, and focus on other facets of the LLM RAG workflow.
#2 โ The Fine-tuning Dataset is the Value Creation. While it may go without saying, it is an important reminder: model fine-tuning is all about the dataset.
This is where the value will be created.
This is the heavy lifting.
This is the hard part.
And there is no substitute for rolling up your sleeves and getting your hands-dirty with building, curating, cleaning, and reviewing the fine-tuning samples.
The dataset is the set of instructions that will be translated into adjustments in the models parameters to learn to minimize the loss of the intended training objective. Subject the training samples to scrutiny. Do the training samples map to the training objective and cover a wide range of expected potential scenarios? This is a trial-and-error process, and can be extremely time-consuming.
At the outset, if the fine-tuning dataset is not well-designed at both high-level and in the details โ and with sufficient breadth and depth of examples โ it will be impossible to compensate with other steps in the process. It is also critical that there is alignment between Step #1 and Step #2 โ and usually, iterations, as the training objectives should be narrowed and clarified based on the availability of applicable datasets. You can think of these two steps as the master chef getting the right ingredients together. If you donโt have good, high-quality ingredients, it is very difficult to move ahead in the process. Most of the time in the fine-tuning lifecycle should be spent on this step.
#3 โ Model Hyper-Parameters โ Learning Rate is the Key. Once you finally have your plan defined and ingredients assembled in steps 1 +2, it is time for training!
There are a lot of things that can go wrong in a training process, and many best practices guides. While some may disagree, in our experience, getting the learning rate wrong is usually the most common way that fine-tuning trainings go off track. Intuitively, the learning rate defines the size of the step that is taken when applying the โlearningโ from the gradient and back-propagating it through the model. We like to think of this as analogous to the oven temperature when cooking โ if the temperature is too high, it is likely that you will burn the food. If the temperature is too low, sometimes, it does not get the texture that you expect or looks good on the outside but is soft and undercooked on the inside.
We find that most โbase foundation modelโ papers provide fairly unclear (and sometimes inaccurate) learning rate guidelines for subsequent downstream finetunings of their base models. Sometimes, this is because their base training was happening with huge batch sizes and parallelization, and sometimes, because there simply was not a lot of experimentation done on how the how model would be used in fine-tuning.
A few useful โrules of thumbโ for fine-tuning LR settings for GPT-based decoder models:
ยท LR โ for fine-tune training, set the peak learning rate (LR) in a range around 1.0 X 1e-5, subject to a warm-up and gradual step-down decay over the training lifecycle. We rarely see a model that responds well to finetuning above 1.5โ2.0 X 1e-5 at the โhigh-endโ, and similarly a lower-end range of ~0.5 X 10e-5. This can be a trial-and-error experiment with most models, but this is the range where we typically see the best results. (Of course, there are exceptions, usually discovered through trial-and-error, and can be potentially offset by other hyper-parameter adjustments.) This also assumes relatively small batch sizes, compared to the base training, e.g., 2โ16 per batch/gradient accumulation step per GPU.
ยท Warm-up โ usually 3โ10% of the training steps โ different base models have different receptivity to โforeignโ fine-tuning materials, often with different โwrappersโ and without a warm-up period, it is easy to blow-up the model and lead to some form of catastrophic forgetting. A few hundred steps to warm-up usually does the trick, but even this can require some experimentation for different models, requiring more or less warm-up for optimal training.
โขTrouble-Shooting โ watch the model closely during training while at peak LR, and if you see steady increases in loss during this period, then it is a good sign that the peak LR is too high, and needs to be adjusted downwards.
ยท Decay / step-down through training โ we have not found any special difference with particular decay formulas โ in fact, we often use a simple linear step-down at specific training steps. The important part is that the LR declines over time. We usually decay down to 0.4โ0.5X 10e-5, but not much lower than that โ to try to squeeze a little incremental optimization out of the latter part of the finetuning dataset.
ยท Gradient clipping โ especially important with some models, and not for others.
#4 โ Training passes โ only use 1 Epoch โ donโt use the same data sample more than once in a training cycle. We remember training CNNs in the relative old days, and it was common-place to run the same sample potentially dozens of times in 10โ20 epochs (or more) in a particular training cycle. The power of the transformer is its grip and ability to learn patterns incredibly quickly. As a rule of thumb, if you need 2โ3 full passes of your training set to start to see the targeted behavior, it is a good sign that you are on the right path, but that the training data set is too small. Once the training data is the right size, a single training should do it with the model โseeingโ each sample only once. Some people might disagree โ often times, there are recommendations to do 2โ4 passes of each sampleโ but in our experience, this is a good guideline to avoid over-fitting and assess whether our training data is sufficiently large compared to our training objective. It is also a good test whether the training objective is sufficiently well-defined and realistic. Especially in a fine-tuning process where each sample will carry, relatively, a reasonable amount of influence, quality not quantity, is critical.
#5 -A real held-out testing dataset โ machine learning conventional wisdom from the beginning of time is that you should hold out a test and validation set from the total set. Our approach is to develop a formal testing set from scratch that is similar in principles to the training dataset, but was not prepared in the same process. When samples are prepared in the same process and from the same source materials, often times, there will be an โimplicitโ set of similarities that can conflate the results, and lead to โtestโ results that are substantially better than โrealโ results seen in the wild. We would recommend building a (small, but well-designed) test dataset that purposefully includes a couple of adjacent areas that should be natural extensions of the training, but that are not formally in the training dataset. As an example, we may use invoices as a testing set, for a training set that uses financial tables, as we are looking to see that that the model can โgeneralizeโ to a related domain with very similar set of patterns. As you are working on a fine-tuning project, you will likely train the model multiple times, and it is extremely helpful to have a test benchmark to compare results and determine which checkpoint of the model is performing overall the best. We were surprised recently when we thought a particular checkpoint was under-performing because we were focused on a very specific question, only to realize that the model was failing on that one single test, but performing much better than any of our other checkpoints across the full testing set.
#6 โ Iterate, iterate, iterate. To bake the perfect batch of cookies takes a lot of training runs. Keep iterating through steps 1โ5 and review every single element in the details to keep optimizing and improving the results. Most big progress occurs from very small adjustments โ cleaning up the formatting of the training samples, removing a few problematic samples, minor adjustments in the learning rates, etc.
We hope that you have picked up a couple of useful nuggets to aid you in your next finetuning project โ Happy Finetuning!
To check out some of our fine-tuned models, please go to our repo home page on HuggingFace โ LLMWare RAG Instruct Models.
For more information about llmware, please check out our main github repo at llmware-ai/llmware/.
Please also check out video tutorials at: youtube.com/@llmware.









