LLMs Finetuning

Ankit kumar
4 min readOct 19, 2024

--

— LLMs Series Part-3

fig1.

Introduction

In the LLMs part 2, we discussed Prompt Engineering, and we had some critical issues:

  1. It doesn't work well for smaller language models.
  2. Including more examples in the context window takes more space, which can be used to provide other useful information.

So, in this article, we will discuss different ways of performing LLMs fine-tuning and their tradeoffs.

1. Full Finetuning

It updates all of the model’s parameters using a task-specific dataset. It involves modifying the entire set of weights, enabling the model to specialize in particular tasks or domains.

a. Single-Task Fine-Tuning:

Description: The model is fine-tuned on a single task. All the model’s parameters are adjusted based on the specific task, such as text classification, summarization, or translation.

Benefits:

  • Achieves high performance on the specific task because the model’s parameters are fully optimized for that task.
  • Can adapt to specialized tasks where a generic language model may not perform well.

Tradeoffs:

  • Overfitting: The model may become too specialized for the task and lose its generalization ability.
  • High computational cost: Requires substantial computing resources, especially for large models.
  • Catastrophic Forgetting: The model may “forget” knowledge from other domains not related to the fine-tuning data.

b. Multi-Task Fine-Tuning

Description: The model is fine-tuned on multiple tasks simultaneously. The training dataset consists of a mix of different tasks to expose the model to various types of data and prompts.

Benefits:

  • Improves the model’s generalization across multiple tasks due to exposure to a variety of contexts.
  • Reduces the risk of overfitting to a single task.

Tradeoffs:

  • Conflicting objectives: Different tasks might have conflicting requirements, which can reduce performance on individual tasks.
  • Computational intensity: Training on multiple tasks requires substantial data and computational resources.

2. Parameter-Efficient Fine-Tuning (PEFT)

Instead of updating all parameters, PEFT techniques aim to fine-tune a smaller subset of parameters. This approach helps reduce computational costs and mitigate the problem of catastrophic forgetting.

fig 2.

a. Selective Fine-Tuning

Description: Only a subset of layers (e.g., top layers, attention heads) is fine-tuned while freezing the rest of the model.

Benefits:

  • Reduces computational requirements and memory usage since only a fraction of the parameters is being updated.
  • Retains more of the original knowledge compared to full fine-tuning, reducing the risk of catastrophic forgetting.

Tradeoffs:

  • Limited specialization: The model might not achieve the same level of task-specific optimization as full fine-tuning.
  • Layer selection challenge: Identifying the optimal layers to fine-tune requires experimentation, and different tasks might benefit from tuning different parts of the network.

b. Reparameterization Methods (e.g., LoRA)

i) Low-Rank Adaptation (LoRA):

Fig 3 : Our reparametrization. We only train A and B.

Description: Introduces trainable low-rank matrices to the original weight matrices. These low-rank matrices are fine-tuned, while the original parameters are kept frozen.

Benefits:

  • Efficient fine-tuning: Significantly reduces memory and computational costs.
  • Maintains general knowledge: Since the original weights remain unchanged, the model retains its general capabilities while adapting to specific tasks.

Tradeoffs:

  • Complexity: The approach adds extra parameters to the model, which can complicate training.
  • Performance gap: May not reach the same level of performance as full fine-tuning for some complex tasks. It is one of the most effective methods so far.

c. Additive Methods

These methods add new parameters to the existing model architecture without modifying the original weights directly. These methods allow for task-specific adaptations while preserving the original model’s capabilities.

i. Soft Prompts

Description: Instead of fixing the input prompts, learnable embeddings (soft prompts) are prepended to the input sequences during training. These embeddings guide the model’s output for specific tasks.

Benefits:

  • Low computational overhead: Only a small number of parameters are introduced, making the approach lightweight.
  • Task-specific adaptation: This can be used for prompt-based tasks, improving performance without full fine-tuning.

Tradeoffs:

  • Limited effectiveness for complex tasks: Soft prompts may not provide enough task-specific adaptation for more complex tasks.
  • Dependent on original model performance: The success of soft prompts is often contingent on the quality of the pre-trained model.

ii. Addition of Layers (Adapters)

Description: Extra layers (adapters) are inserted into the network between existing layers. These new layers are fine-tuned while the original model parameters are frozen.

Benefits:

  • Modularity: Adapters can be added for different tasks without affecting the original model.
  • Reduced memory footprint: Only the adapter parameters need to be stored and fine-tuned, making it suitable for large models.

Tradeoffs:

  • Increased model size: The addition of layers can increase the model’s size, potentially affecting inference speed.
  • Requires careful tuning: Finding the right configuration for adapters (e.g., size, placement) can be challenging.

Tradeoffs Summary

  • Full Fine-Tuning: Provides maximum task-specific adaptation but is computationally expensive and can lead to overfitting or catastrophic forgetting.
  • Parameter-Efficient Fine-Tuning: Balances computational cost and task adaptation by updating only a subset of parameters, often with tradeoffs in performance for complex tasks.
  • Additive Methods: Offer modularity and flexibility, with lower resource requirements but may not achieve the same level of adaptation as full fine-tuning.

--

--

Ankit kumar
Ankit kumar

No responses yet