LLAMA Series
— technical evolution of LLAMA models
Introduction
The LLAMA (Large Language Model Meta AI) series developed by Meta has seen significant advancements with each iteration, from LLAMA-1 to LLAMA-3. The models have evolved through innovations in architecture, training techniques, data processing, and fine-tuning strategies considering human-aligned response generation. Here, we delve into the technical details of each version, exploring changes in architecture, optimization, fine-tuning, and data processing that drive their differences.
LLAMA-1: Establishing the Foundation
Tokenizer
- Algorithm: LLAMA-1 uses byte-pair encoding (BPE) through SentencePiece, a common tokenization strategy for handling subword units.
- Special Handling: It splits numbers into individual digits and uses byte-level fallback for unknown UTF-8 characters. This approach allows for better handling of non-standard characters and uncommon tokens.
- Training Data: The training dataset consists of approximately 1.4 trillion tokens, processed from diverse sources such as books, web data, and Wikipedia. Each token is seen once, except for Wikipedia and books, which undergo two epochs.
Architecture
- Transformer Base: LLAMA-1 is based on the standard Transformer architecture (decode-only model), with the following enhancements:
- Pre-normalization: Instead of normalizing the output, it applies normalization before the transformer sub-layers using RMSNorm, which helps improve training stability.
- SwiGLU Activation Function: Replaces ReLU with the SwiGLU activation, enhancing performance by using a more expressive non-linear function. The dimension used for the hidden layer is 2/3×4d2/3 \times 4d2/3×4d, which balances computational efficiency with expressiveness.
- Rotary Positional Embeddings (RoPE): Uses rotary embeddings instead of absolute positional encodings. RoPE enables better handling of long sequences by applying rotational transformations to token representations, improving positional understanding without the drawbacks of fixed embeddings.
Training and Optimization
- Optimizer: Trains using the AdamW optimizer with hyperparameters β1=0.9,β2=0.95\beta_1 = 0.9, \beta_2 = 0.95β1=0.9,β2=0.95, weight decay of 0.1, and a cosine learning rate schedule. The final learning rate is set to 10% of the maximum.
- Efficient Implementation: The training employs optimizations to reduce memory usage and computational cost, such as causal multi-head attention that avoids storing unnecessary attention weights.
Strengths and Limitations
- Strengths: The open nature and efficient design make LLAMA-1 suitable for a wide range of research tasks. It provides a solid baseline for experimenting with large-scale language models.
- Limitations: It lacks some architectural innovations present in newer models, which results in performance gaps in tasks requiring longer context management and fine-grained reasoning.
LLAMA-2: Scaling and Refining the Model
LLAMA-2, released in mid-2023, introduces several significant improvements over LLAMA-1. It comes in configurations of 7B, 13B, and 70B parameters and introduces a specialized variant, LLAMA-2-Chat, optimized for dialogue applications.
Tokenizer and Training Data
- Tokenizer: LLAMA-2 continues to use the same SentencePiece-based tokenizer as LLAMA-1, retaining the BPE approach, digit splitting, and byte-level fallback. The vocabulary remains at 32k tokens.
- Expanded Training Dataset: The pretraining corpus is increased by 40%, encompassing a wider range of high-quality sources, resulting in a total of 2 trillion tokens. This increase aims to enhance generalization and reduce hallucinations by up-sampling factual sources.
- Data Processing: New data filtering techniques were introduced to curate a cleaner dataset, focusing on removing low-quality content and balancing different domains.
Architectural Improvements
- Context Length Doubling: LLAMA-2 doubles the maximum context length compared to LLAMA-1, which improves the model’s ability to handle longer input sequences effectively.
- Grouped-Query Attention (GQA): LLAMA-2 introduces GQA, which reduces the number of queries each head computes by sharing queries across heads. This optimization helps to reduce memory overhead during inference, making the model more scalable.
- Some Core Features Retained: Pre-normalization with RMSNorm, SwiGLU activation, and RoPE positional embeddings are carried over from LLAMA-1, but fine-tuned for better integration with the new architectural changes.
Fine-Tuning and Optimization Techniques
- Instruction Tuning: LLAMA-2 incorporates instruction tuning, where the model is fine-tuned on a set of high-quality instructions to improve its capability to follow prompts. This process includes supervised fine-tuning (SFT) with approximately 27,540 annotations.
- Iterative Fine-Tuning with RLHF: Reinforcement Learning from Human Feedback (RLHF) is used to fine-tune LLAMA-2-Chat. Two main RL algorithms are employed:
- Proximal Policy Optimization (PPO): A policy gradient method that updates the model iteratively based on feedback.
- Rejection Sampling: For the largest 70B variant, outputs are sampled, ranked, and used for gradient updates based on a reward function. Distillation techniques transfer the fine-tuning benefits of the 70B model to smaller versions.
Strengths and Limitations
- Strengths: LLAMA-2 shows marked improvements in inference scalability, contextual understanding, and safety. The introduction of LLAMA-2-Chat enables robust dialogue handling.
- Limitations: While the use of GQA and context scaling provides scalability benefits, the largest configurations (70B) still pose computational challenges for real-time applications.
LLAMA-3: Cutting-Edge Advancements
LLAMA-3, the latest version released in 2024, further builds on the architectural and training optimizations introduced in LLAMA-2. It aims to push open-source LLMs closer to the capabilities of models like GPT-4, while still emphasizing efficiency and accessibility.
Architectural Enhancements
- Refined Grouped-Query Attention (GQA): GQA is further optimized in LLAMA-3, allowing for more efficient memory use and better handling of longer sequences. New techniques for adaptive query sharing are introduced to balance computational cost with accuracy.
- Sparse Attention Mechanisms: LLAMA-3 integrates sparsity in the attention layers, enabling the model to focus on the most relevant parts of the input sequence. This reduces the quadratic complexity of traditional self-attention, allowing the model to handle even longer contexts.
- Rotary Embeddings Optimization: Further improvements to RoPE enhance long-range dependencies and context retention. Adaptive rotary scaling is used to dynamically adjust the attention span based on the input’s characteristics.
Training Data and Techniques
- Increased Training Dataset Size: LLAMA-3’s training dataset size is expanded to over 3 trillion tokens, with further improvements in quality filtering and data augmentation techniques. It also includes more specialized data sources to improve performance in technical and domain-specific tasks.
- Multi-Phase Training Regime: The training process is divided into phases that focus on different data distributions, with the final phase concentrating on high-quality and challenging tasks. This helps the model to generalize better and reduces overfitting.
Fine-Tuning Strategies
- Advanced Instruction Fine-Tuning: LLAMA-3 improves on LLAMA-2’s instruction tuning by using larger datasets and a more diverse set of instructions. Multiple rounds of iterative fine-tuning and human feedback are employed.
- Enhanced RLHF Techniques: The RLHF process in LLAMA-3 incorporates multi-agent learning, where the model is trained with feedback from multiple evaluators simultaneously, leading to more balanced and robust fine-tuning.
- Safety and Alignment Improvements: LLAMA-3 uses enhanced safety-specific datasets and adversarial training to mitigate harmful outputs. Iterative red-teaming continues to refine the model’s responses.
Strengths and Limitations
- Strengths: LLAMA-3 demonstrates cutting-edge performance across NLP tasks, especially in handling long contexts, complex reasoning, and domain-specific knowledge. It narrows the gap between open-source and proprietary models.
- Limitations: Despite optimizations, LLAMA-3’s largest configurations still require substantial computational resources. The trade-off between speed and accuracy remains a challenge for deployment in resource-constrained environments.
Comparative Analysis:
1. Architectural Differences
- LLAMA-1: Standard Transformer with basic optimizations (RMSNorm, SwiGLU, RoPE).
- LLAMA-2: Adds GQA and doubles context length, improving scalability and long-sequence handling.
- LLAMA-3: Incorporates sparse attention and adaptive techniques, optimizing for even longer contexts and more efficient computation.
2. Training Data Evolution
- LLAMA-1: 1.4T tokens, limited epochs for most data sources.
- LLAMA-2: 2T tokens, with increased quality and more diverse data.
- LLAMA-3: 3T+ tokens, with multi-phase training to enhance generalization and performance.
3. Fine-Tuning and Safety
- LLAMA-1: No instruction tuning or RLHF.
- LLAMA-2: Introduces instruction tuning and RLHF with PPO and rejection sampling.
- LLAMA-3: Expands RLHF techniques and includes advanced safety measures, such as adversarial training.
4. Efficiency and Inference Optimization
- LLAMA-1: Basic efficient implementations for training.
- LLAMA-2: GQA and memory-efficient attention techniques.
- LLAMA-3: Sparse attention and adaptive techniques for more efficient long-context management.
Conclusion
The evolution of LLAMA models from LLAMA-1 to LLAMA-3 reflects Meta’s commitment to pushing the boundaries of open-source large language models. Each iteration has introduced significant architectural and training improvements, making LLAMA models increasingly competitive with closed-source models like GPT-4. Understanding these technical differences helps developers and researchers leverage the strengths of each model version, selecting the most suitable one for their specific applications and deployment requirements.
References:
— LLAMA 1
— LLAMA 2
— LLAMA 3