New Paradigms in LLM Pre-training and Post-training: Innovations by Alibaba, Apple, Google, and Meta AI

Introduction

The field of Artificial Intelligence and machine learning has come a long way, particularly in the development of large language models (LLMs). From the early versions such as GPT-3 to the sophisticated, open-weight models we see today, LLMs have significantly evolved. The training methodologies for these models have also undergone substantial changes. Initially, the focus was on pre-training, but the current paradigms include both pre-training and post-training phases. This shift, popularized by models like ChatGPT, aims to better align the models with human preferences through supervised instruction fine-tuning and alignment techniques.

Qwen 2 by Alibaba

Overview

Alibaba's Qwen 2 models are available in sizes ranging from 0.5 billion to 72 billion parameters. These models exhibit strong multilingual capabilities across 30 languages and boast a large 151,642-token vocabulary. While most of the models were trained on 7 trillion tokens, the 0.5 billion parameter model was trained on an impressive 12 trillion tokens.

Pre-training

Qwen 2 employs a two-stage pre-training process. Initially, the models undergo standard pre-training, followed by long-context training, which extends the context length from 4,096 to 32,768 tokens. To enhance in-context learning and instruction-following abilities, Qwen models generate synthetic pre-training data.

Post-training

The post-training phase for Qwen 2 involves supervised fine-tuning across 500,000 examples. An innovative Direct Preference Optimization (DPO) method is used to align the model with human preferences. This alignment occurs in two stages: offline using pre-existing data and online by forming preference pairs in real-time.

Apple Intelligence Foundation Models (AFM)

Overview

Apple has developed two versions of its Intelligence Foundation Models (AFM), both featuring 3 billion parameters. One version is designed for on-device use, while the other is intended for server use.

Pre-training

Apple's pre-training process is conducted in three stages. The core pre-training involves training on 6.3 trillion tokens for the server models. This is followed by continued pre-training focused on high-quality data, particularly in the domains of math and code. The final stage involves context lengthening to 32,768 tokens using synthetic long-context Q&A data.

Post-training

For post-training, Apple follows a two-step process: supervised instruction fine-tuning and several rounds of reinforcement learning with human feedback (RLHF). Apple introduces two new algorithms in this phase: Rejection Sampling Fine-tuning with Teacher Committee (iTeC) and RLHF with Mirror Descent Policy Optimization.

Google's Gemma 2

Overview

Google's Gemma 2 models are available in sizes of 2 billion, 9 billion, and 27 billion parameters, each with a sizable vocabulary of 256k tokens.

Pre-training

Google focuses on quality rather than the size of the dataset for pre-training. For smaller models, knowledge distillation techniques are employed. The 27B model is trained on 13 trillion tokens, while the smaller 9B and 2B versions are distilled from this larger model.

Post-training

The post-training phase involves typical supervised fine-tuning and RLHF. The Reward model used in RLHF is ten times larger than the policy model. Additionally, an averaging method called WARP (a successor to WARM) is employed.

Llama 3.1 by Meta AI

Meta AI's Llama 3.1, though not highlighted in detail, is noteworthy as part of the recent advancements in LLMs. Meta AI continues to explore and innovate in both pre-training and post-training methodologies to enhance model capabilities.

Impact of Vocabulary Size in Multilingual Contexts

Both Qwen 2 and Gemma 2 models have impressively large vocabularies. In multilingual contexts, a larger vocabulary size allows the models to understand and generate text in multiple languages more effectively. This feature is crucial for applications requiring robust multilingual support, such as global customer service solutions and international content generation. A sizable vocabulary also helps in reducing out-of-vocabulary issues, leading to more accurate and contextually appropriate responses.

Implications of Apple's iTeC and RLHF with Mirror Descent Policy Optimization

Apple's introduction of iTeC (Rejection Sampling Fine-tuning with Teacher Committee) and RLHF with Mirror Descent Policy Optimization marks significant advancements in post-training methodologies. These algorithms enhance the model's alignment with human preferences by refining the reward mechanisms and sampling strategies. In practice, this means that outputs generated by Apple's models are more likely to meet user expectations, providing a better overall user experience. The Mirror Descent Policy Optimization, in particular, allows for more efficient learning, reducing computational costs while maintaining high levels of performance.

Knowledge Distillation Techniques: Apple vs. Google

Knowledge distillation is a crucial technique employed by both Apple and Google, albeit in different ways. Apple's approach involves distilling knowledge from larger models to create efficient, on-device models. This process ensures that the models remain lightweight and capable of running on less powerful hardware without compromising performance.

Google, on the other hand, uses knowledge distillation to create scalable solutions. By training a large 27B model and then distilling knowledge into smaller 9B and 2B models, Google ensures that even the smaller models perform exceptionally well. This approach allows for a range of model sizes to suit different use cases while maintaining high levels of accuracy and performance. The primary benefit of these techniques is the creation of models that are both powerful and efficient, capable of being deployed across various platforms and devices.

Conclusion

The evolution of LLM training methodologies underscores the importance of both pre-training quality and effective post-training alignment with human preferences. Synthetic data and optimized data pipelines play crucial roles in improving model performance. The diverse approaches taken by companies like Alibaba, Apple, Google, and Meta AI highlight the varied strategies to achieve efficient and capable LLMs. As these techniques continue to evolve, we can expect even more sophisticated models that offer enhanced performance and better alignment with human needs.

FAQs

What are the main innovations in LLM pre-training and post-training? The main innovations include advanced pre-training techniques, synthetic data generation, supervised instruction fine-tuning, and reinforcement learning with human feedback.
How do large vocabularies benefit multilingual models? Large vocabularies enhance the ability of models to understand and generate text in multiple languages, reducing out-of-vocabulary issues and improving accuracy.
What is the significance of Apple's iTeC and RLHF with Mirror Descent Policy Optimization? These algorithms improve model alignment with human preferences and optimize learning efficiency, leading to better user experiences.
How do Apple and Google differ in their use of knowledge distillation? Apple focuses on creating lightweight, on-device models, while Google aims for scalable solutions with a range of model sizes for different use cases.