Stanford CS25: V4 I Behind the Scenes of LLM Pre-training: StarCoder Use Case

08 Jun 2024 (1 year ago)

Training Large Language Models (LLMs)

Recent progress in open-source LLMs has led to comparable performance to closed models like GPT-4.
Lack of transparency in training details hinders the understanding of how to train effective LLMs.
Key factors in training LLMs include model architecture, GPUs, and data.
Data is crucial for LLM performance, and understanding data requirements is essential.
Scaling laws guide the allocation of resources between model size and data size.
The Chinchilla scaling law suggests equal scaling of data and model size for optimal performance.
Compute-optimal training may not always be optimal due to inference costs.
Recent models like LLaMA and PaLM 2 have demonstrated the benefits of training smaller models on larger datasets for cost-effectiveness.
Scaling laws don't consider inference costs, which are important for practical use cases.
Smaller models trained longer can be more cost-effective during inference.
Repeating data up to four times can achieve similar performance to using unique tokens, especially for domains with limited data.
Scaling behavior depends on data quality, so it's important to use appropriate scaling laws for the specific dataset.

Data for Training LLMs

Large volumes of data for training LLMs can be obtained from the web and GitHub code.
The web data set "webtext" provides 15 trillion tokens and outperforms other publicly available datasets.
The "stack" dataset is the largest dataset of open-source code, with permissive licenses and opt-out options for GitHub users.
Synthetic data has become increasingly important for LLM training and can match or outperform models trained on web data.
Popular LLMs like Cloud 3 and PaLM 3 incorporate synthetic data in their pre-training mix.
Hugging Face released "cosmia," a large dataset of synthetic text generated using an open-source model.
The dataset was created using 80% web data and 20% curated sources like Stanford courses and WikiHow.
Filtering techniques were applied to remove low-quality samples, duplicates, and near-duplicates.

Filtering and Curating Data

The authors of the Fine-tuned WebText dataset ran over 200 ablation experiments to find the best filtering techniques.
For the Star-Coder dataset, the authors filtered out low-quality files, autogenerated content, and languages that are no longer maintained.
They also performed data quality inspection and used near-duplication filtering, which gave them the most performance boost.
The speaker discusses the process of filtering and curating the Stack dataset for training code language models.
They explain the importance of removing duplication, personal identifiable information (PII), benchmarks, and test sets from the training data.
The speaker also highlights the formatting techniques applied to the data, such as adding tokens to indicate repository name, file name, and GitHub stars.
They introduce the concept of repository awareness in Starcoder 2, which helps the model understand the relationships between files within the same repository.

Open-Source Code Language Models

The speaker provides an overview of the progress made in training open-source code language models and highlights notable models such as Codex, Starcoder, Code LLama, DeepSeek, Granite, CodeGen, and StableCode.
They discuss the significance of open and responsible releases of language models, including data transparency, opt-out tools, removal of PII, reproducibility, and comprehensive documentation.
Starcoder 15b and 25b were state-of-the-art code models upon release, outperforming other models on various benchmarks.
It is important to use multiple evaluation benchmarks to avoid contamination and fully understand a model's behavior.
Starcoder models come with tooling, including a VS Code implementation with a membership test to identify generated code in the training data.
Fine-tuning large language models like Starcoder or Codex on personal codebases is possible using efficient techniques like parameter-efficient fine-tuning.

Challenges and Considerations

Training AI models on AI-generated synthetic data can enforce biases and cause contamination, leading to concerns about overfitting on benchmarks.
Synthetic data may not fully represent the natural distribution of language, and a mix of synthetic and natural data is often beneficial.
RHF (Reinforcement Learning from Human Feedback) data is not necessarily more important than unsupervised pre-training data. Instruction tuning without RL and methods like DPO or ORPO can achieve good results.
Multimodal grounding with images and videos alongside text has not been extensively explored, but text still seems to play a significant role in multimodal language models.
Training text versus code models differs primarily in the training data, but there may be other differences that require further research.
The architecture used for training is similar to other large language models (LLMs), with a focus on long context and fast inference.
For fine-tuning with limited compute resources, it is recommended to use curated datasets, quantized models, and techniques like PFT (parameterized fine-tuning) to enable fine-tuning on a single GPU.
The optimal amount of training data depends on the domain and task, with scaling laws observed to hold across domains like English and code. However, it is underexplored how these scaling laws change for different domains.
Tokenization for code differs from general-purpose text in that number splitting is important, and the tokenizer is trained on a code-specific dataset.
For fine-tuning, data preparation is different from pre-training, with a focus on specific language and heavy filtering.
When publishing large datasets, considerations include providing tools for filtering and documentation, respecting licenses and copyrights, and potentially adding a gating mechanism for sensitive information.