These models tokenize text, embed tokens into vector representations, process them through a neural network (like a Transformer), and output a probability distribution over potential next tokens. rel="noopener noreferrer" target="_blank">(00:08:45)
Tokenization in LLMs
Tokenization is crucial because it handles variations in language, such as typos, by breaking down words into smaller units, ensuring that even misspelled words can be processed by the model. rel="noopener noreferrer" target="_blank">(00:10:50)
Tokenization methods: Currently, tokenization is favored over character-by-character or byte-by-byte methods due to length limitations in existing architectures. However, there is hope that future architectures will not have these limitations, leading to a move away from tokenizers. rel="noopener noreferrer" target="_blank">(00:17:40)
Drawbacks of tokenization: One major drawback is the handling of numbers in math. Tokenizing numbers as single units prevents models from understanding and generalizing mathematical concepts effectively. rel="noopener noreferrer" target="_blank">(00:17:56)
Evaluating Large Language Models
Evaluating large language models (LLMs): Perplexity, a measure of how well a model predicts a sequence of words, was commonly used for evaluating LLMs. However, it is no longer favored in academic settings due to its dependence on tokenizers and data. rel="noopener noreferrer" target="_blank">(00:21:06)
AlpacaEval, an alternative evaluation method, uses LLMs to rate responses from different models and demonstrates a 98% correlation with human evaluations on Chatbot Arena. rel="noopener noreferrer" target="_blank">(01:31:49)
After undesirable content is removed, data is further refined by removing duplication in headers, footers, URLs, and frequently duplicated paragraphs from sources like books. rel="noopener noreferrer" target="_blank">(00:31:50)
Scaling laws demonstrate that increasing the amount of data and the size of language models leads to better performance, with predictable improvements in performance based on increases in compute, data, and parameters. rel="noopener noreferrer" target="_blank">(00:40:55)
There are two machine learning pipelines: an older pipeline where hyperparameters are tuned on large models for a short time, and a newer pipeline where a scaling recipe is found by training smaller models of different sizes and extrapolating the results to larger models. rel="noopener noreferrer" target="_blank">(00:45:08)
Scaling laws can be used to determine the optimal allocation of training resources, such as whether to train a larger model on less data or a smaller model on more data. rel="noopener noreferrer" target="_blank">(00:49:27)
The Chinchilla paper demonstrated the use of scaling laws to determine the optimal allocation of training resources by varying the number of tokens and model size while keeping the amount of compute constant. rel="noopener noreferrer" target="_blank">(00:49:47)
Considering inference costs, a ratio of approximately 150 tokens per parameter is more practical, as smaller models require less computational resources during inference. rel="noopener noreferrer" target="_blank">(00:52:35)
Computational Costs and Environmental Impact
Estimating the computational cost of training large language models, such as Llama 3 400b, involves considering factors like the number of parameters (45 billion in this case) and the total number of tokens used in training (15.6 tokens for Llama 3 400b). rel="noopener noreferrer" target="_blank">(00:55:24)
Training the model required approximately 26 million GPU hours over 70 days and cost an estimated $75 million, including compute costs, salaries, and other expenses. rel="noopener noreferrer" target="_blank">(00:57:02)
While significant, the carbon emissions from training the model are considered relatively small compared to future models, estimated to be around 4,000 tons of CO2 equivalent. rel="noopener noreferrer" target="_blank">(00:58:40)
SFT does not require a large amount of data, as demonstrated by the Lima paper, which showed that scaling data from 2,000 to 32,000 examples did not significantly improve results. rel="noopener noreferrer" target="_blank">(01:05:31)
While synthetic data generation using LLMs is a promising area of research, it is not as crucial for SFT, and obtaining a smaller set of high-quality human-generated data (around 2,000 examples) might be sufficient. rel="noopener noreferrer" target="_blank">(01:07:20)
While humans may be better at distinguishing between good and bad outputs, they may not be the best at generating ideal responses, limiting the effectiveness of behavioral cloning in SFT. rel="noopener noreferrer" target="_blank">(01:10:38)
DPO, a simplification of PPO, proposes maximizing the probability of generating desired outputs and minimizing the probability of generating undesired outputs, essentially maximizing the "green" and minimizing the "red" in human preferences. rel="noopener noreferrer" target="_blank">(01:20:04)
Perplexity, a common metric for evaluating language models, becomes problematic when applied to large language models (LLMs) because these models are not trained to maximize likelihood. rel="noopener noreferrer" target="_blank">(01:29:50)