Stanford CS25: V4 I Demystifying Mixtral of Experts

18 May 2024 (over 1 year ago)

Mixture of Experts (MoE) Models

Albert Jang, an AI scientist at Mistral AI and a PhD student at Cambridge University, gave a talk on demystifying mixture of experts (MoE) language models.
MoE layers have been used to scale Transformers to trillions of parameters.
In the Mistral implementation of the MoE layer, a router assigns gating weights to pick the top two experts to process the inputs, and the outputs of the experts are weighted by the gating weights and summed to get the result.
Sparse MoE models achieve near cost-performance Pareto Frontier because tokens only need to go through the active parameters for that token, resulting in faster inference compared to large dense models like PaLM 2 70B.
The sparse MoE model released by Mistral under Apache 2.0 outperforms PaLM 2 70B with about five times faster inference and can handle up to 32k tokens.
Mixture-of-experts models can be compressed to a size smaller than four gigabytes.
The gating layers in mixture-of-experts models provide an opportunity for interpreting the model by giving sparse gating signals.
Sparse mixture of experts models leverage sparsity to gain more knowledge and can be trained to be efficient at inference.
Mixture of experts techniques are likely to be incorporated into other large foundation models, especially when efficiency gains can be achieved at scale.

MixOL-A 7B outperforms Mol 7B on knowledge-heavy tasks and reasoning comprehension tasks.
MixOL-A 7B has 12.9 billion active parameters, significantly fewer than other models, yet performs better, especially in the knowledge category.
Making the MLP layers of a language model into mixture-of-experts layers can provide a huge boost in knowledge.
There are actually 32 * 8 experts in total in MixOL-A * 7B, not just eight, and they are relatively independent across layers.
MixOL-A * 7B has 40.6 * 7 billion parameters in total, not 56 billion, and each token sees only 12.9 billion active parameters.
The cost of serving a mixture-of-experts model is not proportional to the number of active parameters.

Fine-tuning large language models can be challenging, especially when working with multiple models.
Fine-tuning open-source language models provides more control over the process compared to closed-source models like ChatGPT and PaLM 2.
Translating images into text format is necessary for fine-tuning language models on visual tasks since they are not designed to understand images directly.
Mixture of experts models improve performance by storing more knowledge in wider MLP layers and enhance inference efficiency by selecting relevant parameters for each token.
The mixture of depths model is an example of adaptive computation, where different numbers of parameters are engaged in calculating different tokens.
Communication cost in mixture of experts models depends on the number of experts and their size, and it becomes more expensive when scaling beyond a single node.

MoE architecture is a safe choice for designing high-performing models.
MoE layers can be arranged in random orders, such as attention first, MLP first, etc., and still perform well.
The eight experts in the MoE model are trained on roughly the same datasets.
Learning to route tokens to experts may not be significantly better than random mapping, but gating provides more intelligent expert selection.
Consider inference needs and GPU memory footprint when designing the model architecture.
Scaling laws should be analyzed before training to ensure optimal performance-cost ratio.
The inference runtime and GPU memory footprint for the 7B vs. 8x7B models depend on implementation details and optimization techniques.
The geometric mean of active parameters to total parameters can be a good rule of thumb for approximating model capability, but it depends on training quality and token efficiency.
Exploring the possibility of trimming the 8x7B model back to 7B by removing specific experts could be an interesting experiment.
The significant improvement in reasoning ability of the 8x7B model compared to the 7B model is speculative and may be due to increased knowledge or ambiguous benchmark definitions.
MoE models may pose challenges in serving due to higher GPU memory consumption, but they offer increased throughput at high batch sizes.

Expert specialization is not as straightforward as one might think, and there's a lot of research to be done in architecture and interpretability.
Sparse mixture of experts models are better suited for devices with limited memory, while dense models are better for devices with more memory.
Mixture of experts models may not outperform domain-specific models at their respective tasks, as domain-specific models are hard to beat with continuous pre-training and fine-tuning.
The optimal placement of layers in neuron network ensemble methods, whether at early or deep layers, requires further study, as late fusion tends to perform better in traditional neural network ensembles.
Load balancing losses and loss function discontinuities are introduced to ensure that each expert handles a similar amount of tokens during training.
Mixture of experts models can be used for rack, and swapping out one expert for a domain-specific expert is possible, but requires additional training.
Training mixture of experts models is roughly equivalent to training a 13B model, but incurs some extra communication cost.
Serving very large models with many experts (e.g., 128 experts) can be challenging due to implementation difficulties and increased communication costs.