Deep Learning and AI

Mixture of Experts in LLMs

December 6, 2024 • 7 min read

Introduction

LLMs have transformed the landscape of data ingestion, enabling chatbots to interact with users seamlessly. These models excel at a wide variety of tasks—summarization, creative writing, coding, mathematics, and more. However, while the size of these models has increased drastically, can we say the same about their performance across all tasks?

The Mixture of Experts (MoE) architecture for LLMs shifts the paradigm by employing multiple smaller, specialized models that work together to complete a given task, rather than relying on a single large, general-purpose model.

Think of it like a hospital! Doctors, nurses, and surgeons all have a medical background to treat patients but also specialize in a specific field—cardiology, neurology, orthopedics—and collaborate as experts to provide the best, most comprehensive care to their patients. Similarly, in MoE, separate models with unique specializations work together to, for example, understand and annotate a large code block.

What is the MoE Architecture?

Mixture of Experts (MoE) is an advanced neural network architecture designed to enhance the scalability and efficiency of Large Language Models (LLMs). It achieves this by selectively activating a subset of specialized "experts" for each input, reducing computational overhead while maintaining model performance. A gating network determines which experts are most suitable for the task and routes the input accordingly.

MoE utilizes independent AI models, such as specially fine-tuned LLMs, to carry out specific tasks, making the overarching AI model more versatile. The key benefits include:

Scalability: Models can grow significantly by adding new experts without requiring dense computational resources, as only a subset of parameters is activated for any given input.
Efficiency: By activating only a few experts per task, MoE minimizes computational costs, optimizing resource usage while retaining complexity and capability.
Specialization: MoE enhances versatility by employing experts tailored for intricate tasks, such as language processing. These experts aren't domain-specific but focus on fine-grained tasks like punctuation, idioms, or numerical reasoning.

How MoE Architecture Works

The Mixture of Experts (MoE) architecture is a neural network design that dynamically routes input data to specialized components, or "experts," enabling efficient computation and scalability. The architecture comprises three main components:

Experts: Independent sub-networks, such as feedforward layers, each trained to specialize in specific tasks or data patterns.
Gating Network: Determines which experts are most relevant for a given input by producing a probability distribution and selecting a small subset—typically two experts, as seen in models like Mixtral.
Combiner: Aggregates the outputs of the selected experts, weighting them based on the gating network's decisions, and passes the result to the next layer in a traditional transformer model.

Once input is routed, the selected experts process it independently and in parallel. Their outputs are combined and weighted according to the gating network's probabilities before being passed to subsequent layers.

In essence, the MoE architecture enhances traditional transformer networks by replacing a single dense feedforward network with a group of feedforward networks managed by a router.

A standout feature of MoE is its sparse activation, where only a small subset of experts is activated for each input. This approach significantly reduces computational costs while maintaining the model's capacity to handle diverse and complex tasks. Sparse activation enables MoE models to scale to trillions of parameters without requiring dense computation for every input, making them ideal for large-scale AI applications.

Open Source Mixture of Expert Models - Mixtral

Mistral AI offers open-source foundational AI models designed for community use. Their popular Mistral models have been widely adopted for local LLM projects and include MoE models like Mixtral 8x7B and 8x22B.

Mixtral 8x7B: Comprising eight 7B parameter experts, this model totals 46.7 billion parameters, with around 10 billion likely allocated to components like attention and the gating network. Its smaller expert size makes it accessible for hobbyists using commodity hardware. Despite its reduced size, Mixtral 8x7B consistently outperforms larger models like GPT-3.5 while being more cost-efficient.
Mixtral 8x22B: Utilizing eight 22B parameter experts, this model totals 141 billion parameters. It delivers superior performance compared to its smaller counterpart and rivals the capabilities of extremely large models.

While the total parameter counts for Mixtral models are 46.7B and 141B, their active parameter counts are much lower due to sparsity. Mixtral 8x7B activates only 13B parameters, and Mixtral 8x22B activates 39B, yet both perform as well as or better than larger dense models like LLaMA 2 70B and GPT-3.5 175B, which activate all parameters during inference.

Hardware to Run MoE like Mixtral

For those looking to run an open source model like Mixtral, there are various tools to deploy like Ollama. Let's check out some hardware and a configurable solution to run your models.

The listed GPUs are all excellent choices for running the smaller yet highly performance Mixtral 8x7B delivering inference speeds just as fast as ChatGPT except locally! You can further augment Mixtral by incorporating RAG features. We recommend a single NVIDIA RTX 4090 with Intel Core i7 or i9 or AMD Ryzen 7 or 9. We like our GPUs to stay NVIDIA due to their widespread adoption for AI and CUDA architecture.

If you choose to run the larger Mixtral 8x22B version multi-GPU deployments are required and need more GPU memory. This is the reason we recommend the RTX 5000 Ada or RTX 6000 Ada for multi-GPU workstations. These GPUs not only house increased GPU memory but utilize the standard double-wide GPU form factor for fitting up to 4x GPUs in a workstation form factor or up to 8x in a server form factor.

If you have any questions on hardware for running AI on a workstation or server, contact us today! Or configure your ideal platform and get a quote today.

Blog