Skip to content

Mixture of Experts (MoE)

LLM & Language Models

An architecture where a model consists of multiple specialized sub-networks (experts), with a routing mechanism that activates only the relevant experts for each input.

Mixture of Experts is an architecture that makes large models more efficient. Instead of one giant network processing every input, MoE uses multiple smaller 'expert' networks and a router that selects which experts to activate for each token.

GPT-4 is widely reported to use MoE with approximately 8 experts, activating 2 per token. This means while the total model has ~1.8 trillion parameters, only a fraction are active for any given input — making inference much more efficient than a dense model of the same size.

MoE enables models to be very large (lots of total knowledge) without being proportionally expensive to run. Mixtral (Mistral's MoE model) demonstrated that an open-source MoE model could compete with much larger dense models at a fraction of the compute cost.

Real-World Example

GPT-4 reportedly uses a Mixture of Experts architecture — it has many specialized sub-networks but only activates a few for each response. This makes it efficient despite its massive size.

Related Terms

More in LLM & Language Models

FAQ

What is Mixture of Experts (MoE)?

An architecture where a model consists of multiple specialized sub-networks (experts), with a routing mechanism that activates only the relevant experts for each input.

How is Mixture of Experts (MoE) used in practice?

GPT-4 reportedly uses a Mixture of Experts architecture — it has many specialized sub-networks but only activates a few for each response. This makes it efficient despite its massive size.

What concepts are related to Mixture of Experts (MoE)?

Key related concepts include LLM (Large Language Model), Transformer, Parameters, Inference. Understanding these together gives a more complete picture of how Mixture of Experts (MoE) fits into the AI landscape.