💭
Published on

Towards DeepSeek - Introduction to DeepSeek-V3-Base

Authors
  • avatar
    Name
    Jan Hardtke
    Twitter

Lately, DeepSeek AI, which is a spin-off from a hedge fund (DeepSeek Capital), has disrupted the LLM landscape with the release of DeepSeek-R1, an open-source state-of-the-art reasoning model. Remarkably, its performance is comparable to OpenAI’s o1, yet DeepSeek claims to have trained it for only $5 million, a surprisingly low figure for such an advanced model.

Their success stems from key innovations in the widely used Mixture of Experts (MoE) Transformer architecture, alongside a novel reinforcement learning technique called Group Relative Policy Optimization (GRPO), which enables end-to-end learning of the reasoning process, leading to cutting-edge results in complex tasks.

Description of the image

Image source

In this post, we will cover the main architectural aspects of the DeepSeek-V3-Base Technical Report, which describes the base model that is later used to train R1-Zero and R1 using GRPO.

For now, we will concentrate on the architectural innovations that DeepSeek achieved in their MoE base model, which we can summarize in three steps:

  • DeepSeekMoE: Introducing a new approach for the Mixture of Experts model in conjunction with a method for Auxiliary-Loss-Free Load Balancing.
  • MLA (Multi-Head Latent Attention): A nearly loss-free alternative to GQA.
  • Multi Token Prediction: Enhancing training efficiency by predicting multiple tokens simultaneously.

Prerequisites

To understand the innovations in the MoE setting and Multi-Head Latent Attention (MLA), we will quickly (re)introduce both of them.

Mixture of Experts (MoE)

In our standard transformer block, recall that the FFN layer follows the RMSNorm of the MHA layer. The MoE architecture—which gained wide popularity after its use in GShard—replaces precisely this layer in a decoder block. The overall idea is to replace the large FFN with multiple smaller FFNs, called experts, each of which processes tokens based on a learned probability distribution. The network that generates this distribution is known as the router.

Description of the image

Image source

Mathematically, this can be expressed as follows:
Let ss and dd denote the sequence length and hidden dimension, respectively.
We consider a set of experts i{1,,n}:fi:RdRd\forall i \in \{1,\dots,n\}: f_i: \mathbb{R}^{d} \to \mathbb{R}^{d} and
a gating (routing) function w:RdRnw:\mathbb{R}^{d} \to \mathbb{R}^{n}.
Then, for each token embedding xjRdx_j \in \mathbb{R}^{d},
the output of the Mixture of Experts layer is computed as a weighted sum of the expert outputs:

MoE(xj)=i=1nw(xj)ifi(xj),j{1,,s}.\text{MoE}(x_j) = \sum_{i=1}^{n} w(x_j)_{i} \cdot f_i(x_j), \quad \forall j \in \{1,\dots,s\}.

Aggregating over the sequence, we obtain

MoE(x)=[MoE(x1);MoE(x2);;MoE(xs)]Rs×d.\text{MoE}(x) = [\,\text{MoE}(x_1);\, \text{MoE}(x_2);\, \dots;\, \text{MoE}(x_s)\,] \in \mathbb{R}^{s \times d}.

where w(x)jiw(x)_{ji} denotes the gating weight corresponding to expert fif_i, with i=1nw(x)ji=1j{1,,s}\sum_{i=1}^{n} w(x)_{ji} = 1 \quad \forall j \in \{1,\dots,s\}. The reason this architecture has gained so much traction is that if w(x)w(x) is sparse, only a small subset of experts will ever be active at once. Consequently, we can scale the parameter count to trillions while only activating a small subset at any given time, thereby reducing memory requirements. This approach leverages the network’s ability to learn which subset of experts to use in a given context without having to evaluate a massive dense network.

In practice, our gating function ww is also modeled via an FFN.

Top-K Sampling

To make the vector w(x)iw(x)_i sparse, both GShard and Switch Transformer have popularized the method of top-KK sampling. Instead of summing over all the weighted expert outputs, we only keep the KK highest values for any given w(x)jw(x)_j and disregard the values of the other experts. Switch Transformer pushed this to its limits by setting K=1K=1, which was previously thought to be infeasible. Mathematically we can overall express this as:

hj=i=1N(gj,iFFNi(xj))+xj,gj,i={sj,i,if sj,iTopk({sj,l1ln},K),0,otherwise,sj,i=Softmaxi(FFNroute(xj)),\begin{align} h_j &= \sum_{i=1}^{N} \left( g_{j,i}\,\mathrm{FFN}_i\left(x_j\right) \right) + x_j, \\ g_{j,i} &= \begin{cases} s_{j,i}, & \text{if } s_{j,i} \in \mathrm{Topk}\left(\{ s_{j,l} \mid 1 \le l \le n \}, K\right), \\ 0, & \text{otherwise}, \end{cases} \\ s_{j,i} &= \mathrm{Softmax}_i\left(\mathrm{FFN}_{\text{route}}(x_j)\right), \end{align}

where hjh_j is the final output of the decoder block and FFNroute\mathrm{FFN}_{\text{route}} implements the gating network, which is often just a simple perceptron.

Auxiliary Loss & Expert Capacity

When training such MoE models, as described above, we often encounter several issues. One common problem is the over-utilization of one or just a few experts. For example, due to chance during the early stages of training, one expert might yield a slightly lower loss, which causes the gating network to over-rely on that expert. This imbalance means that the other experts receive little to no training, reinforcing the problem and leading to suboptimal overall performance.

The solution to this issue is the introduction of an additional loss term called the auxiliary loss. This loss is used to encourage the network to evenly distribute its selections across all experts during training. We define this loss as:

Laux=αi=1Etipi,\mathcal{L}_{\text{aux}} = \alpha \cdot \sum_{i=1}^{E} t_i p_i,

where tit_i is the fraction of tokens routed to expert ii, which the model can influence by adjusting pip_i, the empirical average probability of a token being routed to expert ii. The scaling factor α\alpha is introduced as a hyper-parameter. To prevent a single expert from being overloaded, we define an additional hard limit on how many tokens an expert can handle per batch. This limit is called the expert capacity. While the exact definition may vary from paper to paper, the one introduced in Switch Transformer is defined as:

C=(tokens per batchnumber of experts)capacity factor.C = \left(\frac{\text{tokens per batch}}{\text{number of experts}}\right) \cdot \text{capacity factor}.

While tokens that exceed the capacity limit CC are often dropped—meaning their computation is skipped and they are passed to a later layer via the skip connection, later methods have experimented with dynamically redistributing those tokens to underutilized experts. Again we introduce the capacity factor as a hyper-parameter.

DeepSeek-V3-Base Architecture

Now that we have covered the required prerequisites, let's take a closer look at the overall architecture of DeepSeek-V3. We'll begin with the new MoE layer, DeepSeekMoE, which was introduced in our previous paper.

DeepSeekMoE

DeepSeekMoE introduces several changes to the standard MoE architecture. One key innovation is what we call Fine-Grained Expert Segmentation. In this approach, the number of experts is increased by a factor of mm, while the hidden dimension of each expert is scaled down by a factor of mm. As a result, the top-KK selection is adjusted to a new value of K=mKK' = mK.

DeepSeekMoE architecture illustration

Image source

As illustrated above (in part (b)), when we set m=2m=2 and hence K=2KK' = 2K (for example, if K=2K=2 then K=4K'=4), the rationale is to increase the combinatorial complexity of the activated experts. For instance, with N=16N=16 experts and a top-2 routing strategy, there are

(162)=120\binom{16}{2} = 120

different combinations of experts. However, if we set m=4m=4, then the effective number of experts becomes 16×4=6416 \times 4 = 64, and with a top routing value of K=8K' = 8, we obtain

(648)4.426×109\binom{64}{8} \approx 4.426 \times 10^9

different combinations of active experts—all while keeping the overall parameter count roughly the same.

Additionally, DeepSeekMoE introduces the concept of shared experts (part (c)). The idea is that each expert may need to learn some common knowledge, and if each expert learns it individually, it leads to a lot of redundancy in their parameters. To model this shared information more efficiently, we introduce a set of shared experts whose goal is to capture this common knowledge. This increases parameter efficiency for the remaining experts. For this to work properly, the shared experts are excluded from the routing mechanism so that every token passes through every shared expert.Given that NsN_s denotes the number of shared experts and NrN_r the number of routed experts, DeepSeek-V3 expresses it's MoE layer as follows:

hj=xj+i=1NsFFNi(s)(xj)+i=1Nrgj,iFFNi(r)(xj),gj,i=gj,il=1Nrgj,l,gj,i={sj,i,if sj,iTopk({sj,l1lNr},Kr),0,otherwise,sj,i=Sigmoid(xjTei),\begin{align} h_j &= x_j + \sum_{i=1}^{N_s} \mathrm{FFN}^{(s)}_i(x_j) + \sum_{i=1}^{N_r} g_{j,i}\, \mathrm{FFN}^{(r)}_i(x_j), \\[1em] g_{j,i} &= \frac{g'_{j,i}}{\sum_{l=1}^{N_r} g'_{j,l}}, \\[1em] g'_{j,i} &= \begin{cases} s_{j,i}, & \text{if } s_{j,i} \in \mathrm{Topk}\Bigl(\{ s_{j,l} \mid 1 \le l \le N_r \},\, K_r\Bigr), \\[1em] 0, & \text{otherwise}, \end{cases} \\[1em] s_{j,i} &= \mathrm{Sigmoid}\Bigl( x_j^T e_i \Bigr), \end{align}

where eie_i is a weight vector. Lastly, DeepSeek‑V3 introduces an auxiliary-loss-free load balancing method that ensures an even distribution of expert utilization without adding a separate loss term. In this approach, an additional bias bib_i is incorporated into the formulation of gj,ig'_{j,i}:

gj,i={sj,i+bi,if sj,iTopk({sj,l1lNr},Kr),0,otherwise.g'_{j,i} = \begin{cases} s_{j,i} + b_i, & \text{if } s_{j,i} \in \mathrm{Topk}\Bigl(\{ s_{j,l} \mid 1 \le l \le N_r \},\, K_r\Bigr), \\[1em] 0, & \text{otherwise}. \end{cases}

The bias bib_i is then dynamically adjusted during training: it is decreased by γ\gamma if expert ii is considered overloaded and increased by γ\gamma if it is underloaded. Although the paper does not explicitly specify the criteria for these states, they are likely determined relative to the balanced load, which can be estimated as TNr\frac{T}{N_r} (with TT being the total number of tokens). Although we overall refer to this MoE model as auxiliary-loss-free, the authors introduce an additional loss term, called the Complementary Sequence-Wise Auxiliary Loss, which enforces a balanced expert load within each sequence. This is particularly beneficial during inference, as it helps ensure that the experts, and consequently the GPU resources, are evenly distributed.

LCSA=αi=1NrfiPifi=NrKr1Tt=1T1{st,iTopk({st,l1lNr},Kr)}st,i=st,ij=1Nrst,jPi=1Tt=1Tst,i,\begin{align} \mathcal{L}_{\text{CSA}} &= \alpha \cdot \sum_{i=1}^{N_r} f_i \, P_i \\[1em] f_i &= \frac{N_r}{K_r}\frac{1}{T} \sum_{t=1}^{T} 1\Bigl\{ s_{t,i} \in \mathrm{Topk}\Bigl(\{ s_{t,l} \mid 1 \le l \le N_r \},\, K_r\Bigr) \Bigr\} \\[1em] s'_{t,i} &= \frac{s_{t,i}}{\sum_{j=1}^{N_r} s_{t,j}} \\[1em] P_i &= \frac{1}{T} \sum_{t=1}^{T} s'_{t,i}, \end{align}

where TT is the number of tokens in the sequence. We can see that fif_i represents the fraction of tokens in the sequence routed to expert ii, scaled by the ratio of the total number of routed experts to the number of active experts. This is then weighted by the average probability that an expert is chosen within the sequence and summed over all routed experts. Note that this formulation of the loss is essentially the same as the auxiliary loss, but it uses empirical averages over intra-sequence statistics. To ensure that the influence of this term remains rather small, leaving most of the load balancing work to the bias term, we set α1.\alpha \ll 1.

KV-Caching

As a refresher we will quickly again cover Key-Value-Caching, which we already covered in a previous post.

One of the most important optimizations that Llama has introduced in their architecture is the so-called key-value cache. The purpose of this becomes clear if we look at the figure below:

Description of the image

Image source

As you know, during inference we sample the next token and append it to the sequence before we feed this new sequence into the transformer to predict the next token. But as you can see in the figure, to predict token 5, we only need query token 4 to multiply with the keys. This means to predict token 5, we only need the last row of the attention matrix. Thus, instead of feeding in the entire sequence of tokens of length nn to predict n+1n+1, we just feed in the nn-th token. For the attention and subsequent multiplication with VV, however, we still need the previous tokens. This is exactly where the key-value cache (kv-cache) comes in.

Description of the kv-cache mechanism

Image source

For every token we see, we save it in the kv-cache for later usage in the multiplications. By doing this, we can save a significant amount of attention multiplications, as we only need to compute the last row of the attention matrix! Nice!

Multi-Head Latent Attention (MLA)

Multi-Head Latent Attention was first introduced by DeepSeek-V2 and is a revolutionary approach to drastically reducing the size of the KV-cache during inference. Let's, for example, consider the architecture of DeepSeek-V3 and calculate its memory requirements during inference when using its 100K token context window.

For DeepSeek-V3, we have a head dimension of dh=128d_h = 128, with nh=128n_h = 128 heads per attention layer. In total, there are l=61l = 61 such layers. If we now use FP16 precision for each weight and our full context window n=105n = 10^5, we arrive at a KV-cache size of

2×n×dh×nh×l×2109400\frac{2 \times n \times d_h \times n_h \times l \times 2}{10^9} \approx 400

This means the size of our KV-cache for a full context window is approximately 400GB! This is tremendous, and therefore there have been multiple suggestions over the years to save space in the KV-cache. One of these, which found application in the architecture of Llama3, is GQA (Grouped Query Attention), where multiple query heads share a single key and value head, although this comes at the cost of accuracy.

The idea of MLA is now to compress our incoming token embedding XRs×dhnhX \in \mathbb{R}^{s \times d_hn_h}, where ss is the sequence length and dd is the embedding dimension, with a learned weight matrix WDKVRdc×dhnh W^{DKV} \in\mathbb{R}^{ d_c\times d_hn_h}, to obtain

ctKV=WDKVht,c_t^{KV} = W^{DKV}h_t,

where ctKVRs×dc c_t^{KV} \in \mathbb{R}^{s \times d_c} and dcdhnhd_c \ll d_hn_h is the compressed latent representation of keys and values of token tt. To obtain our keys and values, we define WUKRdc×(dhnh) W^{UK} \in \mathbb{R}^{d_c \times (d_hn_h)} and WUVRdc×(dhnh)W^{UV} \in \mathbb{R}^{d_c \times (d_hn_h)} which will learn to upscale our compressed representation ctKVc_t^{KV}

ktC=WUKctKVvtC=WUVctKV.k_t^C = W^{UK} c_t^{KV} \\ v_t^C = W^{UV} c_t^{KV} .

After this, we resume with our standard attention mechanism. Now, before covering the genius of this approach, note that through this low‐rank approximation of the traditional WKW^K via WDKVWUKW^{DKV}W^{UK}, we achieve a smaller parameter count because dcdhnhd_c \ll d_hn_h. This is very similar to what we do when fine-tuning with LoRA to minimize the number of parameters we have to tune.

One might be tempted to think that we have just traded reduced memory requirements for increased computational demand by introducing two new matrices. However, this is exactly where the ingenuity of MLA lies. We only have to learn and store the additional matrices WUKW^{UK} and WUVW^{UV} during training. During inference, however, where the KV-cache normally comes into play, we are able to precompute WUKWDKVW^{UK}W^{DKV} and WUVWDKVW^{UV}W^{DKV}. As the following relations hold

qt=WQht,kt=WUKctKV=WUK(WDKVht),vt=WUVctKV=WUV(WDKVht).\mathbf{q}_t = W^{Q} \mathbf{h}_t, \quad \mathbf{k}_t = W^{UK} c_t^{KV} = W^{UK} ( W^{DKV} \mathbf{h}_t ), \quad \mathbf{v}_t = W^{UV} c_t^{KV} = W^{UV} ( W^{DKV} \mathbf{h}_t ).

From this, we can conclude that the standard attention mechanism will look like

Attn(ht,H)=softmax ⁣((WQht)(WUKWDKVH)dh)(WUVWDKVH).\mathrm{Attn}(\mathbf{h}_t, H) = \mathrm{softmax}\!\left( \frac{(W^{Q}\mathbf{h}_t)\,(W^{UK} W^{DKV} H)^{\top}} {\sqrt{d_h}} \right) (W^{UV} W^{DKV} H).

We can rewrite this by regrouping terms of WQW^{Q} and WUKW^{UK}, resulting in

Attn(ht,H)=softmax ⁣((WQ(WUK)ht)(WDKVH)dh)(WUVWDKVH).\mathrm{Attn}(\mathbf{h}_t, H) = \mathrm{softmax}\!\left( \frac{(W^{Q}(W^{UK})^{\top}\mathbf{h}_t)\,(W^{DKV} H)^{\top}} {\sqrt{d_h}} \right) (W^{UV} W^{DKV} H).

As we can see, during inference we can precompute WQ(WUK)R(dhnh)×(dhnh)W^{Q}(W^{UK})^{\top} \in \mathbb{R}^{(d_h n_h) \times (d_h n_h)} for every layer and use the cached cKVc^{KV} for the attention calculation, thereby not incurring any additional computational cost with this approach.

Similarly, we can precompute WOWUVR(dhnh)×(dhnh).W^{O} W^{UV} \in \mathbb{R}^{(d_h n_h) \times (d_h n_h)}. In multi-head attention, we combine the results from all heads and project them using WOW^{O}:

ot=WO(softmax ⁣()(WUVWDKVH))\mathbf{o}_t = W^{O} \left( \mathrm{softmax}\!\left( \cdots \right) (W^{UV} W^{DKV} H) \right)

Using associativity, we finally end up with

ot=(WOWUV)(softmax ⁣()(WDKVH))\mathbf{o}_t = (W^{O} W^{UV}) \left( \mathrm{softmax}\!\left( \cdots \right) (W^{DKV} H) \right)

However, DeepSeek-V3-Base introduces an additional step for the Rotary Positional Embedding.
The complete MLA formulation from the paper is given as

ctKV=WDKVht,[kt,1C;kt,2C;;kt,nhC]=ktC=WUKctKV,ktR=RoPE(WKRht),kt,i=[kt,iC;ktR],[kt,1V;kt,2V;;kt,nhV]=vtC=WUVctKV.\begin{aligned} c_t^{KV} &= W^{DKV} h_t, \\ [k_{t,1}^{C};\, k_{t,2}^{C};\, \dots;\, k_{t,n_h}^{C}] &= k_t^{C} = W^{UK} c_t^{KV}, \\ k_t^{R} &= \mathrm{RoPE}(W^{KR} h_t), \\ k_{t,i} &= [k_{t,i}^{C};\, k_t^{R}], \\ [k_{t,1}^{V};\, k_{t,2}^{V};\, \dots;\, k_{t,n_h}^{V}] &= v_t^{C} = W^{UV} c_t^{KV}. \end{aligned}

It is important to note that tt denotes the index of the t-th token and nhn_h the number of attention heads.
As we can see, the formulation introduces an additional term ktRk_t^{R}, which carries the positional information of the token at position tt.

Here, WKRRdhR×dW^{KR} \in \mathbb{R}^{d_h^{R} \times d} is a projection matrix, where dhRd_h^{R} specifies the number of dimensions used to down-project hth_t before applying the Rotary Positional Embedding (RoPE).
Finally, we concatenate kt,iCk_{t,i}^{C} and ktRk_t^{R} to form the complete key vector kt,ik_{t,i} for token tt and attention head ii. During inference, we only need to store ctKVc_t^{KV} and ktRk_t^{R}, both of which are significantly smaller than storing the full key and value vectors for each token at their original embedding dimension.

Additionally they introduce a low-rank compressiion for the queries, which are not stored as the ones for keys and values, but are soley used due to their lower activation memory during training and inference.

ctQ=WDQht,[qt,1C;qt,2C;;qt,nhC]=qtC=WUQctQ,[qt,1R;qt,2R;;qt,nhR]=qtR=RoPE(WQRctQ),qt,i=[qt,iC;qt,iR].\begin{aligned} c_t^{Q} &= W^{DQ} h_t, \\ [q_{t,1}^{C};\, q_{t,2}^{C};\, \dots;\, q_{t,n_h}^{C}] &= q_t^{C} = W^{UQ} c_t^{Q}, \\ [q_{t,1}^{R};\, q_{t,2}^{R};\, \dots;\, q_{t,n_h}^{R}] &= q_t^{R} = \mathrm{RoPE}(W^{QR} c_t^{Q}), \\ q_{t,i} &= [q_{t,i}^{C};\, q_{t,i}^{R}]. \end{aligned}

Again, we introduce a down-projection matrix WDQRdc×dW^{DQ} \in \mathbb{R}^{d_c' \times d},
where, as before, dc(dhnh)d_c' \ll (d_h n_h).

When we need qt,iCq_{t,i}^{C} again, we can recompute it via WUQctQW^{UQ} c_t^{Q}, with WUQR(dhnh)×dcW^{UQ} \in \mathbb{R}^{(d_h n_h) \times d_c'}.

They further use WQRR(dRhnh)×dcW^{QR} \in \mathbb{R}^{(d_R^h n_h) \times d_c'} to encode the compressed latent representation ctQc_t^{Q} using RoPE, and finally concatenate both qt,iCq_{t,i}^{C} and qt,iRq_{t,i}^{R} to obtain the complete query vector for token tt and head ii

qt,i=[qt,iC;qt,iR].q_{t,i} = [q_{t,i}^{C};\, q_{t,i}^{R}].

This way, we do not need to store qt,iq_{t,i} during training but can simply recompute it during the backward pass, thereby saving precious GPU memory.

Multi-Token Prediction

Another important step in the training of DeepSeek-V3-Base was the introduction of MTP. Instead of calculating the loss solely by aggregating the CrossEntropy between the predicted token distribution and the ground truth, the model is additionally forced to predict the k-th subsequent tokens. The authors justify this approach by arguing that it may both densify the training signal and help the model better plan the prediction of future tokens.

Description of the kv-cache mechanism

Image source

Each subsequent token is predicted using an MTP block (see Eq.). These blocks consist of a shared embedding layer and output head from the main model, together with a Transformer block (TRMk\mathrm{TRM}_k) and a linear projection matrix MkRd×2dM_k \in \mathbb{R}^{d \times 2d}. As shown in the figure, we take the output of the ii-th token from the (k1)(k-1)-th MTP block and normalize it using RMSNorm\textit{RMSNorm}. In addition, we compute the embedding of the (i+k)(i+k)-th token using the shared embedding layer. Both representations are then concatenated and projected through MkM_k, yielding

hi(k)=Mk[RMSNorm(hi(k1)); RMSNorm(Emb(ti+k))].h'^{(k)}_{i} = M_{k} \big[ \mathrm{RMSNorm}(h^{(k-1)}_{i});\ \mathrm{RMSNorm}(\mathrm{Emb}(t_{i+k})) \big].

We now use the representations hi(k)h'^{(k)}_{i} as inputs to the Transformer block:

h1:Tk(k)=TRMk(h1:Tk(k)).h^{(k)}_{1:T-k} = \mathrm{TRM}_k\left(h'^{(k)}_{1:T-k}\right).

Note that we only use the original input tokens within the Transformer. As illustrated in the figure, the embeddings of future tokens are computed and incorporated during the concatenation and projection step; however, the Transformer itself operates solely on h1:Tk(k)h^{(k)}_{1:T-k}.

The output of the Transformer block, hi(k)h^{(k)}_{i}, is then fed into the shared output head to predict the probability distribution Pi+k+1(k)P^{(k)}_{i+k+1} for the token ti+k+1t_{i+k+1}:

Pi+k+1(k)=OutHead(hi(k)).P^{(k)}_{i+k+1} = \mathrm{OutHead}\left(h^{(k)}_{i}\right).

For the loss calculation, we use the following setup. Recall that for the first MTP block (k=1k = 1), we aim to predict the token (i+k+1)(i + k + 1). Therefore, for the first input token t1t_1, the prediction target becomes t3t_3. Accordingly, the loss formulation for an MTP module is given by

LMTP(k)=CrossEntropy ⁣(P2+k:T+1(k),t2+k:T+1)=1Ti=2+kT+1logPi(k)[ti].L^{(k)}_{\mathrm{MTP}} = \mathrm{CrossEntropy}\!\left(P^{(k)}_{2+k:T+1},\, t_{2+k:T+1}\right) = -\frac{1}{T} \sum_{i=2+k}^{T+1} \log P^{(k)}_{i}[t_i].

Finally, we average these losses over all prediction depths DD and weight them by a scaling factor λ\lambda:

LMTP=λDk=1DLMTP(k).L_{\mathrm{MTP}} = \frac{\lambda}{D} \sum_{k=1}^{D} L^{(k)}_{\mathrm{MTP}}.

Note that this entire multi-token prediction procedure is applied solely during training to enhance the model's ability to anticipate future tokens. During inference, the MTP blocks are not used.