Clarification Needed: Description of Gradient Accumulation's Peak Memory Impact Seems Incorrect

#122
by XiaoBanni - opened

Hi everyone,

I came across the following text describing gradient accumulation, and its conclusion is confusing me:

"Using gradient accumulation means we need to keep buffers where we accumulate gradients that persist throughout a training step, whereas without gradient accumulation, in the backward pass gradients are computed while freeing the activation memory, which means lower peak memory use."

This text concludes that training without gradient accumulation leads to lower peak memory use.

This seems to be the exact opposite of my understanding and the primary purpose of the technique.

My understanding is that gradient accumulation is a "time-for-space" trade-off specifically designed to reduce peak memory usage, allowing us to train with larger effective batch sizes on memory-constrained hardware.

My reasoning is:

  1. Peak Memory Bottleneck: The main driver of peak memory is not the gradient buffers, but the activations saved during the forward pass, which are needed for the backward pass.
  2. Without Gradient Accumulation (Standard Training): To process a large batch of size $B$, the model must compute and store activations for all $B$ samples simultaneously. Peak Activation Memory $\propto B$.
  3. With Gradient Accumulation: We process a small micro-batch of size $b$ (where $b \ll B$) at a time. The model only needs to store activations for $b$ samples. After its micro-backward pass, these $b$ activations are freed. Peak Activation Memory $\propto b$.

Therefore, gradient accumulation should result in significantly lower peak memory use.

Am I misunderstanding something fundamental, or is the conclusion in the quoted text misleading?

Thanks in advance for any clarification!

Yes, the statement is misleading. Gradient accumulation increases gradient buffer persistence slightly but dramatically reduces activation memory
requirements so the overall peak
memory usage decreases, not increases🤔🤔🤔🤔🤔

Sign up or log in to comment