GradCache implementation?

by serialcoder - opened Jan 9, 2024

Jan 9, 2024

Your implementation in Tevatron library only uses gradient accumulation and GradCache isn't supported. Is gradient accumulation good enough to enable large batch size instead of GradCache? Thanks.

MrLight

Castorini org Jan 9, 2024

Hi,

GradCache is not used in the original implementation, as current gradcache do not support deepspeed yet.
Gradient accumulation would be good enough here,

Xueguang

serialcoder

Jan 10, 2024

Thank you @MrLight . Btw, in the forward function of RankLlama model, target is set to be zero:

https://github.com/texttron/tevatron/blob/2e5d00ee21d5a7db0bd2ea1463c9150a572106d4/examples/rankllama/modeling.py#L36

Shouldn't this method accept labels parameters which will then be used to calculate the loss function? As far as I can see, there isn't a way to signal "positive" vs "negative" pairs to the model. Am I missing something?

MrLight

Castorini org Jan 11, 2024

Hi @serialcoder ,

ranker_logits.view(self.train_batch_size, -1)

reranker logits is reshaped so that the first score in each group belongs to the positive pairs.
so the target is set to the 0 index for each group.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment