GradCache implementation?
Hi @MrLight ,
Your implementation in Tevatron library only uses gradient accumulation and GradCache isn't supported. Is gradient accumulation good enough to enable large batch size instead of GradCache? Thanks.
Hi,
GradCache is not used in the original implementation, as current gradcache do not support deepspeed yet.
Gradient accumulation would be good enough here,
Xueguang
Thank you
@MrLight
. Btw, in the forward function of RankLlama model, target is set to be zero:
Shouldn't this method accept labels parameters which will then be used to calculate the loss function? As far as I can see, there isn't a way to signal "positive" vs "negative" pairs to the model. Am I missing something?
Hi @serialcoder ,
ranker_logits.view(self.train_batch_size, -1)
reranker logits is reshaped so that the first score in each group belongs to the positive pairs.
so the target is set to the 0 index for each group.