ddp training problem (NCCL during evaluation) #61

SlenderMongoose · 2024-10-05T04:24:53Z

Remove the words "YES" and "NO" from product titles because of the sick evaluation process! or using

return logits[:, 1][-1:], gold[-1:]

in function preprocess_logits_for_metrics

important Ensure that the tokenized prompt remains smaller than the cutoff length; otherwise, the RS label will be lost during evaluation!

多卡DDP验证的时候会因为找不到yes no或者找到好几个yes no而卡住，因此有必要采取上述措施。一劳永逸的方法就是改写验证，出现这种错误的根本原因在于作者并不是带着yes no标签进验证而是进了验证之后才从原来的prompt里面扒yes 和no

所以对于问题1可以采用上面的return语句，默认最后一次yes no才是ctr 的标签

这个问题应该是只要某个词的一部分会被token化为yes或者no都有可能导致这个错误（这在电商数据里非常常见，比如某个商品的名称叫 No.1巴拉巴拉。。。，这个时候No就已经出现一次了）
多卡报错一般是累计超过了三个，即偏好中出现的商品名称里面 yes 或者 no 的累计次数超过3，这和作者原始的验证逻辑相关（我不清楚为什么要每隔三个元素取一个当作label，但总之原本的逻辑就是这个，这个地方放的就应该是ctr的唯一label）。
即使没有报错卡住，原始代码逻辑利用的ctrlabel也会因为某些商品名称中出现的yes和no而改变

第二个问题是必须的，这个没什么可说的，cutoff之后超长的用户概貌标签一定丢失

TALLRec微调的训练阶段用的标签是自回归的标签，而验证阶段用的标签是推荐ctr的标签。
测试阶段不涉及这个问题，只需要保证prompt小于llama1规定的2048即可，因为测试是先把标签拔出来再进的

The text was updated successfully, but these errors were encountered:

SlenderMongoose · 2024-10-06T02:03:05Z

The author selects labels at every third step to enable batch validation; however, the issue mentioned above still persists. Thus, the following modification is suggested.

# For batch validation, try this one.    
def preprocess_logits_for_metrics(logits, labels):
        def filter_last_indices(labels_index):
            unique_values, indices = torch.unique(labels_index[:, 0], return_inverse = True)
            max_indices = torch.zeros(len(unique_values), dtype = torch.long)
            for i in range(len(unique_values)):
                group = torch.nonzero(indices == i, as_tuple = False).squeeze()
                max_in_group = torch.argmax(labels_index[group, 1])
                max_indices[i] = group[max_in_group]
            return labels_index[max_indices]

        labels_index = torch.argwhere(torch.bitwise_or(labels == 8241, labels == 3782))
        labels_index = filter_last_indices(labels_index)
        gold = torch.where(labels[labels_index[:, 0], labels_index[:, 1]] == 3782, 0, 1)
        labels_index[:, 1] = labels_index[:, 1] - 1
        logits = logits.softmax(dim = -1)
        logits = torch.softmax(logits[labels_index[:, 0], labels_index[:, 1]][:, [3782, 8241]], dim = -1)
        return logits[:, 1], gold  # yes prob , yes label

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ddp training problem (NCCL during evaluation) #61

ddp training problem (NCCL during evaluation) #61

SlenderMongoose commented Oct 5, 2024 •

edited

Loading

SlenderMongoose commented Oct 6, 2024 •

edited

Loading

ddp training problem (NCCL during evaluation) #61

ddp training problem (NCCL during evaluation) #61

Comments

SlenderMongoose commented Oct 5, 2024 • edited Loading

SlenderMongoose commented Oct 6, 2024 • edited Loading

SlenderMongoose commented Oct 5, 2024 •

edited

Loading

SlenderMongoose commented Oct 6, 2024 •

edited

Loading