bugfix #6103

2877992943 · 2024-11-21T08:55:47Z

What does this PR do?

Fixes # (neat pack lose data bug)
in method preprocess_packed_supervised_dataset this line

index= length2indexes[length].pop()

lose data and lead to degenerated accuracy when there is more than 1 sample keyed by length,

thus add preprocess_packed_supervised_dataset_fullDataGroup use 'noDegenerateGroups' keep all data then packing,result in complete data trained

hiyouga · 2024-11-21T13:18:31Z

I cannot get why length2indexes causes sample dropping, the number of indexes in knapsacks should be equal to the number of samples.

LLaMA-Factory/src/llamafactory/data/processors/supervised.py

Lines 178 to 183 in c8f1998

    
           knapsacks = greedy_knapsack(lengths, data_args.cutoff_len - 1)  # reserved for the padding token 
        
           for knapsack in knapsacks: 
        
               packed_input_ids, packed_attention_masks, packed_labels = [], [], [] 
        
               packed_images, packed_videos = [], [] 
        
               for i, length in enumerate(knapsack): 
        
                   index = length2indexes[length].pop()

2877992943 · 2024-11-22T07:46:03Z

it seems that .pop() give one sample and what happend to the rest of length2indexes[length] is losing them

2877992943 · 2024-11-22T08:40:34Z

besides, when change to the preprocess_packed_supervised_dataset_fullDataGroup the training data the training steps increase,which proving that preprocess_packed_supervised_dataset losing data when pack

hiyouga · 2024-11-22T08:53:05Z

There is no remaining sample in length2indexes dict, thus all the input examples are joined to the result list.

def preprocess_packed_supervised_dataset(
    examples: Dict[str, List[Any]],
    template: "Template",
    tokenizer: "PreTrainedTokenizer",
    processor: Optional["ProcessorMixin"],
    data_args: "DataArguments",
) -> Dict[str, List[Any]]:
    ...
    model_inputs = defaultdict(list)
    knapsacks = greedy_knapsack(lengths, data_args.cutoff_len - 1)  # reserved for the padding token
    for knapsack in knapsacks:
        packed_input_ids, packed_attention_masks, packed_labels = [], [], []
        packed_images, packed_videos = [], []
        for i, length in enumerate(knapsack):
            index = length2indexes[length].pop()
            packed_input_ids += batch_input_ids[index]
            packed_labels += batch_labels[index]
            packed_images += batch_images[index]
            packed_videos += batch_videos[index]
            if data_args.neat_packing:
                packed_attention_masks += [i + 1] * len(batch_input_ids[index])  # start from 1
            else:
                packed_attention_masks += [1] * len(batch_input_ids[index])

        if len(packed_input_ids) < data_args.cutoff_len:
            pad_length = data_args.cutoff_len - len(packed_input_ids)
            packed_input_ids += [tokenizer.pad_token_id] * pad_length
            packed_labels += [IGNORE_INDEX] * pad_length
            if data_args.neat_packing:
                packed_attention_masks += [0] * pad_length
            else:
                packed_attention_masks += [1] * pad_length  # more efficient flash_attn

        if len(packed_input_ids) != data_args.cutoff_len:
            raise ValueError("The length of packed example should be identical to the cutoff length.")

        model_inputs["input_ids"].append(packed_input_ids)
        model_inputs["attention_mask"].append(packed_attention_masks)
        model_inputs["labels"].append(packed_labels)
        model_inputs["images"].append(packed_images or None)
        model_inputs["videos"].append(packed_videos or None)

    print("length2indexes:", length2indexes)

    return model_inputs

Running tokenizer on dataset (num_proc=16):   0%|                                                      | 0/1091 [00:00<?, ? examples/s]
length2indexes: defaultdict(<class 'list'>, {56: [], 57: [], 55: [], 51: [], 50: [], 47: [], 54: [], 46: [], 64: [], 59: [], 70: [], 72: [], 66: [], 73: [], 69: [], 62: [], 67: [], 60: [], 101: [], 52: [], 63: [], 107: [], 49: [], 71: [], 68: [], 65: [], 78: []})
Running tokenizer on dataset (num_proc=16):   6%|██▊                                         | 69/1091 [00:00<00:09, 108.80 examples/s]
length2indexes: defaultdict(<class 'list'>, {70: [], 69: [], 63: [], 72: [], 61: [], 71: [], 66: [], 78: [], 68: [], 73: [], 74: [], 67: [], 426: [], 50: [], 361: [], 79: [], 128: [], 151: [], 381: [], 64: [], 296: [], 286: [], 499: [], 125: [], 354: [], 200: [], 367: [], 166: [], 403: [], 130: [], 174: [], 139: [], 95: [], 328: [], 347: [], 145: [], 306: [], 382: [], 48: [], 263: [], 138: [], 236: [], 398: [], 326: [], 170: [], 262: [], 135: [], 167: [], 185: [], 273: [], 288: []})
Running tokenizer on dataset (num_proc=16):  38%|████████████████▏                          | 411/1091 [00:01<00:01, 500.59 examples/s]
length2indexes: defaultdict(<class 'list'>, {232: [], 218: [], 189: [], 195: [], 323: [], 68: [], 204: [], 491: [], 77: [], 60: [], 177: [], 444: [], 339: [], 209: [], 52: [], 102: [], 56: [], 79: [], 251: [], 278: [], 139: [], 157: [], 46: [], 66: [], 548: [], 300: [], 203: [], 65: [], 67: [], 224: [], 62: [], 383: [], 54: [], 135: [], 80: [], 295: [], 63: [], 105: [], 334: [], 294: [], 342: [], 527: [], 170: [], 61: [], 164: [], 247: [], 58: [], 283: [], 78: [], 55: [], 215: [], 71: [], 75: [], 72: [], 336: [], 121: [], 51: []})
Running tokenizer on dataset (num_proc=16):  56%|████████████████████████▏                  | 615/1091 [00:01<00:00, 586.28 examples/s]
length2indexes: defaultdict(<class 'list'>, {383: [], 62: [], 139: [], 70: [], 78: [], 683: [], 149: [], 346: [], 172: [], 143: [], 58: [], 77: [], 65: [], 76: [], 123: [], 122: [], 119: [], 306: [], 373: [], 55: [], 193: [], 53: [], 204: [], 129: [], 253: [], 212: [], 284: [], 54: [], 283: [], 344: [], 167: [], 574: [], 132: [], 136: [], 64: [], 128: [], 224: [], 92: [], 157: [], 163: [], 104: [], 137: [], 205: [], 56: [], 90: [], 61: [], 80: [], 264: [], 411: [], 66: [], 416: [], 210: [], 84: [], 445: [], 215: [], 289: [], 183: [], 79: [], 400: []})
Running tokenizer on dataset (num_proc=16):  94%|███████████████████████████████████████▍  | 1023/1091 [00:02<00:00, 669.34 examples/s]
length2indexes: defaultdict(<class 'list'>, {88: [], 155: [], 211: [], 70: [], 335: [], 156: [], 129: [], 145: [], 378: [], 61: [], 57: [], 237: [], 73: [], 395: [], 119: [], 136: [], 82: [], 354: [], 169: [], 247: [], 430: [], 53: [], 163: [], 289: [], 209: [], 353: [], 309: [], 245: [], 69: [], 175: [], 121: [], 176: [], 432: [], 68: [], 77: [], 338: [], 407: [], 193: [], 134: [], 317: [], 186: [], 191: [], 234: [], 124: [], 72: [], 304: [], 92: [], 249: [], 146: [], 60: [], 187: [], 267: [], 375: [], 302: [], 91: [], 80: [], 139: [], 140: [], 62: [], 103: [], 123: [], 450: [], 83: [], 58: []})

2877992943 · 2024-11-22T10:41:37Z

yes there is remaining ,below sample show losing data index[13]

for knapsack in knapsacks:
        packed_input_ids, packed_attention_masks, packed_labels = [], [], []
        packed_images, packed_videos = [], []
        packed_posid=[] ### 1group
        for i, length in enumerate(knapsack):
            index = length2indexes[length].pop() 
            if len(length2indexes[length])>0:
                print('leftover...',length2indexes[length])

[INFO|2024-11-22 18:38:44] llamafactory.data.loader:157 >> Loading dataset glaive_toolcall_en_demo.json...
Converting format of dataset: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [00:00<00:00, 4756.81 examples/s]
Running tokenizer on dataset: 0%| | 0/300 [00:00<?, ? examples/s]leftover... [13]

hiyouga · 2024-11-22T11:37:20Z

you must print the length2indexes after the knapsacks loop is end.

bugfix

cb9d17a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bugfix #6103

bugfix #6103

2877992943 commented Nov 21, 2024

hiyouga commented Nov 21, 2024 •

edited

Loading

2877992943 commented Nov 22, 2024

2877992943 commented Nov 22, 2024

hiyouga commented Nov 22, 2024

2877992943 commented Nov 22, 2024

hiyouga commented Nov 22, 2024

bugfix #6103

Are you sure you want to change the base?

bugfix #6103

Conversation

2877992943 commented Nov 21, 2024

What does this PR do?

hiyouga commented Nov 21, 2024 • edited Loading

2877992943 commented Nov 22, 2024

2877992943 commented Nov 22, 2024

hiyouga commented Nov 22, 2024

2877992943 commented Nov 22, 2024

hiyouga commented Nov 22, 2024

hiyouga commented Nov 21, 2024 •

edited

Loading