Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix #6103

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

bugfix #6103

wants to merge 1 commit into from

Conversation

2877992943
Copy link

What does this PR do?

Fixes # (neat pack lose data bug)
in method preprocess_packed_supervised_dataset this line

index= length2indexes[length].pop()

lose data and lead to degenerated accuracy when there is more than 1 sample keyed by length,

thus add preprocess_packed_supervised_dataset_fullDataGroup use 'noDegenerateGroups' keep all data then packing,result in complete data trained

@hiyouga
Copy link
Owner

hiyouga commented Nov 21, 2024

I cannot get why length2indexes causes sample dropping, the number of indexes in knapsacks should be equal to the number of samples.

knapsacks = greedy_knapsack(lengths, data_args.cutoff_len - 1) # reserved for the padding token
for knapsack in knapsacks:
packed_input_ids, packed_attention_masks, packed_labels = [], [], []
packed_images, packed_videos = [], []
for i, length in enumerate(knapsack):
index = length2indexes[length].pop()

@2877992943
Copy link
Author

it seems that .pop() give one sample and what happend to the rest of length2indexes[length] is losing them

@2877992943
Copy link
Author

besides, when change to the preprocess_packed_supervised_dataset_fullDataGroup the training data the training steps increase,which proving that preprocess_packed_supervised_dataset losing data when pack

@hiyouga
Copy link
Owner

hiyouga commented Nov 22, 2024

There is no remaining sample in length2indexes dict, thus all the input examples are joined to the result list.

def preprocess_packed_supervised_dataset(
    examples: Dict[str, List[Any]],
    template: "Template",
    tokenizer: "PreTrainedTokenizer",
    processor: Optional["ProcessorMixin"],
    data_args: "DataArguments",
) -> Dict[str, List[Any]]:
    ...
    model_inputs = defaultdict(list)
    knapsacks = greedy_knapsack(lengths, data_args.cutoff_len - 1)  # reserved for the padding token
    for knapsack in knapsacks:
        packed_input_ids, packed_attention_masks, packed_labels = [], [], []
        packed_images, packed_videos = [], []
        for i, length in enumerate(knapsack):
            index = length2indexes[length].pop()
            packed_input_ids += batch_input_ids[index]
            packed_labels += batch_labels[index]
            packed_images += batch_images[index]
            packed_videos += batch_videos[index]
            if data_args.neat_packing:
                packed_attention_masks += [i + 1] * len(batch_input_ids[index])  # start from 1
            else:
                packed_attention_masks += [1] * len(batch_input_ids[index])

        if len(packed_input_ids) < data_args.cutoff_len:
            pad_length = data_args.cutoff_len - len(packed_input_ids)
            packed_input_ids += [tokenizer.pad_token_id] * pad_length
            packed_labels += [IGNORE_INDEX] * pad_length
            if data_args.neat_packing:
                packed_attention_masks += [0] * pad_length
            else:
                packed_attention_masks += [1] * pad_length  # more efficient flash_attn

        if len(packed_input_ids) != data_args.cutoff_len:
            raise ValueError("The length of packed example should be identical to the cutoff length.")

        model_inputs["input_ids"].append(packed_input_ids)
        model_inputs["attention_mask"].append(packed_attention_masks)
        model_inputs["labels"].append(packed_labels)
        model_inputs["images"].append(packed_images or None)
        model_inputs["videos"].append(packed_videos or None)

    print("length2indexes:", length2indexes)

    return model_inputs
Running tokenizer on dataset (num_proc=16):   0%|                                                      | 0/1091 [00:00<?, ? examples/s]
length2indexes: defaultdict(<class 'list'>, {56: [], 57: [], 55: [], 51: [], 50: [], 47: [], 54: [], 46: [], 64: [], 59: [], 70: [], 72: [], 66: [], 73: [], 69: [], 62: [], 67: [], 60: [], 101: [], 52: [], 63: [], 107: [], 49: [], 71: [], 68: [], 65: [], 78: []})
Running tokenizer on dataset (num_proc=16):   6%|██▊                                         | 69/1091 [00:00<00:09, 108.80 examples/s]
length2indexes: defaultdict(<class 'list'>, {70: [], 69: [], 63: [], 72: [], 61: [], 71: [], 66: [], 78: [], 68: [], 73: [], 74: [], 67: [], 426: [], 50: [], 361: [], 79: [], 128: [], 151: [], 381: [], 64: [], 296: [], 286: [], 499: [], 125: [], 354: [], 200: [], 367: [], 166: [], 403: [], 130: [], 174: [], 139: [], 95: [], 328: [], 347: [], 145: [], 306: [], 382: [], 48: [], 263: [], 138: [], 236: [], 398: [], 326: [], 170: [], 262: [], 135: [], 167: [], 185: [], 273: [], 288: []})
Running tokenizer on dataset (num_proc=16):  38%|████████████████▏                          | 411/1091 [00:01<00:01, 500.59 examples/s]
length2indexes: defaultdict(<class 'list'>, {232: [], 218: [], 189: [], 195: [], 323: [], 68: [], 204: [], 491: [], 77: [], 60: [], 177: [], 444: [], 339: [], 209: [], 52: [], 102: [], 56: [], 79: [], 251: [], 278: [], 139: [], 157: [], 46: [], 66: [], 548: [], 300: [], 203: [], 65: [], 67: [], 224: [], 62: [], 383: [], 54: [], 135: [], 80: [], 295: [], 63: [], 105: [], 334: [], 294: [], 342: [], 527: [], 170: [], 61: [], 164: [], 247: [], 58: [], 283: [], 78: [], 55: [], 215: [], 71: [], 75: [], 72: [], 336: [], 121: [], 51: []})
Running tokenizer on dataset (num_proc=16):  56%|████████████████████████▏                  | 615/1091 [00:01<00:00, 586.28 examples/s]
length2indexes: defaultdict(<class 'list'>, {383: [], 62: [], 139: [], 70: [], 78: [], 683: [], 149: [], 346: [], 172: [], 143: [], 58: [], 77: [], 65: [], 76: [], 123: [], 122: [], 119: [], 306: [], 373: [], 55: [], 193: [], 53: [], 204: [], 129: [], 253: [], 212: [], 284: [], 54: [], 283: [], 344: [], 167: [], 574: [], 132: [], 136: [], 64: [], 128: [], 224: [], 92: [], 157: [], 163: [], 104: [], 137: [], 205: [], 56: [], 90: [], 61: [], 80: [], 264: [], 411: [], 66: [], 416: [], 210: [], 84: [], 445: [], 215: [], 289: [], 183: [], 79: [], 400: []})
Running tokenizer on dataset (num_proc=16):  94%|███████████████████████████████████████▍  | 1023/1091 [00:02<00:00, 669.34 examples/s]
length2indexes: defaultdict(<class 'list'>, {88: [], 155: [], 211: [], 70: [], 335: [], 156: [], 129: [], 145: [], 378: [], 61: [], 57: [], 237: [], 73: [], 395: [], 119: [], 136: [], 82: [], 354: [], 169: [], 247: [], 430: [], 53: [], 163: [], 289: [], 209: [], 353: [], 309: [], 245: [], 69: [], 175: [], 121: [], 176: [], 432: [], 68: [], 77: [], 338: [], 407: [], 193: [], 134: [], 317: [], 186: [], 191: [], 234: [], 124: [], 72: [], 304: [], 92: [], 249: [], 146: [], 60: [], 187: [], 267: [], 375: [], 302: [], 91: [], 80: [], 139: [], 140: [], 62: [], 103: [], 123: [], 450: [], 83: [], 58: []})

@2877992943
Copy link
Author

yes there is remaining ,below sample show losing data index[13]

for knapsack in knapsacks:
        packed_input_ids, packed_attention_masks, packed_labels = [], [], []
        packed_images, packed_videos = [], []
        packed_posid=[] ### 1group
        for i, length in enumerate(knapsack):
            index = length2indexes[length].pop() 
            if len(length2indexes[length])>0:
                print('leftover...',length2indexes[length])

[INFO|2024-11-22 18:38:44] llamafactory.data.loader:157 >> Loading dataset glaive_toolcall_en_demo.json...
Converting format of dataset: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [00:00<00:00, 4756.81 examples/s]
Running tokenizer on dataset: 0%| | 0/300 [00:00<?, ? examples/s]leftover... [13]

@hiyouga
Copy link
Owner

hiyouga commented Nov 22, 2024

you must print the length2indexes after the knapsacks loop is end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants