Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The errors on multiprocessing may related to the dataset "syntext1_96voc" #48

Open
YuMJie opened this issue Oct 11, 2023 · 2 comments
Open

Comments

@YuMJie
Copy link

YuMJie commented Oct 11, 2023

When I try to run the code python tools/train_net.py --config-file configs/R_50/CTW1500/finetune_96voc_50maxlen.yaml --num-gpus 4
some errors occur. However ,it can run correctly if I set the --num-gpus 1 or change the code

DATASETS:
  TRAIN: ("ic13_train_96voc","totaltext_train_96voc")
  TEST: ("ctw1500_test",)

on config file configs/R_50/CTW1500/pretrain_96voc_50maxlen.yaml
and it will be error when set ' TRAIN: ("syntext1_96voc","ic13_train_96voc","totaltext_train_96voc")'

[10/11 08:19:10 adet.data.dataset_mapper]: Cropping used in training: RandomCropWithInstance(crop_type='relative_range', crop_size=[0.1, 0.1], crop_instance=False)
[10/11 08:19:11 adet.data.datasets.text]: Loaded 229 images in COCO format from /dataset/ic13/train_96voc.json
[10/11 08:19:46 adet.data.datasets.text]: Loading /dataset/syntext1/annotations/train_96voc.json takes 35.33 seconds.
[10/11 08:19:47 adet.data.datasets.text]: Loaded 94723 images in COCO format from /dataset/syntext1/annotations/train_96voc.json
[10/11 08:24:02 d2.data.build]: Removed 0 images with no usable annotations. 94950 images left.
[10/11 08:24:02 d2.data.build]: Using training sampler TrainingSampler
[10/11 08:24:03 d2.data.common]: Serializing 94950 elements to byte tensors and concatenating them all ...
Traceback (most recent call last):
  File "train_net.py", line 304, in <module>
    launch(
  File "/usr/local/lib/python3.8/dist-packages/detectron2/engine/launch.py", line 67, in launch
    mp.spawn(
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 130, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL

The structure tree of my dataset is as follow:

.
├── ArT
│   ├── art_train.json
│   └── rename_artimg_train
├── CTW1500
│   ├── test.json
│   ├── test_images
│   ├── train_96voc.json
│   ├── train_images
│   ├── weak_voc_new.txt
│   └── weak_voc_pair_list.txt
├── ChnSyntext
│   ├── chn_syntext.json
│   └── syn_130k_images
├── LSVT
│   ├── annotations
│   ├── lsvt_train.json
│   └── rename_lsvtimg_train
├── ReCTS
│   ├── ReCTS_test_images
│   ├── ReCTS_train_images
│   ├── ReCTS_val_images
│   ├── rects_test.json
│   ├── rects_train.json
│   └── rects_val.json
├── evaluation
│   ├── gt_ctw1500.zip
│   ├── gt_icdar2015.zip
│   ├── gt_inversetext.zip
│   └── gt_totaltext.zip
├── ic13
│   ├── train_37voc.json
│   ├── train_96voc.json
│   └── train_images
├── ic15
│   ├── GenericVocabulary.txt
│   ├── GenericVocabulary_new.txt
│   ├── GenericVocabulary_pair_list.txt
│   ├── ch4_test_vocabulary.txt
│   ├── ch4_test_vocabulary_new.txt
│   ├── ch4_test_vocabulary_pair_list.txt
│   ├── ic15_test.json
│   ├── ic15_train.json
│   ├── new_strong_lexicon
│   ├── strong_lexicon
│   ├── test.json
│   ├── test_images
│   ├── train_37voc.json
│   ├── train_96voc.json
│   └── train_images
├── inversetext
│   ├── inversetext_lexicon.txt
│   ├── inversetext_pair_list.txt
│   ├── test.json
│   └── test_images
├── mlt2017
│   ├── train_37voc.json
│   ├── train_96voc.json
│   └── train_images
├── syntext1
│   ├── annotations
│   ├── train.json
│   └── train_images
├── syntext2
│   ├── annotations
│   ├── train.json
│   ├── train_37voc.json
│   ├── train_96voc.json
│   └── train_images
├── textocr
│   ├── train_37voc_1.json
│   ├── train_37voc_2.json
│   └── train_images
└── totaltext
    ├── test.json
    ├── test_images
    ├── train.json
    ├── train_37voc.json
    ├── train_96voc.json
    ├── train_images
    ├── weak_voc_new.txt
    └── weak_voc_pair_list.txt
@ymy-k
Copy link
Collaborator

ymy-k commented Oct 12, 2023

Try to reduce the number of workers? Maybe it's a memory issue?

@YuMJie
Copy link
Author

YuMJie commented Oct 12, 2023

Try to reduce the number of workers? Maybe it's a memory issue?

I try reducing the number of workers , however , it also occur this error.
But interestingly,it can work if set
DATASETS: TRAIN: ("totaltext_train_96voc") TEST: ("ctw1500_test",)
and i can correctly run
python tools/train_net.py --config-file configs/R_50/pretrain/150k_tt_mlt_13_15.yaml --num-gpus 4
python tools/train_net.py --config-file configs/R_50/TotalText/finetune_150k_tt_mlt_13_15.yaml --num-gpus 4
python tools/train_net.py --config-file configs/R_50/IC15/finetune_150k_tt_mlt_13_15.yaml --num-gpus 4
but it can not run on the dataset syntext1_96voc ,syntext2_96voc
Thank for your reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants