Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training loss not converge #13

Open
ilyaskhanbey opened this issue Jun 26, 2021 · 5 comments
Open

training loss not converge #13

ilyaskhanbey opened this issue Jun 26, 2021 · 5 comments

Comments

@ilyaskhanbey
Copy link

hello, i tried your training code with (adobe, distinction636 and realWorld) dataset. i keep only solid objects, and i removed all transparent object like glasses.

composition and laplacien loss seems to not converge after warmup step.(itr>5000)
9770/500000], REC: 0.0121, COMP: 0.2987, LAP: 0.1311, lr: 0.001000

I removed the test step while training, i think it's have not impact in the training. am i wrong?

I should wait for more iterations or i should fix some hyperparameters when training with more dataset than you used?

@yucornetto
Copy link
Owner

Hi, could you give more info about your problem? E.g., how do you put all datasets together?

The test step should have no impact on training. Besides, it seems to me your loss looks normal?

@ilyaskhanbey
Copy link
Author

Hello i think i found what is the issue and why there is no ceonvergence of the loss.
I think it's because the composition loss when enabling real world augmentation (adding noise to input image)
Actually when you do training you do so something like that:

alpha_pred = mg_net(image_noise)
loss_lap = image_noise - (alpha_pred*fg +(1-alpha_pred)*bg)

I think it' should be

alpha_pred = mg_net(image_noise)
loss_lap = image_with_no_noise - (alpha_pred*fg +(1-alpha_pred)*bg)

in the first configuration the loss is more impacted by the noise due to the augmentation.

To test that i start training only on one image (we should converge in few iterations)
after the fix here is losses:

Image tensor shape: torch.Size([2, 3, 512, 512]). Trimap tensor shape: torch.Size([2, 3, 512, 512])
[66/500000], REC: 0.0203, COMP: 0.0745, LAP: 0.1510, lr: 0.000100

without thefix , the compo and lap loss will always grab between (0.15 to 0.7)

Image tensor shape: torch.Size([2, 3, 512, 512]). Trimap tensor shape: torch.Size([2, 3, 512, 512])
[69/500000], REC: 0.0241, COMP: 0.4077, LAP: 0.2623, lr: 0.000100

I think the composition loss when it's not converging it's impact the laplacien loss too.

@yucornetto
Copy link
Owner

Thanks for the explanation! Composition loss will be affected if real-world noises are introduced, as in https://github.com/yucornetto/MGMatting/blob/main/code-base/config/MGMatting-RWP-100k.toml you can see we disable the composition loss when real-world-aug is enabled.

As for lap loss, I am somehow confused since fg/image should not be involved in lap loss computation, as in https://github.com/yucornetto/MGMatting/blob/main/code-base/trainer.py#L224

I am not sure about your point that comp loss not converging also affect lap loss. But my suggestion will be just setting comp_loss_weight to 0 when real-world-aug is enabled.

@ilyaskhanbey
Copy link
Author

Thank you for your answer, i didn't notice you are not using comp loss when enabling real-world-aug
I tested your pretrained rwp model and it's seems working much better for real data than the dim pretrained model.
I will try to retrain rwp model with enabling comp loss and the fix of comp loss i made, i will tell you if peformance are better.

Last question is it a good idea to have a batch size of 40 as you mentionned in your article? marco forte in his FBA article said batch size must be between 6 to 16 (in BN) for alpha prediction.

Thank you for your great work

@yucornetto
Copy link
Owner

yucornetto commented Jun 27, 2021

Thanks for offering to run the experiments!

For your question regarding to FBA matting paper, I think they use batch size = 1 + WS + GN? Not sure about the 6 to 16 parts. My personal experience is that if BN is used, a relatively large batch size usually can lead to a better performance. But I did not run experiments to verify it on the matting task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants