Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Igor lig 4447 w mse benchmark #1474

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

IgorSusmelj
Copy link
Contributor

Changes

  • Adds WMSE ImageNet benchmark
  • Adds missing projection head

Copy link

codecov bot commented Jan 11, 2024

Codecov Report

Attention: 6 lines in your changes are missing coverage. Please review.

Comparison is base (2b215aa) 85.50% compared to head (7ad8205) 85.45%.
Report is 1 commits behind head on master.

Files Patch % Lines
lightly/loss/wmse_loss.py 64.28% 5 Missing ⚠️
lightly/models/modules/heads.py 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1474      +/-   ##
==========================================
- Coverage   85.50%   85.45%   -0.05%     
==========================================
  Files         135      135              
  Lines        5657     5672      +15     
==========================================
+ Hits         4837     4847      +10     
- Misses        820      825       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@IgorSusmelj
Copy link
Contributor Author

image
Seems to be training.

@IgorSusmelj IgorSusmelj marked this pull request as ready for review January 12, 2024 07:52

# we use a projection head with output dimension 64
# and w_size of 128 to support a batch size of 256
self.projection_head = WMSEProjectionHead(output_dim=64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the output dimension is wrong here. From the paper:

Finally, we use an embedding size
of 64 for CIFAR-10 and CIFAR-100, and an embedding of
size of 128 for STL-10 and Tiny ImageNet. For ImageNet-
100 we use a configuration similar to the Tiny ImageNet
experiments, and 240 epochs of training. Finally, in the
ImageNet experiments (Tab. 3), we use the implementation
and the hyperparameter configuration of (Chen et al., 2020b)
(same number of layers in the projection head, etc.) based
on their open-source implementation2, the only difference
being the learning rate and the loss function (respectively,
0.075 and the contrastive loss in (Chen et al., 2020b) vs. 0.1
and Eq. 6 in W-MSE 4

So they're using a SimCLR2 projection head.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And most likely the embedding dim is the same as the one for SimCLR2.

self.projection_head = WMSEProjectionHead(output_dim=64)

self.criterion_WMSE4loss = WMSELoss(
w_size=128, embedding_dim=64, num_samples=4, gather_distributed=True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ImageNet they probably use w_size=256:

For CIFAR-10
and CIFAR-100, the slicing sub-batch size is 128, for Tiny
ImageNet and STL-10, it is 256

"weight_decay": 0.0,
},
],
lr=0.1 * math.sqrt(self.batch_size_per_device * self.trainer.world_size),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The denominator is missing here.

@@ -59,10 +61,18 @@ def forward(self, x: torch.Tensor) -> torch.Tensor:

f_cov_shrinked = (1 - self.eps) * f_cov + self.eps * eye

# get type of f_cov_shrinked and temporary convert to full precision
# to support chelosky decomposition
f_cov_shrinked_type = f_cov_shrinked.dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks super duper hacky. Why is it necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as written in the comment. The original code is not using half precision.

if self.gather_distributed and dist.is_initialized():
world_size = dist.get_world_size()
if world_size > 1:
input = torch.cat(gather(input), dim=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this is correct? Intuitively I think there could be problems because now every device computes the exact same loss, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it but I will add it again. That seems the most easy and proper way to support multi-GPU training. I'll make sure we divide the loss by the number of devices to make runs more comparable between different multi-gpu setups.

@@ -699,6 +699,27 @@ def __init__(
)


class WMSEProjectionHead(SimCLRProjectionHead):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this. We should be able to use SimCLR instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guarin, we should make sure things are consistent. I'm not sure what we agreed on. AFAIK, The same goes for the transforms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't the default values different?

In any case, I prefer if all components of the WMSE model are called WMSESomething. Mixing components from different models is always confusing and it makes the components harder to discover in the code. If two models have the same head then we can just subclass from the first model and update the docstring.

num_layers: int = 2,
batch_norm: bool = True,
):
super(WMSEProjectionHead, self).__init__(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
super(WMSEProjectionHead, self).__init__(
super().__init__(

In general the class should not be passed to the super method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants