During training，train_loss suddenly increase meanwhile val_dice decrease to 0 #1623

overwhelmedyy · 2024-01-23T03:06:14Z

overwhelmedyy
Jan 23, 2024

I was using the jupyter notebook tutorial code " spleen_segmentation_3d_lightning.ipynb",work on a pancrease dataset given by tutor.The changes I made is learning rate from 1e-4 to 8e-4, max_epoch from 600 to 100, and I changed CacheDataset to PersistentDataset. Other than these I believe the training process is keep unchanged.

Below is the code pieces start from setting PersistentDataset

    self.train_ds = PersistentDataset(
        data=train_files,
        transform=train_transforms,
        cache_dir=persistent_cache,
    )
    self.val_ds = PersistentDataset(
        data=val_files,
        transform=val_transforms,
        cache_dir=persistent_cache,
    )

def train_dataloader(self):
    train_loader = DataLoader(
        self.train_ds,
        batch_size=2,
        shuffle=True,
        num_workers=2,
        persistent_workers=True,
        collate_fn=pad_list_data_collate,
    )
    return train_loader

def val_dataloader(self):
    val_loader = DataLoader(
        self.val_ds,
        batch_size=2,
        num_workers=2,
        persistent_workers=True,
        collate_fn=pad_list_data_collate,
    )
    return val_loader

def configure_optimizers(self):
    optimizer = torch.optim.Adam(self._model.parameters(), 8e-4)
    return optimizer

def training_step(self, batch, batch_idx):
    images, labels = batch["image"], batch["label"]
    output = self.forward(images)
    loss = self.loss_function(output, labels)
    tensorboard_logs = {"train_loss": loss.item()}
    self.log("train_loss", loss.item(), prog_bar=True, logger=True, on_epoch=True)
    return {"loss": loss, "log": tensorboard_logs}

def validation_step(self, batch, batch_idx):
    images, labels = batch["image"], batch["label"]
    self.tb_logger.add_text("label shape", f"{labels[0].shape}")
    roi_size = (160, 160, 160)
    sw_batch_size = 4
    outputs = sliding_window_inference(images, roi_size, sw_batch_size, self.forward)
    loss = self.loss_function(outputs, labels)
    outputs = [self.post_pred(i) for i in decollate_batch(outputs)]
    labels = [self.post_label(i) for i in decollate_batch(labels)]
    self.dice_metric(y_pred=outputs, y=labels)
    d = {"val_loss": loss, "val_number": len(outputs)}
    self.validation_step_outputs.append(d)
    return d

def on_validation_epoch_end(self):
    val_loss, num_items = 0, 0
    for output in self.validation_step_outputs:
        val_loss += output["val_loss"].sum().item()
        num_items += output["val_number"]
    mean_val_dice = self.dice_metric.aggregate().item()
    self.dice_metric.reset()
    mean_val_loss = torch.tensor(val_loss / num_items)
    tensorboard_logs = {
        "val_dice": mean_val_dice,
        "val_loss": mean_val_loss,
    }
    if mean_val_dice > self.best_val_dice:
        self.best_val_dice = mean_val_dice
        self.best_val_epoch = self.current_epoch
    print(
        f"current epoch: {self.current_epoch} "
        f"current mean dice: {mean_val_dice:.4f}"
        f"\nbest mean dice: {self.best_val_dice:.4f} "
        f"at epoch: {self.best_val_epoch}"
    )
    self.validation_step_outputs.clear()
    self.log_dict(tensorboard_logs, logger=True)

    return {"log": tensorboard_logs}


  if __name__ == "__main__":
    torch.set_float32_matmul_precision("medium")
    os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"
    net = Net()

trainer = lightning.Trainer(
    devices="auto",
    max_epochs=100,
    logger=tensorboard_logger,
    log_every_n_steps=50,
    enable_checkpointing=True,
    deterministic=True,
    check_val_every_n_epoch=5,
    num_sanity_val_steps=None
)

trainer.fit(net)
print(f"train completed, best_metric: {net.best_val_dice:.4f} " f"at epoch {net.best_val_epoch}")

100 epoch is very little and the learning rate is quite large,basically my intention is to check if the code is functioning.

However,during the training process when the Dice Metric pretty much hit the best score,the train_loss suddenly oscillating then going way up:

and the val_metric drops to zero :

and val_loss is like this:

this is reproducible and I' ve seen it a few times,seems like larger the learning rate,earlier the loss increasing (and dice dropping) will happen.But at the first 50 epochs it works pretty well to me.So confusing...

I'm really a beginner and didnt find similar problem raised by others.Anyone please give me a little idea or suggestion,I will be more than grateful.
So many thanks!

KumoLiu · 2024-01-23T04:19:20Z

KumoLiu
Jan 23, 2024
Maintainer

Hi @yy4551,
It may be that the learning rate is large, and the network reached a local minimum. You can try reducing your learning rate or using a different optimizer.
Thanks.

1 reply

overwhelmedyy Jan 23, 2024
Author

thanks,
actually I've tried larger learning rate(1e-3) at earlier beginning with the same tutorial code and same dataset，though Dice Metric raising fast as expected，such situation above was never never encountered. Also, I never changed optimizer(torch.optim.Adam), never changed inference method(sliding windows inference).
Besides,I worked on my dataset, no all zero labels,every of them is 0 and 1,and nothing unexpected with nonzero count , no bizzare images neither.
Could it might be any other reasons? This optimizer should work and larger learning_rate tried before.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

During training，train_loss suddenly increase meanwhile val_dice decrease to 0 #1623

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

During training，train_loss suddenly increase meanwhile val_dice decrease to 0 #1623

overwhelmedyy Jan 23, 2024

Replies: 1 comment · 1 reply

KumoLiu Jan 23, 2024 Maintainer

overwhelmedyy Jan 23, 2024 Author

overwhelmedyy
Jan 23, 2024

Replies: 1 comment 1 reply

KumoLiu
Jan 23, 2024
Maintainer

overwhelmedyy Jan 23, 2024
Author