-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question on the Value of Training Loss for DiffuLoss with MAR and Causal Methods #20
Comments
Thanks for the great work! |
Thanks for your interest. This repo should achieve a loss curve similar to the above one (around 0.33), but different data could result in slightly different loss values. This kind of quickly converged loss curve is commonly observed for diffusion loss (e.g. Figure 13 in DiT). However, after the initial quick decrease, it still constantly decreases as shown in the figure above, and thus keeps improving the generation performance. The absolute loss value can be affected by many factors: tokenizer, dataset, model capacity, noise scheduling, etc. The absolute value can thus vary a lot. For example, DiT's loss is around 0.15, which means a pixel distance of around 0.4. This is because the denoising function is very hard to learn, especially when the noise level is high. Moreover, since we use a very large masking ratio (randomly sampled between 0.7 and 1.0) during MAR's training, our loss is even larger than DiT. |
Thanks for the answer. |
how about the loss of causal method |
@Robootx here is the loss for random order causal method (with teacher-forcing language modeling loss): |
thank you very much |
Could you please show me some images generated by a causal method? |
I wonder how the model performs at different training stages, e.g., how many training steps it takes to be able to generate the shape of an object? |
can you prepare the lr details of mar_huge training? It will help reproduce the results. |
@Juhywcy the learning rate schedule and value is the same for all models |
thanks for your reply!! constant and 1e-4? |
yes -- also with linear warmup. |
thanks for your fast reply! have a good day! |
Thanks for your great work! |
I just want to confirm that, in the paper, the lr is 8e-4? |
@LingweiMeng we scale the final learning rate according to the total batch size divided by 256. 1e-4 is the "base learning rate" before scaling. |
Thank you.
… On 14 Oct 2024, at 05:20, Tianhong Li ***@***.***> wrote:
@LingweiMeng <https://github.com/LingweiMeng> we scale the final learning rate according to the total batch size divided by 256. 1e-4 is the "base learning rate" before scaling.
—
Reply to this email directly, view it on GitHub <#20 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ATAO74RMSDGE6OBDZYETQYDZ3LP3DAVCNFSM6AAAAABMRSWNWCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBZGEZTKNRSGI>.
You are receiving this because you were mentioned.
|
Thanks for your great work !
I am currently engaged in a project that involves the DiffuLoss, and I am curious about the convergence behavior of the training loss. Specifically, I would like to know how much the training loss can ultimately converge for both the MAR and Causal methods. And which converges faster, the MAR method or the Causal method?
Again,awesome work!I look forward to your reponse!
The text was updated successfully, but these errors were encountered: