Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency in Reported Experimental Results for ViTPose and ViTPose++ Across Papers #129

Open
Janus-Shiau opened this issue Jan 19, 2024 · 2 comments

Comments

@Janus-Shiau
Copy link

Janus-Shiau commented Jan 19, 2024

Hello VitPose Team,

Firstly, I'd like to express my admiration for your work on ViTPose and ViTPose++. These are indeed significant achievements in the field of Human Pose Estimation with a pretty robust and awesome performance.

However, I am reaching out to inquire about an inconsistency in the reported experimental results reported across two of your papers: ViTPose and ViTPose++.

In the initial release of the ViTPose++ paper, I noticed that ViTPose appeared to perform better on the OCHuman dataset compared to ViTPose++. However, in the updated ViTPose++ paper from December 2023, new data was included for ViTPose in the OCHuman results, which differed from the original ViTPose data.

Specifically, when comparing Table 11 from the ViTPose paper with Table 15 from the ViTPose++ paper, there seems to be a discrepancy of over 25 AP in the ViTPose data, even though other SOTA data remains consistent.

image

image

Clarifying these differences is important for others in the field and would greatly enhance the understanding and application of your valuable work. Could you please provide some insight into this discrepancy?

Thank you for your time and effort.

@Annbless
Copy link
Collaborator

Hi, Thanks for your notice.

In the ViTPose paper, the results are reported using the multi-task training setting. In our ViTPose++ paper (Table 15), the ViTPose results indicate the model trained with a single dataset (MS COCO in this case).

@Janus-Shiau
Copy link
Author

Janus-Shiau commented Jan 19, 2024

Hi @Annbless,

Thank you for your prompt and clear explanation. I realize now that I overlooked this detail when comparing the data. This clarification will surely be beneficial for anyone delving into the nuances of your impactful work.

Thanks again for your guidance and contributions to the field.

Reference

To assist others who might have similar questions in the future, I've extracted the relevant text from both papers.

From the ViTPose paper, section A:

"Please note that the ViTPose variants are trained under the multi-dataset training setting and tested directly without further finetuning on the specific training dataset, to keep the whole pipeline as simple as possible."

From the ViTPose++ paper, section 4.5.2:

"Note that the ViTPose++ models are trained with the combination of all the datasets and directly tested on the target dataset without further fine-tuning, which keeps the whole pipeline as simple as possible. For each dataset, we use the corresponding FFN, decoder, and prediction head in ViTPose for prediction. We also provide the ViTPose baseline results. It’s worth highlighting that, despite using the same number of parameters for inference, ViTPose++ utilizes much fewer parameters during training compared with training individual ViTPose models for each dataset."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants