Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#586 done broke me #600

Closed
SteveMacenski opened this issue May 4, 2023 · 12 comments · Fixed by #601
Closed

#586 done broke me #600

SteveMacenski opened this issue May 4, 2023 · 12 comments · Fixed by #601

Comments

@SteveMacenski
Copy link
Contributor

SteveMacenski commented May 4, 2023

A full thread can be found osrf/ros2_test_cases#1042 but during Iron Testing Day, it was identified that Nav2 is totally busted and jumping around. Upon further investigation it was isolated to geometry2, upon further investigation again it was found to be the cause of 2 commit

I cannot tell you why in particular these are broken, but they should either be reverted for Iron or the issue resolved before release. Until such a time, the Nav2 stack does some wacky stuff that its hard to indicate to you clearly how these commits are causing it.

My best description of what I see, but you can reproduce it yourself easily by launching the default Nav2 system and watching the wacky wavey inflatable balloon man dance across your rviz display:

We're seeing Nav2 receiving AMCL localization service requests, but not honoring the content of them (and seems to randomly just bounce around) & Gazebo Classic appears to be producing data that doesn't seems sensible for where the robot truly is in the environment + reporting that the robot is outside of the local costmap bounds for update.

CC @roncapat

@roncapat
Copy link
Contributor

roncapat commented May 4, 2023

Hi Steve, I'm sorry for that. But I'm a bit confused about why this can happen at all.

Those changes deal only with the exception paths of TF2, so it might be needed to understand if/how/where those exceptions are handled by Nav2 in the case you mentioned.

Could you please share/point the standard nav2 configuration launch scenario you mentioned, so that I can have a look? I use Nav2 with a different setup, no AMCL for example.

Does Nav2 breaks also in other test cases/configurations?

@clalancette
Copy link
Contributor

Yes, I have the same questions as @roncapat ; these changes seem innocuous, so I'm surprised they are the cause of the issue. Having a way to reproduce so we can look into the problem would be most helpful.

I'd prefer to try to debug it and fix it forward first. If we can't make progress there, then I'm not opposed to reverting #586 and #592.

@SteveMacenski
Copy link
Contributor Author

SteveMacenski commented May 4, 2023

Nav2 launch starts here: https://github.com/ros-planning/navigation2/blob/main/nav2_bringup/launch/tb3_simulation_launch.py. Every system test we have involving simulation is breaking - so its not just end-users are seeing it, our CI is picking it up too. Example: https://app.circleci.com/pipelines/github/ros-planning/navigation2/9261/workflows/128a3a96-5c73-49b6-b2e2-966572dc3225/jobs/29858 (first 3 are unrelated, see the other 10).

I don't think (but I could be wrong) its at the Nav2 level, since we always have the base TransformException in try/excepts that log to terminal error messages of any sort via TF (for example), unless we missed somewhere. As such, I would expect to see some kind of logging but I see none.

I'll try a few things. I can tell you its not https://github.com/ros2/geometry2/pull/592/files since its broken before it.

The 2 things that draw my eye are

      return tf2::TF2Error::TF2_EXTRAPOLATION_ERROR;
      return error_code;

and the new exceptions

      case tf2::TF2Error::TF2_BACKWARD_EXTRAPOLATION_ERROR:
        throw BackwardExtrapolationException(error_string);
      case tf2::TF2Error::TF2_FORWARD_EXTRAPOLATION_ERROR:
        throw ForwardExtrapolationException(error_string);
      case tf2::TF2Error::TF2_NO_DATA_FOR_EXTRAPOLATION_ERROR:
        throw NoDataForExtrapolationException(error_string);

My speculation is that one of two things are happening: (1) something consuming exceptions further up the stack just haven't been programmed to use them so they're not being handled properly (2) the error code setting isn't working properly, so return error_code is just garbage in one or more places.

I'm putting my money on (2) since (1) I feel would give me error messages somewhere, somehow. The silence is illustrative.

I'm not sure I'll be able to give you a patch for an exact fix, but I hope to be able to follow up later today at least with where specifically its happening within that commit and leave it to @roncapat to diagnose further about how that change could have messed things up, if he's game 😄. If the issue is actually more things like #592 that are missing, I won't be able to find that, but I should be able to isolate the other 2 options.

@SteveMacenski
Copy link
Contributor Author

Fun stuff, found it, I was correct its

-      return tf2::TF2Error::TF2_EXTRAPOLATION_ERROR;
+      return error_code;

Reverting those 2 instances, things work again. Something's up with populating error_code

@roncapat
Copy link
Contributor

roncapat commented May 4, 2023

I'm in... Tomorrow evening I'll give an in depth look, but of course everything you can find or narrow down it's a nice to have (while I was writing, I already saw you found something, so thank you!).

@SteveMacenski
Copy link
Contributor Author

Its actually just one of those lines, L461 changing it back fixes the problem. I spent about 3 minutes trying to understand why, and I didn't come up with an answer, but that's the targeted problem.

@roncapat
Copy link
Contributor

roncapat commented May 4, 2023

Now I'm from mobile but... I noticed that on line

https://github.com/ros2/geometry2/blob/rolling/tf2/src/cache.cpp#L124

I set an error but not an error string. A bit over the lines you reverted, there is an "if(error_string) to detect the presence of an error.

Edit: no, this is not the cause of the issue.

@SteveMacenski
Copy link
Contributor Author

Maybe this is an easy fix then!

@roncapat
Copy link
Contributor

roncapat commented May 6, 2023

I'm 90% sure I found it in code. There's one possible overwrite/reset of error_code... Will test and post the patch soon hopefully.

@roncapat
Copy link
Contributor

roncapat commented May 6, 2023

@SteveMacenski coud you try this patch?. the rationale is that we have extrapolation_error_string potentially holding a different message than error_string. So same must be for error_code (but wasn't).

In principle, since before the introduction of error_code in the API the only thing carried around was the string (describing an error only in a lexical way), now it's just a matter of bringing around together both string and code representation of the error being careful not to mix/decouple them.

@SteveMacenski
Copy link
Contributor Author

That fixes it! Can you submit the PR?

@clalancette what's the procedure for getting a patch into the distribution during testing (anything I can / should do)?

@clalancette
Copy link
Contributor

clalancette commented May 7, 2023

@clalancette what's the procedure for getting a patch into the distribution during testing (anything I can / should do)?

CI needs to run, then we need to backport to Iron, and then we need to do package releases. Not much to do here, we should be able to get this all in the next couple of days.

@clalancette clalancette linked a pull request May 7, 2023 that will close this issue
clalancette pushed a commit that referenced this issue May 8, 2023
* error_code shall follow same path of its error string

* Lint
mergify bot pushed a commit that referenced this issue May 8, 2023
* error_code shall follow same path of its error string

* Lint

(cherry picked from commit ac2479e)
clalancette pushed a commit that referenced this issue May 8, 2023
* error_code shall follow same path of its error string

* Lint

(cherry picked from commit ac2479e)

Co-authored-by: Patrick Roncagliolo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants