Replies: 6 comments
-
We already print out the trace to the location of unsupported operators. In the case you require full compilation, these will be printed as errors. In the case that you are using partial compilation, you will see these messages as Debug/Info messages. I guess you are likely talking about issues like We could try to formalize this a bit so we can summarize the context in the converter, maybe something like:
|
Beta Was this translation helpful? Give feedback.
-
Yeah, when fallback occurs due to unsupported operators, this already prints out the unsupported operator as you mentioned with the file and line number (eg. #803). For this request, I was thinking more along the failures seen in something like #922:
Having better printouts here would be helpful in pinpointing the offending op in large models, if at least for error isolation and bug reporting. I believe this is in line with your statement ( Regarding debug level messaging, I am opposed to using this as the default solution for addressing issues in general, since large and complex graphs (where Torch-TensorRT will frequently be used) will output massive amounts of unnecessary information that is irrelevant to the root issue; having to parse this adds additional friction to the debugging process, when frequently the user is really only looking for one offending op. It is also easy to change the lowered graph when searching for a problematic op, or combination of ops, especially when attempting to isolate errors, which may hide certain failures, so being able to print out a failing op adds an additional layer of sanity checking. Additionally, frequently the output here looks successful and correct, so it's not apparent at a glance (if at all) where the issue lies. That said, however, I would still like to see your suggestion of formalizing the printout here realized, as this may indeed already be sufficient (and I'm just not reading the debug output correctly or well enough). In summary, I'm in favor of giving these failures the same printout treatment as what we currently have for unsupported ops, since the end result is that the user must either 1. explicitly tell Torch-TensorRT to ignore the op, 2. modify the graph to remove the op, or 3. report the op conversion failure as a bug or feature request. I guess I would frame this as raising the most basic information (ie. op, file, line number that may already be present) in the lowered (debug) output to regular output in all situations where a warning or error appears, because this is usually the best place to start debugging. |
Beta Was this translation helpful? Give feedback.
-
Yeah in general I am not against raising the severity of messages and providing more context, but I do want to be purposeful in what we report to the user in these error messages and ideally make our errors actionable. i.e. Here is an issue we detected, here is a potential solution. What I want to avoid is dumping a ton of information on people in cases where its misleading or there is no action. This would include your Totally agree cases like this https://github.com/NVIDIA/Torch-TensorRT/blob/ba9f730fe049ffd31cd1d30e7f99a74c82086342/core/conversion/evaluators/aten.cpp#L220 A backtrace would be super useful. In cases like this https://github.com/NVIDIA/Torch-TensorRT/blob/ba9f730fe049ffd31cd1d30e7f99a74c82086342/core/partitioning/shape_analysis.cpp#L67 This to me does not seem actionable by the user. If there is a failure here not sure what options a user has. Perhaps dumping context at a lower level may help in reporting but just throwing out the trace here obfuscates the issue. Philosophically, I would prefer logging to be as close to how people expect compiler logging to work. Warnings imply a potential risk and Errors imply some incompatibility with the source code and the expectations of the compiler. In both cases the action is on the user to accept the risk or modify their code. The issue really is that unlike GCC, Torch-TRT is still fairly new, so hitting bugs in the compiler happen more often and really the action is on us, the user's code is fine. One thing that might help here is better distinguish between
TL;DR, Totally agree we can provide more information to make it easier to narrow down where failures are happening, but want to be careful to not send people off on wild bug hunts for things they are not expected to have to fix. I think its worth taking a full pass over the logging system and adding more information in general. |
Beta Was this translation helpful? Give feedback.
-
Hmm yeah that's fair; I'm in agreement with what's stated here, especially with regards to not sending users off on a wild goose chase. That said, would you say that just logging an op and file location (not the full stack trace) is still too much here? Even if it's an underlying issue with the conversion process (ie. the user's code is fine), this still seems like a good way (and easier) for the user to identify that their op usage is correct (because we're saying this op is failing, but on inspection usage is correct) and that it's the compilation that's failing, which also makes it easier to report the bug. |
Beta Was this translation helpful? Give feedback.
-
Bump @narendasan ^ |
Beta Was this translation helpful? Give feedback.
-
Converting this to a discussion as it is not a specific request but a general request for improvement |
Beta Was this translation helpful? Give feedback.
-
Is your feature request related to a problem? Please describe.
By default, when running
Torch-TensorRT
from Python, most errors encountered during compilation will print a stack trace originating from either the Python compilation command (eg. printing something liketorch_tensorrt_pypi/torch_tensorrt/ts/_compiler.py", line 119, in compile compiled_cpp_mod = _C.compile_graph(module._c, _parse_compile_spec(spec)). RuntimeError: ....
) or the CPP file backing the implementation.Warnings during runtime may be printed without any identifying information.
The current stack traces and output by themselves do not provide enough information to diagnose or pinpoint which PyTorch op is causing the offending behavior. This is especially egregious when attempting to apply Torch-TensorRT to large and complex models, where failures with the conversion process are currently difficult to identify without unwinding and stepping through the model op by op to locate the root cause of conversion failures.
Describe the solution you'd like
Ideally, when an error or warning occurs, we should print output identifying the offending op (at minimum), as well as the file and line number at which the op occurs, in addition to other stack traces, errors, and warnings.
Describe alternatives you've considered
Currently, I've tried log printing at the graph level to better identify errors during the conversion process. However, there are issues with this approach:
Additional context
I think this ask is potentially complicated but somewhat possible... On the one hand, Torch-TensorRT converts using the JIT output, which means obtaining metadata could be more complex than just reading the model directly. On the other hand, looking at the Interpreting Graphs section of the documentation, it appears file and line number information should be accessible somewhere...
Beta Was this translation helpful? Give feedback.
All reactions