-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable the huge_tree
option for the lxml
parser
#3365
base: main
Are you sure you want to change the base?
Conversation
@KwisatzHaderach Are you able to somehow test this change and let me know if it resolves your problem? The processing of huge files using LXML can take a lot of resources (esp. CPU). We can always add an option/condition to completely bypass the LXML processing and use just Jinja2. The LXML is used to just check the XML schema for non-custom JUnit flavors and for prettifying the XML output. |
@seberm fine with me, the XML is not something user can easily inject I assume, if he wants to DOS us he has a lot of other options directly from the tests anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't get a chance to test it yet, but approving since it looks good and we need this sooner rather than later
So I gave it an another round of tests. Let's take following test as an example:
As you can see from the results below, enabling the 1) LXML completely bypassed (uninstalled), using just Jinja to generate the JUnitTakes around 43s on my machine to generate, but it works.
2) LXML enabled (schema validation + pretty print),
|
2a7c422
to
5317bc7
Compare
huge_tree
option for the lxml
parser
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand it correctly, we want large depth (above 256) to require huge_tree.
We played with it locally with:
DEPTH=257
OUTPUT_FILE="deep.xml"
# Start the XML file with a root tag
echo "<root>" > "$OUTPUT_FILE"
# Generate nested tags
for i in $(seq 1 $DEPTH); do
printf "%0.s " >> "$OUTPUT_FILE" # Indent for readability (optional)
echo "<tag$i>" >> "$OUTPUT_FILE"
done
# Close the nested tags in reverse order
for i in $(seq $DEPTH -1 1); do
printf "%0.s " >> "$OUTPUT_FILE" # Indent for readability (optional)
echo "</tag$i>" >> "$OUTPUT_FILE"
done
# Close the root tag
echo "</root>" >> "$OUTPUT_FILE"
And then I thought we could catch the exception and only use huge_tree, when it's necessary?
from lxml import etree
def parse_with_huge_tree(file_path):
try:
tree = etree.parse(file_path)
print("XML parsed successfully (default settings).")
return tree
except etree.LxmlError as e:
print(f"LxmlError encountered: {e}")
print("Retrying with huge_tree=True...")
try:
parser = etree.XMLParser(huge_tree=True)
tree = etree.parse(file_path, parser)
print("XML parsed successfully with huge_tree=True.")
return tree
except etree.LxmlError as retry_error:
print(f"Retry failed with huge_tree=True: {retry_error}")
except Exception as retry_exception:
print(f"Unexpected error during retry: {retry_exception}")
except Exception as e:
print(f"Unexpected error: {e}")
# Usage
parse_with_huge_tree('./deep.xml')
Make sense?
(and maybe print a warning that it's quite large and we are being forced to use huge_tree, security vulnerability, et cetera)
Hi @mhoyer, Thanks for testing the deep trees. The deep XML trees in the tmt junit plugin can happen only with Regarding trying with |
41204e8
to
517971a
Compare
Yep, just informing users. I haven't looked into what it actually does, so I'm not sure how beneficial it is to try it without the flag first |
@martinhoyer FYI, I've modified the code to enable the I did more testing and discovered this should not prolong the time spent on LXML operations. Most of time seems to be spent in the Jinja2 |
Hello @psss,
So, I've re-checked the code and I think the removal of I intend to execute the tests only once (and not to remove data in Right now, it could still happen that the subsequent runs will start in in-progress workdirs due to the for loop over tmt execute methods: But because the test uses only a tmt execute method, everything should work just fine. I think we could remove the for loop over the execute methods because it currently seems unnecessary. |
It seems I've found where the bottleneck is: |
/packit test |
…e user" This reverts commit 21cfe4d.
21cfe4d
to
4b5bc05
Compare
Hello @martinhoyer , The reason is it's hard to detect the "huge text node" or "deep tree" errors between various lxml and libxml2 versions. The error messages and message types are a little bit different. Sometimes the error message is only in To sum it up, I don't think it's worth complicating the code just to print a warning. Let's just keep the |
Absolutely agree, simpler is usually better, tested the current implementation, along with #3382, and it is working fine on data from TF run where I encountered the error previously. BTW why #3382 is very important, when I was rerunning the report step for my run, without it took over 6 hours, with it it's under a minute. |
This PR should resolve the problem with the LXML limit on the size of text nodes it can handle:
For more info about a
huge_tree
option, please see:Especially this LXML FAQ section about security concerns:
TODOs:
Pull Request Checklist: