Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable the huge_tree option for the lxml parser #3365

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

seberm
Copy link
Collaborator

@seberm seberm commented Nov 18, 2024

This PR should resolve the problem with the LXML limit on the size of text nodes it can handle:

For more info about a huge_tree option, please see:

huge_tree - disable security restrictions and support very deep trees and very long text content (only affects libxml2 2.7+)

Especially this LXML FAQ section about security concerns:

Is lxml vulnerable to XML bombs?

This has nothing to do with lxml itself, only with the parser of libxml2. Since libxml2 version 2.7, the parser imposes hard security limits on input documents to prevent DoS attacks with forged input data. Since lxml 2.2.1, you can disable these limits with the huge_tree parser option if you need to parse really large, trusted documents. All lxml versions will leave these restrictions enabled by default.

Note that libxml2 versions of the 2.6 series do not restrict their parser and are therefore vulnerable to DoS attacks.

TODOs:

  • Try to test the test with huge output without lxml (with just jinja) to check if bottleneck is really lxml (schema or pretty print)
  • Enable the "huge output" test in the CI? -> yes
  • Add a test for XML deep trees (using a custom junit flavor)

Pull Request Checklist:

  • implement the feature
  • write the documentation
  • extend the test coverage
  • update the specification
  • adjust plugin docstring
  • modify the json schema
  • mention the version
  • include a release note

@seberm seberm added plugin | junit The junit report plugin plugin | reportportal The reportportal report plugin labels Nov 18, 2024
@seberm seberm added this to the 1.40 milestone Nov 18, 2024
@seberm seberm added the ci | full test Pull request is ready for the full test execution label Nov 18, 2024
@seberm
Copy link
Collaborator Author

seberm commented Nov 18, 2024

Hello @thrix, @happz, @psss ,
Do you have any objections or concerns about enabling the huge_tree option, especially from the TestingFarm perspective?

I think we should be perfectly fine with enabling it.

Thanks!

@seberm
Copy link
Collaborator Author

seberm commented Nov 18, 2024

@KwisatzHaderach Are you able to somehow test this change and let me know if it resolves your problem?

The processing of huge files using LXML can take a lot of resources (esp. CPU). We can always add an option/condition to completely bypass the LXML processing and use just Jinja2. The LXML is used to just check the XML schema for non-custom JUnit flavors and for prettifying the XML output.

@thrix
Copy link
Collaborator

thrix commented Nov 19, 2024

Hello @thrix, @happz, @psss , Do you have any objections or concerns about enabling the huge_tree option, especially from the TestingFarm perspective?

I think we should be perfectly fine with enabling it.

Thanks!

@seberm fine with me, the XML is not something user can easily inject I assume, if he wants to DOS us he has a lot of other options directly from the tests anyway.

Copy link
Collaborator

@KwisatzHaderach KwisatzHaderach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't get a chance to test it yet, but approving since it looks good and we need this sooner rather than later

@seberm
Copy link
Collaborator Author

seberm commented Nov 19, 2024

So I gave it an another round of tests. Let's take following test as an example:

require:
  - python
result: fail
test: python -c "print((('a' * 1023) + '\n') * 1024 * 10)"

As you can see from the results below, enabling the huge_tree=True option makes sense, and I think it will help solve the #3363 issue.

1) LXML completely bypassed (uninstalled), using just Jinja to generate the JUnit

Takes around 43s on my machine to generate, but it works.

$ python -m tmt -r ~/tmt-learning/huge-out run -a report --how junit --file x    43.01s user 2.08s system 71% cpu 1:03.41 total

2) LXML enabled (schema validation + pretty print), huge_tree=False

Takes around 44s on my machine, it crashes on huge text node error.

$ python -m tmt -r ~/tmt-learning/huge-out run -a report --how junit --file x    44.30s user 2.47s system 69% cpu 1:06.87 total

3) LXML enabled (schema validation + pretty print), huge_tree=True

Takes also around 44s on my machine, it works without error.

$ python -m tmt -r ~/tmt-learning/huge-out run -a report --how junit --file x    44.22s user 2.20s system 71% cpu 1:04.93 total

(another?) problem I was facing

If you change the test to the following command (similar amount of data as in previous test but no line ending):

test: python -c "print('a' * (10 * 1024 * 1024 + 1))"

The tmt is still "working" on generating the report and it seems like it will never finish. This behavior is the same with or without the LXML installed.

I would appreciate it if someone could help me to profile this test in the tmt, so we could decide if this is actually a problem that needs to be solved. Perhaps not? Any ideas?

@seberm
Copy link
Collaborator Author

seberm commented Nov 20, 2024

@psss @happz Do you think it's a good idea to enable the test mentioned above (test: python -c "print((('a' * 1023) + '\n') * 1024 * 10)") in the CI? I am not sure if it's a good idea to slow down the CI in this way. Thanks.

@seberm seberm force-pushed the feature/enable-huge-tree-for-lxml branch from 2a7c422 to 5317bc7 Compare November 21, 2024 14:42
@seberm seberm requested a review from thrix November 21, 2024 14:42
@psss psss changed the title Enable the huge_tree option for lxml parser Enable the huge_tree option for the lxml parser Nov 21, 2024
Copy link
Collaborator

@martinhoyer martinhoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it correctly, we want large depth (above 256) to require huge_tree.

We played with it locally with:

DEPTH=257
OUTPUT_FILE="deep.xml"

# Start the XML file with a root tag
echo "<root>" > "$OUTPUT_FILE"

# Generate nested tags
for i in $(seq 1 $DEPTH); do
    printf "%0.s " >> "$OUTPUT_FILE" # Indent for readability (optional)
    echo "<tag$i>" >> "$OUTPUT_FILE"
done

# Close the nested tags in reverse order
for i in $(seq $DEPTH -1 1); do
    printf "%0.s " >> "$OUTPUT_FILE" # Indent for readability (optional)
    echo "</tag$i>" >> "$OUTPUT_FILE"
done

# Close the root tag
echo "</root>" >> "$OUTPUT_FILE"

And then I thought we could catch the exception and only use huge_tree, when it's necessary?

from lxml import etree

def parse_with_huge_tree(file_path):
    try:
        tree = etree.parse(file_path)
        print("XML parsed successfully (default settings).")
        return tree
    except etree.LxmlError as e:
        print(f"LxmlError encountered: {e}")
        print("Retrying with huge_tree=True...")
        try:
            parser = etree.XMLParser(huge_tree=True)
            tree = etree.parse(file_path, parser)
            print("XML parsed successfully with huge_tree=True.")
            return tree
        except etree.LxmlError as retry_error:
            print(f"Retry failed with huge_tree=True: {retry_error}")
        except Exception as retry_exception:
            print(f"Unexpected error during retry: {retry_exception}")
    except Exception as e:
        print(f"Unexpected error: {e}")

# Usage
parse_with_huge_tree('./deep.xml')

Make sense?

(and maybe print a warning that it's quite large and we are being forced to use huge_tree, security vulnerability, et cetera)

@seberm
Copy link
Collaborator Author

seberm commented Nov 22, 2024

Hi @mhoyer,
it's eighter "large tag depth" or/and "big text content in tag data".

Thanks for testing the deep trees. The deep XML trees in the tmt junit plugin can happen only with custom flavors. I will add a test for it. In the default flavor or when using the polarion jinja template for generating XUnit, this should never happen.

Regarding trying with huge_tree=False and retrying with huge_tree=True, considering that the huge_tree option will still be effectively enabled, I don't see a reason to complicate the code other than informing the user about potential security/DoS concerns or that it may consume more system resources and take longer. Or is there any more?

@seberm seberm force-pushed the feature/enable-huge-tree-for-lxml branch from 41204e8 to 517971a Compare November 22, 2024 11:25
@martinhoyer
Copy link
Collaborator

still be effectively enabled, I don't see a reason to complicate the code other than informing the user about potential security/DoS concerns or that it may consume more system resources and take longer. Or is there any more?

Yep, just informing users. I haven't looked into what it actually does, so I'm not sure how beneficial it is to try it without the flag first

@seberm seberm added plugin | polarion Plugins implementing the Polarion integration step | report Stuff related to the report step bug Something isn't working and removed plugin | reportportal The reportportal report plugin labels Nov 26, 2024
@seberm
Copy link
Collaborator Author

seberm commented Nov 26, 2024

@martinhoyer FYI, I've modified the code to enable the huge_tree option only if necessary and print the warning to the user:

I did more testing and discovered this should not prolong the time spent on LXML operations. Most of time seems to be spent in the Jinja2 render_template_file method, but it's a different issue.

@seberm
Copy link
Collaborator Author

seberm commented Nov 26, 2024

Hello @psss,

you've intentionally removed -r from the tmt run command here?

So, I've re-checked the code and I think the removal of -r is exactly what I wanted.

I intend to execute the tests only once (and not to remove data in --id $run_dir) and then run the report step multiple times with various junit arguments to check the report plugin arguments and functionality. Or am I still getting it wrong?

Right now, it could still happen that the subsequent runs will start in in-progress workdirs due to the for loop over tmt execute methods:

But because the test uses only a tmt execute method, everything should work just fine. I think we could remove the for loop over the execute methods because it currently seems unnecessary.

@seberm
Copy link
Collaborator Author

seberm commented Nov 26, 2024

I did more testing and discovered this should not prolong the time spent on LXML operations. Most of time seems to be spent in the Jinja2 render_template_file method, but it's a different issue.

It seems I've found where the bottleneck is:

@seberm
Copy link
Collaborator Author

seberm commented Nov 26, 2024

/packit test

@seberm seberm force-pushed the feature/enable-huge-tree-for-lxml branch from 21cfe4d to 4b5bc05 Compare November 27, 2024 14:30
@seberm
Copy link
Collaborator Author

seberm commented Nov 27, 2024

Hello @martinhoyer ,
just FYI, after multiple discussions and testing with @KwisatzHaderach,
I have reverted the changes for retrying with huge_tree=True and printing the warning to the user.

The reason is it's hard to detect the "huge text node" or "deep tree" errors between various lxml and libxml2 versions. The error messages and message types are a little bit different. Sometimes the error message is only in error.msg, sometimes it's in error.error_log list. And the exception type is still the same - etree.XMLSyntaxError.

To sum it up, I don't think it's worth complicating the code just to print a warning. Let's just keep the huge_tree option always enabled.

@KwisatzHaderach
Copy link
Collaborator

KwisatzHaderach commented Nov 27, 2024

Absolutely agree, simpler is usually better, tested the current implementation, along with #3382, and it is working fine on data from TF run where I encountered the error previously.

BTW why #3382 is very important, when I was rerunning the report step for my run, without it took over 6 hours, with it it's under a minute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ci | full test Pull request is ready for the full test execution plugin | junit The junit report plugin plugin | polarion Plugins implementing the Polarion integration step | report Stuff related to the report step
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Polarion XML fails to generate due to xmlSAX2Characters: huge text node
4 participants