Enable the `huge_tree` option for the `lxml` parser #3365

seberm · 2024-11-18T10:20:03Z

This PR should resolve the problem with the LXML limit on the size of text nodes it can handle:

Polarion XML fails to generate due to xmlSAX2Characters: huge text node #3363

For more info about a huge_tree option, please see:

https://lxml.de/api/lxml.etree.XMLParser-class.html

huge_tree - disable security restrictions and support very deep trees and very long text content (only affects libxml2 2.7+)

Especially this LXML FAQ section about security concerns:

https://lxml.de/FAQ.html#is-lxml-vulnerable-to-xml-bombs

Is lxml vulnerable to XML bombs?

This has nothing to do with lxml itself, only with the parser of libxml2. Since libxml2 version 2.7, the parser imposes hard security limits on input documents to prevent DoS attacks with forged input data. Since lxml 2.2.1, you can disable these limits with the huge_tree parser option if you need to parse really large, trusted documents. All lxml versions will leave these restrictions enabled by default.

Note that libxml2 versions of the 2.6 series do not restrict their parser and are therefore vulnerable to DoS attacks.

TODOs:

Try to test the test with huge output without lxml (with just jinja) to check if bottleneck is really lxml (schema or pretty print)
Enable the "huge output" test in the CI? -> yes
Add a test for XML deep trees (using a custom junit flavor)

Pull Request Checklist:

seberm · 2024-11-18T10:32:35Z

Hello @thrix, @happz, @psss ,
Do you have any objections or concerns about enabling the huge_tree option, especially from the TestingFarm perspective?

I think we should be perfectly fine with enabling it.

Thanks!

seberm · 2024-11-18T11:53:01Z

@KwisatzHaderach Are you able to somehow test this change and let me know if it resolves your problem?

The processing of huge files using LXML can take a lot of resources (esp. CPU). We can always add an option/condition to completely bypass the LXML processing and use just Jinja2. The LXML is used to just check the XML schema for non-custom JUnit flavors and for prettifying the XML output.

tests/report/junit/data/main.fmf

thrix · 2024-11-19T13:20:54Z

Hello @thrix, @happz, @psss , Do you have any objections or concerns about enabling the huge_tree option, especially from the TestingFarm perspective?

I think we should be perfectly fine with enabling it.

Thanks!

@seberm fine with me, the XML is not something user can easily inject I assume, if he wants to DOS us he has a lot of other options directly from the tests anyway.

KwisatzHaderach

didn't get a chance to test it yet, but approving since it looks good and we need this sooner rather than later

seberm · 2024-11-19T22:33:50Z

So I gave it an another round of tests. Let's take following test as an example:

require:
  - python
result: fail
test: python -c "print((('a' * 1023) + '\n') * 1024 * 10)"

As you can see from the results below, enabling the huge_tree=True option makes sense, and I think it will help solve the #3363 issue.

1) LXML completely bypassed (uninstalled), using just Jinja to generate the JUnit

Takes around 43s on my machine to generate, but it works.

$ python -m tmt -r ~/tmt-learning/huge-out run -a report --how junit --file x    43.01s user 2.08s system 71% cpu 1:03.41 total

2) LXML enabled (schema validation + pretty print), `huge_tree=False`

Takes around 44s on my machine, it crashes on huge text node error.

$ python -m tmt -r ~/tmt-learning/huge-out run -a report --how junit --file x    44.30s user 2.47s system 69% cpu 1:06.87 total

3) LXML enabled (schema validation + pretty print), `huge_tree=True`

Takes also around 44s on my machine, it works without error.

$ python -m tmt -r ~/tmt-learning/huge-out run -a report --how junit --file x    44.22s user 2.20s system 71% cpu 1:04.93 total

(another?) problem I was facing

If you change the test to the following command (similar amount of data as in previous test but no line ending):

test: python -c "print('a' * (10 * 1024 * 1024 + 1))"

The tmt is still "working" on generating the report and it seems like it will never finish. This behavior is the same with or without the LXML installed.

I would appreciate it if someone could help me to profile this test in the tmt, so we could decide if this is actually a problem that needs to be solved. Perhaps not? Any ideas?

seberm · 2024-11-20T15:59:16Z

@psss @happz Do you think it's a good idea to enable the test mentioned above (test: python -c "print((('a' * 1023) + '\n') * 1024 * 10)") in the CI? I am not sure if it's a good idea to slow down the CI in this way. Thanks.

martinhoyer

If I understand it correctly, we want large depth (above 256) to require huge_tree.

We played with it locally with:

DEPTH=257
OUTPUT_FILE="deep.xml"

# Start the XML file with a root tag
echo "<root>" > "$OUTPUT_FILE"

# Generate nested tags
for i in $(seq 1 $DEPTH); do
    printf "%0.s " >> "$OUTPUT_FILE" # Indent for readability (optional)
    echo "<tag$i>" >> "$OUTPUT_FILE"
done

# Close the nested tags in reverse order
for i in $(seq $DEPTH -1 1); do
    printf "%0.s " >> "$OUTPUT_FILE" # Indent for readability (optional)
    echo "</tag$i>" >> "$OUTPUT_FILE"
done

# Close the root tag
echo "</root>" >> "$OUTPUT_FILE"

And then I thought we could catch the exception and only use huge_tree, when it's necessary?

from lxml import etree

def parse_with_huge_tree(file_path):
    try:
        tree = etree.parse(file_path)
        print("XML parsed successfully (default settings).")
        return tree
    except etree.LxmlError as e:
        print(f"LxmlError encountered: {e}")
        print("Retrying with huge_tree=True...")
        try:
            parser = etree.XMLParser(huge_tree=True)
            tree = etree.parse(file_path, parser)
            print("XML parsed successfully with huge_tree=True.")
            return tree
        except etree.LxmlError as retry_error:
            print(f"Retry failed with huge_tree=True: {retry_error}")
        except Exception as retry_exception:
            print(f"Unexpected error during retry: {retry_exception}")
    except Exception as e:
        print(f"Unexpected error: {e}")

# Usage
parse_with_huge_tree('./deep.xml')

Make sense?

(and maybe print a warning that it's quite large and we are being forced to use huge_tree, security vulnerability, et cetera)

seberm · 2024-11-22T10:53:47Z

Hi @mhoyer,
it's eighter "large tag depth" or/and "big text content in tag data".

Thanks for testing the deep trees. The deep XML trees in the tmt junit plugin can happen only with custom flavors. I will add a test for it. In the default flavor or when using the polarion jinja template for generating XUnit, this should never happen.

Regarding trying with huge_tree=False and retrying with huge_tree=True, considering that the huge_tree option will still be effectively enabled, I don't see a reason to complicate the code other than informing the user about potential security/DoS concerns or that it may consume more system resources and take longer. Or is there any more?

martinhoyer · 2024-11-22T13:19:19Z

still be effectively enabled, I don't see a reason to complicate the code other than informing the user about potential security/DoS concerns or that it may consume more system resources and take longer. Or is there any more?

Yep, just informing users. I haven't looked into what it actually does, so I'm not sure how beneficial it is to try it without the flag first

seberm · 2024-11-26T12:59:16Z

@martinhoyer FYI, I've modified the code to enable the huge_tree option only if necessary and print the warning to the user:

21cfe4d

I did more testing and discovered this should not prolong the time spent on LXML operations. Most of time seems to be spent in the Jinja2 render_template_file method, but it's a different issue.

seberm · 2024-11-26T13:04:09Z

Hello @psss,

you've intentionally removed -r from the tmt run command here?

So, I've re-checked the code and I think the removal of -r is exactly what I wanted.

I intend to execute the tests only once (and not to remove data in --id $run_dir) and then run the report step multiple times with various junit arguments to check the report plugin arguments and functionality. Or am I still getting it wrong?

Right now, it could still happen that the subsequent runs will start in in-progress workdirs due to the for loop over tmt execute methods:

https://github.com/teemtee/tmt/pull/3365/files#diff-76f4c6cc642bd53f7744efaa0fca41ce533ea57e6dcc6d47383c4e21604fafa0R12

But because the test uses only a tmt execute method, everything should work just fine. I think we could remove the for loop over the execute methods because it currently seems unnecessary.

seberm · 2024-11-26T13:56:41Z

I did more testing and discovered this should not prolong the time spent on LXML operations. Most of time seems to be spent in the Jinja2 render_template_file method, but it's a different issue.

It seems I've found where the bottleneck is:

Do not save the logs into memory using an auxiliary variable #3382

seberm · 2024-11-26T20:07:38Z

/packit test

…e user" This reverts commit 21cfe4d.

seberm · 2024-11-27T14:46:47Z

Hello @martinhoyer ,
just FYI, after multiple discussions and testing with @KwisatzHaderach,
I have reverted the changes for retrying with huge_tree=True and printing the warning to the user.

The reason is it's hard to detect the "huge text node" or "deep tree" errors between various lxml and libxml2 versions. The error messages and message types are a little bit different. Sometimes the error message is only in error.msg, sometimes it's in error.error_log list. And the exception type is still the same - etree.XMLSyntaxError.

To sum it up, I don't think it's worth complicating the code just to print a warning. Let's just keep the huge_tree option always enabled.

KwisatzHaderach · 2024-11-27T15:49:36Z

Absolutely agree, simpler is usually better, tested the current implementation, along with #3382, and it is working fine on data from TF run where I encountered the error previously.

BTW why #3382 is very important, when I was rerunning the report step for my run, without it took over 6 hours, with it it's under a minute.

seberm added plugin | junit The junit report plugin plugin | reportportal The reportportal report plugin labels Nov 18, 2024

seberm added this to the 1.40 milestone Nov 18, 2024

seberm requested review from psss, lukaszachy, happz, thrix and janhavlin as code owners November 18, 2024 10:20

seberm added the ci | full test Pull request is ready for the full test execution label Nov 18, 2024

seberm linked an issue Nov 18, 2024 that may be closed by this pull request

Polarion XML fails to generate due to xmlSAX2Characters: huge text node #3363

Open

thrix reviewed Nov 19, 2024

View reviewed changes

tests/report/junit/data/main.fmf Outdated Show resolved Hide resolved

KwisatzHaderach approved these changes Nov 19, 2024

View reviewed changes

seberm force-pushed the feature/enable-huge-tree-for-lxml branch from 2a7c422 to 5317bc7 Compare November 21, 2024 14:42

seberm requested a review from thrix November 21, 2024 14:42

psss changed the title ~~Enable the huge_tree option for lxml parser~~ Enable the huge_tree option for the lxml parser Nov 21, 2024

martinhoyer reviewed Nov 21, 2024

View reviewed changes

seberm force-pushed the feature/enable-huge-tree-for-lxml branch from 41204e8 to 517971a Compare November 22, 2024 11:25

seberm added plugin | polarion Plugins implementing the Polarion integration step | report Stuff related to the report step bug Something isn't working and removed plugin | reportportal The reportportal report plugin labels Nov 26, 2024

seberm added 11 commits November 27, 2024 15:26

Enable the huge_tree option for lxml parser

477c2f8

Add disabled test which generates huge output into JUnit file

54e16d7

Enable and change the test command to a working one

c362c93

Improve the test the way it executes the execute step only once

fe5f29d

Use rlRun_LOG instead of teeing the output

8f77e34

Add test for XML deep trees

e97a28c

Do not remove output files one by one

e33a495

Enable huge_tree only when necessary, print the warning to the user

69dd8f7

Revert "Enable huge_tree only when necessary, print the warning to th…

f97acfb

…e user" This reverts commit 21cfe4d.

Re-add assert after the revert

0c46859

Small refactor of comments to save some space

4b5bc05

seberm force-pushed the feature/enable-huge-tree-for-lxml branch from 21cfe4d to 4b5bc05 Compare November 27, 2024 14:30

Improve the error message

cbdb31c

martinhoyer approved these changes Nov 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable the `huge_tree` option for the `lxml` parser #3365

Enable the `huge_tree` option for the `lxml` parser #3365

seberm commented Nov 18, 2024 •

edited

Loading

seberm commented Nov 18, 2024

seberm commented Nov 18, 2024

thrix commented Nov 19, 2024 •

edited

Loading

KwisatzHaderach left a comment

seberm commented Nov 19, 2024

seberm commented Nov 20, 2024

martinhoyer left a comment •

edited

Loading

seberm commented Nov 22, 2024

martinhoyer commented Nov 22, 2024

seberm commented Nov 26, 2024

seberm commented Nov 26, 2024

seberm commented Nov 26, 2024

seberm commented Nov 26, 2024

seberm commented Nov 27, 2024 •

edited

Loading

KwisatzHaderach commented Nov 27, 2024 •

edited

Loading

Enable the huge_tree option for the lxml parser #3365

Are you sure you want to change the base?

Enable the huge_tree option for the lxml parser #3365

Conversation

seberm commented Nov 18, 2024 • edited Loading

TODOs:

seberm commented Nov 18, 2024

seberm commented Nov 18, 2024

thrix commented Nov 19, 2024 • edited Loading

KwisatzHaderach left a comment

Choose a reason for hiding this comment

seberm commented Nov 19, 2024

1) LXML completely bypassed (uninstalled), using just Jinja to generate the JUnit

2) LXML enabled (schema validation + pretty print), huge_tree=False

3) LXML enabled (schema validation + pretty print), huge_tree=True

(another?) problem I was facing

seberm commented Nov 20, 2024

martinhoyer left a comment • edited Loading

Choose a reason for hiding this comment

seberm commented Nov 22, 2024

martinhoyer commented Nov 22, 2024

seberm commented Nov 26, 2024

seberm commented Nov 26, 2024

seberm commented Nov 26, 2024

seberm commented Nov 26, 2024

seberm commented Nov 27, 2024 • edited Loading

KwisatzHaderach commented Nov 27, 2024 • edited Loading

Enable the `huge_tree` option for the `lxml` parser #3365

Enable the `huge_tree` option for the `lxml` parser #3365

seberm commented Nov 18, 2024 •

edited

Loading

thrix commented Nov 19, 2024 •

edited

Loading

2) LXML enabled (schema validation + pretty print), `huge_tree=False`

3) LXML enabled (schema validation + pretty print), `huge_tree=True`

martinhoyer left a comment •

edited

Loading

seberm commented Nov 27, 2024 •

edited

Loading

KwisatzHaderach commented Nov 27, 2024 •

edited

Loading