Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify using CDATA in HTML context #10801

Open
SetTrend opened this issue Nov 26, 2024 · 12 comments
Open

Clarify using CDATA in HTML context #10801

SetTrend opened this issue Nov 26, 2024 · 12 comments
Labels
clarification Standard could be clearer document conformance

Comments

@SetTrend
Copy link

SetTrend commented Nov 26, 2024

What is the issue with the HTML Standard?

Currently, the HTML standard doesn't get clear on whether a <![CDATA[ ]]> section may be used in HTML context:

https://html.spec.whatwg.org/multipage/syntax.html#cdata-sections

There is just an example that – speaking only for the example itself – claims "CDATA sections can only be used in foreign content (MathML or SVG)."

Is this statement true for HTML? Then it should be moved outside the example heading.

@domenic
Copy link
Member

domenic commented Nov 27, 2024

Per spec it's a conformance error; see https://html.spec.whatwg.org/multipage/parsing.html#markup-declaration-open-state.

I agree that this should probably be stated somewhere near where you list, instead of just inside the parser. I'm not really sure what the best conventions are for this sort of duplicate conformance requirement, but I know we have a variety of them.

I guess this is already implicit in https://html.spec.whatwg.org/#elements-2 actually?

The contents of the element must be placed between just after the start tag (which might be implied, in certain cases) and just before the end tag (which again, might be implied in certain cases). The exact allowed contents of each individual element depend on the content model of that element, as described earlier in this specification.

and no content models allow CDATA sections.

But yeah, your idea of just moving this sentence outside of the example might be reasonable.

@domenic domenic added clarification Standard could be clearer document conformance labels Nov 27, 2024
@annevk
Copy link
Member

annevk commented Nov 27, 2024

I think it could be a note instead of being part of the example, but we probably wouldn't want to restate it normatively as it indeed already follows from where it is referenced?

@domenic
Copy link
Member

domenic commented Nov 27, 2024

@annevk, I thought you were the one who generally argued for the duplicate-normative-conformance-requirements approach, per whatwg/url#704 (comment) ?

@annevk
Copy link
Member

annevk commented Nov 27, 2024

I'm okay with separate requirements for "parsing" and "writing", but here we are talking about duplicating a "writing" requirement, no?

@annevk
Copy link
Member

annevk commented Nov 27, 2024

Only foreign elements are defined as allowing CDATA sections:

Foreign elements whose start tag is marked as self-closing can't have any contents (since, again, as there's no end tag, no content can be put between the start tag and the end tag). Foreign elements whose start tag is not marked as self-closing can have text, character references, CDATA sections, other elements, and comments, but the text must not contain the character U+003C LESS-THAN SIGN (<) or an ambiguous ampersand.

@domenic
Copy link
Member

domenic commented Nov 27, 2024

I guess it depends on whether you find it clear that "foreign elements can have CDATA sections" implies "non-foreign elements cannot have CDATA sections". I think that's probably technically how it is written, but kind of confusing.

@annevk
Copy link
Member

annevk commented Nov 27, 2024

The writing section is kind of written that way. It starts with a document and goes downward from there.

@SetTrend
Copy link
Author

From my perspective, a statement like "foreign elements can have CDATA sections" is not exhaustive nor exclusive enough. It's similar to saying "1 > 0". That wouldn't exclude 2 from also being greater than 0.

@annevk
Copy link
Member

annevk commented Nov 27, 2024

Sure, but coupled with the next paragraph (and other text in that section) it's quite clear though:

Normal elements can have text, character references, other elements, and comments, but the text must not contain the character U+003C LESS-THAN SIGN (<) or an ambiguous ampersand. Some normal elements also have yet more restrictions on what content they are allowed to hold, beyond the restrictions imposed by the content model and those described in this paragraph. Those restrictions are described below.

@SetTrend
Copy link
Author

I cannot find the text you are referring to in the above mentioned document near the CDATA sections section.

@sideshowbarker
Copy link
Contributor

I agree it’s worth adding some clarification in the CDATA sections itself, and I think we could do that with just this:

-  <p><dfn data-x="syntax-cdata">CDATA sections</dfn> must consist of the following components, in
-  this order:</p>
+  <p><dfn data-x="syntax-cdata">CDATA sections</dfn> can only be used in in foreign content (MathML
+  or SVG), and must consist of the following components, in this order:</p>

I don’t think it’s necessary to normatively restate the requirement anywhere; instead just that “can” there is sufficient — given that the actual normative document-conformance requirements are stated in the places Anne cited.

(And for the record here: the normativity follows from the fact that the only place where the spec references the “CDATA sections” definition is in the enumeration of what foreign elements are limited to consisting of — which explicitly includes CDATA sections; while the corresponding enumeration of what normal elements are limited to consisting of explicitly does not include CDATA sections — so the spec already states that CDATA sections are explicitly not allowed in normal elements.)

@dmsnell
Copy link
Contributor

dmsnell commented Nov 27, 2024

When writing WordPress’ HTML parser I found the terminology confusing and think there’s room to improve the communication around CDATA sections. Specifically, I find it’s confusing for a human looking at the syntax.

<![CDATA[]]>

What is this fragment of syntax? I presume most people will look at that and say, “it’s a CDATA section.” The HTML specification, however, must know the context around the fragment and will either say, “it’s a CDATA section,” or more likely, “it’s a syntax error that creates a bogus comment.”

I know in discussions with others this has been hard to communicate, as we often ask the question, “What should happen when a CDATA section appears within HTML elements?” The cheap answer is how I feel the specification words it: this can’t happen - the question is invalid.

So somehow it might be clarifying to expand on this where CDATA is mentioned at first:

CDATA sections can only be used in foreign content (MathML or SVG).
Everywhere else that they appear to exist is considered invalid HTML and the token transforms into [a bogus comment](link to tokenizing step handling this).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clarification Standard could be clearer document conformance
Development

No branches or pull requests

5 participants