Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Language-specific frontends #429

Open
DemiMarie opened this issue Nov 13, 2022 · 4 comments
Open

Language-specific frontends #429

DemiMarie opened this issue Nov 13, 2022 · 4 comments

Comments

@DemiMarie
Copy link

The frontend is currently not aware of the specific programming language. This is a significant problem, as it can cause code to be misparsed. I don’t have any examples yet, though.

@skvadrik
Copy link
Owner

This is a valid concern, especially if we are going to support more language backends.

There is one example I have encountered when the parser has to be aware of the language: unpaired single quotes in rust. Normally re2c parses a single quote as a beginning of a literal and looks for a matching quote to end the literal. Currently the parser has a bit of rust-specific code to deal with it.

Full parsing of all supported languages is out of the question; this is not feasible, and it would make re2c unnecessarily complicated. There may be a need for more language-specific support in the parser, for example awareness of all kinds of string literals allowed in some language.

If you have other ideas or problematic examples, you are welcome to share them.

@DemiMarie
Copy link
Author

Some of the cases that come to mind:

  1. C++/Rust/Go raw string literals

  2. Certain C preprocessor directives should be treated as comments (#pragma, #error, #warning come to mind)

  3. Comment nesting (IIRC Rust comments nest, while C and C++’s comments definitely do not.)

  4. C preprocessor macro abuse

  5. C line continuation:

    //       \
     this is still commented
    /\
    * this is also a comment */
    "\\
    "is still in the string literal"
    R"ab\
    c(a raw string literal)abc"
  6. C trigraphs (yuck)

For 5 and 6, I suggest treating any occurences of the bad cases (continued line comment, escaped newline in block comment delimiter, escaped newline after backslash in string or char literal, trigraph that could impact parsing) as syntax errors. They are all considered bad practice anyway (to the point that compilers issue warnings about them), so rejecting them should be okay.

@skvadrik
Copy link
Owner

Another problematic case that came to mind is numeric literals with single quote used as thousand separator (12'345).

It should be noted that re2c handles code outside of blocks differently from the code inside of blocks (that is, user-defined semantic actions). Although semantic actions are not parsed precisely, re2c is able to recognize comments, strings, etc., as it searches for the closing curly brace. But the code between blocks is treated more or less like a stream of raw characters.

Any effort to change this should be conservative, meaning that if re2c is unable to recognize a precise lexeme (e.g. a string, a preprocessor directive, etc.) then it should fallback to the "raw stream of characters" logic.

@DemiMarie
Copy link
Author

Any effort to change this should be conservative, meaning that if re2c is unable to recognize a precise lexeme (e.g. a string, a preprocessor directive, etc.) then it should fallback to the "raw stream of characters" logic.

I recommend issuing a warning in this case. BTW there are certain cases (such as unterminated string literals) that are undefined behavior (!!!!!) if I recall correctly. re2c can just reject those outright.

One other thought I had is to actually pipe the output of re2c through the C preprocessor, then inspect the preprocessor’s output to make sure that what re2c thought were balanced { and } actually were. I suspect this would only be viable with build system integration, or on *nix where the syntax for invoking the C compiler is mostly standardized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants