-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Language-specific frontends #429
Comments
This is a valid concern, especially if we are going to support more language backends. There is one example I have encountered when the parser has to be aware of the language: unpaired single quotes in rust. Normally re2c parses a single quote as a beginning of a literal and looks for a matching quote to end the literal. Currently the parser has a bit of rust-specific code to deal with it. Full parsing of all supported languages is out of the question; this is not feasible, and it would make re2c unnecessarily complicated. There may be a need for more language-specific support in the parser, for example awareness of all kinds of string literals allowed in some language. If you have other ideas or problematic examples, you are welcome to share them. |
Some of the cases that come to mind:
For 5 and 6, I suggest treating any occurences of the bad cases (continued line comment, escaped newline in block comment delimiter, escaped newline after backslash in string or char literal, trigraph that could impact parsing) as syntax errors. They are all considered bad practice anyway (to the point that compilers issue warnings about them), so rejecting them should be okay. |
Another problematic case that came to mind is numeric literals with single quote used as thousand separator ( It should be noted that re2c handles code outside of blocks differently from the code inside of blocks (that is, user-defined semantic actions). Although semantic actions are not parsed precisely, re2c is able to recognize comments, strings, etc., as it searches for the closing curly brace. But the code between blocks is treated more or less like a stream of raw characters. Any effort to change this should be conservative, meaning that if re2c is unable to recognize a precise lexeme (e.g. a string, a preprocessor directive, etc.) then it should fallback to the "raw stream of characters" logic. |
I recommend issuing a warning in this case. BTW there are certain cases (such as unterminated string literals) that are undefined behavior (!!!!!) if I recall correctly. re2c can just reject those outright. One other thought I had is to actually pipe the output of re2c through the C preprocessor, then inspect the preprocessor’s output to make sure that what re2c thought were balanced |
The frontend is currently not aware of the specific programming language. This is a significant problem, as it can cause code to be misparsed. I don’t have any examples yet, though.
The text was updated successfully, but these errors were encountered: