-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Parser] Fix bug with non-ASCII characters in the new parser #948
Conversation
Excellent, Thanks for working on this issue! |
For me, the file is available at |
This comment was marked as outdated.
This comment was marked as outdated.
I sorted it out. We have to specify the file's name instead of the whole path. |
@Thirumalai-Shaktivel, can you try to debug this on Linux? I don't know what's breaking the CI. Works fine on my computer. |
src/lpython/parser/tokenizer.re
Outdated
@@ -268,7 +270,7 @@ int Tokenizer::lex(Allocator &al, YYSTYPE &yylval, Location &loc, diag::Diagnost | |||
int_bin = "0"[bB][01]+; | |||
int_hex = "0"[xX][0-9a-fA-F]+; | |||
int_dec = digit+ (digit | "_" digit)*; | |||
char = [a-zA-Z_]; | |||
char = L | "_"; | |||
name = char (char | digit)*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this accept a Hindi numeral? It seems it would only accept digits 0-9, but not Hindi numerals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added the support for Unicode numerals in the latest commit.
This is a relatively small change, so we can do this after:
|
The linux failure is |
The Windows failure is Why don't we include these categories by hand from |
It seems we need this:
|
Looks good. The change looks good. Ping me after tests pass, then we need to carefully benchmark. We don't want to lose any speed. :) |
I haven't compared it time-wise with the |
Yes, Unicode support is over 3x larger than all the rest of our tokenizer together... And all for ensuring that numerals in other languages (other than 0-9) are not allowed at the beginning. Why not just allow any unicode for "char" (yes, including numerals in other languages)? If somebody really wants to check their code, we could do it at the ASR level with an ASR pass, that they can optionally run. That way we are not paying this hefty Unicode price.... Try to implement it in a separate PR. I am going to benchmark this one now. |
Should I do something like this? char = [^\x00-\x7F]|[a-zA-Z_]; |
Yes, I would try it (in a separate PR). |
So main: $ time lpython --show-ast --new-parser a.py --no-color > a.txt
lpython --show-ast --new-parser a.py --no-color > a.txt 0.12s user 0.02s system 97% cpu 0.148 total and this PR: $ time lpython --show-ast --new-parser a.py --no-color > a.txt
lpython --show-ast --new-parser a.py --no-color > a.txt 0.12s user 0.02s system 97% cpu 0.149 total Although on average it seems this PR is slightly slower (more around 156ms). Overall not too bad. I noticed we got slightly slower on that benchmark, we started at 139ms. That is expected as the parser becomes full featured. If your other PR can pass your tests (from this PR), then I would rather use that if the generated code is about as large as in master. |
The one in #951 generates even less code than |
Closing in favor of #951. |
Partially fixes #916 (New parser only)
CPython supports Unicode characters for identifier names, but the old definition only restricted it to the ASCII character set.
lpython/src/lpython/parser/tokenizer.re
Lines 271 to 272 in d40c2c1
However, we cannot simply fix this by adding all the non-ASCII characters with the old char definition, like this -
char = [^\x00-\x7F]|[a-zA-Z_];
This is because CPython restricts the first letter of an identifier to be a numeral, which holds for any language. Here is an example -
१
is a Hindi numeral, so it throws an error if used as the first character of the identifier.I came across this article https://www.regular-expressions.info/unicode.html which suggested there are well-defined Unicode categories that can be used in a regular expression. We require all the characters in the
letter
category forchar
. In re2c, we can import these categories by including a header comment like this -/*!include:re2c "re2c-2.2/include/unicode_categories.re" */
I tried this, but it did not recognize the file's path for some reason. So instead, I just copied the regex for the letter category from the file mentioned in skvadrik/re2c#235 (comment).