Add script to regen unicode file #425

ccleve · 2022-11-02T15:24:55Z

Addresses #235, #423

This seems to work. The format of new file is a little different: the char classes are sorted, and single-char ranges are replaced by just a single char. For example,

[\u1234-\u1234] -> [\u1234]

Add missing header include.

skvadrik · 2022-11-03T07:17:28Z

include/generate_unicode_files.py

@@ -0,0 +1,94 @@
+#!/usr/bin/python


#!/usr/bin/env python

skvadrik

Thanks for taking care of this! A few thigs:

Please rebase to get rid of the merge commit (should be trivial)
We need to make sure the script is not installed (I think it is as part of the include subdirectory).
Loking at the changes in the generated files, they seem to be more substantial than just sorting and single-character classes.
- The L_ category (whatever it was) has disappeared; this breaks backwards compatibility.
- I looked at the beginning of L category and I see that range \u0561-\u0587 is replaced by \u0560-\u0588. There may be many more such changes. We need to understand how many and if they are correct --- I suspect they may be already fixed in the haskell Data.Charset library that I used to generate old unicode_categories.re Therefore let me first regenerate and commit the files with the haskell script (I will also sort and fix single-character classes), and then let's move to your script. I'll try to do this by the end of today.
Ideally, it would also be good to generate test files with the same python library as the categories and get rid of the haskell script (not that it doesn't work, but it adds an extra dependency and is generally more difficult for people to run than python). But I can take care of it later.

skvadrik · 2022-11-04T08:29:42Z

I have regenerated unicode_categories.re and tests with the old haskell script: e3ec259, and I can already see that the nontrivial changes to the character ranges are the same as in the python script. What remains is to sort and fix single-character classes. I'll try to do that ASAP but I'm traveling in the next few days so it may have to wait a bit.

ccleve · 2022-11-06T16:39:04Z

I'm really struggling to figure out how to rebase and get rid of the extra commit. Apparently something I did closed this PR. Git is such a walking disaster...

skvadrik · 2022-11-07T07:48:03Z

I'm really struggling to figure out how to rebase and get rid of the extra commit. Apparently something I did closed this PR. Git is such a walking disaster...

It can be very confusing. I github auto-closed because you pushed a commit saying "Merge pull request ccleve#1 from skvadrik/master" (it's a github feature, not a git one). I wonder if it can be configured in settings (to disallow github to be "smart" and close PRs / bugs based on keywords).

So what you need to do now to get nice linear history without merge commits is:

git rebase -i HEAD~3
in the editor, you will see three latest commits: 1) 0a8aa3b, 2) Merge pull request Add missing header include. ccleve/re2c#1 from skvadrik/master and 3) Merge branch 'master' of https://github.com/ccleve/re2c.
You need to squash commits 2 and 3 into 1.
To do that, replace "pick" word in front of commits 2 and 3 with "fixup" (it is the same as "squash", but it does not try to merge commit messages and uses the first one.
Save and exit the editor (git should say "successfully rebased...").
Then git push -f (force-push, as you are rewriting history, which is not allowed by default as it is a destructive operation).

And after I merge my remaining work (sort + fix dupes), you will need to rebase your work on top of mine:

git pull --rebase skvadrik master

ccleve · 2022-11-07T17:49:10Z

Thanks, but I'm not seeing the same commits after the git rebase command, and I can't take the time to figure this out.

I recommend just closing or deleting this PR, and then copying and pasting my script into your repo. Life's too short to deal with git nonsense.

On another matter: it seems that it's not easy to get Python to spit out other unicode character properties, like Word Break or Script. I haven't found a good module that can do them, and I don't want to parse unicodedata.txt myself. I ended up writing code in Java to generate these files because Java and ICU4J support is really, really good. I'm happy to contribute Java code to generate the files, although it will be a lot harder for users to use Java than Python.

skvadrik · 2022-11-07T21:50:06Z

I recommend just closing or deleting this PR, and then copying and pasting my script into your repo. Life's too short to deal with git nonsense.

Ok, I'll see what I can do. Thanks for the script!

On another matter: it seems that it's not easy to get Python to spit out other unicode character properties, like Word Break or Script. I haven't found a good module that can do them, and I don't want to parse unicodedata.txt myself. I ended up writing code in Java to generate these files because Java and ICU4J support is really, really good. I'm happy to contribute Java code to generate the files, although it will be a lot harder for users to use Java than Python.

Right, a Java script has the same problem as a Haskell script: it may be nontrivial to run (depending on the developer's environment.

ccleve and others added 2 commits November 2, 2022 09:58

Added script to regen unicode file

0a8aa3b

Merge pull request #1 from skvadrik/master

fc35c2a

Add missing header include.

skvadrik reviewed Nov 3, 2022

View reviewed changes

include/generate_unicode_files.py

@@ -0,0 +1,94 @@

#!/usr/bin/python

Copy link

Owner

skvadrik Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#!/usr/bin/env python

skvadrik requested changes Nov 3, 2022

View reviewed changes

ccleve closed this Nov 6, 2022

ccleve force-pushed the master branch from fc35c2a to e3ec259 Compare November 6, 2022 16:20

Merge branch 'master' of https://github.com/ccleve/re2c

43fd0be

ccleve reopened this Nov 6, 2022

skvadrik closed this Nov 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add script to regen unicode file #425

Add script to regen unicode file #425

ccleve commented Nov 2, 2022

skvadrik Nov 3, 2022

skvadrik left a comment

skvadrik commented Nov 4, 2022

ccleve commented Nov 6, 2022

skvadrik commented Nov 7, 2022

ccleve commented Nov 7, 2022

skvadrik commented Nov 7, 2022

Add script to regen unicode file #425

Add script to regen unicode file #425

Conversation

ccleve commented Nov 2, 2022

skvadrik Nov 3, 2022

Choose a reason for hiding this comment

skvadrik left a comment

Choose a reason for hiding this comment

skvadrik commented Nov 4, 2022

ccleve commented Nov 6, 2022

skvadrik commented Nov 7, 2022

ccleve commented Nov 7, 2022

skvadrik commented Nov 7, 2022