-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add script to regen unicode file #425
Conversation
Add missing header include.
@@ -0,0 +1,94 @@ | |||
#!/usr/bin/python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#!/usr/bin/env python
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking care of this! A few thigs:
-
Please rebase to get rid of the merge commit (should be trivial)
-
We need to make sure the script is not installed (I think it is as part of the include subdirectory).
-
Loking at the changes in the generated files, they seem to be more substantial than just sorting and single-character classes.
- The
L_
category (whatever it was) has disappeared; this breaks backwards compatibility. - I looked at the beginning of
L
category and I see that range\u0561-\u0587
is replaced by\u0560-\u0588
. There may be many more such changes. We need to understand how many and if they are correct --- I suspect they may be already fixed in the haskell Data.Charset library that I used to generate old unicode_categories.re Therefore let me first regenerate and commit the files with the haskell script (I will also sort and fix single-character classes), and then let's move to your script. I'll try to do this by the end of today.
- The
-
Ideally, it would also be good to generate test files with the same python library as the categories and get rid of the haskell script (not that it doesn't work, but it adds an extra dependency and is generally more difficult for people to run than python). But I can take care of it later.
I have regenerated unicode_categories.re and tests with the old haskell script: e3ec259, and I can already see that the nontrivial changes to the character ranges are the same as in the python script. What remains is to sort and fix single-character classes. I'll try to do that ASAP but I'm traveling in the next few days so it may have to wait a bit. |
I'm really struggling to figure out how to rebase and get rid of the extra commit. Apparently something I did closed this PR. Git is such a walking disaster... |
It can be very confusing. I github auto-closed because you pushed a commit saying "Merge pull request ccleve#1 from skvadrik/master" (it's a github feature, not a git one). I wonder if it can be configured in settings (to disallow github to be "smart" and close PRs / bugs based on keywords). So what you need to do now to get nice linear history without merge commits is:
And after I merge my remaining work (sort + fix dupes), you will need to rebase your work on top of mine:
|
Thanks, but I'm not seeing the same commits after the git rebase command, and I can't take the time to figure this out. I recommend just closing or deleting this PR, and then copying and pasting my script into your repo. Life's too short to deal with git nonsense. On another matter: it seems that it's not easy to get Python to spit out other unicode character properties, like Word Break or Script. I haven't found a good module that can do them, and I don't want to parse unicodedata.txt myself. I ended up writing code in Java to generate these files because Java and ICU4J support is really, really good. I'm happy to contribute Java code to generate the files, although it will be a lot harder for users to use Java than Python. |
Ok, I'll see what I can do. Thanks for the script!
Right, a Java script has the same problem as a Haskell script: it may be nontrivial to run (depending on the developer's environment. |
Addresses #235, #423
This seems to work. The format of new file is a little different: the char classes are sorted, and single-char ranges are replaced by just a single char. For example,
[\u1234-\u1234] -> [\u1234]