Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add script to regen unicode file #425

Closed
wants to merge 3 commits into from
Closed

Conversation

ccleve
Copy link

@ccleve ccleve commented Nov 2, 2022

Addresses #235, #423

This seems to work. The format of new file is a little different: the char classes are sorted, and single-char ranges are replaced by just a single char. For example,

[\u1234-\u1234] -> [\u1234]

@@ -0,0 +1,94 @@
#!/usr/bin/python
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#!/usr/bin/env python

Copy link
Owner

@skvadrik skvadrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking care of this! A few thigs:

  • Please rebase to get rid of the merge commit (should be trivial)

  • We need to make sure the script is not installed (I think it is as part of the include subdirectory).

  • Loking at the changes in the generated files, they seem to be more substantial than just sorting and single-character classes.

    • The L_ category (whatever it was) has disappeared; this breaks backwards compatibility.
    • I looked at the beginning of L category and I see that range \u0561-\u0587 is replaced by \u0560-\u0588. There may be many more such changes. We need to understand how many and if they are correct --- I suspect they may be already fixed in the haskell Data.Charset library that I used to generate old unicode_categories.re Therefore let me first regenerate and commit the files with the haskell script (I will also sort and fix single-character classes), and then let's move to your script. I'll try to do this by the end of today.
  • Ideally, it would also be good to generate test files with the same python library as the categories and get rid of the haskell script (not that it doesn't work, but it adds an extra dependency and is generally more difficult for people to run than python). But I can take care of it later.

@skvadrik
Copy link
Owner

skvadrik commented Nov 4, 2022

I have regenerated unicode_categories.re and tests with the old haskell script: e3ec259, and I can already see that the nontrivial changes to the character ranges are the same as in the python script. What remains is to sort and fix single-character classes. I'll try to do that ASAP but I'm traveling in the next few days so it may have to wait a bit.

@ccleve
Copy link
Author

ccleve commented Nov 6, 2022

I'm really struggling to figure out how to rebase and get rid of the extra commit. Apparently something I did closed this PR. Git is such a walking disaster...

@ccleve ccleve reopened this Nov 6, 2022
@skvadrik
Copy link
Owner

skvadrik commented Nov 7, 2022

I'm really struggling to figure out how to rebase and get rid of the extra commit. Apparently something I did closed this PR. Git is such a walking disaster...

It can be very confusing. I github auto-closed because you pushed a commit saying "Merge pull request ccleve#1 from skvadrik/master" (it's a github feature, not a git one). I wonder if it can be configured in settings (to disallow github to be "smart" and close PRs / bugs based on keywords).

So what you need to do now to get nice linear history without merge commits is:

And after I merge my remaining work (sort + fix dupes), you will need to rebase your work on top of mine:

  • git pull --rebase skvadrik master

@ccleve
Copy link
Author

ccleve commented Nov 7, 2022

Thanks, but I'm not seeing the same commits after the git rebase command, and I can't take the time to figure this out.

I recommend just closing or deleting this PR, and then copying and pasting my script into your repo. Life's too short to deal with git nonsense.

On another matter: it seems that it's not easy to get Python to spit out other unicode character properties, like Word Break or Script. I haven't found a good module that can do them, and I don't want to parse unicodedata.txt myself. I ended up writing code in Java to generate these files because Java and ICU4J support is really, really good. I'm happy to contribute Java code to generate the files, although it will be a lot harder for users to use Java than Python.

@skvadrik
Copy link
Owner

skvadrik commented Nov 7, 2022

I recommend just closing or deleting this PR, and then copying and pasting my script into your repo. Life's too short to deal with git nonsense.

Ok, I'll see what I can do. Thanks for the script!

On another matter: it seems that it's not easy to get Python to spit out other unicode character properties, like Word Break or Script. I haven't found a good module that can do them, and I don't want to parse unicodedata.txt myself. I ended up writing code in Java to generate these files because Java and ICU4J support is really, really good. I'm happy to contribute Java code to generate the files, although it will be a lot harder for users to use Java than Python.

Right, a Java script has the same problem as a Haskell script: it may be nontrivial to run (depending on the developer's environment.

@skvadrik skvadrik closed this Nov 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants