-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
« Wide character in die at -e line 624 » with some Unicode characters in outdir’s path #474
Comments
I've been able to reproduce this. I needed to be on linux in a directory that is on an ext4 file system. It appears that when biber tries to open the .blg file, the name it uses is in NFD instead of NFC, even when the directory name is specified in the NFC form. This causes exactly the error message shown when the directory name contains an accented character. The actual listing in the bug report of the error message ("Can't open résultats/test.blg") is in NFC, presumably because of something done by a pasting operation in a web browser. On macOS and APFS (which is normalization insensitive, but normalization preserving), when the directory name does not contain an accented character, but the base name of the .tex file does contain an accented character, then the name of the .blg file is in NFD. In contrast, the .bbl filename is in NFC. This is given that the name of the .tex is in NFC. The version of biber is 2.20 (in TeXLive 2024). On combinations of OS and file systems (e.g.,macOS and APFS) that are insensitive to Unicode normalization of filenames, latexmk invoked as in the bug report does not raise an error. |
Looks like I forgot to NFC the filename. |
I tried but I'm not familiar enough with Perl to be able to generate the executable to test from the sources! Sorry ! 😅 |
I tried the 2.21.beta version, and it worked, provided that the file and
directory names on linux were all NFC.
But on a Unicode-normalization-sensitive system, it now fails if the names
aren't NFC. That's unlikely to be the case for most users in Western Europe,
since when typing in characters, typical keyboard layouts give pre-composed
characters, i.e., NFC. So they will create files with NFC names. At least if
the files are created within the linux
However, on macOS, suppose I have a file or directory whose name is NFC. Then
I rename the file in the Finder, without even touching the non-ASCII
characters. After the rename, the name is NFD! I've seen complaints about
that on the web. (Korean users seem to be particularly bothered.) Command
line commands (mv, etc) don't have this problem.
Luckily, at least by default, the macOS and its file systems are
Unicode-normalization insensitive, so this issue doesn't seem to be too big a
deal for our purposes. But transferring the files to linux could cause all
kinds of interesting anomalies! It might be useful to have a little script to
rename all files and directories to have a particular normalization. Perhaps
one already exists.
Pdflatex, at least on TeXLive 2024, preserves Unicode normalization from the
.tex filename to the names of generated files, and the same applies to latexmk.
I haven't tried this with xelatex and lualatex, but I would conjecture they
have the same behavior.
Would it not be better for biber to preserve the normalization of what's on the
command line, since that matches better the behavior of the other programs
involved? (With latexmk I went through an initial phase of thinking the
internal use of NFD would be a good idea; there are recommendations that that
is the "correct" thing to do. But that led to a minefield of other
complications, so I abandoned that.) What problems would that changed behavior
lead to?
John
…On 3/29/24 11:43 AM, plk wrote:
Looks like I forgot to NFC the filename. |biber| is all NFD internally and it
should NFC everything on output but it looks like this was missed. Can you try
|biber| 2.21 DEV version from SF?
—
Reply to this email directly, view it on GitHub
<#474 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADXT47OWWNSLJ2IWOEKF2PLY2V4YXAVCNFSM6AAAAABFM5IA6SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRXGQYDEMZVGY>.
You are receiving this because you commented.Message ID:
***@***.***>
|
Well, you have to use NFD internally because there lots of tricky things that have to be done with independent combining chars etc. I can however, have a look at preserving filenames from the form of the |
Please try 2.21 from SF again. |
On 4/13/24 2:19 PM, plk wrote:> > Well, you have to use NFD internally because
there lots of tricky things that
have to be done with independent combining chars etc. I can however, have a
look at preserving filenames from the form of the .tex file.
For the textual content of things like the author fields in .bib files, I agree
that the internal use of NFD is suitable. That's because for ordinary text,
characters that differ by normalization are intended to be equivalent.
But for filenames, things are entirely different. The combinations of Windows
with NTFS and FAT32, and linux with ext4 (and IIRC FAT32) are all normalization
sensitive. E.g., in these cases it's perfectly possible to have two different
files whose names are identical except for Unicode normalization.
So preservation of Unicode normalization of filenames is compulsory, as far as
I can see. There are effectively two different worlds of strings: Those for
ordinary text and those for filenames.
Of course, if you are typing filenames in Windows and linux, you are probably
going to get only NFC filenames, at least with standard keyboard layouts for
many Western European languages.
But it's easily possible to get NFD filenames if you generate the files on
macOS and transfer by a normalization-preserving method (e.g., in a zip file
from macOS to unix). That's because GUIs in macOS coerce filenames to NFD.
That doesn't matter much on macOS, since by default it is insensitive to the
Unicode normalization of filenames. But once the files are on linux or
Windows, there are complications.
John
|
On 4/14/24 10:27 AM, plk wrote:>
Please try 2.21 from SF again.
Sorry, but it doesn't work. I see at least the following problems
1. When I run this version of biber with a bcf file named NFC-café.bcf (with
NFC coding), the bbl file has the name NFC-café.bbl. I conjecture that what
has happened is that Perl's encode subroutine was applied to a string that was
already UTF-8 encoded.
I can reproduce this kind of situation in a Perl script if I do
use utf8;
my $orig = 'NFCé';
my $enc1 = encode( 'UTF-8', $orig );
my $enc2 = encode( 'UTF-8', $enc1 );
The string $enc2 has content that is the UTF-8 encoding of 'NFCé'. The string
$enc1 has the correct UTF-8 encoding of the original string.
2. The same error occurs in the blg file for the strings for the names of .bbl
file. I've attached a zip file containing an example.
3. If the OS is linux, and the bcf file is in a directory named résultats,
this version of biber, just like 2.20, still tries to write a .blg file whose
name is the NFD version of what it should be writing. That gives a fatal
error, since there is no directory whose name is the NFD version of 'résultats'.
John
…--------------iZI3vq0Dv8G5a1I7JF6nqX00
Content-Type: application/zip; name="biber-issue.zip"
Content-Disposition: attachment; filename="biber-issue.zip"
Content-Transfer-Encoding: base64
UEsDBBQAAAAIAM9YklhR9cpGEAIAAFoEAAAOABwATkZDLWNhZmXMgS5ibGdVVAkAA/Y2IWZgOCFm
dXgLAAEECQIAAAQUAAAArVNNjpswFN7nFG9HKw0EEwIkEpXaVJEqjZKqk6iVoiwMMcQaByNj0mZ2
VU/Ta3S2PVRtl0Cgs6yEBMbv+3mfn3fuHha8yGjulKf5BPlv4MNquQYbNkdagXre0YQI8BwPwauE
SPx6tBtgwhZzz/OMMqJh1mq5sFOckd/fnYTl1mjnRXtINJmtyRzNZZ+9+cTvNOM4Vro1vC0FoOgO
PNfz7wChuRvMJ+5oh9zZ/q8hrex7bov8RPCBFjlYzrgRfv7pJGmmdJE/uQHNZp3cktfFARCkVJJH
clHtFtohVCSVlBegFafuraIfTFr0R8FTUlVatQeY3gKCqX+TDn/U1RkXWkaSb2DSstSCUZ4LXB4p
cdTKMjU91mDfYDQtClHn4x5vyBc4kJSbBBzH0fVRr36KZoOulUUDwxJDxWuRvuhDMYXeHrYLzhiW
RHMFUcu0PhMhqFFlPMW6FVLY2wdLuclwzaQagzMWFCdqK4bqSDNJDhZ8pfLY2yl4YdO84EKv/4Oo
YjphRp+wyS+G1fL9VXW4VQpy/aWsNdI3IxZ1QT9wIY0sraQikpdxzniC2fj67r0s4BnIS2n8SXFp
DEhyKnVXhsACrI6i38YLFrqzW3F1WEKCxJRxoc3gs/o0MeqR+ZdKz03CTIZh0PJ8FtS0cr2lzz9+
qeuSsMYkKZphsrabpR0NiMKwM7SuZVkrPxwGTKM/UEsBAh4DFAAAAAgAz1iSWFH1ykYQAgAAWgQA
AA4AGAAAAAAAAQAAAKSBAAAAAE5GQy1jYWZlzIEuYmxnVVQFAAP2NiFmdXgLAAEECQIAAAQUAAAA
UEsFBgAAAAABAAEAVAAAAFgCAAAAAA==
--------------iZI3vq0Dv8G5a1I7JF6nqX00--
|
I will have a look - I suspect that the |
Can you please try 2.21 dev again from SF? |
Sorry for my delay in replying, and dropping the ball on this. I hadn't checked this site, which I should have done, and didn't see the Apr 27 message until Benjamin recently pointed it out. I've just tried the current 2.21 beta (as of 19 Sep 2024), and find the following:
In the simplest cases on Linux and Windows, files and directories have NFC names, in which case there is no problem. However the situation with mixed normalization can arise in practice in a cross-OS situation:
I suspect it would be useful to have a little utility to rename files to be pure NFC (or pure NFD). |
I put in a fallback for 2.21 dev which selects an appropriate default for mixed form, depending on OS (NFD for mac, NFC for everything else. |
I've tried the new version. It still doesn't solve the problem. Even so, this version is an improvement on the 2.20 release, which doesn't work when the name of the bcf file contains an accented character coded as NFC. First, on macOS, there's actually no need to set a normalization form for file names, since the file systems are normalization insensitive. On Linux (with its default file system ext4 at least), there are two use cases for the command line to biber:
I've tried this on a file with the name NFC('NFC-résultats').'/'.NFD('NFD-café.bcf'), in Perl notation. This is a name that could well occur in practice (see my comment from yesterday). Here's the result
Both the .blg and .bbl files are written, but with the base name being NFC('NFD-café'). Because of the error, the .bbl file is incorrect (zero length). The line of error message has exactly the same mixed coding as on the command line. (Note that the github web interface coerces that to NFC in this comment.) The Perl error on the line above indicates that there is a coding problem in line 57 of Screen.pm. Presumably a decoded Unicode string was passed to print, but no coding system was specified for stdout, and the Unicode string contains one or more code points that are above 255. This conjecture matches the situation for the given filename, since the string contains a COMBINING ACUTE ACCENT, whose decimal code point is 769. It's perhaps worth adding that there are some related issues with v. 2.19 of biber in all but the case with pure NFC filenames. In addition, 2.19 mangles the coding of some of its output to at least the screen and the .blg file. V. 2.21 (beta) appears not to do this, so it is an improvement. (I verified this just now by switching to TeX Live 2023.) |
I think it is likely impossible to fix this - when you read in a file and the path has mixed NFD/NFC, you can't write the path in mixed mode (without a great deal of really hacky messing about). Right now, |
I don't understand what is meant by "you can't write the path in mixed mode". I know explicitly that in Perl if I create a file with mixed normalization by Perl's open function, the on-disk filename is exactly the one specified. This is at least true on:
In the Ubuntu and Windows 11 cases, I can have different files whose names only differ by normalization. (On macOS, with its older HFS+, the filename is coerced to NFD, independently of the string given to the open function.) The default situation for APFS on macOS is that it is (a) insensitive to both normalization and case, and (b) preserves both. This is my experience. The only reasonably official documentation I was able to find quickly is https://developer.apple.com/library/archive/documentation/FileManagement/Conceptual/APFS_Guide/FAQ/FAQ.html That document is an old one, which says it is "retired". But its statements about normalization-insensitivity match my current experience. That's for macOS. I remember reading about a different situation on iOS, but I've no experience, and I think it is irrelevant here. |
It's definitely possible to fix the problem. Existence proofs:
Literally all the difficulties with Unicode and latexmk that I encountered were in dealing with the code page issue on Windows. That's only because of deficiencies in current versions of native Windows perl interpreters. (As you know with biber, the problems can be solved by setting the Windows system locale to use UTF-8.) [It may be useful to copy the relevant part latexmk solution for the code page issue into biber (and to copy into latexmk the things biber does with the use of the wide, i.e., Unicode, interface to the Windows file system).] |
A further comment about macOS, APFS and normalization: If you do a Google search for APFS and normalization, most of the hits you get contain statements that are quite misleading, if not wrong. I suspect there was a change in the implementation of APFS after the first version was released, and the comments seen refer to the old version. |
I've been able to incorporate a version of latexmk's treatment of Windows code pages, so that biber now works on Windows when filenames contain non-ASCII characters, independently of the setting of the System locale/code page. It continues to work on Linux (if file and directory names are NFC) and on macOS. I need to clean up my code before I submit it. |
Would you be interested in helping develop/support |
So, we should basically not touch any normalisation for filenames, just for file contents. Now, getting back into this, I've removed all the messing about from the filename (not file content) normalisation which should be better but I think that leaves us with a problem with the |
I'll merge my changes into the version you've made. That should solve a lot of problems. In addition to handling the Windows code page issue, I found the following: The calls to the file system for opening files tended to use decoded strings rather then encoded byte strings. Generally that gives problems, which become particularly visible on Windows. So I corrected all the cases which I've found so far. Perhaps you should hold off on further changes until I send mine to the repository. (My changes affect biber itself, Biber/Config.pm, and Biber/Output/base.pm.) |
@jccollins - I added you to |
I pushed my updates to the repository. In the testfiles directory I added a directory of files with non-trivial Unicode names. They are useful for tests. In the bin directory there's an extra executable that I use for exercising my code for handling codepage issues (by an added module CodePage.pm). Could you check how things are working: This is my first time for uploads to a github repository. |
My new version worked for most of my tests. It failed only on linux when given the hard case of mixed-normalization filenames. (Probably the same issue would arise on Windows when the UTF-8 system locale is used.) I've tracked the problem down to Utils.pm, where some normalization of filenames is done. That file may have some other issues. |
I fixed a test issue and rearranged some things a bit - there's no need to put anything in comments about in-progress etc. in the files, we can just use github comments for that. The standard regression test suite passes now. I changes the version back to "2.21" as some install scripts rely on the format. I think you're right we will have to look at the Utils.pm subs which mess about with filename normalisation. |
I've updated Utils.pm. Things now work in all the tests I've done. What I did:
I preferred to leave in some comments about difficulties remaining. I find it a lot easier when I am working on a file, to have such remarks in the file. I also added a zip file of the Unicode-tests directory. That's better than the contents of directory, whose normalization tends not to survive the round trip to and from github. A .tar file also won't work because on macOS it doesn't always preserve normalization (as I found by trying it). Perhaps we should remove the Unicode-tests directory, since the normalization of the filenames tends not to survive in the various kinds of processing. |
Feel free to remove the unicode files if there is a ZIP for testing - I pulled the |
Done. |
I've made some minor changes in CodePage.pm and Config.pm:
As far as I can tell, things are now working correctly for files specified on the command line to biber. There are the following restrictions:
The remaining problems I know concern the names of .bib files specified in the .bcf file. The problematic situation is when the filename or glob pattern supplied to |
Good job for noticing this, which could easily happen to me. |
What I've diagnosed is that when biber reads the .bcf file it coerces the filenames and patterns for the datasources to NFD, instead of the NFC that is normal for Western European languages on Windows and Linux. This will need to be corrected, of course. Philip knows the relevant part of the code (I think it's in Biber/Input/file/bibtex.pm and Biber/Input/file/biblatexml.pl). Then what happens is that *latex writes the intended name correctly to the .bcf file, but biber in effect misreads it. So biber ends up looking for a differently named file on disk that the one the user intended. (On both Windows and Linux, it is unfortunately perfectly possible to have two or more distinct files with names that differ only in the normalization form for accented characters, so that the names are visually identical!) My general approach in this and similar cases is, if possible, to choose a glob pattern that doesn't contain any of the problematic characters, but which is chosen to uniquely identify the file. The current development version of biber treats that correctly. |
I've modified the subroutine glob_data_file in Utils.pm so that when it does a glob, it does a glob on the NFC and NFD variants of the glob pattern, as well as the original pattern. (But for each of the variants, it omits the NFC glob if the original pattern is already NFC, and similarly for NFD.) This matches the behavior of the file_exist_check. This change appears to cover the most important cases when the argument to I've also added a directory containing the .bib and .tex files that I used for tests, together with a Perl script to do the tests. The subroutine glob_data_file still contains the statements for printing diagnostic messages about the strings involved. They make it easy to see that the filename argument to the subroutine is coerced to NFD, but that files with names of the opposite NF are now found. |
I've looked more into the problem about NFD being applied to the file/pattern names that were given to What happens is that parse_ctrlfile slurps in the contents of the .bcf file, immediately converts it to NFD, and then parses the result. See lines 421--423:
For actual text fields, this is appropriate, I imagine, but not for filename fields. At this point I don't see a nice way of fixing things to preserve normalization of filenames while converting everything else to NFD. (Of course, there are ugly hacky ways.) In the big picture, it may be unimportant to do better than what's in the code at the moment. It handles the simplest cases that are likely to arise in practice, like the example that started this bug report. In any case, if a user has trouble with non-ASCII filenames, there is always the standard advice that for maximum portability, one should restrict file names to the ASCII characters a-z, 0-9 and -, and leave Unicode stuff to the contents of .tex and .bib files. |
We could simply get the filenames and save them before the NFD call? |
I've seen a better possibility. This simply to read the .bcf file line by line, preserve the datasource lines as is, and apply NFD only to the other lines. Finally the lines are concatenated to get a single string of XML code. This completely avoids the use of File::Slurper, but, of course, at the expense of slower read time. To know whether the slower read time matters, I measured the processing times for reading and applying NFD to the 150kB .bcf file for a 600+ page book of mine. The difference between using Slurper and the use of Perl's usual line-by-line reading methods is in the millisec range (on a modern Macbook Air). That is dwarfed the 2+sec total time for biber. I'll work on a fix. |
@jccollins - can we take out any raw "say" statements in |
Have a look at the new line in |
Your solution as a one-liner is nice, except that the regex for splitting the contents of the bcf file into lines needs correction. The original regex was So I changed that to As far as I can see, the result is correct. (I've done a few stress tests on Linux.) |
@plk About the raw The problem is that CodePage needs to do its thing with code pages etc very early, before anything else needs to use the results. This includes knowledge of the encoding of filenames in interactions with the file system, and the setting of the console to use UTF-8 (i.e., CP 65001). I see a lot of initialization associated with the use of Log4Perl. So CodePage will run earlier than this initialization, and may therefore be unable to use Log4Perl directly. Correct me if I've misunderstood. On the other hand, the messages that CodePage writes are mostly informational. I've found them helpful so that I know what's happening and when. But now I'm happy that the code is working as intended, I think the actual writing of the messages by CodePage is no longer important. However, they can be important for debugging. So I propose that CodePage should simply save the messages in arrays: One for informational messages and one for warnings. Then after In the rest of the biber code, I see several ways of sending messages via Log4Perl. What would be appropriate here? |
Should be fine about the |
Sorry, this is a long comment, but I think I need to explain some details. Unfortunately, the new version doesn't work correctly on Windows. It reports The problem is simply that
and the The initialization code in the present (and previous) versions of What the initialization code does is:
I see a way of arranging delayed execution of the initialization code, in a way which would have a number of advantages. But there's still a difficulty with biber's start up code, which will need significant changes. The relevant steps in the current code are:
I think it would be simpler not to decode I think it would be nice to avoid having Another optional argument could indicate whether informational messages about code pages are to be given at all. One complication is that it is possible to inadvertently call routines like |
I've another idea for CodePage that I'm going to try out. |
I've changed The relevant information is now put into the tracing information in the logger by other modules. I've corrected a bug in As far as I can tell, the new version is working properly in my stress tests for non-ASCII file names on all of Windows, Linux and macOS. With my inexperience with git, I got into trouble with your earlier commits today. I think I've corrected the problems. But you should double check. |
Looks fine - all regression tests pass. |
I've made a small change in This avoids the possibility that the user's use or non-use of the Does this sound right? |
Yes, I suspect you are right about this - the CodePage solution is a more general solution than the --winunicode option. Once you are happy with that, I can change the documentation to say this is automatic now. |
I'm happy with how it's working, so you can update the documentation. The only question I have is what should happen with the
|
By the way, there's a standard perl module Encode::Locale that finds the locale settings, but, as far as I can tell, more comprehensively than CodePage. As far as I know Encode::Locale is part of a standard perl installation. It may be worth changing CodePage to use Encode::Locale as much as possible. I.e., CodePage's purpose is now to provide extra functionality, like setting the Windows CP for console output to UTF-8, and providing convenient utility subroutines for use in biber. Encode::Locale finds the Windows code pages the same way as CodePage, but it is more general, since it also deals with non-UTF-8 locales on Unix systems. Of course, on linux and macOS, anything but a UTF-8 locale is surely unusual nowadays, unlike Windows. |
But it's maybe best to leave well alone. |
Hello,
It seems that the latest version of biber has problems with some Unicode characters in the path (
outdir
of latexmk).Strangely, not all Unicode characters have this problem, and John Collins was unable to reproduce this behavior on his system.
I'm on Linux Manjaro, with the latest version of Texlive 2024 (updated yesterday). The 2023 version, and the 2024 version at the very beginning of the year did not have this problem, which appeared when I updated everything yesterday.
The text was updated successfully, but these errors were encountered: