Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode DeviceSQL strings in PDB are UTF-16-BE, not UTF-16-LE #24

Open
brunchboy opened this issue Jul 12, 2020 · 2 comments
Open

Unicode DeviceSQL strings in PDB are UTF-16-BE, not UTF-16-LE #24

brunchboy opened this issue Jul 12, 2020 · 2 comments

Comments

@brunchboy
Copy link

I heard from another person who is implementing parsing of PDB files, and he was working with some Russian text, and discovered we were wrong to think these strings are UTF-16LE. Here is what he said, and I validated this by creating a playlist containing the same string in its name:

I could have something off, but here is what I'm seeing about the strings. My PDB includes a Russian song called Покинула чат ("left the chat"). The first letter here is U+041F. All the Cyrillic letters start with 04, but the spacebar between the words is the same U+0020 as in English. Here's how the track name looks in the pdb in hex:

Screen Shot 2020-07-11 at 7 51 32 AM

If I skip the 0 and read little endian, I get back the desired "Покинула чат"

If I don't skip and read big endian, I get back the incorrect " окинулаРGат" It gets a lot of the letters right because there is usually a 04 every other byte, but the first letter (which turns out as U+001F "Information separator one") and the characters around the space get messed up (because of the momentary switch from leading 04 to leading 00).
English titles come out right either way, because the leading 00s for each ASCII character in UTF-16 make it forgiving.

@brunchboy
Copy link
Author

And this may not be worth dealing with, but the same person who discovered the above also noticed that the first string pointer (which you call str_u1) in track rows is not always empty. When present, it holds the ISRC (International Standard Recording Code) that uniquely identifies the track.

"str_u1" / IndexedPioString(0), # empty

string 0 in the Track row string array is the ISRC (International Standard Recording Code). It showed up for me for some new music files I imported after buying on iTunes. the string encoding here is a bit different, it starts as a 0x90 string with its length uint16, but is followed by 0x00 (maybe Mac-only again) and 0x03 and then is actually ASCII encoded. see screenshot for USUM72004304:

Screen Shot 2020-07-06 at 10 10 41 PM

I don’t think I’m going to try to do anything with this part myself; I don’t have any use for the ISRC information (which is not always there anyway), and trying to figure out how to deal with a 90 string that isn’t actually Unicode (maybe the 03 byte that comes before the first character is our clue? what a mess!) seems like too much effort for little value. 😄

@negimeister
Copy link

I have noticed the same "every second character looks correct" issue with Japanese titles. There definitely is an issue with encoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants