Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement / Question: How 4-byte utf-16 characters are handled in VersionInfo #435

Open
maxzhenzhera opened this issue Nov 14, 2024 · 0 comments

Comments

@maxzhenzhera
Copy link

Given

  1. Strings in VersionInfo have utf-16-le encoding
  2. To parse a string in VersionInfo get_string_u_at_rva used

    pefile/pefile.py

    Lines 6476 to 6517 in 4b3b1e2

    def get_string_u_at_rva(self, rva, max_length=2**16, encoding=None):
    """Get an Unicode string located at the given address."""
    if max_length == 0:
    return b""
    # If the RVA is invalid let the exception reach the callers. All
    # call-sites of get_string_u_at_rva() will handle it.
    data = self.get_data(rva, 2)
    # max_length is the maximum count of 16bit characters needs to be
    # doubled to get size in bytes
    max_length <<= 1
    requested = min(max_length, 256)
    data = self.get_data(rva, requested)
    # try to find null-termination
    null_index = -1
    while True:
    null_index = data.find(b"\x00\x00", null_index + 1)
    if null_index == -1:
    data_length = len(data)
    if data_length < requested or data_length == max_length:
    null_index = len(data) >> 1
    break
    # Request remaining part of data limited by max_length
    data += self.get_data(rva + data_length, max_length - data_length)
    null_index = requested - 1
    requested = max_length
    elif null_index % 2 == 0:
    null_index >>= 1
    break
    # convert selected part of the string to unicode
    uchrs = struct.unpack("<{:d}H".format(null_index), data[: null_index * 2])
    s = "".join(map(chr, uchrs))
    if encoding:
    return s.encode(encoding, "backslashreplace_")
    return s.encode("utf-8", "backslashreplace_")
  3. In that part where "decoding" goes we can see the handling of 2-byte chunks

    pefile/pefile.py

    Lines 6510 to 6512 in 4b3b1e2

    # convert selected part of the string to unicode
    uchrs = struct.unpack("<{:d}H".format(null_index), data[: null_index * 2])
    s = "".join(map(chr, uchrs))

Problem

Therefore, if the VersionInfo string contains a 4-byte utf-16 character - it will not be treated properly.
It will result in 2 different forcefully casted Unicode characters.

Question

Am I wrong or do not know something?
Or it should be fixed in pefile?

I understand that frequency meeting characters taking 4-byte size might not be big.
But at the end of the day, it is not handled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant