Enhancement / Question: How 4-byte utf-16 characters are handled in VersionInfo #435

maxzhenzhera · 2024-11-14T19:20:07Z

Given

Strings in VersionInfo have utf-16-le encoding

To parse a string in VersionInfo get_string_u_at_rva used

Lines 6476 to 6517 in 4b3b1e2

    
           def get_string_u_at_rva(self, rva, max_length=2**16, encoding=None): 
        
               """Get an Unicode string located at the given address.""" 
        
               if max_length == 0: 
        
                   return b"" 
        
               # If the RVA is invalid let the exception reach the callers. All 
        
               # call-sites of get_string_u_at_rva() will handle it. 
        
               data = self.get_data(rva, 2) 
        
               # max_length is the maximum count of 16bit characters needs to be 
        
               # doubled to get size in bytes 
        
               max_length <<= 1 
        
               requested = min(max_length, 256) 
        
               data = self.get_data(rva, requested) 
        
               # try to find null-termination 
        
               null_index = -1 
        
               while True: 
        
                   null_index = data.find(b"\x00\x00", null_index + 1) 
        
                   if null_index == -1: 
        
                       data_length = len(data) 
        
                       if data_length < requested or data_length == max_length: 
        
                           null_index = len(data) >> 1 
        
                           break 
        
                       # Request remaining part of data limited by max_length 
        
                       data += self.get_data(rva + data_length, max_length - data_length) 
        
                       null_index = requested - 1 
        
                       requested = max_length 
        
                   elif null_index % 2 == 0: 
        
                       null_index >>= 1 
        
                       break 
        
               # convert selected part of the string to unicode 
        
               uchrs = struct.unpack("<{:d}H".format(null_index), data[: null_index * 2]) 
        
               s = "".join(map(chr, uchrs)) 
        
               if encoding: 
        
                   return s.encode(encoding, "backslashreplace_") 
        
               return s.encode("utf-8", "backslashreplace_")

In that part where "decoding" goes we can see the handling of 2-byte chunks

pefile/pefile.py

Lines 6510 to 6512 in 4b3b1e2

    
           # convert selected part of the string to unicode 
        
           uchrs = struct.unpack("<{:d}H".format(null_index), data[: null_index * 2]) 
        
           s = "".join(map(chr, uchrs))

Problem

Therefore, if the VersionInfo string contains a 4-byte utf-16 character - it will not be treated properly.
It will result in 2 different forcefully casted Unicode characters.

Question

Am I wrong or do not know something?
Or it should be fixed in pefile?

I understand that frequency meeting characters taking 4-byte size might not be big.
But at the end of the day, it is not handled.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement / Question: How 4-byte utf-16 characters are handled in VersionInfo #435

Enhancement / Question: How 4-byte utf-16 characters are handled in VersionInfo #435

maxzhenzhera commented Nov 14, 2024

Enhancement / Question: How 4-byte utf-16 characters are handled in VersionInfo #435

Enhancement / Question: How 4-byte utf-16 characters are handled in VersionInfo #435

Comments

maxzhenzhera commented Nov 14, 2024

Given

Problem

Question