The XTF-8 codec encodes a hybrid string that may consists of UTF-8 sequences and binary bytes to a valid UTF-8 string, which can then be used in places that require UTF-8, e.g., JSON.
The encoded UTF-8 string can be decoded to the original data, provided that there was no collision (although very unlikely) in the encoding process.
The XTF-8 codec uses the Unicode range U+EF80..U+EFFF to encode any invalid UTF-8 sequence (assume to be binary bytes) in the input hybrid string. The chosen encoding area is located in the Unicode Private User Area (PUA), and has been registered with ConScript Unicode Registry by MirBSD for their OPTU encoding.
The encoding procedure is as follows:
- Decode the input string according to UTF-8.
- For every valid UTF-8 sequence:
- If the Unicode code point is inside the XTF8 encoding area:
- If error handler is "replace" (the default), the conflicting code point is replaced with U+FFFD (Replacement Character), and output the UTF-8 encoded sequence of U+FFFD;
- Otherwise, raise an error and abort the process.
- Otherwise, copy the sequence to the output;
- If the Unicode code point is inside the XTF8 encoding area:
- When come across an invalid UTF-8 sequence:
- Assume the bytes following the previous valid UTF-8 sequence to the current byte are binary data;
- Assert each byte is non-ASCII (i.e., 0x80..0xFF);
- Encode each byte to a Unicode code point within the XTF8
encoding area, i.e.,
codepoint = 0xEF80 | (byte & 0x7F)
; - Output the UTF-8 encoded sequence of the encoded code point.
- Assert the encoded output is a valid UTF-8 string.
The decoding procedure is as follows:
- Decode the input string according to UTF-8.
- For every valid UTF-8 sequence:
- If the Unicode code point is inside the XTF8 encoding area,
decode the code point to a non-ASCII byte, i.e.,
byte = (codepoint & 0x7F) | 0x80
; - Otherwise, copy the sequence to the output;
- If the Unicode code point is inside the XTF8 encoding area,
decode the code point to a non-ASCII byte, i.e.,
- When come across an invalid UTF-8 sequence:
- If error handler is "replace" (the default), the invalid sequence since the previous valid UTF-8 sequence is replaced with U+FFFD (Replacement Character), and output the UTF-8 encoded sequence of U+FFFD;
- Otherwise, raise an error and abort the process.
Security
- MUST NOT ignore the invalid bytes in decoding the input; abortion or replacement should be used instead.
- MUST NOT decode XTF8-encoded code points to ASCII characters (i.e., 0x00..0x7F), avoiding character smuggling.
Python PEP 383 presents another method to losslessly encode strings of mixed text and binary bytes. Every non-decodable byte (0x80..0xFF) will be encoded as lone surrogate code of Unicode range U+DC80..U+DCFF. However, this encoding area is disallowed in UTF-8 and thus the encoded results would still be invalid UTF-8 strings.
The MirBSD Project proposed the OPTU encoding to tackle the similar issue. It uses a block of PUA to encode the non-decodable bytes and thus makes the encoded results valid UTF-8 strings. This would probably cause the fewest security and interoperability problems. There is, however, some possibility of collision with other uses of the same PUA characters.
In reality, the XTF-8 codec implements the OPTU encoding.
This software makes use of the fast DFA-based UTF-8 decoder, developed by Björn Höhrmann and released under the MIT License.