UTF8 enoding #250

dtp555-1212 · 2019-05-22T19:59:16Z

It appears there is a bug in the UTF8 encoding (at least for some characters)...

In the attached file... there is a 2 byte UTF character which should be encoded as C3 A9 ... (if you copy/paste the UTF char into a file by itself, then use od -t x1, you will see that it is indeed C3 A9). The C3 in the generated parser is correct, but then generates 83 as the second target byte. I am using -8 on the command line. (If there is something I am doing wrong, or if there is a workaround, please let me know)

skvadrik · 2019-05-22T21:05:03Z

Eh, it's a duplicate of #237. The problem is, re2c -8 option does not give you source-level Unicode support: if you write characters like é in regexp definitons, re2c interprets it as a plain byte sequence (each byte as a single character), not as one Unicode symbol. You have to use "\u00e9" instead.

I realize this is very ugly, difficult to use, confusing and needs fixing.

What exactly happens in case of é and how re2c ends up with C3 83 byte sequence is explained in great detail in #237 (let me know if you need more clarifications).

dtp555-1212 · 2019-05-23T00:11:31Z

Thanks for your reply... Obviously there are many Unicode values, not just the one I provided in my example. Do I understand you correctly, that I cannot provide the escaped hex byte sequence. I must use a unicode equivalent. (in this case \u00e9 for the two byte sequence C3 9A)... Is this understanding correct? Will this work for the 3 & 4 byte unicode values as well? (and not only match the character 'visually' but have the expected byte count for utf-8?) With this understanding, it sounds like I will have to preprocess the input strings to substitute the appropriate unicode encoding prior to processing with re2c. Do you have a suggested tool for that? Thanks again P.S. as you have acknowledged that this needs addressing, do you have a timeframe that it might be implemented?

…

________________________________ From: Ulya Trofimovich <[email protected]> Sent: Wednesday, May 22, 2019 3:05 PM To: skvadrik/re2c Cc: dtp555-1212; Author Subject: Re: [skvadrik/re2c] UTF8 enoding (#250) Eh, it's a duplicate of #237<#237>. The problem is, re2c -8 option does not give you source-level Unicode support: if you write characters like é in regexp definitons, re2c interprets it as a plain byte sequence (each byte as a single character), not as one Unicode symbol. You have to use "\u00e9" instead. I realize this is very ugly, difficult to use, confusing and needs fixing. What exactly happens in case of é and how re2c ends up with C3 83 byte sequence is explained in great detail in #237<#237> (let me know if you need more clarifications). — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#250?email_source=notifications&email_token=ADDLWOJHPLAUU2C5LVHPYL3PWWYQDA5CNFSM4HOXQJRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWAK6LI#issuecomment-494972717>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ADDLWOOK4QKTNJNGPKTA35DPWWYQDANCNFSM4HOXQJRA>.

skvadrik · 2019-05-23T16:41:29Z

Do I understand you correctly, that I cannot provide the escaped hex byte sequence. I must use a unicode equivalent.

Yes, it won't work. If you try regular expression \xC3\x9A in -8 mode, re2c will interpret it as "code point C3 followed by a code point 9A", both of which translate into 2-byte code unit sequences in UTF-8. The same happens when instead of \xC3\x9A you write é (only re2c doesn't have to unescape bytes).

Will this work for the 3 & 4 byte unicode values as well?

Escaped sequences will work for all Unicode code points (re2c supports 2-byte, 4-byte and 8-byte syntax: \xhh, \uhhhh and \Uhhhhhhhh).

it sounds like I will have to preprocess the input strings to substitute the appropriate unicode encoding prior to processing with re2c. Do you have a suggested tool for that?

No, unfortunately I don't. In a similar issue #235 we ended up with a pre-defined set of Unicode categories, but it's not good enough for your case.

P.S. as you have acknowledged that this needs addressing, do you have a timeframe that it might be implemented?

I might be able to fix this in a few days. I have a sketch of the fix already, but it requires some pre-requisite work in order to make it more elegant. It's a matter of using -8 in re2c own lexer (which is written in re2c) and switching between two different lexers (ASCII and UTF8). The new behavior will be guarded by an option, something like --input-encoding <ascii | utf8>.

skvadrik · 2019-05-24T12:34:27Z

Pushed a fix: 29a6d01.

Now it is possible to use UTF-8 encoded strings in regular expressions (in string literals and character classes). The new behaviour is enabled with option --input-encoding utf8. By default re2c assumes --input-encoding ascii; in future it may be possible to flip default behaviour (if it keeps confusing people).

It was necessary to use a new option instead of reusing -8, because one may wish to generate multiple lexers with different output encoding from the same set of UTF-8 encoded rules. That is, one may need to combine --input-encoding utf8 with one of the options -u, -x, -w, etc., and not necessarily -8.

I deliberately chose a broad name for the new option (as opposed to a more precise --utf-8-literals or some such) so that it can be extended it in future, for example support UTF-8 encoded variable names (I do not see any good in that so far though).

skvadrik · 2019-05-24T12:42:21Z

@dtp555-1212 If you can, please send me your real-world test. If it's closed-source, I only need the grammar rules (though a working self-contained example is always great).

dtp555-1212 · 2019-05-24T13:52:22Z

Attached is a list of words that have utf8 chars in them, and the other would be the rule to insert into the test program previously provided. Hope that helps Thanks

…

skvadrik · 2019-05-24T14:20:28Z

Thanks! I added a test (it returns 0 for all the names on the list): https://github.com/skvadrik/re2c/blob/a00dc4871106ea39ef84f47bb840a018b17cea25/test/encodings/utf8_names.i8--input-encoding(utf8).re

There is an error in the name Ibargüen, it has a strange C2 byte right before C3 BC representing ü. It doesn't look like valid UTF-8 to me. After deleting C2 from both places everything works fine.

terpstra · 2019-05-24T20:53:48Z

This is great! When can we expect the next re2c release? I can't wait to re2c:include a Unicode character classes library and define character classes with literal UTF8 strings in them!

skvadrik · 2019-05-24T21:30:05Z

Soon, soon, really soon! I know I said this a couple of times before, such a shame... /o\ Realistically, not earlier than in 2 weeks, not later than the end of July. Thanks for asking, it gives me the inspiration to start writing changelog. :)

2.0.3 (2020-08-22) ~~~~~~~~~~~~~~~~~~ - Fix issues when building re2c as a CMake subproject (`#302 <https://github.com/skvadrik/re2c/pull/302>`_: - Final corrections in the SIMPA article "RE2C: A lexer generator based on lookahead-TDFA", https://doi.org/10.1016/j.simpa.2020.100027 2.0.2 (2020-08-08) ~~~~~~~~~~~~~~~~~~ - Enable re2go building by default. - Package CMake files into release tarball. 2.0.1 (2020-07-29) ~~~~~~~~~~~~~~~~~~ - Updated version for CMake build system (forgotten in release 2.0). - Added a short article about re2c for the Software Impacts journal. 2.0 (2020-07-20) ~~~~~~~~~~~~~~~~ - Added new code generation backend for Go and a new ``re2go`` program (`#272 <https://github.com/skvadrik/re2c/issues/272>`_: Go support). Added option ``--lang <c | go>``. - Added CMake build system as an alternative to Autotools (`#275 <https://github.com/skvadrik/re2c/pull/275>`_: Add a CMake build system (thanks to ligfx), `#244 <https://github.com/skvadrik/re2c/issues/244>`_: Switching to CMake). - Changes in generic API: + Removed primitives ``YYSTAGPD`` and ``YYMTAGPD``. + Added primitives ``YYSHIFT``, ``YYSHIFTSTAG``, ``YYSHIFTMTAG`` that allow to express fixed tags in terms of generic API. + Added configurations ``re2c:api:style`` and ``re2c:api:sigil``. + Added named placeholders in interpolated configuration strings. - Changes in reuse mode (``-r, --reuse`` option): + Do not reset API-related configurations in each `use:re2c` block (`#291 <https://github.com/skvadrik/re2c/issues/291>`_: Defines in rules block are not propagated to use blocks). + Use block-local options instead of last block options. + Do not accumulate options from rules/reuse blocks in whole-program options. + Generate non-overlapping YYFILL labels for reuse blocks. + Generate start label for each reuse block in storable state mode. - Changes in start-conditions mode (``-c, --start-conditions`` option): + Allow to use normal (non-conditional) blocks in `-c` mode (`#263 <https://github.com/skvadrik/re2c/issues/263>`_: allow mixing conditional and non-conditional blocks with -c, `#296 <https://github.com/skvadrik/re2c/issues/296>`_: Conditions required for all lexers when using '-c' option). + Generate condition switch in every re2c block (`#295 <https://github.com/skvadrik/re2c/issues/295>`_: Condition switch generated for only one lexer per file). - Changes in the generated labels: + Use ``yyeof`` label prefix instead of ``yyeofrule``. + Use ``yyfill`` label prefix instead of ``yyFillLabel``. + Decouple start label and initial label (affects label numbering). - Removed undocumented configuration ``re2c:flags:o``, ``re2c:flags:output``. - Changes in ``re2c:flags:t``, ``re2c:flags:type-header`` configuration: filename is now relative to the output file directory. - Added option ``--case-ranges`` and configuration ``re2c:flags:case-ranges``. - Extended fixed tags optimization for the case of fixed-counter repetition. - Fixed bugs related to EOF rule: + `#276 <https://github.com/skvadrik/re2c/issues/276>`_: Example 01_fill.re in docs is broken + `#280 <https://github.com/skvadrik/re2c/issues/280>`_: EOF rules with multiple blocks + `#284 <https://github.com/skvadrik/re2c/issues/284>`_: mismatched YYBACKUP and YYRESTORE (Add missing fallback states with EOF rule) - Fixed miscellaneous bugs: + `#286 <https://github.com/skvadrik/re2c/issues/286>`_: Incorrect submatch values with fixed-length trailing context. + `#297 <https://github.com/skvadrik/re2c/issues/297>`_: configure error on ubuntu 18.04 / cmake 3.10 - Changed bootstrap process (require explicit configuration flags and a path to re2c executable to regenerate the lexers). - Added internal options ``--posix-prectable <naive | complex>``. - Added debug option ``--dump-dfa-tree``. - Major revision of the paper "Efficient POSIX submatch extraction on NFA". ---- 1.3x ---- 1.3 (2019-12-14) ~~~~~~~~~~~~~~~~ - Added option: ``--stadfa``. - Added warning: ``-Wsentinel-in-midrule``. - Added generic API primitives: + ``YYSTAGPD`` + ``YYMTAGPD`` - Added configurations: + ``re2c:sentinel = 0;`` + ``re2c:define:YYSTAGPD = "YYSTAGPD";`` + ``re2c:define:YYMTAGPD = "YYMTAGPD";`` - Worked on reproducible builds (`#258 <https://github.com/skvadrik/re2c/pull/258>`_: Make the build reproducible). ---- 1.2x ---- 1.2.1 (2019-08-11) ~~~~~~~~~~~~~~~~~~ - Fixed bug `#253 <https://github.com/skvadrik/re2c/issues/253>`_: re2c should install unicode_categories.re somewhere. - Fixed bug `#254 <https://github.com/skvadrik/re2c/issues/254>`_: Turn off re2c:eof = 0. 1.2 (2019-08-02) ~~~~~~~~~~~~~~~~ - Added EOF rule ``$`` and configuration ``re2c:eof``. - Added ``/*!include:re2c ... */`` directive and ``-I`` option. - Added ``/*!header:re2c:on*/`` and ``/*!header:re2c:off*/`` directives. - Added ``--input-encoding <ascii | utf8>`` option. + `#237 <https://github.com/skvadrik/re2c/issues/237>`_: Handle non-ASCII encoded characters in regular expressions + `#250 <https://github.com/skvadrik/re2c/issues/250>`_ UTF8 enoding - Added include file with a list of definitions for Unicode character classes. + `#235 <https://github.com/skvadrik/re2c/issues/235>`_: Unicode character classes - Added ``--location-format <gnu | msvc>`` option. + `#195 <https://github.com/skvadrik/re2c/issues/195>`_: Please consider using Gnu format for error messages - Added ``--verbose`` option that prints "success" message if re2c exits without errors. - Added configurations for options: + ``-o --output`` (specify output file) + ``-t --type-header`` (specify header file) - Removed configurations for internal/debug options. - Extended ``-r`` option: allow to mix multiple ``/*!rules:re2c*/``, ``/*!use:re2c*/`` and ``/*!re2c*/`` blocks. + `#55 <https://github.com/skvadrik/re2c/issues/55>`_: allow standard re2c blocks in reuse mode - Fixed ``-F --flex-support`` option: parsing and operator precedence. + `#229 <https://github.com/skvadrik/re2c/issues/229>`_: re2c option -F (flex syntax) broken + `#242 <https://github.com/skvadrik/re2c/issues/242>`_: Operator precedence with --flex-syntax is broken - Changed difference operator ``/`` to apply before encoding expansion of operands. + `#236 <https://github.com/skvadrik/re2c/issues/236>`_: Support range difference with variable-length encodings - Changed output generation of output file to be atomic. + `#245 <https://github.com/skvadrik/re2c/issues/245>`_: re2c output is not atomic - Authored research paper "Efficient POSIX Submatch Extraction on NFA" together with Dr Angelo Borsotti. - Added experimental libre2c library (``--enable-libs`` configure option) with the following algorithms: + TDFA with leftmost-greedy disambiguation + TDFA with POSIX disambiguation (Okui-Suzuki algorithm) + TNFA with leftmost-greedy disambiguation + TNFA with POSIX disambiguation (Okui-Suzuki algorithm) + TNFA with lazy POSIX disambiguation (Okui-Suzuki algorithm) + TNFA with POSIX disambiguation (Kuklewicz algorithm) + TNFA with POSIX disambiguation (Cox algorithm) - Added debug subsystem (``--enable-debug`` configure option) and new debug options: + ``-dump-cfg`` (dump control flow graph of tag variables) + ``-dump-interf`` (dump interference table of tag variables) + ``-dump-closure-stats`` (dump epsilon-closure statistics) - Added internal options: + ``--posix-closure <gor1 | gtop>`` (switch between shortest-path algorithms used for the construction of POSIX closure) - Fixed a number of crashes found by American Fuzzy Lop fuzzer: + `#226 <https://github.com/skvadrik/re2c/issues/226>`_, `#227 <https://github.com/skvadrik/re2c/issues/227>`_, `#228 <https://github.com/skvadrik/re2c/issues/228>`_, `#231 <https://github.com/skvadrik/re2c/issues/231>`_, `#232 <https://github.com/skvadrik/re2c/issues/232>`_, `#233 <https://github.com/skvadrik/re2c/issues/233>`_, `#234 <https://github.com/skvadrik/re2c/issues/234>`_, `#238 <https://github.com/skvadrik/re2c/issues/238>`_ - Fixed handling of newlines: + correctly parse multi-character newlines CR LF in ``#line`` directives + consistently convert all newlines in the generated file to Unix-style LF - Changed default tarball format from .gz to .xz. + `#221 <https://github.com/skvadrik/re2c/issues/221>`_: big source tarball - Fixed a number of other bugs and resolved issues: + `#2 <https://github.com/skvadrik/re2c/issues/2>`_: abort + `#6 <https://github.com/skvadrik/re2c/issues/6>`_: segfault + `#10 <https://github.com/skvadrik/re2c/issues/10>`_: lessons/002_upn_calculator/calc_002 doesn't produce a useful example program + `#44 <https://github.com/skvadrik/re2c/issues/44>`_: Access violation when translating the attached file + `#49 <https://github.com/skvadrik/re2c/issues/49>`_: wildcard state \000 rules makes lexer behave weard + `#98 <https://github.com/skvadrik/re2c/issues/98>`_: Transparent handling of #line directives in input files + `#104 <https://github.com/skvadrik/re2c/issues/104>`_: Improve const-correctness + `#105 <https://github.com/skvadrik/re2c/issues/105>`_: Conversion of pointer parameters into references + `#114 <https://github.com/skvadrik/re2c/issues/114>`_: Possibility of fixing bug 2535084 + `#120 <https://github.com/skvadrik/re2c/issues/120>`_: condition consisting of default rule only is ignored + `#167 <https://github.com/skvadrik/re2c/issues/167>`_: Add word boundary support + `#168 <https://github.com/skvadrik/re2c/issues/168>`_: Wikipedia's article on re2c + `#180 <https://github.com/skvadrik/re2c/issues/180>`_: Comment syntax? + `#182 <https://github.com/skvadrik/re2c/issues/182>`_: yych being set by YYPEEK () and then not used + `#196 <https://github.com/skvadrik/re2c/issues/196>`_: Implicit type conversion warnings + `#198 <https://github.com/skvadrik/re2c/issues/198>`_: no match for ‘operator!=’ in ‘i != std::vector<_Tp, _Alloc>::rend() [with _Tp = re2c::bitmap_t, _Alloc = std::allocator<re2c::bitmap_t>]()’ + `#210 <https://github.com/skvadrik/re2c/issues/210>`_: How to build re2c in windows? + `#215 <https://github.com/skvadrik/re2c/issues/215>`_: A memory read overrun issue in s_to_n32_unsafe.cc + `#220 <https://github.com/skvadrik/re2c/issues/220>`_: src/dfa/dfa.h: simplify constructor to avoid g++-3.4 bug + `#223 <https://github.com/skvadrik/re2c/issues/223>`_: Fix typo + `#224 <https://github.com/skvadrik/re2c/issues/224>`_: src/dfa/closure_posix.cc: pack() tweaks + `#225 <https://github.com/skvadrik/re2c/issues/225>`_: Documentation link is broken in libre2c/README + `#230 <https://github.com/skvadrik/re2c/issues/230>`_: Changes for upcoming Travis' infra migration + `#239 <https://github.com/skvadrik/re2c/issues/239>`_: Push model example has wrong re2c invocation, breaks guide + `#241 <https://github.com/skvadrik/re2c/issues/241>`_: Guidance on how to use re2c for full-duplex command & response protocol + `#243 <https://github.com/skvadrik/re2c/issues/243>`_: A code generated for period (.) requires 4 bytes + `#246 <https://github.com/skvadrik/re2c/issues/246>`_: Please add a license to this repo + `#247 <https://github.com/skvadrik/re2c/issues/247>`_: Build failure on current Cygwin, probably caused by force-fed c++98 mode + `#248 <https://github.com/skvadrik/re2c/issues/248>`_: distcheck still looks for README + `#251 <https://github.com/skvadrik/re2c/issues/251>`_: Including what you use is find, but not without inclusion guards - Updated documentation and website.

skvadrik mentioned this issue May 24, 2019

Handle non-ASCII encoded characters in regular expressions. #237

Closed

skvadrik closed this as completed Jun 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF8 enoding #250

UTF8 enoding #250

dtp555-1212 commented May 22, 2019 •

edited

Loading

skvadrik commented May 22, 2019

dtp555-1212 commented May 23, 2019 via email

skvadrik commented May 23, 2019

skvadrik commented May 24, 2019

skvadrik commented May 24, 2019

dtp555-1212 commented May 24, 2019 via email

skvadrik commented May 24, 2019

terpstra commented May 24, 2019

skvadrik commented May 24, 2019

UTF8 enoding #250

UTF8 enoding #250

Comments

dtp555-1212 commented May 22, 2019 • edited Loading

skvadrik commented May 22, 2019

dtp555-1212 commented May 23, 2019 via email

skvadrik commented May 23, 2019

skvadrik commented May 24, 2019

skvadrik commented May 24, 2019

dtp555-1212 commented May 24, 2019 via email

skvadrik commented May 24, 2019

terpstra commented May 24, 2019

skvadrik commented May 24, 2019

dtp555-1212 commented May 22, 2019 •

edited

Loading