-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF8 enoding #250
Comments
Eh, it's a duplicate of #237. The problem is, re2c I realize this is very ugly, difficult to use, confusing and needs fixing. What exactly happens in case of |
Thanks for your reply... Obviously there are many Unicode values, not just the one I provided in my example. Do I understand you correctly, that I cannot provide the escaped hex byte sequence. I must use a unicode equivalent. (in this case \u00e9 for the two byte sequence C3 9A)... Is this understanding correct? Will this work for the 3 & 4 byte unicode values as well? (and not only match the character 'visually' but have the expected byte count for utf-8?) With this understanding, it sounds like I will have to preprocess the input strings to substitute the appropriate unicode encoding prior to processing with re2c. Do you have a suggested tool for that?
Thanks again
P.S. as you have acknowledged that this needs addressing, do you have a timeframe that it might be implemented?
…________________________________
From: Ulya Trofimovich <[email protected]>
Sent: Wednesday, May 22, 2019 3:05 PM
To: skvadrik/re2c
Cc: dtp555-1212; Author
Subject: Re: [skvadrik/re2c] UTF8 enoding (#250)
Eh, it's a duplicate of #237<#237>. The problem is, re2c -8 option does not give you source-level Unicode support: if you write characters like é in regexp definitons, re2c interprets it as a plain byte sequence (each byte as a single character), not as one Unicode symbol. You have to use "\u00e9" instead.
I realize this is very ugly, difficult to use, confusing and needs fixing.
What exactly happens in case of é and how re2c ends up with C3 83 byte sequence is explained in great detail in #237<#237> (let me know if you need more clarifications).
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#250?email_source=notifications&email_token=ADDLWOJHPLAUU2C5LVHPYL3PWWYQDA5CNFSM4HOXQJRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWAK6LI#issuecomment-494972717>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ADDLWOOK4QKTNJNGPKTA35DPWWYQDANCNFSM4HOXQJRA>.
|
Yes, it won't work. If you try regular expression
Escaped sequences will work for all Unicode code points (re2c supports 2-byte, 4-byte and 8-byte syntax:
No, unfortunately I don't. In a similar issue #235 we ended up with a pre-defined set of Unicode categories, but it's not good enough for your case.
I might be able to fix this in a few days. I have a sketch of the fix already, but it requires some pre-requisite work in order to make it more elegant. It's a matter of using |
Pushed a fix: 29a6d01. Now it is possible to use UTF-8 encoded strings in regular expressions (in string literals and character classes). The new behaviour is enabled with option It was necessary to use a new option instead of reusing I deliberately chose a broad name for the new option (as opposed to a more precise |
@dtp555-1212 If you can, please send me your real-world test. If it's closed-source, I only need the grammar rules (though a working self-contained example is always great). |
Attached is a list of words that have utf8 chars in them, and the other would be the rule to insert into the test program previously provided.
Hope that helps
Thanks
…________________________________
From: Ulya Trofimovich <[email protected]>
Sent: Friday, May 24, 2019 6:42 AM
To: skvadrik/re2c
Cc: dtp555-1212; Mention
Subject: Re: [skvadrik/re2c] UTF8 enoding (#250)
@dtp555-1212<https://github.com/dtp555-1212> If you can, please send me your real-world test. If it's closed-source, I only need the grammar rules (though a working self-contained example is always great).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#250?email_source=notifications&email_token=ADDLWONGB6N44M7KT2TP2NDPW7PC7A5CNFSM4HOXQJRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWFGKIQ#issuecomment-495609122>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ADDLWOJDVR6SOC5B7XKE2UDPW7PC7ANCNFSM4HOXQJRA>.
Abadía
Åberg
Abián
Adám
Ádám
Adenízia
Áder
Adrián
Ágatha
Agustín
Ahouré
Aída
Aïda
Ajeé
Akgül
Alagía
Alarcón
Aléman
Álex
Alizé
Alizée
Álvarez
Álvaro
Amélie
Anaís
Anaïs
Anastasákis
Andéol
András
André
Andréanne
Andrée
Andrés
Andújar
Anél
Ángel
Ángela
Angélil
Aníbal
Aníta
Añor
Antónia
António
Aoás
Apolónia
Araújo
Arbeláez
Arcón
Arévalo
Áron
Ásdís
Auböck
Augé
Áurea
Aurélie
Aurélien
Ávila
Baláz
Balázs
Ballivián
Bárbara
Bård
Barnabé
Barré
Barták
Barteková
Baugé
Bäumer
Béatrice
Bécaud
Bédard
Bédié
Begoña
Béla
Bélanger
Belascoarán
Belén
Bělohlávek
Beltré
Benavídez
Bendegúz
Benítez
Benjámin
Benoît
Beresová
Bermúdez
Bernabéu
Bernárdez
Béryl
Beyoncé
Böckler
Boczkó
Boglárka
Bolaños
Bolívar
Bolükbasi
Borgström
Borlée
Böröcz
Botín
Briceño
Brücken
Brzobohatý
Bubeník
Bublé
Bühler
Búranová
Büsra
Büthe
Büyükakcay
Byström
Cabrnochová
Cáceres
Calderón
Cañadilla
Cañas
Cañavate
Canelón
Cánepa
Cantú
Capó
Cárdenas
Carlén
Carré
Casañas
Cassarà
Cássia
Castellaños
Cátia
Cazaubón
Cebrián
Cécile
Cécilia
Cédric
Célestin
Céline
Célio
Čepický
Cerén
César
Céspedes
Cézanne
Chacón
Chaunté
Chávez
Chihuán
Chloé
Chrétien
Cibrián
Cintrón
Cíosóig
Cissé
Clélia
Clémence
Clément
Clévenot
Colón
Compaoré
Conceição
Concepción
Condé
Córdoba
Cordón
Córdova
Cortés
Crépeau
Cristóbal
Cubillán
Cué
Cuétara
Cynné
Czaková
Czigány
Daabousová
Dallapé
Dániel
Danièle
Danté
Dávalos
Dávid
DawnCheré
Débora
Déborah
Déby
Décary
Delía
Dembélé
Dénes
Dépré
DerlisRamón
Dési
Desirée
Desrosières
Díaz
Diémé
Dièye
Dilmé
Djá
Djénébou
Dolínek
Domínguez
Donté
Dóra
Dorjsürengiin
Dostál
Duchonová
Ducó
Dueñas
Dukátová
Durán
Dvorák
Echávarri
Echevarría
Éder
Édgar
Ekateríni
Élodie
Elphége
Émane
Émile
Emilíana
Émilie
Épangue
Erdélyi
Ergüven
Érica
Érick
Érika
España
Espíndola
Étienne
Eugénie
Eurén
Éva
Éve
Évora
Fabián
Fábio
Fabíola
Fagúndez
Fältskog
Fariña
Felício
Félix
Ferencová
Fernández
Flávia
Flesjå
Flóra
Florenç
Flügel
Flüggen
Foldházi
François
Françoise
Frédéric
Frédérick
Frisé
Fürste
Gábor
Gádorfalvi
Gagné
Gáliková
Gándara
Garbiñe
García
Garrigós
Gascón
Gáspár
Gastón
Gaudí
Gélineau
Geneviève
Gérard
Germán
Gerónimo
Géroudet
Gévrise
Giménez
Ginóbili
Gnassingbé
Gomà
Gómez
Gonçalves
Göncz
González
Göran
Grátz
Grégory
Grévy
Grimké
Grimsbö
Grímsson
Grönberg
Grövdal
Guillén
Güldeniz
Gülec
Gulldén
Gümbel
Gündegmaa
Günes
Günther
Gutiérrez
Güvenc
Guzmán
György
Gyurcsány
Häfner
Háido
Håkan
Hambüchen
Hamchétou
Hárai
Härstedt
Håvard
Havlát
Héléna
Hélene
Hendrychová
Hernán
Hernández
Hernangómez
Hervé
Hidvégi
Higuaín
Hinriksdóttir
Hjálmsdóttir
Holingerová
Holló
Horváth
Hosnyánszky
Hosszú
Hrasnová
Hristóforos
Hrivnák
Hufnágel
Hultén
Hüseyin
Hypólito
Hyryläinen
Ibañez
Ibarg�üen
Idéhn
Ié
Illés
Inácio
Iñárritu
Inés
István
Iván
Jackée
Jágr
Jakubský
Jámison
Jämsä
Janatková
János
Járóka
Jaurès
Jeremiáš
Jérémie
Jérémy
Jérent
Jérome
Jéssica
Jesús
Jhené
Jiménez
Jiří
João
Joaquín
Joëlle
Jóhannsson
Jonatán
JonBenét
Jördis
Jorén
Josée
Josué
Jóźwiak
Juhász
Júlio
Júnior
Juppé
Jürgen
Jurinová
Kaboré
Kafétien
Kaká
Kalovský
Kapás
Karlström
Karolína
Kasó
Katarína
Kätlin
Kévin
Kemrová
Késely
Kévin
Khloé
Khüderbulga
Kléber
Kléberson
Klobucník
Klocová
Klöden
Kněžínková
Köbrich
Köhler
Kohlová
Koňařík
Kořán
Kovács
Kovágó
Kozák
Krejčí
Kristián
Krisztián
Krizsán
Krüger
Kühn
Kühne
Kylliäinen
Laanmäe
Labbé
Laferrière
Laprovíttola
Larrañaga
László
Lázaro
Léa
Léandre
Lefèvre
Leitón
Lemprière
León
Lepistö
Lerú
Lidström
Lillána
Listopadová
Liván
Lívia
Lloréns
Lluís
Löke
Longová
López
Lotiès
Lövnes
Lü
Lübeck
Lucía
Lückenkemper
Luís
Lukás
Lukáš
Lúthersdóttir
Madaí
Madarász
Mäe
Magallán
Mägi
Mahé
Maíla
Majdán
Mäkelä
Mandátová
Mané
Mangué
Marc-André's
Márcio
Maréchal
Marí
Mária
María
Marílson
Marín
Mariño
Mário
Márk
Marozsán
Márquez
Martí
Martín
Martínek
Martínez
Márton
Massó
Mätas
Máté
Matías
Matús
Maurício
Máximo
Meité
Mélanie
Mélina
Méline
Méndez
Meroúsis
Micheál
Michèle
Mihaíl
Mijaín
Miklós
Millán
Miltiádis
Moisés
Mokosová
Molnár
Mónaco
Monáe
Mónica
Mónika
Montaño
Morén
Mörk
Mörö
Müller
Muñiz
Muñoz
Murúa
Nádia
Naïm
Natália
Negrón
Németh
Néstor
Niccolò
Nicolás
Niinistö
Nóbrega
Noélie
Noémie
Nordén
Núbia
Nuñez
Ódorová
Öhrström
Ólafur
Óleo
Opatrný
Orbán
Ordóñez
Oréane
Ortíz
Óscar
Ozlü
Ozyüksel
Pääbo
Pabón
Padacké
Pádraig
Páez
Pajón
Pál
Palát
Panayióta
Pär
Paré
Pärt
Patiño
Patrícia
Patrocínio
Pattantyús
Pavón
Péché
Péchoux
Pelikán
Peña
Peñate
Pénélope
Péni
Pépin
Pérez
Perón
Pétain
Petchamé
Péter
Pétervári
Philémon
Phúc
Piétrus
Pinzón
Pité
Pitkämäki
Poésy
Pokorný
Polívka
Póta
Préval
Prokopová
Puigcercós
Pürevjargalyn
Putálová
Quiñones
Quiñonez
Quintillà
Quvenzhané
Rácz
Ramírez
Raúl
Řebíček
Récsei
Rédli
Réka
Rémi
Renáta
Rendón
René
Renée
Rénelle
Rentería
Repcík
Reséndiz
Rézola
Ribéry
Richárd
Ríga
Robenílson
Róbert
Róchez
Rocío
Rodríguez
Rogério
Rolfö
Román
Romová
Rónald
Rosário
Rubén
Rühr
Ruíz
Sá
Saborío
Sagardía
Sallói
Salomé
Salvadó
Samassékou
Sánchez
Sandé
Sándor
Sardá
Sárosi
Sátila
Saúl
Saunière
Savón
Scalamandré
Schächter
Schäfer
Schäuble
Schlögl
Schmiedlová
Schön
Schröder
Schüpbac
Schüssel
Schütze
Séamus
Seán
Sebastián
Sébastien
Sebestyén
Sélom
Sène
Senyürek
Seppälä
Sepúlveda
Sérgio
Shkëlzen
Sicília
Silfvén
Siljamäki
Sinéad
Sjåstad
Sjöberg
Sjödin
Sjöström
Skantár
Söderberg
Söderling
Sofía
Solé
Solís
Söllner
Somorácz
Sörenstam
Ståhl
Ståle
Stefanídi
Stéphane
Stéphanie
Strålman
Strömberg
Stübe
Studničková
Suárez
Šuláková
Süle
Süleyman
Švácha
Svärd
Svennerstål
Szabián
Szabó
Szász
Szilágyi
Szomolányi
Szücs
Szwarnóg
Taaramäe
Tabaré
Tainá
Takács
Támara
Tamás
Tarragó
Tazegül
Tcheuméo
Tchórz
Téa
Tentóglou
Teófilo
Teré
Tévez
Thaísa
Théo
Théophile
Thérèse
Théry
Thiéry
Tímea
Tió
Todenhöfer
Tomáš
Tomorkhüleg
Tõnu
Topolánek
Tormé
Tornéus
Törnroos
Török
Tórrez
Tórtola
Tóth
Touadéra
Tramèr
Traoré
Träsch
Trévor
Tsinopoúlou
Túñez
Türk
Tüvshinbat
Tüvshinbayar
Üitümen
Ünal
Ungvári
Urán
Úrsula
Üstündag
Václav
Valdés
Valentín
Valérian
Valériane
Välimäki
Vallée
Vámos
Vásquez
Vázquez
Velázquez
Veldáková
Venyercsán
Veréb
Verón
Verrasztó
Víctor
Victória
Viktória
Vilató
Villaécija
Villafría
Vinícius
Viñolas
Vitória
Vladimír
Wallén
Wálter
Wanyá
Wé
Wéverton
Wikström
Xénia
Yáñez
Younés
Zagré
Zalánki
Zelená
Zélia
Zoltán
Zságer
Zsófia
utfExamples = ( "Abadía"|
"Åberg"|
"Abián"|
"Adám"|
"Ádám"|
"Adenízia"|
"Áder"|
"Adrián"|
"Ágatha"|
"Agustín"|
"Ahouré"|
"Aída"|
"Aïda"|
"Ajeé"|
"Akgül"|
"Alagía"|
"Alarcón"|
"Aléman"|
"Álex"|
"Alizé"|
"Alizée"|
"Álvarez"|
"Álvaro"|
"Amélie"|
"Anaís"|
"Anaïs"|
"Anastasákis"|
"Andéol"|
"András"|
"André"|
"Andréanne"|
"Andrée"|
"Andrés"|
"Andújar"|
"Anél"|
"Ángel"|
"Ángela"|
"Angélil"|
"Aníbal"|
"Aníta"|
"Añor"|
"Antónia"|
"António"|
"Aoás"|
"Apolónia"|
"Araújo"|
"Arbeláez"|
"Arcón"|
"Arévalo"|
"Áron"|
"Ásdís"|
"Auböck"|
"Augé"|
"Áurea"|
"Aurélie"|
"Aurélien"|
"Ávila"|
"Baláz"|
"Balázs"|
"Ballivián"|
"Bárbara"|
"Bård"|
"Barnabé"|
"Barré"|
"Barták"|
"Barteková"|
"Baugé"|
"Bäumer"|
"Béatrice"|
"Bécaud"|
"Bédard"|
"Bédié"|
"Begoña"|
"Béla"|
"Bélanger"|
"Belascoarán"|
"Belén"|
"Bělohlávek"|
"Beltré"|
"Benavídez"|
"Bendegúz"|
"Benítez"|
"Benjámin"|
"Benoît"|
"Beresová"|
"Bermúdez"|
"Bernabéu"|
"Bernárdez"|
"Béryl"|
"Beyoncé"|
"Böckler"|
"Boczkó"|
"Boglárka"|
"Bolaños"|
"Bolívar"|
"Bolükbasi"|
"Borgström"|
"Borlée"|
"Böröcz"|
"Botín"|
"Briceño"|
"Brücken"|
"Brzobohatý"|
"Bubeník"|
"Bublé"|
"Bühler"|
"Búranová"|
"Büsra"|
"Büthe"|
"Büyükakcay"|
"Byström"|
"Cabrnochová"|
"Cáceres"|
"Calderón"|
"Cañadilla"|
"Cañas"|
"Cañavate"|
"Canelón"|
"Cánepa"|
"Cantú"|
"Capó"|
"Cárdenas"|
"Carlén"|
"Carré"|
"Casañas"|
"Cassarà"|
"Cássia"|
"Castellaños"|
"Cátia"|
"Cazaubón"|
"Cebrián"|
"Cécile"|
"Cécilia"|
"Cédric"|
"Célestin"|
"Céline"|
"Célio"|
"Čepický"|
"Cerén"|
"César"|
"Céspedes"|
"Cézanne"|
"Chacón"|
"Chaunté"|
"Chávez"|
"Chihuán"|
"Chloé"|
"Chrétien"|
"Cibrián"|
"Cintrón"|
"Cíosóig"|
"Cissé"|
"Clélia"|
"Clémence"|
"Clément"|
"Clévenot"|
"Colón"|
"Compaoré"|
"Conceição"|
"Concepción"|
"Condé"|
"Córdoba"|
"Cordón"|
"Córdova"|
"Cortés"|
"Crépeau"|
"Cristóbal"|
"Cubillán"|
"Cué"|
"Cuétara"|
"Cynné"|
"Czaková"|
"Czigány"|
"Daabousová"|
"Dallapé"|
"Dániel"|
"Danièle"|
"Danté"|
"Dávalos"|
"Dávid"|
"DawnCheré"|
"Débora"|
"Déborah"|
"Déby"|
"Décary"|
"Delía"|
"Dembélé"|
"Dénes"|
"Dépré"|
"DerlisRamón"|
"Dési"|
"Desirée"|
"Desrosières"|
"Díaz"|
"Diémé"|
"Dièye"|
"Dilmé"|
"Djá"|
"Djénébou"|
"Dolínek"|
"Domínguez"|
"Donté"|
"Dóra"|
"Dorjsürengiin"|
"Dostál"|
"Duchonová"|
"Ducó"|
"Dueñas"|
"Dukátová"|
"Durán"|
"Dvorák"|
"Echávarri"|
"Echevarría"|
"Éder"|
"Édgar"|
"Ekateríni"|
"Élodie"|
"Elphége"|
"Émane"|
"Émile"|
"Emilíana"|
"Émilie"|
"Épangue"|
"Erdélyi"|
"Ergüven"|
"Érica"|
"Érick"|
"Érika"|
"España"|
"Espíndola"|
"Étienne"|
"Eugénie"|
"Eurén"|
"Éva"|
"Éve"|
"Évora"|
"Fabián"|
"Fábio"|
"Fabíola"|
"Fagúndez"|
"Fältskog"|
"Fariña"|
"Felício"|
"Félix"|
"Ferencová"|
"Fernández"|
"Flávia"|
"Flesjå"|
"Flóra"|
"Florenç"|
"Flügel"|
"Flüggen"|
"Foldházi"|
"François"|
"Françoise"|
"Frédéric"|
"Frédérick"|
"Frisé"|
"Fürste"|
"Gábor"|
"Gádorfalvi"|
"Gagné"|
"Gáliková"|
"Gándara"|
"Garbiñe"|
"García"|
"Garrigós"|
"Gascón"|
"Gáspár"|
"Gastón"|
"Gaudí"|
"Gélineau"|
"Geneviève"|
"Gérard"|
"Germán"|
"Gerónimo"|
"Géroudet"|
"Gévrise"|
"Giménez"|
"Ginóbili"|
"Gnassingbé"|
"Gomà"|
"Gómez"|
"Gonçalves"|
"Göncz"|
"González"|
"Göran"|
"Grátz"|
"Grégory"|
"Grévy"|
"Grimké"|
"Grimsbö"|
"Grímsson"|
"Grönberg"|
"Grövdal"|
"Guillén"|
"Güldeniz"|
"Gülec"|
"Gulldén"|
"Gümbel"|
"Gündegmaa"|
"Günes"|
"Günther"|
"Gutiérrez"|
"Güvenc"|
"Guzmán"|
"György"|
"Gyurcsány"|
"Häfner"|
"Háido"|
"Håkan"|
"Hambüchen"|
"Hamchétou"|
"Hárai"|
"Härstedt"|
"Håvard"|
"Havlát"|
"Héléna"|
"Hélene"|
"Hendrychová"|
"Hernán"|
"Hernández"|
"Hernangómez"|
"Hervé"|
"Hidvégi"|
"Higuaín"|
"Hinriksdóttir"|
"Hjálmsdóttir"|
"Holingerová"|
"Holló"|
"Horváth"|
"Hosnyánszky"|
"Hosszú"|
"Hrasnová"|
"Hristóforos"|
"Hrivnák"|
"Hufnágel"|
"Hultén"|
"Hüseyin"|
"Hypólito"|
"Hyryläinen"|
"Ibañez"|
"Ibarg�üen"|
"Idéhn"|
"Ié"|
"Illés"|
"Inácio"|
"Iñárritu"|
"Inés"|
"István"|
"Iván"|
"Jackée"|
"Jágr"|
"Jakubský"|
"Jámison"|
"Jämsä"|
"Janatková"|
"János"|
"Járóka"|
"Jaurès"|
"Jeremiáš"|
"Jérémie"|
"Jérémy"|
"Jérent"|
"Jérome"|
"Jéssica"|
"Jesús"|
"Jhené"|
"Jiménez"|
"Jiří"|
"João"|
"Joaquín"|
"Joëlle"|
"Jóhannsson"|
"Jonatán"|
"JonBenét"|
"Jördis"|
"Jorén"|
"Josée"|
"Josué"|
"Jóźwiak"|
"Juhász"|
"Júlio"|
"Júnior"|
"Juppé"|
"Jürgen"|
"Jurinová"|
"Kaboré"|
"Kafétien"|
"Kaká"|
"Kalovský"|
"Kapás"|
"Karlström"|
"Karolína"|
"Kasó"|
"Katarína"|
"Kätlin"|
"Kévin"|
"Kemrová"|
"Késely"|
"Kévin"|
"Khloé"|
"Khüderbulga"|
"Kléber"|
"Kléberson"|
"Klobucník"|
"Klocová"|
"Klöden"|
"Kněžínková"|
"Köbrich"|
"Köhler"|
"Kohlová"|
"Koňařík"|
"Kořán"|
"Kovács"|
"Kovágó"|
"Kozák"|
"Krejčí"|
"Kristián"|
"Krisztián"|
"Krizsán"|
"Krüger"|
"Kühn"|
"Kühne"|
"Kylliäinen"|
"Laanmäe"|
"Labbé"|
"Laferrière"|
"Laprovíttola"|
"Larrañaga"|
"László"|
"Lázaro"|
"Léa"|
"Léandre"|
"Lefèvre"|
"Leitón"|
"Lemprière"|
"León"|
"Lepistö"|
"Lerú"|
"Lidström"|
"Lillána"|
"Listopadová"|
"Liván"|
"Lívia"|
"Lloréns"|
"Lluís"|
"Löke"|
"Longová"|
"López"|
"Lotiès"|
"Lövnes"|
"Lü"|
"Lübeck"|
"Lucía"|
"Lückenkemper"|
"Luís"|
"Lukás"|
"Lukáš"|
"Lúthersdóttir"|
"Madaí"|
"Madarász"|
"Mäe"|
"Magallán"|
"Mägi"|
"Mahé"|
"Maíla"|
"Majdán"|
"Mäkelä"|
"Mandátová"|
"Mané"|
"Mangué"|
"Marc-André's"|
"Márcio"|
"Maréchal"|
"Marí"|
"Mária"|
"María"|
"Marílson"|
"Marín"|
"Mariño"|
"Mário"|
"Márk"|
"Marozsán"|
"Márquez"|
"Martí"|
"Martín"|
"Martínek"|
"Martínez"|
"Márton"|
"Massó"|
"Mätas"|
"Máté"|
"Matías"|
"Matús"|
"Maurício"|
"Máximo"|
"Meité"|
"Mélanie"|
"Mélina"|
"Méline"|
"Méndez"|
"Meroúsis"|
"Micheál"|
"Michèle"|
"Mihaíl"|
"Mijaín"|
"Miklós"|
"Millán"|
"Miltiádis"|
"Moisés"|
"Mokosová"|
"Molnár"|
"Mónaco"|
"Monáe"|
"Mónica"|
"Mónika"|
"Montaño"|
"Morén"|
"Mörk"|
"Mörö"|
"Müller"|
"Muñiz"|
"Muñoz"|
"Murúa"|
"Nádia"|
"Naïm"|
"Natália"|
"Negrón"|
"Németh"|
"Néstor"|
"Niccolò"|
"Nicolás"|
"Niinistö"|
"Nóbrega"|
"Noélie"|
"Noémie"|
"Nordén"|
"Núbia"|
"Nuñez"|
"Ódorová"|
"Öhrström"|
"Ólafur"|
"Óleo"|
"Opatrný"|
"Orbán"|
"Ordóñez"|
"Oréane"|
"Ortíz"|
"Óscar"|
"Ozlü"|
"Ozyüksel"|
"Pääbo"|
"Pabón"|
"Padacké"|
"Pádraig"|
"Páez"|
"Pajón"|
"Pál"|
"Palát"|
"Panayióta"|
"Pär"|
"Paré"|
"Pärt"|
"Patiño"|
"Patrícia"|
"Patrocínio"|
"Pattantyús"|
"Pavón"|
"Péché"|
"Péchoux"|
"Pelikán"|
"Peña"|
"Peñate"|
"Pénélope"|
"Péni"|
"Pépin"|
"Pérez"|
"Perón"|
"Pétain"|
"Petchamé"|
"Péter"|
"Pétervári"|
"Philémon"|
"Phúc"|
"Piétrus"|
"Pinzón"|
"Pité"|
"Pitkämäki"|
"Poésy"|
"Pokorný"|
"Polívka"|
"Póta"|
"Préval"|
"Prokopová"|
"Puigcercós"|
"Pürevjargalyn"|
"Putálová"|
"Quiñones"|
"Quiñonez"|
"Quintillà"|
"Quvenzhané"|
"Rácz"|
"Ramírez"|
"Raúl"|
"Řebíček"|
"Récsei"|
"Rédli"|
"Réka"|
"Rémi"|
"Renáta"|
"Rendón"|
"René"|
"Renée"|
"Rénelle"|
"Rentería"|
"Repcík"|
"Reséndiz"|
"Rézola"|
"Ribéry"|
"Richárd"|
"Ríga"|
"Robenílson"|
"Róbert"|
"Róchez"|
"Rocío"|
"Rodríguez"|
"Rogério"|
"Rolfö"|
"Román"|
"Romová"|
"Rónald"|
"Rosário"|
"Rubén"|
"Rühr"|
"Ruíz"|
"Sá"|
"Saborío"|
"Sagardía"|
"Sallói"|
"Salomé"|
"Salvadó"|
"Samassékou"|
"Sánchez"|
"Sandé"|
"Sándor"|
"Sardá"|
"Sárosi"|
"Sátila"|
"Saúl"|
"Saunière"|
"Savón"|
"Scalamandré"|
"Schächter"|
"Schäfer"|
"Schäuble"|
"Schlögl"|
"Schmiedlová"|
"Schön"|
"Schröder"|
"Schüpbac"|
"Schüssel"|
"Schütze"|
"Séamus"|
"Seán"|
"Sebastián"|
"Sébastien"|
"Sebestyén"|
"Sélom"|
"Sène"|
"Senyürek"|
"Seppälä"|
"Sepúlveda"|
"Sérgio"|
"Shkëlzen"|
"Sicília"|
"Silfvén"|
"Siljamäki"|
"Sinéad"|
"Sjåstad"|
"Sjöberg"|
"Sjödin"|
"Sjöström"|
"Skantár"|
"Söderberg"|
"Söderling"|
"Sofía"|
"Solé"|
"Solís"|
"Söllner"|
"Somorácz"|
"Sörenstam"|
"Ståhl"|
"Ståle"|
"Stefanídi"|
"Stéphane"|
"Stéphanie"|
"Strålman"|
"Strömberg"|
"Stübe"|
"Studničková"|
"Suárez"|
"Šuláková"|
"Süle"|
"Süleyman"|
"Švácha"|
"Svärd"|
"Svennerstål"|
"Szabián"|
"Szabó"|
"Szász"|
"Szilágyi"|
"Szomolányi"|
"Szücs"|
"Szwarnóg"|
"Taaramäe"|
"Tabaré"|
"Tainá"|
"Takács"|
"Támara"|
"Tamás"|
"Tarragó"|
"Tazegül"|
"Tcheuméo"|
"Tchórz"|
"Téa"|
"Tentóglou"|
"Teófilo"|
"Teré"|
"Tévez"|
"Thaísa"|
"Théo"|
"Théophile"|
"Thérèse"|
"Théry"|
"Thiéry"|
"Tímea"|
"Tió"|
"Todenhöfer"|
"Tomáš"|
"Tomorkhüleg"|
"Tõnu"|
"Topolánek"|
"Tormé"|
"Tornéus"|
"Törnroos"|
"Török"|
"Tórrez"|
"Tórtola"|
"Tóth"|
"Touadéra"|
"Tramèr"|
"Traoré"|
"Träsch"|
"Trévor"|
"Tsinopoúlou"|
"Túñez"|
"Türk"|
"Tüvshinbat"|
"Tüvshinbayar"|
"Üitümen"|
"Ünal"|
"Ungvári"|
"Urán"|
"Úrsula"|
"Üstündag"|
"Václav"|
"Valdés"|
"Valentín"|
"Valérian"|
"Valériane"|
"Välimäki"|
"Vallée"|
"Vámos"|
"Vásquez"|
"Vázquez"|
"Velázquez"|
"Veldáková"|
"Venyercsán"|
"Veréb"|
"Verón"|
"Verrasztó"|
"Víctor"|
"Victória"|
"Viktória"|
"Vilató"|
"Villaécija"|
"Villafría"|
"Vinícius"|
"Viñolas"|
"Vitória"|
"Vladimír"|
"Wallén"|
"Wálter"|
"Wanyá"|
"Wé"|
"Wéverton"|
"Wikström"|
"Xénia"|
"Yáñez"|
"Younés"|
"Zagré"|
"Zalánki"|
"Zelená"|
"Zélia"|
"Zoltán"|
"Zságer"|
"Zsófia");
|
Thanks! I added a test (it returns 0 for all the names on the list): https://github.com/skvadrik/re2c/blob/a00dc4871106ea39ef84f47bb840a018b17cea25/test/encodings/utf8_names.i8--input-encoding(utf8).re There is an error in the name |
This is great! When can we expect the next re2c release? I can't wait to re2c:include a Unicode character classes library and define character classes with literal UTF8 strings in them! |
Soon, soon, really soon! I know I said this a couple of times before, such a shame... |
2.0.3 (2020-08-22) ~~~~~~~~~~~~~~~~~~ - Fix issues when building re2c as a CMake subproject (`#302 <https://github.com/skvadrik/re2c/pull/302>`_: - Final corrections in the SIMPA article "RE2C: A lexer generator based on lookahead-TDFA", https://doi.org/10.1016/j.simpa.2020.100027 2.0.2 (2020-08-08) ~~~~~~~~~~~~~~~~~~ - Enable re2go building by default. - Package CMake files into release tarball. 2.0.1 (2020-07-29) ~~~~~~~~~~~~~~~~~~ - Updated version for CMake build system (forgotten in release 2.0). - Added a short article about re2c for the Software Impacts journal. 2.0 (2020-07-20) ~~~~~~~~~~~~~~~~ - Added new code generation backend for Go and a new ``re2go`` program (`#272 <https://github.com/skvadrik/re2c/issues/272>`_: Go support). Added option ``--lang <c | go>``. - Added CMake build system as an alternative to Autotools (`#275 <https://github.com/skvadrik/re2c/pull/275>`_: Add a CMake build system (thanks to ligfx), `#244 <https://github.com/skvadrik/re2c/issues/244>`_: Switching to CMake). - Changes in generic API: + Removed primitives ``YYSTAGPD`` and ``YYMTAGPD``. + Added primitives ``YYSHIFT``, ``YYSHIFTSTAG``, ``YYSHIFTMTAG`` that allow to express fixed tags in terms of generic API. + Added configurations ``re2c:api:style`` and ``re2c:api:sigil``. + Added named placeholders in interpolated configuration strings. - Changes in reuse mode (``-r, --reuse`` option): + Do not reset API-related configurations in each `use:re2c` block (`#291 <https://github.com/skvadrik/re2c/issues/291>`_: Defines in rules block are not propagated to use blocks). + Use block-local options instead of last block options. + Do not accumulate options from rules/reuse blocks in whole-program options. + Generate non-overlapping YYFILL labels for reuse blocks. + Generate start label for each reuse block in storable state mode. - Changes in start-conditions mode (``-c, --start-conditions`` option): + Allow to use normal (non-conditional) blocks in `-c` mode (`#263 <https://github.com/skvadrik/re2c/issues/263>`_: allow mixing conditional and non-conditional blocks with -c, `#296 <https://github.com/skvadrik/re2c/issues/296>`_: Conditions required for all lexers when using '-c' option). + Generate condition switch in every re2c block (`#295 <https://github.com/skvadrik/re2c/issues/295>`_: Condition switch generated for only one lexer per file). - Changes in the generated labels: + Use ``yyeof`` label prefix instead of ``yyeofrule``. + Use ``yyfill`` label prefix instead of ``yyFillLabel``. + Decouple start label and initial label (affects label numbering). - Removed undocumented configuration ``re2c:flags:o``, ``re2c:flags:output``. - Changes in ``re2c:flags:t``, ``re2c:flags:type-header`` configuration: filename is now relative to the output file directory. - Added option ``--case-ranges`` and configuration ``re2c:flags:case-ranges``. - Extended fixed tags optimization for the case of fixed-counter repetition. - Fixed bugs related to EOF rule: + `#276 <https://github.com/skvadrik/re2c/issues/276>`_: Example 01_fill.re in docs is broken + `#280 <https://github.com/skvadrik/re2c/issues/280>`_: EOF rules with multiple blocks + `#284 <https://github.com/skvadrik/re2c/issues/284>`_: mismatched YYBACKUP and YYRESTORE (Add missing fallback states with EOF rule) - Fixed miscellaneous bugs: + `#286 <https://github.com/skvadrik/re2c/issues/286>`_: Incorrect submatch values with fixed-length trailing context. + `#297 <https://github.com/skvadrik/re2c/issues/297>`_: configure error on ubuntu 18.04 / cmake 3.10 - Changed bootstrap process (require explicit configuration flags and a path to re2c executable to regenerate the lexers). - Added internal options ``--posix-prectable <naive | complex>``. - Added debug option ``--dump-dfa-tree``. - Major revision of the paper "Efficient POSIX submatch extraction on NFA". ---- 1.3x ---- 1.3 (2019-12-14) ~~~~~~~~~~~~~~~~ - Added option: ``--stadfa``. - Added warning: ``-Wsentinel-in-midrule``. - Added generic API primitives: + ``YYSTAGPD`` + ``YYMTAGPD`` - Added configurations: + ``re2c:sentinel = 0;`` + ``re2c:define:YYSTAGPD = "YYSTAGPD";`` + ``re2c:define:YYMTAGPD = "YYMTAGPD";`` - Worked on reproducible builds (`#258 <https://github.com/skvadrik/re2c/pull/258>`_: Make the build reproducible). ---- 1.2x ---- 1.2.1 (2019-08-11) ~~~~~~~~~~~~~~~~~~ - Fixed bug `#253 <https://github.com/skvadrik/re2c/issues/253>`_: re2c should install unicode_categories.re somewhere. - Fixed bug `#254 <https://github.com/skvadrik/re2c/issues/254>`_: Turn off re2c:eof = 0. 1.2 (2019-08-02) ~~~~~~~~~~~~~~~~ - Added EOF rule ``$`` and configuration ``re2c:eof``. - Added ``/*!include:re2c ... */`` directive and ``-I`` option. - Added ``/*!header:re2c:on*/`` and ``/*!header:re2c:off*/`` directives. - Added ``--input-encoding <ascii | utf8>`` option. + `#237 <https://github.com/skvadrik/re2c/issues/237>`_: Handle non-ASCII encoded characters in regular expressions + `#250 <https://github.com/skvadrik/re2c/issues/250>`_ UTF8 enoding - Added include file with a list of definitions for Unicode character classes. + `#235 <https://github.com/skvadrik/re2c/issues/235>`_: Unicode character classes - Added ``--location-format <gnu | msvc>`` option. + `#195 <https://github.com/skvadrik/re2c/issues/195>`_: Please consider using Gnu format for error messages - Added ``--verbose`` option that prints "success" message if re2c exits without errors. - Added configurations for options: + ``-o --output`` (specify output file) + ``-t --type-header`` (specify header file) - Removed configurations for internal/debug options. - Extended ``-r`` option: allow to mix multiple ``/*!rules:re2c*/``, ``/*!use:re2c*/`` and ``/*!re2c*/`` blocks. + `#55 <https://github.com/skvadrik/re2c/issues/55>`_: allow standard re2c blocks in reuse mode - Fixed ``-F --flex-support`` option: parsing and operator precedence. + `#229 <https://github.com/skvadrik/re2c/issues/229>`_: re2c option -F (flex syntax) broken + `#242 <https://github.com/skvadrik/re2c/issues/242>`_: Operator precedence with --flex-syntax is broken - Changed difference operator ``/`` to apply before encoding expansion of operands. + `#236 <https://github.com/skvadrik/re2c/issues/236>`_: Support range difference with variable-length encodings - Changed output generation of output file to be atomic. + `#245 <https://github.com/skvadrik/re2c/issues/245>`_: re2c output is not atomic - Authored research paper "Efficient POSIX Submatch Extraction on NFA" together with Dr Angelo Borsotti. - Added experimental libre2c library (``--enable-libs`` configure option) with the following algorithms: + TDFA with leftmost-greedy disambiguation + TDFA with POSIX disambiguation (Okui-Suzuki algorithm) + TNFA with leftmost-greedy disambiguation + TNFA with POSIX disambiguation (Okui-Suzuki algorithm) + TNFA with lazy POSIX disambiguation (Okui-Suzuki algorithm) + TNFA with POSIX disambiguation (Kuklewicz algorithm) + TNFA with POSIX disambiguation (Cox algorithm) - Added debug subsystem (``--enable-debug`` configure option) and new debug options: + ``-dump-cfg`` (dump control flow graph of tag variables) + ``-dump-interf`` (dump interference table of tag variables) + ``-dump-closure-stats`` (dump epsilon-closure statistics) - Added internal options: + ``--posix-closure <gor1 | gtop>`` (switch between shortest-path algorithms used for the construction of POSIX closure) - Fixed a number of crashes found by American Fuzzy Lop fuzzer: + `#226 <https://github.com/skvadrik/re2c/issues/226>`_, `#227 <https://github.com/skvadrik/re2c/issues/227>`_, `#228 <https://github.com/skvadrik/re2c/issues/228>`_, `#231 <https://github.com/skvadrik/re2c/issues/231>`_, `#232 <https://github.com/skvadrik/re2c/issues/232>`_, `#233 <https://github.com/skvadrik/re2c/issues/233>`_, `#234 <https://github.com/skvadrik/re2c/issues/234>`_, `#238 <https://github.com/skvadrik/re2c/issues/238>`_ - Fixed handling of newlines: + correctly parse multi-character newlines CR LF in ``#line`` directives + consistently convert all newlines in the generated file to Unix-style LF - Changed default tarball format from .gz to .xz. + `#221 <https://github.com/skvadrik/re2c/issues/221>`_: big source tarball - Fixed a number of other bugs and resolved issues: + `#2 <https://github.com/skvadrik/re2c/issues/2>`_: abort + `#6 <https://github.com/skvadrik/re2c/issues/6>`_: segfault + `#10 <https://github.com/skvadrik/re2c/issues/10>`_: lessons/002_upn_calculator/calc_002 doesn't produce a useful example program + `#44 <https://github.com/skvadrik/re2c/issues/44>`_: Access violation when translating the attached file + `#49 <https://github.com/skvadrik/re2c/issues/49>`_: wildcard state \000 rules makes lexer behave weard + `#98 <https://github.com/skvadrik/re2c/issues/98>`_: Transparent handling of #line directives in input files + `#104 <https://github.com/skvadrik/re2c/issues/104>`_: Improve const-correctness + `#105 <https://github.com/skvadrik/re2c/issues/105>`_: Conversion of pointer parameters into references + `#114 <https://github.com/skvadrik/re2c/issues/114>`_: Possibility of fixing bug 2535084 + `#120 <https://github.com/skvadrik/re2c/issues/120>`_: condition consisting of default rule only is ignored + `#167 <https://github.com/skvadrik/re2c/issues/167>`_: Add word boundary support + `#168 <https://github.com/skvadrik/re2c/issues/168>`_: Wikipedia's article on re2c + `#180 <https://github.com/skvadrik/re2c/issues/180>`_: Comment syntax? + `#182 <https://github.com/skvadrik/re2c/issues/182>`_: yych being set by YYPEEK () and then not used + `#196 <https://github.com/skvadrik/re2c/issues/196>`_: Implicit type conversion warnings + `#198 <https://github.com/skvadrik/re2c/issues/198>`_: no match for ‘operator!=’ in ‘i != std::vector<_Tp, _Alloc>::rend() [with _Tp = re2c::bitmap_t, _Alloc = std::allocator<re2c::bitmap_t>]()’ + `#210 <https://github.com/skvadrik/re2c/issues/210>`_: How to build re2c in windows? + `#215 <https://github.com/skvadrik/re2c/issues/215>`_: A memory read overrun issue in s_to_n32_unsafe.cc + `#220 <https://github.com/skvadrik/re2c/issues/220>`_: src/dfa/dfa.h: simplify constructor to avoid g++-3.4 bug + `#223 <https://github.com/skvadrik/re2c/issues/223>`_: Fix typo + `#224 <https://github.com/skvadrik/re2c/issues/224>`_: src/dfa/closure_posix.cc: pack() tweaks + `#225 <https://github.com/skvadrik/re2c/issues/225>`_: Documentation link is broken in libre2c/README + `#230 <https://github.com/skvadrik/re2c/issues/230>`_: Changes for upcoming Travis' infra migration + `#239 <https://github.com/skvadrik/re2c/issues/239>`_: Push model example has wrong re2c invocation, breaks guide + `#241 <https://github.com/skvadrik/re2c/issues/241>`_: Guidance on how to use re2c for full-duplex command & response protocol + `#243 <https://github.com/skvadrik/re2c/issues/243>`_: A code generated for period (.) requires 4 bytes + `#246 <https://github.com/skvadrik/re2c/issues/246>`_: Please add a license to this repo + `#247 <https://github.com/skvadrik/re2c/issues/247>`_: Build failure on current Cygwin, probably caused by force-fed c++98 mode + `#248 <https://github.com/skvadrik/re2c/issues/248>`_: distcheck still looks for README + `#251 <https://github.com/skvadrik/re2c/issues/251>`_: Including what you use is find, but not without inclusion guards - Updated documentation and website.
It appears there is a bug in the UTF8 encoding (at least for some characters)...
utf8bug.zip
In the attached file... there is a 2 byte UTF character which should be encoded as C3 A9 ... (if you copy/paste the UTF char into a file by itself, then use od -t x1, you will see that it is indeed C3 A9). The C3 in the generated parser is correct, but then generates 83 as the second target byte. I am using -8 on the command line. (If there is something I am doing wrong, or if there is a workaround, please let me know)
The text was updated successfully, but these errors were encountered: