Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bilingual text not properly divided into chunks #42

Open
zackw opened this issue Oct 22, 2015 · 0 comments
Open

Bilingual text not properly divided into chunks #42

zackw opened this issue Oct 22, 2015 · 0 comments

Comments

@zackw
Copy link

zackw commented Oct 22, 2015

The following test program takes an English/Italian bilingual text and uses CLD2 to divide it into chunks. This should be an easy case for chunk detection, since the language consistently changes at paragraph boundaries. However, only the first chunk boundary is correctly detected; all subsequent chunk boundaries are off by a few words, and (as a consequence, I think) it fails to identify the language of some of the chunks.

#include <vector>
#include <stdio.h>
#include <cld2/public/encodings.h>
#include <cld2/public/compact_lang_det.h>

static const char text[] =
"In my younger and more vulnerable years my father gave me some advice\n"
"that I've been turning over in my mind ever since.\n"
"\n"
"Nei miei anni più giovani e più vulnerabili mio padre mi diede un\n"
"consiglio su cui da allora non ho più smesso di rimuginare.\n"
"\n"
"«Whenever you feel like criticizing any one,» he told me, «just\n"
"remember that all the people in this world haven't had the advantages\n"
"that you've had.»\n"
"\n"
"«Quando ti viene voglia di criticare qualcuno» mi disse «ricordati\n"
"solo che non tutti a questo mondo hanno avuto i vantaggi che hai avuto\n"
"tu».\n"
"\n"
"He didn't say any more, but we've always been unusually communicative\n"
"in a reserved way, and I understood that he meant a great deal more\n"
"than that.\n"
"\n"
"Non disse altro, ma siamo sempre stati straordinariamente comunicativi\n"
"senza tante parole e capii che voleva dire molto di più di questo.\n"
"\n"
"In consequence, I'm inclined to reserve all judgments, a habit that\n"
"has opened up many curious natures to me and also made me the victim\n"
"of not a few veteran bores.\n"
"\n"
"Di conseguenza, sono inclìne a evitare ogni giudizio, un'abitudine che\n"
"mi ha rivelato molti caratteri strani e mi ha anche reso vittima di\n"
"non pochi rompiscatole di lungo corso.\n"
"\n"
"The abnormal mind is quick to detect and attach itself to this quality\n"
"when it appears in a normal person, and so it came about that in\n"
"college I was unjustly accused of being a politician, because I was\n"
"privy to the secret griefs of wild, unknown men.\n"
"\n"
"La mente anormale è lesta a riconoscere e a aggrapparsi a questa\n"
"qualità quand'essa si manifesta in una persona normale, e così accadde\n"
"che all'università fui ingiustamente accusato di essere un\n"
"politicante, perché ero al corrente delle pene segrete di uomini\n"
"sregolati e sconosciuti.\n";

int main(void)
{
  CLD2::CLDHints hints;
  hints.content_language_hint = 0;
  hints.tld_hint = 0;
  hints.encoding_hint = CLD2::UNKNOWN_ENCODING;
  hints.language_hint = CLD2::UNKNOWN_LANGUAGE;

  CLD2::Language top3[3];
  int pct3[3];
  double score3[3];
  CLD2::ResultChunkVector chunks;
  int text_bytes;
  bool reliable;

  CLD2::ExtDetectLanguageSummary(text, sizeof text - 1,
                                 true, &hints, 0,
                                 top3, pct3, score3, &chunks,
                                 &text_bytes, &reliable);

  puts("<!doctype html><meta charset=\"utf-8\"><body>");
  CLD2::DumpResultChunkVector(stdout, text, &chunks);
  puts("</body>");
  return 0;
}

And here's the program output for me:

<!doctype html><meta charset="utf-8"><body>
DumpResultChunkVector[12]<br>
[0]{0 122 en}  <span style="background:#FFFFF4;color:#000000;">
In my younger and more vulnerable years my father gave me some advice that I&apos;ve been turning over in my mind ever since.  </span><br>
[1]{122 117 it}  <span style="background:#E3FFD8;color:#000000;">
Nei miei anni più giovani e più vulnerabili mio padre mi diede un consiglio su cui da allora non ho più smesso di </span><br>
[2]{239 38 un}  <span style="background:#FFFFFF;color:#B0B0B0;">
rimuginare.  «Whenever you feel like </span><br>
[3]{277 144 en}  <span style="background:#FFFFF4;color:#000000;">
criticizing any one,» he told me, «just remember that all the people in this world haven&apos;t had the advantages that you&apos;ve had.»  «Quando ti </span><br>
[4]{421 139 un}  <span style="background:#FFFFFF;color:#B0B0B0;">
viene voglia di criticare qualcuno» mi disse «ricordati solo che non tutti a questo mondo hanno avuto i vantaggi che hai avuto tu».  He </span><br>
[5]{560 147 en}  <span style="background:#FFFFF4;color:#000000;">
didn&apos;t say any more, but we&apos;ve always been unusually communicative in a reserved way, and I understood that he meant a great deal more than that.  </span><br>
[6]{707 131 it}  <span style="background:#E3FFD8;color:#000000;">
Non disse altro, ma siamo sempre stati straordinariamente comunicativi senza tante parole e capii che voleva dire molto di più di </span><br>
[7]{838 178 en}  <span style="background:#FFFFF4;color:#000000;">
questo.  In consequence, I&apos;m inclined to reserve all judgments, a habit that has opened up many curious natures to me and also made me the victim of not a few veteran bores.  Di </span><br>
[8]{1016 75 un}  <span style="background:#FFFFFF;color:#B0B0B0;">
conseguenza, sono inclìne a evitare ogni giudizio, un&apos;abitudine che mi ha </span><br>
[9]{1091 102 it}  <span style="background:#E3FFD8;color:#000000;">
rivelato molti caratteri strani e mi ha anche reso vittima di non pochi rompiscatole di lungo corso.  </span><br>
[10]{1193 283 en}  <span style="background:#FFFFF4;color:#000000;">
The abnormal mind is quick to detect and attach itself to this quality when it appears in a normal person, and so it came about that in college I was unjustly accused of being a politician, because I was privy to the secret griefs of wild, unknown men.  La mente anormale è lesta a </span><br>
[11]{1476 261 it}  <span style="background:#E3FFD8;color:#000000;">
riconoscere e a aggrapparsi a questa qualità quand&apos;essa si manifesta in una persona normale, e così accadde che all&apos;università fui ingiustamente accusato di essere un politicante, perché ero al corrente delle pene segrete di uomini sregolati e sconosciuti. </span><br>
<br>
</body>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
@zackw and others