-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Storing binary data in database is causing a proto to be changed #177
Comments
of note: i also tried using only binary everywhere, but that gave me the same problem. I think this is because binary is translated to strin |
In ClickHouse String can contain arbitrary sequence of bytes so they can be used to store anything "binary". When we insert strings, the bytes are always stored as is, but when reading a String, we escape invalid UTF-8 code points. This is because Elixir defines strings differently, as a sequence of Unicode characters, and libraries like Jason assume this, raising errors on invalid codepoints. Escaping invalid UTF-8 with � in Ch was the easiest fix at the time for this discrepancy in definitions. For more details and an example of how it works, please see https://github.com/plausible/ch#utf-8-in-rowbinary I'm open to changing this to allow reading String fields without any escaping. However, we need to decide where the escaping should occur instead, at which point from ClickHouse SELECT through Ch/ecto_ch/Ecto/app to Jason encode. |
Would it be possible to supply an option or a separate type for this maybe? I worry that changing this in that chain could result in a breaking change for some users and it would probably be easier to introduce a separate type. Actually, looking at the documentation from Ch you linked for RowBinary is p much what I need, I guess i'm not sure how to use that in Ecto_Ch however. When i looked through this repo I couldn't find any usages like that. Would I need to write a custom select method rather than using the default |
Sorry for a late reply... I've been thinking about this and I think there is a way to skip utf8 escaping with TestRepo.query!("CREATE TABLE binary_test(bin String) ENGINE = Memory")
original_bin = "\x61\xF0\x80\x80\x80b"
# this is what happens in insert_all
TestRepo.query!(["INSERT INTO binary_test(bin) FORMAT RowBinary\n", byte_size(original_bin) | original_bin])
assert TestRepo.one(
from t in "binary_test",
select: %{default: t.bin, type_binary: type(t.bin, :binary), length: fragment("length(?)", t.bin)}
) == %{
default: _escaped = "a�b",
type_binary: original_bin, # right now this fails
length: length(original_bin)
} The hard part is extracting this type information from ecto_ch/lib/ecto/adapters/clickhouse.ex Lines 327 to 357 in 6f4f750
For the query in that test it looks like this query_meta: %{
select: %{
take: [],
postprocess: {:map,
[
default: {:value, :any},
type_binary: {:value, :binary},
length: {:value, :any}
]},
from: :none,
assocs: [],
preprocess: []
},
sources: {{"binary_test", nil, nil}},
preloads: []
} It's "hard" because this looks like internal, undocumented API. I'll also ask for advice on Ecto mailing list. Another approach is a PR into Ecto to move utf8 there into the loading of a |
I am also looking to store binary data, currently have to base64 encode it. |
i have a custom type that i'm attempting to serialize a protobuf to a binary and store in the DB. I tried using the binary type, but moved to the string type as that seemed more fitting some of the examples I saw. I'm getting a really odd error though when i try to load from the DB. It appears as though the bytes stored in the database have been modified.
And i get this when trying to load:
I'd love to know what I'm doing wrong here, or if the library is doing something specific to encode strings?
The text was updated successfully, but these errors were encountered: