Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storing binary data in database is causing a proto to be changed #177

Open
jaronoff97 opened this issue Jun 26, 2024 · 5 comments
Open

Storing binary data in database is causing a proto to be changed #177

jaronoff97 opened this issue Jun 26, 2024 · 5 comments

Comments

@jaronoff97
Copy link

jaronoff97 commented Jun 26, 2024

i have a custom type that i'm attempting to serialize a protobuf to a binary and store in the DB. I tried using the binary type, but moved to the string type as that seemed more fitting some of the examples I saw. I'm getting a really odd error though when i try to load from the DB. It appears as though the bytes stored in the database have been modified.


  use Ecto.Type
  @impl true
  ch_string = Ecto.ParameterizedType.init(Ch, type: "String")
  def type, do: unquote(Macro.escape(ch_string))

  @doc """
  Convert the map from the database back to
  the desired term.
  """
  @impl true
  def load(term) when is_binary(term) do
    IO.inspect(term, label: "Loaded binary term")
    IO.inspect(byte_size(term), label: "Loaded term byte size")

    case MyModule.MyObject.decode(term) do
      {:ok, decoded_term} ->
        IO.inspect(decoded_term, label: "Decoded term")
        IO.inspect(byte_size(decoded_term), label: "Decoded term byte size")
        {:ok, decoded_term}

      {:error, reason} ->
        IO.puts("Decoding error: #{reason}")
        {:error, reason}
    end
  end

  @doc """
  Converting the data structure to map for storage.
  """
  @impl true
  def dump(term) do
    encoded_term = MyModule.MyObject.encode(term)
    IO.inspect(term, label: "Term to be encoded")
    IO.inspect(encoded_term, label: "Encoded term")
    IO.inspect(byte_size(encoded_term), label: "Encoded term byte size")
    {:ok, encoded_term}
  end

And i get this when trying to load:

Encoded term: <<10, 136, 11, 10, 133, 11, 18, 130, 11, 10, 255, 10, 101, 120, 112, 111, 114,
  116, 101, 114, 115, 58, 10, 32, 32, 32, 32, 99, 108, 105, 99, 107, 104, 111,
  117, 115, 101, 58, 10, 32, 32, 32, 32, 32, 32, 32, 32, 99, 114, 101, ...>>
Encoded term byte size: 1419
Loaded binary term: <<10, 239, 191, 189, 11, 10, 239, 191, 189, 11, 18, 239, 191, 189, 11, 10, 239,
  191, 189, 10, 101, 120, 112, 111, 114, 116, 101, 114, 115, 58, 10, 32, 32, 32,
  32, 99, 108, 105, 99, 107, 104, 111, 117, 115, 101, 58, 10, 32, 32, 32, ...>>
Loaded term byte size: 1427
** (Protobuf.DecodeError) insufficient data decoding field config_map, expected <<10, 239, 191, 189, 11, 18, 239, 191, 189, 11, 10, 239, 191, 189, 10, 101, 120, 112, 111, 114, 116, 101, 114, 115, 58, 10, 32, 32, 32, 32, 99, 108, 105, 99, 107, 104, 111, 117, 115, 101, 58, 10, 32, 32, 32, 32, 32, 32, 32, 32, ...>> to be at least 24076271 bytes

I'd love to know what I'm doing wrong here, or if the library is doing something specific to encode strings?

@jaronoff97
Copy link
Author

jaronoff97 commented Jun 26, 2024

of note: i also tried using only binary everywhere, but that gave me the same problem. I think this is because binary is translated to strin

@ruslandoga
Copy link
Contributor

ruslandoga commented Jun 26, 2024

👋 @jaronoff97

In ClickHouse String can contain arbitrary sequence of bytes so they can be used to store anything "binary". When we insert strings, the bytes are always stored as is, but when reading a String, we escape invalid UTF-8 code points. This is because Elixir defines strings differently, as a sequence of Unicode characters, and libraries like Jason assume this, raising errors on invalid codepoints. Escaping invalid UTF-8 with � in Ch was the easiest fix at the time for this discrepancy in definitions. For more details and an example of how it works, please see https://github.com/plausible/ch#utf-8-in-rowbinary

I'm open to changing this to allow reading String fields without any escaping. However, we need to decide where the escaping should occur instead, at which point from ClickHouse SELECT through Ch/ecto_ch/Ecto/app to Jason encode.

@jaronoff97
Copy link
Author

Would it be possible to supply an option or a separate type for this maybe? I worry that changing this in that chain could result in a breaking change for some users and it would probably be easier to introduce a separate type.

Actually, looking at the documentation from Ch you linked for RowBinary is p much what I need, I guess i'm not sure how to use that in Ecto_Ch however. When i looked through this repo I couldn't find any usages like that. Would I need to write a custom select method rather than using the default Repo.All(MyModule)?

@ruslandoga
Copy link
Contributor

ruslandoga commented Jul 14, 2024

Sorry for a late reply...

I've been thinking about this and I think there is a way to skip utf8 escaping with type/2 macro in select. It doesn't work right now but I'll try to make this test pass next week: #183

TestRepo.query!("CREATE TABLE binary_test(bin String) ENGINE = Memory")

original_bin = "\x61\xF0\x80\x80\x80b"
# this is what happens in insert_all
TestRepo.query!(["INSERT INTO binary_test(bin) FORMAT RowBinary\n", byte_size(original_bin) | original_bin])

assert TestRepo.one(
  from t in "binary_test",
    select: %{default: t.bin, type_binary: type(t.bin, :binary), length: fragment("length(?)", t.bin)}
) == %{
  default: _escaped = "a�b",
  type_binary: original_bin, # right now this fails
  length: length(original_bin)
}

The hard part is extracting this type information from query_meta here

def execute(adapter_meta, query_meta, {:nocache, {operation, query}}, params, opts) do
sql = prepare_sql(operation, query, params)
opts =
case operation do
:all ->
[{:command, :select} | put_setting(opts, :readonly, 1)]
:delete_all ->
[{:command, :delete} | opts]
:alter_update_all ->
[{:command, :alter} | opts]
end
result = Ecto.Adapters.SQL.query!(adapter_meta, sql, params, put_source(opts, query_meta))
case operation do
:all ->
%{num_rows: num_rows, rows: rows} = result
{num_rows, rows}
:delete_all ->
# clickhouse doesn't give us any info on how many rows have been deleted
{0, nil}
:alter_update_all ->
# clickhouse doesn't give us any info on how many rows have been / will be altered
{0, nil}
end
end

For the query in that test it looks like this

  query_meta: %{
    select: %{
      take: [],
      postprocess: {:map,
       [
         default: {:value, :any},
         type_binary: {:value, :binary},
         length: {:value, :any}
       ]},
      from: :none,
      assocs: [],
      preprocess: []
    },
    sources: {{"binary_test", nil, nil}},
    preloads: []
  }

It's "hard" because this looks like internal, undocumented API. I'll also ask for advice on Ecto mailing list. Another approach is a PR into Ecto to move utf8 there into the loading of a :string type. Right now it allows any binary through.

@hkrutzer
Copy link
Contributor

I am also looking to store binary data, currently have to base64 encode it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants