Improve semantics and API for getting key columns from a table #25585

hiltontj · 2024-11-22T17:37:26Z

Problem

When working with a TableDefinition from the catalog, it is not clear how to get the columns that would constitute the series key for that table.

There is a series_key on the TableDefinition, but (right now) this has a special meaning in the context of data written with the new experimental v3 write API. This is a bit confusing because v1 tables do have a series key, but that field does not apply to v1 writes and their corresponding tables.

There are several places where we want to know the answer to the question: what are the columns that make up the series key for this table and what is their order?

The answer should be:

for a v1 table, the tag columns in lexicographical order by tag name
for a v3 table, the series_key defined on the table

There isn't a method that gives us this right now.

Proposed solution

Update to `/api/v3/write_lp` write path

For writes coming through the /api/v3/write_lp API, we store the series key for v1 tables on the TableDefinition directly in the series_key field, as we do for writes that come through the /api/v3/write API. The order in the former case would be determined from the first write made to the table.

So, if the first write contains:

cpu,region=us,host=a <field_set> <time>

Then the series key will be [region, host], as ordered, and not [host, region], which would be the lexicographical order

New tag columns are not allowed for writes made through the /api/v3/write_lp API, as is the case for the /api/v3/write API.

Writes made through the `/api/[v1|v2]/write` APIs

These should not change, and still need to accept writes containing new tag columns, otherwise, that would be a breaking change in system behaviour.

It should still be possible to write to tables that have an explicit series_key via these APIs, as long as all the members of the key are present.

Updates to `TableDefinition` API

Add an API to the TableDefinition type: series_key_column_ids, that provides the series key for a table. For the case where the table definition has a series_key specified, it can just use that directly; otherwise, it needs to derive it by scanning columns that are tags and sorting them to lexicographical order.

Alternatives

We could still store the series key for tables that are written to via the v1/v2 APIs, but would need to update it when tag columns are added on new writes. If we do that, we may need a different way of differentiating "v3" tables other than the series_key field on the TableDefinition.

Additional context

The write path validation and parsing is currently handled in the validator.rs module where the handling of /api/v3/write writes branches from here, and everything else (including /api/v3/write_lp) branches from here.

The Schema type from influxdb3_core has the notion of a primary key, in which it scans the columns for tags, sorts them, then adds the time column. See here.

The Schema also can support the new series_key. See here. We likely should leverage that if possible for writes through the /api/v3/write_lp.

The text was updated successfully, but these errors were encountered:

pauldix · 2024-11-23T18:45:10Z

I believe there are a bunch of places where it is assumed that the series key is fixed because of the sort specified on Parquet files. We don't keep the sort order of the parquet files anywhere so we always use the series key as that. If that changes, then what we say the sort order is will be wrong in the future.

Rather than supporting changing series keys (i.e. adding new tags), I'd rather just lock that down for now. So the v1 and v2 APIs should validate that the tags present are the same as what was in the first write sent that created the table.

If we don't do this, we'll have to store the series key for every Parquet file separately so it can be used at query and compaction time.

hiltontj · 2024-11-23T19:01:48Z

Rather than supporting changing series keys (i.e. adding new tags), I'd rather just lock that down for now. So the v1 and v2 APIs should validate that the tags present are the same as what was in the first write sent that created the table.

I would be okay with that. The guidance for users that would want to add new tag columns after a table has been created is to either use string fields, or to drop the table and recreate.

mgattozzi · 2024-11-25T15:37:59Z

So what I'm getting from this is that we want to return an error in the http api for v1/v2 and v3 if a new tag is added for now and later we can worry about maintaining compatibility correct?

pauldix · 2024-11-25T19:02:34Z

@mgattozzi that's correct

hiltontj added the v3 label Nov 22, 2024

hiltontj changed the title ~~Semantics and API for getting _key_ columns from a table are unclear~~ Semantics and API for getting key columns from a table are unclear Nov 22, 2024

hiltontj changed the title ~~Semantics and API for getting key columns from a table are unclear~~ Improve semantics and API for getting key columns from a table Nov 22, 2024

hiltontj assigned hiltontj and mgattozzi and unassigned hiltontj Nov 25, 2024

hiltontj mentioned this issue Nov 27, 2024

Additional testing on the metadata cache #25564

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve semantics and API for getting key columns from a table #25585

Improve semantics and API for getting key columns from a table #25585

hiltontj commented Nov 22, 2024

pauldix commented Nov 23, 2024

hiltontj commented Nov 23, 2024

mgattozzi commented Nov 25, 2024

pauldix commented Nov 25, 2024

Improve semantics and API for getting key columns from a table #25585

Improve semantics and API for getting key columns from a table #25585

Comments

hiltontj commented Nov 22, 2024

Problem

Proposed solution

Update to /api/v3/write_lp write path

Writes made through the /api/[v1|v2]/write APIs

Updates to TableDefinition API

Alternatives

Additional context

pauldix commented Nov 23, 2024

hiltontj commented Nov 23, 2024

mgattozzi commented Nov 25, 2024

pauldix commented Nov 25, 2024

Update to `/api/v3/write_lp` write path

Writes made through the `/api/[v1|v2]/write` APIs

Updates to `TableDefinition` API