Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve semantics and API for getting key columns from a table #25585

Open
hiltontj opened this issue Nov 22, 2024 · 4 comments
Open

Improve semantics and API for getting key columns from a table #25585

hiltontj opened this issue Nov 22, 2024 · 4 comments
Assignees
Labels

Comments

@hiltontj
Copy link
Contributor

Problem

When working with a TableDefinition from the catalog, it is not clear how to get the columns that would constitute the series key for that table.

There is a series_key on the TableDefinition, but (right now) this has a special meaning in the context of data written with the new experimental v3 write API. This is a bit confusing because v1 tables do have a series key, but that field does not apply to v1 writes and their corresponding tables.

There are several places where we want to know the answer to the question: what are the columns that make up the series key for this table and what is their order?

The answer should be:

  • for a v1 table, the tag columns in lexicographical order by tag name
  • for a v3 table, the series_key defined on the table

There isn't a method that gives us this right now.

Proposed solution

Update to /api/v3/write_lp write path

For writes coming through the /api/v3/write_lp API, we store the series key for v1 tables on the TableDefinition directly in the series_key field, as we do for writes that come through the /api/v3/write API. The order in the former case would be determined from the first write made to the table.

So, if the first write contains:

cpu,region=us,host=a <field_set> <time>

Then the series key will be [region, host], as ordered, and not [host, region], which would be the lexicographical order

New tag columns are not allowed for writes made through the /api/v3/write_lp API, as is the case for the /api/v3/write API.

Writes made through the /api/[v1|v2]/write APIs

These should not change, and still need to accept writes containing new tag columns, otherwise, that would be a breaking change in system behaviour.

It should still be possible to write to tables that have an explicit series_key via these APIs, as long as all the members of the key are present.

Updates to TableDefinition API

Add an API to the TableDefinition type: series_key_column_ids, that provides the series key for a table. For the case where the table definition has a series_key specified, it can just use that directly; otherwise, it needs to derive it by scanning columns that are tags and sorting them to lexicographical order.

Alternatives

  • We could still store the series key for tables that are written to via the v1/v2 APIs, but would need to update it when tag columns are added on new writes. If we do that, we may need a different way of differentiating "v3" tables other than the series_key field on the TableDefinition.

Additional context

The write path validation and parsing is currently handled in the validator.rs module where the handling of /api/v3/write writes branches from here, and everything else (including /api/v3/write_lp) branches from here.

The Schema type from influxdb3_core has the notion of a primary key, in which it scans the columns for tags, sorts them, then adds the time column. See here.

The Schema also can support the new series_key. See here. We likely should leverage that if possible for writes through the /api/v3/write_lp.

@hiltontj hiltontj added the v3 label Nov 22, 2024
@hiltontj hiltontj changed the title Semantics and API for getting _key_ columns from a table are unclear Semantics and API for getting key columns from a table are unclear Nov 22, 2024
@hiltontj hiltontj changed the title Semantics and API for getting key columns from a table are unclear Improve semantics and API for getting key columns from a table Nov 22, 2024
@pauldix
Copy link
Member

pauldix commented Nov 23, 2024

I believe there are a bunch of places where it is assumed that the series key is fixed because of the sort specified on Parquet files. We don't keep the sort order of the parquet files anywhere so we always use the series key as that. If that changes, then what we say the sort order is will be wrong in the future.

Rather than supporting changing series keys (i.e. adding new tags), I'd rather just lock that down for now. So the v1 and v2 APIs should validate that the tags present are the same as what was in the first write sent that created the table.

If we don't do this, we'll have to store the series key for every Parquet file separately so it can be used at query and compaction time.

@hiltontj
Copy link
Contributor Author

Rather than supporting changing series keys (i.e. adding new tags), I'd rather just lock that down for now. So the v1 and v2 APIs should validate that the tags present are the same as what was in the first write sent that created the table.

I would be okay with that. The guidance for users that would want to add new tag columns after a table has been created is to either use string fields, or to drop the table and recreate.

@mgattozzi
Copy link
Contributor

So what I'm getting from this is that we want to return an error in the http api for v1/v2 and v3 if a new tag is added for now and later we can worry about maintaining compatibility correct?

@hiltontj hiltontj assigned hiltontj and mgattozzi and unassigned hiltontj Nov 25, 2024
@pauldix
Copy link
Member

pauldix commented Nov 25, 2024

@mgattozzi that's correct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants