Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support data source sampling with TABLESAMPLE #13563

Open
theirix opened this issue Nov 25, 2024 · 3 comments
Open

Support data source sampling with TABLESAMPLE #13563

theirix opened this issue Nov 25, 2024 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@theirix
Copy link
Contributor

theirix commented Nov 25, 2024

Is your feature request related to a problem or challenge?

It is helpful to have sampling support for queries to ease the exploration of data.

Describe the solution you'd like

It should be supported on the SQL level (SAMPLE or TABLESAMPLE syntax). The sampling construct should be passed to the table source so the sampling is performed at the scan plan (e.g. in an optimised parquet reader).

This feature could be implemented in three sequential stages:

  1. Support additional SQL syntax but fail in the physical plan builder
  2. Transparently convert to WHERE RANDOM() < P filter
  3. For eligible data sources push the sampling to the table source

Describe alternatives you've considered

It is possible to use WHERE RANDOM() < 0.1 selection (see discussion #13268 ), but the support in SQL is clearer.

Existing query engines and databases already implement sampling, but it is not in ANSI standard. There are different flavours, but essentially, they allow for specific sampling methods and percentages (or sometimes a number of rows) TABLESAMPLE [SYSTEM | BERNOULLI] (PERCENTAGE | ROWS)

DuckDB:

SELECT * FROM tbl TABLESAMPLE SYSTEM (10%),

PostgreSQL and Trino:

SELECT * FROM tbl TABLESAMPLE SYSTEM (10),

Spark

SELECT * FROM tbl TABLESAMPLE SYSTEM (10 PERCENT)

Clickhouse is different:

SELECT * FROM tbl SAMPLE 0.1

Additional context

Also requested in #11554. The filter for sampling was refined in #13268.

@alamb
Copy link
Contributor

alamb commented Nov 25, 2024

I looked around in sqlparser-rs briefly and it seems this syntax is not yet supported

https://github.com/search?q=repo%3Aapache%2Fdatafusion-sqlparser-rs%20tablesample&type=code

(though the keyword is)

I suggest adding support for this clause in sqlparser first and then we can add a rewrite pass that converts the TABLESAMPLE clause into a where clause, similar to what we do for SHOW TABLES:

fn show_tables_to_plan(
&self,
extended: bool,
full: bool,
db_name: Option<Ident>,
filter: Option<ShowStatementFilter>,
) -> Result<LogicalPlan> {
if self.has_table("information_schema", "tables") {
// We only support the basic "SHOW TABLES"
// https://github.com/apache/datafusion/issues/3188
if db_name.is_some() || filter.is_some() || full || extended {
plan_err!("Unsupported parameters to SHOW TABLES")
} else {
let query = "SELECT * FROM information_schema.tables;";
let mut rewrite = DFParser::parse_sql(query)?;
assert_eq!(rewrite.len(), 1);
self.statement_to_plan(rewrite.pop_front().unwrap()) // length of rewrite is 1
}
} else {
plan_err!("SHOW TABLES is not supported unless information_schema is enabled")
}
}

@theirix
Copy link
Contributor Author

theirix commented Nov 25, 2024

Thank you for the initial analysis. I will first submit a PR to datafusion-sqlparser-rs with extended grammar.

@theirix
Copy link
Contributor Author

theirix commented Nov 25, 2024

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants