-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSE shorthand alias #10868
CSE shorthand alias #10868
Conversation
@@ -1080,8 +1080,8 @@ query TT | |||
explain select a/2, a/2 + 1 from t | |||
---- | |||
logical_plan | |||
01)Projection: {t.a / Int64(2)|{Int64(2)}|{t.a}} AS t.a / Int64(2), {t.a / Int64(2)|{Int64(2)}|{t.a}} AS t.a / Int64(2) + Int64(1) | |||
02)--Projection: t.a / Int64(2) AS {t.a / Int64(2)|{Int64(2)}|{t.a}} | |||
01)Projection: #1 AS t.a / Int64(2), #1 AS t.a / Int64(2) + Int64(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second projection is #1 + 1
, but the alias makes it hard to read. Can't figure out how to fix this, appreciate any feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But don't we need to keep the AS t.a / Int64(2)
alias in this case so as to keep the original top level column name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It must be kept for the first projection, yes.
Ideally, the second projection IMO should be #1 + 1 AS ...
.
9fb7f98
to
72e16a4
Compare
@@ -166,6 +166,15 @@ impl CommonSubexprEliminate { | |||
) -> Result<(Vec<Vec<Expr>>, LogicalPlan)> { | |||
let mut common_exprs = IndexMap::new(); | |||
|
|||
input.schema().iter().for_each(|(qualifier, field)| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should suppose that all extracted aliases remain in the plan:
- Let's suppose that we have a plan like this after some CSE optimization:
... Projection: #1 as c1, #1 as c2, #2 as c3, #2 as c4, (a + 2) as c5, (a + 1 + 1) as c6 Projection: (a + b) as "#1", (a + c) as "#2", a ...
- Then some other optimization rule prunes "c1" and "c2" from the plan because they turn out to be unnecessary:
... Projection: #2 as c3, #2 as c4, (a + 2) as c5, (a + 1 + 1) as c6 Projection: (a + c) as "#2", a ...
- And then some other rule creates new CSE possibilities:
... Projection: #2 as c3, #2 as c4, (a + 2) as c5, (a + 2) as c6 Projection: (a + c) as "#2", a ...
- CSE rule runs again but indexes here and at
build_common_expr_project_plan()
are out of sync...
IMO the best thing we can do is to choose a unique aliases for a common expressions in CommonSubexprRewriter
when we found the expression and store the alias in common_exprs
together with the expression. In that case we don't need to deal with index sync issues and don't get plans with unnecessary aliases like here: https://github.com/apache/datafusion/pull/10868/files#diff-351499880963d6a383c92e156e75019cd9ce33107724a9635853d7d4cd1898d0R1403
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both issues don't affect correctness.
One thing I'd like to point out is that adding unused columns (all input's columns) in intermediate projection is the behavior of current CSE, it's not introduced in this PR. You can try copying the new test and running it against main. You'll get this output.
let plan = LogicalPlanBuilder::from(table_scan.clone())
.project(vec![(col("a") + col("b")).alias("#1"), col("c")])?
.project(vec![
(col("c") + lit(2)).alias("c3"),
(col("c") + lit(2)).alias("c4"),
])?
.build()?;
Projection: {test.c + Int32(2)|{Int32(2)}|{test.c}} AS test.c + Int32(2) AS c3, {test.c + Int32(2)|{Int32(2)}|{test.c}} AS test.c + Int32(2) AS c4
Projection: test.c + Int32(2) AS {test.c + Int32(2)|{Int32(2)}|{test.c}}, #1, test.c
Projection: test.a + test.b AS #1, test.c
TableScan: test
Extra projections are removed by other rules, so the final plan doesn't contain these projections.
Also, you may have noticed that extra projections make the aliases "out-of-sync" and to be honest I don't mind the #2
instead of #1
(as long as it's not something ridiculous like #1023
for example), and I don't see a way to fix that without patching some hacky global state/counter or asking other rules to reuse aliases when removing the extra projections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, what I meant by "idexes go out of sync" is that if your modified CSE rule runs on a plan that we got in the 3rd step (i.e. there is no #1
in the plan) e.g.:
let plan = LogicalPlanBuilder::from(table_scan.clone())
.project(vec![(col("a") + col("b")).alias("#2"), col("c")])?
.project(vec![
col("#2").alias("c1"),
col("#2").alias("c2"),
(col("c") + lit(2)).alias("c3"),
(col("c") + lit(2)).alias("c4"),
])?
.build()?;
then it produces an incorrect plan:
Projection: #2 AS c1, #2 AS c2, #2 AS c3, #2 AS c4 // The issue here is that `#2` gets aliased to `#1` below, but `#2` doesn't change here.
Projection: #2 AS #1, test.c + Int32(2) AS #2, test.c
Projection: test.a + test.b AS #2, test.c
TableScan: test
This is because you inject #2
into common_exprs
, but you don't inject it to expr_stats
(and others).
IMO modifying common_exprs
is hacky if you don't do it in CommonSubexprRewriter
, that's why I suggested the solution in my previous comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed index usage; now we keep the original alias inside the common_exprs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now this starts to look as the suggested because we assign the unique aliases in CommonSubexprRewriter
and store it in common_exprs
together with the common expression.
But why do you still inject previous #
aliases to common_exprs
? I think you just need to find the biggest one here and pass that number to CommonSubexprRewriter
and simply start assigning new #
aliases in f_down()
from that number + 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a solution that can produce a unique alias fast. There is no problem with having gaps if we can do it constant time (vs. no gaps with linear time to the number of common expressions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n is usually really small. I don't think this is a big performance hit, and so filling the gaps is a good tradeoff IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But why do you want to fill the gaps? These are artifical aliases so having consecutive numbers has no use, all that matter is they are short, unique and easy to read. Also, if you don't inject anything into common_exprs
then the pontless #1 AS #1
aliases won't get added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I don't think I'll be able to do that anytime soon.
If that's the only remaining issue, I can mark the PR as ready and a maintainer can push that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do it and let me try to open a PR with the fix to your PR tomorrow or during the weekend.
@@ -342,7 +346,7 @@ impl CommonSubexprEliminate { | |||
Aggregate::try_new(Arc::new(new_input), new_group_expr, new_aggr_expr) | |||
.map(LogicalPlan::Aggregate) | |||
} else { | |||
let mut expr_number = common_exprs.len(); | |||
let mut expr_number = common_exprs.values().map(|t| t.1).max().unwrap_or(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a test case
The failing CI is a simple The last thread between me and @peter-toth mentions some possible improvements, but these should be a separate follow-up PR. I'm extremely sorry for not being able to push any further changes. And thanks @peter-toth for partaking in this long, boring review process and helping me understand this rule better. |
Thanks @MohamedAbdeen21 for the PR! I fixed the clippy issue and made some changes to alias generation as it turned out that there is an
|
Thanks for the heads up @peter-toth -- I can handle any conflicts if they arise. I am still trying to get improved performance with that PR |
Hey @peter-toth sorry for the late reply. Looks good, and I like the Alias generator. But I have a couple of comments:
I don't mind either options, a PR against this one keeps the entire review history and can make reviews easier, but a new PR can be easier to maintain as you'll be able to directly push commits. |
No problem, I've opened #10939.
|
Became part of #10939. |
Which issue does this PR close?
Closes #10280.
Rationale for this change
Shorten aliases generated by CSE for readability
What changes are included in this PR?
Use shorthand numeric aliases (
#1
,#2
,#3
, ...) for common subexpressions. Shouldn't cause same conflicts as #10333 as it still uses the same underlying expression identifier string.Are these changes tested?
Using existing tests + a couple of new tests
Are there any user-facing changes?
Better and more concise logical plans
cc @peter-toth