CSE shorthand alias #10868

MohamedAbdeen21 · 2024-06-11T10:19:33Z

Which issue does this PR close?

Rationale for this change

Shorten aliases generated by CSE for readability

What changes are included in this PR?

Use shorthand numeric aliases (#1, #2, #3, ...) for common subexpressions. Shouldn't cause same conflicts as #10333 as it still uses the same underlying expression identifier string.

Are these changes tested?

Using existing tests + a couple of new tests

Are there any user-facing changes?

Better and more concise logical plans

cc @peter-toth

MohamedAbdeen21 · 2024-06-11T10:21:23Z

datafusion/sqllogictest/test_files/subquery.slt

@@ -1080,8 +1080,8 @@ query TT
 explain select a/2, a/2 + 1 from t
 ----
 logical_plan
-01)Projection: {t.a / Int64(2)|{Int64(2)}|{t.a}} AS t.a / Int64(2), {t.a / Int64(2)|{Int64(2)}|{t.a}} AS t.a / Int64(2) + Int64(1)
-02)--Projection: t.a / Int64(2) AS {t.a / Int64(2)|{Int64(2)}|{t.a}}
+01)Projection: #1 AS t.a / Int64(2), #1 AS t.a / Int64(2) + Int64(1)


Second projection is #1 + 1 , but the alias makes it hard to read. Can't figure out how to fix this, appreciate any feedback.

But don't we need to keep the AS t.a / Int64(2) alias in this case so as to keep the original top level column name?

It must be kept for the first projection, yes.

Ideally, the second projection IMO should be #1 + 1 AS ....

datafusion/optimizer/src/common_subexpr_eliminate.rs

peter-toth · 2024-06-12T15:17:59Z

datafusion/optimizer/src/common_subexpr_eliminate.rs

@@ -166,6 +166,15 @@ impl CommonSubexprEliminate {
    ) -> Result<(Vec<Vec<Expr>>, LogicalPlan)> {
        let mut common_exprs = IndexMap::new();

+        input.schema().iter().for_each(|(qualifier, field)| {


I don't think we should suppose that all extracted aliases remain in the plan:

Let's suppose that we have a plan like this after some CSE optimization:
... Projection: #1 as c1, #1 as c2, #2 as c3, #2 as c4, (a + 2) as c5, (a + 1 + 1) as c6 Projection: (a + b) as "#1", (a + c) as "#2", a ...

Then some other optimization rule prunes "c1" and "c2" from the plan because they turn out to be unnecessary:
... Projection: #2 as c3, #2 as c4, (a + 2) as c5, (a + 1 + 1) as c6 Projection: (a + c) as "#2", a ...

And then some other rule creates new CSE possibilities:
... Projection: #2 as c3, #2 as c4, (a + 2) as c5, (a + 2) as c6 Projection: (a + c) as "#2", a ...

CSE rule runs again but indexes here and at build_common_expr_project_plan() are out of sync...

IMO the best thing we can do is to choose a unique aliases for a common expressions in CommonSubexprRewriter when we found the expression and store the alias in common_exprs together with the expression. In that case we don't need to deal with index sync issues and don't get plans with unnecessary aliases like here: https://github.com/apache/datafusion/pull/10868/files#diff-351499880963d6a383c92e156e75019cd9ce33107724a9635853d7d4cd1898d0R1403

Both issues don't affect correctness.

One thing I'd like to point out is that adding unused columns (all input's columns) in intermediate projection is the behavior of current CSE, it's not introduced in this PR. You can try copying the new test and running it against main. You'll get this output.

let plan = LogicalPlanBuilder::from(table_scan.clone()) .project(vec![(col("a") + col("b")).alias("#1"), col("c")])? .project(vec![ (col("c") + lit(2)).alias("c3"), (col("c") + lit(2)).alias("c4"), ])? .build()?;

Projection: {test.c + Int32(2)|{Int32(2)}|{test.c}} AS test.c + Int32(2) AS c3, {test.c + Int32(2)|{Int32(2)}|{test.c}} AS test.c + Int32(2) AS c4 Projection: test.c + Int32(2) AS {test.c + Int32(2)|{Int32(2)}|{test.c}}, #1, test.c Projection: test.a + test.b AS #1, test.c TableScan: test

Extra projections are removed by other rules, so the final plan doesn't contain these projections.

Also, you may have noticed that extra projections make the aliases "out-of-sync" and to be honest I don't mind the #2 instead of #1 (as long as it's not something ridiculous like #1023 for example), and I don't see a way to fix that without patching some hacky global state/counter or asking other rules to reuse aliases when removing the extra projections.

No, what I meant by "idexes go out of sync" is that if your modified CSE rule runs on a plan that we got in the 3rd step (i.e. there is no #1 in the plan) e.g.:

let plan = LogicalPlanBuilder::from(table_scan.clone()) .project(vec![(col("a") + col("b")).alias("#2"), col("c")])? .project(vec![ col("#2").alias("c1"), col("#2").alias("c2"), (col("c") + lit(2)).alias("c3"), (col("c") + lit(2)).alias("c4"), ])? .build()?;

then it produces an incorrect plan:

Projection: #2 AS c1, #2 AS c2, #2 AS c3, #2 AS c4 // The issue here is that `#2` gets aliased to `#1` below, but `#2` doesn't change here. Projection: #2 AS #1, test.c + Int32(2) AS #2, test.c Projection: test.a + test.b AS #2, test.c TableScan: test

This is because you inject #2 into common_exprs, but you don't inject it to expr_stats (and others).

IMO modifying common_exprs is hacky if you don't do it in CommonSubexprRewriter, that's why I suggested the solution in my previous comment.

Removed index usage; now we keep the original alias inside the common_exprs.

Now this starts to look as the suggested because we assign the unique aliases in CommonSubexprRewriter and store it in common_exprs together with the common expression.

But why do you still inject previous # aliases to common_exprs? I think you just need to find the biggest one here and pass that number to CommonSubexprRewriter and simply start assigning new # aliases in f_down() from that number + 1.

We need a solution that can produce a unique alias fast. There is no problem with having gaps if we can do it constant time (vs. no gaps with linear time to the number of common expressions).

n is usually really small. I don't think this is a big performance hit, and so filling the gaps is a good tradeoff IMO

But why do you want to fill the gaps? These are artifical aliases so having consecutive numbers has no use, all that matter is they are short, unique and easy to read. Also, if you don't inject anything into common_exprs then the pontless #1 AS #1 aliases won't get added.

TBH I don't think I'll be able to do that anytime soon.

If that's the only remaining issue, I can mark the PR as ready and a maintainer can push that change.

Please do it and let me try to open a PR with the fix to your PR tomorrow or during the weekend.

MohamedAbdeen21 · 2024-06-13T11:40:49Z

datafusion/optimizer/src/common_subexpr_eliminate.rs

@@ -342,7 +346,7 @@ impl CommonSubexprEliminate {
            Aggregate::try_new(Arc::new(new_input), new_group_expr, new_aggr_expr)
                .map(LogicalPlan::Aggregate)
        } else {
-            let mut expr_number = common_exprs.len();
+            let mut expr_number = common_exprs.values().map(|t| t.1).max().unwrap_or(0);


This needs a test case

MohamedAbdeen21 · 2024-06-13T18:43:29Z

The failing CI is a simple clippy warning, I'd appreciate if it can be fixed before merging.

The last thread between me and @peter-toth mentions some possible improvements, but these should be a separate follow-up PR.

I'm extremely sorry for not being able to push any further changes. And thanks @peter-toth for partaking in this long, boring review process and helping me understand this rule better.

peter-toth · 2024-06-15T14:03:45Z

The failing CI is a simple clippy warning, I'd appreciate if it can be fixed before merging.

The last thread between me and @peter-toth mentions some possible improvements, but these should be a separate follow-up PR.

I'm extremely sorry for not being able to push any further changes. And thanks @peter-toth for partaking in this long, boring review process and helping me understand this rule better.

Thanks @MohamedAbdeen21 for the PR! I fixed the clippy issue and made some changes to alias generation as it turned out that there is an AliasGenerator available for optimizer rules. I also reverted some unnecessary changes.
https://github.com/peter-toth/arrow-datafusion/commits/cse-numeric-aliases/ contains your original commits, a merge commit from main and my suggestions in the last commit.
As far as I see we have 2 options:

I can open a PR targeting this PR and then you can review my suggestions and merge them into this PR.
Or I can open a new PR targeting main which will contain your commits too.

cc @alamb, as this PR might conflict with your #10835

alamb · 2024-06-15T15:47:01Z

cc @alamb, as this PR might conflict with your #10835

Thanks for the heads up @peter-toth -- I can handle any conflicts if they arise. I am still trying to get improved performance with that PR

MohamedAbdeen21 · 2024-06-16T11:53:44Z

Hey @peter-toth sorry for the late reply.

Looks good, and I like the Alias generator. But I have a couple of comments:

Thought we agreed on the # prefix. Is there a reason for choosing the __cse_ prefix? DuckDB uses #XXX, SQLServer uses ExprXXX (can't verify this atm), Spark uses _expr#XXX, all of which look cleaner than __cse_XXX
Tiny nit, but can you use into_values instead of using into_iter then ignoring the key?

I don't mind either options, a PR against this one keeps the entire review history and can make reviews easier, but a new PR can be easier to maintain as you'll be able to directly push commits.

peter-toth · 2024-06-16T16:05:59Z

Hey @peter-toth sorry for the late reply.

Looks good, and I like the Alias generator. But I have a couple of comments:

Thought we agreed on the # prefix. Is there a reason for choosing the __cse_ prefix? DuckDB uses #XXX, SQLServer uses ExprXXX (can't verify this atm), Spark uses _expr#XXX, all of which look cleaner than __cse_XXX

Tiny nit, but can you use into_values instead of using into_iter then ignoring the key?

I don't mind either options, a PR against this one keeps the entire review history and can make reviews easier, but a new PR can be easier to maintain as you'll be able to directly push commits.

No problem, I've opened #10939.

Yes and actually I'm fine with that as well, but I found a few AliasGenerator usecases in the existing code and wanted to follow the conventions of https://github.com/apache/datafusion/blob/main/datafusion/optimizer/src/decorrelate_predicate_subquery.rs#L249 and https://github.com/apache/datafusion/blob/main/datafusion/optimizer/src/scalar_subquery_to_join.rs#L245 with the __cse prefix.
I've also changed into_iter() to into_values () in 6b1d1e3. Thanks for the suggestion!

MohamedAbdeen21 · 2024-06-16T16:18:21Z

Became part of #10939.

github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Jun 11, 2024

MohamedAbdeen21 commented Jun 11, 2024

View reviewed changes

peter-toth reviewed Jun 11, 2024

View reviewed changes

datafusion/optimizer/src/common_subexpr_eliminate.rs Outdated Show resolved Hide resolved

MohamedAbdeen21 closed this Jun 11, 2024

MohamedAbdeen21 reopened this Jun 11, 2024

MohamedAbdeen21 added 9 commits June 11, 2024 22:38

initial change

a3001ff

test renaming

0a1e1ce

use counter instead of indexmap

7184263

order slt tests

a4fceb5

change cse tests

00e5a05

restore slt tests

ae5e8b4

fix slt test

19d69e2

formatting

eef86f9

ensure no alias collision

72e16a4

MohamedAbdeen21 force-pushed the cse-numeric-aliases branch from 9fb7f98 to 72e16a4 Compare June 11, 2024 19:39

peter-toth reviewed Jun 12, 2024

View reviewed changes

MohamedAbdeen21 added 2 commits June 13, 2024 14:32

keep original alias numbers for collision

2ee4d9a

ensure no collision in aggregate cse

f76087c

MohamedAbdeen21 commented Jun 13, 2024

View reviewed changes

MohamedAbdeen21 marked this pull request as ready for review June 13, 2024 18:31

peter-toth mentioned this pull request Jun 16, 2024

Use shorter aliases in CSE #10939

Merged

MohamedAbdeen21 closed this Jun 16, 2024

MohamedAbdeen21 deleted the cse-numeric-aliases branch June 17, 2024 15:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSE shorthand alias #10868

CSE shorthand alias #10868

MohamedAbdeen21 commented Jun 11, 2024 •

edited

Loading

MohamedAbdeen21 Jun 11, 2024

peter-toth Jun 11, 2024

MohamedAbdeen21 Jun 11, 2024

peter-toth Jun 12, 2024 •

edited

Loading

MohamedAbdeen21 Jun 12, 2024

peter-toth Jun 13, 2024

MohamedAbdeen21 Jun 13, 2024

peter-toth Jun 13, 2024 •

edited

Loading

peter-toth Jun 13, 2024

MohamedAbdeen21 Jun 13, 2024 •

edited

Loading

peter-toth Jun 13, 2024 •

edited

Loading

MohamedAbdeen21 Jun 13, 2024

peter-toth Jun 13, 2024 •

edited

Loading

MohamedAbdeen21 Jun 13, 2024

MohamedAbdeen21 commented Jun 13, 2024

peter-toth commented Jun 15, 2024 •

edited

Loading

alamb commented Jun 15, 2024

MohamedAbdeen21 commented Jun 16, 2024 •

edited

Loading

peter-toth commented Jun 16, 2024 •

edited

Loading

MohamedAbdeen21 commented Jun 16, 2024

CSE shorthand alias #10868

CSE shorthand alias #10868

Conversation

MohamedAbdeen21 commented Jun 11, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peter-toth Jun 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peter-toth Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MohamedAbdeen21 Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

peter-toth Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peter-toth Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MohamedAbdeen21 commented Jun 13, 2024

peter-toth commented Jun 15, 2024 • edited Loading

alamb commented Jun 15, 2024

MohamedAbdeen21 commented Jun 16, 2024 • edited Loading

peter-toth commented Jun 16, 2024 • edited Loading

MohamedAbdeen21 commented Jun 16, 2024

MohamedAbdeen21 commented Jun 11, 2024 •

edited

Loading

peter-toth Jun 12, 2024 •

edited

Loading

peter-toth Jun 13, 2024 •

edited

Loading

MohamedAbdeen21 Jun 13, 2024 •

edited

Loading

peter-toth Jun 13, 2024 •

edited

Loading

peter-toth Jun 13, 2024 •

edited

Loading

peter-toth commented Jun 15, 2024 •

edited

Loading

MohamedAbdeen21 commented Jun 16, 2024 •

edited

Loading

peter-toth commented Jun 16, 2024 •

edited

Loading