-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use shorter aliases in CSE #10939
Use shorter aliases in CSE #10939
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @peter-toth -- I think this looks like a really nice improvement in readability.
I am going to run the planning benchmarks on this PR to see if it has any impact on performance, but even if not I think the readability improvements alone make it a valuable contribution
cc @waynexia as well
.expr_stats | ||
.entry(expr_id.clone()) | ||
.or_insert((0, data_type)); | ||
let count = self.expr_stats.entry(expr_id.clone()).or_insert(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be great to avoid this clone (which I think is a String) by using a slightly different API. Maybe something like
let count = if let Some(count) = self.expr_stats.get(expr_id) {
count
} else {
self.expr_stats.insert(expr_id.clone(), 0);
0
};
Which while less elegant and requires hashing twice doens't clone the string 🤔
There may be a more clever way to avoid cloning the id
Also, I see the current code clones expr_id
as well, so this change is no worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why I wouldn't change this in this PR is because my #10473 (https://github.com/apache/datafusion/pull/10473/files#diff-351499880963d6a383c92e156e75019cd9ce33107724a9635853d7d4cd1898d0R761) will completely eliminate string identifiers and remove these clones.
let expected = "Projection: UInt32(1) + test.a, UInt32(1) + {AVG({UInt32(1) + test.a|{test.a}|{UInt32(1)}})|{{UInt32(1) + test.a|{test.a}|{UInt32(1)}}}} AS col1, UInt32(1) - {AVG({UInt32(1) + test.a|{test.a}|{UInt32(1)}})|{{UInt32(1) + test.a|{test.a}|{UInt32(1)}}}} AS col2, {AVG({UInt32(1) + test.a|{test.a}|{UInt32(1)}} AS UInt32(1) + test.a)} AS AVG(UInt32(1) + test.a), UInt32(1) + {my_agg({UInt32(1) + test.a|{test.a}|{UInt32(1)}})|{{UInt32(1) + test.a|{test.a}|{UInt32(1)}}}} AS col3, UInt32(1) - {my_agg({UInt32(1) + test.a|{test.a}|{UInt32(1)}})|{{UInt32(1) + test.a|{test.a}|{UInt32(1)}}}} AS col4, {my_agg({UInt32(1) + test.a|{test.a}|{UInt32(1)}} AS UInt32(1) + test.a)} AS my_agg(UInt32(1) + test.a)\ | ||
\n Aggregate: groupBy=[[{UInt32(1) + test.a|{test.a}|{UInt32(1)}} AS UInt32(1) + test.a]], aggr=[[AVG({UInt32(1) + test.a|{test.a}|{UInt32(1)}}) AS {AVG({UInt32(1) + test.a|{test.a}|{UInt32(1)}})|{{UInt32(1) + test.a|{test.a}|{UInt32(1)}}}}, my_agg({UInt32(1) + test.a|{test.a}|{UInt32(1)}}) AS {my_agg({UInt32(1) + test.a|{test.a}|{UInt32(1)}})|{{UInt32(1) + test.a|{test.a}|{UInt32(1)}}}}, AVG({UInt32(1) + test.a|{test.a}|{UInt32(1)}} AS UInt32(1) + test.a) AS {AVG({UInt32(1) + test.a|{test.a}|{UInt32(1)}} AS UInt32(1) + test.a)}, my_agg({UInt32(1) + test.a|{test.a}|{UInt32(1)}} AS UInt32(1) + test.a) AS {my_agg({UInt32(1) + test.a|{test.a}|{UInt32(1)}} AS UInt32(1) + test.a)}]]\ | ||
\n Projection: UInt32(1) + test.a AS {UInt32(1) + test.a|{test.a}|{UInt32(1)}}, test.a, test.b, test.c\ | ||
let expected = "Projection: UInt32(1) + test.a, UInt32(1) + __cse_2 AS col1, UInt32(1) - __cse_2 AS col2, __cse_4 AS AVG(UInt32(1) + test.a), UInt32(1) + __cse_3 AS col3, UInt32(1) - __cse_3 AS col4, __cse_5 AS my_agg(UInt32(1) + test.a)\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That certainly looks much nicer
/// A map that contains the common expressions extracted during the second, rewriting | ||
/// traversal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps we can also update the docstring here to mention that the map contains the alias
/// A map that contains the common expressions extracted during the second, rewriting | |
/// traversal. | |
/// A map that contains the common expressions and their alias extracted during the second, rewriting | |
/// traversal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Fixed in be2e560 together with ExprStats
docstring.
Running benchmarks now... |
My benchmark runs show a small overall improvement (1% or so) so I feel like this PR is ready to go I plan to leave this PR open for another day or so to allow @MohamedAbdeen21 to comment if the would like to Benchmar Details
|
We did a thorough review of the changes in the old, now-closed, PR. FWIW, I'm happy with this change 👍 . |
Looking great! The CSE alias has changed many times, I think it's much closer to its final form now (I've skimmed #10473, and it also looks very 🆒!) One thing I'd like to discuss is the prefix name. Unlike (#10868 (comment)) |
…n expression aliases, remove `DataType` from `ExprStats` as not needed, store aliases in `CommonExprs`, revert unnecessary changes
a592e69
to
bfe814b
Compare
Sorry, I had rebase this PR. The commits remained the same as before, except the last one that changes the prefix to |
Thanks again @peter-toth @waynexia and @MohamedAbdeen21 |
Thank you all for the review! |
* initial change * test renaming * use counter instead of indexmap * order slt tests * change cse tests * restore slt tests * fix slt test * formatting * ensure no alias collision * keep original alias numbers for collision * ensure no collision in aggregate cse * use `AliasGenerator` to generate aliases, use `__cse` prefix in common expression aliases, remove `DataType` from `ExprStats` as not needed, store aliases in `CommonExprs`, revert unnecessary changes * use `into_values()` instead of `into_iter()` where possible * fix docstring of `ExprStats` and `CommonExprs` * use `__common_expr` prefix --------- Co-authored-by: Mohamed Abdeen <[email protected]>
Which issue does this PR close?
Closes #10280.
Rationale for this change
This PR is based on @MohamedAbdeen21's #10868 and shortens aliases generated by CSE for readability.
What changes are included in this PR?
Use shorthand numeric aliases (
__cse_1
,__cse_2
, ...) for common subexpressions.Are these changes tested?
Yes, with existing and new UTs.
Are there any user-facing changes?
Yes, better and more concise logical plans.