-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add reference visitor TreeNode
APIs, change ExecutionPlan::children()
and PhysicalExpr::children()
return references
#10543
Add reference visitor TreeNode
APIs, change ExecutionPlan::children()
and PhysicalExpr::children()
return references
#10543
Conversation
Thanks @peter-toth. After a quick look, I started thinking it might be better to use this new ref_visitor API instead of keeping also the original one. I'll take a closer look tomorrow to see if that makes sense. |
Sure. We can go that way, but the changes needed will be bigger, see the description of the PR for the details. |
I agree with @berkaysynnada in #10543 (comment) that in an ideal world we woul change
Thus, I suggest we merge this PR as is, and file a follow on PR to potentially unify the API. While it in an ideal world we wouldn't have both, I think this is a step in the right direction Is that ok with you @berkaysynnada ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contribution @peter-toth 🙏
I think we should wait for @berkaysynnada 's feedback before merging this PR byg from my perspective it is good
Can we also please add a note of visit_ref
/ apply_ref
in the description here:
/// [`apply`], [`visit`], [`exists`]. |
I think we can add the doc as a follow on PR too
Breaking the full migration into two PRs (with the latter one removing the old usage and migrating to the new one) sounds reasonable to me. It will probably make sense to do the follow-on quickly because we already will have one major API change (format options) in the next version. Instead of having two versions with two major API changes, it'd probably be better to have one version with these API changes lumped in together. @peter-toth, how big do you think this PR will be if we do all the changes at once? |
@ozankabak, let me check the changes required for the alternative today or tomorrow and come back to you. |
I'm still working on an alternative to this PR and will need a couple of more days to test a few different ideas... |
No worries. Will be happy to review and help iterate once you are ready |
What do we think about merging this PR and filing a follow on ticket to unify the APIs? |
I'm ok with merging the current state of the PR. But I was also thinking about how to improve it: As far as I see we have 3 options here:
So far I've been playing with 3. but it became very complex and still doesn't fully work. Also, I'm no longer sure that such an API makes sense as the API user can't specify what kind of references they want. E.g. in CSE (#10473) I need permanent references and can't do anything with other references whose lifetime doesn't match the root node's lifetime. So now I'm working on implementing 1., but it will be a pervasive change as there are 56 ( |
dfd8cd2
to
7470408
Compare
7470408
to
a857d7f
Compare
I've rebased the PR on the latest This PR doesn't modify the I've also updated the PR description. |
Thanks a lot, @peter-toth! This looks great to me. The new Perhaps we could add documentation with code snippets to exemplify the usage of methods and their purposes, to ease the experience for less experienced users. |
@@ -258,6 +259,10 @@ impl CaseExpr { | |||
} | |||
} | |||
|
|||
lazy_static! { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't create temporary Arc::new(NoOp::new())
objects in children()
, but as NoOp
is just a placeholder, that we need in with_new_children()
, we can use 1 global instance. As Arc
s can't be const, the simplest solution seemed to be using lazy_static!
, but if someone has better idea please share.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we skip pushing for the None cases? The user would know which indices refer to which expressions if needed. My concern is that this behavior could spread to expressions that have a flexible number of children.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to be able to restore a CaseExpr
in CaseExpr::with_new_children()
from the a children: Vec<Arc<dyn PhysicalExpr>>
. If we don't push anyting for the 2 optional CaseExpr.expr
and CaseExpr.else_expr
fields, then I don't think there is a way to tell if the first element of children
belongs to the optional CaseExpr.expr
or to CaseExpr.when_then_expr
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with_new_children()
of CaseExpr
says that:
// For physical CaseExpr, we do not allow modifying children size
.
That sentence tells me that we can also should not change the existence of expressions. Can we just insert the new children into the Some() ones in the element order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, but please check how NoOp
s are used there, if we don't have those NoOp
s in children
then how can we restore expr
, else_expr
and when_then_expr
? The problem is that we have 2 optional fields...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a world where NoOp
's are never used, I don't see a reason why this implementation wouldn't work.
let expr = if self.expr().is_some() {
Some(children[0].clone())
} else {
None
};
let else_expr = if self.expr().is_some() {
Some(children[children.len() - 1].clone())
} else {
None
};
let branches = match (&expr, &else_expr) {
(Some(_), Some(_)) => children[1..children.len() - 1].to_vec(),
(Some(_), None) => children[1..children.len()].to_vec(),
(None, Some(_)) => children[0..children.len() - 1].to_vec(),
(None, None) => children[0..children.len()].to_vec(),
};
let mut when_then_expr: Vec<WhenThen> = vec![];
for (prev, next) in branches.into_iter().tuples() {
when_then_expr.push((prev, next));
}
Of course, NoOp
s are safer and don't require any assumptions, but I'm just curious about what I might be missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, I got it now. I will fix it today...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can assume that the number of children and their role can't change then probably we can do this: b7eaa47
I hope to review this PR later today or tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @peter-toth and @berkaysynnada for the review
I think this looks amazing -- really nice 👌
What I think we should do is merge this PR and file some follow on tickets (e.g. to add an example of how to use this API, apply the same treatment to apply_with_subqueries
, etc.
I'll plan to file those follow on tickets in the next day or two and merge this PR in unless anyone else would like more time to review
Agian, really nice work @peter-toth
@@ -169,7 +169,7 @@ impl ExecutionPlan for WorkTableExec { | |||
&self.cache | |||
} | |||
|
|||
fn children(&self) -> Vec<Arc<dyn ExecutionPlan>> { | |||
fn children(&self) -> Vec<&Arc<dyn ExecutionPlan>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading through this I think it is also an API change (as it changes the API for ExecutionPlan::children
-- however I think it is for the best.
Though I wonder if we are going to be changing the signature anyways, I wonder if we should consider something that doesn't require an allocation like
fn children(&self) -> [&Arc<dyn ExecutionPlan>] {
Instead of
fn children(&self) -> Vec<&Arc<dyn ExecutionPlan>> {
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can return an array if its size is not fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe &[&Arc<dyn ExecutionPlan>]
was meant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could return a slice, but the current Vec
is in sync with other implementations how we usually return children. Like LogicalPlan::inputs()
or ConcreteTreeNode::children()
.
Or shall we change those too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actualy, I don't think we can return a slice of references. Returning an empty slice here would be ok, but at other places where there are children to return (e.g. in BinaryExpr
) we need to build a temporary container (vec or array) to store the references of children and then return a slice of the container, but who will own the container?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened a PR as returning a slice in ConcreteTreeNode
is possible: #10666
But it will not work for LogicalPlan
or ExecutionPlan
or PhysicalExpr
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to return &[Arc<dyn ExecutionPlan>]
and it fails unsurprisingly -- unless we stores all children in a vector like Union
:
fn children(&self) -> &[Arc<dyn ExecutionPlan>] {
self.inputs.as_slice()
}
Otherwise we cannot return a temporary slice
TreeNode
APIsTreeNode
APIs, change ExecutionPlan::children
to return references
TreeNode
APIs, change ExecutionPlan::children
to return referencesTreeNode
APIs, change ExecutionPlan::children()
and PhysicalExpr::children()
return references
Awesome! I plan to merge this PR tomorrow unless anyone else would like time to review. |
Thank you again @peter-toth @ozankabak and @berkaysynnada -- I am woking on some doc examples for this as well |
I made these two PRs to add examples of using TreeNode:
Also a bonus to improve: #10685 |
Thanks all for the review! @alamb, those API example PRs look great! |
* deps: update datafusion to 39.0.0, pyo3 to 0.21, and object_store to 0.10.1 `datafusion-common` also depends on `pyo3`, so they need to be upgraded together. * feat: remove GetIndexField datafusion replaced Expr::GetIndexField with a FieldAccessor trait. Ref apache/datafusion#10568 Ref apache/datafusion#10769 * feat: update ScalarFunction The field `func_name` was changed to `func` as part of removing `ScalarFunctionDefinition` upstream. Ref apache/datafusion#10325 * feat: incorporate upstream array_slice fixes Fixes #670 * update ExectionPlan::children impl for DatasetExec Ref apache/datafusion#10543 * update value_interval_daytime Ref apache/arrow-rs#5769 * update regexp_replace and regexp_match Fixes #677 * add gil-refs feature to pyo3 This silences pyo3's deprecation warnings for its new Bounds api. It's the 1st step of the migration, and should be removed before merge. Ref https://pyo3.rs/v0.21.0/migration#from-020-to-021 * fix signature for octet_length Ref apache/datafusion#10726 * update signature for covar_samp AggregateUDF expressions now have a builder API design, which removes arguments like filter and order_by Ref apache/datafusion#10545 Ref apache/datafusion#10492 * convert covar_pop to expr_fn api Ref: https://github.com/apache/datafusion/pull/10418/files * convert median to expr_fn api Ref apache/datafusion#10644 * convert variance sample to UDF Ref apache/datafusion#10667 * convert first_value and last_value to UDFs Ref apache/datafusion#10648 * checkpointing with a few todos to fix remaining compile errors * impl PyExpr::python_value for IntervalDayTime and IntervalMonthDayNano * convert sum aggregate function to UDF * remove unnecessary clone on double reference * apply cargo fmt * remove duplicate allow-dead-code annotation * update tpch examples for new pyarrow interval Fixes #665 * marked q11 tpch example as expected fail Ref #730 * add default stride of None back to array_slice
…n()` and `PhysicalExpr::children()` return references (apache#10543) * add reference visitor APIs * use stricter references in apply() and visit() * avoid where clause * remove NO_OP * remove assert after removing NO_OP --------- Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
Part of #10121, required for #10505 and #10426.
Rationale for this change
The current
TreeNode
visitor APIs (TreeNode::visit()
andTreeNode::apply()
) have a limiation due to the lifetimes of theTreeNode
references passed toTreeNodeVisitor::f_down()
,TreeNodeVisitor::f_up()
and thef
closure ofapply()
don't match the lifetime of the rootTreeNode
reference on which the APIs are called.This restriction means that we can't build up data structures that contain references to descendant treenodes.
E.g. the following code snippet to collect how many times subexpressions occur in an expression tree doesn't work:
This PR changes the
TreeNode
visitor APIs to make sure the lifetime of references match the lifetime of the rootTreeNode
reference so the above example will work.Please note:
The
LogicalPlan::apply_with_subqueries()
andLogicalPlan::visit_with_subqueries()
APIs, that are similar toTreeNode
's base APIs but provide subquery support, can't be made stricter easily. This is because inLogicalPlan::apply_expressions()
andLogicalPlan::apply_subqueries()
we create temporaryExpr::eq
,Expr::Column
andLogicalPlan::Subquery
objects that are not compatible with the root treenode's lifetime.What changes are included in this PR?
TreeNode::apply()
,TreeNode::visit()
andTreeNode::apply_children()
APIs.TreeNode
implementations (Expr
,LogicalPlan
,ConcreteTreeNode
andDynTreeNode
) are amended to be able to implement the stricter APIs.Are these changes tested?
Yes, with new UTs.
Are there any user-facing changes?
No.