Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop copying Exprs and LogicalPlans so much during Common Subexpression Elimination #9873

Closed
alamb opened this issue Mar 30, 2024 · 16 comments · Fixed by #10835
Closed

Stop copying Exprs and LogicalPlans so much during Common Subexpression Elimination #9873

alamb opened this issue Mar 30, 2024 · 16 comments · Fixed by #10835
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Mar 30, 2024

Is your feature request related to a problem or challenge?

The common subexpression elimination pass copies many Exprs around. You can see this performance impact this has by looking at the screenshot from #9637 (comment)

While we will fix the copying plan problem in #9637 I think there is more work to be done in the common sub expression code itself, which copies a significant number of Exprs and Strings around

312892336-1c8214cc-09ec-41b2-aa5d-f52a5dfa4226

Describe the solution you'd like

Figure out how to avoid cloneing Exprs in the https://github.com/apache/arrow-datafusion/blob/main/datafusion/optimizer/src/common_subexpr_eliminate.rs

  1. The Exprs themselves
  2. Avoid creating Strings for Identifier

We should see a significant improvement in the sql_planner benchmarks:

cargo bench --bench sql_planner

Describe alternatives you've considered

No response

Additional context

I noticed this while reviewing #9871

@peter-toth
Copy link
Contributor

I'm happy to take this issue, but I just noticed that there is a draft PR already: #10067. @alamb would you like to finish that PR?

@alamb
Copy link
Contributor Author

alamb commented Apr 19, 2024

I'm happy to take this issue, but I just noticed that there is a draft PR already: #10067. @alamb would you like to finish that PR?

I am on vacation this week (well, partial vacation as it is school break week) so I likely won't have time to work on #10067 until next week. If you have time to do so that would be great. Otherwise I will pick it up the following week

CSE is one of the passes that shows up in the planning benchmarks when I last profiled it, so I think this particular change will be very impactful to planning performance

@alamb
Copy link
Contributor Author

alamb commented Apr 19, 2024

(Thank you @peter-toth 🙏 -- it is much fun working with you)

@peter-toth
Copy link
Contributor

peter-toth commented Apr 19, 2024

Ok, got it. I'm happy to look into it, but not sure I can do it before Monday. Enjoy your partial vacation! 😄

@peter-toth
Copy link
Contributor

I've started working on this, but it will surely take some time...

@alamb alamb changed the title Stop copying Exprs so much during Common Subexpression Elimination Stop copying Exprs and LogicalPlans so much during Common Subexpression Elimination Apr 24, 2024
@peter-toth
Copy link
Contributor

@alamb, just a quick update that I'm still working on this. Trying out different ideas... Will try to open a PR this week or next.

@erratic-pattern
Copy link
Contributor

Hey I am interested in helping with this. Maybe @peter-toth and I can divide our efforts here? Let me know what you've worked on so far, and I can figure out how to help.

One thing I see in particular that's not directly cloning the LogicalPlans and Exprs, but may be putting pressure on the global allocator, is the Identifiers in the ExprSet which are represented as Strings produced by Displaying the Exprs.

Assuming that the new zero-copy implementation will continue using the ExprSet, maybe I could look into efficiently hashing subexpressions to produce numeric identifiers that are easier to copy.

@erratic-pattern
Copy link
Contributor

Actually it looks like ExprSet needs to be completely redesigned because it relies on cloning the Exprs.

I've started a branch for my work. Here is a basic commit that simply swaps the existing code over to using the rewrite API.

@peter-toth
Copy link
Contributor

peter-toth commented May 6, 2024

I've just opened a PR #10396. I wanted to continue @alamb's #10067 but then I ran into a few different issues with the rule so I think we should address those first.

2 notes:

@alamb
Copy link
Contributor Author

alamb commented May 7, 2024

#10067 is orthogonal to my PR it would be great if someone could continue that effort as unfortunately I don't have much time to work on DataFusion lately.

I am happy to do this. Thank you @peter-toth

@alamb alamb self-assigned this May 7, 2024
@erratic-pattern
Copy link
Contributor

erratic-pattern commented May 10, 2024

@peter-toth

the issue of String identifiers are explained in my PR. I have a solution for the issue, but #10396 is already huge and the implementation will be a bit tricky...

I am thinking a Hash implementation for Expr would fix this problem, and is something I plan to work on soon. Would be interested to know what approach you think would work best.

Having a Hash impl would also allow us to refactor other optimization rules which make heavy use of display_name to generate HashMap keys for Exprs

UPDATE: It looks like Expr already derives Hash. Is there a reason we're not using that instead of string keys?

@alamb
Copy link
Contributor Author

alamb commented May 11, 2024

UPDATE: It looks like Expr already derives Hash. Is there a reason we're not using that instead of string keys?

I believe CSE may predate the Hash impl for Expr

@peter-toth
Copy link
Contributor

UPDATE: It looks like Expr already derives Hash. Is there a reason we're not using that instead of string keys?

I forgot to comment on this thread that my detailed answer is on the other: #10426 (comment)

@alamb
Copy link
Contributor Author

alamb commented May 14, 2024

BTW I am purposely procrastinating on this issue until #10426 / #10473 are done to avoid conflicts

Though I may do some preliminary work in smaller PRs

@peter-toth
Copy link
Contributor

peter-toth commented May 17, 2024

@alamb, I think you shouldn't wait for my #10473, as there is still some preliminary work required I need to finish (#10543) and I'm still not sure about a few details of #10473.
Feel free to proceed with refactoring the rule in your #10067 to use rewrite() and I will rebase my PR on top of yours and resolve the possible conflicts.

@alamb
Copy link
Contributor Author

alamb commented May 17, 2024

Thanks @peter-toth

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
3 participants