-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance improvements to bool replacement #35
Comments
This is a good point, thanks for bringing it up. Yeah, go ahead and submit a pull request, let's see what we can do |
can you provide a sample schema of the dataframe you are loading that takes "roughly 15 seconds"? I.e. the output of |
I came across this and tried to recreate it, I've got a snippet which exhibits a similar time profile, the following runs in 12s on my M1 Macbook Pro:
I created another df before this which had 8 int columns and 1 million rows, that took about 0.5 seconds. It seems the issue is caused by empty ndarrays (if I give them a non-zero size then the Of course this example isn't that realistic, so getting @AaronCritchley 's original df format would be handy. |
Using the above example, replacing
with
Brings the time down to 0.2s on the same M1 MacBook Pro. |
Current behaviour is to run
df.replace({True: 1, False: 0})
on all DataFrames passed, this can have quite a noticeable performance impact on large dataframes.For example, on a 1.25M row, 8 column, 150MB DataFrame, running the replace takes roughly 15 seconds.
Some options to avoid this that seem sensible:
I think the first point is probably the most pragmatic, but happy to raise an MR with whatever change you see fit - this change would save quite a lot of time when loading large data sets into the database.
The text was updated successfully, but these errors were encountered: