Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements to bool replacement #35

Open
AaronCritchley opened this issue Jul 27, 2021 · 4 comments
Open

Performance improvements to bool replacement #35

AaronCritchley opened this issue Jul 27, 2021 · 4 comments

Comments

@AaronCritchley
Copy link

Current behaviour is to run df.replace({True: 1, False: 0}) on all DataFrames passed, this can have quite a noticeable performance impact on large dataframes.

For example, on a 1.25M row, 8 column, 150MB DataFrame, running the replace takes roughly 15 seconds.

Some options to avoid this that seem sensible:

  • Only run replace on bool columns (or perhaps Object columns too?)
  • Allow the user to specify if they would like the replacement to go ahead via an argument or flag
  • Allow the user which columns should be replaced, defaulting to current behaviour

I think the first point is probably the most pragmatic, but happy to raise an MR with whatever change you see fit - this change would save quite a lot of time when loading large data sets into the database.

@yehoshuadimarsky
Copy link
Owner

This is a good point, thanks for bringing it up. Yeah, go ahead and submit a pull request, let's see what we can do

@yehoshuadimarsky
Copy link
Owner

can you provide a sample schema of the dataframe you are loading that takes "roughly 15 seconds"? I.e. the output of df.info()

@jamesprinc3
Copy link

jamesprinc3 commented Apr 13, 2022

I came across this and tried to recreate it, I've got a snippet which exhibits a similar time profile, the following runs in 12s on my M1 Macbook Pro:


import time
import numpy as np
import pandas as pd

def test_bool_replace():
    length = 1_000_000

    df = pd.DataFrame({
        'bool': [bool(x) for x in np.random.randint(0, 2, size=(length, 1))],
        'emptyNDArray': [np.random.rand(0, 0) for _ in range(length)],
    })

    start = time.time()
    df = df.replace({True: 1, False: 0})
    end = time.time()
    print(end - start)

I created another df before this which had 8 int columns and 1 million rows, that took about 0.5 seconds.

It seems the issue is caused by empty ndarrays (if I give them a non-zero size then the .replace seems to fail - this is perhaps a bug to be fixed too!).

Of course this example isn't that realistic, so getting @AaronCritchley 's original df format would be handy.

@jamesprinc3
Copy link

Using the above example, replacing

df = df.replace({True: 1, False: 0})

with

    bool_df = df.select_dtypes("bool")
    df[bool_df.columns] = bool_df.replace({True: 1, False: 0})

Brings the time down to 0.2s on the same M1 MacBook Pro.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants