Performance improvements to bool replacement #35

AaronCritchley · 2021-07-27T23:57:13Z

Current behaviour is to run df.replace({True: 1, False: 0}) on all DataFrames passed, this can have quite a noticeable performance impact on large dataframes.

For example, on a 1.25M row, 8 column, 150MB DataFrame, running the replace takes roughly 15 seconds.

Some options to avoid this that seem sensible:

Only run replace on bool columns (or perhaps Object columns too?)
Allow the user to specify if they would like the replacement to go ahead via an argument or flag
Allow the user which columns should be replaced, defaulting to current behaviour

I think the first point is probably the most pragmatic, but happy to raise an MR with whatever change you see fit - this change would save quite a lot of time when loading large data sets into the database.

The text was updated successfully, but these errors were encountered:

yehoshuadimarsky · 2021-09-05T03:27:55Z

This is a good point, thanks for bringing it up. Yeah, go ahead and submit a pull request, let's see what we can do

yehoshuadimarsky · 2021-09-13T14:52:11Z

can you provide a sample schema of the dataframe you are loading that takes "roughly 15 seconds"? I.e. the output of df.info()

jamesprinc3 · 2022-04-13T19:30:15Z

I came across this and tried to recreate it, I've got a snippet which exhibits a similar time profile, the following runs in 12s on my M1 Macbook Pro:


import time
import numpy as np
import pandas as pd

def test_bool_replace():
    length = 1_000_000

    df = pd.DataFrame({
        'bool': [bool(x) for x in np.random.randint(0, 2, size=(length, 1))],
        'emptyNDArray': [np.random.rand(0, 0) for _ in range(length)],
    })

    start = time.time()
    df = df.replace({True: 1, False: 0})
    end = time.time()
    print(end - start)

I created another df before this which had 8 int columns and 1 million rows, that took about 0.5 seconds.

It seems the issue is caused by empty ndarrays (if I give them a non-zero size then the .replace seems to fail - this is perhaps a bug to be fixed too!).

Of course this example isn't that realistic, so getting @AaronCritchley 's original df format would be handy.

jamesprinc3 · 2022-04-13T19:42:59Z

Using the above example, replacing

df = df.replace({True: 1, False: 0})

with

    bool_df = df.select_dtypes("bool")
    df[bool_df.columns] = bool_df.replace({True: 1, False: 0})

Brings the time down to 0.2s on the same M1 MacBook Pro.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvements to bool replacement #35

Performance improvements to bool replacement #35

AaronCritchley commented Jul 27, 2021

yehoshuadimarsky commented Sep 5, 2021

yehoshuadimarsky commented Sep 13, 2021

jamesprinc3 commented Apr 13, 2022 •

edited

Loading

jamesprinc3 commented Apr 13, 2022

Performance improvements to bool replacement #35

Performance improvements to bool replacement #35

Comments

AaronCritchley commented Jul 27, 2021

yehoshuadimarsky commented Sep 5, 2021

yehoshuadimarsky commented Sep 13, 2021

jamesprinc3 commented Apr 13, 2022 • edited Loading

jamesprinc3 commented Apr 13, 2022

jamesprinc3 commented Apr 13, 2022 •

edited

Loading