group_by >> summarize on an empty df #467

nathanjmcdougall · 2023-01-09T20:38:30Z

Consider the following:

from siuba import _, group_by, summarize, 
DataFrame.from_dict(dict(x=[], y=[])) >> group_by(_.x) >> summarize(z=_.y.sum())

This doesn't add the column z:

x	y

I would have expected

x	z

The text was updated successfully, but these errors were encountered:

machow · 2023-01-10T18:03:03Z

Thanks for reporting. Digging a bit into dplyr, it seems like some it has careful handling of this case:

it runs the given operation on the empty data
it sets the resulting array to be the correct type
if the operation would return a non-empty value, it discards the value

For example:

library(dplyr)

df <- tibble(a = integer(), b = integer())

# in all the examples below, the value is discarded (e.g. 1, 1.2 get thrown away)

# c is a int
df %>% group_by(a) %>% summarize(c = 1)

# c is a dbl
df %>% group_by(a) %>% summarize(c = 1.2)

# c is a int, since sum(a) is 0
df %>% group_by(a) %>% summarize(c = sum(a))

machow · 2023-01-10T18:13:46Z

Note also that the experimental behavior of summarize being able to return 0 or > 1 rows is deprecated (and a new function tentatively called reframe will handle that behavior!).

It seems like the code above still works on the main branch of dplyr, but this case now prints a warning:

df %>% group_by(a) %>% summarize(c = integer())

output:

Warning message:
Returning more (or less) than 1 row per `summarise()` group was deprecated in dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()` always returns an
  ungrouped data frame and adjust accordingly.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

nathanjmcdougall · 2023-01-11T04:23:35Z

Ah, this is quite an interesting way of looking at it.

"A grouped summarise always return 1 row per group"
But what if there are no groups? Does this violate the 1 row per group rule? I would argue that the answer is no rather than yes.

Regarding this process:

it runs the given operation on the empty data
it sets the resulting array to be the correct type
if the operation would return a non-empty value, it discards the value

It seems to me that there are no groups to group by, so there is no empty data to summarize with a function like sum, and no resulting array to set to a correct type, etc. Rather than passing an empty list of values to sum and returning 0, it's that we don't even need to run any summarization because there's no groups.

It seems that most summarizing methods in pandas like sum, all, mean etc. all accept vacuous/empty inputs and will return 0, True, NaN respectively, i.e. one value, not zero. This means that in most cases I would need to explicitly handle the empty dataframes separately to ensure that the result of a group_by operation has the same column structure at the end of the process as for non-empty dataframes.

If siuba needs to match dplyr behaviour on this point, then is there the possibility of adding an optional argument to the summarize function like __fail_empty: bool = True? Or some other work around? In any case, I feel like an explicit warning would be helpful when this existing functionality kicks in.

machow mentioned this issue Jan 10, 2023

Support a 0-length array result in summarize, when working on an empty DataFrame tidyverse/dplyr#6637

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

group_by >> summarize on an empty df #467

group_by >> summarize on an empty df #467

nathanjmcdougall commented Jan 9, 2023

machow commented Jan 10, 2023

machow commented Jan 10, 2023

nathanjmcdougall commented Jan 11, 2023 •

edited

Loading

group_by >> summarize on an empty df #467

group_by >> summarize on an empty df #467

Comments

nathanjmcdougall commented Jan 9, 2023

machow commented Jan 10, 2023

machow commented Jan 10, 2023

nathanjmcdougall commented Jan 11, 2023 • edited Loading

nathanjmcdougall commented Jan 11, 2023 •

edited

Loading