[BUG] ops.GroupBy after ops.Filter fails to group correctly, and produces unexpected NaNs #1886

matib99 · 2024-11-27T15:07:49Z

Describe the bug

Applying ops.GroupBy(...) after ops.Filter(...) causes some weird behaviour. Some rows are filled with lists of nans, and rows are not groupped correctly. It seems like the problem is with indexes.

A bug related to #1767

Steps/Code to reproduce bug
Sample code:

import pandas as pd
import nvtabular as nvt

# dummy data
_event_id = [0, 1, 2, 3]
_session = ["a", "a", "a", "b"]
_category = ["x", "x", "x", "y"]
_event_type = ["start", "start", "stop", "start"]
input_df = pd.DataFrame(
    {"event_id": _event_id, "session": _session, "category": _category, "event_type": _event_type}
)
print(input_df.head())

# graph
cat_feats = ["category"] >> nvt.ops.Categorify()

features = ["event_id", "session", "event_type"] + cat_feats

features = features >> nvt.ops.Filter(f=lambda df: df["event_type"] == "start")

groupby_features = features >> nvt.ops.Groupby(
    groupby_cols=["session"],
    aggs={
        "event_id": "list",
        "category": ["list", "count"],
        "event_type": ["list"],
    },
)

processor = nvt.Workflow(groupby_features)
dataset = nvt.Dataset(input_df)

output_df = processor.fit_transform(dataset)
print(output_df.head())

input_df looks like this:

   event_id session category event_type
0         0       a        x      start
1         1       a        x      start
2         2       a        x       stop
3         3       b        y      start

And output_df (after filter and groupby):

  session    event_id_list    category_list        event_type_list     category_count  
0       a  [0.0, 1.0, 3.0]  [3.0, 3.0, 4.0]  [start, start, start]                  3 
1       b            [nan]            [nan]                 [None]                  0

Expected behavior
Expected output_df should look like this:

  session event_id_list category_list       event_type_list  category_count
0       a        [0, 1]        [3, 3]        [start, start]               2
1       b           [3]           [4]               [start]               1

The event with event_id == 3 should be assigned to the session b, not a.
Dtype of columns event_id_list and category_list should be lists of ints not floats

Environment details (please complete the following information):

Environment location: docker container (from nvidia/cuda:11.8.0-devel-ubi8)
Method of NVTabular install: mamba
nvtabular version: 23.8.0

Additional context

Related issue #1767 was about TypeError. In the output_df you can see, that the category_list column contains lists of floats (categories should be ints after ops.Categorify ) so they were converted in order to avoid TypeError.

I believe, that only the symptom of a bug was fixed there and not the cause. I think TypeError was an indirect result of the bug I describe in this issue. Since GroupBy causes some rows to be nans, there was a type conflict between original values (ints) and the nans (floats). But the real problem is that GroupBy after Filter messes up indexing and create some empty rows.

The text was updated successfully, but these errors were encountered:

matib99 added the bug Something isn't working label Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] ops.GroupBy after ops.Filter fails to group correctly, and produces unexpected NaNs #1886

[BUG] ops.GroupBy after ops.Filter fails to group correctly, and produces unexpected NaNs #1886

matib99 commented Nov 27, 2024

[BUG] ops.GroupBy after ops.Filter fails to group correctly, and produces unexpected NaNs #1886

[BUG] ops.GroupBy after ops.Filter fails to group correctly, and produces unexpected NaNs #1886

Comments

matib99 commented Nov 27, 2024