Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ops.GroupBy after ops.Filter fails to group correctly, and produces unexpected NaNs #1886

Open
matib99 opened this issue Nov 27, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@matib99
Copy link

matib99 commented Nov 27, 2024

Describe the bug

Applying ops.GroupBy(...) after ops.Filter(...) causes some weird behaviour. Some rows are filled with lists of nans, and rows are not groupped correctly. It seems like the problem is with indexes.

A bug related to #1767

Steps/Code to reproduce bug
Sample code:

import pandas as pd
import nvtabular as nvt

# dummy data
_event_id = [0, 1, 2, 3]
_session = ["a", "a", "a", "b"]
_category = ["x", "x", "x", "y"]
_event_type = ["start", "start", "stop", "start"]
input_df = pd.DataFrame(
    {"event_id": _event_id, "session": _session, "category": _category, "event_type": _event_type}
)
print(input_df.head())

# graph
cat_feats = ["category"] >> nvt.ops.Categorify()

features = ["event_id", "session", "event_type"] + cat_feats

features = features >> nvt.ops.Filter(f=lambda df: df["event_type"] == "start")

groupby_features = features >> nvt.ops.Groupby(
    groupby_cols=["session"],
    aggs={
        "event_id": "list",
        "category": ["list", "count"],
        "event_type": ["list"],
    },
)

processor = nvt.Workflow(groupby_features)
dataset = nvt.Dataset(input_df)

output_df = processor.fit_transform(dataset)
print(output_df.head())

input_df looks like this:

   event_id session category event_type
0         0       a        x      start
1         1       a        x      start
2         2       a        x       stop
3         3       b        y      start

And output_df (after filter and groupby):

  session    event_id_list    category_list        event_type_list     category_count  
0       a  [0.0, 1.0, 3.0]  [3.0, 3.0, 4.0]  [start, start, start]                  3 
1       b            [nan]            [nan]                 [None]                  0 

Expected behavior
Expected output_df should look like this:

  session event_id_list category_list       event_type_list  category_count
0       a        [0, 1]        [3, 3]        [start, start]               2
1       b           [3]           [4]               [start]               1

The event with event_id == 3 should be assigned to the session b, not a.
Dtype of columns event_id_list and category_list should be lists of ints not floats

Environment details (please complete the following information):

  • Environment location: docker container (from nvidia/cuda:11.8.0-devel-ubi8)
  • Method of NVTabular install: mamba
  • nvtabular version: 23.8.0

Additional context

Related issue #1767 was about TypeError. In the output_df you can see, that the category_list column contains lists of floats (categories should be ints after ops.Categorify ) so they were converted in order to avoid TypeError.

I believe, that only the symptom of a bug was fixed there and not the cause. I think TypeError was an indirect result of the bug I describe in this issue. Since GroupBy causes some rows to be nans, there was a type conflict between original values (ints) and the nans (floats). But the real problem is that GroupBy after Filter messes up indexing and create some empty rows.

@matib99 matib99 added the bug Something isn't working label Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant