Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

character or factor in dataset #600

Open
clarkliming opened this issue Aug 3, 2023 · 8 comments
Open

character or factor in dataset #600

clarkliming opened this issue Aug 3, 2023 · 8 comments

Comments

@clarkliming
Copy link
Contributor

clarkliming commented Aug 3, 2023

in rtables, both character and factor are allowed, however there are cases that lead to the following error like

Error: Error applying analysis function (var - RACE): Number of rows generated by analysis function do not match across all columns.

This error message is not clear to the end users.
So the question here is: shall chevron always require "factor" used, or allow "character"?

option 1: chevron require factors be used. And in preprocessing convert all necessary variables to factor.
pro:

  1. easier to use (minimal requirement for the data)

cons:

  1. preprocessing is much complicated
  2. custom preprocessing require more user effort

option 2: chevron require factors/character be used. In preprocessing convert some of the variables to factor (only if called with analyze_vars. Any other possibilities that this error could occur? @Melkiades )
pro:

  1. easier to use (minimal requirement for the data)
  2. minimal changes needed

cons:

  1. only several preprocessing is complicated

option 3: chevron still allows characters (as this is supported by rtables), and rtables/tern may improve the handling of such situation

pros:

  1. users are able to understand the error even if they are not using chevron
  2. chevron will still be simple (and robust to data) from the users' side (in most of the cases; only in cases that different arm have different levels will this issue occur)

cons:

  1. users still need to manually adjust the data (convert to factor) if rtable/tern only provide "warnings" if not provide solution to make it work
  2. to eliminate the error can be incompatible with the current design

@Melkiades may I ask if there are any differences between factor and character in rtables?

@clarkliming
Copy link
Contributor Author

or @Melkiades do you think if it is possible to allow the different length in results? use "" as the place holders should be fine?

@clarkliming
Copy link
Contributor Author

related to #675

@barnett11
Copy link
Contributor

Just bumping this again @clarkliming - we really need to resolve this before more users start onboarding into this space as it's quite a fundamental

@BFalquet
Copy link
Contributor

I think Admiral/Oak should do it. At our stage, converting to factor will drop the missing levels and it will be difficult to track.

@clarkliming
Copy link
Contributor Author

how about we only allow "factor" ? @BFalquet maybe you can have a look and summarize

  1. how many variables are allowing factor/charactor only, or both
  2. what impact we will have if we only allow factors

@Melkiades
Copy link
Contributor

I think rtables transforms everything into factors, which is why we do df_explicit_na. Otherwise, that will be eaten up by the transformation. I have been thinking to update how we deal with NAs for a bit, but still haven't got the time to work pro-actively on that

@Melkiades
Copy link
Contributor

btw sorry for missing this out. The error seems unrelated to the discussion about factor vs character. I need to see why it happens but it seems that some analyze outputs more or less statistics than others. Looking at the error, I think this is not allowed if not with pruning or multiple tables' rbind. I need to take a closer look at the code producing the error

@clarkliming
Copy link
Contributor Author

btw sorry for missing this out. The error seems unrelated to the discussion about factor vs character. I need to see why it happens but it seems that some analyze outputs more or less statistics than others. Looking at the error, I think this is not allowed if not with pruning or multiple tables' rbind. I need to take a closer look at the code producing the error

thank you @Melkiades there are multiple reasons of this issue, but "character" can lead to this. there is even an extra intro like "converting something to factor". it happens when in each column the unique character values do not match and we want to have a summary table counting the occurence of each level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants