-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Selections.jl + DataFrames.jl #1936
Comments
Hi, Thank you for your interest and willingness to contribute. The way we approached column selection in DataFrames.jl is the following:
All this is not carved in stone - please feel free to comment. Given your package has some overlap with the current (and different) way we handle similar things I would recommend (if you were willing to) to split your requests into a series of atomic proposals (i.e. what exactly you propose to change/add to the functionality we have now). |
@Drvi , very cool stuff! @bkamins and I have actually discussed a bit moving the DataFrames selection code out into it's own package, so it's great to see someone do this! One thing we talked about was making the selection code Tables.jl-based instead of DataFrames.jl-specific. Have you looked into this at all? It'd be great if we could make all the logic just work on the result of doing |
Thanks @quinnj and @bkamins. So if I understand correctly, the best approach would be to make Selections.jl working with Tables.jl so DataFrames can then opt-in into it later, once the package is ready. That makes a lot of sense. My bigger plan for Selections was to extend the way people can select, rename, order and even transform columns, i.e. to be able to refer to columns not only by their name but by their properties (like eltype or some statistic based on the actual values). The api I have in mind was
My original goal was to emulate the If my understanding is correct, Selections are more general than the JuliaDBs selectors (expect I don't have a special selection for primary keys of a table) as you can combine many of them with
Yes, relying on Tables.jl would be great, but I'm not sure if it supports all the operations I need:
I'd appreciate any guidance here, I'm not that familiar with what Tables.jl can and cannot do. If there is not a way to do this with Tables.jl, people would have to define these to opt-int. |
Interesting. AFAICT, we could integrate Selections.jl with DataFrames and Tables.jl by defining an |
@nalimilan Making the Selections.jl just spitting out pairs of |
Do we really need Tables.jl to support renaming? As long as the abstract type exists, we can do whatever we want using selections in DataFrames. Generic functions that work on all table types can come later. |
Hi everyone. So I finally get some time to finish the overhaul of Selections.jl. Please see the README.md for an introduction. The highlights of this new version:
I'd love some feedback from anyone interested! cc: @nalimilan @quinnj @bkamins. |
Interesting, thanks! Sounds very powerful. A few remarks:
|
Thank you for your comments! There were some excellent points.
This is something I need to think about more deeply. IIUC, it is a generalization of what Selections are currently doing, because currently the Args & kwargs: select(df,
s1 => r1 => t1, # All the queries in `args...` are chained together.
s2 => r2 => t2, # First the `selections` are evaluated to identify the columns to retain
s3 => r3 => t3, # and what functions to apply to them (and how to apply them).
s4 => r4 => t4, # E.g. when the set of selected columns is empty,
... # no `transforms` would be applied.
;
col1 = S1 => T1, ## When the `args...` are done, add `col1` to the modified table
col2 = S2 => T2, ## Add `col2` to the modified table with `col1` already present
...
) Args only: select(df,
s1 => r1 => t1, # Chain the first two selection queries and materialize them, only
s2 => r2 => t2, # the selected columns (after their corresponding transforms) are available
S1 => T1 => col1, ## Adds/overwrites column `col1` of the newly materialized table.
s3 => r3 => t3, # Keep chaining until `S2 => T2 => col2` is met, then
s4 => r4 => t4 # materialize again. For this phase, `col1` is available
...
;
kwargs ## are up for grabs
) I think that the relative position of select(
select(
select(df,
s1 => r1 => t1,
s1 => r1 => t1;
col1 = S1 => T1),
s3 => r3 => t3,
s4 => r4 => t4
...;
col2 = S2 => T2),
...)
If you want to pass all columns into the function, you can use Using broadcasting on the pair constructor ( This is a summary of my current approach:
The If what I'm currently doing is the behavior you'd expect for
You could use I hope that in time, I'll develop macro alternatives that would behave differently based on the input function signature (at least I hope that it is possible:-)), so that
That makes a lot of sense, throwing an error by default seems like a better idea.
Yes, this is indeed inspired by R. I liked that I didn't have to wrap integers into Ad
Yes, this is a very similar situation. I'm afraid that
Ad Ad |
Yes. The difference between
The idea is just that
I guess what I find weird is that Anyway this discussion sounds like the same problem as #1952, so maybe we can find a common solution or pattern.
Yes, that's certainly doable, that could even be just
Well that's probably something to consider: with InvertedIndices's
Yes, that would be a possibility.
Actually I wasn't suggesting to have |
Given the current functionality of |
I would still love to abstract all the selection/transform stuff into a Tables.jl-based package someday, probably once things settle down more. |
Sure - the design is mostly independent on |
It could be interesting to reconsider how Selections.jl fits in the new system with |
Closing this as the discussion did not have a follow up. |
Hi!
I've put together a package that implements quite powerful column selection and renaming capabilities for DataFrames.jl, Selections.jl and would love to see it incorporated into DataFrames.jl.
You can select columns based on their names, positions, ranges and regular expressions, just like DataFrames does. Apart from that one can select columns by boolean indexing and by applying predicate functions to column names or values or both; so you can (de)select columns having more than 60 % missing values, whose names are all caps containing the string "ID" like this:
&
and|
in order to create quite complex selection rules.Please see the README.md for a more comprehensive description of the package.
Currently Selections export both
select
andrename
functions which is conflicting with DataFrames exports. So my question is -- would you like this functionality to be a part of DataFrames? I'd be more than happy to make the necessary changes (e.g, make the api compliant with DataAPI) and iron out the API if you think there is room for improvement. In any case, I'd love to get some feedback on the package so that it can be useful for the community.Thank you for reading this.:)
The text was updated successfully, but these errors were encountered: