Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add sortrows and sortcols to unstack #3395

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@
columns only to a subset of the columns specified by the `cols`
keyword argument
([#3386](https://github.com/JuliaData/DataFrames.jl/pull/3386))
* Add `sortrows` and `sortcols` keyword arguments to `unstack`
([#3395](https://github.com/JuliaData/DataFrames.jl/pull/3395))

## Bug fixes

Expand Down
48 changes: 26 additions & 22 deletions src/abstractdataframe/reshape.jl
Original file line number Diff line number Diff line change
Expand Up @@ -215,18 +215,19 @@ end
"""
unstack(df::AbstractDataFrame, rowkeys, colkey, value;
renamecols::Function=identity, allowmissing::Bool=false,
combine=only, fill=missing, threads::Bool=true)
combine=only, fill=missing, threads::Bool=true,
sortrows=false, sortcols=false)
unstack(df::AbstractDataFrame, colkey, value;
renamecols::Function=identity, allowmissing::Bool=false,
combine=only, fill=missing, threads::Bool=true)
combine=only, fill=missing, threads::Bool=true,
sortrows=false, sortcols=false)
unstack(df::AbstractDataFrame;
renamecols::Function=identity, allowmissing::Bool=false,
combine=only, fill=missing, threads::Bool=true)
combine=only, fill=missing, threads::Bool=true,
sortrows=false, sortcols=false)

Unstack data frame `df`, i.e. convert it from long to wide format.

Row and column keys are ordered in the order of their first appearance.

# Positional arguments
- `df` : the AbstractDataFrame to be unstacked
- `rowkeys` : the columns with a unique key for each row, if not given, find a
Expand Down Expand Up @@ -259,6 +260,14 @@ Row and column keys are ordered in the order of their first appearance.
time). Whether or not tasks are actually spawned and their number are
determined automatically. Set to `false` if `combine` requires serial
execution or is not thread-safe.
- `sortrows`: the order of rows in the output table; all values accepted by

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `sortrows`: the order of rows in the output table; all values accepted by
- `sortrows`: the order of rows in the resulting table; all values accepted by

I prefer something else than "output", I find it a little ambiguous.

`sort` keyword argument in `groupby` passed the `rowkeys` for grouping are supported;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What "passed" means? I am confused.

`false` by default (rows are ordered following the first appereance order).
- `sortcols`: the order of columns in the output table; all values accepted by

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `sortcols`: the order of columns in the output table; all values accepted by
- `sortcols`: the order of columns in the resulting table; all values accepted by

I prefer something else than "output", I find it a little ambiguous.

`sort` keyword argument in `groupby` passed `colkey` for grouping are supported;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What "passed" means? I am confused.

`false` by default (columns are ordered following the first appereance order).
Note that the ordering is done on the source data (not on column final column names
that can be potentially changed by the function passed in the `renamecols` keyword argument).
Comment on lines +269 to +270

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also confusing.
"column final column names"


Metadata: table-level `:note`-style metadata and column-level `:note`-style
metadata for row keys columns are preserved.
Expand Down Expand Up @@ -420,7 +429,8 @@ julia> unstack(df, :cols, :values, combine=sum)
function unstack(df::AbstractDataFrame, rowkeys, colkey::ColumnIndex,
values::ColumnIndex; renamecols::Function=identity,
allowmissing::Bool=false, allowduplicates::Bool=false,
combine=only, fill=missing, threads::Bool=true)
combine=only, fill=missing, threads::Bool=true,
sortrows=false, sortcols=false)
if allowduplicates
Base.depwarn("allowduplicates keyword argument is deprecated. " *
"Pass `combine=last` instead of `allowduplicates=true`.", :unstack)
Expand Down Expand Up @@ -472,17 +482,18 @@ function unstack(df::AbstractDataFrame, rowkeys, colkey::ColumnIndex,
noduplicates = false
end

g_rowkey = groupby(df_op, rowkeys)
g_colkey = groupby(df_op, colkey)
# if sorting is set to false we use fast aggregation, as we later fix the order
g_rowkey = groupby(df_op, rowkeys, sort=sortrows)
g_colkey = groupby(df_op, colkey, sort=sortcols)
valuecol = df_op[!, values_out]
return _unstack(df_op, index(df_op)[rowkeys], index(df_op)[colkey], g_colkey,
valuecol, g_rowkey, renamecols, allowmissing, noduplicates, fill)
end

function unstack(df::AbstractDataFrame, colkey::ColumnIndex, values::ColumnIndex;
renamecols::Function=identity, allowmissing::Bool=false,
allowduplicates::Bool=false, combine=only, fill=missing,
threads::Bool=true)
allowduplicates::Bool=false, combine=only, fill=missing,
threads::Bool=true, sortrows=false, sortcols=false)
if allowduplicates
Base.depwarn("allowduplicates keyword argument is deprecated. " *
"Pass `combine=last` instead of allowduplicates=true.", :unstack)
Expand All @@ -492,20 +503,21 @@ function unstack(df::AbstractDataFrame, colkey::ColumnIndex, values::ColumnIndex
value_int = index(df)[values]
return unstack(df, Not(colkey_int, value_int), colkey_int, value_int,
renamecols=renamecols, allowmissing=allowmissing,
combine=combine,
fill=fill, threads=threads)
combine=combine, fill=fill, threads=threads,
sortrows=sortrows, sortcols=sortcols)
end

function unstack(df::AbstractDataFrame; renamecols::Function=identity,
allowmissing::Bool=false, allowduplicates::Bool=false,
combine=only, fill=missing, threads::Bool=true)
combine=only, fill=missing, threads::Bool=true,
sortrows=false, sortcols=false)
if allowduplicates
Base.depwarn("allowduplicates keyword argument is deprecated. " *
"Pass `combine=last` instead of allowduplicates=true.", :unstack)
combine = last
end
unstack(df, :variable, :value, renamecols=renamecols, allowmissing=allowmissing,
combine=combine, fill=fill, threads=threads)
combine=combine, fill=fill, threads=threads, sortrows=sortrows, sortcols=sortcols)
end

# we take into account the fact that idx, starts and ends are computed lazily
Expand Down Expand Up @@ -590,10 +602,6 @@ function _unstack(df::AbstractDataFrame, rowkeys::AbstractVector{Int},
copycols=false)

@assert length(col_group_row_idxs) == ncol(df2)
# avoid reordering when col_group_row_idxs was already ordered
if !issorted(col_group_row_idxs)
df2 = df2[!, sortperm(col_group_row_idxs)]
end

if !isempty(intersect(_names(df1), _names(df2)))
throw(ArgumentError("Non-unique column names produced. " *
Expand All @@ -604,10 +612,6 @@ function _unstack(df::AbstractDataFrame, rowkeys::AbstractVector{Int},
res_df = hcat(df1, df2, copycols=false)

@assert length(row_group_row_idxs) == nrow(res_df)
# avoid reordering when row_group_row_idxs was already ordered
if !issorted(row_group_row_idxs)
res_df = res_df[sortperm(row_group_row_idxs), :]
end

# only table-level :note-style metadata needs to be copied
# as column-level :note-style metadata is already correctly set
Expand Down
Loading