Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated Basic Usage of Manipulation Functions #3360

Merged
merged 37 commits into from
Dec 13, 2024

Conversation

nathanrboyer
Copy link
Contributor

@nathanrboyer nathanrboyer commented Jul 18, 2023

Replaces PR #2907 since it was too out of sync with the main branch.

@nathanrboyer nathanrboyer changed the title Initial commit Updated Basic Usage of Manipulation Functions Jul 18, 2023
docs/src/man/basics.md Outdated Show resolved Hide resolved
docs/src/man/basics.md Outdated Show resolved Hide resolved
@nathanrboyer
Copy link
Contributor Author

Friendly bump 🙂

@bkamins
Copy link
Member

bkamins commented Sep 15, 2023

Agreed. But first we need to make a decision on #3361. I will have a look at it and comment there.

| ------------ | -------------------------------- | -------------------------------------------- | ------------------------------------------------- |
| `transform` | Creates a new data frame. | Retains both source and manipulated columns. | Retains same number of rows as source data frame. |
| `transform!` | Modifies an existing data frame. | Retains both source and manipulated columns. | Retains same number of rows as source data frame. |
| `select` | Creates a new data frame. | Retains only manipulated columns. | Retains same number of rows as source data frame. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"manipulated" or "created"?

| `transform!` | Modifies an existing data frame. | Retains both source and manipulated columns. | Retains same number of rows as source data frame. |
| `select` | Creates a new data frame. | Retains only manipulated columns. | Retains same number of rows as source data frame. |
| `select!` | Modifies an existing data frame. | Retains only manipulated columns. | Retains same number of rows as source data frame. |
| `subset` | Creates a new data frame. | Retains only source columns. | Number of rows is determined by the manipulation. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe better say that by condition (to differentiate it from combine)

docs/src/man/basics.md Outdated Show resolved Hide resolved
### Constructing Operation Pairs
All of the functions above use the same syntax which is commonly
`manipulation_function(dataframe, operation)`.
The `operation` argument is a `Pair` which defines the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add that Pair is constructed using =>?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also - it does not have to be the pair, so maybe say it is usually a pair?
(your first example is not a pair below)

These rules are typically called transformation mini-language.

Let us move to the examples of application of these rules
## Basic Usage of Manipulation Functions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A larger question - maybe create a separate page for this tutorial?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be better. Maybe "Manipulation Functions" under "User Guide" before "Split-apply-combine"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As commented below - I would put it as a "top level" with a name something along "A gentle introduction to manipulation functions" (so that we clearly signal that this material is less formal than the rest of the manual).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you think of making the new Top Level section something like "Beginner's Guide" or "User's Guide for Beginners" and then placing "Manipulation Functions" at a second level under that? I'm not volunteering to rewrite the entire User's Guide, but it could leave room for others to add similar "gentle" content to the documentation. It would also make the sidebar look cleaner by splitting up the current long name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. The other section that could go there is https://dataframes.juliadata.org/stable/man/basics/ as it has the same objective.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, but Basics contains the existing "Basic Usage of Manipulation Functions". I don't know how to differentiate it from this new section if they live next to each other.

I initially intended to just clarify some topics within that section, but now the scope has grown.

I can maybe overwrite that section if I add these topics that I don't currently cover:

  • "Note that this time we use string column selectors because some of the column names have spaces in them."
  • "The benefit of select or combine over indexing is that it is easier to get the union of several column selectors."
  • "It is important to note that select always returns a data frame, even if a single column selected as opposed to indexing syntax."
  • "By default select copies columns of a passed source data frame. In order to avoid copying, pass the copycols=false keyword argument."

The other sections under Basics use the German dataset, but I think it is easier to understand what is going on with smaller data frames where you know all the data values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can maybe overwrite that section

Yes - I think it is OK just to expand that section (especially that it is top-level now already)

I think it is easier to understand what is going on with smaller data frames

Agreed. just please use different variable names than these already used there so that using different dataframes does not lead to confusion.

Thank you! (sorry for so many comments, but - unfortunately - writing documentation is hard)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (sort of). I did not use new variables names though. I had been and continued to just overwrite the definition of df. My data frames are so small and frequent that coming up with a new name each time would be a pain.

!!! Note
The Julia parser sometimes prevents `:` from being used by itself.
`ERROR: syntax: whitespace not allowed after ":" used for quoting`
means your `:` must be wrapped in either `(:)` or `Cols(:)`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I typically recommend All() which is easy to understand I think.

docs/src/man/basics.md Outdated Show resolved Hide resolved
docs/src/man/basics.md Outdated Show resolved Hide resolved
@mortenpi
Copy link
Contributor

mortenpi commented Dec 6, 2023

@mortenpi - how could @nathanrboyer reference e.g. this section

Sorry, missed this earlier. This actually looks like a bug. The fact that it's level 3 heading shouldn't matter. As a workaround, you might be able to use @id-s in the headers though.

@nathanrboyer
Copy link
Contributor Author

I think I fixed my errors, but I am still getting some errors below which seem unrelated to me.

I get this error on main with julia --project make.jl:

ERROR: LoadError: ArgumentError: makedocs() got passed invalid keyword arguments:
  strict = true

I get this error on this branch with julia --project make.jl:

[ Info: Doctest: running doctests.
┌ Error: doctest failure in C:\Users\nboyer.AIP\.julia\packages\DataFrames\58MUJ\src\abstractdataframe\iteration.jl:38-73
│ 
│ ```jldoctest
│ julia> df = DataFrame(x=1:4, y=11:14)
│ 4×2 DataFrame
│  Row │ x      y
│      │ Int64  Int64
│ ─────┼──────────────
│    1 │     1     11
│    2 │     2     12
│    3 │     3     13
│    4 │     4     14
│ 
│ julia> eachrow(df)
│ 4×2 DataFrameRows
│  Row │ x      y
│      │ Int64  Int64
│ ─────┼──────────────
│    1 │     1     11
│    2 │     2     12
│    3 │     3     13
│    4 │     4     14
│ 
│ julia> copy.(eachrow(df))
│ 4-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
│  (x = 1, y = 11)
│  (x = 2, y = 12)
│  (x = 3, y = 13)
│  (x = 4, y = 14)
│ 
│ julia> eachrow(view(df, [4, 3], [2, 1]))
│ 2×2 DataFrameRows
│  Row │ y      x
│      │ Int64  Int64
│ ─────┼──────────────
│    1 │    14      4
│    2 │    13      3
│ ```
│ 
│ Subexpression:
│ 
│ copy.(eachrow(df))
│ 
│ Evaluated output:
│ 
│ 4-element Vector{@NamedTuple{x::Int64, y::Int64}}:
│  (x = 1, y = 11)
│  (x = 2, y = 12)
│  (x = 3, y = 13)
│  (x = 4, y = 14)
│ 
│ Expected output:
│ 
│ 4-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
│  (x = 1, y = 11)
│  (x = 2, y = 12)
│  (x = 3, y = 13)
│  (x = 4, y = 14)
│ 
│   diff =
│    4-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
│     Vector{@NamedTuple{x::Int64, y::Int64}}:
│     (x = 1, y = 11)
│     (x = 2, y = 12)
│     (x = 3, y = 13)
│     (x = 4, y = 14)
└ @ Documenter C:\Users\nboyer.AIP\.julia\packages\DataFrames\58MUJ\src\abstractdataframe\iteration.jl:38
ERROR: LoadError: `makedocs` encountered a doctest error. Terminating build

@mortenpi
Copy link
Contributor

mortenpi commented May 30, 2024

strict = true

If this is using Documenter 1.0+, then that keyword was removed (and strict=true is now the default). It looks like it got correctly removed on master though: https://github.com/JuliaData/DataFrames.jl/pull/3416/files#diff-4aae2d1c783cade58bd2cb13748da956e568b7f2aed5fafd9e2a46fb97daf613L45

The doctest failures seem to be due to different printing between Julia versions (i.e. make sure you use the same Julia version locally as is being used for the docs CI).

@nathanrboyer
Copy link
Contributor Author

I thought that I already was in sync from this: 7614fc3
but I found another button to sync on my repository page which fixed the strict line.

I am still getting the second error, and I am using the juliaup release branch which seems to match the ci.yml.

version:
- '1.6'
- '1' # automatically expands to the latest stable 1.x release of Julia

@nathanrboyer
Copy link
Contributor Author

I am going to need help figuring out the build issue here. The tests were all passing until I merged main into this branch again today.

The failing doctest is one that appeared earlier on a docstring that is not part of this PR. I attempted to fix the issue in #3472, which worked on my local computer but apparently did not work when CI ran the same tests here. I'm not sure what to do now.

@bkamins
Copy link
Member

bkamins commented Dec 11, 2024

Yes - let us hope that after StableRNGs.jk PR merge this one goes through cleanly. Thank you!

docs/src/man/basics.md Outdated Show resolved Hide resolved
docs/src/man/basics.md Outdated Show resolved Hide resolved
docs/src/man/basics.md Outdated Show resolved Hide resolved
@bkamins bkamins merged commit dc59622 into JuliaData:main Dec 13, 2024
6 checks passed
@bkamins
Copy link
Member

bkamins commented Dec 13, 2024

Thank you!
Can you please double check the documentation on-line after it is deployed?

@nathanrboyer
Copy link
Contributor Author

Reviewing now ...
The definition list doesn't seem to be supported. 😞
VSCode:
image
Online:
image

@nathanrboyer
Copy link
Contributor Author

nathanrboyer commented Dec 13, 2024

I also want to remind you that the terminology we decided on in #2907 (comment) still needs to be propagated to the rest of the manual and docstrings.

nathanrboyer added a commit to nathanrboyer/DataFrames.jl that referenced this pull request Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants