Skip to content

Commit

Permalink
how can i conditionally apply, parse dates
Browse files Browse the repository at this point in the history
  • Loading branch information
ritchie46 committed Dec 13, 2020
1 parent ce5890f commit 050c984
Show file tree
Hide file tree
Showing 10 changed files with 90 additions and 6 deletions.
3 changes: 3 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,13 @@ data/: .venv


run: data
@mkdir -p book/src/outputs
$(PYTHON) -m micro_bench.plot_results
$(PYTHON) -m book.src.examples.lazy_chapter.data_head
$(PYTHON) -m book.src.examples.lazy_chapter.predicate_pushdown_0
$(PYTHON) -m book.src.examples.lazy_chapter.predicate_pushdown_1
$(PYTHON) -m book.src.examples.lazy_chapter.projection_pushdown_0
$(PYTHON) -m book.src.examples.how_can_i.groupby
$(PYTHON) -m book.src.examples.how_can_i.aggregate
$(PYTHON) -m book.src.examples.how_can_i.parse_dates
$(PYTHON) -m book.src.examples.how_can_i.conditionally_apply
2 changes: 2 additions & 0 deletions book/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,5 @@
- [How can I?](how_can_i/intro.md)
* [GroupBy](how_can_i/groupby.md)
* [Aggregate](how_can_i/aggregate.md)
* [Conditionally apply](how_can_i/conditionally_apply.md)
* [Parse dates](how_can_i/parse_dates.md)
5 changes: 2 additions & 3 deletions book/src/examples/how_can_i/aggregate.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,8 @@
import pypolars as pl
from pypolars.lazy import *

reddit = (
pl.scan_csv("data/reddit.csv")
.select([pl.sum("comment_karma"), pl.min("link_karma")])
reddit = pl.scan_csv("data/reddit.csv").select(
[pl.sum("comment_karma"), pl.min("link_karma")]
)

if __name__ == "__main__":
Expand Down
16 changes: 16 additions & 0 deletions book/src/examples/how_can_i/conditionally_apply.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
import pypolars as pl
from pypolars.lazy import *
import numpy as np

df = pl.DataFrame({"range": np.arange(10), "left": ["foo"] * 10, "right": ["bar"] * 10})

out = df.lazy().with_column(
when(col("range") >= 5)
.then(col("left"))
.otherwise(col("right"))
.alias("foo_or_bar")
)

if __name__ == "__main__":
with open("book/src/outputs/how_can_i_conditionally_apply.txt", "w") as f:
f.write(str(out.collect()))
16 changes: 16 additions & 0 deletions book/src/examples/how_can_i/parse_dates.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
import pypolars as pl
from pypolars.lazy import *

df = pl.DataFrame(
{"date": ["2020-01-02", "2020-01-03", "2020-01-04"], "index": [1, 2, 3]}
)

parsed = df.lazy().with_column(
col("date").str_parse_date(pl.datatypes.Date32, "%Y-%m-%d")
)

if __name__ == "__main__":
with open("book/src/outputs/how_can_i_parse_dates_0.txt", "w") as f:
f.write(str(df))
with open("book/src/outputs/how_can_i_parse_dates_1.txt", "w") as f:
f.write(str(parsed.collect()))
3 changes: 2 additions & 1 deletion book/src/how_can_i/aggregate.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
# How can I aggregate?

Aggregations can be done in a `.select` or a `.with_column` method.
Aggregations can be done in a `.select` or a `.with_column`/`with_columns` method.

If you want to do a specific aggregation on all columns you can use the wildcard expression: `.select(col("*").sum())`

## Examples
```python
{{#include ../examples/how_can_i/aggregate.py:1:8}}
reddit.collect()
Expand Down
17 changes: 17 additions & 0 deletions book/src/how_can_i/conditionally_apply.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# How can I conditionally apply

You often want to modify or add a column to DataFrame based on some condition/predicate. This is where
the `when().then().otherwise()` expressions are for. As they are basically a full English sentence, they need no further
explanation.


## Examples

```python
{{#include ../examples/how_can_i/conditionally_apply.py:1:10}}
print(out.collect())
```

```text
{{#include ../outputs/how_can_i_conditionally_apply.txt}}
```
4 changes: 3 additions & 1 deletion book/src/how_can_i/groupby.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
# How can I groupby?

The groupby operations is done with the `.groupby` method following by `.agg` method.
The groupby operations is done with the `.groupby` method following by the `.agg` method.
In the `.agg` method you can do as many aggregations on as many columns as you want.

If you want to do a specific aggregation on all columns you can use the wildcard expression: `.agg(col("*").sum())`

## Examples

```python
{{#include ../examples/how_can_i/groupby.py:1:8}}
reddit.collect()
Expand Down
28 changes: 28 additions & 0 deletions book/src/how_can_i/parse_dates.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Date parsing

Polars has two date data types:

* Date32
- a naive date represented as the number of days since the unix epoch as a 32 bit signed integer.
- Use this for Date objects
* Date64
- a naive datetime represented as the number of milliseconds since the unix epoch as a 64 bit signed integer.
- Use this for DateTime objects

Utf8 types can be parsed as one of the two date datetypes. You can try to let Polars parse the date(time) implicitly or
apply you `fmt` rule. Some examples are:

* `"%Y-%m-%d"` for `"2020-12-31"`
* `"%Y/%B/%d"` for `"2020/December/31"`
* `"%B %y"` for `"December 20"`

## Examples

```python
{{#include ../examples/how_can_i/parse_dates.py:4:10}}
print(parsed.collect())
```

```text
{{#include ../outputs/how_can_i_parse_dates_1.txt}}
```
2 changes: 1 addition & 1 deletion book/src/lazy_polars/intro.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Lazy Polars
We directly skip the eager API and dive into the lazy API of Polars. We will be exploring it's functionality by exploring
We directly skip the eager API and dive into the lazy API of Polars. We will be exploring its functionality by exploring
two medium large datasets of usernames; the [reddit usernames dataset](https://www.reddit.com/r/datasets/comments/9i8s5j/dataset_metadata_for_69_million_reddit_users_in/)
containing 69+ Million rows and a [runescape username dataset](https://github.com/RuneStar/name-cleanup-2014) containing
55+ Million rows.
Expand Down

0 comments on commit 050c984

Please sign in to comment.