Skip to content

Commit

Permalink
how can i use custom functions
Browse files Browse the repository at this point in the history
  • Loading branch information
ritchie46 committed Dec 13, 2020
1 parent 050c984 commit 2bf132c
Show file tree
Hide file tree
Showing 7 changed files with 131 additions and 0 deletions.
3 changes: 3 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,6 @@ run: data
$(PYTHON) -m book.src.examples.how_can_i.aggregate
$(PYTHON) -m book.src.examples.how_can_i.parse_dates
$(PYTHON) -m book.src.examples.how_can_i.conditionally_apply
$(PYTHON) -m book.src.examples.how_can_i.use_custom_functions
$(PYTHON) -m book.src.examples.how_can_i.use_custom_functions_1
$(PYTHON) -m book.src.examples.how_can_i.use_custom_functions_2
2 changes: 2 additions & 0 deletions book/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,5 @@
* [Aggregate](how_can_i/aggregate.md)
* [Conditionally apply](how_can_i/conditionally_apply.md)
* [Parse dates](how_can_i/parse_dates.md)
* [Use custom functions](how_can_i/use_custom_functions.md)
- [Reference guide](reference.md)
11 changes: 11 additions & 0 deletions book/src/examples/how_can_i/use_custom_functions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
import pypolars as pl

my_map = {1: "foo", 2: "bar", 3: "ham", 4: "spam", 5: "eggs"}

s = pl.Series("a", [1, 2, 3, 4, 5])
s = s.apply(lambda x: my_map[x])


if __name__ == "__main__":
with open("book/src/outputs/how_can_i_use_custom_functions_0.txt", "w") as f:
f.write(str(s))
22 changes: 22 additions & 0 deletions book/src/examples/how_can_i/use_custom_functions_1.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
import pypolars as pl
from pypolars.lazy import *
import numpy as np

np.random.seed(1)

df = pl.DataFrame({"foo": np.arange(10), "bar": np.random.rand(10)})

# create a udf
def my_custom_func(s: Series) -> Series:
return np.exp(s) / np.log(s)


# a simple wrapper that take a function and sets output type
my_udf = udf(my_custom_func, output_type=pl.datatypes.Float64)

# run query with udf
out = df.lazy().filter(col("bar").apply(my_udf) > -1)

if __name__ == "__main__":
with open("book/src/outputs/how_can_i_use_custom_functions_1.txt", "w") as f:
f.write(str(out.collect()))
21 changes: 21 additions & 0 deletions book/src/examples/how_can_i/use_custom_functions_2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
import pypolars as pl
from pypolars.lazy import *

my_map = {1: "foo", 2: "bar", 3: "ham", 4: "spam", 5: "eggs"}

df = pl.DataFrame({"foo": [1, 2, 3, 4, 5]})

# create a udf
def my_custom_func(s: Series) -> Series:
return s.apply(lambda x: my_map[x])


# a simple wrapper that take a function and sets output type
my_udf = udf(my_custom_func, output_type=pl.datatypes.Utf8)

# run query with udf
out = df.lazy().with_column(col("foo").apply(my_udf).alias("mapped"))

if __name__ == "__main__":
with open("book/src/outputs/how_can_i_use_custom_functions_2.txt", "w") as f:
f.write(str(out.collect()))
66 changes: 66 additions & 0 deletions book/src/how_can_i/use_custom_functions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# How can I use custom functions?
There will always be an operation so sketchy, so dirty, so grotesque, that you cannot do with the public API of Polars.
Luckily we provide UDFs (User Defined Functions). This means you can define a python function/ lambda and pass it to the
logical plan. You can use custom functions in both the eager API as well as the lazy API.

## Examples

Let's start with eager. Let's say we want to apply a map to a Series. This could be done as shown below.

### Eager

```python
{{#include ../examples/how_can_i/use_custom_functions.py:3:8}}
print(s.collect())
```

```text
{{#include ../outputs/how_can_i_use_custom_functions_0.txt}}
```

There are a few gotcha's however. Polars Series can only contain a single datatype. (_storing custom Python objects is being worked on_)
In the `.apply` method above we didn't specify the data type the Series should contain. Polars tries to infer the output
datatype beforehand by calling the provided function itself. If it later gets a data type that does not matched the
initially inferred type, the value will be indicated as missing: `null`. If you already know the output datatype you need
it's recommended to provide this information to Polars.

```python
s.apply(lambda x: my_map[x], dtype_out=pl.datatypes.Utf8)
```

### Lazy
In lazy you can also apply custom functions. It should be noted that there are differences with eager. In the eager API
the function in `.apply` works on a single element level. The `lambda` we used above got `int` as input and returned `str`
ofter finding the right key in the `dictionary`.
In lazy the `.apply` get's a whole `Series` and input and must return a new `Series`. The output type must also be provided
because for the optimizer to be able to do optimizations the Schema of the query needs to be known at all times.

```python
{{#include ../examples/how_can_i/use_custom_functions_1.py:1:19}}
print(s.collect())
```

```text
{{#include ../outputs/how_can_i_use_custom_functions_1.txt}}
```

Above we've defined out own function, added this to the lazy query and it got executed during execution of the physical plan.
This of course greatly increases flexibility of a query and when needed you are definitely encouraged to do so. This is however
not without cost. Even though we only use vectorized code in this example (numpy functions and Polars comparisons), this query
may still be slower than a full Polars native query. This is due to the Python `GIL`. As mentioned before, polars tries to parallelize
the query execution on the available cores on your machine. However, in Python there may only be one thread modifying Python objects.
So if you have many UDF's they'd have to wait in line until they are allowed there GIL time.


### Double apply
In the lazy UDF you can always use the eager custom lambas as well. To go back to our first example *applying a dictonary map*:

```python
{{#include ../examples/how_can_i/use_custom_functions_2.py:1:18}}
print(out.collect())
```

```text
{{#include ../outputs/how_can_i_use_custom_functions_2.txt}}
```
6 changes: 6 additions & 0 deletions book/src/reference.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Reference guide

Need to see all available methods/ functions of Polars? The reference guide is your best bet.

* [Python](https://ritchie46.github.io/polars/pypolars/index.html)
* [Rust](https://ritchie46.github.io/polars/polars/index.html)

0 comments on commit 2bf132c

Please sign in to comment.