From 04bf93f260b18fe2df8b0c06a0d818b75073582f Mon Sep 17 00:00:00 2001 From: ritchie46 Date: Sun, 13 Dec 2020 11:42:19 +0100 Subject: [PATCH] numpy introp --- book/src/micro_benchmarks.md | 15 +++++++++++++++ book/src/numpy.md | 12 ++++++++++++ 2 files changed, 27 insertions(+) create mode 100644 book/src/micro_benchmarks.md create mode 100644 book/src/numpy.md diff --git a/book/src/micro_benchmarks.md b/book/src/micro_benchmarks.md new file mode 100644 index 000000000..567669a0b --- /dev/null +++ b/book/src/micro_benchmarks.md @@ -0,0 +1,15 @@ +# Micro benchmarks +Below are some micro benchmarks shown between Polars and Pandas. Note that these are just micro benchmarks and nothing +more. An optimization can lead to increased performance in a single benchmark and lead to regressions somewhere else. + +To truly make performance comparisons we should at least look at the macro level of a query. + +## Csv parsing +![](../img/csv.png) + +# Joins +![](../img/join_80_000.png) + +## Groupby +![](../img/groupby10_.png) +![](../img/groupby10_mem.png) diff --git a/book/src/numpy.md b/book/src/numpy.md new file mode 100644 index 000000000..290d36418 --- /dev/null +++ b/book/src/numpy.md @@ -0,0 +1,12 @@ +# Numpy interoperability + +Polars Series have support for numpy's [universal functions](https://numpy.org/doc/stable/reference/ufuncs.html). +That means that numpys elementwise function like `np.exp`, `np.cos`, `np.div`, etc. all work with almost zero overhead. +There are few gotcha's however. Missing values are ignored during the function application. They are maintained, as they +are just a separate bitmask. However, any function that depends on previous/next elements, i.e. a sum, a convolve etc. +should not be used or used with caution. + +## Conversion +You can convert a `Series` to a numpy array with the `.to_numpy` method. Missing values will be replaced by `NaN` during +the conversion. If the Series doesn't have missing values, or you don't care about them, you can use the `.view` method. +This provides a zero copy numpy array of the `Series` data.