Data frame: Copy-overheads and copy-on-write mechanism? #215

SimonHeybrock · 2019-01-28T14:29:11Z

When working with a single array (e.g., an xt::xarray) or an xf::xvariable it is relatively easy to avoid unnecessary copies. My feeling is that for a data frame such as xarray::Dataset or xf::xframe avoiding copies and memory-size overheads is more involved.

I will try to explain based on a couple of examples:

1. Several data frames with common coordinates.

At first sight coordinates should take very little space since typically they have fewer dimensions than the data. In practice this is not necessarily true however:

Axes with more than one dimension #213 shows an example of coordinates with multiple dimensions. To give a real-world example, we may have several million positions (detector pixels), each with a different time/wavelength-axis of length 1000-10000 (in extreme cases), i.e., we are looking at something like 1e6*[1e3..1e4]*8 Byte = 8 Gbyte...80 GByte per frame.
We may have relatively large auxiliary information (such as detector geometry) for every position (pixel). This can typically be some hundreds of Byte per pixel, i.e., we can also easily exceed 1 GByte of total size. Typically this auxiliary information is the same for all data frames. A central (in-memory) "database" for this type of information would certainly be an alternative, but would require linking from a data frame to the database, which complicates the overall design.

2. Operations that modify only a subset of variables.

2a. In-place operation

Assuming that algorithms should be available in Python, operations need to be methods of the xframe, I think? Is a "natural" (to a Python user) syntax still possible?
In C++ we can use std::move in many cases?

2b. Operations returning a new data frame

In practice not all operations would be done in-place:

Users may try different things, e.g., in a Jupyter Notebook, and want to go back one step, i.e., want to keep intermediate data. If we consider that many operations might just touch, e.g., 10% of the actual data in a frame, copying everything would add a major overhead to any non-in-place operation.
Error handling is more difficult for in-place operation: If an algorithm fails, in-place operations would often break the input data frame, so all checks need to be done ahead of time before modifying any data.
Linking a bit to the type erasure discussed in Data frame: Typed frame vs. type erasure in practice? #214: Suppose we want to do something like
```
 frame = smooth(frame, Dim::X) - frame
```
This is Python, so I suppose expression templates do not help here. Depending on Data frame: Typed frame vs. type erasure in practice? #214, it may not either in C++. We need both frame and data derived from frame, i.e., it cannot be don in-place. The data frame returned by smooth would also contain any other data from frame, such as coordinates or attributes, which can be large, as discussed above.

3. Data-variables linked to more than one data-frame

This could be solved in different ways, but one convenience provided by a data frame / dataset type is the ability to store arbitrary variables: This could be used, e.g., for storing a reference measurement, which may be used to normalize the data at a later point. The reference measurement may be identical for a number of different data sets that are processed. This can be considered as a similar issue as described above regarding "auxiliary" information: A central "registry" for this type of information would be possible, but complicates both the data frame as well as the client code.

I think a copy-on-write mechanism would solve most of the issues I mentioned (while of course adding some other complications/headaches). I will wait for an answer before discussing my thoughts on copy-on-write in more detail.

The text was updated successfully, but these errors were encountered:

JohanMabille · 2019-01-29T09:53:04Z

Thanks for opening this. Before answering specifically the questions, I will give an overview of the current design and what it allows, that will help having a better overview.

0. Global overview

0.1. xvariable

The C++ equivalent of DataArray is xvariable. This one is declared as follow:

template <class CCT, class ECT>
class xvariable_container;

// XFRAME_DEFAULT_DATA_CONTAINER is typically xt::xarray<T>
template <class T, class CCT>
using xvariable = xvariable_container<CCT, XFRAME_DEFAULT_DATA_CONTAINER(T)>;

The parameters CCT and ECT are closure types on coordinates and expressions. A good description of what we call closure semantic can be found here, the short version is we can store coordinates by reference or by value in a variable (and the same for the data). So different variables may share a same coordinate object:

using coordinate_type = something;
using variable_type = xvariable<double, const coordinate_type&>;

coordinate_type my_coord(....);
// my_coord is never copied
variable_type v1(std::move(some_data1), my_coord);
variable_type v2(std::move(som_data2), m_coord);

Same thing for the data (in that case we have to use xvariable_container with xt::xarray<T>& as second template parameter for instance).

0.2. xframe

An unaligned xframe will hold xvariable or references on xvariable, and a coordinate object which is the "union" of the coordinates of the xvariables.

An aligned xframe will hold a single coordinate object (which can even be a reference on a coordinate object already existing somewhere else) and many variables that will share this coordinate object as described above. So converting a unaligned xframe to an aligned xframe means to copy (or take a reference on) the coordinate object of the unaligned frame, and create variable with reference on this coordinate object; whether the data will be moved, referenced or copied in the aligned xframe will be an option specified by the user and will depend on the data itself (aligning a data requires resizing, thus a copy of the data). By default, aligning a frame will involve copying the data.

1. Several data frames with common coordinates.

As explained above, sharing a coordinate object between variables, and even between frames, is possible. We don't handle coordinates with multiple dimensions yet, this would require some refactoring of the current xcoordinate system which assumes that the axes are independent.

As the coordinate object, a data / variable may be shared between different xframe. Besides, this variable could be backed by a database (since variable can theoretically hold any kind of xtensor expression as long as they provide the expected API, we can imagine a adapter that sends requests to a database).

2. Operations that modify only a subset of variables.

The methods can also be free functions accepting xframe expressions / objects. The idea is to return unevaluated expressions as long as possible (this is for pure C++) and avoid allocating temporary objects. Exposing these lazy functions to Python is not possible, so most of the time we want to expose an explicit instantiation of this function; nothing prevents from adding a small API function that takes its argument by ocnst ref, makes a copy and call the in place algorithm on the copy before returning it:

template <class E>
void my_function(xframe_expression<E>& e)
{
    // do domething
}

xframe_type my_function(const xframe_type& fr)
{
    xframe res(fr);
    my_function(res);
    return res;
}

Also notice that the API we provide to python may be different from the one in C++ (we may design API classes that embeds free C++ functions as methods and expose them to the Python).

Regarding what we expose to python, if we want to expose arithmetic operators AND keep the lazy evaluation, then we have no choice, we need to use type erasure and a dynamic expressions system.
We have been discussing such a system at QuantStack, but we don't come with a ready-to-implement solution. Besides, that reqiures implementing type erasure for variable and frame expression. I will give more details about type erasure in xframe in the dedicated issue.

### 3. Data-variables linked to more than one data-frame

As it is possible to share coordinates between different variables, it is possible to share a same variable object between different xframe objects. Fo unaligned xframe, the type would be something like xvariable_container<coordinate_type, data_type&>, for aligned xframe that would be xvariable_container<const coordinate_type&, data_type&>. That would be the responsibility of the user to ensure the data lifecycle is consistent with the ones of the xframe objects.

SimonHeybrock · 2019-01-29T12:02:52Z

@JohanMabille Thanks for the detailed answer. I have a couple of follow-up questions:

0.1

If a coordinate is stored by reference, it looks like there is no lifetime management? That is if the coordinate goes out of scope the variable is broken. Likewise, the coordinate is not deleted if all variables that reference it go out of scope?

0.2

For an aligned xframe, if variables hold the coordinates by reference, does this imply that:
- Removing the coordinate from the frame implies walking all variables and updating (removing) those references?
- Copying a variable from one frame to another needs explicit handling to update the reference?
- Copying a frame needs to update the references in all variables?
Can updating these references be done without copying the data?

1.

I assume that sharing coordinates between frame implies that reference-coordinates are always const?
- To modify a coordinate of a frame, you make a copy, modify it, replace the old coordinate, and update the references in all variables?
If also data variables can be shared (via a reference), my question above regarding lifetime management also apply.
If the sharing is done using a non-owning pointer/reference, I suppose this furthermore implies that any modification to data requires making a copy, since it is not possible to tell whether there is more than one owner? Essentially data variables are also treated as const?

2.

With the heavy reliance on lazy evaluation, how are conflicts handled when creating more than one expression that depends on the same data (I assume this would be quite common in practice?)? Do you simply avoid in-place operations altogether?

3.

That would be the responsibility of the user to ensure the data lifecycle is consistent with the ones of the xframe objects.

Except for specialized/static applications that might be taken to imply that we should almost always use variables (and coordinates) stored by value (not by reference), since life-cycle management is not so trivial? Are you (considering) supporting something like

using coordinate_type = std::vector<double>;
using variable_type = xvariable<double, std::shared_ptr<const coordinate_type>>;

such that life-cycle management becomes more feasible?

JohanMabille · 2019-01-31T12:44:35Z

0.1

The lifetime management is up to the user in this case. The coordinate object will be deleted when it goes out of scope, and the remaining variables will be broken. However the main use case for this feature is to share the coordinate object in a xframe, so we are pretty sure that the coordinate object will have the correct lifetime.

There is also the possibility (not implemented yet) to store a shared coordinate whose life cycle can be totally automatic (with a wrapper on a shared_pointer for instance), we already do that for shared expressions in xtensor.

0.2

I'm not sure why we would remove coordinate from a frame, this would prevent accessing data in the variables (even if updated).

Regarding the copy of variables from one frame to another, the variable in the destination frame will be created from a copy of the data of the source variable, and a reference on the coordinate of the destination frame. If you copy one frame to another, first the coordinate is copied, then the variable are copied as described above.

It's also possible to move a variable from a frame to another one, in that case a variable is created in the destination frame from the moved data of the source variable and a reference on the coordinate of the destination frame. The source variable is then removed from the source frame.

1.
1.1. Yes, otherwise it would be impossible to detect change in the coordinate object and update the variables. The frame will provide a reshape method that takes a coordinate object. If the object is compatible (i.e. does not imply reshaping the data), it is simply assigned to the stored coordinate object, nothing has to be done in the variables.

1.2. Same rule as above, the lifetime management is the responsibility of the user.

1.3. Modifying the values in the variables should be possible, only the shape / coordinates should be const.

2.

Expressions are created on variables, not on the data themselves. We already have a sharing mechanism in xtensor that we can reuse in xframe. Usually, unevaluated expressions are const expressions, so there should not be conflicts. In-place operations that could modify data are done on containers or variable backed by a database or a file on the system.

3.

The coordinate_type template parameter of the variable can be anything that provides the interface of a coordinate object. So we can (and actually will) support a wrapper around a shared pointer on coordinates.

SimonHeybrock · 2019-01-31T13:16:31Z

Thanks a lot, that all makes sense! Just two more small questions:

I'm not sure why we would remove coordinate from a frame, this would prevent accessing data in the variables (even if updated).

Essentially because in practice we may have more than a single coordinate for a specific dimension. Only one of them would be what xarray calls "dimension coordinate", but there may be other auxiliary coordinates. Example: Dimension "position" might have a "position" coordinate (say lattitude + longitude), but also a "label" coordinate (say a town name). At a later point, e.g., the labels may become irrelevant and could get removed.

Expressions are created on variables, not on the data themselves. We already have a sharing mechanism in xtensor that we can reuse in xframe. Usually, unevaluated expressions are const expressions, so there should not be conflicts. In-place operations that could modify data are done on containers or variable backed by a database or a file on the system.

Could you clarify this based on a concrete example? Consider a multiplication with uncertainties:

xt::xarray<double> a = {1, 2, 3};
xt::xarray<double> a_err = {1, 2, 3};
xt::xarray<double> b = {1, 2, 3};
xt::xarray<double> b_err = {1, 2, 3};

// Not sure this is a good way of doing this with xtensor, but this is the gist of the operation:
a_err = a_err * b * b + b_err * a * a;
a *= b;

With in-place operation (such as a *= b), if the resulting a_err would not be evaluated immediately we would get wrong results. More generally speaking, doesn't this imply that there may be conflicts between expressions, if any of the inputs is replaced by evaluation of another expression? I am not really sure what my actual question is here, maybe just a lack of understanding how lazy evaluation (in particular including support to store expressions in an xframe) would be used in practice.

JohanMabille · 2019-01-31T22:19:54Z

Regarding the coordinates, I haven't planned support for many coordinate objects yet. The idea for now is to replace the current coordinate object with a reshape or reindex method that accepts a new corodinate object. Since variable have referneces on the coordinate object of the frame, nothing has to be done, they will be automatically "updated". Supporting many coordinates objects should not be too difficult, in that case we can add a "current_coordinate" data member that points to the right coordinate object. Again, updating the coordinate object won't require any walk trhough the variables.

Regarding lazy evaluation, evaluation is triggered upon access or assign. Let's illustrate this with a simple example:

xt::xarray<double> a = {1, 2, 3};
xt::xarray<double> b = {1, 2, 3};

auto f = a + b; // f is xfunction<xt::plus, xt::xarray<double>, xt::xarray<double>>, nothing is evaluated yet
double d = f(0);  // only a(0) + b(0) is computed
xt::xarray<double> c = f; // triggers the evaluations of all elements of f and copies them in c

There is still a potential consistency problem, but that requires doing more complicated things:

auto f = b_err * a * a;
a *= b;
a_err = f + a_err * b * b;

However, this is not a really natural way to write code. This kind of inconsistency is inherent to expression templates with lazy evaluation, however we hardly encounter such a code.

SimonHeybrock · 2019-02-04T11:24:32Z

While not all aspects discussed here are entirely clear to me at this point, I think the original topic of the discussion (copy-on-write) is:

We have now evaluated the costs and benefits of a copy-on-write mechanism for our applications (the analysis may be specific to what we are doing, just linking it here for future reference in case anyone else has similar questions). The outcome is that we do not see a justification for copy-on-write any more with a dataset/dataframe object. There may be performance gains in some cases, but especially when considering also non-performance-related aspects of the overall design, copy-on-write is quite clearly not optimal.

I suggest that any of the other discussion that came up here are continued elsewhere (if required) --- at least from my side the copy-on-write discussion can be considered done, i.e., feel free to close this one.

JohanMabille · 2019-02-04T12:58:56Z

I suggest that any of the other discussion that came up here are continued elsewhere (if required)

I agree, it's easier to follow a discussion focused on a specific topic rather than browse in a lot of messages potentially not related.

JohanMabille added the Discussion label Jan 29, 2019

JohanMabille closed this as completed Feb 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data frame: Copy-overheads and copy-on-write mechanism? #215

Data frame: Copy-overheads and copy-on-write mechanism? #215

SimonHeybrock commented Jan 28, 2019

JohanMabille commented Jan 29, 2019

SimonHeybrock commented Jan 29, 2019 •

edited

Loading

JohanMabille commented Jan 31, 2019

SimonHeybrock commented Jan 31, 2019

JohanMabille commented Jan 31, 2019

SimonHeybrock commented Feb 4, 2019 •

edited

Loading

JohanMabille commented Feb 4, 2019

Data frame: Copy-overheads and copy-on-write mechanism? #215

Data frame: Copy-overheads and copy-on-write mechanism? #215

Comments

SimonHeybrock commented Jan 28, 2019

1. Several data frames with common coordinates.

2. Operations that modify only a subset of variables.

2a. In-place operation

2b. Operations returning a new data frame

3. Data-variables linked to more than one data-frame

JohanMabille commented Jan 29, 2019

0. Global overview

1. Several data frames with common coordinates.

2. Operations that modify only a subset of variables.

SimonHeybrock commented Jan 29, 2019 • edited Loading

0.1

0.2

1.

2.

3.

JohanMabille commented Jan 31, 2019

SimonHeybrock commented Jan 31, 2019

JohanMabille commented Jan 31, 2019

SimonHeybrock commented Feb 4, 2019 • edited Loading

JohanMabille commented Feb 4, 2019

SimonHeybrock commented Jan 29, 2019 •

edited

Loading

SimonHeybrock commented Feb 4, 2019 •

edited

Loading