-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data frame: Copy-overheads and copy-on-write mechanism? #215
Comments
Thanks for opening this. Before answering specifically the questions, I will give an overview of the current design and what it allows, that will help having a better overview. 0. Global overview0.1. xvariable The C++ equivalent of DataArray is template <class CCT, class ECT>
class xvariable_container;
// XFRAME_DEFAULT_DATA_CONTAINER is typically xt::xarray<T>
template <class T, class CCT>
using xvariable = xvariable_container<CCT, XFRAME_DEFAULT_DATA_CONTAINER(T)>; The parameters using coordinate_type = something;
using variable_type = xvariable<double, const coordinate_type&>;
coordinate_type my_coord(....);
// my_coord is never copied
variable_type v1(std::move(some_data1), my_coord);
variable_type v2(std::move(som_data2), m_coord); Same thing for the data (in that case we have to use 0.2. xframe An unaligned An aligned 1. Several data frames with common coordinates.As explained above, sharing a coordinate object between variables, and even between frames, is possible. We don't handle coordinates with multiple dimensions yet, this would require some refactoring of the current As the coordinate object, a data / variable may be shared between different 2. Operations that modify only a subset of variables.The methods can also be free functions accepting template <class E>
void my_function(xframe_expression<E>& e)
{
// do domething
}
xframe_type my_function(const xframe_type& fr)
{
xframe res(fr);
my_function(res);
return res;
} Also notice that the API we provide to python may be different from the one in C++ (we may design API classes that embeds free C++ functions as methods and expose them to the Python). Regarding what we expose to python, if we want to expose arithmetic operators AND keep the lazy evaluation, then we have no choice, we need to use type erasure and a dynamic expressions system. ### 3. Data-variables linked to more than one data-frame As it is possible to share coordinates between different variables, it is possible to share a same variable object between different |
@JohanMabille Thanks for the detailed answer. I have a couple of follow-up questions: 0.1
0.2
1.
2.
3.
Except for specialized/static applications that might be taken to imply that we should almost always use variables (and coordinates) stored by value (not by reference), since life-cycle management is not so trivial? Are you (considering) supporting something like using coordinate_type = std::vector<double>;
using variable_type = xvariable<double, std::shared_ptr<const coordinate_type>>; such that life-cycle management becomes more feasible? |
0.1 The lifetime management is up to the user in this case. The coordinate object will be deleted when it goes out of scope, and the remaining variables will be broken. However the main use case for this feature is to share the coordinate object in a There is also the possibility (not implemented yet) to store a shared coordinate whose life cycle can be totally automatic (with a wrapper on a shared_pointer for instance), we already do that for shared expressions in xtensor. 0.2 I'm not sure why we would remove coordinate from a frame, this would prevent accessing data in the variables (even if updated). Regarding the copy of variables from one frame to another, the variable in the destination frame will be created from a copy of the data of the source variable, and a reference on the coordinate of the destination frame. If you copy one frame to another, first the coordinate is copied, then the variable are copied as described above. It's also possible to move a variable from a frame to another one, in that case a variable is created in the destination frame from the moved data of the source variable and a reference on the coordinate of the destination frame. The source variable is then removed from the source frame. 1. 1.2. Same rule as above, the lifetime management is the responsibility of the user. 1.3. Modifying the values in the variables should be possible, only the shape / coordinates should be const. 2. Expressions are created on variables, not on the data themselves. We already have a sharing mechanism in xtensor that we can reuse in xframe. Usually, unevaluated expressions are const expressions, so there should not be conflicts. In-place operations that could modify data are done on containers or variable backed by a database or a file on the system. 3. The coordinate_type template parameter of the variable can be anything that provides the interface of a coordinate object. So we can (and actually will) support a wrapper around a shared pointer on coordinates. |
Thanks a lot, that all makes sense! Just two more small questions:
Essentially because in practice we may have more than a single coordinate for a specific dimension. Only one of them would be what
Could you clarify this based on a concrete example? Consider a multiplication with uncertainties: xt::xarray<double> a = {1, 2, 3};
xt::xarray<double> a_err = {1, 2, 3};
xt::xarray<double> b = {1, 2, 3};
xt::xarray<double> b_err = {1, 2, 3};
// Not sure this is a good way of doing this with xtensor, but this is the gist of the operation:
a_err = a_err * b * b + b_err * a * a;
a *= b; With in-place operation (such as |
Regarding the coordinates, I haven't planned support for many coordinate objects yet. The idea for now is to replace the current coordinate object with a Regarding lazy evaluation, evaluation is triggered upon access or assign. Let's illustrate this with a simple example: xt::xarray<double> a = {1, 2, 3};
xt::xarray<double> b = {1, 2, 3};
auto f = a + b; // f is xfunction<xt::plus, xt::xarray<double>, xt::xarray<double>>, nothing is evaluated yet
double d = f(0); // only a(0) + b(0) is computed
xt::xarray<double> c = f; // triggers the evaluations of all elements of f and copies them in c There is still a potential consistency problem, but that requires doing more complicated things: auto f = b_err * a * a;
a *= b;
a_err = f + a_err * b * b; However, this is not a really natural way to write code. This kind of inconsistency is inherent to expression templates with lazy evaluation, however we hardly encounter such a code. |
While not all aspects discussed here are entirely clear to me at this point, I think the original topic of the discussion (copy-on-write) is: We have now evaluated the costs and benefits of a copy-on-write mechanism for our applications (the analysis may be specific to what we are doing, just linking it here for future reference in case anyone else has similar questions). The outcome is that we do not see a justification for copy-on-write any more with a dataset/dataframe object. There may be performance gains in some cases, but especially when considering also non-performance-related aspects of the overall design, copy-on-write is quite clearly not optimal. I suggest that any of the other discussion that came up here are continued elsewhere (if required) --- at least from my side the copy-on-write discussion can be considered done, i.e., feel free to close this one. |
I agree, it's easier to follow a discussion focused on a specific topic rather than browse in a lot of messages potentially not related. |
When working with a single array (e.g., an
xt::xarray
) or anxf::xvariable
it is relatively easy to avoid unnecessary copies. My feeling is that for a data frame such asxarray::Dataset
orxf::xframe
avoiding copies and memory-size overheads is more involved.I will try to explain based on a couple of examples:
1. Several data frames with common coordinates.
At first sight coordinates should take very little space since typically they have fewer dimensions than the data. In practice this is not necessarily true however:
1e6*[1e3..1e4]*8 Byte = 8 Gbyte...80 GByte
per frame.1 GByte
of total size. Typically this auxiliary information is the same for all data frames. A central (in-memory) "database" for this type of information would certainly be an alternative, but would require linking from a data frame to the database, which complicates the overall design.2. Operations that modify only a subset of variables.
2a. In-place operation
xframe
, I think? Is a "natural" (to a Python user) syntax still possible?std::move
in many cases?2b. Operations returning a new data frame
In practice not all operations would be done in-place:
frame
and data derived fromframe
, i.e., it cannot be don in-place. The data frame returned bysmooth
would also contain any other data fromframe
, such as coordinates or attributes, which can be large, as discussed above.3. Data-variables linked to more than one data-frame
This could be solved in different ways, but one convenience provided by a data frame / dataset type is the ability to store arbitrary variables: This could be used, e.g., for storing a reference measurement, which may be used to normalize the data at a later point. The reference measurement may be identical for a number of different data sets that are processed. This can be considered as a similar issue as described above regarding "auxiliary" information: A central "registry" for this type of information would be possible, but complicates both the data frame as well as the client code.
I think a copy-on-write mechanism would solve most of the issues I mentioned (while of course adding some other complications/headaches). I will wait for an answer before discussing my thoughts on copy-on-write in more detail.
The text was updated successfully, but these errors were encountered: