You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running some benchmarks with @jhale we've noticed that there was a performance regression introduced in #645.
More specifically, before the above PR any blocked Coeffcient (e.g. a vector-valued material property) would be first copied into a temporary outside of the quadrature loop such that in the hot loop it is accessed with unit stride, see removed logic #645
For higher order quadrature rules (but otherwise simple scalar trees) this could lead to >30% performance regression.
The text was updated successfully, but these errors were encountered:
I can add the logic back, but I think that the best place to do this is when packing coefficients in dolfinx.
Otherwise we repeat the packing step (pack coefficients in a given order, then pack them again in a different order).
More specifically I think ffcx kernels should receive data in XXXXXYYYYYZZZZ format and not XYZXYZXYZ...
Agree that this should be done in dolfinx when packing coefficients. I do not remember if coeff packing is templated based on the block size, so there could be small performance drawback when packing is done over non-compile-time constant loop.
I'd suggest to keep this issue open, it will require changes in the coefficient access in FFCx too.
Running some benchmarks with @jhale we've noticed that there was a performance regression introduced in #645.
More specifically, before the above PR any blocked Coeffcient (e.g. a vector-valued material property) would be first copied into a temporary outside of the quadrature loop such that in the hot loop it is accessed with unit stride, see removed logic #645
For higher order quadrature rules (but otherwise simple scalar trees) this could lead to >30% performance regression.
The text was updated successfully, but these errors were encountered: