Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add architecture design document #41

Merged
merged 2 commits into from
Aug 28, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions docs/developer/architecture-and-design/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Architecture and Design
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not "Architecture decision records"? This would make it easier to find given the Scipp docs and it allows us to add more if need be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had considered adding this as a subsection. I thought we might have other documentation here as well, not just decision records?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok


```{toctree}
---
maxdepth: 2
---

workflow-design
```
228 changes: 228 additions & 0 deletions docs/developer/architecture-and-design/workflow-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,228 @@
# Workflow Design

## Context

This document discusses problems and requirements for data-reduction of neutron-scattering experiments which have led to the design of Sciline.
The terminology used in the examples is specific to neutron-scattering experiments, but the concepts are general and can be applied to other domains.

## Introduction

### Traditional data-reduction workflows

Traditionally, we have supplied users with a toolbox of algorithms and optionally a reduction script or a notebook that uses those algorithms.
Conceptually this looks similar to the following:

```python
# Define parameters
sample_run_id = 12345
background_run_id = 12300
direct_beam_filename = 'direct_beam.h5'
wav_bins = linspace(...)
Q_bins = linspace(...)

# Load data
sample = load_sample(run=sample_run_id)
background = load_background(run=background_run_id)
direct_beam = load_direct_beam(run=direct_beam_filename)

# Process
mask_detectors(sample)
mask_detectors(background)
sample_monitors = preprocess_monitors(sample, wav_bins)
background_monitors = preprocess_monitors(background, wav_bins)

transmission_fraction = transmission_fraction(**sample_monitors)
sample_iofq = compute_i_of_q(
sample,
direct_beam,
transmission_fraction,
Q_bins)

transmission_fraction = transmission_fraction(**background_monitors)
background_iofq = compute_i_of_q(
background,
direct_beam,
transmission_fraction,
Q_bins)

iofq = subtract_background(sample_iofq, background_iofq)
```

This is an *imperative workflow*, where the user specifies the order of operations and the dependencies between them.
This is not ideal for a number of reasons:

- The user has to know the order of operations and the dependencies between them.
- The user has to know which algorithms to use.
- The user has to know which parameters to use for each algorithm.
- The user has to know which data to use for each algorithm.
- The user can easily introduce mistakes into a workflow, e.g., by using the wrong order of operations, or by overwriting data.
This is especially problematic in Jupyter notebooks, where the user can easily run cells out of order.

Our most basic programming models provide little help to the user.
For example we typically write components of reduction workflows as functions of `scipp.Variable` or `scipp.DataArray` objects:

```python
def transmission_fraction(
incident_monitor: sc.DataArray,
transmission_monitor: sc.DataArray,
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring has slipped in between the arguments and return annotation.

Compute transmission fraction from incident and transmission monitors.

Parameters
----------
incident_monitor:
Incident monitor.
transmission_monitor:
Transmission monitor.
"""
) -> sc.DataArray:
return transmission_monitor / incident_monitor
```

Here, we rely on naming of function parameters as well as docstrings to convey the meaning of the parameters, and it is up to the user to pass the correct inputs.
While good practices such as keyword-only arguments can help, this is still far from a scalable and maintainable solution.

As an improvement, we could adopt an approach with more specific domain types, e.g.,

```python
def transmission_fraction(
incident_monitor: IncidentMonitor,
transmission_monitor: TransmissionMonitor,
) -> TransmissionFraction:
return transmission_monitor / incident_monitor
```

We could now run [mypy](https://mypy-lang.org/) on reduction scripts to ensure that the correct types are passed to each function.
However, this is not practical with dynamic workflows, i.e., when users modifying workflows in a Jupyter notebooks on the fly.
Aside from this, such an approach would still not help with several of the other issues listed above.


### High-level summary of proposed approach

We propose an architecture combining *domain-driven design* with *dependency injection*.
Dependency injection aids in building a declarative workflow.
We define domain-specific concepts that are meaningful to the (instrument) scientist.
Simple functions provide workflow components that define relations between these domain concepts.

Concretely, we propose to define specific domain types, such as `IncidentMonitor`, `TransmissionMonitor`, and `TransmissionFraction` in the example above.
However, instead of the user having to pass these to functions, we use dependency injection to provide them to the functions.
In essence this will build a workflow's task graph.

From the [Guice documentation](https://github.com/google/guice/wiki/MentalModel#injection) (Guice is a dependency injection framework for Java):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to the Dependency injection section.


> "This is the essence of dependency injection. If you need something, you don't go out and get it from somewhere, or even ask a class to return you something. Instead, you simply declare that you can't do your work without it, and rely on Guice to give you what you need.
>
> This model is backwards from how most people think about code: it's a more *declarative model* rather than an *imperative one*. This is why dependency injection is often described as a kind of inversion of control (IoC)."
> (emphasis added)


### Domain-driven design

Domain-Driven Design (DDD) is an approach to software development that aims to make software more closely match the domain it is used in.
The obvious benefit of this is that it makes it easier for domain experts to understand and modify the software.

How should we define the domain for the purpose of data reduction?
Looking at, e.g., [Mantid](https://www.mantidproject.org/), we see that the domain is defined as "data reduction for any type of neutron scattering experiment".
This has led to more than 1000 algorithms, making it hard for users to know how to use them.
Furthermore, while algorithms provide some sort of domain-specific language, the data types are generic.

What we propose here is to define the domain more narrowly, highly specific to a technique or even specific to an instrument or experiment.
This will reduce the scope to cover in the domain-specific language.
By making data types specific to the domain, we provide nouns for the domain-specific language.


### Dependency injection

Dependency injection is a common technique for implementing the [inversion of control](https://en.wikipedia.org/wiki/Inversion_of_control) principle.
It makes components of a system more loosely coupled, and makes it easier to replace components, including for testing purposes.
Dependency injection can be performed manually, but there are also frameworks that can help with this.

## Architecture

### In a nutshell

1. The user will define building blocks of a workflow using highly specific domain types for the type-hints, such as `IncidentMonitor`, `TransmissionMonitor`, and `TransmissionFraction`, e.g.,

```python
def transmission_fraction(
incident_monitor: IncidentMonitor,
transmission_monitor: TransmissionMonitor,
) -> TransmissionFraction:
return transmission_monitor / incident_monitor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return transmission_monitor / incident_monitor
return TransmissionFraction(transmission_monitor / incident_monitor)

It's unfortunately verbose but should be part of the design discussion.

```

2. The user passes a set of building blocks to the system, which assembles a dependency graph based on the type-hints.
3. The user requests a specific output from the system using one of the domain types.
This may be computed directly, or the system may construct a `dask` graph to compute the output.

Depending on the level of expertise of the user and the level of control they need, step 1.) or step 1.) and 2.) will be omitted, as pre-defined building blocks and sets of building blocks can be provided in domain-specific support-libraries for common use cases.

### Parameter handling

Generally, the user must provide configuration parameters to a workflow.
In many cases there are defaults that can be used.
In either case, these parameters must be associated with the correct step in the workflow.
This is complicated by the non-linear nature of the workflow.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This is complicated by the non-linear nature of the workflow.
This gets complicated by the non-linear nature of the workflow.

?

A flat list of parameters has been used traditionally, relying entirely on parameter naming.
This is problematic for two reasons:
First, certain basic workflow steps may be used in multiple places.
Second, workflows frequently contain nested steps, which may have the same parameters (or not).
This makes the process of setting parameters somewhat opaque and error-prone.
Furthermore, it relies on a hand-written higher-level workflow to set parameters for nested steps, mapping between globally-uniquely-named parameters and the parameters of the nested steps.
These, in turn, require complicated testing.

A hierarchical parameter system could provide an alternative, but makes it harder to set "global" parameters.
For example, we may want to use the same wavelength-binning for all steps in the workflow.

We propose to handle parameters as dependencies of workflow steps.
That is, the dependency-injection system is used to provide parameters to workflow steps.
Parameters are identified via their type, i.e., we will require defining a domain-specific type for each parameter, such as `WavelengthBinning`.

For nested workflows, we can use a child injector, which provides a scope for parameters.
Parent-scopes can be searched for parameters that are not found in the child-scope, providing a mechanism for "global" parameters.

*Note:
The idea of using child injectors was discarded during implementation for a number of reasons.
Parameters in a "nested" scope can now be realized using the mechanism of generic providers.*

### Metadata handling

There have been a number of discussions around metadata handling.
For example, the support (or non-support) of an arbitrary `attrs` dict as part of `scipp.Variable` and `scipp.DataArray`.
Furthermore, we may have metadata that is part of the data-catalog, which may partially overlap with the metadata that is part of the data itself.
The current conclusion is that any attempt to handle metadata in a generic and automatic way will not work.
Therefore, if a user wants to provide metadata for a workflow result, they must do so *explicitly* by specifying functions that can assemble that metadata.
As with regular results, this can be done by injecting the input metadata into the function that computes the result's metadata.

### Domain-specific types

The system will use domain-specific types to identify workflow steps and parameters.
The system does not require subclassing or use of decorators.
Domain-specific types can be defined as regular classes or subclasses of existing types.
In many cases, a simple alias will be sufficient.

`typing.NewType` can be used as a simple way of creating a domain-specific type alias for type-checking.
For example, we can use it to create a type for a `scipp.DataArray` that represents a transmission monitor.
This avoids a more complex solution such as creating a wrapper class or a subclass:

```python
import typing

TransmissionMonitor = typing.NewType('TransmissionMonitor', scipp.DataArray)
```

Note that this does not create a new type, but rather a new name for an existing type.
That is, `isinstance(monitor, TransmissionMonitor)` does not work, since `TransmissionMonitor` is not a type.
Furthermore, operations will revert to the underlying type, e.g., `monitor * 2` will return a `scipp.DataArray`.
For this application this would actually be desired behavior:
Applying an operation to a domain type will generally result in a different type, so falling back to the underlying type is the correct behavior and forces the user to be explicit about the type of the result.

## Notes

- Earlier versions of this design document included a detailed discussion on how to handle nested workflows using child-injectors.
During initial implementation efforts this idea was discarded, as it was found to be too complicated and not necessary.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate of l. 181f.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Um, what is duplicate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not a full duplicate. But the other section also mentions that nested workflows were considered but not implemented.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the section talking about nested parameters, as opposed to nested workflows?

- Earlier versions of this design document included considerations on the use of the [injector](https://injector.readthedocs.io/en/latest/) Python library.
During early implementation efforts it was found that this library did not provide advantages beyond the very basic features (which can be implemented in a few lines of code).
For example, it was attempted to use `injector`'s `multibind` and child-injector mechanism to handle nested workflows as well as multiple runs.
This turned out to be way too complicated.
By using a dedicated home-grown solution, in particular using generic providers and parameter tables, we found a more straightforward solution and furthermore avoided introducing a dependency.
3 changes: 1 addition & 2 deletions docs/developer/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,11 @@

```{toctree}
---
caption: About
maxdepth: 2
hidden:
---

getting-started
coding-conventions
dependency-management
architecture-and-design/index
```