Draft: Feat/convert storage #60

JensZack · 2024-12-06T19:13:08Z

adding feature to convert between different time_series_manager storage types.

copies raw arrays from one storage to another
does not effect metadata
returns new storage object if replace=False, otherwise the underlying storage is updated to the new type and the old storage is discarded by the time_series_manager

JensZack · 2024-12-06T19:15:03Z

src/infrasys/arrow_storage.py

        index, length = metadata.get_range(start_time=start_time, length=length)
-        data = base_ts[metadata.variable_name][index : index + length]
+        data = base_ts[index : index + length]  # TODO use uuid col


will remove this TODO, it's now obsolete

JensZack · 2024-12-06T19:15:54Z

src/infrasys/in_memory_time_series_storage.py

    def add_time_series(self, metadata: TimeSeriesMetadata, time_series: TimeSeriesData) -> None:
        if metadata.time_series_uuid not in self._arrays:
-            self._arrays[metadata.time_series_uuid] = time_series
+            time_series_array = time_series.data


will update to time_series.data_array

@daniel-thom, this is where I needed to reference .data. There is no other way to access the data from TimeSeriesData without enforcing data on the base class. This method could be changed to take SingleTimeSeries as an input instead of TimeSeriesData

But I really think it makes sense to add data as an attribute to the base class, I was messing with generics so that we could create SingleTimeSeries[pint.Quantity] or something like that to type the data attribute in any child class. I don't think I understand the case where a child class of TimeSeriesData wouldn't have a data attribute

You could do something similar to what is done in get_time_series and throw an error if the type is not SingleTimeSeries.

I agree that all subclasses of TimeSeriesData will have some data attribute. But they could look very different and have other supporting attributes. They could have a different number of dimensions. It could look like this Sienna type. The exact type should be in the subclass. The main point is that any method that operates on data should know what the time series type is. I'm not sure that any code can do anything meaningful with TimeSeriesData.data. Do we want to do something that requires subclasses to have the exact field name data for debugging. I suppose there is some value in that.

JensZack · 2024-12-06T19:18:52Z

src/infrasys/system.py

+            otherwise, the new storage object is returned
+
+        """
+        return self._time_series_mgr.convert_storage(**kwargs, replace=replace)


would it makes sense to return a copy of time_series_mgr with new storage? or the storage itself? or only implement this method for System with replace=True?

The more I think about this the more I think that there is no use case for replace = False. Current thoughts:

Replace the storage object. Return nothing. Leave the data in the tmp dir or maybe delete it.

Leverage this feature in the deserialize workflow. User starts with in-memory, serializes to disk (gets Arrow storage), then deserializes and wants it converted back to memory. The user never sees or has access to the intermediate Arrow storage.

What do you think?

I agree. The replace=False option seems like it's trying to be future-proof, but it also seems like we really don't, and probably won't ever need it.

daniel-thom · 2024-12-06T21:29:07Z

src/infrasys/arrow_storage.py

@@ -50,25 +51,33 @@ def create_with_permanent_directory(cls, directory: Path) -> "ArrowTimeSeriesSto
    def get_time_series_directory(self) -> Path:
        return self._ts_directory

+    def iter_time_series_uuids(self) -> Iterable[UUID]:


Suggested change

def iter_time_series_uuids(self) -> Iterable[UUID]:

def iter_time_series_uuids(self) -> Generator[UUID, None, None]:

I don't think we need to enforce the return type being a generator, unless that is how we want this method to be constructed. In the in_memory case I am just returning the dict.keys(), I could create a generator from this iterable if that is better for the memory usage of the program though.

daniel-thom · 2024-12-07T00:42:54Z

src/infrasys/arrow_storage.py

+        with pa.memory_map(str(fpath), "r") as source:
+            base_ts = pa.ipc.open_file(source).get_record_batch(0)
+            logger.trace("Reading time series from {}", fpath)
+        return base_ts[str(time_series_uuid)].to_numpy()


It looks like the old code only reads length entries of data from a memory-mapped object. This new code fully materializes the array into memory. Is that correct?

The main reason I cast this to numpy is to be consistent in the return type of get_raw_time_series. This does however have the implication that the normal get_time_series is materializing numpy arrays because I refactored that function to use get_raw_timeseries. I could return the original function to return arrow arrays (or tables?). For the get_raw_data, the main use case is for the storage conversion, so I don't think it's an issue to materialize the data into memory as that is required to write to the new storage.

Returning a numpy array seems good. Could you pass index and length to this function as optional parameters? If None, read the whole the file. Then the original behavior is preserved.

daniel-thom · 2024-12-07T00:50:28Z

src/infrasys/system.py

+            otherwise, the new storage object is returned
+
+        """
+        return self._time_series_mgr.convert_storage(**kwargs, replace=replace)


The more I think about this the more I think that there is no use case for replace = False. Current thoughts:

Replace the storage object. Return nothing. Leave the data in the tmp dir or maybe delete it.

Leverage this feature in the deserialize workflow. User starts with in-memory, serializes to disk (gets Arrow storage), then deserializes and wants it converted back to memory. The user never sees or has access to the intermediate Arrow storage.

What do you think?

daniel-thom · 2024-12-07T00:51:39Z

src/infrasys/time_series_models.py

@@ -54,6 +54,7 @@ class TimeSeriesStorageType(str, Enum):
 class TimeSeriesData(InfraSysBaseModelWithIdentifers, abc.ABC):
    """Base class for all time series models"""

+    data: Any  # TODO update to generic?


Isn't this defined in the child classes?

The typing in the memory classes refers to TimeSeriesData objects, which don't enforce a data attribute. I think it is probably fair to move the data attribute into this base class, as I don't think it would make sense to have a child class not provide a data attribute.

We can let the child class set the name and type of their data attribute. My understanding is that no current code accesses an attribute called data if the type is generically TimeSeriesData. It should only be accessing that attribute if it knows that it is of type SingleTimeSeries. Is that right?

daniel-thom · 2024-12-07T00:53:42Z

src/infrasys/time_series_models.py

@@ -105,6 +106,7 @@ def check_data(cls, data) -> NDArray | pint.Quantity:  # Standarize what object

        if isinstance(data, pint.Quantity):
            if not isinstance(data.magnitude, np.ndarray):
+                # Why cant this use pint.Quantity instead of type(data)?


I think we might need to run through one of the custom constructors.

that makes sense

Actually, I don't think that makes sense. Data is typed as np.array or pint.Quantity, and this branch is only considering data of type pint.Quantity, I don't think I saw any place where we are creating child classes from the data attribute class.

daniel-thom · 2024-12-07T00:54:06Z

src/infrasys/time_series_storage_base.py

+        """Get the raw time series data for a given uuid."""
+
+    @abc.abstractmethod
+    def iter_time_series_uuids(self) -> Iterable[UUID]:


Suggested change

def iter_time_series_uuids(self) -> Iterable[UUID]:

def iter_time_series_uuids(self) -> Generator[UUID, None, None]:

codecov-commenter · 2024-12-07T02:12:45Z

Codecov Report

Attention: Patch coverage is 83.51648% with 15 lines in your changes missing coverage. Please review.

Project coverage is 94.72%. Comparing base (60995ae) to head (27c49cc).

Files with missing lines	Patch %	Lines
src/infrasys/in_memory_time_series_storage.py	54.16%	11 Missing ⚠️
src/infrasys/arrow_storage.py	88.46%	3 Missing ⚠️
src/infrasys/time_series_manager.py	93.33%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #60      +/-   ##
==========================================
- Coverage   95.30%   94.72%   -0.58%     
==========================================
  Files          33       33              
  Lines        2766     2824      +58     
==========================================
+ Hits         2636     2675      +39     
- Misses        130      149      +19

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

JensZack added 5 commits December 6, 2024 11:58

refactor: create_storage_method

e23297d

refactor: typing, comment and data array property

829bb04

feat: adding new abstract methods for storage conversion

cbf3f8a

feat: implement new methods for storage classes

7dd9c74

feat: convert_storage methods in time_series_manager and system

a7eae1b

JensZack commented Dec 6, 2024

View reviewed changes

JensZack added 2 commits December 6, 2024 12:44

fix: use uuid_str for arrow table lookup

a727888

test: test convert storage, fixed in_memory issue, formatting

27c49cc

daniel-thom reviewed Dec 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: Feat/convert storage #60

Draft: Feat/convert storage #60

JensZack commented Dec 6, 2024

JensZack Dec 6, 2024

JensZack Dec 6, 2024

JensZack Dec 9, 2024

JensZack Dec 9, 2024

daniel-thom Dec 9, 2024

JensZack Dec 6, 2024

daniel-thom Dec 7, 2024

JensZack Dec 9, 2024

daniel-thom Dec 6, 2024

JensZack Dec 9, 2024

daniel-thom Dec 7, 2024

JensZack Dec 9, 2024

daniel-thom Dec 9, 2024

daniel-thom Dec 7, 2024

daniel-thom Dec 7, 2024

JensZack Dec 9, 2024

daniel-thom Dec 9, 2024

daniel-thom Dec 7, 2024

JensZack Dec 9, 2024

JensZack Dec 9, 2024

daniel-thom Dec 7, 2024

codecov-commenter commented Dec 7, 2024

	def iter_time_series_uuids(self) -> Iterable[UUID]:
	def iter_time_series_uuids(self) -> Generator[UUID, None, None]:

Draft: Feat/convert storage #60

Are you sure you want to change the base?

Draft: Feat/convert storage #60

Conversation

JensZack commented Dec 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Dec 7, 2024

Codecov Report