Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: Feat/convert storage #60

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Draft: Feat/convert storage #60

wants to merge 7 commits into from

Conversation

JensZack
Copy link
Collaborator

@JensZack JensZack commented Dec 6, 2024

adding feature to convert between different time_series_manager storage types.

  • copies raw arrays from one storage to another
  • does not effect metadata
  • returns new storage object if replace=False, otherwise the underlying storage is updated to the new type and the old storage is discarded by the time_series_manager

index, length = metadata.get_range(start_time=start_time, length=length)
data = base_ts[metadata.variable_name][index : index + length]
data = base_ts[index : index + length] # TODO use uuid col
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove this TODO, it's now obsolete

def add_time_series(self, metadata: TimeSeriesMetadata, time_series: TimeSeriesData) -> None:
if metadata.time_series_uuid not in self._arrays:
self._arrays[metadata.time_series_uuid] = time_series
time_series_array = time_series.data
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will update to time_series.data_array

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daniel-thom, this is where I needed to reference .data. There is no other way to access the data from TimeSeriesData without enforcing data on the base class. This method could be changed to take SingleTimeSeries as an input instead of TimeSeriesData

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I really think it makes sense to add data as an attribute to the base class, I was messing with generics so that we could create SingleTimeSeries[pint.Quantity] or something like that to type the data attribute in any child class. I don't think I understand the case where a child class of TimeSeriesData wouldn't have a data attribute

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do something similar to what is done in get_time_series and throw an error if the type is not SingleTimeSeries.

I agree that all subclasses of TimeSeriesData will have some data attribute. But they could look very different and have other supporting attributes. They could have a different number of dimensions. It could look like this Sienna type. The exact type should be in the subclass. The main point is that any method that operates on data should know what the time series type is. I'm not sure that any code can do anything meaningful with TimeSeriesData.data. Do we want to do something that requires subclasses to have the exact field name data for debugging. I suppose there is some value in that.

otherwise, the new storage object is returned

"""
return self._time_series_mgr.convert_storage(**kwargs, replace=replace)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it makes sense to return a copy of time_series_mgr with new storage? or the storage itself? or only implement this method for System with replace=True?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about this the more I think that there is no use case for replace = False. Current thoughts:

  1. Replace the storage object. Return nothing. Leave the data in the tmp dir or maybe delete it.
  2. Leverage this feature in the deserialize workflow. User starts with in-memory, serializes to disk (gets Arrow storage), then deserializes and wants it converted back to memory. The user never sees or has access to the intermediate Arrow storage.

What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. The replace=False option seems like it's trying to be future-proof, but it also seems like we really don't, and probably won't ever need it.

@@ -50,25 +51,33 @@ def create_with_permanent_directory(cls, directory: Path) -> "ArrowTimeSeriesSto
def get_time_series_directory(self) -> Path:
return self._ts_directory

def iter_time_series_uuids(self) -> Iterable[UUID]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def iter_time_series_uuids(self) -> Iterable[UUID]:
def iter_time_series_uuids(self) -> Generator[UUID, None, None]:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to enforce the return type being a generator, unless that is how we want this method to be constructed. In the in_memory case I am just returning the dict.keys(), I could create a generator from this iterable if that is better for the memory usage of the program though.

with pa.memory_map(str(fpath), "r") as source:
base_ts = pa.ipc.open_file(source).get_record_batch(0)
logger.trace("Reading time series from {}", fpath)
return base_ts[str(time_series_uuid)].to_numpy()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the old code only reads length entries of data from a memory-mapped object. This new code fully materializes the array into memory. Is that correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason I cast this to numpy is to be consistent in the return type of get_raw_time_series. This does however have the implication that the normal get_time_series is materializing numpy arrays because I refactored that function to use get_raw_timeseries. I could return the original function to return arrow arrays (or tables?). For the get_raw_data, the main use case is for the storage conversion, so I don't think it's an issue to materialize the data into memory as that is required to write to the new storage.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning a numpy array seems good. Could you pass index and length to this function as optional parameters? If None, read the whole the file. Then the original behavior is preserved.

otherwise, the new storage object is returned

"""
return self._time_series_mgr.convert_storage(**kwargs, replace=replace)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about this the more I think that there is no use case for replace = False. Current thoughts:

  1. Replace the storage object. Return nothing. Leave the data in the tmp dir or maybe delete it.
  2. Leverage this feature in the deserialize workflow. User starts with in-memory, serializes to disk (gets Arrow storage), then deserializes and wants it converted back to memory. The user never sees or has access to the intermediate Arrow storage.

What do you think?

@@ -54,6 +54,7 @@ class TimeSeriesStorageType(str, Enum):
class TimeSeriesData(InfraSysBaseModelWithIdentifers, abc.ABC):
"""Base class for all time series models"""

data: Any # TODO update to generic?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this defined in the child classes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The typing in the memory classes refers to TimeSeriesData objects, which don't enforce a data attribute. I think it is probably fair to move the data attribute into this base class, as I don't think it would make sense to have a child class not provide a data attribute.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can let the child class set the name and type of their data attribute. My understanding is that no current code accesses an attribute called data if the type is generically TimeSeriesData. It should only be accessing that attribute if it knows that it is of type SingleTimeSeries. Is that right?

@@ -105,6 +106,7 @@ def check_data(cls, data) -> NDArray | pint.Quantity: # Standarize what object

if isinstance(data, pint.Quantity):
if not isinstance(data.magnitude, np.ndarray):
# Why cant this use pint.Quantity instead of type(data)?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might need to run through one of the custom constructors.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I don't think that makes sense. Data is typed as np.array or pint.Quantity, and this branch is only considering data of type pint.Quantity, I don't think I saw any place where we are creating child classes from the data attribute class.

"""Get the raw time series data for a given uuid."""

@abc.abstractmethod
def iter_time_series_uuids(self) -> Iterable[UUID]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def iter_time_series_uuids(self) -> Iterable[UUID]:
def iter_time_series_uuids(self) -> Generator[UUID, None, None]:

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 83.51648% with 15 lines in your changes missing coverage. Please review.

Project coverage is 94.72%. Comparing base (60995ae) to head (27c49cc).

Files with missing lines Patch % Lines
src/infrasys/in_memory_time_series_storage.py 54.16% 11 Missing ⚠️
src/infrasys/arrow_storage.py 88.46% 3 Missing ⚠️
src/infrasys/time_series_manager.py 93.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #60      +/-   ##
==========================================
- Coverage   95.30%   94.72%   -0.58%     
==========================================
  Files          33       33              
  Lines        2766     2824      +58     
==========================================
+ Hits         2636     2675      +39     
- Misses        130      149      +19     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants