-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft: Feat/convert storage #60
base: main
Are you sure you want to change the base?
Conversation
index, length = metadata.get_range(start_time=start_time, length=length) | ||
data = base_ts[metadata.variable_name][index : index + length] | ||
data = base_ts[index : index + length] # TODO use uuid col |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will remove this TODO, it's now obsolete
def add_time_series(self, metadata: TimeSeriesMetadata, time_series: TimeSeriesData) -> None: | ||
if metadata.time_series_uuid not in self._arrays: | ||
self._arrays[metadata.time_series_uuid] = time_series | ||
time_series_array = time_series.data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will update to time_series.data_array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daniel-thom, this is where I needed to reference .data. There is no other way to access the data from TimeSeriesData without enforcing data on the base class. This method could be changed to take SingleTimeSeries as an input instead of TimeSeriesData
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I really think it makes sense to add data as an attribute to the base class, I was messing with generics so that we could create SingleTimeSeries[pint.Quantity] or something like that to type the data attribute in any child class. I don't think I understand the case where a child class of TimeSeriesData wouldn't have a data attribute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could do something similar to what is done in get_time_series
and throw an error if the type is not SingleTimeSeries.
I agree that all subclasses of TimeSeriesData will have some data
attribute. But they could look very different and have other supporting attributes. They could have a different number of dimensions. It could look like this Sienna type. The exact type should be in the subclass. The main point is that any method that operates on data should know what the time series type is. I'm not sure that any code can do anything meaningful with TimeSeriesData.data
. Do we want to do something that requires subclasses to have the exact field name data
for debugging. I suppose there is some value in that.
otherwise, the new storage object is returned | ||
|
||
""" | ||
return self._time_series_mgr.convert_storage(**kwargs, replace=replace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it makes sense to return a copy of time_series_mgr with new storage? or the storage itself? or only implement this method for System with replace=True?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I think about this the more I think that there is no use case for replace = False. Current thoughts:
- Replace the storage object. Return nothing. Leave the data in the tmp dir or maybe delete it.
- Leverage this feature in the deserialize workflow. User starts with in-memory, serializes to disk (gets Arrow storage), then deserializes and wants it converted back to memory. The user never sees or has access to the intermediate Arrow storage.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. The replace=False option seems like it's trying to be future-proof, but it also seems like we really don't, and probably won't ever need it.
@@ -50,25 +51,33 @@ def create_with_permanent_directory(cls, directory: Path) -> "ArrowTimeSeriesSto | |||
def get_time_series_directory(self) -> Path: | |||
return self._ts_directory | |||
|
|||
def iter_time_series_uuids(self) -> Iterable[UUID]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def iter_time_series_uuids(self) -> Iterable[UUID]: | |
def iter_time_series_uuids(self) -> Generator[UUID, None, None]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to enforce the return type being a generator, unless that is how we want this method to be constructed. In the in_memory case I am just returning the dict.keys(), I could create a generator from this iterable if that is better for the memory usage of the program though.
with pa.memory_map(str(fpath), "r") as source: | ||
base_ts = pa.ipc.open_file(source).get_record_batch(0) | ||
logger.trace("Reading time series from {}", fpath) | ||
return base_ts[str(time_series_uuid)].to_numpy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the old code only reads length
entries of data from a memory-mapped object. This new code fully materializes the array into memory. Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason I cast this to numpy is to be consistent in the return type of get_raw_time_series. This does however have the implication that the normal get_time_series is materializing numpy arrays because I refactored that function to use get_raw_timeseries. I could return the original function to return arrow arrays (or tables?). For the get_raw_data, the main use case is for the storage conversion, so I don't think it's an issue to materialize the data into memory as that is required to write to the new storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning a numpy array seems good. Could you pass index and length to this function as optional parameters? If None, read the whole the file. Then the original behavior is preserved.
otherwise, the new storage object is returned | ||
|
||
""" | ||
return self._time_series_mgr.convert_storage(**kwargs, replace=replace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I think about this the more I think that there is no use case for replace = False. Current thoughts:
- Replace the storage object. Return nothing. Leave the data in the tmp dir or maybe delete it.
- Leverage this feature in the deserialize workflow. User starts with in-memory, serializes to disk (gets Arrow storage), then deserializes and wants it converted back to memory. The user never sees or has access to the intermediate Arrow storage.
What do you think?
@@ -54,6 +54,7 @@ class TimeSeriesStorageType(str, Enum): | |||
class TimeSeriesData(InfraSysBaseModelWithIdentifers, abc.ABC): | |||
"""Base class for all time series models""" | |||
|
|||
data: Any # TODO update to generic? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this defined in the child classes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The typing in the memory classes refers to TimeSeriesData objects, which don't enforce a data attribute. I think it is probably fair to move the data attribute into this base class, as I don't think it would make sense to have a child class not provide a data attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can let the child class set the name and type of their data attribute. My understanding is that no current code accesses an attribute called data
if the type is generically TimeSeriesData
. It should only be accessing that attribute if it knows that it is of type SingleTimeSeries
. Is that right?
@@ -105,6 +106,7 @@ def check_data(cls, data) -> NDArray | pint.Quantity: # Standarize what object | |||
|
|||
if isinstance(data, pint.Quantity): | |||
if not isinstance(data.magnitude, np.ndarray): | |||
# Why cant this use pint.Quantity instead of type(data)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might need to run through one of the custom constructors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I don't think that makes sense. Data is typed as np.array or pint.Quantity, and this branch is only considering data of type pint.Quantity, I don't think I saw any place where we are creating child classes from the data attribute class.
"""Get the raw time series data for a given uuid.""" | ||
|
||
@abc.abstractmethod | ||
def iter_time_series_uuids(self) -> Iterable[UUID]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def iter_time_series_uuids(self) -> Iterable[UUID]: | |
def iter_time_series_uuids(self) -> Generator[UUID, None, None]: |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #60 +/- ##
==========================================
- Coverage 95.30% 94.72% -0.58%
==========================================
Files 33 33
Lines 2766 2824 +58
==========================================
+ Hits 2636 2675 +39
- Misses 130 149 +19 ☔ View full report in Codecov by Sentry. |
adding feature to convert between different time_series_manager storage types.