Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

whether to support streaming dataset? #487

Open
Mddct opened this issue Jun 27, 2024 · 1 comment
Open

whether to support streaming dataset? #487

Mddct opened this issue Jun 27, 2024 · 1 comment

Comments

@Mddct
Copy link

Mddct commented Jun 27, 2024

Whether grain provides iter dataset, similar to torch.utils.data.IterableDataset, because when the total amount of original index files is large, such as 4T, they are difficult to load directly into memory.

@jimlinntu
Copy link
Contributor

jimlinntu commented Oct 4, 2024

At the moment, you can implement a custom IterDataset

For example:

import grain.python as grain
class _YourCustomDatasetIterator(grain.DatasetIterator):

  def __init__(self, filename: str):
    self._reader = "<define your reader>"
    self._offset = "<get reader's offset>"

  def __next__(self):
    record = next(self._reader)
    self._record_offset = "<get reader's offset>"
    return record

  def get_state(self) -> dict[str, Any]:
    return {"offset": self._offset}

  def set_state(self, state):
    self._offset = state["offset"]
    self._reader.Seek(self._offset)  # Seeks to the correct offset


class YourCustomIterDataset(grain.IterDataset):

  def __init__(self, filename: str):
    super().__init__()
    self._filename = filename

  def __iter__(self):
    return _YourCustomDatasetIterator(self._filename)

Then you should be able to do .map or .filter upon this custom IterDataset

If the source supports random access, you can do

import grain.python as grain

data = [1, 2, 3]
dataset = grain.MapDataset.source(data)
dataset: grain.IterDataset = dataset.to_iter_dataset()

We are planning to support something like that for IterDataset as well i.e.. grain.IterDataset.source(...)

As for now you will have to implement a custom IterDataset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants