Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Toloka pipelines for start, validate and download #64

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

Alcray
Copy link
Collaborator

@Alcray Alcray commented Jun 3, 2024

No description provided.

@Alcray Alcray requested a review from erastorgueva-nv July 15, 2024 11:18
Copy link
Collaborator

@erastorgueva-nv erastorgueva-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your PR. Please see comments.

sdp/processors/toloka/__init__.py Outdated Show resolved Hide resolved
sdp/processors/toloka/create_pool.py Outdated Show resolved Hide resolved
# Attempt to load the audio file to check if it is corrupted
torchaudio.load(entry[self.audio_filepath_key])
entries.append(entry) # File is good, append to entries list
except Exception as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of error do corrupted files raise? Can we make it more specific than just a general "Exception"?

Comment on lines +829 to +990
else:
os.remove(entry[self.audio_filepath_key])
print(f"Deleted corrupted file: {entry[self.audio_filepath_key]}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we maybe remove this? I would like to avoid deleting files.
I am worried about what if a user points the processor to a directory containing non-audio files, e.g. their code. Torchaudio will not be able to open them. Code files will get deleted :(

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I could add a parameter to solve this issue? The thing is that user should not have code or system files in manifest.

Maybe I should require user to provide corrupted_audio_dir? In that case we will only move the files, not delete them. The inference time was very long with corrupted files, moreover I was getting errors some times. Since this processor was written for files from toloka, probably there will be corrupted files

return [DataEntry(data=data_entry)]


class ExcelToJsonConverter(BaseProcessor):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this processor is not used in this PR, do we need it?

torchaudio.load(entry[self.audio_filepath_key])
entries.append(entry) # File is good, append to entries list
except Exception as e:
print(f"Failed to load {entry[self.audio_filepath_key]}: {e}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change all the printing to logging using the logger here: sdp/logging.py?

sdp/processors/__init__.py Outdated Show resolved Hide resolved
self.project_name = project_name
self.project_description = project_description
self.project_instructions = project_instructions
self.save_api_key_to_config = save_api_key_to_config # Initialize the parameter
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not good practice, can we remove this feature?

**kwargs,
):
super().__init__(**kwargs)
self.API_KEY = API_KEY or os.getenv('TOLOKA_API_KEY')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just use os.getev for all these processors? And not read from the config?

@Alcray Alcray changed the title Toloka pipelines for start, validate and download without docs Toloka pipelines for start, validate and download Nov 15, 2024
@karpnv
Copy link
Collaborator

karpnv commented Nov 28, 2024

need to add some tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants