-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Toloka pipelines for start, validate and download #64
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your PR. Please see comments.
# Attempt to load the audio file to check if it is corrupted | ||
torchaudio.load(entry[self.audio_filepath_key]) | ||
entries.append(entry) # File is good, append to entries list | ||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of error do corrupted files raise? Can we make it more specific than just a general "Exception"?
else: | ||
os.remove(entry[self.audio_filepath_key]) | ||
print(f"Deleted corrupted file: {entry[self.audio_filepath_key]}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we maybe remove this? I would like to avoid deleting files.
I am worried about what if a user points the processor to a directory containing non-audio files, e.g. their code. Torchaudio will not be able to open them. Code files will get deleted :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I could add a parameter to solve this issue? The thing is that user should not have code or system files in manifest.
Maybe I should require user to provide corrupted_audio_dir? In that case we will only move the files, not delete them. The inference time was very long with corrupted files, moreover I was getting errors some times. Since this processor was written for files from toloka, probably there will be corrupted files
return [DataEntry(data=data_entry)] | ||
|
||
|
||
class ExcelToJsonConverter(BaseProcessor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this processor is not used in this PR, do we need it?
torchaudio.load(entry[self.audio_filepath_key]) | ||
entries.append(entry) # File is good, append to entries list | ||
except Exception as e: | ||
print(f"Failed to load {entry[self.audio_filepath_key]}: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's change all the printing to logging using the logger here: sdp/logging.py
?
self.project_name = project_name | ||
self.project_description = project_description | ||
self.project_instructions = project_instructions | ||
self.save_api_key_to_config = save_api_key_to_config # Initialize the parameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not good practice, can we remove this feature?
**kwargs, | ||
): | ||
super().__init__(**kwargs) | ||
self.API_KEY = API_KEY or os.getenv('TOLOKA_API_KEY') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just use os.getev for all these processors? And not read from the config?
d6cc329
to
6883d74
Compare
Signed-off-by: Alexan <[email protected]>
Signed-off-by: Alexan <[email protected]>
Signed-off-by: Alexan <[email protected]>
Signed-off-by: Alexan <[email protected]>
Signed-off-by: Alexan <[email protected]>
Signed-off-by: Alexan <[email protected]>
Signed-off-by: Alexan <[email protected]>
Signed-off-by: Alexan <[email protected]>
1203ae5
to
9e247f2
Compare
need to add some tests |
No description provided.