Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to Upload and Process Audio Using Multimodal Models (GPT Audio Preview, Gemini 1.5 Pro) #11567

Open
5 tasks done
ottoradiologia opened this issue Dec 11, 2024 · 2 comments
Labels
💪 enhancement New feature or request tts issues related to audio and tts

Comments

@ottoradiologia
Copy link

Self Checks

  • This is only for bug report, if you would like to ask a question, please head to Discussions.
  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • Please do not modify this template :) and fill in all the required fields.

Dify version

13.2

Cloud or Self Hosted

Cloud

Steps to reproduce

I’m experiencing an issue where audio files cannot be properly uploaded and processed by the multimodal models (GPT audio preview and Gemini 1.5 Pro) through the Dify platform. When I attempt to upload an audio file, it successfully generates a URL, but in subsequent follow-up messages, no output or processed content is returned. The input field appears empty, and the audio is never processed.

✔️ Expected Behavior

After uploading the audio file and providing the generated URL in a follow-up request, the multimodal models should transcribe or otherwise process the audio, returning the corresponding output (e.g., transcription text).

❌ Actual Behavior

The response is empty, and no audio processing occurs, despite the upload being successful.

Additional Information:

The issue persists across different audio file formats.
Re-attempting with different files does not resolve the problem.
Other non-audio inputs work as expected in the follow-up messages.
Request for Assistance:
Could you please investigate this issue and advise on whether it is a known bug, a configuration error, or if additional permissions are required to enable audio processing on the Dify platform?

@dosubot dosubot bot added tts issues related to audio and tts 🐞 bug Something isn't working labels Dec 11, 2024
Copy link

dosubot bot commented Dec 11, 2024

The issue you're experiencing, where audio files are uploaded successfully but not processed, could be due to several factors. Here are some potential causes and solutions:

  1. File Format and Size: Ensure that the audio files are in a supported format. Dify supports formats like flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, and webm. Also, check that the file size does not exceed the upload limit [1][2][3].

  2. S3 Upload and URL Generation: The process involves generating a URL and uploading the file to an S3 bucket. If the URL is generated but the file is not processed, there might be an issue with the S3 upload or the subsequent steps. Ensure that the S3 bucket permissions and configurations are correct [4].

  3. Transcription Job: After uploading, a transcription job is initiated. If this job fails, it could be due to incorrect parameters or configuration issues. Check the logs for any specific error messages related to the transcription job [4].

  4. Error Messages: Review any error messages or logs generated during the process. Common errors include NoAudioUploadedError, AudioTooLargeError, and UnsupportedAudioTypeError, among others [5][6].

  5. AWS SageMaker Endpoint: If you're using AWS for processing, ensure that the SageMaker endpoint is correctly set up and accessible. The endpoint processes the audio file and returns the transcribed text [4][7].

  6. Configuration and Permissions: Verify that all necessary configurations and permissions are in place for audio processing. This includes ensuring that any required AWS credentials and permissions are correctly configured.

If these steps do not resolve the issue, it might be helpful to check for any updates or patches for Dify version 13.2 that address known bugs related to audio processing.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@crazywoola crazywoola added 💪 enhancement New feature or request and removed 🐞 bug Something isn't working labels Dec 12, 2024
@hjlarry
Copy link
Contributor

hjlarry commented Dec 12, 2024

Hi, I don't know why gpt-audio-preview not work, but this #11497 will support gemini models understand audio file.
84b04409bc6ca7cfaae090c79a7f33f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💪 enhancement New feature or request tts issues related to audio and tts
Projects
None yet
Development

No branches or pull requests

3 participants