Unable to Upload and Process Audio Using Multimodal Models (GPT Audio Preview, Gemini 1.5 Pro) #11567

ottoradiologia · 2024-12-11T18:20:01Z

Self Checks

This is only for bug report, if you would like to ask a question, please head to Discussions.
I have searched for existing issues search for existing issues, including closed ones.
I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
Please do not modify this template :) and fill in all the required fields.

Dify version

13.2

Cloud or Self Hosted

Cloud

Steps to reproduce

I’m experiencing an issue where audio files cannot be properly uploaded and processed by the multimodal models (GPT audio preview and Gemini 1.5 Pro) through the Dify platform. When I attempt to upload an audio file, it successfully generates a URL, but in subsequent follow-up messages, no output or processed content is returned. The input field appears empty, and the audio is never processed.

✔️ Expected Behavior

After uploading the audio file and providing the generated URL in a follow-up request, the multimodal models should transcribe or otherwise process the audio, returning the corresponding output (e.g., transcription text).

❌ Actual Behavior

The response is empty, and no audio processing occurs, despite the upload being successful.

Additional Information:

The issue persists across different audio file formats.
Re-attempting with different files does not resolve the problem.
Other non-audio inputs work as expected in the follow-up messages.
Request for Assistance:
Could you please investigate this issue and advise on whether it is a known bug, a configuration error, or if additional permissions are required to enable audio processing on the Dify platform?

dosubot · 2024-12-11T18:22:57Z

The issue you're experiencing, where audio files are uploaded successfully but not processed, could be due to several factors. Here are some potential causes and solutions:

File Format and Size: Ensure that the audio files are in a supported format. Dify supports formats like flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, and webm. Also, check that the file size does not exceed the upload limit ^[1]^[2]^[3].
S3 Upload and URL Generation: The process involves generating a URL and uploading the file to an S3 bucket. If the URL is generated but the file is not processed, there might be an issue with the S3 upload or the subsequent steps. Ensure that the S3 bucket permissions and configurations are correct ^[4].
Transcription Job: After uploading, a transcription job is initiated. If this job fails, it could be due to incorrect parameters or configuration issues. Check the logs for any specific error messages related to the transcription job ^[4].
Error Messages: Review any error messages or logs generated during the process. Common errors include NoAudioUploadedError, AudioTooLargeError, and UnsupportedAudioTypeError, among others ^[5]^[6].
AWS SageMaker Endpoint: If you're using AWS for processing, ensure that the SageMaker endpoint is correctly set up and accessible. The endpoint processes the audio file and returns the transcribed text ^[4]^[7].
Configuration and Permissions: Verify that all necessary configurations and permissions are in place for audio processing. This includes ensuring that any required AWS credentials and permissions are correctly configured.

If these steps do not resolve the issue, it might be helpful to check for any updates or patches for Dify version 13.2 that address known bugs related to audio processing.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

hjlarry · 2024-12-12T03:52:52Z

Hi, I don't know why gpt-audio-preview not work, but this #11497 will support gemini models understand audio file.

dosubot bot added tts issues related to audio and tts 🐞 bug Something isn't working labels Dec 11, 2024

crazywoola added 💪 enhancement New feature or request and removed 🐞 bug Something isn't working labels Dec 12, 2024

hjlarry mentioned this issue Dec 15, 2024

The LLM Node cannot read file with a custom field type of "File" when using a multimodal model. #11655

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to Upload and Process Audio Using Multimodal Models (GPT Audio Preview, Gemini 1.5 Pro) #11567

Unable to Upload and Process Audio Using Multimodal Models (GPT Audio Preview, Gemini 1.5 Pro) #11567

ottoradiologia commented Dec 11, 2024

dosubot bot commented Dec 11, 2024

hjlarry commented Dec 12, 2024

Unable to Upload and Process Audio Using Multimodal Models (GPT Audio Preview, Gemini 1.5 Pro) #11567

Unable to Upload and Process Audio Using Multimodal Models (GPT Audio Preview, Gemini 1.5 Pro) #11567

Comments

ottoradiologia commented Dec 11, 2024

Self Checks

Dify version

Cloud or Self Hosted

Steps to reproduce

✔️ Expected Behavior

❌ Actual Behavior

dosubot bot commented Dec 11, 2024

hjlarry commented Dec 12, 2024