Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(document_extractor): integrate unstructured API for PPTX extraction #10180

Merged

Conversation

laipz8200
Copy link
Member

Checklist:

Important

Please review the checklist below before submitting your pull request.

  • Please open an issue before creating a PR or link to an existing issue
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods

Description

  • Added support for using the unstructured API for PPTX text extraction when available.
  • Falls back to existing method if API credentials are not configured.
  • Ensures flexibility and potentially enhanced performance or accuracy in text extraction.

#9995

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update, included: Dify Document
  • Improvement, including but not limited to code refactoring, performance optimization, and UI/UX improvement
  • Dependency upgrade

Testing Instructions

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • Test A
  • Test B

- Added support for using the unstructured API for PPTX text extraction when available.
- Falls back to existing method if API credentials are not configured.
- Ensures flexibility and potentially enhanced performance or accuracy in text extraction.
@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. 📚 feat:datasource Data sources like web, Notion, Logseq, Lark, Docs labels Nov 1, 2024
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Nov 1, 2024
@crazywoola crazywoola merged commit 53a7cb0 into main Nov 1, 2024
9 checks passed
@crazywoola crazywoola deleted the feat/integrate-unstructured-API-for-PPTX-extraction branch November 1, 2024 15:19
AlwaysBluer pushed a commit to AlwaysBluer/dify that referenced this pull request Nov 2, 2024
…m-vdb

* 'lindorm-vdb' of github.com:AlwaysBluer/dify: (39 commits)
  Feat : add LLM model indicator in prompt generator (langgenius#10187)
  chore: enable vision support for models in OpenRouter that should have supported vision (langgenius#10191)
  chore : code generator preview hint (langgenius#10188)
  fix: webapp upload file (langgenius#10195)
  fix(api): replace current_user with end_user in file upload (langgenius#10194)
  feat(document_extractor): integrate unstructured API for PPTX extraction (langgenius#10180)
  fix(tools): suppress RuntimeWarnings in podcast audio generator (langgenius#10182)
  [fix] fix the bug that modify document name not effective (langgenius#10154)
  fix(workflow model): ensure consistent timestamp updating (langgenius#10172)
  fix: Cannot find declaration to go to CLEAN_DAY_SETTING (langgenius#10157)
  feat: add gpustack model provider (langgenius#10158)
  refactor(tools): Avoid warnings. (langgenius#10161)
  refactor(migration/model): update column types for workflow schema (langgenius#10160)
  Feat/add-remote-file-upload-api (langgenius#9906)
  fix: upload remote image preview (langgenius#9952)
  clean un-allowed special charters when doing indexing estimate (langgenius#10153)
  refactor(service): handle unsupported DSL version with warning (langgenius#10151)
  Add VESSL AI OpenAI API-compatible model provider and LLM model (langgenius#9474)
  feat: synchronize input/output variables in the panel with generated code by the code generator (langgenius#10150)
  Refined README for better reading experience. (langgenius#10143)
  ...
idonotknow pushed a commit to AceDataCloud/Dify that referenced this pull request Nov 16, 2024
@fdb02983rhy
Copy link
Contributor

Could you advise me how to use this?
I have set the env variables in docker .env but it didn't work.
image
image
image

@laipz8200
Copy link
Member Author

Could you advise me how to use this? I have set the env variables in docker .env but it didn't work. image image image

Thank you for the report, link to #10886

@fdb02983rhy
Copy link
Contributor

fdb02983rhy commented Nov 21, 2024

1

The knowledge chunking works fine after the fix. fix10953
image

2

However, it isn't functioning as expected for pptx extraction.

api-1         |   File "/app/api/.venv/lib/python3.10/site-packages/gunicorn/workers/base_async.py", line 115, in handle_request
api-1         |     for item in respiter:
api-1         |   File "/app/api/.venv/lib/python3.10/site-packages/werkzeug/wsgi.py", line 256, in __next__
api-1         |     return self._next()
api-1         |   File "/app/api/.venv/lib/python3.10/site-packages/werkzeug/wrappers/response.py", line 32, in _iter_encoded
api-1         |     for item in iterable:
api-1         |   File "/app/api/.venv/lib/python3.10/site-packages/flask/helpers.py", line 113, in generator
api-1         |     yield from gen
api-1         |   File "/app/api/libs/helper.py", line 186, in generate
api-1         |     yield from response
api-1         |   File "/app/api/core/app/features/rate_limiting/rate_limit.py", line 115, in __next__
api-1         |     return next(self.generator)
api-1         |   File "/app/api/core/app/apps/base_app_generate_response_converter.py", line 25, in _generate_full_response
api-1         |     for chunk in cls.convert_stream_full_response(response):
api-1         |   File "/app/api/core/app/apps/advanced_chat/generate_response_converter.py", line 67, in convert_stream_full_response
api-1         |     for chunk in stream_response:
api-1         |   File "/app/api/core/app/apps/advanced_chat/generate_task_pipeline.py", line 187, in _to_stream_response
api-1         |     for stream_response in generator:
api-1         |   File "/app/api/core/app/apps/advanced_chat/generate_task_pipeline.py", line 218, in _wrapper_process_stream_response
api-1         |     for response in self._process_stream_response(tts_publisher=tts_publisher, trace_manager=trace_manager):
api-1         |   File "/app/api/core/app/apps/advanced_chat/generate_task_pipeline.py", line 319, in _process_stream_response
api-1         |     workflow_node_execution = self._handle_workflow_node_execution_failed(event)
api-1         |   File "/app/api/core/app/task_pipeline/workflow_cycle_manage.py", line 339, in _handle_workflow_node_execution_failed
api-1         |     WorkflowNodeExecution.process_data: json.dumps(event.process_data) if event.process_data else None,
api-1         |   File "/usr/local/lib/python3.10/json/__init__.py", line 231, in dumps
api-1         |     return _default_encoder.encode(obj)
api-1         |   File "/usr/local/lib/python3.10/json/encoder.py", line 199, in encode
api-1         |     chunks = self.iterencode(o, _one_shot=True)
api-1         |   File "/usr/local/lib/python3.10/json/encoder.py", line 257, in iterencode
api-1         |     return _iterencode(o, 0)
api-1         |   File "/app/api/.venv/lib/python3.10/site-packages/frozendict/__init__.py", line 32, in default
api-1         |     return BaseJsonEncoder.default(
api-1         |   File "/usr/local/lib/python3.10/json/encoder.py", line 179, in default
api-1         |     raise TypeError(f'Object of type {o.__class__.__name__} '
api-1         | TypeError: Object of type File is not JSON serializable

@laipz8200
Copy link
Member Author

1

The knowledge chunking works fine after the fix. fix10953 image

2

However, it isn't functioning as expected for pptx extraction.

api-1         |   File "/app/api/.venv/lib/python3.10/site-packages/gunicorn/workers/base_async.py", line 115, in handle_request
api-1         |     for item in respiter:
api-1         |   File "/app/api/.venv/lib/python3.10/site-packages/werkzeug/wsgi.py", line 256, in __next__
api-1         |     return self._next()
api-1         |   File "/app/api/.venv/lib/python3.10/site-packages/werkzeug/wrappers/response.py", line 32, in _iter_encoded
api-1         |     for item in iterable:
api-1         |   File "/app/api/.venv/lib/python3.10/site-packages/flask/helpers.py", line 113, in generator
api-1         |     yield from gen
api-1         |   File "/app/api/libs/helper.py", line 186, in generate
api-1         |     yield from response
api-1         |   File "/app/api/core/app/features/rate_limiting/rate_limit.py", line 115, in __next__
api-1         |     return next(self.generator)
api-1         |   File "/app/api/core/app/apps/base_app_generate_response_converter.py", line 25, in _generate_full_response
api-1         |     for chunk in cls.convert_stream_full_response(response):
api-1         |   File "/app/api/core/app/apps/advanced_chat/generate_response_converter.py", line 67, in convert_stream_full_response
api-1         |     for chunk in stream_response:
api-1         |   File "/app/api/core/app/apps/advanced_chat/generate_task_pipeline.py", line 187, in _to_stream_response
api-1         |     for stream_response in generator:
api-1         |   File "/app/api/core/app/apps/advanced_chat/generate_task_pipeline.py", line 218, in _wrapper_process_stream_response
api-1         |     for response in self._process_stream_response(tts_publisher=tts_publisher, trace_manager=trace_manager):
api-1         |   File "/app/api/core/app/apps/advanced_chat/generate_task_pipeline.py", line 319, in _process_stream_response
api-1         |     workflow_node_execution = self._handle_workflow_node_execution_failed(event)
api-1         |   File "/app/api/core/app/task_pipeline/workflow_cycle_manage.py", line 339, in _handle_workflow_node_execution_failed
api-1         |     WorkflowNodeExecution.process_data: json.dumps(event.process_data) if event.process_data else None,
api-1         |   File "/usr/local/lib/python3.10/json/__init__.py", line 231, in dumps
api-1         |     return _default_encoder.encode(obj)
api-1         |   File "/usr/local/lib/python3.10/json/encoder.py", line 199, in encode
api-1         |     chunks = self.iterencode(o, _one_shot=True)
api-1         |   File "/usr/local/lib/python3.10/json/encoder.py", line 257, in iterencode
api-1         |     return _iterencode(o, 0)
api-1         |   File "/app/api/.venv/lib/python3.10/site-packages/frozendict/__init__.py", line 32, in default
api-1         |     return BaseJsonEncoder.default(
api-1         |   File "/usr/local/lib/python3.10/json/encoder.py", line 179, in default
api-1         |     raise TypeError(f'Object of type {o.__class__.__name__} '
api-1         | TypeError: Object of type File is not JSON serializable

Could you please open an issue for this and share your version, dsl for us to reproduce this problem?

@fdb02983rhy
Copy link
Contributor

1

The knowledge chunking works fine after the fix. fix10953 image

2

However, it isn't functioning as expected for pptx extraction.

api-1         |   File "/app/api/.venv/lib/python3.10/site-packages/gunicorn/workers/base_async.py", line 115, in handle_request
api-1         |     for item in respiter:
api-1         |   File "/app/api/.venv/lib/python3.10/site-packages/werkzeug/wsgi.py", line 256, in __next__
api-1         |     return self._next()
api-1         |   File "/app/api/.venv/lib/python3.10/site-packages/werkzeug/wrappers/response.py", line 32, in _iter_encoded
api-1         |     for item in iterable:
api-1         |   File "/app/api/.venv/lib/python3.10/site-packages/flask/helpers.py", line 113, in generator
api-1         |     yield from gen
api-1         |   File "/app/api/libs/helper.py", line 186, in generate
api-1         |     yield from response
api-1         |   File "/app/api/core/app/features/rate_limiting/rate_limit.py", line 115, in __next__
api-1         |     return next(self.generator)
api-1         |   File "/app/api/core/app/apps/base_app_generate_response_converter.py", line 25, in _generate_full_response
api-1         |     for chunk in cls.convert_stream_full_response(response):
api-1         |   File "/app/api/core/app/apps/advanced_chat/generate_response_converter.py", line 67, in convert_stream_full_response
api-1         |     for chunk in stream_response:
api-1         |   File "/app/api/core/app/apps/advanced_chat/generate_task_pipeline.py", line 187, in _to_stream_response
api-1         |     for stream_response in generator:
api-1         |   File "/app/api/core/app/apps/advanced_chat/generate_task_pipeline.py", line 218, in _wrapper_process_stream_response
api-1         |     for response in self._process_stream_response(tts_publisher=tts_publisher, trace_manager=trace_manager):
api-1         |   File "/app/api/core/app/apps/advanced_chat/generate_task_pipeline.py", line 319, in _process_stream_response
api-1         |     workflow_node_execution = self._handle_workflow_node_execution_failed(event)
api-1         |   File "/app/api/core/app/task_pipeline/workflow_cycle_manage.py", line 339, in _handle_workflow_node_execution_failed
api-1         |     WorkflowNodeExecution.process_data: json.dumps(event.process_data) if event.process_data else None,
api-1         |   File "/usr/local/lib/python3.10/json/__init__.py", line 231, in dumps
api-1         |     return _default_encoder.encode(obj)
api-1         |   File "/usr/local/lib/python3.10/json/encoder.py", line 199, in encode
api-1         |     chunks = self.iterencode(o, _one_shot=True)
api-1         |   File "/usr/local/lib/python3.10/json/encoder.py", line 257, in iterencode
api-1         |     return _iterencode(o, 0)
api-1         |   File "/app/api/.venv/lib/python3.10/site-packages/frozendict/__init__.py", line 32, in default
api-1         |     return BaseJsonEncoder.default(
api-1         |   File "/usr/local/lib/python3.10/json/encoder.py", line 179, in default
api-1         |     raise TypeError(f'Object of type {o.__class__.__name__} '
api-1         | TypeError: Object of type File is not JSON serializable

Could you please open an issue for this and share your version, dsl for us to reproduce this problem?

Sure #10956

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📚 feat:datasource Data sources like web, Notion, Logseq, Lark, Docs lgtm This PR has been approved by a maintainer size:S This PR changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants