feat(document_extractor): integrate unstructured API for PPTX extraction #10180

laipz8200 · 2024-11-01T11:37:02Z

Checklist:

Important

Please review the checklist below before submitting your pull request.

Please open an issue before creating a PR or link to an existing issue
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods

Description

Added support for using the unstructured API for PPTX text extraction when available.
Falls back to existing method if API credentials are not configured.
Ensures flexibility and potentially enhanced performance or accuracy in text extraction.

#9995

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update, included: Dify Document
Improvement, including but not limited to code refactoring, performance optimization, and UI/UX improvement
Dependency upgrade

Testing Instructions

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

Test A
Test B

- Added support for using the unstructured API for PPTX text extraction when available. - Falls back to existing method if API credentials are not configured. - Ensures flexibility and potentially enhanced performance or accuracy in text extraction.

…m-vdb * 'lindorm-vdb' of github.com:AlwaysBluer/dify: (39 commits) Feat : add LLM model indicator in prompt generator (langgenius#10187) chore: enable vision support for models in OpenRouter that should have supported vision (langgenius#10191) chore : code generator preview hint (langgenius#10188) fix: webapp upload file (langgenius#10195) fix(api): replace current_user with end_user in file upload (langgenius#10194) feat(document_extractor): integrate unstructured API for PPTX extraction (langgenius#10180) fix(tools): suppress RuntimeWarnings in podcast audio generator (langgenius#10182) [fix] fix the bug that modify document name not effective (langgenius#10154) fix(workflow model): ensure consistent timestamp updating (langgenius#10172) fix: Cannot find declaration to go to CLEAN_DAY_SETTING (langgenius#10157) feat: add gpustack model provider (langgenius#10158) refactor(tools): Avoid warnings. (langgenius#10161) refactor(migration/model): update column types for workflow schema (langgenius#10160) Feat/add-remote-file-upload-api (langgenius#9906) fix: upload remote image preview (langgenius#9952) clean un-allowed special charters when doing indexing estimate (langgenius#10153) refactor(service): handle unsupported DSL version with warning (langgenius#10151) Add VESSL AI OpenAI API-compatible model provider and LLM model (langgenius#9474) feat: synchronize input/output variables in the panel with generated code by the code generator (langgenius#10150) Refined README for better reading experience. (langgenius#10143) ...

…ion (#10180)

…ion (langgenius#10180)

fdb02983rhy · 2024-11-19T20:59:55Z

Could you advise me how to use this?
I have set the env variables in docker .env but it didn't work.

laipz8200 · 2024-11-20T04:18:30Z

Could you advise me how to use this? I have set the env variables in docker .env but it didn't work.

Thank you for the report, link to #10886

fdb02983rhy · 2024-11-21T14:29:08Z

1

The knowledge chunking works fine after the fix. fix10953

2

However, it isn't functioning as expected for pptx extraction.

api-1         |   File "/app/api/.venv/lib/python3.10/site-packages/gunicorn/workers/base_async.py", line 115, in handle_request
api-1         |     for item in respiter:
api-1         |   File "/app/api/.venv/lib/python3.10/site-packages/werkzeug/wsgi.py", line 256, in __next__
api-1         |     return self._next()
api-1         |   File "/app/api/.venv/lib/python3.10/site-packages/werkzeug/wrappers/response.py", line 32, in _iter_encoded
api-1         |     for item in iterable:
api-1         |   File "/app/api/.venv/lib/python3.10/site-packages/flask/helpers.py", line 113, in generator
api-1         |     yield from gen
api-1         |   File "/app/api/libs/helper.py", line 186, in generate
api-1         |     yield from response
api-1         |   File "/app/api/core/app/features/rate_limiting/rate_limit.py", line 115, in __next__
api-1         |     return next(self.generator)
api-1         |   File "/app/api/core/app/apps/base_app_generate_response_converter.py", line 25, in _generate_full_response
api-1         |     for chunk in cls.convert_stream_full_response(response):
api-1         |   File "/app/api/core/app/apps/advanced_chat/generate_response_converter.py", line 67, in convert_stream_full_response
api-1         |     for chunk in stream_response:
api-1         |   File "/app/api/core/app/apps/advanced_chat/generate_task_pipeline.py", line 187, in _to_stream_response
api-1         |     for stream_response in generator:
api-1         |   File "/app/api/core/app/apps/advanced_chat/generate_task_pipeline.py", line 218, in _wrapper_process_stream_response
api-1         |     for response in self._process_stream_response(tts_publisher=tts_publisher, trace_manager=trace_manager):
api-1         |   File "/app/api/core/app/apps/advanced_chat/generate_task_pipeline.py", line 319, in _process_stream_response
api-1         |     workflow_node_execution = self._handle_workflow_node_execution_failed(event)
api-1         |   File "/app/api/core/app/task_pipeline/workflow_cycle_manage.py", line 339, in _handle_workflow_node_execution_failed
api-1         |     WorkflowNodeExecution.process_data: json.dumps(event.process_data) if event.process_data else None,
api-1         |   File "/usr/local/lib/python3.10/json/__init__.py", line 231, in dumps
api-1         |     return _default_encoder.encode(obj)
api-1         |   File "/usr/local/lib/python3.10/json/encoder.py", line 199, in encode
api-1         |     chunks = self.iterencode(o, _one_shot=True)
api-1         |   File "/usr/local/lib/python3.10/json/encoder.py", line 257, in iterencode
api-1         |     return _iterencode(o, 0)
api-1         |   File "/app/api/.venv/lib/python3.10/site-packages/frozendict/__init__.py", line 32, in default
api-1         |     return BaseJsonEncoder.default(
api-1         |   File "/usr/local/lib/python3.10/json/encoder.py", line 179, in default
api-1         |     raise TypeError(f'Object of type {o.__class__.__name__} '
api-1         | TypeError: Object of type File is not JSON serializable

laipz8200 · 2024-11-21T15:26:14Z

1

The knowledge chunking works fine after the fix. fix10953

2

However, it isn't functioning as expected for pptx extraction.

api-1         |   File "/app/api/.venv/lib/python3.10/site-packages/gunicorn/workers/base_async.py", line 115, in handle_request
api-1         |     for item in respiter:
api-1         |   File "/app/api/.venv/lib/python3.10/site-packages/werkzeug/wsgi.py", line 256, in __next__
api-1         |     return self._next()
api-1         |   File "/app/api/.venv/lib/python3.10/site-packages/werkzeug/wrappers/response.py", line 32, in _iter_encoded
api-1         |     for item in iterable:
api-1         |   File "/app/api/.venv/lib/python3.10/site-packages/flask/helpers.py", line 113, in generator
api-1         |     yield from gen
api-1         |   File "/app/api/libs/helper.py", line 186, in generate
api-1         |     yield from response
api-1         |   File "/app/api/core/app/features/rate_limiting/rate_limit.py", line 115, in __next__
api-1         |     return next(self.generator)
api-1         |   File "/app/api/core/app/apps/base_app_generate_response_converter.py", line 25, in _generate_full_response
api-1         |     for chunk in cls.convert_stream_full_response(response):
api-1         |   File "/app/api/core/app/apps/advanced_chat/generate_response_converter.py", line 67, in convert_stream_full_response
api-1         |     for chunk in stream_response:
api-1         |   File "/app/api/core/app/apps/advanced_chat/generate_task_pipeline.py", line 187, in _to_stream_response
api-1         |     for stream_response in generator:
api-1         |   File "/app/api/core/app/apps/advanced_chat/generate_task_pipeline.py", line 218, in _wrapper_process_stream_response
api-1         |     for response in self._process_stream_response(tts_publisher=tts_publisher, trace_manager=trace_manager):
api-1         |   File "/app/api/core/app/apps/advanced_chat/generate_task_pipeline.py", line 319, in _process_stream_response
api-1         |     workflow_node_execution = self._handle_workflow_node_execution_failed(event)
api-1         |   File "/app/api/core/app/task_pipeline/workflow_cycle_manage.py", line 339, in _handle_workflow_node_execution_failed
api-1         |     WorkflowNodeExecution.process_data: json.dumps(event.process_data) if event.process_data else None,
api-1         |   File "/usr/local/lib/python3.10/json/__init__.py", line 231, in dumps
api-1         |     return _default_encoder.encode(obj)
api-1         |   File "/usr/local/lib/python3.10/json/encoder.py", line 199, in encode
api-1         |     chunks = self.iterencode(o, _one_shot=True)
api-1         |   File "/usr/local/lib/python3.10/json/encoder.py", line 257, in iterencode
api-1         |     return _iterencode(o, 0)
api-1         |   File "/app/api/.venv/lib/python3.10/site-packages/frozendict/__init__.py", line 32, in default
api-1         |     return BaseJsonEncoder.default(
api-1         |   File "/usr/local/lib/python3.10/json/encoder.py", line 179, in default
api-1         |     raise TypeError(f'Object of type {o.__class__.__name__} '
api-1         | TypeError: Object of type File is not JSON serializable

Could you please open an issue for this and share your version, dsl for us to reproduce this problem?

fdb02983rhy · 2024-11-21T17:35:15Z

1

The knowledge chunking works fine after the fix. fix10953

2

However, it isn't functioning as expected for pptx extraction.

api-1         |   File "/app/api/.venv/lib/python3.10/site-packages/gunicorn/workers/base_async.py", line 115, in handle_request
api-1         |     for item in respiter:
api-1         |   File "/app/api/.venv/lib/python3.10/site-packages/werkzeug/wsgi.py", line 256, in __next__
api-1         |     return self._next()
api-1         |   File "/app/api/.venv/lib/python3.10/site-packages/werkzeug/wrappers/response.py", line 32, in _iter_encoded
api-1         |     for item in iterable:
api-1         |   File "/app/api/.venv/lib/python3.10/site-packages/flask/helpers.py", line 113, in generator
api-1         |     yield from gen
api-1         |   File "/app/api/libs/helper.py", line 186, in generate
api-1         |     yield from response
api-1         |   File "/app/api/core/app/features/rate_limiting/rate_limit.py", line 115, in __next__
api-1         |     return next(self.generator)
api-1         |   File "/app/api/core/app/apps/base_app_generate_response_converter.py", line 25, in _generate_full_response
api-1         |     for chunk in cls.convert_stream_full_response(response):
api-1         |   File "/app/api/core/app/apps/advanced_chat/generate_response_converter.py", line 67, in convert_stream_full_response
api-1         |     for chunk in stream_response:
api-1         |   File "/app/api/core/app/apps/advanced_chat/generate_task_pipeline.py", line 187, in _to_stream_response
api-1         |     for stream_response in generator:
api-1         |   File "/app/api/core/app/apps/advanced_chat/generate_task_pipeline.py", line 218, in _wrapper_process_stream_response
api-1         |     for response in self._process_stream_response(tts_publisher=tts_publisher, trace_manager=trace_manager):
api-1         |   File "/app/api/core/app/apps/advanced_chat/generate_task_pipeline.py", line 319, in _process_stream_response
api-1         |     workflow_node_execution = self._handle_workflow_node_execution_failed(event)
api-1         |   File "/app/api/core/app/task_pipeline/workflow_cycle_manage.py", line 339, in _handle_workflow_node_execution_failed
api-1         |     WorkflowNodeExecution.process_data: json.dumps(event.process_data) if event.process_data else None,
api-1         |   File "/usr/local/lib/python3.10/json/__init__.py", line 231, in dumps
api-1         |     return _default_encoder.encode(obj)
api-1         |   File "/usr/local/lib/python3.10/json/encoder.py", line 199, in encode
api-1         |     chunks = self.iterencode(o, _one_shot=True)
api-1         |   File "/usr/local/lib/python3.10/json/encoder.py", line 257, in iterencode
api-1         |     return _iterencode(o, 0)
api-1         |   File "/app/api/.venv/lib/python3.10/site-packages/frozendict/__init__.py", line 32, in default
api-1         |     return BaseJsonEncoder.default(
api-1         |   File "/usr/local/lib/python3.10/json/encoder.py", line 179, in default
api-1         |     raise TypeError(f'Object of type {o.__class__.__name__} '
api-1         | TypeError: Object of type File is not JSON serializable

Could you please open an issue for this and share your version, dsl for us to reproduce this problem?

Sure #10956

dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. 📚 feat:datasource Data sources like web, Notion, Logseq, Lark, Docs labels Nov 1, 2024

fix(document_extractor): streamline PPTX text extraction logic

c23528c

laipz8200 requested a review from crazywoola November 1, 2024 13:50

crazywoola approved these changes Nov 1, 2024

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Nov 1, 2024

crazywoola merged commit 53a7cb0 into main Nov 1, 2024
9 checks passed

crazywoola deleted the feat/integrate-unstructured-API-for-PPTX-extraction branch November 1, 2024 15:19

Nov1c444 pushed a commit that referenced this pull request Nov 5, 2024

feat(document_extractor): integrate unstructured API for PPTX extract…

d9271e5

…ion (#10180)

laipz8200 mentioned this pull request Nov 5, 2024

chore: update version to 0.11.0 across all relevant files #10278

Merged

Yeuoly pushed a commit that referenced this pull request Nov 5, 2024

feat(document_extractor): integrate unstructured API for PPTX extract…

81a77d0

…ion (#10180)

iamjoel pushed a commit that referenced this pull request Nov 7, 2024

feat(document_extractor): integrate unstructured API for PPTX extract…

101d979

…ion (#10180)

idonotknow pushed a commit to AceDataCloud/Dify that referenced this pull request Nov 16, 2024

feat(document_extractor): integrate unstructured API for PPTX extract…

847f250

…ion (langgenius#10180)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(document_extractor): integrate unstructured API for PPTX extraction #10180

feat(document_extractor): integrate unstructured API for PPTX extraction #10180

laipz8200 commented Nov 1, 2024

fdb02983rhy commented Nov 19, 2024

laipz8200 commented Nov 20, 2024

fdb02983rhy commented Nov 21, 2024 •

edited

Loading

laipz8200 commented Nov 21, 2024

1

2

fdb02983rhy commented Nov 21, 2024

1

2

feat(document_extractor): integrate unstructured API for PPTX extraction #10180

feat(document_extractor): integrate unstructured API for PPTX extraction #10180

Conversation

laipz8200 commented Nov 1, 2024

Checklist:

Description

Type of Change

Testing Instructions

fdb02983rhy commented Nov 19, 2024

laipz8200 commented Nov 20, 2024

fdb02983rhy commented Nov 21, 2024 • edited Loading

1

2

laipz8200 commented Nov 21, 2024

1

2

fdb02983rhy commented Nov 21, 2024

1

2

fdb02983rhy commented Nov 21, 2024 •

edited

Loading