refactor: improve handling of leading punctuation removal #10761

zandko · 2024-11-15T19:27:02Z

Summary

This refactor improves the logic for trimming leading punctuation from text content. It replaces the use of startswith checks with a regular expression, enabling broader support for diverse punctuation and symbols. Additionally, .strip() has been incorporated to ensure removal of trailing whitespace for cleaner output.

Motivation and Context:

Enhances flexibility for handling various punctuation cases, including multilingual and mixed-symbol text inputs.
Improves code readability, maintainability, and robustness in preprocessing logic.

Dependencies:
No additional dependencies were introduced with this change.

Checklist

Important

Please review the checklist below before submitting your pull request.

This change requires a documentation update, included: Dify Document
I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
I've updated the documentation accordingly.
I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods

crazywoola

LGTM

crazywoola · 2024-11-16T06:22:37Z

Please fix the lint errors

zandko · 2024-11-16T07:24:18Z

Please fix the lint errors

Thank you for pointing that out. I have addressed all the lint errors flagged during the review and committed the fixes.

bowenliang123 · 2024-11-16T13:35:16Z

As the logic is not as clear in readability as before, for the maintainability, I would suggest extracting to a common util method for this, and the proper unit tests for is very import for this, too.

To be honest, I accept the purpose of this PR, but I don't understand the regex in the change and how it satisfies the goal.

zandko · 2024-11-16T13:52:58Z

As the logic is not as clear in readability as before, for the maintainability, I would suggest extracting to a common util method for this, and the proper unit tests for is very import for this, too.

To be honest, I accept the purpose of this PR, but I don't understand the regex in the change and how it satisfies the goal.

Thank you for your feedback! Here's an explanation of the regex and its purpose:

Regex Explanation:
- ^[\p{P}\p{S}]+: This pattern matches one or more punctuation (\p{P}) or symbol (\p{S}) characters at the start of the string. For example:
  - Input: !@#Hello
  - Match: !@#
- In re.match, this checks if the string starts with punctuation or symbols.
- In re.sub, it removes those leading punctuation or symbols from the string.

The purpose of this logic is to clean up the string so it doesn't begin with any unnecessary symbols or punctuation.

If you feel this approach works, I can extract the logic into a reusable utility method (e.g., remove_leading_symbols) to improve readability and maintainability, as you suggested. I'll also add proper unit tests to cover edge cases.

bowenliang123 · 2024-11-16T15:38:30Z

Thanks for the explanation, SGTM. And please make sure unit tests provided, at least covering the minimum cases is alright to me.

zandko · 2024-11-17T07:48:55Z

Thanks for the explanation, SGTM. And please make sure unit tests provided, at least covering the minimum cases is alright to me.

This commit finalizes the logic for removing leading punctuation and symbols by adopting a Unicode range regex pattern. The new approach improves compatibility and avoids reliance on advanced regex features such as \p{}.

Changes:

Updated the regex pattern to explicitly cover Unicode punctuation and symbols:
- ^[\u2000-\u206F\u2E00-\u2E7F\u3000-\u303F!\"#$%&'()*+,\-./:;<=>?@\[\]^_{|}~]+`
- Matches leading punctuation and symbols, including CJK-specific marks.
Ensured robust handling of edge cases through unit tests.

This update simplifies the implementation while maintaining the intended behavior. All related unit tests pass, ensuring correctness.

api/tests/unit_tests/utils/test_text_processing.py

yihong0618 · 2024-12-25T08:06:13Z

why we need to remove the leading chars like { [ } and so on?
for the old logic we only remove . and 。 which is make sense.
the new logic may cause

#11868

yihong0618 · 2024-12-25T08:08:32Z

As the logic is not as clear in readability as before, for the maintainability, I would suggest extracting to a common util method for this, and the proper unit tests for is very import for this, too.
To be honest, I accept the purpose of this PR, but I don't understand the regex in the change and how it satisfies the goal.

Thank you for your feedback! Here's an explanation of the regex and its purpose:

Regex Explanation:

^[\p{P}\p{S}]+: This pattern matches one or more punctuation (\p{P}) or symbol (\p{S}) characters at the start of the string. For example:

Input: !@#Hello

Match: !@#

In re.match, this checks if the string starts with punctuation or symbols.

In re.sub, it removes those leading punctuation or symbols from the string.

The purpose of this logic is to clean up the string so it doesn't begin with any unnecessary symbols or punctuation.

If you feel this approach works, I can extract the logic into a reusable utility method (e.g., remove_leading_symbols) to improve readability and maintainability, as you suggested. I'll also add proper unit tests to cover edge cases.

why we need to remove !@# can you explain that?>
thanks

refactor: improve handling of leading punctuation removal

4318d18

dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. 💪 enhancement New feature or request labels Nov 15, 2024

crazywoola requested a review from JohnJyong November 16, 2024 06:21

crazywoola reviewed Nov 16, 2024

View reviewed changes

fix(lint): resolve lint issues in punctuation trimming refactor

e3061f1

refactor: finalize leading symbol removal using Unicode range

6bc1c37

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Nov 17, 2024

crazywoola self-assigned this Nov 18, 2024

JohnJyong reviewed Nov 18, 2024

View reviewed changes

api/tests/unit_tests/utils/test_text_processing.py Show resolved Hide resolved

crazywoola approved these changes Nov 18, 2024

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Nov 18, 2024

crazywoola merged commit 14f3d44 into langgenius:main Nov 18, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: improve handling of leading punctuation removal #10761

refactor: improve handling of leading punctuation removal #10761

zandko commented Nov 15, 2024

crazywoola left a comment

crazywoola commented Nov 16, 2024

zandko commented Nov 16, 2024

bowenliang123 commented Nov 16, 2024 •

edited

Loading

zandko commented Nov 16, 2024

bowenliang123 commented Nov 16, 2024

zandko commented Nov 17, 2024 •

edited

Loading

yihong0618 commented Dec 25, 2024

yihong0618 commented Dec 25, 2024 •

edited

Loading

refactor: improve handling of leading punctuation removal #10761

refactor: improve handling of leading punctuation removal #10761

Conversation

zandko commented Nov 15, 2024

Summary

Checklist

crazywoola left a comment

Choose a reason for hiding this comment

crazywoola commented Nov 16, 2024

zandko commented Nov 16, 2024

bowenliang123 commented Nov 16, 2024 • edited Loading

zandko commented Nov 16, 2024

bowenliang123 commented Nov 16, 2024

zandko commented Nov 17, 2024 • edited Loading

yihong0618 commented Dec 25, 2024

yihong0618 commented Dec 25, 2024 • edited Loading

bowenliang123 commented Nov 16, 2024 •

edited

Loading

zandko commented Nov 17, 2024 •

edited

Loading

yihong0618 commented Dec 25, 2024 •

edited

Loading