Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: improve handling of leading punctuation removal #10761

Merged
merged 3 commits into from
Nov 18, 2024

Conversation

zandko
Copy link
Contributor

@zandko zandko commented Nov 15, 2024

Summary

This refactor improves the logic for trimming leading punctuation from text content. It replaces the use of startswith checks with a regular expression, enabling broader support for diverse punctuation and symbols. Additionally, .strip() has been incorporated to ensure removal of trailing whitespace for cleaner output.

Motivation and Context:

  • Enhances flexibility for handling various punctuation cases, including multilingual and mixed-symbol text inputs.
  • Improves code readability, maintainability, and robustness in preprocessing logic.

Dependencies:
No additional dependencies were introduced with this change.

Checklist

Important

Please review the checklist below before submitting your pull request.

  • This change requires a documentation update, included: Dify Document
  • I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
  • I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
  • I've updated the documentation accordingly.
  • I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods

@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. 💪 enhancement New feature or request labels Nov 15, 2024
@crazywoola crazywoola requested a review from JohnJyong November 16, 2024 06:21
Copy link
Member

@crazywoola crazywoola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@crazywoola
Copy link
Member

Please fix the lint errors

@zandko
Copy link
Contributor Author

zandko commented Nov 16, 2024

Please fix the lint errors

Thank you for pointing that out. I have addressed all the lint errors flagged during the review and committed the fixes.

@bowenliang123
Copy link
Contributor

bowenliang123 commented Nov 16, 2024

As the logic is not as clear in readability as before, for the maintainability, I would suggest extracting to a common util method for this, and the proper unit tests for is very import for this, too.

To be honest, I accept the purpose of this PR, but I don't understand the regex in the change and how it satisfies the goal.

@zandko
Copy link
Contributor Author

zandko commented Nov 16, 2024

As the logic is not as clear in readability as before, for the maintainability, I would suggest extracting to a common util method for this, and the proper unit tests for is very import for this, too.

To be honest, I accept the purpose of this PR, but I don't understand the regex in the change and how it satisfies the goal.

Thank you for your feedback! Here's an explanation of the regex and its purpose:

  • Regex Explanation:
    • ^[\p{P}\p{S}]+: This pattern matches one or more punctuation (\p{P}) or symbol (\p{S}) characters at the start of the string. For example:
      • Input: !@#Hello
      • Match: !@#
    • In re.match, this checks if the string starts with punctuation or symbols.
    • In re.sub, it removes those leading punctuation or symbols from the string.

The purpose of this logic is to clean up the string so it doesn't begin with any unnecessary symbols or punctuation.

If you feel this approach works, I can extract the logic into a reusable utility method (e.g., remove_leading_symbols) to improve readability and maintainability, as you suggested. I'll also add proper unit tests to cover edge cases.

@bowenliang123
Copy link
Contributor

Thanks for the explanation, SGTM. And please make sure unit tests provided, at least covering the minimum cases is alright to me.

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Nov 17, 2024
@zandko
Copy link
Contributor Author

zandko commented Nov 17, 2024

Thanks for the explanation, SGTM. And please make sure unit tests provided, at least covering the minimum cases is alright to me.

This commit finalizes the logic for removing leading punctuation and symbols by adopting a Unicode range regex pattern. The new approach improves compatibility and avoids reliance on advanced regex features such as \p{}.

Changes:

  • Updated the regex pattern to explicitly cover Unicode punctuation and symbols:
    • ^[\u2000-\u206F\u2E00-\u2E7F\u3000-\u303F!\"#$%&'()*+,\-./:;<=>?@\[\]^_{|}~]+`
    • Matches leading punctuation and symbols, including CJK-specific marks.
  • Ensured robust handling of edge cases through unit tests.

This update simplifies the implementation while maintaining the intended behavior. All related unit tests pass, ensuring correctness.

@crazywoola crazywoola self-assigned this Nov 18, 2024
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Nov 18, 2024
@crazywoola crazywoola merged commit 14f3d44 into langgenius:main Nov 18, 2024
6 checks passed
@yihong0618
Copy link
Contributor

why we need to remove the leading chars like { [ } and so on?
for the old logic we only remove . and which is make sense.
the new logic may cause

#11868

@yihong0618
Copy link
Contributor

yihong0618 commented Dec 25, 2024

As the logic is not as clear in readability as before, for the maintainability, I would suggest extracting to a common util method for this, and the proper unit tests for is very import for this, too.
To be honest, I accept the purpose of this PR, but I don't understand the regex in the change and how it satisfies the goal.

Thank you for your feedback! Here's an explanation of the regex and its purpose:

  • Regex Explanation:

    • ^[\p{P}\p{S}]+: This pattern matches one or more punctuation (\p{P}) or symbol (\p{S}) characters at the start of the string. For example:

      • Input: !@#Hello
      • Match: !@#
    • In re.match, this checks if the string starts with punctuation or symbols.

    • In re.sub, it removes those leading punctuation or symbols from the string.

The purpose of this logic is to clean up the string so it doesn't begin with any unnecessary symbols or punctuation.

If you feel this approach works, I can extract the logic into a reusable utility method (e.g., remove_leading_symbols) to improve readability and maintainability, as you suggested. I'll also add proper unit tests to cover edge cases.

why we need to remove !@# can you explain that?>
thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💪 enhancement New feature or request lgtm This PR has been approved by a maintainer size:M This PR changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants