Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encodings: decode utf-8 with errors='replace' when confident #421

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

Rongronggg9
Copy link
Contributor

@Rongronggg9 Rongronggg9 commented Dec 24, 2023

"Confident" means "metadata of the document explicitly indicates that the encoding is UTF-8".

Background of the patch

When a UTF-8 feed has a few invalid characters but the rest is fine, feedparser will only parse it as iso-8859-2 (or other encodings detected by chardet, if installed), even if both the HTTP and XML headers explicitly indicate that its encoding is utf-8.

To handle it better, we should decode the feed as UTF-8 with errors='replace'.

Date: Sun, 24 Dec 2023 16:23:48 GMT
Server: Apache/2.0.59 (Win32) PHP/5.1.6
X-Powered-By: PHP/5.1.6
Cache-Control: no-store, no-cache, must-revalidate
Expires: Sun, 24 Dec 2023 16:38:48 GMT
Set-Cookie: REDACTED
P3P: REDACTED
Access-Control-Allow-Origin: *
Transfer-Encoding: chunked
Content-Type: application/rss+xml; charset=utf-8

@Rongronggg9 Rongronggg9 force-pushed the fix/encoding-confidence branch 3 times, most recently from dd2d6bf to 750ca5f Compare December 26, 2023 18:29
@Rongronggg9 Rongronggg9 marked this pull request as ready for review December 27, 2023 01:43
@butaford
Copy link

Please accept "Pull requests". Everything works as it should with him!

"Confident" means "metadata of the document explicitly indicates that
the encoding is UTF-8". This prevents feedparser from falling back to
other encodings when there are only tiny errors.
@Rongronggg9 Rongronggg9 force-pushed the fix/encoding-confidence branch from 750ca5f to 5fc7ed2 Compare December 15, 2024 15:58
@Rongronggg9
Copy link
Contributor Author

Hi @kurtmckee, could you take a look at this? I've just rebased my patch.

Nowadays, non-UTF-8 web resources are rare. If the feed declares its encoding as UTF-8, it is almost impossible to be other encodings.

The problem with the current methodology in feedparser is that iso-8859-2 is always a "catch-all" option, making any feeds with just tiny mistakes fall back to it. This behavior could mess things up in most scenarios.

UTF-8 is a self-synchronizing code. It is guaranteed that any tiny error in a UTF-8 document never messes up the whole document. Thus, it is safe to decode it with errors='replace'.

My patch aims to adhere to the encoding declaration when it is UTF-8. This should make UTF-8 feeds with tiny mistakes being parsed less painfully. Non-UTF-8 encoding declarations are not considered because their presence is probably related to misconfiguration. Most non-UTF-8 encodings are not self-synchronizing so that's another reason for the patch to consider UTF-8 only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants