encodings: decode utf-8 with errors='replace' when confident #421

Rongronggg9 · 2023-12-24T20:29:37Z

"Confident" means "metadata of the document explicitly indicates that the encoding is UTF-8".

Background of the patch

When a UTF-8 feed has a few invalid characters but the rest is fine, feedparser will only parse it as iso-8859-2 (or other encodings detected by chardet, if installed), even if both the HTTP and XML headers explicitly indicate that its encoding is utf-8.

To handle it better, we should decode the feed as UTF-8 with errors='replace'.

I met the problem at On the same site, different recognition of encoding Rongronggg9/RSS-to-Telegram-Bot#391
- Feed URL: http://iptvin.ru/component/jcomments/?task=rss&object_id=1000707&object_group=com_content&tmpl=component
- Snapshot of the feed: iptvin.xml.gz
- Snapshot of HTTP headers:

Date: Sun, 24 Dec 2023 16:23:48 GMT
Server: Apache/2.0.59 (Win32) PHP/5.1.6
X-Powered-By: PHP/5.1.6
Cache-Control: no-store, no-cache, must-revalidate
Expires: Sun, 24 Dec 2023 16:38:48 GMT
Set-Cookie: REDACTED
P3P: REDACTED
Access-Control-Allow-Origin: *
Transfer-Encoding: chunked
Content-Type: application/rss+xml; charset=utf-8

butaford · 2024-01-23T08:24:51Z

Please accept "Pull requests". Everything works as it should with him!

"Confident" means "metadata of the document explicitly indicates that the encoding is UTF-8". This prevents feedparser from falling back to other encodings when there are only tiny errors.

Rongronggg9 · 2024-12-15T16:22:17Z

Hi @kurtmckee, could you take a look at this? I've just rebased my patch.

Nowadays, non-UTF-8 web resources are rare. If the feed declares its encoding as UTF-8, it is almost impossible to be other encodings.

The problem with the current methodology in feedparser is that iso-8859-2 is always a "catch-all" option, making any feeds with just tiny mistakes fall back to it. This behavior could mess things up in most scenarios.

UTF-8 is a self-synchronizing code. It is guaranteed that any tiny error in a UTF-8 document never messes up the whole document. Thus, it is safe to decode it with errors='replace'.

My patch aims to adhere to the encoding declaration when it is UTF-8. This should make UTF-8 feeds with tiny mistakes being parsed less painfully. Non-UTF-8 encoding declarations are not considered because their presence is probably related to misconfiguration. Most non-UTF-8 encodings are not self-synchronizing so that's another reason for the patch to consider UTF-8 only.

Rongronggg9 force-pushed the fix/encoding-confidence branch 3 times, most recently from dd2d6bf to 750ca5f Compare December 26, 2023 18:29

Rongronggg9 marked this pull request as ready for review December 27, 2023 01:43

Rongronggg9 mentioned this pull request Sep 24, 2024

Title Strange Characters issue when reading RSS XML files not encoded in utf-8 #478

Closed

encodings: decode utf-8 with errors='replace' when confident

5fc7ed2

"Confident" means "metadata of the document explicitly indicates that the encoding is UTF-8". This prevents feedparser from falling back to other encodings when there are only tiny errors.

Rongronggg9 force-pushed the fix/encoding-confidence branch from 750ca5f to 5fc7ed2 Compare December 15, 2024 15:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

encodings: decode utf-8 with errors='replace' when confident #421

encodings: decode utf-8 with errors='replace' when confident #421

Rongronggg9 commented Dec 24, 2023 •

edited

Loading

butaford commented Jan 23, 2024

Rongronggg9 commented Dec 15, 2024

encodings: decode utf-8 with errors='replace' when confident #421

Are you sure you want to change the base?

encodings: decode utf-8 with errors='replace' when confident #421

Conversation

Rongronggg9 commented Dec 24, 2023 • edited Loading

Background of the patch

butaford commented Jan 23, 2024

Rongronggg9 commented Dec 15, 2024

Rongronggg9 commented Dec 24, 2023 •

edited

Loading