Skip to content

Commit

Permalink
perf: Remove references to external resources before Markdown parsing
Browse files Browse the repository at this point in the history
Avoid Pandoc to try to download them. Updating test Markdown files accordingly.
  • Loading branch information
clemlesne committed Sep 12, 2024
1 parent 652bf5f commit 86f200a
Show file tree
Hide file tree
Showing 5 changed files with 238 additions and 228 deletions.
20 changes: 19 additions & 1 deletion app/scrape.py
Original file line number Diff line number Diff line change
Expand Up @@ -926,11 +926,20 @@ def _network_used_callback(size_bytes: int) -> None:
# Extract text content
# TODO: Make it async with a wrapper
try:
# Remove "src" attributes to avoid downloading external resources
full_html_minus_resources = full_html
for attribute in ("src", "srcset"):
full_html_minus_resources = re.sub(
rf"{attribute}=\"[^\"].*?\"", # Match attribute
f'{attribute}=""', # Replace with empty string
full_html_minus_resources,
)

# Convert HTML to Markdown
full_markdown = convert_text(
format="html", # Input is HTML
sandbox=True, # Enable sandbox mode, we don't know what we are scraping
source=full_html,
source=full_html_minus_resources,
to="markdown-fenced_divs-native_divs-raw_html-bracketed_spans-native_spans-link_attributes-header_attributes-inline_code_attributes",
extra_args=[
"--embed-resources=false",
Expand All @@ -954,9 +963,18 @@ def _network_used_callback(size_bytes: int) -> None:
full_markdown,
)

# Remove empty images
full_markdown = full_markdown.replace("![]()", "")

# Remove empty links
full_markdown = full_markdown.replace("[]()", "")

# Clean up by removing double newlines
full_markdown = re.sub(r"\n\n+", "\n\n", full_markdown)

# Strip
full_markdown = full_markdown.strip()

except (
RuntimeError
) as e: # pypandoc raises a RuntimeError if Pandoc returns one
Expand Down
4 changes: 2 additions & 2 deletions tests/scrape.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ async def test_scrape_page_website(
mode="r",
) as f:
expected = await f.read()
assert page.content == expected.rstrip(), "Markdown content should match"
assert page.content == expected.strip(), "Markdown content should match"


async def test_scrape_page_links(browser: Browser) -> None:
Expand Down Expand Up @@ -166,7 +166,7 @@ async def test_scrape_page_paragraphs(browser: Browser) -> None:
mode="r",
) as f:
expected = await f.read()
assert page.content == expected.rstrip(), "Content should match"
assert page.content == expected.strip(), "Content should match"

# Check title
assert page.title == "Complex paragraph example", "Title should match"
Expand Down
22 changes: 7 additions & 15 deletions tests/websites/google.html.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Skip to main content[Accessibility help](https://support.google.com/websearch/an

Accessibility feedback

[![Google](./google/googlelogo_light_color_92x30dp.png)](https://www.google.com/webhp?hl=en&sa=X&ved=0ahUKEwjqwcCgg4mIAxVqcKQEHWSHCUYQPAgI "Go to Google Home")
[![Google]()](https://www.google.com/webhp?hl=en&sa=X&ved=0ahUKEwjqwcCgg4mIAxVqcKQEHWSHCUYQPAgI "Go to Google Home")

Press / to jump to the search box

Expand Down Expand Up @@ -98,7 +98,7 @@ About 140,000,000 results (0.21 seconds) 

Ctrl+Shift+X to select

![Google](https://fonts.gstatic.com/s/i/productlogos/googleg/v6/24px.svg)
![Google]()

# Search settings

Expand Down Expand Up @@ -168,8 +168,6 @@ What color are bananas naturally?

4:28

![](https://fonts.gstatic.com/s/i/productlogos/youtube/v9/192px.svg)

Mashed

YouTube·
Expand All @@ -180,7 +178,7 @@ Aug 28, 2021

# The Real Difference Between Red And Yellow Bananas

![](https://www.gstatic.com/images/branding/product/1x/youtube_32dp.png)YouTube·Mashed·Aug 28, 2021
YouTube·Mashed·Aug 28, 2021

In this video

Expand Down Expand Up @@ -245,8 +243,6 @@ https://www.healthline.com › nutrition › green-bananas-\...

Search for: [Is it better to eat bananas green or yellow?](/search?sca_esv=d3224e15adc24c7c&q=Is+it+better+to+eat+bananas+green+or+yellow%3F&sa=X&ved=2ahUKEwjqwcCgg4mIAxVqcKQEHWSHCUYQzmd6BAgkEAY)

![](//www.gstatic.com/ui/v1/activityindicator/loading_24.gif)

Feedback

[\
Expand Down Expand Up @@ -408,22 +404,23 @@ YouTube · Scientific American

1. [](https://www.youtube.com/watch?v=0WCErY3OYng&t=40)


From 00:40

Why do bananas turn brown?
2. [](https://www.youtube.com/watch?v=0WCErY3OYng&t=57)


From 00:57

Melanin is part of a banana\'s defense system
3. [](https://www.youtube.com/watch?v=0WCErY3OYng&t=73)


From 01:13

Triggers of Ripening

![](https://fonts.gstatic.com/s/i/productlogos/youtube/v9/192px.svg)

Scientific American

YouTube·
Expand All @@ -434,7 +431,7 @@ Apr 13, 2018

# Why Do Bananas Change Color?

![](https://www.gstatic.com/images/branding/product/1x/youtube_32dp.png)YouTube·Scientific American·Apr 13, 2018
YouTube·Scientific American·Apr 13, 2018

In this video

Expand All @@ -451,8 +448,6 @@ In this video

Triggers of Ripening

![](https://i.ytimg.com/vi/0WCErY3OYng/mqdefault.jpg?sqp=-oaymwEFCJQBEFM&rs=AMzJL3npWGNkEMCAQ2mRSduxEreZPZH0Fw)

[\
](https://specialtyproduce.com/produce/Yellow_Bananas_919.php)

Expand Down Expand Up @@ -485,8 +480,6 @@ Mar 15, 2023 --- Banana chips are yellow in color because *they are made from ba

People also ask

![](//www.gstatic.com/ui/v1/activityindicator/loading_24.gif)

Feedback

People also search for
Expand Down Expand Up @@ -550,4 +543,3 @@ Updating location...
[Consumer information](https://support.google.com/websearch?p=fr_consumer_info&hl=en-FR&fg=1)[Report inappropriate content](https://support.google.com/legal/answer/3110420?hl=en-FR&fg=1)

Google apps

Loading

0 comments on commit 86f200a

Please sign in to comment.