-
-
Notifications
You must be signed in to change notification settings - Fork 883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metadata fetching fails on Youtube #5208
Comments
You should fill in the issue template. Anyway I can reproduce the problem. It happens because of #4957 which fetches only a limited amount of data for previews to reduce cpu usage. The problem is that youtube in particular includes a huge amount of javascript in the html file, so the opengraph tags appear much later than 64 bytes. I was able to fetch youtube metadata with cc @phiresky |
I don't think that makes the optimization entirely useless, as it still avoids downloading multiple MB |
I'd be fine with increasing that, although its frustrating that youtube decides to insert a megabyte of javascript before the usable opengraph tags. |
Checking a random youtube video, the og:* tags are at around 500kB. Ridiculous how they insert so much crap before. |
possible patch: diff --git a/crates/api_common/src/request.rs b/crates/api_common/src/request.rs
index 36010f760..f2c2aeecc 100644
--- a/crates/api_common/src/request.rs
+++ b/crates/api_common/src/request.rs
@@ -32,10 +32,11 @@ use reqwest::{
};
use reqwest_middleware::ClientWithMiddleware;
use serde::{Deserialize, Serialize};
-use tracing::info;
+use tracing::{info, warn};
use url::Url;
use urlencoding::encode;
use webpage::HTML;
+use std::time::Instant;
pub fn client_builder(settings: &Settings) -> ClientBuilder {
let user_agent = format!("Lemmy/{VERSION}; +{}", settings.get_protocol_and_hostname());
@@ -51,9 +52,8 @@ pub fn client_builder(settings: &Settings) -> ClientBuilder {
#[tracing::instrument(skip_all)]
pub async fn fetch_link_metadata(url: &Url, context: &LemmyContext) -> LemmyResult<LinkMetadata> {
info!("Fetching site metadata for url: {}", url);
- // We only fetch the first 64kB of data in order to not waste bandwidth especially for large
- // binary files
- let bytes_to_fetch = 64 * 1024;
+ // We fetch the first 1MB of data to ensure we get all metadata
+ let bytes_to_fetch = 1024 * 1024;
let response = context
.client()
.get(url.as_str())
@@ -91,9 +91,24 @@ pub async fn fetch_link_metadata(url: &Url, context: &LemmyContext) -> LemmyResu
// only take first bytes regardless of how many bytes the server returns
let html_bytes = collect_bytes_until_limit(response, bytes_to_fetch).await?;
- extract_opengraph_data(&html_bytes, url)
+
+ // Track parsing time
+ let parse_start = Instant::now();
+ let result = extract_opengraph_data(&html_bytes, url)
.map_err(|e| info!("{e}"))
- .unwrap_or_default()
+ .unwrap_or_default();
+ let parse_duration = parse_start.elapsed();
+
+ // Log warning if parsing took longer than 100ms
+ if parse_duration.as_millis() > 100 {
+ warn!(
+ "HTML parsing for {} took {}ms, which exceeds the 100ms threshold",
+ url,
+ parse_duration.as_millis()
+ );
+ }
+
+ result
}
};
Ok(LinkMetadata { |
If I'm skimming over Mastodon source correctly, there is also a 1 MiB limit for fetching html for opengraph metadata: Seems like this is a reasonable limit. |
@phiresky feel free to do a PR w/ that patch when you get a chance. |
Recently, the YouTube urls are not autofilling post titles anymore. It would be great if you could bring back this feature.
The text was updated successfully, but these errors were encountered: