-
-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track technology adoption and share #591
Comments
If slicing by new sites, probably want to avoid the long tail of sites that drop in and out of our dataset depending on traffic that month, but aren’t really new - just low traffic-ed sites. Could exclude any new sites in the largest 10m rank, and only look at new sites in top 1m or 100k sites that either haven’t appeared at all before or only in top 10m previously. |
@tunetheweb I was actually hoping we could find a source for truly new sites; sites that are just hitting the web. |
Not aware of any to be hones. We could use meta dates but they are notoriously unreliable. “New to top million” or similar is best way I can think of measuring this. It would then also include sites that launched maybe a few months ago but are only now getting serious traffic/traction. Maybe, once we figure out the algorithm to mention this we can become that source 😁 |
Am I oversimplifying or can we just check to see if the website had ever been in the dataset? |
@rviscomi ok, can I be really cheeky? I was hoping to “add” a bit to the dataset, so “on top”, not “within”. I think a certain search engine would know about some sites new to them? |
I think we can only assume we're able to work with the data already publicly available to us. Beyond "have we seen this URL before" we could also look at resource freshness data like the @tomvangoethem or @nrllh might also be interested in this problem from a research perspective. |
Perhaps worth forking the "new site" dimension from the technology adoption report for now. |
From what I understand, with the "new site" dimension you're mainly interested in sites that were created/developed recently? How about using Certificate Transparency logs for that? Should be feasible to determine when a site's first certificate was issued (or, given that domains expire and get reused: the last time that the site did not have a valid certificate for a certain period of time). Accessing CT logs might be a bit tricky though; depending on the number of sites to test, it might be feasible using the crt.sh or censys.io APIs. Censys also provides access to their data on BigQuery for research purposes (not sure if that would fall under "publicly available to us"?). Ingesting CT logs into the HTTP Archive dataset might also be an interesting option. Perhaps there's some other data sources that I don't know about? |
Add a new report that tracks the adoption and share of detected technologies.
Reports currently fall into timeseries and histograms, so we many need a new report template that handles more custom ways to explore and visualize this data.
The primary use case for this feature is to track CMS adoption, but it would be good to build this in a way that supports any given technology category and users can filter it down however they want.
Similar to the CWV Technology Report, it could be useful to apply dimensions to the stats, like ranking and country. @jdevalk also suggested slicing by "new" sites.
The text was updated successfully, but these errors were encountered: