Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Link Checker #61

Open
kaijli opened this issue Dec 6, 2024 · 3 comments
Open

Add Link Checker #61

kaijli opened this issue Dec 6, 2024 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@kaijli
Copy link
Contributor

kaijli commented Dec 6, 2024

Ticket for copying / implementing Lychee link checker from nmdc-schema.

@kaijli kaijli self-assigned this Dec 6, 2024
@eecavanna
Copy link
Collaborator

Posting for reference (no need to reply).

I learned today that the output of the command I am currently using to build the Runtime docs (which is mkdocs build), does some amount of link checking. Here is an example of its output from a recent GitHub Actions workflow run:

WARNING -  Doc file 'howto-guides/claim-and-run-jobs.md' contains a link 'guide-create-triggers.md', but the target 'howto-guides/guide-create-triggers.md' is not found among documentation files.
WARNING -  Doc file 'howto-guides/jobs/gold-translation-etl.md' contains a link '../tutorials/tutorial-metadata-in.md', but the target 'howto-guides/tutorials/tutorial-metadata-in.md' is not found among documentation files.

I think it only reports (or even checks for) broken links between the Markdown documents that make up the website. I don't think it checks for broken links pointing to other websites. For reference, Lychee does do the latter.

@ssarrafan
Copy link
Contributor

Adding this issue for the next sprint to try to meet Jan deadline. If you prefer to have it in a later sprint let me know.

@eecavanna
Copy link
Collaborator

eecavanna commented Dec 19, 2024

FYI: I learned today that lychee's link checking is not recursive. That hasn't impacted our use of lychee in our repos, since we are always scanning static files on the filesystem lychee has access to—as opposed to scanning a live website. It did impact my attempt to use lychee to scan a live website today, though. Although (in my mind) this ticket is not about scanning a live website, I wanted to document this here for reference.

To scan the live website, I ended up using a different link checker tool—one that is recursive—called linkcheck.

Here's the command I used to run it:

docker run --rm tennox/linkcheck https://example.com

In the case of this docs site: if we were to eventually want to scan the live site, here's a command we could run in order to do that:

docker run --rm tennox/linkcheck https://docs.microbiomedata.org

Currently, the tool respects our robots.txt file (which tells search engines and other types of "web crawlers" that we don't want them to crawl our website, which is our stance until we're ready to launch the site). As a result, the tool outputs the following:

Crawling...

Access to these URLs denied by robots.txt, so we couldn't check them:
- https://docs.microbiomedata.org/_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c
- https://docs.microbiomedata.org/_static/css/custom.css?v=418f5f0c
- https://docs.microbiomedata.org/_static/css/theme.css?v=e59714d7
- https://docs.microbiomedata.org/_static/doctools.js?v=9bcbadda
- https://docs.microbiomedata.org/_static/documentation_options.js?v=2709fde1
- https://docs.microbiomedata.org/_static/favicon.ico
- https://docs.microbiomedata.org/_static/jquery.js?v=5d32c60e
- https://docs.microbiomedata.org/_static/js/index.js?v=09cb6ca5
- https://docs.microbiomedata.org/_static/js/theme.js
- https://docs.microbiomedata.org/_static/nmdc-logo-bg-white.png
- https://docs.microbiomedata.org/_static/pygments.css?v=80d5e7a1
- https://docs.microbiomedata.org/_static/sphinx_highlight.js?v=dc90522c
- https://docs.microbiomedata.org/explanation/community_conversations.html
- https://docs.microbiomedata.org/explanation/fair_data.html
- https://docs.microbiomedata.org/genindex.html
- https://docs.microbiomedata.org/howto_guides/api_gui.html
- https://docs.microbiomedata.org/howto_guides/data_plan.html
- https://docs.microbiomedata.org/howto_guides/globus.html
- https://docs.microbiomedata.org/howto_guides/portal_guide.html
- https://docs.microbiomedata.org/howto_guides/run_workflows.html
- https://docs.microbiomedata.org/howto_guides/submit2nmdc.html
- https://docs.microbiomedata.org/overview/nmdc_overview.html
- https://docs.microbiomedata.org/reference/data_portal.html
- https://docs.microbiomedata.org/runtime.html
- https://docs.microbiomedata.org/search.html
- https://docs.microbiomedata.org/tutorials/nav_data_portal.html
- https://docs.microbiomedata.org/tutorials/run_workflows.html
- https://docs.microbiomedata.org/tutorials/submission_portal.html
- https://docs.microbiomedata.org/workflows.html


Stats:
      99 links
       1 destination URLs
      34 URLs ignored
       0 warnings
       0 errors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

No branches or pull requests

3 participants