Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lyrics: Refactor Genius, Google backends, and consolidate common functionality #5474

Open
wants to merge 23 commits into
base: fix-lrclib-lyrics
Choose a base branch
from

Conversation

snejus
Copy link
Member

@snejus snejus commented Oct 19, 2024

Description

I've made several updates to the lyrics plugin to enhance its functionality and address a few bugs. Here's a quick overview of what has changed:

Bug Fixes

Plugin Enhancements

  • Session Management: Introduced a TimeoutSession to enable connection pooling and maintain consistent configuration across requests.
  • Error Handling: Centralized error handling logic in a new RequestsHandler class, which includes methods for retrieving either HTML text or JSON data.
  • Logging: Added methods to ensure the backend name is included in log messages.

Configuration Changes

  • Added a new dist_thresh field to the configuration, allowing users to control the maximum tolerable mismatch between the artist and title of the lyrics search result and their item. Interestingly, this field was previously available (though undocumented) and used in the Tekstowo backend. Now, this threshold has also been applied to Genius and Google search logic.

Backend Updates

  • All backends that perform searches now validate each result against the configured dist_thresh.

Genius

  • Removed the need to scrape HTML tags for lyrics; instead, lyrics are now parsed from the JSON data embedded in the HTML. This change should reduce our vulnerability to Genius' frequent alterations in their HTML structure.
  • Documented the structure of their search JSON data.

Google

  • Typed the response data returned by the Google Custom Search API.
  • Excluded certain pages under https://letras.mus.br that do not contain lyrics.
  • Excluded all results from MusiXmatch, as we cannot access their pages.
  • Improved parsing of URL titles (used for matching item/lyrics artist/title):
    • Handled results from long search queries where URL titles are truncated with an ellipsis.
    • Enhanced URL title cleanup logic.
    • Added functionality to determine (or rather, guess) not only the track title but also the artist from the URL title.
  • Similar to Fix lrclib lyrics #5406, search results are now compared to the original item and sorted by distance. Results exceeding the configured dist_thresh value are discarded. The previous functionality simply selected the first result containing the track's title in its URL, which often led to returning lyrics for the wrong artist, particularly for short track titles.
  • Since we now fetch lyrics confidently, redundant checks for valid lyrics and credits cleanup have been removed.

HTML Cleanup

  • Organized regex patterns into a new Html class.
  • Adjusted patterns to ensure new lines between blocks of lyrics text scraped from letras.mus.br and musica.com.
  • Modified patterns to scrape missing lyrics text on paroles.net and lacoccinelle.net. See the diff in test/plugins/lyrics_page.py.

To Do

  • Documentation: If you've added a new command-line flag, please find the appropriate page under docs/ to describe it.
  • Changelog: Add an entry to docs/changelog.rst at the bottom of one of the lists near the top of the document.
  • Tests: While tests are highly encouraged, they are not strictly required.

Note: If you find this PR too large to manage, I am happy to split it up!

This commit introduces a distance threshold mechanism for the Genius and
Google backends.

- Create a new `SearchBackend` base class with a method `check_match`
  that performs checking.
- Start using undocumented `dist_thresh` configuration option for good,
  and mention it in the docs. This controls the maximum allowable
  distance for matching artist and title names.

These changes aim to improve the accuracy of lyrics matching, especially
when there are slight variations in artist or title names, see #4791.
Improve requests performance with requests.Session which uses connection
pooling for repeated requests to the same host.

Additionally, this centralizes request configuration, making sure that
we use the same timeout and provide beets user agent for all requests.
Having removed it I fuond that only the Genius lyrics changed: it had en
extra new line. Thus I defined a function 'collapse_newlines' which now
gets called for the Genius lyrics.
Tidy up 'Google.is_page_candidate' method and remove 'Google.sluggify'
method which was a duplicate of 'slug'.

Since 'GeniusFetchTest' only tested whether the artist name is cleaned
up (the rest of the functionality is patched), remove it and move its
test cases to the 'test_slug' test.
* Type the response data that Google Custom Search API return.
* Exclude some 'letras.mus.br' pages that do not contain lyric.
* Exclude results from Musixmatch as we cannot access their pages.
* Improve parsing of the URL title:
  - Handle long URL titles that get truncated (end with ellipsis) for
    long searches
  - Remove domains starting with 'www'
  - Parse the title AND the artist. Previously this would only parse the
    title, and fetch lyrics even when the artist did not match.
* Remove now redundant credits cleanup and checks for valid lyrics.
Additionally, improve HTML pre-processing:

* Ensure a new line between blocks of lyrics text from letras.mus.br.
* Parse a missing last block of lyrics text from lacocinelle.net.
* Parse a missing last block of lyrics text from paroles.net.
* Fix encoding issues with AZLyrics by setting response encoding to
  None, allowing `requests` to handle it.
If we get caught by Cloudfare, it forwards our request somewhere else
and returns some validation text response. To make sure that this text
does not get assumed for lyrics, we can disable redirects for the Google
backend, check the response code and raise if there's a redirect
attempt. This source will then be skipped and the backend continues with
the next one.
I think we can make our life easier by removing these checks assuming
that users follow the instructions in the docs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

lyrics: In Genius backend, tolerate artist disambiguation markers
1 participant