PICARD-2466: Lookup function by using Catalog Number and other identifying info #2538

koraynilay · 2024-08-26T05:17:12Z

Summary

This is a…
- Bug fix
- Feature addition
- Refactoring
- Minor / simple change (like a typo)
- Other
Describe this change in 1-2 sentences: Try to match by catalog number, barcode and asin when using the Lookup function on clusters.

Problem

When using the Lookup (Ctrl+L) function on clusters, Picard won't try to match using available release identifiers (like catalogue number or barcode), this PR aims to fix this.
The problem still applies to unclustered files (I haven't checked to them yet).

I searched online but couldn't find anything similar to this (except for the mentioned JIRA tickets) so I apologize if there already is something similar.

JIRA ticket (optional): PICARD-2466, PICARD-2793

Solution

Make a dictionary containing a Counter for each tag, iterating through it and using most_common() gives us the most common value in all files in the cluster for each tag specified in tags_to_update (Method shamelessly copied from FileCluster's artist and title). This way we can populate the cluster's metadata with additional useful info.
This PR also adds those identifiers to Metadata.compare() and Metadata.compare_to_release_parts() (using their respective weights defined in CLUSTER_COMPARISON_WEIGHTS)

Action

Additional actions required:

Update Picard documentation (please include a reference to this PR)
Other (please specify below)

This is the first commit for a possible solution for PICARD-2466 also applies to PICARD-2793

@kokon191

Thanks to @kokon191 for the catno comparison code Cluster weights are sensible defaults which work for some of my tested files, but they will probably need some tweaking. The Metadata weights are sensible defaults guessed from the other weights. I haven't added them to File yet.

Sophist-UK

Also, I am not sure that this actually resolves anything that is part of PICARD-2793.

Sophist-UK · 2024-08-26T10:10:47Z

picard/cluster.py

@@ -72,6 +72,9 @@
    'releasecountry': 2,
    'releasetype': 10,
    'totalalbumtracks': 5,
+    'catalognumber': 60,
+    'asin': 65,
+    'barcode': 68,


I am unclear how these weights were chosen or why they are so high.

I chose these in the same way as the ones in metadata.py but after trying to match ~80 albums (all files tagged with only catalognumber), with 'catalognumber': 60, it was able to correctly identify almost all of them (except for ~4), while with a lower value it would identify quite a few less.

But all these were guesses and as I am not familiar with the codebase yet they may very well be wrong (even tho they worked for my specific use case).

Sophist-UK · 2024-08-26T10:10:56Z

picard/metadata.py

@@ -163,6 +163,9 @@ class Metadata(MutableMapping):
        ('totaltracks', 5),
        ('discnumber', 5),
        ('totaldiscs', 4),
+        ('catalognumber', 30),
+        ('barcode', 33),
+        ('asin', 32),


I am unclear how these weights were chosen or why they are so high.

I chose these values so they are a bit higher than the highest one (title), since they more uniquely identify releases/recordings/etc.

I have not had any experience with determining optimum search weightings in Picard, but I have designed complex weighted search algorithms in other search engines - and I know that getting the right weights is an art, and if you get them wrong then all sorts of funny results come up first.

I don't know whether @zas or @phw can help with determining whether these weights are good or bad.

I also believe these weights are too high. This can easily overrule better matches based on title and artist. It's also one reason why I think we should go a different route.

Sophist-UK · 2024-08-26T10:12:57Z

picard/cluster.py

+                        metadata_counters[tag] = Counter()
+                    metadata_counters[tag][value] += 1
+        for tag in metadata_counters:
+            self.metadata[tag] = metadata_counters[tag].most_common(1)[0][0]


What is the purpose of this additional code (which is only for these tags)?

And why has genre been added to these tags?

What is the purpose of this additional code (which is only for these tags)?

This is in order to add those tags to the cluster so they can be used in the lookup.

And why has genre been added to these tags?

Because I thought it would make sense for a cluster to also have the most common genre in its files, there may also be other tags which makes sense to include (like releasecountry or comment for release disambiguation maybe?) and all these can be quickly added/removed to the lookup (which could also lead to PICARD-2793).

Additional code - I haven't read the existing code in detail, but I surely code to include these tags in the cluster should be co-located with the code that puts the existing tags into the cluster.

Genres should not IMO be added because they are highly subjective and IMO the MB implementation is far too free-form i.e. lacking in hierarchy and unstructured.

Hmm, I had searched for that a bit after trying to just add catno=self.metadata['catalognumber'], to the find_releases call in Cluster.lookup_metadata() (which didn't work) but I couldn't find anything.

Now I tried (on master) printing self.metadata right before the find_releases call and the log just reads

store: {'album': ['Do the Dive [Limited Edition]'], 'albumartist': ['Call of Artemis'], 'totaltracks': ['4'], '~totalalbumtracks': ['4']} deleted: set() images: ["TagCoverArtImage from '/I/Raccolte/Musica/D4DJ Collection/Call of Artemis - Do the Dive [Limited Edition] (2022) [FLAC+CUE+LOG] {BRMM-10544}/01. Do the Dive.flac' of type front"] length: 1066065

(all 4 files have catalognumber set)
So it doesn't seem to get populated with the catalog number.
Unless I didn't understand something and those tag shouldn't be in self.metadata

~~Now that I think about it tho, in the UI I can see the cluster files' metadata, I'm gonna search if I can find where it gets them from.~~
Update: seems like the metadata box just gets the tags by iterating through the files

I think generally this tag merging for clusters makes sense, and we should do something like this for some relevant tags. It makes the clusters far more useful if we add such lookups. Whether this includes genres or not is a different question, and currently I don't have any strong opinion on this either way.

I'm a bit concerned about performance here if this runs for every individual file added. But this would need to be measured. Optimization would be to ensure this runs only after all files have been added when clustering multiple files at once (i.e. usually when creating the cluster).

I'm a bit concerned about performance here if this runs for every individual file added. But this would need to be measured. Optimization would be to ensure this runs only after all files have been added when clustering multiple files at once

I was also concerned about that, but from what I saw, self.update() gets called only once after all files have been added (after for file in files loop) in cluster.add_files(), so it should run only one time for each cluster (as you said it needs to be tested tho).

(i.e. usually when creating the cluster).

Presumably this code would therefore be better placed inside wherever the cluster is created.

Actually my first approach was to do this, by passing the metadata objects of the various files to FileCluster[1] and have a FileCluster.metadata property returning a metadata object with the most common tags, but that would require quite a bit of changes, also in the tests (as it would change the arguments of FileCluster.add() and Cluster's constructor).

I have pushed these changes to koraynilay:identifiers-lookup-filecluster, if you want to see them (they should work, but the tests didn't all pass and there are still some problems (see 742b9d2's commit message for an example)).

[1] like this

# in cluster.py, Cluster.cluster() @staticmethod def cluster(files): # ... for file in files: # ... token = tokenize(album) if token: # Basically pass `metadata` here - cluster_list[token].add(album, artist or various_artists, file) + cluster_list[token].add(metadata, file)

koraynilay · 2024-08-26T17:20:44Z

Also, I am not sure that this actually resolves anything that is part of PICARD-2793.

Hmm, yes, you're right, it doesn't relate directly, I included it because this change would also be needed for PICARD-2793's main use case (see from kokonut191's comments on JIRA).

I can remove it from the PR description tho, since it actually is a separate issue.

Also thanks for reviewing this.

phw

Thanks for opening this PR. I'm however not too sure about including this in the matching algorithm, especially not with such high weights. A mismatch in cat. no. or barcode would quickly lead to no match at all. IMHO if we use those identifiers in the matching they should have a low weight, so they will be used to push the result in favor of the releases with matching identifier, but don't spoil the search result completely.

Because questions about looking up by some identifier come up from time to time, but with different identifiers such as cat. no., barcode, disc ID, ISRC etc., I had been thinking about a different approach: We could add a context menu to clusters and files for looking up specifically by certain identifiers. This could be context aware and only offer lookups if the relevant tag is present.

The lookup would do an explicit search for the identifier in question on MB. It could either match the results automatically as it is done for "Lookup" and "Scan" or maybe show a search result window.

I think this would be the more flexible approach and would better fulfill the requirement of looking up by barcode, disc ID, cat. no., ISRC etc.

phw · 2024-08-28T06:43:34Z

picard/metadata.py

+
+            if 'barcode' in self and 'barcode' in weights:
+                score = 1.0 if self['barcode'] == release['barcode'] else 0.0
+                parts.append((score, weights['barcode']))


Proper comparison of barcodes likely should be a bit more complex then exact string match. Because of UPC vs EAN-13 it is e.g. not uncommon that UPCs are written with a leading zero. Compare also e.g. https://github.com/kellnerd/harmony/blob/1744a1a87ee8a8f8a0a59cbf8208e8f01ee9bae0/utils/gtin.ts#L66

Right, I did see that when I was looking at the docs for the search API, but it didn't come to mind when writing this code.

phw · 2024-08-28T06:44:40Z

picard/metadata.py

@@ -163,6 +163,9 @@ class Metadata(MutableMapping):
        ('totaltracks', 5),
        ('discnumber', 5),
        ('totaldiscs', 4),
+        ('catalognumber', 30),
+        ('barcode', 33),
+        ('asin', 32),


I also believe these weights are too high. This can easily overrule better matches based on title and artist. It's also one reason why I think we should go a different route.

phw · 2024-08-28T06:48:04Z

picard/cluster.py

+                        metadata_counters[tag] = Counter()
+                    metadata_counters[tag][value] += 1
+        for tag in metadata_counters:
+            self.metadata[tag] = metadata_counters[tag].most_common(1)[0][0]


I think generally this tag merging for clusters makes sense, and we should do something like this for some relevant tags. It makes the clusters far more useful if we add such lookups. Whether this includes genres or not is a different question, and currently I don't have any strong opinion on this either way.

I'm a bit concerned about performance here if this runs for every individual file added. But this would need to be measured. Optimization would be to ensure this runs only after all files have been added when clustering multiple files at once (i.e. usually when creating the cluster).

Sophist-UK · 2024-08-28T09:20:59Z

Optimization would be to ensure this runs only after all files have been added when clustering multiple files at once (i.e. usually when creating the cluster).

Presumably this code would therefore be better placed inside wherever the cluster is created.

koraynilay · 2024-08-28T14:54:39Z

Thanks for opening this PR. I'm however not too sure about including this in the matching algorithm, especially not with such high weights. A mismatch in cat. no. or barcode would quickly lead to no match at all. IMHO if we use those identifiers in the matching they should have a low weight, so they will be used to push the result in favor of the releases with matching identifier, but don't spoil the search result completely.

Thanks for reviewing it. Yeah that makes sense, unfortunately it didn't occur to me that there could be files tagged with a wrong identifier, as while trying to tag my files I was just thinking "if the catno matches, that means it's correct, whatever the other tags may be".

Because questions about looking up by some identifier come up from time to time, but with different identifiers such as cat. no., barcode, disc ID, ISRC etc., I had been thinking about a different approach: We could add a context menu to clusters and files for looking up specifically by certain identifiers.

I agree this would be the better solution, so even if the matching doesn't automatically find the right release, it can still be found explicitly.

The lookup would do an explicit search for the identifier in question on MB. It could either match the results automatically as it is done for "Lookup" and "Scan" or maybe show a search result window.

I'd prefer for it to match the results automatically, so that it can be done for a lot of clusters/files (as in my case, I could just select all ~80 clusters and press the "Lookup by catalogue number")

This could be context aware and only offer lookups if the relevant tag is present.

IMO it would be better to always offer those lookups, even if the tag isn't present (or show it if it's present in at least one file), because consider if the user has e.g.

10 album/clusters all tagged with catno
15 album/clusters without catno
20 album/clusters with some files having the correct catno and some files having no catno at all
5 unclustered files with catno
7 unclustered files without catno

selecting everything and pressing "lookup by catno" should match 1., 3. and 4., then pressing the normal lookup should match 2. and 5..
(catno here can be changed to barcode or another identifier)

phw · 2024-09-03T07:42:21Z

I added PICARD-2961 for the feature of a "Lookup by..." menu. This could work well for many similar feature requests for lookups, I linked the relevant tickets there.

Let us know if this is something you would like to work on. If not I might actually look into this sometime in the future because I personally would like to have the lookup from disc IDs in tags handled by this. But currently this will take quite some time until I can get to this.

Yeah that makes sense, unfortunately it didn't occur to me that there could be files tagged with a wrong identifier, as while trying to tag my files I was just thinking "if the catno matches, that means it's correct, whatever the other tags may be".

That's indeed always a tricky part with different data quality. A lot of Picard's matching is build on the fact that existing data might be bad. This is only a main reason why a lot of identifiers are not directly used.

Unfortunately we don't really have a good solution for this. All of the matching is usually done with local testing. I have a my own collection and a folder of test stuff, but it is by no means comprehensive. What we would really need is all kinds of test collection data of varying quality, together with the intended result. Then those weights could be tuned to give best results. This is likely a project we should tackle in some way, but this is really a separate topic on its own.

I'd prefer for it to match the results automatically, so that it can be done for a lot of clusters/files (as in my case, I could just select all ~80 clusters and press the "Lookup by catalogue number")

Yes, I tend to agree.

IMO it would be better to always offer those lookups, even if the tag isn't present (or show it if it's present in at least one file), because consider if the user has e.g.

Also makes sense.

koraynilay · 2024-09-04T00:15:48Z

Let us know if this is something you would like to work on. If not I might actually look into this sometime in the future because I personally would like to have the lookup from disc IDs in tags handled by this. But currently this will take quite some time until I can get to this.

Unfortunately I have basically never worked with GUIs and I don't have a lot of time rn to understand how everything works, but if I start working on it I will comment here to let you guys know.

That's indeed always a tricky part with different data quality. A lot of Picard's matching is build on the fact that existing data might be bad. This is only a main reason why a lot of identifiers are not directly used.

Unfortunately we don't really have a good solution for this. All of the matching is usually done with local testing. I have a my own collection and a folder of test stuff, but it is by no means comprehensive. What we would really need is all kinds of test collection data of varying quality, together with the intended result. Then those weights could be tuned to give best results. This is likely a project we should tackle in some way, but this is really a separate topic on its own.

Maybe we can add a prompt when using "Lookup"/"Lookup by..." (or global option in the settings) like "trust these files are tagged correctly" and then use exact[1] tag (also usable for title/artist/album) matching on them.

[1] would be better if it could also work with romanized tags (e.g. Russian or Japanese) and match the correct (prioritize romanized, but match also the non-romanized version), but I'm not sure how that would work (something like the artist translation field?)

kokon191 · 2024-10-07T09:03:22Z

I'm +1 for generic lookup patterns since that's what I looked for in my ticket (PICARD-2793). Maybe even generic weights.

Maybe we can add a prompt when using "Lookup"/"Lookup by..." (or global option in the settings) like "trust these files are tagged correctly" and then use exact[1] tag (also usable for title/artist/album) matching on them.

I would prefer it being a strict flag checkbox that when enabled, uses only exact string matching, instead of a prompt that pops up every time. Though my vote is on configurable weights and let users pump some numbers up if they prefer so (like personally I set catno weight to 100 since it takes priority over everything for my library).

On the question of performance, I don't think it's any slower in the grand scheme of things. Or at least, that's what happens in my use case when I tried with a few thousand files. The slowest part has always been the network part/query with musicbrainz for me, and it easily outweighs anything else.

koraynilay added 3 commits August 26, 2024 05:56

Add additional metadata (like identifiers) to clusters

ce17b88

This is the first commit for a possible solution for PICARD-2466 also applies to PICARD-2793

Add catno, barcode and asin to cluster lookup_metadata

1274479

zas requested review from phw and zas August 26, 2024 06:21

Sophist-UK reviewed Aug 26, 2024

View reviewed changes

phw reviewed Aug 28, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PICARD-2466: Lookup function by using Catalog Number and other identifying info #2538

PICARD-2466: Lookup function by using Catalog Number and other identifying info #2538

koraynilay commented Aug 26, 2024

Sophist-UK left a comment

Sophist-UK Aug 26, 2024

koraynilay Aug 26, 2024 •

edited

Loading

Sophist-UK Aug 26, 2024

koraynilay Aug 26, 2024 •

edited

Loading

Sophist-UK Aug 26, 2024

phw Aug 28, 2024

Sophist-UK Aug 26, 2024

koraynilay Aug 26, 2024 •

edited

Loading

Sophist-UK Aug 26, 2024

koraynilay Aug 26, 2024 •

edited

Loading

phw Aug 28, 2024

koraynilay Aug 28, 2024 •

edited

Loading

koraynilay commented Aug 26, 2024 •

edited

Loading

phw left a comment

phw Aug 28, 2024

koraynilay Aug 28, 2024

phw Aug 28, 2024

phw Aug 28, 2024

Sophist-UK commented Aug 28, 2024

koraynilay commented Aug 28, 2024

phw commented Sep 3, 2024

koraynilay commented Sep 4, 2024

kokon191 commented Oct 7, 2024 •

edited

Loading

PICARD-2466: Lookup function by using Catalog Number and other identifying info #2538

Are you sure you want to change the base?

PICARD-2466: Lookup function by using Catalog Number and other identifying info #2538

Conversation

koraynilay commented Aug 26, 2024

Summary

Problem

Solution

Action

Sophist-UK left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

koraynilay Aug 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

koraynilay Aug 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

koraynilay Aug 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

koraynilay Aug 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

koraynilay Aug 28, 2024 • edited Loading

Choose a reason for hiding this comment

koraynilay commented Aug 26, 2024 • edited Loading

phw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Sophist-UK commented Aug 28, 2024

koraynilay commented Aug 28, 2024

phw commented Sep 3, 2024

koraynilay commented Sep 4, 2024

kokon191 commented Oct 7, 2024 • edited Loading

koraynilay Aug 26, 2024 •

edited

Loading

koraynilay Aug 26, 2024 •

edited

Loading

koraynilay Aug 26, 2024 •

edited

Loading

koraynilay Aug 26, 2024 •

edited

Loading

koraynilay Aug 28, 2024 •

edited

Loading

koraynilay commented Aug 26, 2024 •

edited

Loading

kokon191 commented Oct 7, 2024 •

edited

Loading