-
-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PICARD-2466: Lookup function by using Catalog Number and other identifying info #2538
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -72,6 +72,9 @@ | |
'releasecountry': 2, | ||
'releasetype': 10, | ||
'totalalbumtracks': 5, | ||
'catalognumber': 60, | ||
'asin': 65, | ||
'barcode': 68, | ||
} | ||
|
||
|
||
|
@@ -166,6 +169,17 @@ def remove_file(self, file, new_album=True): | |
|
||
def update(self, signal=True): | ||
self.metadata['~totalalbumtracks'] = self.metadata['totaltracks'] = len(self.files) | ||
metadata_counters = {} | ||
tags_to_update = ['catalognumber', 'barcode', 'asin', 'genre'] | ||
for file in self.files: | ||
for tag in tags_to_update: | ||
if tag in file.metadata: | ||
value = file.metadata[tag] | ||
if tag not in metadata_counters: | ||
metadata_counters[tag] = Counter() | ||
metadata_counters[tag][value] += 1 | ||
for tag in metadata_counters: | ||
self.metadata[tag] = metadata_counters[tag].most_common(1)[0][0] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the purpose of this additional code (which is only for these tags)? And why has genre been added to these tags? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This is in order to add those tags to the cluster so they can be used in the lookup.
Because I thought it would make sense for a cluster to also have the most common genre in its files, there may also be other tags which makes sense to include (like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Additional code - I haven't read the existing code in detail, but I surely code to include these tags in the cluster should be co-located with the code that puts the existing tags into the cluster. Genres should not IMO be added because they are highly subjective and IMO the MB implementation is far too free-form i.e. lacking in hierarchy and unstructured. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, I had searched for that a bit after trying to just add Now I tried (on
(all 4 files have
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think generally this tag merging for clusters makes sense, and we should do something like this for some relevant tags. It makes the clusters far more useful if we add such lookups. Whether this includes genres or not is a different question, and currently I don't have any strong opinion on this either way. I'm a bit concerned about performance here if this runs for every individual file added. But this would need to be measured. Optimization would be to ensure this runs only after all files have been added when clustering multiple files at once (i.e. usually when creating the cluster). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I was also concerned about that, but from what I saw,
Actually my first approach was to do this, by passing the metadata objects of the various files to I have pushed these changes to koraynilay:identifiers-lookup-filecluster, if you want to see them (they should work, but the tests didn't all pass and there are still some problems (see 742b9d2's commit message for an example)). [1] like this # in cluster.py, Cluster.cluster()
@staticmethod
def cluster(files):
# ...
for file in files:
# ...
token = tokenize(album)
if token:
# Basically pass `metadata` here
- cluster_list[token].add(album, artist or various_artists, file)
+ cluster_list[token].add(metadata, file) |
||
if signal and self.ui_item: | ||
self.ui_item.update() | ||
|
||
|
@@ -286,6 +300,9 @@ def lookup_metadata(self): | |
artist=self.metadata['albumartist'], | ||
release=self.metadata['album'], | ||
tracks=str(len(self.files)), | ||
catno=self.metadata['catalognumber'], | ||
barcode=self.metadata['barcode'], | ||
asin=self.metadata['asin'], | ||
limit=config.setting['query_limit']) | ||
|
||
def clear_lookup_task(self): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -163,6 +163,9 @@ class Metadata(MutableMapping): | |
('totaltracks', 5), | ||
('discnumber', 5), | ||
('totaldiscs', 4), | ||
('catalognumber', 30), | ||
('barcode', 33), | ||
('asin', 32), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am unclear how these weights were chosen or why they are so high. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I chose these values so they are a bit higher than the highest one ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have not had any experience with determining optimum search weightings in Picard, but I have designed complex weighted search algorithms in other search engines - and I know that getting the right weights is an art, and if you get them wrong then all sorts of funny results come up first. I don't know whether @zas or @phw can help with determining whether these weights are good or bad. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I also believe these weights are too high. This can easily overrule better matches based on title and artist. It's also one reason why I think we should go a different route. |
||
] | ||
|
||
__date_match_factors = { | ||
|
@@ -232,6 +235,8 @@ def compare(self, other, ignored=None): | |
ia = a | ||
ib = b | ||
score = 1.0 - (int(ia != ib)) | ||
elif name in {'catalognumber', 'barcode', 'asin'}: | ||
score = 1.0 if a == b else 0.0 | ||
else: | ||
score = similarity2(a, b) | ||
parts.append((score, weight)) | ||
|
@@ -289,6 +294,26 @@ def compare_to_release_parts(self, release, weights): | |
except (ValueError, KeyError): | ||
pass | ||
|
||
if 'catalognumber' in self and 'catalognumber' in weights: | ||
try: | ||
found = 0.0 | ||
for _ in release['label-info']: | ||
if self['catalognumber'] == _['catalog-number']: | ||
print(_['catalog-number']) | ||
found += 1.0 | ||
break | ||
parts.append((found, weights['catalognumber'])) | ||
except (KeyError, IndexError): | ||
pass | ||
|
||
if 'barcode' in self and 'barcode' in weights: | ||
score = 1.0 if self['barcode'] == release['barcode'] else 0.0 | ||
parts.append((score, weights['barcode'])) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Proper comparison of barcodes likely should be a bit more complex then exact string match. Because of UPC vs EAN-13 it is e.g. not uncommon that UPCs are written with a leading zero. Compare also e.g. https://github.com/kellnerd/harmony/blob/1744a1a87ee8a8f8a0a59cbf8208e8f01ee9bae0/utils/gtin.ts#L66 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, I did see that when I was looking at the docs for the search API, but it didn't come to mind when writing this code. |
||
|
||
if 'asin' in self and 'asin' in weights: | ||
score = 1.0 if self['asin'] == release['asin'] else 0.0 | ||
parts.append((score, weights['asin'])) | ||
|
||
# Date Logic | ||
date_match_factor = 0.0 | ||
if 'date' in weights: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unclear how these weights were chosen or why they are so high.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose these in the same way as the ones in
metadata.py
but after trying to match ~80 albums (all files tagged with onlycatalognumber
), with'catalognumber': 60,
it was able to correctly identify almost all of them (except for ~4), while with a lower value it would identify quite a few less.But all these were guesses and as I am not familiar with the codebase yet they may very well be wrong (even tho they worked for my specific use case).