Fastest way to get an element #1651

stefanoostwegel · 2022-06-10T12:27:17Z

stefanoostwegel
Jun 10, 2022

We have created a script that will perform anonymisation of the content.
We use the function walk() to walk over the full dataset and test if the value requires modification.

There are extrme big files with hundrets of tags, that could take more than 30 minutes to walk over.

Is there a faster way to get all attributes of a certain tag? Especially the nested ones as well, without knowing where the tags can be and which/how many sequences there are?

mrbean-bremen · 2022-06-10T17:35:22Z

mrbean-bremen
Jun 10, 2022
Maintainer

If you want to find your item in all nested sequences, you have no choice but to iterate over all items of the dataset to find sequences - I see no way around this. The problem with walk is that it iterates over the dataset and reads all elements while doing so (e.g. converts them from the raw bytes into DataElement items). For some items this may be expansive (though 30 minutes for a dataset still seems to be an extraordinary long time).
If you are only interested in a certain tag, it is probably better to not to do this conversion for all other tags (except sequence tags). You can use the iterator Dataset.elements which iterates over the dataset without doing this conversion if not already done (other than the default iterator used if iterating over the dataset itself).
You still have to iterate over all items, but this way for most items the potentially expansive conversion will not be done. Something like this (untested) could work:

def handle_tag(ds: Dataset, tag: BaseTag):
    for raw_element in ds.elements():
        if raw_element.tag == tag:
            element = ds[raw_element.tag]  # converts the raw element to a DataElement
            # do anonymization
        elif raw_element.VR == "SQ":
            element = ds[raw_element.tag]
            sequence = element.value
            for dataset in sequence:
                handle_tag(dataset, tag)

0 replies

darcymason · 2022-06-10T17:55:36Z

darcymason
Jun 10, 2022
Maintainer

Firstly, if you haven't already, I would strongly recommend you profile your code to learn what the speed bottlenecks actually are. We can speculate but most likely only one or two specific things will make a substantial difference, and the profiling would point to those.

Since you mention big files, I suspect that reading time is dominated by disk I/O, not so much from the access to specific tags. If so, and it is image files your are mainly dealing with, then are you using stop_before_pixels in the reading? Depending on what you have to write out, that might not help, however, if the pixel data still has to be read in anyway to write it back out again. For very large files, using a memmap file handle to dcmread might help avoid disk swapping on a computer with insufficient RAM.

Is there a faster way to get all attributes of a certain tag?

@mrbean-bremen's answer seems quite good - once in memory, the access can be quite fast, except for the decoding to native Python types which @mrbean-bremen noted, and worked around using elements.

The only other thought I have is that if you know which tags and sequences might appear, then you can just go directly to them, rather than iterating all data elements.

tags_to_change = [0x100010, 0x100020, ...]
for tag in tags_to_change:
    try:
        ds[tag].value = # change as needed
    except KeyError:
        pass

Then the above would have to be repeated with any known sequences and the tags that could be in them.

0 replies

stefanoostwegel · 2022-10-17T07:53:08Z

stefanoostwegel
Oct 17, 2022
Author

For some reason i lost track of your response.
I have been fiddling around with a couple of things.
Here is what i did to improve the performance.

The dataset is loaded into memory without the pixeldata. We do not process pixeldata so for large files (like fMRI) it is aboslutly worth it to use the stop_before_pixels to read he files.

Since we do not process every attribute in the dataset, we had to design a way to only access the tags in the dataset that require processing.
To do so, we load the dataset into JSON, to have a quick python array to work with. The we strip the JSON file of any attribute that we do not want to process.
Next, we loop trough the dataset, where the TAG equals the value in the JSON object. This went very fast in comparison to the walk() function or the raw elements.

Now the only performance issue we have, is large datasets that require a lot of processing.
Two of the common usecases are RTStruct and REG objects.
These have references to UID's that require hashing, so we have to open hundrets, sometimes thousands of sequences to acces the value, hash it, close it and move on.
We still havent found a way to do this in a fast way to address all the tags we want to hash.

Code snippets are:

    def create_mapped_dataset(self, dataset):
        logger.debug('Converting dicom to json')
        try:
            json_file = dataset.to_json()
        except Exception as e:
            logger.error(f'Unable to convert object to JSON file {e}')
            return False
        elements_dict = json.loads(json_file)
        logger.debug('Looping over JSON Dict to create anonimisation mapping')
        self.anonimisation_mapping = self.loop_over_json(dicom_dict=elements_dict)
        return True

and then:

   def loop_over_json(self, dicom_dict):
        return_set = {}
        for element in dicom_dict:
            try:
                vr = dicom_dict[element]["vr"]
            except:
                vr = 'N/A'
            if vr == 'SQ':
                sequence_return = [self.loop_over_json(dicom_dict=item) for item in dicom_dict[element]["Value"]]
                try:
                    if sequence_return[0]:
                        return_set[element] = sequence_return
                except IndexError as e:
                    logger.debug(f'An index error occurred: {e}')
                    #pass
            else:
                if str(string_to_tag(element)) in self.processing_list:
                    return_set[element] = dicom_dict[element]
        if return_set:
            return return_set

In the second one, the self.processing_list value contains all tags we would like to access and process.
After that we do the following:

  def iterate_mapped_sequences(self, dataset, mapped_dataset):
        logger.debug('Modifying existing elements with mapped sequences')
        # initiate the class in ' Processing Functions.py' so we can easily execute the functions to modify
        # the listed elements.
        modify_elements = Process()
        # loop over eacht attribute in each dataset
        pydicom.config.convert_wrong_length_to_UN = True
        logger.debug(f'Working with mapped_dataset: {mapped_dataset}')
        for mapped_tag in mapped_dataset:
            logger.debug(f'Accessing mapped tag {mapped_tag}')
           #process the elements that are part of the mapped_dataset
                if isinstance(mapped_dataset[mapped_tag], list):
                    logger.debug(f'{mapped_tag} is a sequence.')
                    [self.iterate_mapped_sequences(dataset=item,
                                                   mapped_dataset=mapped_dataset[mapped_tag][0]) for item in
                     dataset[string_to_tag(mapped_tag)].value]

Any suggestions how to loop over large files that have many tags in sequences we have to access?

1 reply

darcymason Oct 21, 2022
Maintainer

Just getting back to having a look at this again...

Next, we loop trough the dataset, where the TAG equals the value in the JSON object. This went very fast in comparison to the walk() function or the raw elements.

Using the walk function is probably not very time efficient, but looping over the dataset is done anyway by the conversion to JSON - it goes into every sequence and every single value and does complex processing on them all. So I don't understand how that could be faster to do that complete pass and still do other passes over the json. I would suggest going over the original dataset (not json) only once, not multiple passes as the code above does, if that is possible (I didn't see any reason why not on a quick read-through of the code). And ideally in that one pass, using the elements as @mrbean-bremen suggested avoids converting data element values that you don't care about, but that is likely only a modest gain.

I'm not clear what actually happens in the end - the last code snip only shows the recursive sequence iteration part. Do you set the dataset values to a new value? Do you write datasets back out?

Other thoughts (some big, some minor):

you mention having to hash a number of things. Are you using something like functools lru_cache or just cache to avoid repeating expensive calculations?
make sure any one-time assignments or calculations are pulled out of loops. E.g. pydicom.config.convert_wrong_length_to_UN = True should be set (once) before any of this code is called.
change debugging to:

if debugging:
    logger.debug(...)

to avoid a function call and whatever string processing / value calculations are needed for the debug arguments. The check of a boolean is very fast and can probably be left in that way rather than removing all debugging at the end to increase speed.

If you must stay with json, there is a Dataset.to_json_dict call you can use rather than creating json and then converting it to a dict.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DICOM in Python

Fastest way to get an element #1651

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

DICOM in Python

Fastest way to get an element #1651

stefanoostwegel Jun 10, 2022

Replies: 3 comments · 1 reply

mrbean-bremen Jun 10, 2022 Maintainer

darcymason Jun 10, 2022 Maintainer

stefanoostwegel Oct 17, 2022 Author

darcymason Oct 21, 2022 Maintainer

stefanoostwegel
Jun 10, 2022

Replies: 3 comments 1 reply

mrbean-bremen
Jun 10, 2022
Maintainer

darcymason
Jun 10, 2022
Maintainer

stefanoostwegel
Oct 17, 2022
Author

darcymason Oct 21, 2022
Maintainer