Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Marker genes in JSON file - Correlation analysis mode #11

Open
mitsiaskonstanto opened this issue Dec 19, 2023 · 5 comments
Open

Marker genes in JSON file - Correlation analysis mode #11

mitsiaskonstanto opened this issue Dec 19, 2023 · 5 comments

Comments

@mitsiaskonstanto
Copy link

Hi @danielsf,

I'm creating a new thread here, so we can continue our conversation regarding the marker genes that are used during the mapping process.

I followed your instructions in #10 in order to access those marker genes.

(1) In the hierarchical analysis mode I get multiple marker gene lists that are used to discriminate between the children of each parent in the taxonomy tree, as also a 'None' element, that indicates the root of the taxonomy tree. That's in general clear.
(2) In the correlation analysis mode though, I only get a 'None' element. Is that reasonable?

Cheers,
Dimitris

@danielsf
Copy link
Collaborator

danielsf commented Dec 19, 2023

Hi Dimitris,

Yes, that is an expected result.

When you run correlation mapping, the first step the code takes is to flatten the taxonomy tree, i.e. reduce it to a single level so that the cell type clusters are all direct children of the root node. The marker gene lookup table is similarly flattened. There is now only one parent node in the tree ('None'), so all of the markers are attached to that node.

Cheers,

Scott

@mitsiaskonstanto
Copy link
Author

Hi Scott,

Great, thank you for clarifying.

A follow-up question regarding the hierarchical mapping marker genes this time:

The JSON file includes such info:

hierarchical_mapping_json$marker_genes["CCN20230722_SUBC/CS20230722_SUBC_022"]
$`CCN20230722_SUBC/CS20230722_SUBC_022`
   [1] "ENSMUSG00000051951" "ENSMUSG00000002459" "ENSMUSG00000033774" "ENSMUSG00000033740" "ENSMUSG00000067879"
   [6] "ENSMUSG00000042501" "ENSMUSG00000048960" "ENSMUSG00000016918" "ENSMUSG00000025776" "ENSMUSG00000025931"
  [11] "ENSMUSG00000026141" "ENSMUSG00000026058" "ENSMUSG00000026077" "ENSMUSG00000050967" "ENSMUSG00000026065"
  [16] "ENSMUSG00000026062" "ENSMUSG00000045515" "ENSMUSG00000008136" "ENSMUSG00000026042" "ENSMUSG00000018417" 

As I was examining the genes included in each "SUBC", I observed that a big percentage of them are constantly present in every predicted "SUBCLASS". And if I also seek for "unique" markers across subclasses, I end up with really few subclasses with some unique markers.

Nevertheless, it seems that this is not a problem for the mapper, since the results I get make sense.
Though, it would be great to know how (and which of) these markers drive the diversity between these subclasses, or even supertypes. Why a subclass label is preferred against another one since they are defined by a very similar signature of marker genes?
Is there a level of "importance" for each marker in each "SUBC" that the algorithm takes into consideration while choosing the labels to assign?

Thank you in advance,
Dimitris

@danielsf
Copy link
Collaborator

The marker genes used by the on-line MapMyCells app are the product of another research team, so I'm going to have to ask around to see if there is an answer to your question. With the onset of the end-of-year holidays, I probably won't be able to properly respond to this until early 2024. Sorry I can't give you anything more helpful now.

@danielsf
Copy link
Collaborator

danielsf commented Jan 2, 2024

@mitsiaskonstanto

I just read over your question again and realized I can answer it. There is no importance score that the algorithm uses when assigning classes, subclasses, etc. The data is simply subsampled to include only the marker genes and then correlated against the average gene expression profiles of the clusters in the reference data (again, using only the marker genes). The cluster with the highest correlation coefficient is chosen (i.e. all marker genes are considered equal).

The documentation for the cell type assignment algorithm can now be found here.

@mitsiaskonstanto
Copy link
Author

Hi Scott,

Happy new year and thank you very much for your response.

Ok, that's totally reasonable then.

I will go through the documentation you have created and let you know if everything is clear.

Cheers,

Dimitris

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants