Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Visualization scheme for gene maps #41

Open
angadps opened this issue Mar 30, 2022 · 7 comments
Open

Visualization scheme for gene maps #41

angadps opened this issue Mar 30, 2022 · 7 comments

Comments

@angadps
Copy link
Contributor

angadps commented Mar 30, 2022

Hello,

This is mostly a question to better understand the current visualizations and if and how they could be configured. Hope this hasn't been already discussed.

I've run xfuse on a couple of dozen samples so far and I'm trying to investigate about a dozen genes in each of the samples. Based on the raw Visium gene expression data some of the plots aren't making sense to me. For instance, one of the genes is expressed in < 10 spots in the Visium data, still I see bright spots throughout the tissue in the gene maps analysis. Maybe having a better understanding of the visuals might help.

  1. Is the data being normalized by default when visualizing the "xfused" gene expression? If so, the relative scaling can cause some of the regions to "light up", even though the absolute predicted expression values might be low. This would not require a scale/legend as there isn't one right now. However, if the visualization is that of the absolute values, having a scale would be useful.

  2. Is the visualization happening on the the original scale, or on a log-scale? Some of the genes are both highly expressed throughout the tissue and have large variation in the expression levels. Visualizing using a log scale can help catch the high expression values throughout along with capturing the differences.

Alternatively, are there configurable options for the above which I can use to display my plots?

Thanks,
--Angad.

@ludvb
Copy link
Owner

ludvb commented Mar 30, 2022

Hi Angad,

The gene maps use original scale data. By default, the data for each gene is normalized to the range of the colormap. There is no pixel-wise normalization. I agree that a scale would be very useful.

The data that is shown is the estimated posterior mean and not the observed expression. In case of low-expressed genes, the posterior mean will likely be much smaller than 1 and should not be taken to imply actual expression. It is also likely that the prediction has a high degree of uncertainty, as there may be insufficient data to reliably estimate expression levels.

Here are all the config options for the gene maps analysis with brief comments:

[analyses.analysis-gene_maps.options]
gene_regex = ".*"     # Regex specifying which genes to predict
num_samples = 10      # Number of Monte Carlo samples to draw from the inferred posterior
genes_per_batch = 10  # Number of genes per batch (use a lower value if you get OOM errors)
predict_mean = true   # Whether to draw samples of the posterior mean or the emitted negative binomial count values
normalize = false     # Whether to use pixel-wise normalization
mask_tissue = true    # Whether to mask out regions outside the tissue
scale = 1.0           # A rescaling factor in the range (0, 1). A lower value speeds up the analysis and results in smaller output files (at the cost of lower resolution).
writer = "image"      # Output format (can be either "image" or "tensor")

I think a good idea may be to set the writer argument to "tensor". This will save the expression maps as torch tensors, which can be loaded in Python using torch.load. The tensors will have the shape (num_samples, height, width) and contain the unnormalized samples. It would then be possible to normalize the data in the way most suitable for your analysis and to get a better sense of predictive dispersion/uncertainty.

Let me know if you have any questions!

@angadps
Copy link
Contributor Author

angadps commented Mar 31, 2022

Hi,

Thanks for the detailed explanation. That does explain a lot of my observations. In fact I tried writing out the (num_samples, height, width) maps. As a sanity check, I first read in these new tensors and try plotting them without doing anything else. Incidentally, I notice that the values that I read in have "inverted" masks. For instance, an individual samples from a tensor look something like this:

array([[NZ, NZ, NZ,... NZ, NZ, NZ, ... NZ, NZ, NZ],
[NZ, 0, 0,... 0, 0, 0, ... 0, 0, NZ],
[NZ, 0, 0,... 0, 0, 0, ... 0, 0, NZ],
[NZ, 0, 0,... 0, 0, 0, ... 0, 0, NZ],
[NZ, 0, 0,... 0, 0, 0, ... 0, 0, NZ],
[NZ, NZ, NZ,... NZ, NZ, NZ, ... NZ, NZ, NZ]], dtype=float32)

Here, NZ is any non-zero floating point value. My tissue sample is placed in the center of the slide. How should I interpret the zero and non-zero values here? Should I not expect to see the zero-values for masked out regions and non-zero values for regions containing tissue? Replacing the writer to "image" shows me the posterior means as expected (without inversion). I might be making an incorrect assumption here regarding the tensor object?

Thanks for the help!

@ludvb
Copy link
Owner

ludvb commented Mar 31, 2022

That does indeed look like the mask has been inverted! I think this may be a bug. Can you try to modify this line:

samples[:, tissue_mask] = 0.0

by adding a tilde before tissue_mask:

samples[:, ~tissue_mask] = 0.0

and see if that fixes it?

@angadps
Copy link
Contributor Author

angadps commented Apr 4, 2022

Yes, that fixes it. Thank you!
The tensor allows me to normalize and combine expression across multiple genes as well.
Sample masking though does not always work but that's for another thread.

@ludvb
Copy link
Owner

ludvb commented Apr 6, 2022

Feel free to PR a fix for this if you like!

@angadps
Copy link
Contributor Author

angadps commented Apr 12, 2022

Thanks, I just sent you a PR. I've tested the changes over your latest master commit and it looks good.

I also want to update you that I'm able to use the tensor objects and create visualizations suited better for my needs.

Essentially I've copied you visualization code under analyze/gene_maps.py and added a "pseudo coverage mean normalization". Ideally I would calculate the the total of posterior means across all genes and divide the posterior mean for each individual gene by that number. But obviously it is not feasible to generate the tensors for all genes. Instead, I'm using the spatial gene expression data (as proxy) to do the same. That along with a log scale helps visualize a wide range of expression values across many samples and genes.

Thanks for your help!

@ludvb
Copy link
Owner

ludvb commented Apr 13, 2022

Thanks, much appreciated! :) Glad to hear that you have found a good way to do the gene maps analysis. Could be an interesting workflow to implement as an option!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants