In this tutorial we will go over the analysis of a panel of samples using Conos. Conos objects can be used to identify clusters of corresponding cells across panels of samples from similar or dissimilar sources, with different degrees of cell type overlap. Here we will identify corresponding clusters across a panel of bone marrow (BM) and cord blood (CB) by generating a joint graph with the cells from all the samples. We will use the graph to propagate labels from a single labelled sample to other samples and finally perform differential expression between BM and CM samples.
First, let’s load Conos library to start with:
library(conos)
Next we will load a previously prepared panel of samples. This panel was made up of 16 cord blood and bone marrow samples, but here we look at a smaller subset of just 4 samples. All samples have been subset to exactly 3000 cells. Note: when starting with your own panel, it’s recommended to filter out low-count / poor /dying cells.
panel <- readRDS(file.path(find.package('conos'),'extdata','panel.rds'))
Let’s take a look at the panel. The panel is a named list of sparse matrices (type dgCMatrix).
str(panel,1)
## List of 4
## $ MantonBM1_HiSeq_1:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## $ MantonBM2_HiSeq_1:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## $ MantonCB1_HiSeq_1:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## $ MantonCB2_HiSeq_1:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
Before we continue it is very important to make sure that cells in our panel are uniquely named. No two cells (even in different samples) should be named identically. In this case the cells have been prefixed by sample id, so there will not be any collisions. However in most cases you will have to prefix the cells before continuing.
head(colnames(panel[[1]]))
## [1] "MantonBM1_HiSeq_1-TCTATTGGTCTCTCGT-1"
## [2] "MantonBM1_HiSeq_1-GAATAAGTCACGCATA-1"
## [3] "MantonBM1_HiSeq_1-ACACCGGTCTAACTTC-1"
## [4] "MantonBM1_HiSeq_1-TCATTTGGTACGCTGC-1"
## [5] "MantonBM1_HiSeq_1-TATTACCCAAAGGAAG-1"
## [6] "MantonBM1_HiSeq_1-CGCCAAGCATCTGGTA-1"
To quickly check that the cell names will be unique, we can run:
any(duplicated(unlist(lapply(panel,colnames))))
## [1] FALSE
Conos is focused on integration, and relies on pagoda2 or Seurat to perform dataset pre-processing.
We will generate pagoda2 apps for poorly-expressed genes from each
individual sample using basicP2proc
helper function for quick
processing. As the datasets will be compared to each other we will turn
off automated dropping of low-expressed genes (min.cells.per.gene=0
),
and lower the numbers of local PCs estimated for faster processing.
(note: you could run the outer loop in parallel using mclapply, however
if ran within RStudio this sometimes causes multi-threading problems;
also, multiprocessing must be disabled in order to obtain exactly the
same individual sample embeddings from one run to another: this can be
done by using set.seed(1)
and specifying n.cores=1
in the command
below).
library(pagoda2)
##
panel.preprocessed <- lapply(panel, basicP2proc, n.cores=4, min.cells.per.gene=0, n.odgenes=2e3, get.largevis=FALSE, make.geneknn=FALSE)
## 3000 cells, 33694 genes; normalizing ... using plain model winsorizing ... log scale ... done.
## calculating variance fit ... using gam 171 overdispersed genes ... 171 persisting ... done.
## running PCA using 2000 OD genes .... done
## running tSNE using 4 cores:
## 3000 cells, 33694 genes; normalizing ... using plain model winsorizing ... log scale ... done.
## calculating variance fit ... using gam 159 overdispersed genes ... 159 persisting ... done.
## running PCA using 2000 OD genes .... done
## running tSNE using 4 cores:
## 3000 cells, 33694 genes; normalizing ... using plain model winsorizing ... log scale ... done.
## calculating variance fit ... using gam 248 overdispersed genes ... 248 persisting ... done.
## running PCA using 2000 OD genes .... done
## running tSNE using 4 cores:
## 3000 cells, 33694 genes; normalizing ... using plain model winsorizing ... log scale ... done.
## calculating variance fit ... using gam 166 overdispersed genes ... 166 persisting ... done.
## running PCA using 2000 OD genes .... done
## running tSNE using 4 cores:
Let’s look at the output of our processing. We now have a named list of pagoda2 objects, which is the starting point for the analysis with Conos.
str(panel.preprocessed,1)
## List of 4
## $ MantonBM1_HiSeq_1:Reference class 'Pagoda2' [package "pagoda2"] with 14 fields
## ..and 34 methods, of which 20 are possibly relevant
## $ MantonBM2_HiSeq_1:Reference class 'Pagoda2' [package "pagoda2"] with 14 fields
## ..and 34 methods, of which 20 are possibly relevant
## $ MantonCB1_HiSeq_1:Reference class 'Pagoda2' [package "pagoda2"] with 14 fields
## ..and 34 methods, of which 20 are possibly relevant
## $ MantonCB2_HiSeq_1:Reference class 'Pagoda2' [package "pagoda2"] with 14 fields
## ..and 34 methods, of which 20 are possibly relevant
The alternative, Seurat, pre-processing can be done in a similar way
using an analogous basicSeuratProc
helper function. Alternatively, if
you already have a set of Seurat objects (one per dataset), you can just
skip this step and feed them directly to Conos$new()
as shown below.
library(Seurat)
panel.preprocessed <- lapply(panel, basicSeuratProc)
We note that sample pre-processing steps can be used to filter/adjust the data in custom ways. For instance, one can reduce the impact of the cell cycle contribution by omitting cycle-annotated genes from the matrices prior to the pre-processing. Similarly, if it is deemed appropriate, one can regress out certain signatures using standard techniques.
We will now construct a Conos object for this panel of samples. At this point we haven’t calculated anything. We have just generated an object that contains the samples. Note that at this step we also set the n.cores parameter. The graph generation with Conos can take advantage of parallel processing, so use as many physical cores as you have available here.
con <- Conos$new(panel.preprocessed, n.cores=4)
Our original pagoda2 (or Seurat) objects are now saved in the Conos object (if you are short of memory you can go ahead and delete the originals).
str(con$samples,1)
## List of 4
## $ MantonBM1_HiSeq_1:Reference class 'Pagoda2' [package "pagoda2"] with 14 fields
## ..and 34 methods, of which 20 are possibly relevant
## $ MantonBM2_HiSeq_1:Reference class 'Pagoda2' [package "pagoda2"] with 14 fields
## ..and 34 methods, of which 20 are possibly relevant
## $ MantonCB1_HiSeq_1:Reference class 'Pagoda2' [package "pagoda2"] with 14 fields
## ..and 34 methods, of which 20 are possibly relevant
## $ MantonCB2_HiSeq_1:Reference class 'Pagoda2' [package "pagoda2"] with 14 fields
## ..and 34 methods, of which 20 are possibly relevant
We can now plot a panel of these samples using the clusters we identified by examining each sample on its own. We note that each sample has an independent set of clusters that bears no relationship to clusters in other sample (for example note cluster 9).
con$plotPanel(clustering="multilevel", use.local.clusters=T, title.size=6)
## Warning: The `printer` argument is deprecated as of rlang 0.3.0.
## This warning is displayed once per session.
Next we will build the joint graph that encompasses all the samples. We do that by pairwise projecting samples onto a common space and establishing kNN of mNN pairs between the samples. We then append within-sample kNN neighbours to the graph to ensure that all the cell are included in the graph.
- We use ‘PCA’ space here which is very fast and will yield good integration in most cases.
- CPCA space should provide more accurate alignment under greater dataset-specific distortions.
- CCA space optimizes conservation of correlation between datasets and can give yield very good alignments in low-similarity cases (e.g. large evolutionary distances).
- If your datasets were all measured on the same platform you may also want to consider “genes” space which can give better resolution in such (simpler) cases.
Other parameters passed to the buildGraph()
function below are all
default values - so are shown just for
information.
con$buildGraph(k=30, k.self=5, space='PCA', ncomps=30, n.odgenes=2000, matching.method='mNN', metric='angular', score.component.variance=TRUE, verbose=TRUE)
## found 0 out of 6 cached PCA space pairs ... running 6 additional PCA space pairs done
## inter-sample links using mNN done
## local pairs local pairs done
## building graph ..done
Note: as pairwise comparisons may take a while, Conos will cache results
for each space. If you want to recalculate, for instance PCA, pairings
with different set of parameters (e.g. more components, different number
of starting over-dispersed genes), clear the cache first by doing
con$pairs$PCA <- NULL
.
In the $buildGraph()
invokation above, we specified
score.component.variance=TRUE
which estimates amount of variance
explained by successive PCs (by default this option is off to save
time). We can visualize the results using:
plotComponentVariance(con, space='PCA')
When
using ‘angular’ distance measure (default), it is NOT recommended to
reduce the number of components to a bare minimum indicated by the
“elbow” inflection point - include 10-20 more (typcally 30 components
work well). For ‘L2’ distance, using fewer components (i.e. at ‘elbow’
value) is sometimes better. (note: remember that if you want to
recalcualte projections, clear the cache for that space, i.e.
con$pairs$PCA <- NULL
).
We next use the graph we identified to get global clusters. Here we use Leiden community detection method to obtain clusters. Increasing the value of the resolution parameter will result in more fine-grained clusters, while decreasing it will return coarser clustering.
con$findCommunities(method=leiden.community, resolution=1)
We can now plot the clusters we obtained. Note that the cluster numbers between different samples now correspond to the same cell type. Also not the presence of cluster 5 in BM samples only, but not in CB.
con$plotPanel(font.size=4)
A convenience function can be used to examine the composition of the clusters in terms of samples, sample entropy (middle), and cluster size (bottom):
plotClusterBarplots(con, legend.height = 0.1)
Check an expression pattern of a specific gene across all the individual embeddings.
con$plotPanel(gene = 'GZMK')
Next we embed and visualize the complete joint graph.
Note: embedding estimation will run the first time around. Please see
$embedGraph()
function for additional embedding options.
Note 2: both functions $plotGraph
and $plotPanel
are based on the
function conos::embeddingPlot
and forward all visualization parameters
to this function. So, to get full list of the possible parameters see
?conos::embeddingPlot
and examples below.
con$plotGraph(alpha=0.1)
## Estimating embeddings.
We note that the graph captures the population structure irrespectively of the sample of origin of each cell.
con$plotGraph(color.by='sample', mark.groups=F, alpha=0.1, show.legend=T)
We can also visualize gene expression on this joint graph embedding:
con$plotGraph(gene='GZMK', title='GZMK expression')
Other community detection methods can provide a more sensitive and hierarchical view of the subpopulation structure. Here we run walktrap community detection method on the same joint graph:
con$findCommunities(method = igraph::walktrap.community, steps=7)
Note: it is recommended to use higher number of steps (e.g. 8-10,
however these calculations take much longer). Here we’ll get a lot of
smaller clusters. Note: different clustering results are kept as a
simple list under con$clusters
.
Visualize new clusters:
con$plotPanel(clustering='walktrap',font.size=4)
New clustering, as viewed on a joint graph:
con$plotGraph(clustering='walktrap')
At the moment, Conos is able to use two methods of graph embedding: largeVis (default) and UMAP. The UMAP takes a bit longer to estimate, but generally gives better quality of the embedding. Though sometime UMAP makes even slightest difference (which is not detected by either largeVis or even clustering algorithms) looking perfectly distinguishable. It’s best to examine both types of embeddings.
For the description of largeVis parameters please look at
conos::projectKNNs
function. The most influential are alpha
and
sgd_batched
. Decreasing alpha results in less compressed clusters, and
increasing sgd_batches often helps to avoid cluster intersections and
spread out the clusters. Here we take alpha to a very low value, for the
sake of example:
con$embedGraph(alpha=0.001, sgd_batched=1e8)
## Estimating embeddings.
con$plotGraph(clustering='walktrap', size=0.1)
UMAP embedding supports all parameters, described in the
uwot package. Two most important
ones are spread
and min.dist
, which together control how tight the
clusters are. According to the python
manual:
- min.dist: The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. The value should be set relative to the spread value, which determines the scale at which embedded points will be spread out.
- spread: The effective scale of embedded points. In combination with min_dist this determines how clustered/clumped the embedded points are.
con$embedGraph(method="UMAP", min.dist=0.01, spread=15, n.cores=4)
## Convert graph to adjacency list...
## Done
## Estimate nearest neighbors and commute times...
## Estimating hitting distances: 18:21:33.
## Done.
## Estimating commute distances: 18:21:46.
## Hashing adjacency list: 18:21:46.
## Done.
## Estimating distances: 18:21:47.
## Done
## Done.
## All done!: 18:21:51.
## Done
## Estimate UMAP embedding...
## 18:21:51 Read 12000 rows and found 1 numeric columns
## 18:21:51 Commencing smooth kNN distance calibration using 4 threads
## 18:21:52 Initializing from normalized Laplacian + noise
## 18:21:53 Commencing optimization for 1000 epochs, with 368044 positive edges using 4 threads
## 18:22:14 Optimization finished
## Done
con$plotGraph(clustering='walktrap', size=0.1)
In the example above, UMAP layout makes even many of the very small subpopulations called by walktrap apparent.
Walktrap clustering generates a hierarchical community structure.
We can get a cut of the top dendrogram and visualize it. Here we’ll cut to get 40 top clusters.
fc <- greedyModularityCut(con$clusters$walktrap$result,40);
The cut determines a finer clustering (likely overclustering) of the dataset on its leafs:
con$plotGraph(groups=fc$groups, size=0.1)
Let’s look at the hierarchical structure of these clusters:
# fc$hc is an hclust structure ... here we will convert it to a dendrogram
dend <- as.dendrogram(fc$hc)
plot(dend)
We can modify the dendrogram to show various properties. For instance, alter the width of the edges to reflect how many samples are contributing to it (normalized entropy). To do so, let’s first define a factor specifying which samples different samples came from:
samf <- con$getDatasetPerCell()
str(samf)
## Factor w/ 4 levels "MantonBM1_HiSeq_1",..: 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "names")= chr [1:12000] "MantonBM1_HiSeq_1-TCTATTGGTCTCTCGT-1" "MantonBM1_HiSeq_1-GAATAAGTCACGCATA-1" "MantonBM1_HiSeq_1-ACACCGGTCTAACTTC-1" "MantonBM1_HiSeq_1-TCATTTGGTACGCTGC-1" ...
Now we’ll use dendSetWidthByBreadth()
function to calculate the
entropies of each edge and set the width
accordingly:
dend <- dendSetWidthByBreadth(dend,samf,fc$leafContent, min.width=1, max.width=4)
plot(dend)
Similarly, we can find a factor that labels cells by the tissue they are from (in this case BM or CB). To define the factor for this simple dataset, we’ll simply parse the cell names:
tissue.factor <- as.factor(setNames(ifelse(grepl('BM',names(samf)),'BM','CB'), names(samf)))
str(tissue.factor)
## Factor w/ 2 levels "BM","CB": 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "names")= chr [1:12000] "MantonBM1_HiSeq_1-TCTATTGGTCTCTCGT-1" "MantonBM1_HiSeq_1-GAATAAGTCACGCATA-1" "MantonBM1_HiSeq_1-ACACCGGTCTAACTTC-1" "MantonBM1_HiSeq_1-TCATTTGGTACGCTGC-1" ...
Now, let’s color the edges according to the tissue mixture:
dend <- dendSetColorByMixture(dend, tissue.factor, fc$leafContent)
plot(dend)
An alternative way to explore this the hierarchical community structure using an interactive app. The app also allows to visualize tissue composition and sample similarities:
conosShinyApp(con,N=30)
One of the uses of this graph is to propagate labels. For example in some cases we will only have information about the cell types in one of the samples and we want to automatically label the other samples.
We’ll load annotation from a simple text file (first column giving cell name, second - cell type), and make a named factor out of it:
cellannot <- read.table(file.path(find.package('conos'),'extdata','cellannot.txt'),header=F,sep='\t')
cellannot <- setNames(cellannot[,2], cellannot[,1])
Next we plot our panel with the annotations we made. This is to verify that the annotated cells are indeed in only one sample and that the other samples are unlabelled.
con$plotPanel(groups = cellannot)
Next let’s propagates the labels from the one annotated sample to the other samples.
new.label.probabilities <- con$propagateLabels(labels = cellannot, verbose=T)
This function returns probabilities of each cell belonging to each group. These probabilities can be used to estimate uncertainty of the labeling:
con$plotGraph(colors=(1 - apply(new.label.probabilities, 1, max)), show.legend=T, legend.title="Uncertainty", legend.pos=c(1, 0))
Now we can assign each cell to the the the cell type with the highest probability. Though we can see that one cluster has high uncertainty and possibly represent a cell type, not presented in the initial annotation.
new.annot <- setNames(colnames(new.label.probabilities)[apply(new.label.probabilities,1,which.max)], rownames(new.label.probabilities))
head(new.annot)
## MantonBM1_HiSeq_1-CGATTGACACCTCGGA-1 MantonBM2_HiSeq_1-CTGATAGAGCGTTCCG-1
## "NK" "NK"
## MantonBM1_HiSeq_1-AGGTCCGTCTCTGCTG-1 MantonBM2_HiSeq_1-GGAAAGCCAGACGCCT-1
## "NK" "NK"
## MantonBM2_HiSeq_1-GGGTTGCGTAGCTGCC-1 MantonBM2_HiSeq_1-GACAGAGGTCACAAGG-1
## "NK" "NK"
We now see that all our samples have been labelled automatically!
con$plotPanel(groups = new.annot)
The same effect can be achieved with parameter return.distribution=F
(though it’s recommended to always examine
uncertainty):
new.annot2 <- con$propagateLabels(labels = cellannot, verbose=F, return.distribution=F)
all(new.annot2 == new.annot)
## [1] TRUE
By default, label propagation affects initial labels as well:
sum(new.annot[names(cellannot)] != cellannot)
## [1] NA
Even though here the effect on this data is not that pronounced,
sometimes we trust initial labeling completely and don’t want to change
it. In this case, option fixed.initial.labels=T
should be
used:
new.annot <- con$propagateLabels(labels = cellannot, verbose=F, return.distribution=F, fixed.initial.labels=T)
all(new.annot[names(cellannot)] == cellannot)
## [1] NA
con$plotPanel(groups = new.annot)
The first step we can do to understand meaning of the dataset is to look at the cluster cell markers:
de.info <- con$getDifferentialGenes(groups=new.annot)
## Estimating marker genes per sample
## Aggregating marker genes
## All done!
head(de.info$Bcells)
## NULL
cowplot::plot_grid(con$plotGraph(groups=new.annot), con$plotGraph(gene="CD74"))
Next, given a joint clustering of cells that captures cell relationships between samples, we can want to ask what is different between the cells of these populations between specific samples types, in this case CB and BM samples. Conos provides routines to be able to do that.
The general approach we suggest for differential expression analysis, is
to first pool all the data associated with each cluster (forming a
meta-cell that is analogous bulk RNA-seq measurement of the cells within
each cluster), and then use standard differential expression packages,
such as DESeq2 or limma to compare these “bulk-like” meta-cell samples,
using appropriate design models. In this section we show a convenience
routine (getPerCellTypeDE
) that enables one type of comparison (same
cluster, between sample groups), but if more advanced models are desired
(e.g. additional model variables, etc.) getClusterCountMatrices
command can be used to obtain the meta-cell counts:
str( con$getClusterCountMatrices() , 1)
## List of 4
## $ MantonBM1_HiSeq_1: num [1:33694, 1:15] 0 0 0 1 0 0 0 0 38 4 ...
## ..- attr(*, "dimnames")=List of 2
## $ MantonBM2_HiSeq_1: num [1:33694, 1:15] 0 0 0 0 0 0 0 0 66 4 ...
## ..- attr(*, "dimnames")=List of 2
## $ MantonCB1_HiSeq_1: num [1:33694, 1:15] 0 0 0 0 0 0 0 0 73 5 ...
## ..- attr(*, "dimnames")=List of 2
## $ MantonCB2_HiSeq_1: num [1:33694, 1:15] 0 0 0 0 0 0 0 0 136 17 ...
## ..- attr(*, "dimnames")=List of 2
The list above, returns pooled count matrix for each sample, where the rows are genes, and columns are clusters. Different groups parameter can be supplied.
Back to DE analysis of the cluster states between groups of samples. First we need to define our sample groups
samplegroups <- list(
bm = c("MantonBM1_HiSeq_1","MantonBM2_HiSeq_1"),
cb = c("MantonCB1_HiSeq_1","MantonCB2_HiSeq_1")
)
We can then run differential expression between cells in these sample groups:
de.info <- getPerCellTypeDE(con, groups=as.factor(new.annot), sample.groups = samplegroups, ref.level='bm', n.cores=4)
…and examine the output
str(de.info[1:3], 2)
## List of 3
## $ B cells :List of 3
## ..$ res :'data.frame': 33694 obs. of 6 variables:
## ..$ cm :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## ..$ sample.groups:List of 2
## $ DC : logi NA
## $ dying cells:List of 3
## ..$ res :'data.frame': 33694 obs. of 6 variables:
## ..$ cm :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## ..$ sample.groups:List of 2
Let’s look at the results for the B cells
res <- de.info[['B cells']]$res
head(res[order(res$padj,decreasing = FALSE),])
## baseMean log2FoldChange lfcSE stat pvalue
## IGLL1 150.9805 -5.521811 0.7013839 -7.872737 3.469662e-15
## JCHAIN 342.3496 -3.328696 0.4428750 -7.516106 5.643192e-14
## IGHG1 558.5807 -10.827820 1.4442727 -7.497075 6.525766e-14
## IGLC2 3351.0226 -3.506702 0.5181883 -6.767236 1.312659e-11
## RP11-386I14.4 362.2495 2.771804 0.4179236 6.632323 3.304455e-11
## IGKC 2621.3799 -2.514947 0.3824759 -6.575439 4.850972e-11
## padj
## IGLL1 6.064968e-11
## JCHAIN 3.802346e-10
## IGHG1 3.802346e-10
## IGLC2 5.736319e-08
## RP11-386I14.4 1.155238e-07
## IGKC 1.413250e-07
As can be seen from the sample distribution plot, different samples, in
particular those reprsenting different tissues (BM or CB) form separate
subclusters within the clusters of major cell types. Conos allows to
force better alignment through i) adjustment of the alignment.strength parameter
, and ii) through rebalancing of edge weights based on a
specific factor (e.g. tissue to which the cell belongs) using
balance.edge.weights
parameter.
con$buildGraph(k=15, k.self=5, alignment.strength=0.3, space='PCA', ncomps=30, n.odgenes=2000, matching.method='mNN', metric='angular', score.component.variance=TRUE, verbose=TRUE)
## found 6 out of 6 cached PCA space pairs ... done
## inter-sample links using mNN done
## local pairs local pairs done
## building graph ..done
re-generate the embedding and visualize the sample distribution again:
con$embedGraph()
## Estimating embeddings.
con$plotGraph(color.by='sample', mark.groups=F, alpha=0.1, show.legend=T)
We can also check the entropy:
con$findCommunities()
plotClusterBarplots(con, legend.height = 0.1)