Graph input for clonal embeddings construction#

Some methods for batch correction or multimodal data representations don’t give you a latent embedding that you can use to find nearest clonally labelled cells. For example, the result of multimodal integration with Weighted Nearest Neighbors algorithm is a graph, as well as the result of batch correction with batch-balanced kNN. It can be unclear how to build a matrix of clonal nearest neighbors in this case.

[1]:

import scanpy as sc
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import clone2vec as c2v
import scanpy.external as sce

sc.set_figure_params(dpi=80)
sc.settings.verbosity = 3
sns.set_style("ticks")

In this example, we’re going to use a NSCLC cohort from Liu et al. In the original paper, authors used batch-balanced kNN to deal with the batch effect (in the clone2vec manuscript, we re-process the dataset with Harmony), so we can try to obtain similar to the original results and use it to build a clonal embedding.

[2]:

adata = c2v.datasets.Liu_NSCLC_CD8(embedding_type="gex")

using gene expression embedding

[3]:

sce.pp.bbknn(adata, use_rep="X_pca", n_pcs=30, batch_key="sample")

computing batch balanced neighbors
WARNING: consider updating your call to make use of `computation`
        finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:52)

[4]:

sc.tl.umap(adata, min_dist=0.3)

computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm)
    'umap', UMAP parameters (adata.uns) (0:05:07)

[5]:

ax = sc.pl.umap(adata, color="cluster", frameon=False,
                title="Authors' cell types", show=False)

c2v.pl.embedding_axis(ax, label="Cells")

And now we will use the key trick – we’re going to create a vector representation of the graph input using Leplacian eigenmaps.

[6]:

c2v.utils.laplacian_eigenmaps(adata)

computing Laplacian eigenmaps:
    using .obsp['connectivities'], n_components=20
    finished (0:00:16): added
     .obsm['X_laplacian'] Laplacian eigenmaps (20 components)
     .uns['laplacian_eigenmaps'] parameters

To create a clonal embedding, we will use a column 'full_clone' from adata.obs – it contains also a site where clone was found, and a timepoint, so we will treat each timepoint:site:clone combination individually.

[7]:

clones = c2v.pp.clones_adata(adata, obs_name="full_clone", min_size=3)

creating clone-level AnnData
    selected 1591 clones (>= 3)
    finished (0:00:01): created clones AnnData with
     .X float matrix of proportions (clones × categories)
     .layers['proportions'] float matrix with fate proportions
     .layers['counts'] integer matrix with fate counts
     .obs['n_cells'] integer vector with number of cells per clone
     .obs['n_fates'] integer vector with number of fates per clone
     .var['n_clones'] integer vector with number of clones per fate
     .uns['fill_obs'] string label

[8]:

c2v.tl.clonal_nn(adata, clones, use_rep="X_laplacian")

computing clone-to-clone adjacency graph
    finished (0:00:38): added
     .obsp['gex_adjacency'] clone adjacency graph.
     .uns['clonal_nn'] parameters.

[9]:

c2v.tl.clone2vec(clones)

fitting clone2vec embeddings
    SG epochs:  13%|█▎        | 63/500 [06:03<42:00,  5.77s/it, loss=5.0829, Δ=4.31e-05]
    early stopping at epoch 64
    finished (0:06:05): added
     .obsm['clone2vec'] embedding matrix
     .uns['clone2vec'] training details

[10]:

sc.pp.neighbors(clones, use_rep="clone2vec")

computing neighbors
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:01)

[11]:

sc.tl.umap(clones)

computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm)
    'umap', UMAP parameters (adata.uns) (0:00:02)

Now let’s take a look at each clone’s identity.

[12]:

c2v.pp.transfer_annotation(adata, clones, annotation_obs_adata=["patient", "biospy_site", "treatment_hx"])

transferring annotations between cell and clone objects
    finished (0:00:00): added to clonal AnnData
     .obs['gex_patient'] categorical labels
     .obs['gex_biospy_site'] categorical labels
     .obs['gex_treatment_hx'] categorical labels

[14]:

ax = sc.pl.umap(clones, color=["gex_patient"], frameon=False,
                title="Patient", show=False)

c2v.pl.embedding_axis(ax, label="Clones")

We see some minor propagation of the batch effect from the patient 19. We will take a closer look to it in the next notebook, for now let’s leave it and find clonal clusters.

[15]:

sc.tl.leiden(clones, resolution=0.2, flavor="igraph", n_iterations=2)

running Leiden clustering
    finished: found 3 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:00)

[16]:

ax = sc.pl.umap(clones, color="leiden", title="Clonal clusters",
                frameon=False, show=False)

c2v.pl.fancy_legend(ax, textsize=18, center_loc=True, fontweight="bold")

We see three clonal clusters here – let’s take a look at their projections on the gene expression embedding.

[17]:

c2v.pp.transfer_annotation(adata, clones, annotation_obs_clones="leiden")

transferring annotations between cell and clone objects
    finished (0:00:00): added to clonal AnnData
     .obs['c2v_leiden'] categorical labels

[18]:

c2v.pl.group_kde(adata, groupby="c2v_leiden", groups=["0", "1", "2"], bw_method=0.2)

[19]:

c2v.pl.group_scatter(adata, groupby="c2v_leiden", groups=["0", "1", "2"], s=5)

Finally, we can check if markers discovered in the paper are consistend with the clonal clusters constructed based on the bbknn-based gene expression embedding.

[20]:

clones = c2v.pp.transfer_expression(adata, clones)

summarizing expression at clone level:
    using `adata.X` as expression source
    finished (0:00:07): created clone-level AnnData with
     .X aggregated expression matrix (clones × genes)
     .uns['transfer_expression'] dict with 'strategy'
     .uns['mask_key'] string with obsm key for boolean mask of clones with non-zero cells
     .obsm['proportions'] dataframe of features
     .obsm['counts'] dataframe of features

[21]:

axes = sc.pl.umap(clones, color=["CXCL13", "IL7R", "GZMK"], frameon=False,
                  show=False, cmap=c2v.pl.Reds)
c2v.pl.small_cbar(axes)
c2v.pl.embedding_axis(axes[0], label="Clones")

[23]:

axes = c2v.pl.scaled_dotplot(
    clones,
    groupby="leiden",
    var_names=["CXCL13", "GZMK", "IL7R"],
    vmax=1,
    vmin=-1,
    size_title="Fraction of clones\nin group (%)",
)

/home/sergey/miniconda3/envs/sl/lib/python3.10/functools.py:889: UserWarning: zero-centering a sparse array/matrix densifies it.
  return dispatch(args[0].__class__)(*args, **kw)

[24]:

clones.write_h5ad("Liu_CD8_bbknn_c2v.h5ad")