Beyond clonal data: spatial clustering of genes#

As we show in the manuscript, clone2vec is in general a model that provides a robust distance measure between discrete distributions. On top of that, there is no requirement for all elements to exist in the same space. In other word, we can use neighbourhood matrices from completely different experiments if they consists of the same classes of elements.

To illustrate the ability of clone2vec to be used in other settings, we will build a clustering of genes depending on their spatial distribution in single cells using SeqFISH+ data. It’s important to keep in mind that we have to assume that every single cell is from the same cell type (which is true for the data we work with), otherwise there is a chance that we can see clusters of genes that are associated with one or another cell type.

[1]:
import scanpy as sc
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.sparse as sp
import matplotlib.pyplot as plt
import clone2vec as c2v

sc.set_figure_params(dpi=80)
sc.settings.verbosity = 3
sns.set_style("ticks")

Step 1: Downloading and reading the data#

Here we will create an AnnData object where instead of cells we will keep every detected RNA molecule as rows.

[ ]:
!wget https://github.com/jadexq/ELLA/releases/download/v0.0.1/seqfish_data_dict.pkl

data = pd.read_pickle("seqfish_data_dict.pkl")
adata = sc.AnnData(
    X=sp.csc_matrix(np.zeros(len(data["expr"]))[:, None]),
    obs=pd.DataFrame({
        "gene": data["expr"].gene.values,
        "cell": data["expr"].cell.values,
    }, index=data["expr"].index.astype(str)),
    obsm={"X_spatial": data["expr"][["x", "y"]].values},
)

Step 2: Building a k-NN graphs#

Now we need to build a kNN graph between molecules in single cells. To translate it onto a “clonal” language, now cells are individual measurments of RNA molecules, and clones are gene labels of these molecules. Out goal will be to find a genes that spatially co-exist together.

[6]:
genes = c2v.pp.clones_adata(adata, obs_name="gene", fill_obs="cell", min_size=10)

# Putting gene x cell matrix in genes.X
genes.X = sp.csr_matrix(genes.layers["counts"])
genes.layers = {}

# Putting the number of
genes.var["n_genes"] = genes.X.sum(axis=0).A[0]
genes = genes[:, genes.var.n_genes > 0].copy()
creating clone-level AnnData
    selected 6125 clones (>= 10)
    finished (0:00:09): created clones AnnData with
     .X float matrix of proportions (clones × categories)
     .layers['proportions'] float matrix with fate proportions
     .layers['counts'] integer matrix with fate counts
     .obs['n_cells'] integer vector with number of cells per clone
     .obs['n_fates'] integer vector with number of fates per clone
     .var['n_clones'] integer vector with number of clones per fate
     .uns['fill_obs'] string label

To build graph for each cell independently, we have to add a split_by argument.

[5]:
c2v.tl.clonal_nn(adata, genes, split_by="cell", use_rep="X_spatial")
computing clone-to-clone adjacency graph
    split-by 'cell': 100%|██████████| 171/171 [22:29<00:00,  7.89s/it]
    finished (0:22:36): added
     .obsp['gex_adjacency'] clone adjacency graph.
     .uns['clonal_nn'] parameters.

Step 3: Building an embedding#

Now, when we have a matrix of nearest neighbors of each gene, we can create an embedding of these genes. In this case, we will use our fast Poisson model to save a computational time: Poisson model’s complexity is \(\mathcal{O}(n_\mathrm{genes})\), while classical multinomial model’s complexity is \(\mathcal{O}(n_\mathrm{molecules})\).

[76]:
c2v.tl.clone2vec_Poi(genes, tol=1e-6, init="random")
fitting clone2vec_Poi embeddings
    GLM-PCA epochs:   9%|▉         | 45/500 [00:29<04:58,  1.52it/s, Log-Likelihood=107853896.0000, Δ=2.97e-07, LR=3.12e-02]
WARNING: obsm_key 'clone2vec_Poi' already exists in clones.obsm. Overwriting.
WARNING: uns_key 'clone2vec_Poi' already exists in clones.uns. Overwriting.
    finished (0:00:29): added
     .obsm['clone2vec_Poi'] embedding matrix
     .uns['clone2vec_Poi'] training details
[8]:
c2v.pl.loss_history(genes)
WARNING: Neither clone2vec nor clone2vec[loss_history] found in clones.uns. Using clone2vec_Poi[loss_history] instead.
_images/Spatial_classification_mRNA_10_1.png
[78]:
sc.pp.neighbors(genes, use_rep="clone2vec_Poi")
computing neighbors
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:00)
[79]:
sc.tl.umap(genes)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm)
    'umap', UMAP parameters (adata.uns) (0:00:08)
[80]:
sc.tl.leiden(
    genes,
    resolution=0.25,
    flavor="igraph",
    n_iterations=2,
)
running Leiden clustering
    finished: found 3 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:00)
[10]:
ax = sc.pl.umap(
    genes,
    color="leiden",
    frameon=False,
    title="Groups of genes",
    show=False,
)

c2v.pl.fancy_legend(ax, center_loc=False, textsize=18, fontweight="bold")
c2v.pl.embedding_axis(ax, label="Genes")
_images/Spatial_classification_mRNA_14_0.png

Here we got three clusters of genes. Let’s check what are these genes.

[82]:
top_n = 10

print(f"Top-{top_n} genes from")
for leiden in ["0", "1", "2"]:
    top_genes = genes[genes.obs.leiden == leiden].obs.sort_values("n_cells", ascending=False).index[:top_n]
    print(f"   cluster {leiden}: {', '.join(top_genes)}")
Top-10 genes from
   cluster 0: Col1a1, Fn1, Bgn, Emp1, Lox, Fbln2, Flna, P4hb, Col6a2, Nid1
   cluster 1: Myh9, Hnrnpf, Tagln2, Ddb1, Actn1, Pcbp1, Fscn1, Kpnb1, Hnrnpl, Ppp1ca
   cluster 2: Cyb5r3, Kctd10, Dynll2, Sh3pxd2a, Ddr2, Palld, Naa50, Dync1li2, Kif1c, Zcchc24

Genes from cluster 0 look like genes that cell is usually secreting into intercellular space, therefore they’re usually processed in the EPR and contain SRP signal. Genes from cluster 1 are house-keeping genes, and their localization should not be linked to EPR (further away from the nuclei). And the genes from cluster 2 are genes that usually are being transported to protrusions.

Let’s validate it visually.

Step 4: Plotting of groups of genes#

[11]:
genes_mapping = dict(genes.obs.leiden)
adata.obs["group"] = [genes_mapping[i] if i in genes_mapping else "NA" for i in adata.obs.gene]
[12]:
cell = "5-0"

c2v.pl.group_kde(
    adata[adata.obs.cell == cell],
    groupby="group",
    groups=["0", "1", "2"],
    basis="X_spatial",
    bw_method=0.2,
)
/home/sergey/miniconda3/envs/sl/lib/python3.10/site-packages/anndata/_core/anndata.py:1138: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
  df[key] = c
_images/Spatial_classification_mRNA_20_1.png

Here we see that our initial guess was correct, and localization of these groups of genes matches our prediction. It’s also consistent with the results that authors obtained in the original paper.

Step 5: Checking SRP#

Additionally, we can check an SRP signal distribution across clusters based on the prediction of SRP via DeepSig.

Step 5.1: Downloading and filtering proteome#

Here we will download fasta-file with mouse proteome and keep only proteins from the analysis.

mkdir deepsig
cd deepsig
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M38/gencode.vM38.pc_translations.fa.gz
gunzip gencode.vM38.pc_translations.fa.gz
mv gencode.vM38.pc_translations.fa mm_proteins.fa
[31]:
# Helpers to work with fasta

def fasta_to_dict(path):
    seqs = {}
    current_id = None
    current_seq = []
    with open(path) as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            if line.startswith(">"):
                if current_id is not None:
                    seqs[current_id] = "".join(current_seq)
                current_id = line[1:]
                current_seq = []
            else:
                current_seq.append(line)
        if current_id is not None:
            seqs[current_id] = "".join(current_seq)
    return seqs

def dict_to_fasta(seq_dict, path, line_width=80):
    with open(path, "w") as f:
        for seq_id, seq in seq_dict.items():
            f.write(f">{seq_id}\n")
            for i in range(0, len(seq), line_width):
                f.write(seq[i:(i + line_width)] + "\n")
[29]:
fasta = fasta_to_dict("deepsig/mm_proteins.fa")

filtered_fasta = {}
for record in fasta:
    gene_name = record.split("|")[-2]
    if gene_name in genes.obs_names:
        if gene_name not in filtered_fasta:
            filtered_fasta[gene_name] = fasta[record]
        elif len(filtered_fasta[gene_name]) < len(fasta[record]):
            filtered_fasta[gene_name] = fasta[record]

dict_to_fasta(filtered_fasta, "deepsig/mm_filtered.fa")

Strp 5.2: DeepSig#

docker pull bolognabiocomp/deepsig
docker run -v $(pwd):/data/ bolognabiocomp/deepsig -f mm_filtered.fa -o mm.out -k euk
[65]:
deepsig = pd.read_csv(
    "deepsig/mm.out",
    sep="\t",
    names=["gene", "0", "prediction", "1", "2", "3", "4", "5", "6"],
    usecols=["gene", "prediction"],
)

srp_genes = set(deepsig[deepsig.prediction == "Signal peptide"].gene.values)
all_genes = set(deepsig.gene.values)

genes.obs["srp"] = [
    "True" if i in srp_genes
    else "False" if i in all_genes
    else np.nan for i in genes.obs_names
]

Plotting these genes on UMAP.

[16]:
ax = sc.pl.umap(
    genes,
    color="srp",
    groups=["True"],
    legend_loc=None,
    frameon=False,
    title="Genes with predicted SRP",
    show=False,
    s=40,
    palette={
        "True": sns.color_palette()[2],
        "False": sns.color_palette()[3],
    },
)

c2v.pl.embedding_axis(ax, label="Genes")
_images/Spatial_classification_mRNA_30_0.png
[13]:
composition = genes.obs[genes.obs.srp != np.nan].groupby(
    ["leiden", "srp"], observed=True
).size().unstack()

fig, ax = plt.subplots(figsize=(3, 4))
(composition.T / composition.sum(axis=1)).T.plot(
    kind="bar",
    stacked=True,
    ax=ax,
    edgecolor="black",
    width=0.9,
    color={
        "True": sns.color_palette()[2],
        "False": sns.color_palette()[3],
    },
)
ax.grid(alpha=0.3)
ax.set_xlabel("Cluster of genes")
ax.set_ylabel("Proportion of the cluster")
ax.set_title("SRP in genes")
ax.legend(loc=(1.03, 0.72), edgecolor="white", title="SRP")

plt.show()
_images/Spatial_classification_mRNA_31_0.png

We see a significant enrichment of genes with predicted SRP in the cluster 0, which further validates our guess.

Saving the object#

[14]:
genes.write_h5ad("seqfish_embedding.h5ad")