10X单细胞空间联合分析之再次解读cell2location

到了年根了，其实想总结一些之前的东西，其中cell2location很想再次分享一下，好好研读一下。

Cell2location 是一种有principled Bayesian model，可以解析空间转录组数据中的 fine-grained细胞类型，并创建不同组织的综合细胞图。

Given cell type annotation for each cell, the corresponding reference cell type signatures (g_{f,g}), which represent the average mRNA count of each gene (g) in each cell type (f), can be estimated from sc/snRNA-seq data using 2 provided methods (see below). Cell2location needs untransformed unnormalised spatial mRNA counts as input. You also need to provide cell2location with the expected average cell abundance per location which is used as a prior to guide estimation of absolute cell abundance. This value depends on the tissue and can be estimated by counting nuclei for a few locations in the paired histology image but can be approximate (see paper methods for more guidance).

provide 2 methods for estimating reference cell type signatures from scRNA-seq data:

一种基于负二项式回归的统计方法。通常建议使用 NB 回归，它允许跨技术和批次稳健地组合数据，从而提高空间映射精度。
单个基因的每个cluster平均 mRNA 计数的硬编码计算 (cell2location.cluster_averages.compute_cluster_averages)。当批次效应较小时，这种更快的硬编码计算每个集群平均值的方法提供了类似的高准确度。

代码部分也很值得多多学习

Loading packages

import sys

#if branch is stable, will install via pypi, else will install from source
branch = "github"
user = "BayraktarLab"
IN_COLAB = "google.colab" in sys.modules

if IN_COLAB and branch == "stable":
    !pip install --quiet cell2location
elif IN_COLAB and branch != "stable":
    !pip install --quiet --upgrade jsonschema
    !pip install --quiet git+https://github.com/BayraktarLab/cell2location#egg=cell2location[tutorials]

import scanpy as sc
import anndata
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

import cell2location
import scvi

from matplotlib import rcParams
rcParams['pdf.fonttype'] = 42 # enables correct plotting of text
import seaborn as sns

First, let’s define where we save the results of our analysis:

results_folder = './results/lymph_nodes_analysis/'

# create paths and names to results folders for reference regression and cell2location models
ref_run_name = f'{results_folder}/reference_signatures'
run_name = f'{results_folder}/cell2location_map'

这里的用法我很喜欢， f'{results_folder}/reference_signatures'

Loading Visium and scRNA-seq reference data

首先从 10X Space Ranger 输出中读取空间 Visium 数据。可以使用 scanpy 方便地下载和导入此数据集。

adata_vis = sc.datasets.visium_sge(sample_id="V1_Human_Lymph_Node")
adata_vis.obs['sample'] = list(adata_vis.uns['spatial'].keys())[0]

# rename genes to ENSEMBL
adata_vis.var['SYMBOL'] = adata_vis.var_names
adata_vis.var_names = adata_vis.var['gene_ids']
adata_vis.var_names.name = None

这里对scanpy读入的对象的信息添加很值得学习。

注意！线粒体编码的基因（基因名称以前缀 mt- 或 MT- 开头）与空间映射无关，因为它们的表达代表了单细胞和细胞核数据中的技术产物，而不是线粒体的生物学丰度。然而，这些基因在每个位置构成了 15-40% 的 mRNA。因此，为了避免映射伪影，我们强烈建议去除线粒体基因。

# find mitochondria-encoded (MT) genes
adata_vis.var['MT_gene'] = [gene.startswith('MT-') for gene in adata_vis.var['SYMBOL']]

# remove MT genes for spatial mapping (keeping their counts in the object)
adata_vis.obsm['MT'] = adata_vis[:, adata_vis.var['MT_gene'].values].X.toarray()
adata_vis = adata_vis[:, ~adata_vis.var['MT_gene'].values]

在单细胞参考中包含了跨越淋巴结、脾脏和扁桃体的 scRNA-seq 数据集，以确保我们捕获了空间转录组数据集中可能存在的免疫细胞状态的全部多样性。

# Download data if not already here
import os
if not os.path.exists('./data/sc.h5ad'):
    !cd ./data/ && wget https://cell2location.cog.sanger.ac.uk/paper/integrated_lymphoid_organ_scrna/RegressionNBV4Torch_57covariates_73260cells_10237genes/sc.h5ad

# Read data
adata_ref = sc.read(f'./data/sc.h5ad')

# Use ENSEMBL as gene IDs to make sure IDs are unique and correctly matched
adata_ref.var['SYMBOL'] = adata_ref.var.index
adata_ref.var.index = adata_ref.var['GeneID-2'].copy()
adata_ref.var_names = adata_ref.var['GeneID-2'].copy()
adata_ref.var.index.name = None
adata_ref.raw.var['SYMBOL'] = adata_ref.raw.var.index
adata_ref.raw.var.index = adata_ref.raw.var['GeneID-2'].copy()
adata_ref.raw.var.index.name = None
# before we estimate the reference cell type signature we recommend to perform very permissive genes selection
# in this 2D histogram orange rectangle lays over excluded genes.
# In this case, the downloaded dataset was already filtered using this method,
# hence no density under the orange rectangle
from cell2location.utils.filtering import filter_genes
selected = filter_genes(adata_ref, cell_count_cutoff=5, cell_percentage_cutoff2=0.03, nonz_mean_cutoff=1.12)

# filter the object
adata_ref = adata_ref[:, selected].copy()

图片.png

Estimation of reference cell type signatures (NB regression)

The signatures are estimated from scRNA-seq data, accounting for batch effect, using a Negative binomial regression model.(这里需要考虑批次效应)

# prepare anndata for the regression model
scvi.data.setup_anndata(adata=adata_ref,
                        # 10X reaction / sample / batch
                        batch_key='Sample',
                        # cell type, covariate used for constructing signatures
                        labels_key='Subset',
                        # multiplicative technical effects (platform, 3' vs 5', donor effect)
                        categorical_covariate_keys=['Method']
                       )
scvi.data.view_anndata_setup(adata_ref)

图片.png

# create and train the regression model
from cell2location.models import RegressionModel
mod = RegressionModel(adata_ref)

# Use all data for training (validation not implemented yet, train_size=1)
mod.train(max_epochs=250, batch_size=2500, train_size=1, lr=0.002, use_gpu=True)

# plot ELBO loss history during training, removing first 20 epochs from the plot
mod.plot_history(20)

图片.png

# In this section, we export the estimated cell abundance (summary of the posterior distribution).
adata_ref = mod.export_posterior(
    adata_ref, sample_kwargs={'num_samples': 1000, 'batch_size': 2500, 'use_gpu': True}
)

# Save model
mod.save(f"{ref_run_name}", overwrite=True)

# Save anndata object with results
adata_file = f"{ref_run_name}/sc.h5ad"
adata_ref.write(adata_file)

Examine QC plots

1、重建准确性以评估推理是否存在任何问题。
2、由于批次效应，估计的表达特征不同于每个cluster中的平均表达。对于不受批效应影响的 scRNA-seq 数据集（该数据集有），可以使用聚类平均表达而不是用模型估计特征。当此图与对角线图非常不同时（例如，Y 轴上的值非常低，到处都是密度），则表明特征估计存在问题。

mod.plot_QC()

图片.png

The model and output h5ad can be loaded later like this:

mod = cell2location.models.RegressionModel.load(f"{ref_run_name}", adata_ref)
adata_file = f"{ref_run_name}/sc.h5ad"
adata_ref = sc.read_h5ad(adata_file)

# export estimated expression in each cluster
if 'means_per_cluster_mu_fg' in adata_ref.varm.keys():
    inf_aver = adata_ref.varm['means_per_cluster_mu_fg'][[f'means_per_cluster_mu_fg_{i}'
                                    for i in adata_ref.uns['mod']['factor_names']]].copy()
else:
    inf_aver = adata_ref.var[[f'means_per_cluster_mu_fg_{i}'
                                    for i in adata_ref.uns['mod']['factor_names']]].copy()
inf_aver.columns = adata_ref.uns['mod']['factor_names']
inf_aver.iloc[0:5, 0:5]

图片.png

Cell2location: spatial mapping

# find shared genes and subset both anndata and reference signatures
intersect = np.intersect1d(adata_vis.var_names, inf_aver.index)
adata_vis = adata_vis[:, intersect].copy()
inf_aver = inf_aver.loc[intersect, :].copy()

# prepare anndata for cell2location model
scvi.data.setup_anndata(adata=adata_vis, batch_key="sample")
scvi.data.view_anndata_setup(adata_vis)

图片.png

Note! While you can often use the default value of detection_alpha hyperparameter, it is useful to adapt the expected cell abundance N_cells_per_location to every tissue. This value can be estimated from paired histology images and as described in the note above. Change the value presented in this tutorial (N_cells_per_location=30) to the value observed in your your tissue.

# create and train the model
mod = cell2location.models.Cell2location(
    adata_vis, cell_state_df=inf_aver,
    # the expected average cell abundance: tissue-dependent
    # hyper-prior which can be estimated from paired histology:
    N_cells_per_location=30,
    # hyperparameter controlling normalisation of
    # within-experiment variation in RNA detection (using default here):
    detection_alpha=200
)

mod.train(max_epochs=30000,
          # train using full data (batch_size=None)
          batch_size=None,
          # use all data points in training because
          # we need to estimate cell abundance at all locations
          train_size=1,
          use_gpu=True)

# plot ELBO loss history during training, removing first 100 epochs from the plot
mod.plot_history(1000)
plt.legend(labels=['full data training']);

图片.png

# In this section, we export the estimated cell abundance (summary of the posterior distribution).
adata_vis = mod.export_posterior(
    adata_vis, sample_kwargs={'num_samples': 1000, 'batch_size': mod.adata.n_obs, 'use_gpu': True}
)

# Save model
mod.save(f"{run_name}", overwrite=True)

# mod = cell2location.models.Cell2location.load(f"{run_name}", adata_vis)

# Save anndata object with results
adata_file = f"{run_name}/sp.h5ad"
adata_vis.write(adata_file)

Visualising cell abundance in spatial coordinates

# add 5% quantile, representing confident cell abundance, 'at least this amount is present',
# to adata.obs with nice names for plotting
adata_vis.obs[adata_vis.uns['mod']['factor_names']] = adata_vis.obsm['q05_cell_abundance_w_sf']

# select one slide
from cell2location.utils import select_slide
slide = select_slide(adata_vis, 'V1_Human_Lymph_Node')

# plot in spatial coordinates
with mpl.rc_context({'axes.facecolor':  'black',
                     'figure.figsize': [4.5, 5]}):

    sc.pl.spatial(slide, cmap='magma',
                  # show first 8 cell types
                  color=['B_Cycling', 'B_GC_LZ', 'T_CD4+_TfH_GC', 'FDC',
                         'B_naive', 'T_CD4+_naive', 'B_plasma', 'Endo'],
                  ncols=4, size=1.3,
                  img_key='hires',
                  # limit color scale at 99.2% quantile of cell abundance
                  vmin=0, vmax='p99.2'
                 )

图片.png

# Now we use cell2location plotter that allows showing multiple cell types in one panel
from cell2location.plt import plot_spatial

# select up to 6 clusters
clust_labels = ['T_CD4+_naive', 'B_naive', 'FDC']
clust_col = ['' + str(i) for i in clust_labels] # in case column names differ from labels

slide = select_slide(adata_vis, 'V1_Human_Lymph_Node')

with mpl.rc_context({'figure.figsize': (15, 15)}):
    fig = plot_spatial(
        adata=slide,
        # labels to show on a plot
        color=clust_col, labels=clust_labels,
        show_img=True,
        # 'fast' (white background) or 'dark_background'
        style='fast',
        # limit color scale at 99.2% quantile of cell abundance
        max_color_quantile=0.992,
        # size of locations (adjust depending on figure size)
        circle_diameter=6,
        colorbar_position='right'
    )

图片.png

Downstream analysis

Identifying discrete tissue regions by Leiden clustering

通过使用由 cell2location 估计的细胞丰度对位置进行聚类来识别细胞组成不同的组织区域。

我们通过使用每种细胞类型的估计细胞丰度对 Visium 点进行聚类来找到组织区域。我们构建了一个 K-nearest neigbour (KNN) 图，表示估计细胞丰度中位置的相似性，然后应用 Leiden 聚类。 KNN 邻居的数量应适应数据集的大小和解剖学定义区域的大小（即海马区域相当小，因此可能被大型 n_neighbors 掩盖）。这可以针对范围 KNN 邻居和 Leiden 聚类分辨率完成，直到获得与组织解剖结构匹配的聚类。

聚类是在所有 Visium 部分/批次中联合完成的，因此区域身份是直接可比的。当多个批次之间存在很强的技术影响时（这里不是这种情况），原则上可以使用 sc.external.pp.bbknn 来解释 KNN 构建过程中的这些影响。

The resulting clusters are saved in adata_vis.obs['region_cluster'].

# compute KNN using the cell2location output stored in adata.obsm
sc.pp.neighbors(adata_vis, use_rep='q05_cell_abundance_w_sf',
                n_neighbors = 15)

# Cluster spots into regions using scanpy
sc.tl.leiden(adata_vis, resolution=1.1)

# add region as categorical variable
adata_vis.obs["region_cluster"] = adata_vis.obs["leiden"].astype("category")

# compute UMAP using KNN graph based on the cell2location output
sc.tl.umap(adata_vis, min_dist = 0.3, spread = 1)

# show regions in UMAP coordinates
with mpl.rc_context({'axes.facecolor':  'white',
                     'figure.figsize': [8, 8]}):
    sc.pl.umap(adata_vis, color=['region_cluster'], size=30,
               color_map = 'RdPu', ncols = 2, legend_loc='on data',
               legend_fontsize=20)
    sc.pl.umap(adata_vis, color=['sample'], size=30,
               color_map = 'RdPu', ncols = 2,
               legend_fontsize=20)

# plot in spatial coordinates
with mpl.rc_context({'axes.facecolor':  'black',
                     'figure.figsize': [4.5, 5]}):
    sc.pl.spatial(adata_vis, color=['region_cluster'],
                  size=1.3, img_key='hires', alpha=0.5)

图片.png

Identifying cellular compartments / tissue zones using matrix factorisation (NMF)(这部分应该是新的内容)

在这里，我们使用 cell2location 映射结果来识别细胞类型的空间共现，以便更好地了解组织组织并预测细胞相互作用。我们对来自 cell2location 的细胞类型丰度估计进行了非负矩阵分解（NMF）。与将 NMF 应用于传统 scRNA-seq 的既定好处类似，附加 NMF 分解产生了一组空间细胞类型丰度曲线，将其分组为捕获共定位细胞类型的组件。这种基于 NMF 的分解自然地解释了这样一个事实，即多种细胞类型和微环境可以在相同的 Visium 位置共存，同时跨组织区域（例如单个生发中心）共享信息。

提示在实践中，最好针对一系列因子 (R={5, .., 30}) 训练 NMF，并选择 (R) 作为捕获精细组织区域和拆分已知区室之间的平衡。如果您想找到几个最明显的细胞隔室，请使用少量因子。如果您想找到非常强的协同定位信号并假设大多数细胞类型不协同定位，请使用很多因子（> 30 - 此处使用）。

Below we show how to perform this analysis. To aid this analysis, we wrapped the analysis shown the notebook on advanced downstream analysis into a pipeline that automates training of the NMF model with varying number of factors:

from cell2location import run_colocation
res_dict, adata_vis = run_colocation(
    adata_vis,
    model_name='CoLocatedGroupsSklearnNMF',
    train_args={
      'n_fact': np.arange(11, 13), # IMPORTANT: use a wider range of the number of factors (5-30)
      'sample_name_col': 'sample', # columns in adata_vis.obs that identifies sample
      'n_restarts': 3 # number of training restarts
    },
    export_args={'path': f'{run_name}/CoLocatedComb/'}
)

For every factor number, the model produces the following list of folder outputs:

cell_type_fractions_heatmap/: a dot plot of the estimated NMF weights of cell types (rows) across NMF components (columns)
cell_type_fractions_mean/: the data used for dot plot
factor_markers/: tables listing top 10 cell types most speficic to each NMF factor
models/: saved NMF models
predictive_accuracy/: 2D histogram plot showing how well NMF explains cell2location output
spatial/: NMF weights across locatinos in spatial coordinates
location_factors_mean/: the data used for the plot in spatial coordiantes
stability_plots/: stability of NMF weights between training restarts

检查的关键输出是 cell_type_fractions_heatmap/ 中的文件，它们显示了与细胞隔室相对应的 NMF 组件（列）中细胞类型（行）的估计 NMF 权重的点图。显示的是相对权重，对每种细胞类型的组件进行了标准化。

# Here we plot the NMF weights (Same as saved to `cell_type_fractions_heatmap`)
res_dict['n_fact12']['mod'].plot_cell_type_loadings()

图片.png

Advanced use

Estimate cell-type specific expression of every gene in the spatial data

For this, we adapt the approach of estimating conditional expected expression proposed by RCTD (Cable et al) method. With cell2location, we can look at the posterior distribution rather than just point estimates of cell type specific expression (see mod.samples.keys() and next section on using full distribution).

# Compute expected expression per cell type
expected_dict = mod.module.model.compute_expected_per_cell_type(
    mod.samples["post_sample_q05"], mod.adata
)

# Add to anndata layers
for i, n in enumerate(mod.factor_names_):
    adata_vis.layers[n] = expected_dict['mu'][i]

# Save anndata object with results
adata_file = f"{run_name}/sp.h5ad"
adata_vis.write(adata_file)
adata_file

# Look at cell type specific expression in spatial coordinates,
# Here we highlight CD3D, pan T-cell marker expressed by
# 2 subtypes of T cells in distinct locations but not expressed by co-located B cells
ctypes = ['T_CD4+_TfH_GC', 'T_CD4+_naive', 'B_GC_LZ']
genes = ['CD3D', 'CR2']

with mpl.rc_context({'axes.facecolor':  'black'}):
    # select one slide
    slide = select_slide(adata_vis, 'V1_Human_Lymph_Node')

    from tutorial_utils import plot_genes_per_cell_type
    plot_genes_per_cell_type(slide, genes, ctypes);

图片.png

就是想回顾一下，当然了，软件也更新了很多内容

生活很好，有你更好

最后编辑于：2021.12.30 13:39:35

禁止转载，如需转载请通过简信或评论联系作者。

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 158,233评论 4赞 360
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 67,013评论 1赞 291
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 108,030评论 0赞 241
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 43,827评论 0赞 204
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,221评论 3赞 286
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,542评论 1赞 216
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,814评论 2赞 312
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,513评论 0赞 198
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,225评论 1赞 241
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,497评论 2赞 244
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 31,998评论 1赞 258
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,342评论 2赞 253
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 32,986评论 3赞 235
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,055评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,812评论 0赞 194
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,560评论 2赞 271
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,461评论 2赞 266

10X单细胞空间联合分析之再次解读cell2location

到了年根了，其实想总结一些之前的东西，其中cell2location很想再次分享一下，好好研读一下。

provide 2 methods for estimating reference cell type signatures from scRNA-seq data:

代码部分也很值得多多学习

Loading packages

First, let’s define where we save the results of our analysis:

Loading Visium and scRNA-seq reference data

在单细胞参考中包含了跨越淋巴结、脾脏和扁桃体的 scRNA-seq 数据集，以确保我们捕获了空间转录组数据集中可能存在的免疫细胞状态的全部多样性。

Estimation of reference cell type signatures (NB regression)

The signatures are estimated from scRNA-seq data, accounting for batch effect, using a Negative binomial regression model.(这里需要考虑批次效应)

Examine QC plots

The model and output h5ad can be loaded later like this:

Cell2location: spatial mapping

Visualising cell abundance in spatial coordinates

Downstream analysis

Identifying discrete tissue regions by Leiden clustering

The resulting clusters are saved in adata_vis.obs['region_cluster'].

Identifying cellular compartments / tissue zones using matrix factorisation (NMF)(这部分应该是新的内容)

Below we show how to perform this analysis. To aid this analysis, we wrapped the analysis shown the notebook on advanced downstream analysis into a pipeline that automates training of the NMF model with varying number of factors:

For every factor number, the model produces the following list of folder outputs:

Advanced use

Estimate cell-type specific expression of every gene in the spatial data

就是想回顾一下，当然了，软件也更新了很多内容

推荐阅读更多精彩内容