tascCODA - Tree-aggregated compositional analysis of high-throughput sequencing data¶
This notebook is a tutorial on how to use tascCODA Ostner et al., 2021 for tree-aggregated compositional analysis of high-throughput sequencing (HTS) data.
For this example, we use single-cell RNA sequencing data. However, there are no limitations to use tascCODA with other HTS data, such as 16S rRNA sequencing.
The particular dataset for this analysis was generated by Smillie et al, 2019. It contains samples from two different regions in the small intestine of mice - Epithelium and Lamina Propria - and three different inflammation conditions - healthy, non-inflamed and inflamed. In total, we have 365.492 cells from 51 cell types in 133 samples.
This tutorial is designed to be executed on a standard computer (any operating system) in a Python environment with tascCODA, Jupyter notebook and all their dependencies installed. Running the tutorial takes about 5 minutes on a 2020 Apple MacBook Pro (16GB RAM).
[3]:
# Setup
import numpy as np
import pandas as pd
import re
import toytree as tt
from tasccoda import tree_utils as util
from tasccoda import tree_ana as ana
from tasccoda import tree_results as tr
from tasccoda import tree_agg_model_sslasso as mod
import tasccoda.datasets as dat
import anndata as ad
from sccoda.util import data_visualization as viz
import toyplot
import toyplot.locator
import statsmodels as sm
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.patches as mpatches
import toyplot.svg
import os
import importlib
[4]:
import tasccoda
tasccoda.__version__
[4]:
'0.1.dev12+g8baba74.d20220628'
Data setup¶
First, we read in the per-cell data and convert it to a count table
[5]:
# read in per-cell data
smillie_data = dat.smillie()
smillie_data["Cluster"] = [str.replace(x, " ", "") for x in smillie_data["Cluster"]]
smillie_data
[5]:
| NAME | Subject | Sample | Location | Replicate | Health | Cluster | nGene | nUMI | Major_l1 | Major_l2 | Major_l3 | Major_l4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | N10.EpiA.AAACATACAACCAC | N10 | EpiA | Epi | A | Healthy | EnterocyteProgenitors | 425 | 968 | Epithelial | Epithelial | Absorptive | Immature cells |
| 1 | N10.EpiA.AAACATACAGGCGA | N10 | EpiA | Epi | A | Healthy | CyclingTA | 1695 | 7273 | Epithelial | Epithelial | Cycling TA3 | Cycling TA4 |
| 2 | N10.EpiA.AAACATACCACTAG | N10 | EpiA | Epi | A | Healthy | ImmatureGoblet | 391 | 1190 | Epithelial | Epithelial | Secretory | Progenitor cells |
| 3 | N10.EpiA.AAACATACCCTTTA | N10 | EpiA | Epi | A | Healthy | SecretoryTA | 1327 | 5620 | Epithelial | Epithelial | Secretory | Progenitor cells |
| 4 | N10.EpiA.AAACATACTGCAAC | N10 | EpiA | Epi | A | Healthy | ImmatureEnterocytes2 | 1383 | 4676 | Epithelial | Epithelial | Absorptive | Immature cells |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 365487 | N9.LPB.TTTATCCTAACGAA | N9 | LPB | LP | B | Inflamed | Enterocytes | 2768 | 18811 | Epithelial | Epithelial | Absorptive | Absorptive Mature cells |
| 365488 | N9.LPB.TTTATCCTGTAAAG | N9 | LPB | LP | B | Inflamed | Plasma | 1392 | 27685 | Immune | Lymphoid | B cells | Plasma4 |
| 365489 | N9.LPB.TTTATCCTGTCGTA | N9 | LPB | LP | B | Inflamed | Plasma | 574 | 5478 | Immune | Lymphoid | B cells | Plasma4 |
| 365490 | N9.LPB.TTTCAGTGGCGTTA | N9 | LPB | LP | B | Inflamed | Macrophages | 1437 | 5698 | Immune | Myeloid | Monocytes | Macrophages4 |
| 365491 | N9.LPB.TTTGCATGGATACC | N9 | LPB | LP | B | Inflamed | CD8+LP | 544 | 1348 | Immune | Lymphoid | T cells | CD8+ T |
365492 rows × 13 columns
[6]:
# Convert to count table
counts = pd.crosstab(
[smillie_data["Sample"], smillie_data["Subject"], smillie_data["Location"], smillie_data["Health"], smillie_data["Replicate"]], smillie_data["Cluster"])
counts
obs = pd.DataFrame(index=counts.index).reset_index()
counts
[6]:
| Cluster | Best4+Enterocytes | CD4+ActivatedFos-hi | CD4+ActivatedFos-lo | CD4+Memory | CD4+PD1+ | CD69+Mast | CD69-Mast | CD8+IELs | CD8+IL17+ | CD8+LP | ... | Stem | TA1 | TA2 | Tregs | Tuft | WNT2B+Fos-hi | WNT2B+Fos-lo1 | WNT2B+Fos-lo2 | WNT5B+1 | WNT5B+2 | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sample | Subject | Location | Health | Replicate | |||||||||||||||||||||
| EpiA | N10 | Epi | Healthy | A | 93 | 1 | 2 | 11 | 5 | 2 | 8 | 66 | 1 | 4 | ... | 55 | 354 | 272 | 0 | 24 | 0 | 0 | 0 | 0 | 1 |
| N106 | Epi | Non-inflamed | A | 3 | 0 | 0 | 0 | 0 | 2 | 0 | 3 | 0 | 1 | ... | 0 | 47 | 2 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | |
| N11 | Epi | Healthy | A | 93 | 1 | 0 | 8 | 0 | 0 | 0 | 65 | 0 | 2 | ... | 35 | 461 | 182 | 0 | 24 | 0 | 0 | 0 | 0 | 0 | |
| N110 | Epi | Non-inflamed | A | 221 | 0 | 0 | 1 | 0 | 5 | 0 | 3 | 0 | 9 | ... | 22 | 1260 | 348 | 0 | 17 | 0 | 0 | 0 | 0 | 0 | |
| N12 | Epi | Non-inflamed | A | 18 | 0 | 0 | 0 | 0 | 0 | 3 | 5 | 0 | 1 | ... | 0 | 91 | 31 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| LPB2 | N111 | LP | Inflamed | B2 | 1 | 138 | 15 | 717 | 24 | 136 | 0 | 15 | 13 | 278 | ... | 9 | 132 | 33 | 322 | 2 | 8 | 14 | 21 | 14 | 21 |
| N58 | LP | Inflamed | B2 | 0 | 5 | 1 | 1 | 0 | 21 | 0 | 0 | 0 | 7 | ... | 0 | 2 | 0 | 21 | 0 | 1 | 0 | 1 | 0 | 0 | |
| N661 | LP | Inflamed | B2 | 0 | 82 | 38 | 1431 | 0 | 73 | 0 | 92 | 0 | 478 | ... | 0 | 0 | 0 | 199 | 0 | 9 | 6 | 50 | 0 | 0 | |
| LPB2a | N52 | LP | Inflamed | B2a | 1 | 119 | 404 | 348 | 0 | 66 | 0 | 28 | 0 | 245 | ... | 6 | 23 | 17 | 79 | 0 | 68 | 27 | 26 | 35 | 38 |
| LPB2b | N52 | LP | Inflamed | B2b | 1 | 125 | 404 | 338 | 0 | 67 | 0 | 35 | 0 | 262 | ... | 6 | 25 | 19 | 81 | 0 | 66 | 28 | 26 | 39 | 34 |
133 rows × 51 columns
For preprocessing, we need to extract the tree-structured cell lineage information (Lineages Major_l1 - Major_l4) from the data, generate a Newick string through tasccoda.tree_utils.df2newick() and convert it into a toytree object.
[7]:
# Get cell lineage information for all 51 cell types
vars = smillie_data.groupby("Cluster").agg({
"Major_l1": "first",
"Major_l2": "first",
"Major_l3": "first",
"Major_l4": "first",
})
# Convert lineage information into a newick string
importlib.reload(util)
tree_levels = ["Major_l1", "Major_l2", "Major_l3", "Major_l4", "Cluster"]
newick = util.df2newick(smillie_data.loc[:, tree_levels].reset_index(drop=True), tree_levels)
# Build tree
tree = tt.tree(newick=newick, tree_format=8)
tree.draw(tip_labels_align=True, node_sizes=10, node_labels='name', width=800, node_markers="r10x1.25", node_colors="lightblue")
[7]:
(<toyplot.canvas.Canvas at 0x7fd42667a370>,
<toyplot.coordinates.Cartesian at 0x7fd42667afd0>,
<toytree.Render.ToytreeMark at 0x7fd426682820>)
Now, we can put everything together into an anndata object that we use in tascCODA:
[8]:
data = ad.AnnData(X=counts.reset_index(drop=True), var=vars, obs=obs, uns={"newick": newick, "phylo_tree": tree})
/Users/johannes.ostner/opt/anaconda3/envs/scCODA_tf2.8/lib/python3.9/site-packages/anndata/_core/anndata.py:120: ImplicitModificationWarning: Transforming to str index.
warnings.warn("Transforming to str index.", ImplicitModificationWarning)
For this tutorial, we focus on finding changes between healthy and non-inflamed tissue in the Lamina Propria. Therefore, we subset the data accordingly, and end up with 48 samples:
[9]:
data_LP_hn = data[(data.obs["Health"].isin(["Healthy", "Non-inflamed"])) & (data.obs["Location"] == "LP")]
data_LP_hn
[9]:
View of AnnData object with n_obs × n_vars = 48 × 51
obs: 'Sample', 'Subject', 'Location', 'Health', 'Replicate'
var: 'Major_l1', 'Major_l2', 'Major_l3', 'Major_l4'
uns: 'newick', 'phylo_tree'
A quick plot of the data with the sccoda.util.data_visualization module shows us that there are some changes in relative abundance in the data:
[10]:
viz.boxplots(data_LP_hn, feature_name="Health", figsize=(20, 8))
plt.show()
Run tascCODA¶
Running tascCODA on our data is now simply a matter of running the HMC sampling process. The model creation and inference works analogous to scCODA (see the tutorial here)
As hyperparameters, we have to specify (see the tascCODA paper for further explanations): - The reference cell type (which is assumed to be unchanged between the conditions): We use the automatic setting here, which chooses a reference that induces minimal compositional effects (NK cells) - The model formula (R-style formula string, just as in scCODA) - The aggregation bias \(\phi\). We go with an unbiased aggregation (pen_args={"phi": 0}) here.
THen, we run HMC sampling via sample_hmc_da with the default settings of 20.000 samples, of which 5.000 are discarded as burn-in
[11]:
importlib.reload(ana)
tree_mod= ana.CompositionalAnalysisTree(
data_LP_hn.copy(),
reference_cell_type="automatic",
formula="Health",
reg="scaled_3",
pen_args={"phi": 0}
)
tree_res = tree_mod.sample_hmc_da()
Automatic reference selection! Reference cell type set to NKs
Zero counts encountered in data! Added a pseudocount of 0.5.
WARNING:tensorflow:@custom_gradient grad_fn has 'variables' in signature, but no ResourceVariables were used on the forward pass.
WARNING:tensorflow:@custom_gradient grad_fn has 'variables' in signature, but no ResourceVariables were used on the forward pass.
WARNING:tensorflow:@custom_gradient grad_fn has 'variables' in signature, but no ResourceVariables were used on the forward pass.
WARNING:tensorflow:@custom_gradient grad_fn has 'variables' in signature, but no ResourceVariables were used on the forward pass.
WARNING:tensorflow:@custom_gradient grad_fn has 'variables' in signature, but no ResourceVariables were used on the forward pass.
WARNING:tensorflow:@custom_gradient grad_fn has 'variables' in signature, but no ResourceVariables were used on the forward pass.
WARNING:tensorflow:@custom_gradient grad_fn has 'variables' in signature, but no ResourceVariables were used on the forward pass.
100%|██████████| 20000/20000 [03:11<00:00, 104.58it/s]
MCMC sampling finished. (256.026 sec)
Acceptance rate: 87.7%
/Users/johannes.ostner/opt/anaconda3/envs/scCODA_tf2.8/lib/python3.9/site-packages/arviz/data/base.py:216: UserWarning: More chains (15000) than draws (1). Passed array should have shape (chains, draws, *shape)
warnings.warn(
Result analysis¶
Calling summary() on the results object, we can see the most relevant information for further analysis:
[12]:
tree_res.summary()
Compositional Analysis summary:
Data: 48 samples, 51 cell types
Reference index: 29
Formula: Health
Intercepts:
Final Parameter Expected Sample
Cell Type
Tregs 0.643 69.265072
MT-hi -0.755 17.114752
CD4+ActivatedFos-hi 1.491 161.732019
CD4+PD1+ -0.962 13.914629
CD4+ActivatedFos-lo 1.429 152.009157
CD4+Memory 1.281 131.097427
CD8+IL17+ -0.837 15.767341
CyclingT -0.740 17.373408
CD8+LP 1.524 167.158216
CD8+IELs -0.149 31.372784
Plasma 3.489 1192.659259
Follicular 0.808 81.690749
GC -0.907 14.701371
CyclingB -0.484 22.442147
ILCs -0.882 15.073538
NKs -0.403 24.335611
DC1 -0.682 18.410861
DC2 0.256 47.037293
CyclingMonocytes -0.672 18.595893
InflammatoryMonocytes -0.612 19.745799
Macrophages 1.456 156.169314
CD69-Mast -1.383 9.133425
CD69+Mast 0.551 63.177029
ImmatureEnterocytes1 -1.127 11.798127
ImmatureEnterocytes2 -0.866 15.316654
EnterocyteProgenitors -0.949 14.096701
Enterocytes -1.248 10.453541
Best4+Enterocytes -1.209 10.869283
TA2 -0.090 33.279472
TA1 0.080 39.446320
Tuft -1.473 8.347322
Goblet -1.247 10.464000
Enteroendocrine -1.285 10.073828
SecretoryTA -0.889 14.968392
ImmatureGoblet -0.283 27.438325
Mcells -1.434 8.679299
Stem -0.828 15.909887
CyclingTA 0.064 38.820201
WNT2B+Fos-lo2 0.476 58.612077
RSPO3+ -0.696 18.154905
WNT2B+Fos-lo1 1.002 99.180435
WNT2B+Fos-hi 0.630 68.370453
WNT5B+1 0.252 46.849520
WNT5B+2 0.680 71.875882
Myofibroblasts -0.042 34.915846
InflammatoryFibroblasts -0.968 13.831392
Post-capillaryVenules -0.014 35.907305
Endothelial 0.094 40.002452
Microvascular -0.356 25.506690
Pericytes -0.680 18.447720
Glia -0.167 30.813126
Effects:
Effect Expected Sample \
Covariate Cell Type
Health[T.Non-inflamed] Tregs -0.618 62.216345
MT-hi -0.618 15.373077
CD4+ActivatedFos-hi -0.618 145.273439
CD4+PD1+ -0.618 12.498614
CD4+ActivatedFos-lo -0.618 136.540019
CD4+Memory -0.618 117.756361
CD8+IL17+ -0.618 14.162785
CyclingT -0.618 15.605412
CD8+LP -0.618 150.147441
CD8+IELs -0.618 28.180148
Plasma -0.848 851.175005
Follicular -0.758 63.791360
GC -0.758 11.480130
CyclingB -0.758 17.524813
ILCs 0.000 25.118830
NKs 0.000 40.553324
DC1 -0.408 20.401690
DC2 -0.408 52.123596
CyclingMonocytes -0.408 20.606731
InflammatoryMonocytes -0.408 21.880980
Macrophages -0.408 173.056435
CD69-Mast -0.207 12.374251
CD69+Mast -0.207 85.594220
ImmatureEnterocytes1 0.143 22.683046
ImmatureEnterocytes2 0.143 29.447757
EnterocyteProgenitors 0.143 27.102277
Enterocytes 0.143 20.097949
Best4+Enterocytes 0.143 20.897254
TA2 0.143 63.983021
TA1 0.143 75.839385
Tuft 0.143 16.048538
Goblet 0.143 20.118057
Enteroendocrine 0.143 19.367914
SecretoryTA 0.143 28.778188
ImmatureGoblet 0.143 52.752848
Mcells 0.143 16.686796
Stem 0.143 30.588305
CyclingTA 0.143 74.635611
WNT2B+Fos-lo2 -0.321 70.853740
RSPO3+ -0.321 21.946721
WNT2B+Fos-lo1 -0.321 119.895167
WNT2B+Fos-hi -0.321 82.650242
WNT5B+1 -0.321 56.634467
WNT5B+2 -0.321 86.887811
Myofibroblasts -0.321 42.208337
InflammatoryFibroblasts -0.321 16.720203
Post-capillaryVenules -0.321 43.406871
Endothelial -0.321 48.357327
Microvascular -0.321 30.833993
Pericytes -0.321 22.300693
Glia -0.321 37.248726
log2-fold change
Covariate Cell Type
Health[T.Non-inflamed] Tregs -0.154834
MT-hi -0.154834
CD4+ActivatedFos-hi -0.154834
CD4+PD1+ -0.154834
CD4+ActivatedFos-lo -0.154834
CD4+Memory -0.154834
CD8+IL17+ -0.154834
CyclingT -0.154834
CD8+LP -0.154834
CD8+IELs -0.154834
Plasma -0.486654
Follicular -0.356812
GC -0.356812
CyclingB -0.356812
ILCs 0.736751
NKs 0.736751
DC1 0.148132
DC2 0.148132
CyclingMonocytes 0.148132
InflammatoryMonocytes 0.148132
Macrophages 0.148132
CD69-Mast 0.438113
CD69+Mast 0.438113
ImmatureEnterocytes1 0.943057
ImmatureEnterocytes2 0.943057
EnterocyteProgenitors 0.943057
Enterocytes 0.943057
Best4+Enterocytes 0.943057
TA2 0.943057
TA1 0.943057
Tuft 0.943057
Goblet 0.943057
Enteroendocrine 0.943057
SecretoryTA 0.943057
ImmatureGoblet 0.943057
Mcells 0.943057
Stem 0.943057
CyclingTA 0.943057
WNT2B+Fos-lo2 0.273646
RSPO3+ 0.273646
WNT2B+Fos-lo1 0.273646
WNT2B+Fos-hi 0.273646
WNT5B+1 0.273646
WNT5B+2 0.273646
Myofibroblasts 0.273646
InflammatoryFibroblasts 0.273646
Post-capillaryVenules 0.273646
Endothelial 0.273646
Microvascular 0.273646
Pericytes 0.273646
Glia 0.273646
Nodes:
Final Parameter Is credible
Covariate Node
Health[T.Non-inflamed]_node Mcells 0.000 False
Stem 0.000 False
CyclingTA 0.000 False
ImmatureEnterocytes1 0.000 False
ImmatureEnterocytes2 0.000 False
... ... ...
Lymphoid 0.000 False
Myeloid -0.207 True
Epithelial 0.143 True
Stromal -0.321 True
Immune 0.000 False
[74 rows x 2 columns]
Model properties
First, the summary shows an overview over the model properties: * Number of samples/cell types * The reference cell type. * The formula used
The model has three types of parameters that are relevant for analysis - intercepts, feature-level effects and node-wise effects. These can be interpreted like in a standard regression model: Intercepts show how the cell types are distributed without any active covariates, effects show how the covariates influence the cell types.
Intercepts
The first column of the intercept summary shows the parameters determined by the MCMC inference.
The “Expected sample” column gives some context to the numerical values. If we had a new sample (with no active covariates) with a total number of cells equal to the mean sampling depth of the dataset, then this distribution over the cell types would be most likely.
Feature-level Effects
For the feature-level effect summary, the first column again shows the inferred parameters for all combinations of covariates and cell types, as sums over node-level effects on all parent nodes. Most important is the distinction between zero and non-zero entries A value of zero means that no statistically credible effect was detected. For a value other than zero, a credible change was detected. A positive sign indicates an increase, a negative sign a decrease in abundance.
Since the numerical values of the “Effect” column are not straightforward to interpret, the “Expected sample” and “log2-fold change” columns give us an idea on the magnitude of the change. The expected sample is calculated for each covariate separately (covariate value = 1, all other covariates = 0), with the same method as for the intercepts. The log-fold change is then calculated between this expected sample and the expected sample with no active covariates from the intercept section. Since the data is compositional, cell types for which no credible change was detected, will still change in abundance as well, as soon as a credible effect is detected on another cell type due to the sum-to-one constraint. If there are no credible effects for a covariate, its expected sample will be identical to the intercept sample, therefore the log2-fold change is 0.
Node-level effects
These parameters are the most important ones. They describe, at which points in the tree a credible change in abundance was detected. The data frame just has two columns: The “Final Parameter” column shows the effect values, the “Is credible” column simply depicts whether the inferred effect is different from 0, i.e. credible. In a normal tascCODA analysis, we are interested in which subtrees (i.e. nodes) have a nonzero effect, which we can easily extract:
[13]:
print(tree_res.node_df[tree_res.node_df["Is credible"] == True])
Final Parameter Median HDI 3% \
Covariate Node
Health[T.Non-inflamed]_node Plasma -0.090 -0.090 -0.538
Tcells -0.618 -0.618 -0.794
Bcells -0.758 -0.758 -1.014
Monocytes -0.201 -0.201 -0.610
Myeloid -0.207 -0.207 -0.545
Epithelial 0.143 0.143 -0.013
Stromal -0.321 -0.321 -0.490
HDI 97% SD Delta Is credible
Covariate Node
Health[T.Non-inflamed]_node Plasma 0.033 0.188 0.065919 True
Tcells -0.432 0.097 0.065919 True
Bcells -0.380 0.174 0.065919 True
Monocytes 0.030 0.223 0.065919 True
Myeloid 0.025 0.196 0.065919 True
Epithelial 0.319 0.103 0.065919 True
Stromal 0.004 0.133 0.065919 True
Interpretation
We see that most credible effects are on intermediate nodes. The only slight increase in abundance is on all Epithelial cell types, while there are decreases on Stromal and some Immune cell types. The most important decreases (i.e. with the largest effect sizes) can be found on B- and T-cells. Plasma cells have an additional decrease, meaning that they change even stronger in abundance than the rest of the B-cell lineage
We can also easily plot the credible effects as nodes on the tree for better visualization. Perfectly aligning and styling the plot might require some handwork though:
[15]:
tree_res.draw_tree_effects(tree=data.uns["phylo_tree"], node_labels='name', width=800, node_labels_style={"font-size": "11px", "-toyplot-anchor-shift":"30px"}, tip_labels_align=True,)