tascCODA - Tree-aggregated compositional analysis of high-throughput sequencing data¶

This notebook is a tutorial on how to use tascCODA Ostner et al., 2021 for tree-aggregated compositional analysis of high-throughput sequencing (HTS) data.

For this example, we use single-cell RNA sequencing data. However, there are no limitations to use tascCODA with other HTS data, such as 16S rRNA sequencing.

The particular dataset for this analysis was generated by Smillie et al, 2019. It contains samples from two different regions in the small intestine of mice - Epithelium and Lamina Propria - and three different inflammation conditions - healthy, non-inflamed and inflamed. In total, we have 365.492 cells from 51 cell types in 133 samples.

This tutorial is designed to be executed on a standard computer (any operating system) in a Python environment with tascCODA, Jupyter notebook and all their dependencies installed. Running the tutorial takes about 5 minutes on a 2020 Apple MacBook Pro (16GB RAM).

[3]:

# Setup
import numpy as np
import pandas as pd
import re
import toytree as tt
from tasccoda import tree_utils as util
from tasccoda import tree_ana as ana
from tasccoda import tree_results as tr
from tasccoda import tree_agg_model_sslasso as mod
import tasccoda.datasets as dat
import anndata as ad
from sccoda.util import data_visualization as viz
import toyplot
import toyplot.locator
import statsmodels as sm
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.patches as mpatches
import toyplot.svg
import os

import importlib

[4]:

import tasccoda
tasccoda.__version__

[4]:

'0.1.dev12+g8baba74.d20220628'

Data setup¶

First, we read in the per-cell data and convert it to a count table

[5]:

# read in per-cell data
smillie_data = dat.smillie()
smillie_data["Cluster"] = [str.replace(x, " ", "") for x in smillie_data["Cluster"]]
smillie_data

[5]:

	NAME	Subject	Sample	Location	Replicate	Health	Cluster	nGene	nUMI	Major_l1	Major_l2	Major_l3	Major_l4
0	N10.EpiA.AAACATACAACCAC	N10	EpiA	Epi	A	Healthy	EnterocyteProgenitors	425	968	Epithelial	Epithelial	Absorptive	Immature cells
1	N10.EpiA.AAACATACAGGCGA	N10	EpiA	Epi	A	Healthy	CyclingTA	1695	7273	Epithelial	Epithelial	Cycling TA3	Cycling TA4
2	N10.EpiA.AAACATACCACTAG	N10	EpiA	Epi	A	Healthy	ImmatureGoblet	391	1190	Epithelial	Epithelial	Secretory	Progenitor cells
3	N10.EpiA.AAACATACCCTTTA	N10	EpiA	Epi	A	Healthy	SecretoryTA	1327	5620	Epithelial	Epithelial	Secretory	Progenitor cells
4	N10.EpiA.AAACATACTGCAAC	N10	EpiA	Epi	A	Healthy	ImmatureEnterocytes2	1383	4676	Epithelial	Epithelial	Absorptive	Immature cells
...	...	...	...	...	...	...	...	...	...	...	...	...	...
365487	N9.LPB.TTTATCCTAACGAA	N9	LPB	LP	B	Inflamed	Enterocytes	2768	18811	Epithelial	Epithelial	Absorptive	Absorptive Mature cells
365488	N9.LPB.TTTATCCTGTAAAG	N9	LPB	LP	B	Inflamed	Plasma	1392	27685	Immune	Lymphoid	B cells	Plasma4
365489	N9.LPB.TTTATCCTGTCGTA	N9	LPB	LP	B	Inflamed	Plasma	574	5478	Immune	Lymphoid	B cells	Plasma4
365490	N9.LPB.TTTCAGTGGCGTTA	N9	LPB	LP	B	Inflamed	Macrophages	1437	5698	Immune	Myeloid	Monocytes	Macrophages4
365491	N9.LPB.TTTGCATGGATACC	N9	LPB	LP	B	Inflamed	CD8+LP	544	1348	Immune	Lymphoid	T cells	CD8+ T

365492 rows × 13 columns

[6]:

# Convert to count table
counts = pd.crosstab(
    [smillie_data["Sample"], smillie_data["Subject"], smillie_data["Location"], smillie_data["Health"], smillie_data["Replicate"]], smillie_data["Cluster"])
counts

obs = pd.DataFrame(index=counts.index).reset_index()
counts

[6]:

				Cluster	Best4+Enterocytes	CD4+ActivatedFos-hi	CD4+ActivatedFos-lo	CD4+Memory	CD4+PD1+	CD69+Mast	CD69-Mast	CD8+IELs	CD8+IL17+	CD8+LP	...	Stem	TA1	TA2	Tregs	Tuft	WNT2B+Fos-hi	WNT2B+Fos-lo1	WNT2B+Fos-lo2	WNT5B+1	WNT5B+2
Sample	Subject	Location	Health	Replicate
EpiA	N10	Epi	Healthy	A	93	1	2	11	5	2	8	66	1	4	...	55	354	272	0	24	0	0	0	0	1
	N106	Epi	Non-inflamed	A	3	0	0	0	0	2	0	3	0	1	...	0	47	2	0	2	0	0	0	0	0
	N11	Epi	Healthy	A	93	1	0	8	0	0	0	65	0	2	...	35	461	182	0	24	0	0	0	0	0
	N110	Epi	Non-inflamed	A	221	0	0	1	0	5	0	3	0	9	...	22	1260	348	0	17	0	0	0	0	0
	N12	Epi	Non-inflamed	A	18	0	0	0	0	0	3	5	0	1	...	0	91	31	0	1	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
LPB2	N111	LP	Inflamed	B2	1	138	15	717	24	136	0	15	13	278	...	9	132	33	322	2	8	14	21	14	21
	N58	LP	Inflamed	B2	0	5	1	1	0	21	0	0	0	7	...	0	2	0	21	0	1	0	1	0	0
	N661	LP	Inflamed	B2	0	82	38	1431	0	73	0	92	0	478	...	0	0	0	199	0	9	6	50	0	0
LPB2a	N52	LP	Inflamed	B2a	1	119	404	348	0	66	0	28	0	245	...	6	23	17	79	0	68	27	26	35	38
LPB2b	N52	LP	Inflamed	B2b	1	125	404	338	0	67	0	35	0	262	...	6	25	19	81	0	66	28	26	39	34

133 rows × 51 columns

For preprocessing, we need to extract the tree-structured cell lineage information (Lineages Major_l1 - Major_l4) from the data, generate a Newick string through tasccoda.tree_utils.df2newick() and convert it into a toytree object.

[7]:

# Get cell lineage information for all 51 cell types
vars = smillie_data.groupby("Cluster").agg({
    "Major_l1": "first",
    "Major_l2": "first",
    "Major_l3": "first",
    "Major_l4": "first",
})

# Convert lineage information into a newick string
importlib.reload(util)
tree_levels = ["Major_l1", "Major_l2", "Major_l3", "Major_l4", "Cluster"]
newick = util.df2newick(smillie_data.loc[:, tree_levels].reset_index(drop=True), tree_levels)

# Build tree
tree = tt.tree(newick=newick, tree_format=8)
tree.draw(tip_labels_align=True, node_sizes=10, node_labels='name', width=800, node_markers="r10x1.25", node_colors="lightblue")

[7]:

(<toyplot.canvas.Canvas at 0x7fd42667a370>,
 <toyplot.coordinates.Cartesian at 0x7fd42667afd0>,
 <toytree.Render.ToytreeMark at 0x7fd426682820>)

Now, we can put everything together into an anndata object that we use in tascCODA:

[8]:

data = ad.AnnData(X=counts.reset_index(drop=True), var=vars, obs=obs, uns={"newick": newick, "phylo_tree": tree})

/Users/johannes.ostner/opt/anaconda3/envs/scCODA_tf2.8/lib/python3.9/site-packages/anndata/_core/anndata.py:120: ImplicitModificationWarning: Transforming to str index.
  warnings.warn("Transforming to str index.", ImplicitModificationWarning)

For this tutorial, we focus on finding changes between healthy and non-inflamed tissue in the Lamina Propria. Therefore, we subset the data accordingly, and end up with 48 samples:

[9]:

data_LP_hn = data[(data.obs["Health"].isin(["Healthy", "Non-inflamed"])) & (data.obs["Location"] == "LP")]
data_LP_hn

[9]:

View of AnnData object with n_obs × n_vars = 48 × 51
    obs: 'Sample', 'Subject', 'Location', 'Health', 'Replicate'
    var: 'Major_l1', 'Major_l2', 'Major_l3', 'Major_l4'
    uns: 'newick', 'phylo_tree'

A quick plot of the data with the sccoda.util.data_visualization module shows us that there are some changes in relative abundance in the data:

[10]:

viz.boxplots(data_LP_hn, feature_name="Health", figsize=(20, 8))
plt.show()

Run tascCODA¶

Running tascCODA on our data is now simply a matter of running the HMC sampling process. The model creation and inference works analogous to scCODA (see the tutorial here)

As hyperparameters, we have to specify (see the tascCODA paper for further explanations): - The reference cell type (which is assumed to be unchanged between the conditions): We use the automatic setting here, which chooses a reference that induces minimal compositional effects (NK cells) - The model formula (R-style formula string, just as in scCODA) - The aggregation bias \(\phi\). We go with an unbiased aggregation (pen_args={"phi": 0}) here.

THen, we run HMC sampling via sample_hmc_da with the default settings of 20.000 samples, of which 5.000 are discarded as burn-in

[11]:

importlib.reload(ana)
tree_mod= ana.CompositionalAnalysisTree(
            data_LP_hn.copy(),
            reference_cell_type="automatic",
            formula="Health",
            reg="scaled_3",
            pen_args={"phi": 0}
        )

tree_res = tree_mod.sample_hmc_da()

Automatic reference selection! Reference cell type set to NKs
Zero counts encountered in data! Added a pseudocount of 0.5.
WARNING:tensorflow:@custom_gradient grad_fn has 'variables' in signature, but no ResourceVariables were used on the forward pass.
WARNING:tensorflow:@custom_gradient grad_fn has 'variables' in signature, but no ResourceVariables were used on the forward pass.
WARNING:tensorflow:@custom_gradient grad_fn has 'variables' in signature, but no ResourceVariables were used on the forward pass.
WARNING:tensorflow:@custom_gradient grad_fn has 'variables' in signature, but no ResourceVariables were used on the forward pass.
WARNING:tensorflow:@custom_gradient grad_fn has 'variables' in signature, but no ResourceVariables were used on the forward pass.
WARNING:tensorflow:@custom_gradient grad_fn has 'variables' in signature, but no ResourceVariables were used on the forward pass.
WARNING:tensorflow:@custom_gradient grad_fn has 'variables' in signature, but no ResourceVariables were used on the forward pass.

100%|██████████| 20000/20000 [03:11<00:00, 104.58it/s]

MCMC sampling finished. (256.026 sec)
Acceptance rate: 87.7%

/Users/johannes.ostner/opt/anaconda3/envs/scCODA_tf2.8/lib/python3.9/site-packages/arviz/data/base.py:216: UserWarning: More chains (15000) than draws (1). Passed array should have shape (chains, draws, *shape)
  warnings.warn(

Result analysis¶

Calling summary() on the results object, we can see the most relevant information for further analysis:

[12]:

tree_res.summary()

Compositional Analysis summary:

Data: 48 samples, 51 cell types
Reference index: 29
Formula: Health

Intercepts:
                         Final Parameter  Expected Sample
Cell Type
Tregs                              0.643        69.265072
MT-hi                             -0.755        17.114752
CD4+ActivatedFos-hi                1.491       161.732019
CD4+PD1+                          -0.962        13.914629
CD4+ActivatedFos-lo                1.429       152.009157
CD4+Memory                         1.281       131.097427
CD8+IL17+                         -0.837        15.767341
CyclingT                          -0.740        17.373408
CD8+LP                             1.524       167.158216
CD8+IELs                          -0.149        31.372784
Plasma                             3.489      1192.659259
Follicular                         0.808        81.690749
GC                                -0.907        14.701371
CyclingB                          -0.484        22.442147
ILCs                              -0.882        15.073538
NKs                               -0.403        24.335611
DC1                               -0.682        18.410861
DC2                                0.256        47.037293
CyclingMonocytes                  -0.672        18.595893
InflammatoryMonocytes             -0.612        19.745799
Macrophages                        1.456       156.169314
CD69-Mast                         -1.383         9.133425
CD69+Mast                          0.551        63.177029
ImmatureEnterocytes1              -1.127        11.798127
ImmatureEnterocytes2              -0.866        15.316654
EnterocyteProgenitors             -0.949        14.096701
Enterocytes                       -1.248        10.453541
Best4+Enterocytes                 -1.209        10.869283
TA2                               -0.090        33.279472
TA1                                0.080        39.446320
Tuft                              -1.473         8.347322
Goblet                            -1.247        10.464000
Enteroendocrine                   -1.285        10.073828
SecretoryTA                       -0.889        14.968392
ImmatureGoblet                    -0.283        27.438325
Mcells                            -1.434         8.679299
Stem                              -0.828        15.909887
CyclingTA                          0.064        38.820201
WNT2B+Fos-lo2                      0.476        58.612077
RSPO3+                            -0.696        18.154905
WNT2B+Fos-lo1                      1.002        99.180435
WNT2B+Fos-hi                       0.630        68.370453
WNT5B+1                            0.252        46.849520
WNT5B+2                            0.680        71.875882
Myofibroblasts                    -0.042        34.915846
InflammatoryFibroblasts           -0.968        13.831392
Post-capillaryVenules             -0.014        35.907305
Endothelial                        0.094        40.002452
Microvascular                     -0.356        25.506690
Pericytes                         -0.680        18.447720
Glia                              -0.167        30.813126


Effects:
                                                Effect  Expected Sample  \
Covariate              Cell Type
Health[T.Non-inflamed] Tregs                    -0.618        62.216345
                       MT-hi                    -0.618        15.373077
                       CD4+ActivatedFos-hi      -0.618       145.273439
                       CD4+PD1+                 -0.618        12.498614
                       CD4+ActivatedFos-lo      -0.618       136.540019
                       CD4+Memory               -0.618       117.756361
                       CD8+IL17+                -0.618        14.162785
                       CyclingT                 -0.618        15.605412
                       CD8+LP                   -0.618       150.147441
                       CD8+IELs                 -0.618        28.180148
                       Plasma                   -0.848       851.175005
                       Follicular               -0.758        63.791360
                       GC                       -0.758        11.480130
                       CyclingB                 -0.758        17.524813
                       ILCs                      0.000        25.118830
                       NKs                       0.000        40.553324
                       DC1                      -0.408        20.401690
                       DC2                      -0.408        52.123596
                       CyclingMonocytes         -0.408        20.606731
                       InflammatoryMonocytes    -0.408        21.880980
                       Macrophages              -0.408       173.056435
                       CD69-Mast                -0.207        12.374251
                       CD69+Mast                -0.207        85.594220
                       ImmatureEnterocytes1      0.143        22.683046
                       ImmatureEnterocytes2      0.143        29.447757
                       EnterocyteProgenitors     0.143        27.102277
                       Enterocytes               0.143        20.097949
                       Best4+Enterocytes         0.143        20.897254
                       TA2                       0.143        63.983021
                       TA1                       0.143        75.839385
                       Tuft                      0.143        16.048538
                       Goblet                    0.143        20.118057
                       Enteroendocrine           0.143        19.367914
                       SecretoryTA               0.143        28.778188
                       ImmatureGoblet            0.143        52.752848
                       Mcells                    0.143        16.686796
                       Stem                      0.143        30.588305
                       CyclingTA                 0.143        74.635611
                       WNT2B+Fos-lo2            -0.321        70.853740
                       RSPO3+                   -0.321        21.946721
                       WNT2B+Fos-lo1            -0.321       119.895167
                       WNT2B+Fos-hi             -0.321        82.650242
                       WNT5B+1                  -0.321        56.634467
                       WNT5B+2                  -0.321        86.887811
                       Myofibroblasts           -0.321        42.208337
                       InflammatoryFibroblasts  -0.321        16.720203
                       Post-capillaryVenules    -0.321        43.406871
                       Endothelial              -0.321        48.357327
                       Microvascular            -0.321        30.833993
                       Pericytes                -0.321        22.300693
                       Glia                     -0.321        37.248726

                                                log2-fold change
Covariate              Cell Type
Health[T.Non-inflamed] Tregs                           -0.154834
                       MT-hi                           -0.154834
                       CD4+ActivatedFos-hi             -0.154834
                       CD4+PD1+                        -0.154834
                       CD4+ActivatedFos-lo             -0.154834
                       CD4+Memory                      -0.154834
                       CD8+IL17+                       -0.154834
                       CyclingT                        -0.154834
                       CD8+LP                          -0.154834
                       CD8+IELs                        -0.154834
                       Plasma                          -0.486654
                       Follicular                      -0.356812
                       GC                              -0.356812
                       CyclingB                        -0.356812
                       ILCs                             0.736751
                       NKs                              0.736751
                       DC1                              0.148132
                       DC2                              0.148132
                       CyclingMonocytes                 0.148132
                       InflammatoryMonocytes            0.148132
                       Macrophages                      0.148132
                       CD69-Mast                        0.438113
                       CD69+Mast                        0.438113
                       ImmatureEnterocytes1             0.943057
                       ImmatureEnterocytes2             0.943057
                       EnterocyteProgenitors            0.943057
                       Enterocytes                      0.943057
                       Best4+Enterocytes                0.943057
                       TA2                              0.943057
                       TA1                              0.943057
                       Tuft                             0.943057
                       Goblet                           0.943057
                       Enteroendocrine                  0.943057
                       SecretoryTA                      0.943057
                       ImmatureGoblet                   0.943057
                       Mcells                           0.943057
                       Stem                             0.943057
                       CyclingTA                        0.943057
                       WNT2B+Fos-lo2                    0.273646
                       RSPO3+                           0.273646
                       WNT2B+Fos-lo1                    0.273646
                       WNT2B+Fos-hi                     0.273646
                       WNT5B+1                          0.273646
                       WNT5B+2                          0.273646
                       Myofibroblasts                   0.273646
                       InflammatoryFibroblasts          0.273646
                       Post-capillaryVenules            0.273646
                       Endothelial                      0.273646
                       Microvascular                    0.273646
                       Pericytes                        0.273646
                       Glia                             0.273646


Nodes:
                                                  Final Parameter  Is credible
Covariate                   Node
Health[T.Non-inflamed]_node Mcells                          0.000        False
                            Stem                            0.000        False
                            CyclingTA                       0.000        False
                            ImmatureEnterocytes1            0.000        False
                            ImmatureEnterocytes2            0.000        False
...                                                           ...          ...
                            Lymphoid                        0.000        False
                            Myeloid                        -0.207         True
                            Epithelial                      0.143         True
                            Stromal                        -0.321         True
                            Immune                          0.000        False

[74 rows x 2 columns]

Model properties

First, the summary shows an overview over the model properties: * Number of samples/cell types * The reference cell type. * The formula used

The model has three types of parameters that are relevant for analysis - intercepts, feature-level effects and node-wise effects. These can be interpreted like in a standard regression model: Intercepts show how the cell types are distributed without any active covariates, effects show how the covariates influence the cell types.

Intercepts

The first column of the intercept summary shows the parameters determined by the MCMC inference.

The “Expected sample” column gives some context to the numerical values. If we had a new sample (with no active covariates) with a total number of cells equal to the mean sampling depth of the dataset, then this distribution over the cell types would be most likely.

Feature-level Effects

For the feature-level effect summary, the first column again shows the inferred parameters for all combinations of covariates and cell types, as sums over node-level effects on all parent nodes. Most important is the distinction between zero and non-zero entries A value of zero means that no statistically credible effect was detected. For a value other than zero, a credible change was detected. A positive sign indicates an increase, a negative sign a decrease in abundance.

Since the numerical values of the “Effect” column are not straightforward to interpret, the “Expected sample” and “log2-fold change” columns give us an idea on the magnitude of the change. The expected sample is calculated for each covariate separately (covariate value = 1, all other covariates = 0), with the same method as for the intercepts. The log-fold change is then calculated between this expected sample and the expected sample with no active covariates from the intercept section. Since the data is compositional, cell types for which no credible change was detected, will still change in abundance as well, as soon as a credible effect is detected on another cell type due to the sum-to-one constraint. If there are no credible effects for a covariate, its expected sample will be identical to the intercept sample, therefore the log2-fold change is 0.

Node-level effects

These parameters are the most important ones. They describe, at which points in the tree a credible change in abundance was detected. The data frame just has two columns: The “Final Parameter” column shows the effect values, the “Is credible” column simply depicts whether the inferred effect is different from 0, i.e. credible. In a normal tascCODA analysis, we are interested in which subtrees (i.e. nodes) have a nonzero effect, which we can easily extract:

[13]:

print(tree_res.node_df[tree_res.node_df["Is credible"] == True])

                                        Final Parameter  Median  HDI 3%  \
Covariate                   Node
Health[T.Non-inflamed]_node Plasma               -0.090  -0.090  -0.538
                            Tcells               -0.618  -0.618  -0.794
                            Bcells               -0.758  -0.758  -1.014
                            Monocytes            -0.201  -0.201  -0.610
                            Myeloid              -0.207  -0.207  -0.545
                            Epithelial            0.143   0.143  -0.013
                            Stromal              -0.321  -0.321  -0.490

                                        HDI 97%     SD     Delta  Is credible
Covariate                   Node
Health[T.Non-inflamed]_node Plasma        0.033  0.188  0.065919         True
                            Tcells       -0.432  0.097  0.065919         True
                            Bcells       -0.380  0.174  0.065919         True
                            Monocytes     0.030  0.223  0.065919         True
                            Myeloid       0.025  0.196  0.065919         True
                            Epithelial    0.319  0.103  0.065919         True
                            Stromal       0.004  0.133  0.065919         True

Interpretation

We see that most credible effects are on intermediate nodes. The only slight increase in abundance is on all Epithelial cell types, while there are decreases on Stromal and some Immune cell types. The most important decreases (i.e. with the largest effect sizes) can be found on B- and T-cells. Plasma cells have an additional decrease, meaning that they change even stronger in abundance than the rest of the B-cell lineage

We can also easily plot the credible effects as nodes on the tree for better visualization. Perfectly aligning and styling the plot might require some handwork though:

[15]:

tree_res.draw_tree_effects(tree=data.uns["phylo_tree"], node_labels='name', width=800, node_labels_style={"font-size": "11px", "-toyplot-anchor-shift":"30px"}, tip_labels_align=True,)