PIPET:
Phenotypic Information Based on Bulk Data Predicts Relevant Subpopulations in Single Cell Data

XINJIA RUAN

2024-03-02

Introduction

PIPET can be used to predict relevant subpopulations in single-cell data from phenotypic information in bulk data. You can use known feature vectors of phenotypic information to preform PIPET() or create feature vectors via the function in the PIPET. This package also provides commonly used downstream analysis for creating customizable visualization results. The workflow of PIPET is shown in the following Figure:

Installation

To install this package, start R (version >= 4.3.1) and enter:

if (!require("devtools")) 
    install.packages("devtools")
devtools::install_github("ruan2ruan/PIPET")

If errors occur during installation, we recommend that you first try to install all R dependencies by referring to the Imports in the DESCRIPTION file, and then install PIPET.

After installation, you could load the PIPET package in your R session:

library(PIPET)

In this tutorial, our single cell dataset is derived from the Peripheral Blood Mononuclear Cells (PBMC) freely available from 10X Genomics. Data has been preprocessed and annotated of cell types can be downloaded from SeuratData1. Therefore, before loading Seurat2 object data, we also need to load some related packages (for specific installation methods, please refer to the website of the relevant packages):

library(tidyverse)
library(patchwork)
library(Seurat)
library(SeuratData)

PIPET example

Here, we demonstrate key steps for PIPET prediction of relevant subpopulations in single-cell data and provide some downstream visualization results. The inputs of PIPET require a single-cell expression matrix and a set of feature vectors containing phenotypic information. This package contains a 571-gene panel for classification of AML and preprocessed TPM expression data and survival data of TCGA-LAML, and now we will show how to perform PIPET in a practical application.

Import data

We load in the 3k PBMC single cell data, which has been preprocessed and annotated of cell types. It has 13714 features and 2638 samples.

#InstallData("pbmc3k")
pbmc3k.final <- LoadData("pbmc3k", type = "pbmc3k.final")
SC = UpdateSeuratObject(object = pbmc3k.final)
SC
#> An object of class Seurat 
#> 13714 features across 2638 samples within 1 assay 
#> Active assay: RNA (13714 features, 2000 variable features)
#>  3 layers present: counts, data, scale.data
#>  2 dimensional reductions calculated: pca, umap

Then we loaded the signature data for AML classification (571 genes identified three main clusters c1, c2, c3). Mo et al.3 generated these signature genes and demonstrated their good performance in classifying clinically relevant subtypes of AML.

markers <- read.table("../data/AML_Subtypes_signature.txt",sep = "\t",check.names = F,stringsAsFactors = F,header = T,row.names = NULL)
table(markers$class)
#> 
#>  c1  c2  c3 
#> 111 160 300
head(markers,5)

Using PIPET to identify cell subpopulations

Based on the above input data, we can use PIPET to identify the phenotype-associated cell subpopulations. The calculation may take several minutes, depending on the amount of data you have and the number of cores you set.

set.seed(123)
tmp <- PIPET(SC_data=SC, markers=markers, gene_col= "genes", class_col= "class",
             nCores=4)
#> 
#> Attaching package: 'matrixStats'
#> The following object is masked from 'package:dplyr':
#> 
#>     count
#> 
#> Attaching package: 'Matrix'
#> The following objects are masked from 'package:tidyr':
#> 
#>     expand, pack, unpack
#> 364 genes are in the final feature vector.
#> The classification of markers is:
#> 
#>  c1  c2  c3 
#>  54 141 169
#> Calculating......
#> Finished!

We merged the PIPET-predicted cell subpopulation results with the original single-cell data.

tmp$ID <- rownames(tmp)
metadata <- SC@meta.data
metadata$ID <- rownames(metadata)
meta <- merge(metadata,tmp,by="ID")
rownames(meta) <- meta$ID
SC <- AddMetaData(SC, metadata = meta)

After filtering out significantly related cell subpopulations (P value less than 0.05), PIPET identified 66 PIPET_c1 cells related to AML-c1 Subtype, 517 PIPET_c2 cells related to AML-c2 Subtype, and 34 PIPET_c3 cells related to AML-c3 Subtype.

SC1 <- subset(x = SC, subset= (Pvalue < 0.05))
table(SC1$prediction)
#> 
#>  c1  c2  c3 
#>  66 517  34

Visualize all cells using UMAP

We first visualized all cells included in the analysis using the UMAP plot. And different cell types are distinguished by colors.

DimPlot(SC, reduction = "umap", group.by = 'seurat_annotations', label = F) +
  ggtitle("pbmc3k-Celltypes")

Visualize predicted cells using UMAP

We then plot only the PIPET annotated cells.

metadata <- SC@meta.data
temp <- data.frame(ID=SC1$ID,type=SC1$prediction)
meta <- merge(metadata,temp,by="ID",all = TRUE)
meta$type <- as.character(meta$type)
meta$type  <- ifelse(is.na(meta$type), "Undefined", meta$type)
rownames(meta) <- meta$ID
meta <- meta[SC$ID,]
SC <- AddMetaData(SC, metadata = meta)

library(RColorBrewer)
colors<-brewer.pal(n = 8, name = "Set1")

PIPET_DotPlot(SC,colors,group="type",title="AML-Subtypes")

For a more intuitive comparison, we plot the two results at the same time.

p1 <- DimPlot(SC, reduction = "umap", group.by = 'seurat_annotations', label = F) +
  ggtitle("pbmc3k-Celltypes")
p2 <- PIPET_DotPlot(SC,colors,group="type",title="AML-Subtypes")
p1 | p2

Proportion plots

We could see that the cells in each PIPET subpopulation are derived from different cell types. PIPET_BarPlot() can display the main cell compositions of each cell subpopulation through stacked histograms.

table(SC1$seurat_annotations,SC1$prediction)
#>               
#>                 c1  c2  c3
#>   Naive CD4 T   12   0   9
#>   Memory CD4 T   6   0   8
#>   CD14+ Mono    24 374   0
#>   B              9   0   4
#>   CD8 T         10   0   2
#>   FCGR3A+ Mono   1 128   0
#>   NK             4   0   0
#>   DC             0  15   0
#>   Platelet       0   0  11
PIPET_BarPlot(SC1,group="seurat_annotations",prediction="prediction",ylim=530)
#> The levels of seurat_annotations are:
#> [1] "Naive CD4 T"  "Memory CD4 T" "CD14+ Mono"   "B"            "CD8 T"       
#> [6] "FCGR3A+ Mono" "NK"           "DC"           "Platelet"
#> If the order of levels is not appropriate, please re-customize seurat_annotations.

DE genes

We then found differentially expressed features of PIPET_c1, PIPET_c2 and PIPET_c3. And top markers were selected to visualize expression.

DefaultAssay(SC1) <- "RNA"
Idents(SC1)="prediction"

#find markers for every cluster compared to all remaining cells, report only the positive ones
SC.markers <- FindAllMarkers(SC1, only.pos = TRUE,
                             min.pct = 0.25, logfc.threshold = 0.25)
#> Calculating cluster c1
#> Calculating cluster c2
#> Calculating cluster c3
biomarkers <- SC.markers %>%
  dplyr::filter(p_val_adj<0.05) %>%
  group_by(cluster) %>%
  slice_max(n = 20000, order_by = avg_log2FC)
top <- biomarkers %>%
  group_by(cluster) %>%
  slice_head(n = 4)
top

Multi-Violin Plot

We used PIPET_MultiViolin() to create a stacked violin plot, which could display violin plots of multiple cell populations or cell types in the same figure to compare gene expression changes between them.

features <- top$gene
PIPET_MultiViolin(SC1, features, preprocess=FALSE)
#> 
#> Attaching package: 'cowplot'
#> The following object is masked from 'package:patchwork':
#> 
#>     align_plots
#> The following object is masked from 'package:lubridate':
#> 
#>     stamp

Prepare Data

We load the preprocessed TCGA-LAML bulk expression matrix and the corresponding clinical survival data. The FPKM data of TCGA-LAML and the corresponding clinical data were downloaded from the Xena public platform (https://xena.ucsc.edu/public-hubs). We converted FPKM expression data to TPM and matched samples with expression data and clinical data.

data("TCGA_LAML_TPM")
TPM <- TCGA_LAML_TPM
data("TCGA_LAML_Surv")
surv.info <- TCGA_LAML_Surv

In this TPM expression matrix, each row represents a gene and each column represents a sample. A total of 49038 genes and 130 samples were included in subsequent analyses.

dim(TPM)
#> [1] 49038   130

We only selected the clinical data of survival information, including sample ID, outcome event (1 means dead, 0 means alive) and survival time (in days).

head(surv.info,3)

A multivariate Cox regression analysis

A multivariate Cox regression model was established using the selected genes that were significantly differentially expressed in PIPET_c3 (c3 subtype is the subtype with the worst prognosis).

c3 <- biomarkers %>%
  filter(cluster=="c3") %>%
  filter(avg_log2FC>=2)

TPM.norm <- log2(TPM + 1)
TPM.norm <- data.frame(t(na.omit(TPM.norm[c3$gene,])))
genes <- colnames(TPM.norm)

tmp <-TPM.norm[rownames(surv.info),]
survival_info_df <- cbind(tmp,surv.info)

library(survival)
multi_COX<-coxph(Surv(futime, fustat) ~ ., data=survival_info_df)

Plot Kaplan–Meier curve

PIPET_KM() calculated the risk score of each sample based on the coefficients obtained from the multivariate Cox regression model. And then we divided the samples into high/low risk groups according to the median value of the risk score. Finally, the KM survival curve is drawn based on the risk group. As shown in the figure below, patients in the high-risk group have worse prognosis than patients in the low-risk group.

PIPET_KM(data=survival_info_df, genes=genes,coxph=multi_COX, surv_cols=c("fustat","futime"),
         xlim=100,legend.title="TCGA-LAML", legend.labs = c("HRisk", "LRisk"),
         pval = TRUE, conf.int = TRUE,  palette = "simpsons")

Summary

Above, we introduced the main functions included in the PIPET package. It can help you integrate information from bulk data and single-cell data, and perform multi-omics analysis. The package PIPET can not only be used to predict relevant cell subpopulations in single cells, but also provides several visualization functions for enriching downstream analyses. If you have any questions about PIPET or suggestions for improving the package, please posted them to the Issue section of GitHub at https://github.com/ruan2ruan/PIPET/issues, or email them to . Any useful suggestions will help us further improve PIPET and provide more functions.

Session Information

sessionInfo()
#> R version 4.3.1 (2023-06-16 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19045)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=Chinese (Simplified)_China.utf8 
#> [2] LC_CTYPE=Chinese (Simplified)_China.utf8   
#> [3] LC_MONETARY=Chinese (Simplified)_China.utf8
#> [4] LC_NUMERIC=C                               
#> [5] LC_TIME=Chinese (Simplified)_China.utf8    
#> 
#> time zone: Asia/Shanghai
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] parallel  stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] survminer_0.4.9         ggpubr_0.6.0            survival_3.5-5         
#>  [4] cowplot_1.1.1           RColorBrewer_1.1-3      Matrix_1.6-4           
#>  [7] matrixStats_1.0.0       pbmc3k.SeuratData_3.1.4 SeuratData_0.2.2       
#> [10] SeuratObject_5.0.1      Seurat_4.3.0            patchwork_1.1.2        
#> [13] PIPET_0.1.0             lubridate_1.9.2         forcats_1.0.0          
#> [16] stringr_1.5.0           dplyr_1.1.2             purrr_1.0.1            
#> [19] readr_2.1.4             tidyr_1.3.0             tibble_3.2.1           
#> [22] ggplot2_3.4.4           tidyverse_2.0.0        
#> 
#> loaded via a namespace (and not attached):
#>   [1] RcppAnnoy_0.0.21            splines_4.3.1              
#>   [3] later_1.3.1                 bitops_1.0-7               
#>   [5] polyclip_1.10-4             lifecycle_1.0.3            
#>   [7] rstatix_0.7.2               globals_0.16.2             
#>   [9] lattice_0.21-8              MASS_7.3-60                
#>  [11] backports_1.4.1             magrittr_2.0.3             
#>  [13] limma_3.56.2                plotly_4.10.2              
#>  [15] sass_0.4.7                  rmarkdown_2.23             
#>  [17] jquerylib_0.1.4             yaml_2.3.7                 
#>  [19] httpuv_1.6.11               sctransform_0.3.5          
#>  [21] spam_2.9-1                  sp_2.0-0                   
#>  [23] spatstat.sparse_3.0-2       reticulate_1.30            
#>  [25] pbapply_1.7-2               abind_1.4-5                
#>  [27] zlibbioc_1.46.0             Rtsne_0.16                 
#>  [29] GenomicRanges_1.52.0        BiocGenerics_0.46.0        
#>  [31] RCurl_1.98-1.12             rappdirs_0.3.3             
#>  [33] GenomeInfoDbData_1.2.10     IRanges_2.34.1             
#>  [35] KMsurv_0.1-5                S4Vectors_0.38.1           
#>  [37] ggrepel_0.9.3               irlba_2.3.5.1              
#>  [39] listenv_0.9.0               spatstat.utils_3.0-3       
#>  [41] goftest_1.2-3               spatstat.random_3.1-5      
#>  [43] fitdistrplus_1.1-11         parallelly_1.36.0          
#>  [45] commonmark_1.9.0            leiden_0.4.3               
#>  [47] codetools_0.2-19            DelayedArray_0.26.7        
#>  [49] xml2_1.3.5                  ggtext_0.1.2               
#>  [51] prettydoc_0.4.1             tidyselect_1.2.0           
#>  [53] farver_2.1.1                stats4_4.3.1               
#>  [55] spatstat.explore_3.2-1      jsonlite_1.8.7             
#>  [57] ellipsis_0.3.2              progressr_0.13.0           
#>  [59] ggridges_0.5.4              tools_4.3.1                
#>  [61] ica_1.0-3                   Rcpp_1.0.11                
#>  [63] glue_1.6.2                  gridExtra_2.3              
#>  [65] xfun_0.39                   DESeq2_1.40.2              
#>  [67] MatrixGenerics_1.12.3       GenomeInfoDb_1.36.1        
#>  [69] withr_2.5.0                 fastmap_1.1.1              
#>  [71] fansi_1.0.4                 digest_0.6.33              
#>  [73] timechange_0.2.0            R6_2.5.1                   
#>  [75] mime_0.12                   colorspace_2.1-0           
#>  [77] scattermore_1.2             tensor_1.5                 
#>  [79] markdown_1.8                spatstat.data_3.0-1        
#>  [81] ggsci_3.0.0                 utf8_1.2.3                 
#>  [83] generics_0.1.3              data.table_1.14.8          
#>  [85] httr_1.4.6                  htmlwidgets_1.6.2          
#>  [87] S4Arrays_1.0.5              uwot_0.1.16                
#>  [89] pkgconfig_2.0.3             gtable_0.3.3               
#>  [91] lmtest_0.9-40               XVector_0.40.0             
#>  [93] survMisc_0.5.6              htmltools_0.5.5            
#>  [95] carData_3.0-5               dotCall64_1.0-2            
#>  [97] scales_1.2.1                Biobase_2.60.0             
#>  [99] png_0.1-8                   knitr_1.43                 
#> [101] km.ci_0.5-6                 rstudioapi_0.15.0          
#> [103] tzdb_0.4.0                  reshape2_1.4.4             
#> [105] nlme_3.1-162                cachem_1.0.8               
#> [107] zoo_1.8-12                  KernSmooth_2.23-22         
#> [109] miniUI_0.1.1.1              pillar_1.9.0               
#> [111] grid_4.3.1                  vctrs_0.6.3                
#> [113] RANN_2.6.1                  promises_1.2.0.1           
#> [115] car_3.1-2                   xtable_1.8-4               
#> [117] cluster_2.1.4               evaluate_0.21              
#> [119] cli_3.6.1                   locfit_1.5-9.8             
#> [121] compiler_4.3.1              rlang_1.1.1                
#> [123] crayon_1.5.2                future.apply_1.11.0        
#> [125] ggsignif_0.6.4              labeling_0.4.2             
#> [127] plyr_1.8.8                  stringi_1.7.12             
#> [129] viridisLite_0.4.2           deldir_1.0-9               
#> [131] BiocParallel_1.34.2         munsell_0.5.0              
#> [133] lazyeval_0.2.2              spatstat.geom_3.2-4        
#> [135] hms_1.1.3                   future_1.33.0              
#> [137] shiny_1.7.4.1               highr_0.10                 
#> [139] SummarizedExperiment_1.30.2 ROCR_1.0-11                
#> [141] gridtext_0.1.5              igraph_1.5.0.1             
#> [143] broom_1.0.5                 bslib_0.5.0

References

  1. Stoeckius, M., Hafemeister, C., Stephenson, W. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat Methods. 14, 865–868 (2017).

  2. Hao, Y., Hao, S., Andersen-Nissen, E. et al. Integrated analysis of multimodal single-cell data. Cell. 13, 3573-3587 (2021).

  3. Mo, Q., Yun, S., Sallman, D.A. et al. Integrative molecular subtypes of acute myeloid leukemia. Blood Cancer J. 13, 71 (2023).