1 Majora

Step 1 Assignment Aaron

Version Changes

Revised. Amendments from Version 1

This version of the workflow contains a number of improvements based on the referees' comments. We have re-compiled the workflow using the latest packages from Bioconductor release 3.4, and stated more explicitly the dependence on these package versions. We have added a reference to the Bioconductor workflow page, which provides user-friendly instructions for installation and execution of the workflow. We have also moved cell cycle classification before gene filtering as this provides more precise cell cycle phase classifications. Some minor rewording and elaborations have also been performed in various parts of the article.


Single-cell RNA sequencing (scRNA-seq) is widely used to profile the transcriptome of individual cells. This provides biological resolution that cannot be matched by bulk RNA sequencing, at the cost of increased technical noise and data complexity. The differences between scRNA-seq and bulk RNA-seq data mean that the analysis of the former cannot be performed by recycling bioinformatics pipelines for the latter. Rather, dedicated single-cell methods are required at various steps to exploit the cellular resolution while accounting for technical noise. This article describes a computational workflow for low-level analyses of scRNA-seq data, based primarily on software packages from the open-source Bioconductor project. It covers basic steps including quality control, data exploration and normalization, as well as more complex procedures such as cell cycle phase assignment, identification of highly variable and correlated genes, clustering into subpopulations and marker gene detection. Analyses were demonstrated on gene-level count data from several publicly available datasets involving haematopoietic stem cells, brain-derived cells, T-helper cells and mouse embryonic stem cells. This will provide a range of usage scenarios from which readers can construct their own analysis pipelines.

Keywords: Single cell, RNA-seq, bioinformatics, Bioconductor, workflow


Single-cell RNA sequencing (scRNA-seq) is widely used to measure the genome-wide expression profile of individual cells. From each cell, mRNA is isolated and reverse transcribed to cDNA for high-throughput sequencing ( Stegle et al., 2015). This can be done using microfluidics platforms like the Fluidigm C1 ( Pollen et al., 2014), protocols based on microtiter plates like Smart-seq2 ( Picelli et al., 2014), or droplet-based technologies like inDrop ( Klein et al., 2015; Macosko et al., 2015). The number of reads mapped to each gene is then used to quantify its expression in each cell. Alternatively, unique molecular identifiers (UMIs) can be used to directly measure the number of transcript molecules for each gene ( Islam et al., 2014). Count data are analyzed to detect highly variable genes (HVGs) that drive heterogeneity across cells in a population, to find correlations between genes and cellular phenotypes, or to identify new subpopulations via dimensionality reduction and clustering. This provides biological insights at a single-cell resolution that cannot be achieved with conventional bulk RNA sequencing of cell populations.

Strategies for scRNA-seq data analysis differ markedly from those for bulk RNA-seq. One technical reason is that scRNA-seq data are much noisier than bulk data ( Brennecke et al., 2013; Marinov et al., 2014). Reliable capture (i.e., conversion) of transcripts into cDNA for sequencing is difficult with the low quantity of RNA in a single cell. This increases the frequency of drop-out events where none of the transcripts for a gene are captured. Dedicated steps are required to deal with this noise during analysis, especially during quality control. In addition, scRNA-seq data can be used to study cell-to-cell heterogeneity, e.g., to identify new cell subtypes, to characterize differentiation processes, to assign cells into their cell cycle phases, or to identify HVGs driving variability across the population ( Fan et al., 2016; Trapnell et al., 2014; Vallejos et al., 2015). This is simply not possible with bulk data, meaning that custom methods are required to perform these analyses.

This article describes a computational workflow for basic analysis of scRNA-seq data, using software packages from the open-source Bioconductor project (release 3.4) ( Huber et al., 2015). Starting from a count matrix, this workflow contains the steps required for quality control to remove problematic cells; normalization of cell-specific biases, with and without spike-ins; cell cycle phase classification from gene expression data; data exploration to identify putative subpopulations; and finally, HVG and marker gene identification to prioritize interesting genes. The application of different steps in the workflow will be demonstrated on several public scRNA-seq datasets involving haematopoietic stem cells, brain-derived cells, T-helper cells and mouse embryonic stem cells, generated with a range of experimental protocols and platforms ( Buettner et al., 2015; Kolodziejczyk et al., 2015; Wilson et al., 2015; Zeisel et al., 2015). The aim is to provide a variety of modular usage examples that can be applied to construct custom analysis pipelines.

Analysis of haematopoietic stem cells


To introduce most of the concepts of scRNA-seq data analysis, we use a relatively simple dataset from a study of haematopoietic stem cells (HSCs) ( Wilson et al., 2015). Single mouse HSCs were isolated into microtiter plates and libraries were prepared for 96 cells using the Smart-seq2 protocol. A constant amount of spike-in RNA from the External RNA Controls Consortium (ERCC) was also added to each cell’s lysate prior to library preparation. High-throughput sequencing was performed and the expression of each gene was quantified by counting the total number of reads mapped to its exonic regions. Similarly, the quantity of each spike-in transcript was measured by counting the number of reads mapped to the spike-in reference sequences. Counts for all genes/transcripts in each cell were obtained from the NCBI Gene Expression Omnibus (GEO) as a supplementary file under the accession number GSE61533 ( http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE61533).

For simplicity, we forego a description of the read processing steps required to generate the count matrix, i.e., read alignment and counting into features. These steps have been described in some detail elsewhere ( Chen et al., 2016; Love et al., 2015), and are largely the same for bulk and single-cell data. The only additional consideration is that the spike-in information must be included in the pipeline. Typically, spike-in sequences can be included as additional FASTA files during genome index building prior to alignment, while genomic intervals for both spike-in transcripts and endogenous genes can be concatenated into a single GTF file prior to counting. For users favouring an R-based approach to read alignment and counting, we suggest using the methods in the Rsubread package ( Liao et al., 2013; Liao et al., 2014). Alternatively, rapid quantification of expression with alignment-free methods such as kallisto ( Bray et al., 2016) or Salmon ( Patro et al., 2015) can be performed using the functions and in the scater package.

Count loading

The first task is to load the count matrix into memory. In this case, some work is required to retrieve the data from the Gzip-compressed Excel format. Each row of the matrix represents an endogenous gene or a spike-in transcript, and each column represents a single HSC. For convenience, the counts for spike-in transcripts and endogenous genes are stored in a object from the scater package ( McCarthy et al., 2016).

library(R.utils)gunzip("GSE61533_HTSEQ_count_results.xls.gz",remove=FALSE,overwrite=TRUE)library(gdata)all.counts <-read.xls(’GSE61533_HTSEQ_count_results.xls’,sheet=1,header=TRUE,row.names=1)library(scater)sce <-newSCESet(countData=all.counts)dim(sce)## Features Samples ## 38498 96

We identify the rows corresponding to ERCC spike-ins and mitochondrial genes. For this dataset, this information can be easily extracted from the row names. In general, though, identifying mitochondrial genes from standard identifiers like Ensembl requires extra annotation (this will be discussed later in more detail).

is.spike <-grepl("^ERCC",rownames(sce))is.mito <-grepl("^mt-",rownames(sce))

For each cell, we calculate quality control metrics such as the total number of counts or the proportion of counts in mitochondrial genes or spike-in transcripts. These are stored in the of the for future reference.

sce <-calculateQCMetrics(sce,feature_controls=list(ERCC=is.spike,Mt=is.mito))head(colnames(pData(sce)))## [1] "total_counts" "log10_total_counts" "filter_on_total_counts" ## [4] "total_features" "log10_total_features" "filter_on_total_features"

We need to explicitly indicate that the ERCC set is, in fact, a spike-in set. This is necessary as spike-ins require special treatment in some downstream steps such as variance estimation and normalization. We do this by supplying the name of the spike-in set to .

library(scran)isSpike(sce) <-"ERCC"

Quality control on the cells

Low-quality cells need to be removed to ensure that technical effects do not distort downstream analysis results. Two common measures of cell quality are the library size and the number of expressed features in each library. The library size is defined as the total sum of counts across all features, i.e., genes and spike-in transcripts. Cells with relatively small library sizes are considered to be of low quality as the RNA has not been efficiently captured (i.e., converted into cDNA and amplified) during library preparation. The number of expressed features in each cell is defined as the number of features with non-zero counts for that cell. Any cell with very few expressed genes is likely to be of poor quality as the diverse transcript population has not been successfully captured. The distributions of both of these metrics are shown in Figure 1.

Figure 1.

Histograms of library sizes (left) and number of expressed genes (right) for all cells in the HSC dataset.

par(mfrow=c(1,2))hist(sce$total_counts/1e6,xlab="Library sizes (millions)",main="",breaks=20,col="grey80",ylab="Number of cells")hist(sce$total_features,xlab="Number of expressed genes",main="",breaks=20,col="grey80",ylab="Number of cells")

Picking a threshold for these metrics is not straightforward as their absolute values depend on the protocol and biological system. For example, sequencing to greater depth will lead to more reads, regardless of the quality of the cells. To obtain an adaptive threshold, we assume that most of the dataset consists of high-quality cells. We remove cells with log-library sizes that are more than 3 median absolute deviations (MADs) below the median log-library size. (A log-transformation improves resolution at small values, especially when the MAD of the raw values is comparable to or greater than the median.) We also remove cells where the log-transformed number of expressed genes is 3 MADs below the median.

libsize.drop <-isOutlier(sce$total_counts,nmads=3,type="lower",log=TRUE)feature.drop <-isOutlier(sce$total_features,nmads=3,type="lower",log=TRUE)

Another measure of quality is the proportion of reads mapped to genes in the mitochondrial genome. High proportions are indicative of poor-quality cells ( Ilicic et al., 2016; Islam et al., 2014), possibly because of increased apoptosis and/or loss of cytoplasmic RNA from lysed cells. Similar reasoning applies to the proportion of reads mapped to spike-in transcripts. The quantity of spike-in RNA added to each cell should be constant, which means that the proportion should increase upon loss of endogenous RNA in low-quality cells. The distributions of mitochondrial and spike-in proportions across all cells are shown in Figure 2.

Figure 2.

Histogram of the proportion of reads mapped to mitochondrial genes (left) or spike-in transcripts (right) across all cells in the HSC dataset.

par(mfrow=c(1,2))hist(sce$pct_counts_feature_controls_Mt,xlab="Mitochondrial proportion (%)",ylab="Number of cells",breaks=20,main="",col="grey80")hist(sce$pct_counts_feature_controls_ERCC,xlab="ERCC proportion (%)",ylab="Number of cells",breaks=20,main="",col="grey80")

Again, the ideal threshold for these proportions depends on the cell type and the experimental protocol. Cells with more mitochondria or more mitochondrial activity may naturally have larger mitochondrial proportions. Similarly, cells with more endogenous RNA or that are assayed with protocols using less spike-in RNA will have lower spike-in proportions. If we assume that most cells in the dataset are of high quality, then the threshold can be set to remove any large outliers from the distribution of proportions. We use the MAD-based definition of outliers to remove putative low-quality cells from the dataset.

mito.drop <-isOutlier(sce$pct_counts_feature_controls_Mt,nmads=3,type="higher")spike.drop <-isOutlier(sce$pct_counts_feature_controls_ERCC,nmads=3,type="higher")

Subsetting by column will retain only the high-quality cells that pass each filter described above. We examine the number of cells removed by each filter as well as the total number of retained cells. Removal of a substantial proportion of cells (> 10%) may be indicative of an overall issue with data quality. It may also reflect genuine biology in extreme cases (e.g., low numbers of expressed genes in erythrocytes) for which the filters described here are inappropriate.

sce <- sce[,!(libsize.drop | feature.drop | mito.drop | spike.drop)]data.frame(ByLibSize=sum(libsize.drop),ByFeature=sum(feature.drop),ByMito=sum(mito.drop),BySpike=sum(spike.drop),Remaining=ncol(sce))## ByLibSize ByFeature ByMito BySpike Remaining ## Samples 2 2 6 3 86

An alternative approach to quality control is to perform a principal components analysis (PCA) based on the quality metrics for each cell, e.g., the total number of reads, the total number of features and the proportion of mitochondrial or spike-in reads. Outliers on a PCA plot may be indicative of low-quality cells that have aberrant technical properties compared to the (presumed) majority of high-quality cells. In Figure 3, no obvious outliers are present, which is consistent with the removal of suspect cells in the preceding quality control steps.

Figure 3.

PCA plot for cells in the HSC dataset, constructed using quality metrics.

fontsize <-theme(axis.text=element_text(size=12),axis.title=element_text(size=16))plotPCA(sce,pca_data_input="pdata") + fontsize

Methods like PCA-based outlier detection and support vector machines can provide more power to distinguish low-quality cells from high-quality counterparts ( Ilicic et al., 2016). This is because they are able to detect subtle patterns across many quality metrics simultaneously. However, this comes at some cost to interpretability, as the reason for removing a given cell may not always be obvious. Thus, for this workflow, we will use the simple approach whereby each quality metric is considered separately. Users interested in the more sophisticated approaches are referred to the scater and cellity packages.

Classification of cell cycle phase

We use the prediction method described by Scialdone et al. (2015) to classify cells into cell cycle phases based on the gene expression data. Using a training dataset, the sign of the difference in expression between two genes was computed for each pair of genes. Pairs with changes in the sign across cell cycle phases were chosen as markers. Cells in a test dataset can then be classified into the appropriate phase, based on whether the observed sign for each marker pair is consistent with one phase or another. This approach is implemented in the function using a pre-trained set of marker pairs for mouse data. The result of phase assignment for each cell in the HSC dataset is shown in Figure 4. (Some additional work is necessary to match the gene symbols in the data to the Ensembl annotation in the pre-trained marker set.)

Figure 4.

Cell cycle phase scores from applying the pair-based classifier on the HSC dataset, where each point represents a cell.

mm.pairs <-readRDS(system.file("exdata","mouse_cycle_markers.rds",package="scran"))library(org.Mm.eg.db)anno <-select(org.Mm.eg.db,keys=rownames(sce),keytype="SYMBOL",column="ENSEMBL") ensembl <- anno$ENSEMBL[match(rownames(sce), anno$SYMBOL)] assignments <-cyclone(sce, mm.pairs,gene.names=ensembl)plot(assignments$score$G1, assignments$score$G2M,xlab="G1 score",ylab="G2/M score",pch=16)

Cells are classified as being in G1 phase if the G1 score is above 0.5 and greater than the G2/M score; in G2/M phase if the G2/M score is above 0.5 and greater than the G1 score; and in S phase if neither score is above 0.5. Here, the vast majority of cells are classified as being in G1 phase. We will focus on these cells in the downstream analysis. Cells in other phases are removed to avoid potential confounding effects from cell cycle-induced differences. Alternatively, if a non-negligible number of cells are in other phases, we can use the assigned phase as a blocking factor in downstream analyses. This protects against cell cycle effects without discarding information.

sce <- sce[,assignments$phases=="G1"]

Pre-trained classifiers are available in scran for human and mouse data. While the mouse classifier used here was trained on data from embryonic stem cells, it is still accurate for other cell types ( Scialdone et al., 2015). This may be due to the conservation of the transcriptional program associated with the cell cycle ( Bertoli et al., 2013; Conboy et al., 2007). The pair-based method is also a non-parametric procedure that is robust to most technical differences between datasets. However, it will be less accurate for data that are substantially different from those used in the training set, e.g., due to the use of a different protocol. In such cases, users can construct a custom classifier from their own training data using the function. This will also be necessary for other model organisms where pre-trained classifiers are not available.

Filtering out low-abundance genes

Low-abundance genes are problematic as zero or near-zero counts do not contain enough information for reliable statistical inference ( Bourgon et al., 2010). In addition, the discreteness of the counts may interfere with downstream statistical procedures, e.g., by compromising the accuracy of continuous approximations. Here, low-abundance genes are defined as those with an average count below a filter threshold of 1. These genes are likely to be dominated by drop-out events ( Brennecke et al., 2013), which limits their usefulness in later analyses. Removal of these genes mitigates discreteness and reduces the amount of computational work without major loss of information.

ave.counts <-rowMeans(counts(sce))keep <- ave.counts >=1sum(keep)## [1] 13965

To check whether the chosen threshold is suitable, we examine the distribution of log-means across all genes ( Figure 5). The peak represents the bulk of moderately expressed genes while the rectangular component corresponds to lowly expressed genes. The filter threshold should cut the distribution at some point along the rectangular component to remove the majority of low-abundance genes.

Figure 5.

Histogram of log-average counts for all genes in the HSC dataset.

hist(log10(ave.counts),breaks=100,main="",col="grey80",xlab=expression(Log[10] ~"average count"))abline(v=log10(1),col="blue",lwd=2,lty=2)

We also look at the identities of the most highly expressed genes ( Figure 6). This should generally be dominated by constitutively expressed transcripts, such as those for ribosomal or mitochondrial proteins. The presence of other classes of features may be cause for concern if they are not consistent with expected biology. For example, a top set containing many spike-in transcripts suggests that too much spike-in RNA was added during library preparation, while the absence of ribosomal proteins and/or the presence of their pseudogenes are indicative of suboptimal alignment.

Figure 6.

Percentage of total counts assigned to the top 50 most highly-abundant features in the HSC dataset.

plotQC(sce,type ="highest-expression",n=50) + fontsize

An alternative approach to gene filtering is to select genes that have non-zero counts in at least n cells. This provides some more protection against genes with outlier expression patterns, i.e., strong expression in only one or two cells. Such outliers are typically uninteresting as they can arise from amplification artifacts that are not replicable across cells. (The exception is for studies involving rare cells where the outliers may be biologically relevant.) An example of this filtering approach is shown below for n set to 10, though smaller values may be necessary to retain genes expressed in rare cell types.

numcells <-nexprs(sce,byrow=TRUE)alt.keep <- numcells >=10sum(alt.keep)## [1] 11988

The relationship between the number of expressing cells and the mean is shown in Figure 7. The two statistics tend to be well-correlated so filtering on either should give roughly similar results.

Figure 7.

Number of expressing cells against the log-mean expression for each gene in the HSC dataset.

smoothScatter(log10(ave.counts), numcells,xlab=expression(Log[10] ~"average count"), ylab="Number of expressing cells")is.ercc <-isSpike(sce,type="ERCC")points(log10(ave.counts[is.ercc]), numcells[is.ercc],col="red",pch=16,cex=0.5)

In general, we prefer the mean-based filter as it tends to be less aggressive. A gene will be retained as long as it has sufficient expression in any subset of cells. Genes expressed in fewer cells require higher levels of expression in those cells to be retained, but this is not undesirable as it avoids selecting uninformative genes (with low expression in few cells) that contribute little to downstream analyses, e.g., HVG detection or clustering. In contrast, the “at least n” filter depends heavily on the choice of n. With n = 10, a gene expressed in a subset of 9 cells would be filtered out, regardless of the level of expression in those cells. This may result in the failure to detect rare subpopulations that are present at frequencies below n. While the mean-based filter will retain more outlier-driven genes, this can be handled by choosing methods that are robust to outliers in the downstream analyses.

Thus, we apply the mean-based filter to the data by subsetting the object as shown below. This removes all rows corresponding to endogenous genes or spike-in transcripts with abundances below the specified threshold.

sce <- sce[keep,]

Normalization of cell-specific biases

Using the deconvolution method to deal with zero counts. Read counts are subject to differences in capture efficiency and sequencing depth between cells ( Stegle et al., 2015). Normalization is required to eliminate these cell-specific biases prior to downstream quantitative analyses. This is often done by assuming that most genes are not differentially expressed (DE) between cells. Any systematic difference in count size across the non-DE majority of genes between two cells is assumed to represent bias and is removed by scaling. More specifically, “size factors” are calculated that represent the extent to which counts should be scaled in each library.

Size factors can be computed with several different approaches, e.g., using the function in the DESeq2 package ( Anders & Huber, 2010; Love et al., 2014), or with the function ( Robinson & Oshlack, 2010) in the edgeR package. However, single-cell data can be problematic for these bulk data-based methods due to the dominance of low and zero counts. To overcome this, we pool counts from many cells to increase the count size for accurate size factor estimation ( Lun et al., 2016). Pool-based size factors are then “deconvolved” into cell-based factors for cell-specific normalization.

sce <-computeSumFactors(sce,sizes=c(20,40,60,80))summary(sizeFactors(sce))## Min. 1st Qu. Median Mean 3rd Qu. Max.## 0.4161 0.8055 0.9434 1.0000 1.1890 1.8410

In this case, the size factors are tightly correlated with the library sizes for all cells ( Figure 8). This suggests that the systematic differences between cells are primarily driven by differences in capture efficiency or sequencing depth. Any DE between cells would yield a non-linear trend between the total count and size factor, and/or increased scatter around the trend. This does not occur here as strong DE is unlikely to exist within a homogeneous population of cells.

Figure 8.

Size factors from deconvolution, plotted against library sizes for all cells in the HSC dataset.

plot(sizeFactors(sce), sce$total_counts/1e6,log="xy",ylab="Library size (millions)",xlab="Size factor")

Computing separate size factors for spike-in transcripts. Size factors computed from the counts for endogenous genes are usually not appropriate for normalizing the counts for spike-in transcripts. Consider an experiment without library quantification, i.e., the amount of cDNA from each library is not equalized prior to pooling and multiplexed sequencing. Here, cells containing more RNA have greater counts for endogenous genes and thus larger size factors to scale down those counts. However, the same amount of spike-in RNA is added to each cell during library preparation. This means that the counts for spike-in transcripts are not subject to the effects of RNA content. Attempting to normalize the spike-in counts with the gene-based size factors will lead to over-normalization and incorrect quantification of expression. Similar reasoning applies in cases where library quantification is performed. For a constant total amount of cDNA, any increases in endogenous RNA content will suppress the coverage of spike-in transcripts. As a result, the bias in the spike-in counts will be opposite to that captured by the gene-based size factor.

To ensure normalization is performed correctly, we compute a separate set of size factors for the spike-in set. For each cell, the spike-in-specific size factor is defined as the total count across all transcripts in the spike-in set. This assumes that none of the spike-in transcripts are differentially expressed, which is reasonable given that the same amount and composition of spike-in RNA should have been added to each cell. (See below for a more detailed discussion on spike-in normalization.) These size factors are stored in a separate field of the object by setting in . This ensures that they will only be used with the spike-in transcripts but not the endogenous genes.

sce <-computeSpikeFactors(sce,type="ERCC",general.use=FALSE)

Applying the size factors to normalize gene expression. The count data are used to compute normalized log-expression values for use in downstream analyses. Each value is defined as the log-ratio of each count to the size factor for the corresponding cell, after adding a prior count of 1 to avoid undefined values at zero counts. Division by the size factor ensures that any cell-specific biases are removed. If spike-in-specific size factors are present in , they will be automatically applied to normalize the spike-in transcripts separately from the endogenous genes.

sce <-normalize(sce)

The log-transformation provides some measure of variance stabilization ( Law et al., 2014), so that high-abundance genes with large variances do not dominate downstream analyses. The computed values are stored as an matrix in addition to the other assay elements.

Checking for important technical factors

We check whether there are technical factors that contribute substantially to the heterogeneity of gene expression. If so, the factor may need to be regressed out to ensure that it does not inflate the variances or introduce spurious correlations. For this dataset, the simple experimental design means that there are no plate or batch effects to examine. Instead, we use the (log-transformed) total count for the spike-in transcripts as a proxy for the relative bias in each sample. This bias is purely technical in origin, given that the same amount of spike-in RNA should have been added to each cell. Thus, any association of gene expression with this factor is not biologically interesting and should be removed.

For each gene, we calculate the percentage of the variance of the expression values that is explained by the spike-in totals ( Figure 9). The percentages are generally small (1–3%), indicating that the expression profiles of most genes are not strongly associated with this factor. This result is consistent with successful removal of cell-specific biases by scaling normalization. Thus, the spike-in total does not need to be explicitly modelled in our downstream analyses.

Figure 9.

Density plot of the percentage of variance explained by the (log-transformed) total spike-in counts across all genes in the HSC dataset.

plotExplanatoryVariables(sce,variables=c("counts_feature_controls_ERCC","log10_counts_feature_controls_ERCC")) + fontsize

Note that the use of the spike-in total as an accurate proxy for the relative technical bias assumes that no library quantification is performed. Otherwise, the coverage of the spike-in transcripts would be dependent on the total amount of endogenous RNA in each cell. (Specifically, if the same amount of cDNA is used for sequencing per cell, any increase in the amount of endogenous RNA will suppress the coverage of the spike-in transcripts.) This means that the spike-in totals could be confounded with genuine biological effects associated with changes in RNA content.

Identifying HVGs from the normalized log-expression

We identify HVGs to focus on the genes that are driving heterogeneity across the population of cells. This requires estimation of the variance in expression for each gene, followed by decomposition of the variance into biological and technical components. HVGs are then identified as those genes with the largest biological components. This avoids prioritizing genes that are highly variable due to technical factors such as sampling noise during RNA capture and library preparation.

Ideally, the technical component would be estimated by fitting a mean-variance trend to the spike-in transcripts using the function. Recall that the same set of spike-ins was added in the same quantity to each cell. This means that the spike-in transcripts should exhibit no biological variability, i.e., any variance in their counts should be technical in origin. Given the mean abundance of a gene, the fitted value of the trend can be used as an estimate of the technical component for that gene. The biological component of the variance can then be calculated by subtracting the technical component from the total variance of each gene with the function.

In practice, this strategy is compromised by the small number of spike-in transcripts, the uneven distribution of their abundances and (for low numbers of cells) the imprecision of their variance estimates. This makes it difficult to accurately fit a complex mean-dependent trend to the spike-in variances. An alternative approach is to fit the trend to the variance estimates of the endogenous genes, using the setting as shown below. This assumes that the majority of genes are not variably expressed, such that the technical component dominates the total variance for those genes. The fitted value of the trend is then used as an estimate of the technical component. Obviously, this is the only approach that can be used if no spike-ins were added in the experiment.

var.fit <-trendVar(sce,trend="loess",use.spikes=FALSE,span=0.2)var.out <-decomposeVar(sce, var.fit)

We assess the suitability of the trend fitted to the endogenous variances by examining whether it is consistent with the spike-in variances ( Figure 10). The trend passes through or close to most of the spike-in variances, indicating that our assumption (that most genes have low levels of biological variability) is valid. This strategy exploits the large number of endogenous genes to obtain a stable trend, with the spike-in transcripts used as diagnostic features rather than in the trend fitting itself. However, if our assumption did not hold, we would instead fit the trend directly to the spike-in variances with the default . This sacrifices stability to reduce systematic errors in the estimate of the biological component for each gene. (In such cases, tinkering with the trend fitting parameters may yield a more stable curve – see for more details.)

Figure 10.

Variance of normalized log-expression values for each gene in the HSC dataset, plotted against the mean log-expression.

plot(var.out$mean, var.out$total,pch=16,cex=0.6,xlab="Mean log-expression",ylab="Variance of log-expression")o <-order(var.out$mean)lines(var.out$mean[o], var.out$tech[o],col="dodgerblue",lwd=2) cur.spike <-isSpike(sce)points(var.out$mean[cur.spike], var.out$total[cur.spike],col="red",pch=

Step One of AA: The Journey Begins

The first step of anything is a beginning, so the first step of the Alcoholics Anonymous 12 steps is the beginning of your recovery process. It’s actually really exciting, because it’s the first day of a new life. This is where the healing starts.

Doing the 12 steps is also referred to as “working” the steps, because it requires willingness, effort and action.  It is said the 12 steps of AA is compared to markers put out lovingly on a path by those who preceded us, to direct us on our journey. The journey can seem daunting from the perspective of a person at the beginning but fortunately all we are asked to do is to take one step at a time.

Step One: We admitted we were powerless over alcohol—that our lives had become unmanageable.

If lucky, our journey has taken us to arriving at a point of surrender. For some people the road they traveled getting to the first step in AA has been more than enough to convince them that unconditional surrender is the only option for recovery.

For a lot of people in recovery, walking into a treatment center or an AA meeting the first time is a major part of “working” step one. Your simple and humble act of asking for help is effectively an admission of powerlessness and unmanageability.

Most addicts are filled with guilt, shame, remorse, and self-loathing when they come into the rooms of AA. They’ve also gotten very used to keeping secrets from pretty much everyone, so opening up about the nature and extent of your alcoholic behavior is going against the grain. It may even feel completely unnatural and you probably don’t want to do it. But sharing your experience and the unmanageability lifts the burden of lugging them around in secret. Letting go of your secrets frees you up to move forward with a different, better life. For many people, the act of sharing Step One in an AA meeting is the true start of recovery.

However, becoming abstinent from alcohol will also be a requirement for starting to work the first step. The first step is all about looking at the effects of alcoholism in your life and for what is needed to be clean: to find a way to stop the behaviors with a perspective that isn’t clouded by alcohol. If you’ve been clean for a while, then the first step is about powerlessness over behaviors that make your life unmanageable.

Step One: Doing The “Work”

There are a lot of things alcoholics can do to fully work Step One. Most of the work is designed to unearth your complete history of use and abuse.

Inventories are a great way to work the steps—even starting with Step One. You can make a few lists:

  • A Consequences List: The easiest way to break through the fog of addiction is to create a list of consequences related to the behavior.
  • Powerlessness List: Go for as many examples of your powerlessness over your addictive behavior as possible. Be as fearlessly honest as you can, starting with early examples and ending with the most recent. (A note on “Powerlessness” this is used to exemplify the cravings in an alcoholic [or any addict] that are so intense that the ability to resist is almost impossible. Once an alcoholic takes a drink, a chemical reaction occurs within that body, setting off an intense craving for more.)
  • Unmanageability List: Write out the ways in which your addiction has created chaos and destruction in your life.

Here are some other really great questions to ask yourself while doing Step One:

  • What does the disease of addiction mean to me?
  • How has my disease affected me physically? Mentally? Spiritually? Emotionally? Financially?
  • How does the self-centered part of my disease affect my life and the life of those around me?
  • Have I blamed other people for my behavior?
  • Have I compared my addiction with other people’s addictions?
  • What does unmanageability mean to me?
  • What troubles have been caused because of my addiction?
  • Have I used alcohol or drugs to change or suppress my feelings?
  • What reservations am I still holding onto?
  • Do I accept that I’ll never regain “control” over drinking, even after a long period without use?
  • What could my life be like if I surrendered completely?
  • Am I WILLING: to follow a sponsor’s direction, go to meetings regularly and give recovery my best effort?
  • Have I made peace with the fact that I’m an alcoholic and that I’ll have to do things to stay clean?

Responsibility & Acceptance in AA

For each and every one of the millions of success stories in AA you will hear repeatedly about responsibility. It is our responsibility to stay involved in sobriety and follow our sponsor’s suggestions. It is our responsibility to actively cultivate and grow willingness. It’s important to grasp that you are not “powerless” completely: you do have the power to engage in a program of recovery, the power to choose not to abuse substances….but you are powerless over drugs and alcohol if you put them in your body.

Acceptance comes when we feel a profound sense of hope and peace in coming to terms with our addiction and recovery. We don’t dread a future of meeting attendance, sponsor contact and step work; instead we begin to see recovery is a precious gift.

It has been my experience that doing the steps has brought me serenity and the welcome realization that AA is not just a program where sick people get well—it is a way of living that is rewarding in and of itself.

Lot’s of people find that once they do Step One, that all manner of help appears! I feel grateful to have a program that expands as I grow. Having a firm foundation in recovery through steps has also given me a welcoming fellowship to accompany me in my journey.

Leave a Comment


Your email address will not be published. Required fields are marked *