Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Diffusion maps for high-dimensional single-cell analysis of differentiation data

Diffusion maps for high-dimensional single-cell analysis of differentiation data Motivation: Single-cell technologies have recently gained popularity in cellular differentiation stud- ies regarding their ability to resolve potential heterogeneities in cell populations. Analyzing such high-dimensional single-cell data has its own statistical and computational challenges. Popular multi- variate approaches are based on data normalization, followed by dimension reduction and clustering to identify subgroups. However, in the case of cellular differentiation, we would not expect clear clus- ters to be present but instead expect the cells to follow continuous branching lineages. Results: Here, we propose the use of diffusion maps to deal with the problem of defining differenti- ation trajectories. We adapt this method to single-cell data by adequate choice of kernel width and inclusion of uncertainties or missing measurement values, which enables the establishment of a pseudotemporal ordering of single cells in a high-dimensional gene expression space. We expect this output to reflect cell differentiation trajectories, where the data originates from intrinsic diffusion-like dynamics. Starting from a pluripotent stage, cells move smoothly within the tran- scriptional landscape towards more differentiated states with some stochasticity along their path. We demonstrate the robustness of our method with respect to extrinsic noise (e.g. measurement noise) and sampling density heterogeneities on simulated toy data as well as two single-cell quantitative polymerase chain reaction datasets (i.e. mouse haematopoietic stem cells and mouse embryonic stem cells) and an RNA-Seq data of human pre-implantation embryos. We show that diffusion maps perform considerably better than Principal Component Analysis and are advanta- geous over other techniques for non-linear dimension reduction such as t-distributed Stochastic Neighbour Embedding for preserving the global structures and pseudotemporal ordering of cells. Availability and implementation: The Matlab implementation of diffusion maps for single-cell data is available at https://www.helmholtz-muenchen.de/icb/single-cell-diffusion-map. Contact: fbuettner.phys@gmail.com, fabian.theis@helmholtz-muenchen.de Supplementary information: Supplementary data are available at Bioinformatics online. 1 Introduction reveal the properties of the heterogeneous population of cells at vari- ous stages of development. Purifying for a certain cell type or syn- The advantages of single-cell measurements to various biological re- chronizing cells is experimentally challenging. Moreover, stem cell search fields have become obvious in recent years. One example is populations that have been functionally characterized often show the stem cell studies for which population measurements fail to V The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com 2989 2990 L.Haghverdi et al. heterogeneity in their cellular and molecular properties (Dykstra heterogeneities and addresses the issues relating to missing values et al., 2007; Huang, 2009; Stingl et al., 2006). To overcome these and uncertainties of measurement, barriers, on the one hand researchers conduct continuous single-cell propose a criterion for selecting the scale parameter in a diffusion observation using time-lapse microscopy (Park et al., 2014; Rieger map, et al., 2009; Schroeder, 2011), accompanied by single-cell tracking evaluate the performance of the diffusion map and its robustness and analysis tools. However, this approach is still limited as the ex- to noise and density heterogeneities using a toy model that pression of very few genes (typically one to three) could be observed. mimics the dynamics of differentiation, On the other hand, with the advent of new technologies, such as sin- apply the adapted diffusion map algorithm to two typical qPCR gle-cell qPCR (Wilhelm and Pingoud, 2003) or RNA-Seq (Chu and and one RNA-Seq datasets and show that it captures the differ- Corey, 2012) and flow or mass cytometry (Bandura et al., 2009; entiation dynamics more accurately than other algorithms. Chattopadhyay et al., 2006), it is now possible to measure hundreds to thousands of genes from thousands of single cells at different spe- 2 Methods cific experimental time points (time-course experiments). However, several single cells measured at the same experimental time point 2.1 Diffusion maps may be at different developmental stages. Therefore, there is a need Let n be the number of cells and let G be the number of genes meas- for computational methods which resolve the hidden temporal order ured for each cell. Denote the set of all measured cells by X.We that reflects the ordering of developmental stages of differentiating allow each cell x to diffuse around its measured position x 2 R cells. through an isotropic Gaussian wave function, While differentiation has to be regarded as a non-linear continu- 1=4 2 ous process (Bendall et al., 2014; Buettner and Theis, 2012), stand- 2 jjx  xjj Y ðx Þ¼ exp  (1) 2 2 ard methods used for the analysis of high-dimensional gene pr r expression data are either based on linear methods such as principal 0 2 0 0 The normalization of Y ðx Þ is such that Y ðx Þdx ¼ 1. The component analysis (PCA) and independent components analysis x Gaussian width r determines the length scale over which each cell (ICA) [e.g. used as part of the monocle algorithm (Trapnell et al., can randomly diffuse. The transition probability from cell x to cell y 2014)] or they use clustering techniques that groups cells according is then defined by the interference of the two wave functions Y and to specific properties. Hierarchical clustering methods as used in Y . One can easily show that this interference product is another SPADE (Qiu et al., 2011) and t-SNE (Van der Maaten and Hinton, Gaussian (all prefactors cancel out): 2008) as used in viSNE (Amir et al., 2013) are examples of cluster- ing methods. However, as these methods are designed to detect dis- ! jjx  yjj 0 0 0 crete subpopulations, they usually do not preserve the continuous Y ðx ÞY ðx Þdx ¼ exp  (2) x y 2r trajectories of differentiation data. A more recently proposed algorithm Wanderlust (Bendall et al., 2014) incorporates the non- Hence, we can construct the n  n Markovian transition probability linearity and continuity concepts but provides a pseudotemporal matrix P for all pairs of cells in X as follows: ordering of cells only if the data comprise a single branch. Furthermore, in gene expression measurement techniques, there is 2 1 jjx  yjj P ¼ exp  (3) usually a detection limit at which lower expression levels and non- xy ZðxÞ 2r expressed genes are all reported at the same value. Buettner et al. (2014) suggested the use of a censoring noise model for PCA, jjx  yjj whereas for the other methods it is unclear how these uncertain or ZðxÞ¼ exp  (4) 2r y2X missing values are to be treated. A variety of other manifold learning methods including (Hessian) locally linear embedding (HLLE) At the position of each cell, ZðxÞ is the partition function which pro- (Donoho and Grimes, 2003) and Isomap (Tenenbaum et al., 2000) vides an estimate of the number of neighbours of x in a certain vol- exist in the machine-learning community and are discussed in detail ume defined by r. Hence, it can be interpreted as the density of cells in the discussion and conclusion section. at that proximity. Consequently, we redefine the density normalized Here, we propose diffusion maps (Coifman et al., 2005) as a tool transition probability matrix P as: for analyzing single-cell differentiation data. Diffusion maps use a jjxyjj distance metric (usually referred to as diffusion distance) conceptu- exp 1 2r ~ ~ P ¼ ; P ¼ 0 (5) ally relevant to how differentiation data is generated biologically, as xy xx ZðxÞZðyÞ ZðxÞ cells follow noisy diffusion-like dynamics in the course of taking sev- eral differentiation lineage paths. Diffusion maps preserve the non- 2 jjxyjj exp X 2 2r linear structure of data as a continuum and are robust to noise. ZðxÞ¼ (6) ZðxÞZðyÞ Furthermore, with density normalization, diffusion maps are resist- y2X=x ant to sampling density heterogeneities and can capture rare as well Because we are only interested in the transition probabilities be- as abundant populations. As a non-linear dimension-reduction tool, tween cells and not the on-cell potentials imposed by local densities, diffusion maps can be applied on single-cell omics data to perform we set the diagonal of P to zero and exclude y ¼ x from the sum in dimension-reduction and ordering of cells along the differentiation ~ ~ the partition function Z. For a large enough r, the matrix P defines path in a single step, thus providing insight to the dynamics of differ- an ergodic Markovian diffusion process on the data and has n entiation (or any other concept with continuous dynamics). In this ordered eigenvalues k ¼ 1 > k :::k with corresponding right 0 1 n1 article, we eigenvectors w :::w . 0 n1 propose an adaptation of diffusion maps for the analysis of The t-th power of P will present the transition probabilities be- single-cell data which is less affected by sampling density tween cells in a diffusion (random walk) process of length t. Noting Diffusion maps for high-dimensional single-cell analysis 2991 ~ ~ that P has the same eigenvectors as P, one can show that this transi- Markovian transition probability matrix (DC1 and DC2) are then tion probability can be represented as follows: used for low-dimensional representation and visualization of data. n1 t t ~ ~ P ¼ k w ðxÞw ðyÞZðyÞ (7) i i xy i i¼0 2.2 Accounting for missing and uncertain values The data generated from qPCR, RNA-Seq or cytometry experiments Each row of P can be viewed as a vector, which we represent as are very often prone to imperfections such as missing values or de- p ðx; Þ and consider as the feature representation (Shawe-Taylor tection limit thresholds. It is important to properly treat such uncer- and Cristianini, 2004) for each cell x. By computing the weighted L tainties of data (Buettner et al., 2014; McDavid et al., 2013). Our distance in the feature space, the diffusion distance D between two probabilistic interpretation of diffusion maps allows a straightfor- cells x and y is defined as follows: ward mechanism of handling missing and uncertain data measure- t t 2 ~ ~ X ments. First, we have to decompose the kernel into G components. ðP  P Þ xz yz 2 t t 2 D ðx; yÞ¼jjp ðx; Þ  p ðy; Þjj ¼ (8) t 1=Z Then, instead of a Gaussian, we can use any other wave function ZðzÞ that best represents our prior knowledge on the probability distribu- This diffusion distance can be expressed in terms of the eigenvectors tion of the missing or uncertain values, which then should be of P such that: square-normalized to ensure equal contribution of the G compo- nents. For example, for missing values and non-detects (measure- n1 2 2t 2 ments below the limit of detection), one might choose a uniform D ðx; yÞ¼ k ðw ðxÞ w ðyÞÞ (9) i i t i i¼1 distribution over the whole range of possible values. In the following, we describe how to account for the uncertainty The corresponding eigenvector to the largest eigenvalue k is a con- of non-detect measurements in qPCR data. The statistical subtleties stant vector w ¼ 1. Therefore, it only contributes a zero term to D of non-detect values in qPCR experiments have been systematically and is excluded from the spectral decomposition of D in studied by McDavid et al. (2013) for univariate models. In addition, Equation (9). That means the Euclidean distance of the cells in the for a multivariate PCA analysis, Buettner et al. (2014) proposed that firstleigenvector space represents an approximation of their diffu- different kernels be allowed in each dimension. For the diffusion sion distance D . Moreover, the eigenvalues of P determine the dif- map implementation, we assume any value between the detection fusion coefficients in the direction of the corresponding eigenvector. limit (M ) and a completely non-expressed (off) state of genes valued As real data usually lie on a lower dimensional manifold than the en- as M , is equally possible for the non-detect measurements. tire dimensions of space G, these diffusion coefficients drop to a Considering the kernel width formulated in the diffusion map wave noise level other than a few first (l) prominent directions. Therefore, functions, we assume an indicator wave function between M  r if there is a significant gap between the l-th and ðl þ 1Þ-th eigen- 1=2 and M þ r normalized by ðM  M þ 2rÞ . Thus, we have to 1 1 0 value, the sum up to the l-th term usually determines a good ap- calculate three different kinds of interference of wave functions: proximation for diffusion distances. Thus, for data visualization we The interference of two cells with definite measured values for select these eigenvectors and instead of the mathematical notation gene g is the standard Gaussian kernel (see Section 2.1): w, we call them diffusion components (DCs). Figure 1 presents a summary of diffusion map embedding. ð ðx  y Þ g g 0 0 0 Each cell is represented by a Gaussian wave function in the G- Y ðx ÞY ðx Þdx ¼ exp  ; x y g g g 2r dimensional gene space. On an adequate Gaussian width, the wave functions of neighbouring cells interfere with each other and form the interference of two cells both with non-detect values for gene g the diffusion paths along the (non-linear) data manifold in the high- is 1 (due to the square-normalization constraint): dimensional space. Hence, we construct the Markovian transition probability matrix, the elements of which are the transition proba- 0 0 0 Y ðx ÞY ðx Þdx ¼ 1; x y g g g bilities between all pairs of cells. The eigenfunctions of the Fig. 1. Schematic overview of diffusion maps embedding. (A) The n  G matrix representation of single-cell data consisting of four different cell types. The last column on the right side of the matrix (colour band) indicates the cell type for each cell. (B) Representation of each cell by a Gaussian in the G-dimensional gene space. Diffusion paths (continuous paths with relatively high-probability density) form on the data manifold as a result of interference of the Gaussians. The Probability density function is shown in the heat map. (C) The n  n Markovian transition probability matrix. (D) Data embedding on the first two eigenvectors of the Markovian transition matrix (DC1 and DC2) which correspond to the largest diffusion coefficients of the data manifold. The embedding shows the continuous flow of cells across four cell types; however, it does not suggest the putative time direction 2992 L.Haghverdi et al. the interference of a missing (non-detect) value to a definite meas- where we compute the average of log ðZðxÞÞ with consideration of ured value x is: density heterogeneities such that: ðlog ðZðxÞÞ  ð1=ZðxÞÞÞ 0 0 0 Y ðx ÞY ðx Þdx ¼ x x y g g g hlog ðZðxÞÞi ¼ X (12) ð1=ZðxÞÞ M1þr 1 2 ðx0  x Þ 1=4 g g pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffið Þ exp  dx 2 2 It is worth noting that this average density underestimates the real M  M þ 2r pr r M r 1 0 dimensionality of the structure due to the contribution of the cells 1=4 1 pr lying on the surface of the manifold. However, this does not affect ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi M  M þ 2r 8 1 0 our heuristic since the variation of hdi is our main interest rather than hdi itself. M  r  x M þ r  x 0 g 1 g erfc  erfc : Each time hdi reaches its maximum and starts to decrease, one r r can deduce that an intrinsically lower dimensional structure is For data with missing or uncertain values, we need to check the pair- emerging from the noise-enriched distributed cells in the original wise interference of the wave functions for each gene. The computa- high-dimensional space. Therefore several characteristic length tion time is thus proportional to the number of genes G for a fixed scales of the data manifold (i.e. width of its linear parts, radius of its number of cells n. Therefore, it might be preferable (especially in the curves, etc.) give rise to several local maxima in hdi. Such character- case of large G) to choose the wave function of the missing (or un- istic scales indeed make our choice for the Gaussian width r since certain) value also in the form of a Gaussian such that the multipli- they indicate the scale at which the Euclidean distances used in the cation of the G components of interference can be expressed as the Gaussian kernel are sensible in an assumed Euclidean tangent space sum of the exponents and the exponentiation step can be performed to the manifold. Although Euclidean distances are also valid for only once at the end of the algorithm for computation of the transi- smaller rs than the characteristic length scale, they are not recom- tion matrix. An implementation of this fast version of the censoring mended because smaller kernel width would mean less connectivity algorithm is also provided in the codes package. Supplementary in the cells graph which in turn results in an increased sensitivity to Figure S1 provides an illustration of our approach for accounting noise. Supplementary Figures S2 and S3 illustrate the resulting diffu- for missing and uncertain values. sion map on optimal kernel width and several other kernel widths values for a U-shaped toy data. Also the performance of diffusion map at the optimal kernel width when there is no distinguishable 2.3 Determination of Gaussian kernel width pattern in the data (e.g. normally distributed data in all dimensions The parameter r in Equation (1) determines the scale at which we or sparse data) is illustrated in the Supplementary Figure S4. visualize the data. If r is extremely small, most elements of the tran- sition probability matrix P will tend to be zero and we do not get an overall view of a connected graph structure. In fact, when r is too 2.4 Toy model for differentiation small, the number of degenerate eigenvectors with eigenvalue equal As toggle switches are known to play a role in differentiation to one, indicates the number of disconnected segments that P defines branching processes (Orkin and Zon, 2008), we designed a regula- on the data. For too large r however, the transition probability sen- tory network of three pairs of toggle genes to evaluate the perform- sitivity on the distance between the cells blurs. There is a certain ance of our method on a toy dataset that mimics a differentiation range of r variations for which P defines an ergodic diffusion pro- tree (Krumsiek et al., 2011). Assuming a genetic regulatory module cess on the data as a connected graph and still the diffusion distances as presented in Figure 2A, we simulated the stochastic differenti- between the cells are informative. ation process by the Gillespie algorithm (Gillespie, 1977) with the The un-normalized density at each cell (ZðxÞ in Equation (3))is reactions as shown in Figure 2B and C (Strasser et al., 2012). More proportional to the number of cells in a fixed volume in its neigh- details about the chemical reactions and the reaction rates used in bourhood and depends on r. At scales of r close to zero, cells do not the Gillespie algorithm model can be found in the supplement have any neighbours and their average density is 1 (because of the 1s (Supplementary Figs S5A and B). Genes G and G are antagonistic 1 2 on the diagonal of P). By increasing r, the average density gradually to each other through an inhibiting Hill function. Therefore, starting increases as more cells find other cells in their neighbourhood. There from an initial undifferentiated state where G and G are both in a 1 2 is a density saturation point where r reaches the system size and all very low expression level, single samples may end up in either of the cells form part of one neighbourhood. At this point, for every cell states where G or G is exclusively expressed. At this stage, the 1 2 x 2 X, the density ZðxÞ will be equal to the entire system size n. next pair of toggle genes in the differentiation hierarchy is activated Assuming that the density gradient is not extremely sharp along (through an activating binding Hill function), which are again an- the data manifold, the number of neighbours of cell x in the neigh- tagonistic to each other. This model generates four different types of bourhood r will be proportional to the volume of a hypersphere of fully differentiated cells in the six-dimensional space of genes. radius r, hence: To establish a steady state in the cell population, once a cell hits dðx;rÞ ZðxÞ/ r (10) the end of each branch, we remove it from the population and initi- ate a new cell at the original undifferentiated state. This approach where dðx; rÞ is the dimensionality of data manifold at the position maintains the population size of cells. After an extended simulation of cell x and at the scale r. By differentiating both sides with respect run, the steady state of the population is established and resembles to log ðrÞ, we find that the average dimensionality of the manifold the haemostatic state of (e.g. hematopoietic) stem cells in natural can be estimated by the slope of the log–log plot of the number of organisms. neighbours versus the length scale: We sampled cells from this toy model in two different sets, a bal- @hlog ðZðxÞÞi anced toy dataset, wherein 600 samples serve as a snapshot of the hdðrÞi ¼ (11) @log ðrÞ steady state of the system with no additional extrinsic noise, and an Diffusion maps for high-dimensional single-cell analysis 2993 Fig. 2. (A) Toy model of a differentiation regulatory network consisting of three pairs of antagonistic genes simulated by the Gillespie algorithm. The arrows show activation or inhibition interactions between genes. The toy model employs two classes of gene regulation: (B) G is connected to an inhibitor, its production 0 0 00 rate a is proportional to a Hill function of the concentration of the inhibitor Protein (C) G is connected to an inhibitor G and an activator G , its production rate i i i i i a is proportional to product of an inhibiting and an activating Hill function. The degradation rate c is constant for all proteins imbalanced toy dataset, wherein 1800 sample are derived from a to the performance of two other dimension-reduction methods PCA non-steady-state density distribution with heavier sampling density and t-SNE. Data embeddings with several other methods including þ þ on the G G branch. We also added an extrinsic Gaussian noise ICA, SPADE, kernel-PCA (Scho ¨ lkopf et al., 1998), isomap and 1 3 with a variance of 25% maximum expression to each gene. The Hessian Locally linear embedding (HLLE) are provided in the gene expression plot for a simulated single cell as it proceeds from Supplementary Figures S16–S20. the initial pluripotent state to a fully differentiated state is presented in the supplement (Supplementary Figs S5C and D). 3.1 Diffusion maps cope with high noise level and sampling density heterogeneity for toy data 2.5 Experimental data 3.1.1 Gaussian width determination of the toy data 2.5.1 qPCR data of mouse haematopoietic stem cells. We demonstrate the heuristic determination of r on balanced and We calculated a diffusion map embedding for the haematopoietic imbalanced toy datasets. The average dimensionality of the structure and progenitor stem cells dataset from Moignard et al. (2013). In this of some chosen characteristic length scale can be estimated by experiment, 597 cells from five different haematopoietic cell types, Equation (11). Figure 3 shows the average dimensionality hdi for namely, haematopoietic stem cell (HSC), lymphoid-primed multipo- balanced toy data (red) and imbalanced toy data (black) as a func- tent progenitor (LMPP), megakaryocyte-erythroid progenitor tion of log ðrÞ. The balanced set exhibits two maxima. The first one (PreMegE), common lymphoid progenitor (CLP) and granulocyte– arises at the length scale of the thickness of the differentiation monocyte progenitor (GMP) were gated by Fluorescence-activated branches which include only a few cells. At this r several subpopula- cell sorting (FACS) sorting. Single-cell qPCR expression level meas- tions form at the more densely populated stages of the steady state. urement was then performed for 24 genes. Housekeeping genes were The second maximum appears at a larger length scale when several only used for cell-cycle normalization, where for each cell, all expres- subpopulations become visible to each other and the continuous sion values were divided by the average expression of its housekeep- branches form. We picked the r at the second maximum for visual- ing genes. Furthermore we excluded the five housekeeping genes, as ization (data visualization at the first maximum is provided in the well as c-Kit, which is a stem-cell receptor factor expressed on the Supplementary Fig. S6). For the imbalanced set, however, due to the surface of all analyzed cells, from the diffusion map analysis. high noise level, the first maximum vanished and we only detected one maximum which we then used for the visualization. 2.5.2 qPCR data of mouse stem cells from zygote to blastocyst To understand the earliest cell fate decision in a developing mouse embryo, Guo et al. (2010) conducted a qPCR experiment for 3.1.2 Performance of the diffusion map on the toy data as compared 48 genes in seven different developmental time points. The gene ex- to the other methods pression levels were normalized to the endogenous controls Actb Definition of diffusion distance (Equation (8)) based on probability and Gapdh. The authors also identified four cell types, namely, of transition between cells through several paths renders diffusion inner cell mass (ICM), trophectoderm (TE), primitive endoderm maps very robust to noise. Figure 4 presents a comparison between (PE) and epiblast (EPI) using characteristic markers. The total num- the performance of the diffusion map and the other two methods ber of single cells used in the diffusion map analysis was 429. PCA and t-SNE on the balanced toy dataset. The eigenvalues of the diffusion map (Fig. 4D) suggest that there are four leading dimen- 2.5.3 RNA-Seq of human preimplantation embryos sions that explain the data structure and the higher dimensions pre- For the dataset published by Yan et al. (2013), RNA-Seq analysis sent noise rather than the intrinsic structure of the data manifold. was performed on 90 individual cells from 20 oocytes and embryos. The complete set of two-by-two projections up to the fourth eigen- The sequenced embryos were picked at seven crucial stages of pre- vector can be found in the supplementary Figure S7. PCA of this implantation: metaphase II oocyte, zygote, 2-cell, 4-cell, 8-cell, mor- dataset generated results that were similar to the diffusion map, ula and late blastocyst at the hatching stage. where all four branches of the data could be distinguished. However, standard t-SNE did not preserve the data structure con- tinuity. Visualization using t-SNE with non-standard perplexity val- 3 Results ues are also provided in the Supplementary Figure S8. To determine In this section, we evaluate the performance of the diffusion map on how additional extrinsic noise and density heterogeneities affect each of the datasets described in the Methods section and compare it each method, we also applied the three methods on imbalanced toy 2994 L.Haghverdi et al. data (Fig. 5). The eigenvalues plot of the diffusion map in this figure normalization. Supplementary Figure S12 demonstrates how density suggests the same order of significance for the third and fourth normalization improves the quality of the diffusion map. The third eigenvectors as k almost equals k and that the higher order eigen- refinement that we used in our implementation of diffusion maps is 4 3 functions mostly present noise. We chose two projections accounting for missing and non-detect values (Section 2.2). (DC, DC2, and DC3) and (DC1, DC2, and DC4) for illustration Generally speaking as the proportion of missing and non-detect val- in Figure 5. The complete set of two-by-two projection can be found ues increases, there is a decrease in the quality of the diffusion map. in the Supplementary Figure S9. From Figure 5A, one can infer the However the magnitude of this effect depends highly on the archi- same size for all four branches of differentiation despite different tecture of the gene regulatory network and the role of the corres- sampling densities. This figure also suggests that the diffusion map ponding gene in the network. For example, for a toggle switch, low clearly shows four branches of the imbalanced toy data, whereas expression of a gene would always imply high expression of the PCA and t-SNE produce noisier visualization and represent the two other gene. Therefore, increasing the detection threshold (i.e. rarer branches as smaller. For additional t-SNE visualizations with increasing number of non-detects) does not have a major influence non-standard perplexity values for the imbalanced toy data see on the analysis, as the information is still present in the other gene Supplementary Figure S10. with high expression. We evaluate the performance of diffusion map in several proportions of missing values for the balanced toy data in Supplementary Figure S13. 3.1.3 Refinement of the transition matrix by density normalization, zero diagonal and accounting for missing values In order to adapt the standard diffusion map algorithm to the prop- 3.2 Diffusion map allows identification of differentiation erties of single-cell gene expression parameters, we refined the tran- trajectories on experimental data sition matrix in different ways. First, we set the diagonal of the 3.2.1 Performance on haematopoietic stem cells qPCR data as transition matrix to zero (Equation (5)) since the (non-zero version) compared to the other methods diagonal carries information about local sampling densities. Unlike The diffusion map embedding for the haematopoietic stem cells many other applications where the information about local densities (Fig. 6A) indicates a major branching of HSCs to PreMegE and has some value, the sampling density in the context of single-cell LMPP cell types and a further branching of LMPPs to CLP and GMP data is somewhat arbitrary (e.g. only specific cell types are moni- cells. The branching structures are less clear in the PCA plot (Fig. 6B). tored, different proliferation rates in several stages of differentiation Moreover, PCA produces artificial planes of data in the embedding alters the sampling density, outlier cells show lower density, etc.). because of the non-detect measurements in the qPCR data. The t-SNE For a demonstration of how zero diagonal improves the quality of plot (Fig. 6C) almost separated the cell types (except for LMPPs) into the diffusion map see Supplementary Figure S11. Second, we refined different clusters. However, the notion of temporal progress is less the Markovian transition matrix by density normalization clear compared to the diffusion map embedding. In addition, since (Equation (5)) since the number of diffusion paths between two cells uncertainties in the values of non-detects were not considered, a wid- depends on the density of cells connecting them and more densely ening within the clusters is observed. Detailed visualization using the sampled regions of the data would seem to have smaller diffusion three methods and the Gaussian width determination for diffusion distance to each other on a diffusion map without density map embedding are provided in the Supplementary Figure S14.The ordered eigenvalues plot for the diffusion map and PCA are shown in Figures 6D and E. The ordered eigenvalues plot of the diffusion map suggests that there is no clear separation between the eigenvectors of the diffusion map that captures the intrinsic low-dimensional data manifold and those characterizing noise for this dataset. However, what makes the diffusion map embedding of this dataset more plaus- ible is the concordance between the branching structure as suggested by the diffusion map and the recently established hierarchy of haem- atopoietic cell types (Arinobu et al.,2007; Moignard et al.,2013) illustrated in Figure 6F. 3.2.2 Performance of the diffusion map on mouse embryonic stem cells qPCR data as compared to the other methods Fig. 3. The average dimensionality of the data hdi as a function of log ðrÞ for For the mouse embryonic stem cells, diffusion map visualization the balanced and imbalanced toy datasets using the first three eigenvectors indicated a branching at the early Fig. 4. Visualization of the balanced toy data on (A) the first three eigenvectors of the diffusion map, (B) PCA and (C) t-SNE. The colours (heat map of blue to red) indicate the maximum expression among all genes. Eigenvalues sorted in decreasing order for (D) diffusion map and (E) PCA Diffusion maps for high-dimensional single-cell analysis 2995 Fig. 5. Visualization of the imbalanced toy data on (A) the first three eigenvectors of the diffusion map, (B) the first, second and fourth eigenvectors of the diffusion map, (C) the first three components of the PCA (D) the first, second and fourth components of PCA and (E) t-SNE. The colours (heat map of blue to red) indicate the maximum expression among all genes. Eigenvalues sorted in a decreasing order for (F) diffusion map and (G) PCA Fig. 6. Visualization of haematopoietic stem cells data on the first three eigenvectors of (A) diffusion map, (B) PCA and (C) t-SNE. Eigenvalues sorted in a decreas- ing order for (D) diffusion map and (E) PCA. (F) The hierarchy of haematopoietic cell types 16-cell stage to the ICM and TE cell types, and further branching of to high sequencing costs. A low number of sampled cells could not the ICM at the late 32-cell stage into the EPI and PE (Fig. 7A). The meaningfully indicate a complex structure. Hence, PCA and t-SNE branching structure is unclear in the PCA and t-SNE plots (Figs 7B performance is almost as good as that of the diffusion map. and C). The ordered eigenvalues plot for the diffusion map and PCA However, with the expected development of new and cheaper RNA are shown in Figures 7D and 7E. The branching structure indicated sequencing technologies, we propose a diffusion map that could be by the diffusion map is in agreement with the results of previous used as a powerful dimension-reduction tool the computation time studies on this dataset (Buettner and Theis, 2012; Guo et al., 2010), of which is only linear with respect to the number of genes. which suggests a branching into the two cell types, ICM and TE, after the 8-cell stage and further branching of the ICM into EPI and PE cells (Fig. 7F). More information on Gaussian width determin- 4 Discussion and conclusion ation and two-dimensional projections of data on each pair of the In this manuscript, we have demonstrated the capabilities of diffu- first to fourth eigenvectors of the diffusion map are provided in the sion maps for the analysis of continuous dynamic processes, in par- Supplementary Figure S15. ticular, differentiation data in a toy dataset and a few experimental datasets. Using a biologically relevant distance metric (i.e. diffusion 3.2.3 Performance on human pre-implantation embryos RNA-Seq distance), the adapted diffusion map method outperforms other di- data compared with other methods mension-reduction methods in pseudotemporal ordering of cells The performance of the diffusion map on this RNA-Seq dataset is along the differentiation paths and could capture the expected dif- comparable (although slightly sharper with respect to pseudotime ferentiation structure in all cases. Table 1 provides a general com- ordering) to the other two methods, PCA and t-SNE (Fig. 8). The parison of several dimension-reduction methods, detailing number of single cells measured in RNA-Seq is currently limited due capabilities and limitations in application to single-cell omics data. 2996 L.Haghverdi et al. Fig. 7 Visualization of mouse embryonic stem cells on (A) the first three eigenvectors of diffusion map, (B) PCA and (C) t-SNE. Eigenvalues sorted in a decreasing order for (D) diffusion map and (E) PCA. (F) The hierarchy of cells for mouse embryonic stem cells Fig. 8. Visualization of human preimplantation embryos data on (A) the first three eigenvectors of the diffusion map, (B) PCA and (C) t-SNE. Eigenvalues sorted in a decreasing order for (D) the diffusion map and (E) PCA Among these methods, isomap and (H)LLE have not been applied compute the average intrinsic dimensionality and hence the average for the analysis of single-cell differentiation data and pseudotime characteristic length scale. However, when density heterogeneities ordering so far, mainly because they do not meet the specific re- are extremely large, or the data manifold has many sharp changes quirements for the analysis of such data including capability to han- and several scales, a single r may not provide a globally optimal dle high levels of technical noise, sampling density heterogeneities, scale for data embedding. Therefore, implementation of an efficient detection limits and missing values. Supplementary Figures S19 and and cost-effective method for several locally valid rs determinations, S20 in the supplement demonstrate the poor performance of these instead of a single global value is of interest. methods for finding the differentiation manifold in presence of noise It is worth noting that the mathematical ergodicity in diffusion and density heterogeneity for our toy dataset as well as the three ex- maps reached by adequate kernel width selection does not necessar- perimental single-cell datasets. For any dataset, it is important to ily imply biological ergodicity. If there appears a trace of transitory consider the advantages and disadvantages of each method with re- cells between two clusters, we conclude the two clusters are also bio- spect to the data properties and the purpose of the analysis, in order logically connected to each other in an ergodic sense. However this to make a suitable choice for applying to that dataset. trace might be not present if the transition is too fast or switch-like In our diffusion maps implementation, by performing density abrupt, so that no transitory cells have been caught in the finite set normalization and setting the diagonal of the transition probability of sampled cells of snapshot data. Thus it has to be proven with matrix to zero, we propose a mapping technique wherein the close- dedicated biological experiments (e.g. as used by Buganim et al. ness of cells in the diffusion metric is unaffected by density heteroge- (2012) and Takahashi and Yamanaka (2006)) whether the data is neities in data sampling (see Supplementary Figs S9 and S10). This biologically ergodic or not. feature can be crucial for the detection of rare populations, which is A possible strategy for enhancing the capacity of capturing de- one of the main challenges in the analysis of differentiation data. tails of the structure of rare populations using diffusion maps is to By breaking the diffusion kernel (Mohri et al., 2012) to its multi- limit the transition possibility of each cell only to its closest neigh- plicand wave functions, we also propose a method in accommodat- bours. In this scenario, we could render the diffusion map more local ing the uncertainties of measurement and missing values into the by building the transition matrix P in Equation (6) for k nearest wave function. Consequently, we have successfully addressed uncer- neighbours only. This method, however, might end up with several tainties in the value of non-detects in qPCR data. disconnected sub-graphs of cells when the sampling density along Tuning the scale parameter r is also important for generating in- the intrinsic data manifold is extremely heterogeneous. sights into the structure of the data, for which we proposed a criter- Furthermore, P (without the row normalization) will not be sym- ion on the basis of the characteristic length scales of the data metric any more and we cannot ensure real eigenvalues for the tran- manifold. Because of computational limitations, for our criterion we sition probability matrix. However, as long as the graph is Diffusion maps for high-dimensional single-cell analysis 2997 Table 1. Comparison of several dimension-reduction algorithms in the view of single-cell omics data application Reference Methodology Linear/ Structure Robustness No. of dims Handles missing/un- Keeps Clustering/ Tuning Best performance non-linear faithfulness to noise/density needed for certain values? single-cell keeping parameters heterogeneities embedding resolution? continuity PCA Hotteling (1993) Orthogonal Linear Global þ/ Depends on þ (Buettner et al., þ/þ None Linear data transformation eigenvalues 2014) subspace ICA Stone (2004) Orthogonal Linear Global þ/ Arbitrary þ /þ None Linear data sub- transformation space, known no. of sources SPADE Qui et al. (2011) Agglomerative/k Non-linear Local and /þ 2D  þ/þ -Outlier density Low noise, desired means clustering, (weak) -Target density no. of clusters minimum span- global -Desired no. of %O(2d ) ning trees clusters t-SNE Van der Maaten Attraction/repulsion Non-linear Local þ/ þþ 2or3D þ þ/ Perplexity Clustering to sep- and Hinton balance arate groups, (2008) presence of noise and dens- ity heterogeneities Kernel Scholkopf et al. Kernel methods Non-linear Global þ/ Depends on þ (Buettner et al., þþ/þ Depends on the Physically relevant PCA (1998) eigenvalues 2014) used kernel kernel Isomap Tenenbaum et al. Spectral clustering, Non-linear Global /þ Depends on þ /þ No. of nearest Low noise or a (2000) geodesic distance eigenvalues neighbours priory known geodesics (H)LLE Donoho and Weighted linear Non-linear Global / Arbitrary þ /þ No. of nearest Continuous data Grimes (2003) combination of neighbours manifold, low nearest noise, uniform neighbours sampling Diffusion Coifman et al. Spectral clustering, Non-linear Global þþ/þ Depends on þ (Our þ/þ Kernel width Continuous data map (2005) diffusion distance eigenvalues implementation) manifold, pres- ence of noise and density heterogeneity d is the intrinsic dimensionality of the data manifold. 2998 L.Haghverdi et al. Buettner,F. et al. (2014). Probabilistic PCA of censored data: accounting for connected and eigenvalues are real, we can benefit from a more lo- uncertainties in the visualisation of high-throughput single-cell qPCR data. cally detailed map. 2 Bioinformatics. One caveat in the current version of diffusion map is the n  G Buganim,Y. et al. (2012). Single-cell expression analyses during cellular computation time which can be prohibitive for large cell numbers reprogramming reveal an early stochastic and a late hierarchic phase. Cell, (> 10 ) as generated from cytometry experiments. Choosing the k 150, 1209–1222. nearest neighbours version of diffusion map can therefore be a solu- Chattopadhyay,P.K. et al. (2006). Quantum dot semiconductor nanocrystals tion to this problem. Diffusion distances are based on a robust con- for immunophenotyping by polychromatic flow cytometry. Nat. Med., 12, nectivity measure between cells which is calculated over all possible 972–977. paths of a certain length between the cells. Thus, a diffusion mapping Chu,Y. and Corey,D.R. (2012). RNA sequencing: platform selection, experi- obtained by accounting for a smaller fraction of all possible paths mental design, and data interpretation. Nucleic Acid Therapeutics, 22, 271– (namely those going through each cells’ nearest neighbours) can still Coifman,R.R. et al. (2005). Geometric diffusions as a tool for harmonic ana- provide a good approximation of the diffusion distance between the lysis and structure definition of data: Diffusion maps. Proc. Natl. Acad. Sci. cells and yet avoid computing all n elements of the transition prob- USA, 102, 7426–7431. ability matrix. With such modifications, diffusion maps prevail as a Donoho,D.L. and Grimes,C. (2003). Hessian eigenmaps: Locally linear promising method for the analysis of large cell numbers omics data. embedding techniques for high-dimensional data. Proc. Natl. Acad. Sci. Another issue is the number of embedding dimensions. The num- USA, 100, 5591–5596. ber of significant dimensions of the diffusion map is determined Dykstra,B. et al. (2007). Long-term propagation of distinct hematopoietic dif- where a remarkable gap occurs in its sorted eigenvalues plot. This is ferentiation programs in vivo. Cell Stem Cell, 1, 218–229. not intrinsically bound to the conventional visualizable dimensions Gillespie,D.T. (1977). Exact stochastic simulation of coupled chemical reac- two or three. In contrast, for some other methods such as t-SNE, tions. J. Phys. Chem., 81, 2340–2361. Guo,G. et al. (2010). Resolution of cell fate decisions revealed by single-cell one can pre-determine the number of visualization dimensions for gene expression analysis from zygote to blastocyst. Dev. Cell, 18, 675–685. the embedding optimization to two or three dimensions. Huang,S. (2009). Non-genetic heterogeneity of cells in development: more We conclude that diffusion maps are appropriate and powerful than just noise. Development, 136, 3853–3862. for the dimension-reduction of single-cell qPCR and RNA-Seq cell dif- Krumsiek,J. et al. (2011). Hierarchical differentiation of myeloid progenitors ferentiation data as they are able to handle high noise levels, sampling is encoded in the transcription factor network. PloS One, 6, e22649. density heterogeneities, and missing and uncertain values. As a result McDavid,A. et al. (2013). Data exploration, quality control and testing in single- diffusion maps can organize single cells along the non-linear and com- cell qPCR-based gene expression experiments. Bioinformatics, 29, 461–467. plex branches of differentiation, maintain the global structure of the Mohri,M. et al. (2012). Foundations of Machine Learning. MIT Press, differentiation dynamics and detect rare populations as well. Cambridge, MA, USA. Moignard,V. et al. (2013). Characterization of transcriptional networks in blood stem and progenitor cells using high-throughput single-cell gene ex- pression analysis. Nature Cell Biol., 15, 363–372. Acknowledgements Orkin,S.H. and Zon,L.I. (2008). Hematopoiesis: an evolving paradigm for We thank Michael Strasser (Institute for Computational Biology, Helmholtz stem cell biology. Cell, 132, 631–644. Centre Munich), Victoria Moignard (Cambridge Institute of Medical Park,H.Y. et al. (2014). Visualization of dynamics of single endogenous mrna Research), Berthold Goettgens (Cambridge Institute of Medical Research) labeled in live mouse. Science, 343, 422–424. and Mauro Maggioni (Department of Mathematics, Duke University) for Qiu,P. et al. (2011). Extracting a cellular hierarchy from high-dimensional helpful advice and discussions. cytometry data with SPADE. Nat. Biotechnol., 29, 886–891. Rieger,M.A. et al. (2009). Hematopoietic cytokines can instruct lineage choice. Science, 325, 217–218. Scho ¨ lkopf,B. et al. (1998). Nonlinear component analysis as a kernel eigen- Funding value problem. Neural Comput., 10, 1299–1319. Schroeder,T. (2011). Long-term single-cell imaging of mammalian stem cells. This study has been funded by The Bavarian Research Network for Nat. Methods, 8, S30–S35. Molecular Biosystems (BioSysNet) and the European Research Council (ERC Shawe-Taylor,J. and Cristianini,N. (2004). Kernel Methods for Pattern starting grant LatentCauses). Analysis. Cambridge University Press. Conflict of Interest: none declared. Stingl,J. et al. (2006). Purification and unique properties of mammary epithe- lial stem cells. Nature, 439, 993–997. Strasser,M. et al. (2012). Stability and multiattractor dynamics of a toggle References switch based on a two-stage model of stochastic gene expression. Biophys. J., 102, 19–29. Amir,E.-a. D. et al. (2013). viSNE enables visualization of high dimensional Takahashi,K. and Yamanaka,S. (2006). Induction of pluripotent stem cells single-cell data and reveals phenotypic heterogeneity of leukemia. Nat. from mouse embryonic and adult fibroblast cultures by defined factors. Cell, Biotechnol., 31, 545–552. 126, 663–676. Arinobu,Y. et al. (2007). Reciprocal activation of GATA-1 and PU. 1 marks Tenenbaum,J.B. et al. (2000). A global geometric framework for nonlinear initial specification of hematopoietic stem cells into myeloerythroid and dimensionality reduction. Science, 290, 2319–2323. myelolymphoid lineages. Cell Stem Cell, 1, 416–427. Trapnell,C. et al. (2014). The dynamics and regulators of cell fate decisions Bandura,D.R. et al. (2009). Mass cytometry: technique for real time single cell are revealed by pseudotemporal ordering of single cells. Nature Biotechnol., multitarget immunoassay based on inductively coupled plasma time-of- 32, 381–386. flight mass spectrometry. Anal. Chem., 81, 6813–6822. Van der Maaten,L. and Hinton,G. (2008). Visualizing data using t-SNE. J. Bendall,S.C. et al. (2014). Single-cell trajectory detection uncovers progression Mach. Learn. Res., 9, 2579–2605. and regulatory coordination in human B cell development. Cell, 157, 714– Wilhelm,J. and Pingoud,A. (2003). Real-time polymerase chain reaction. Chembiochem, 4, 1120–1128. Buettner,F. and Theis,F.J. (2012). A novel approach for resolving differences Yan,L. et al. (2013). Single-cell RNA-Seq profiling of human preimplantation in single-cell gene expression patterns from zygote to blastocyst. embryos and embryonic stem cells. Nat. Struct. Mol. Biol., 20, 1131–1139. Bioinformatics, 28, i626–i632. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

Diffusion maps for high-dimensional single-cell analysis of differentiation data

Bioinformatics , Volume 31 (18): 10 – May 21, 2015

Loading...

Page 2

Loading...

Page 3

Loading...

Page 4

Loading...

Page 5

Loading...

Page 6

Loading...

Page 7

Loading...

Page 8

Loading...

Page 9

Loading...

Page 10

 
/lp/oxford-university-press/diffusion-maps-for-high-dimensional-single-cell-analysis-of-07d30TqWiL

References (34)

Publisher
Oxford University Press
Copyright
© The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/btv325
pmid
26002886
Publisher site
See Article on Publisher Site

Abstract

Motivation: Single-cell technologies have recently gained popularity in cellular differentiation stud- ies regarding their ability to resolve potential heterogeneities in cell populations. Analyzing such high-dimensional single-cell data has its own statistical and computational challenges. Popular multi- variate approaches are based on data normalization, followed by dimension reduction and clustering to identify subgroups. However, in the case of cellular differentiation, we would not expect clear clus- ters to be present but instead expect the cells to follow continuous branching lineages. Results: Here, we propose the use of diffusion maps to deal with the problem of defining differenti- ation trajectories. We adapt this method to single-cell data by adequate choice of kernel width and inclusion of uncertainties or missing measurement values, which enables the establishment of a pseudotemporal ordering of single cells in a high-dimensional gene expression space. We expect this output to reflect cell differentiation trajectories, where the data originates from intrinsic diffusion-like dynamics. Starting from a pluripotent stage, cells move smoothly within the tran- scriptional landscape towards more differentiated states with some stochasticity along their path. We demonstrate the robustness of our method with respect to extrinsic noise (e.g. measurement noise) and sampling density heterogeneities on simulated toy data as well as two single-cell quantitative polymerase chain reaction datasets (i.e. mouse haematopoietic stem cells and mouse embryonic stem cells) and an RNA-Seq data of human pre-implantation embryos. We show that diffusion maps perform considerably better than Principal Component Analysis and are advanta- geous over other techniques for non-linear dimension reduction such as t-distributed Stochastic Neighbour Embedding for preserving the global structures and pseudotemporal ordering of cells. Availability and implementation: The Matlab implementation of diffusion maps for single-cell data is available at https://www.helmholtz-muenchen.de/icb/single-cell-diffusion-map. Contact: fbuettner.phys@gmail.com, fabian.theis@helmholtz-muenchen.de Supplementary information: Supplementary data are available at Bioinformatics online. 1 Introduction reveal the properties of the heterogeneous population of cells at vari- ous stages of development. Purifying for a certain cell type or syn- The advantages of single-cell measurements to various biological re- chronizing cells is experimentally challenging. Moreover, stem cell search fields have become obvious in recent years. One example is populations that have been functionally characterized often show the stem cell studies for which population measurements fail to V The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com 2989 2990 L.Haghverdi et al. heterogeneity in their cellular and molecular properties (Dykstra heterogeneities and addresses the issues relating to missing values et al., 2007; Huang, 2009; Stingl et al., 2006). To overcome these and uncertainties of measurement, barriers, on the one hand researchers conduct continuous single-cell propose a criterion for selecting the scale parameter in a diffusion observation using time-lapse microscopy (Park et al., 2014; Rieger map, et al., 2009; Schroeder, 2011), accompanied by single-cell tracking evaluate the performance of the diffusion map and its robustness and analysis tools. However, this approach is still limited as the ex- to noise and density heterogeneities using a toy model that pression of very few genes (typically one to three) could be observed. mimics the dynamics of differentiation, On the other hand, with the advent of new technologies, such as sin- apply the adapted diffusion map algorithm to two typical qPCR gle-cell qPCR (Wilhelm and Pingoud, 2003) or RNA-Seq (Chu and and one RNA-Seq datasets and show that it captures the differ- Corey, 2012) and flow or mass cytometry (Bandura et al., 2009; entiation dynamics more accurately than other algorithms. Chattopadhyay et al., 2006), it is now possible to measure hundreds to thousands of genes from thousands of single cells at different spe- 2 Methods cific experimental time points (time-course experiments). However, several single cells measured at the same experimental time point 2.1 Diffusion maps may be at different developmental stages. Therefore, there is a need Let n be the number of cells and let G be the number of genes meas- for computational methods which resolve the hidden temporal order ured for each cell. Denote the set of all measured cells by X.We that reflects the ordering of developmental stages of differentiating allow each cell x to diffuse around its measured position x 2 R cells. through an isotropic Gaussian wave function, While differentiation has to be regarded as a non-linear continu- 1=4 2 ous process (Bendall et al., 2014; Buettner and Theis, 2012), stand- 2 jjx  xjj Y ðx Þ¼ exp  (1) 2 2 ard methods used for the analysis of high-dimensional gene pr r expression data are either based on linear methods such as principal 0 2 0 0 The normalization of Y ðx Þ is such that Y ðx Þdx ¼ 1. The component analysis (PCA) and independent components analysis x Gaussian width r determines the length scale over which each cell (ICA) [e.g. used as part of the monocle algorithm (Trapnell et al., can randomly diffuse. The transition probability from cell x to cell y 2014)] or they use clustering techniques that groups cells according is then defined by the interference of the two wave functions Y and to specific properties. Hierarchical clustering methods as used in Y . One can easily show that this interference product is another SPADE (Qiu et al., 2011) and t-SNE (Van der Maaten and Hinton, Gaussian (all prefactors cancel out): 2008) as used in viSNE (Amir et al., 2013) are examples of cluster- ing methods. However, as these methods are designed to detect dis- ! jjx  yjj 0 0 0 crete subpopulations, they usually do not preserve the continuous Y ðx ÞY ðx Þdx ¼ exp  (2) x y 2r trajectories of differentiation data. A more recently proposed algorithm Wanderlust (Bendall et al., 2014) incorporates the non- Hence, we can construct the n  n Markovian transition probability linearity and continuity concepts but provides a pseudotemporal matrix P for all pairs of cells in X as follows: ordering of cells only if the data comprise a single branch. Furthermore, in gene expression measurement techniques, there is 2 1 jjx  yjj P ¼ exp  (3) usually a detection limit at which lower expression levels and non- xy ZðxÞ 2r expressed genes are all reported at the same value. Buettner et al. (2014) suggested the use of a censoring noise model for PCA, jjx  yjj whereas for the other methods it is unclear how these uncertain or ZðxÞ¼ exp  (4) 2r y2X missing values are to be treated. A variety of other manifold learning methods including (Hessian) locally linear embedding (HLLE) At the position of each cell, ZðxÞ is the partition function which pro- (Donoho and Grimes, 2003) and Isomap (Tenenbaum et al., 2000) vides an estimate of the number of neighbours of x in a certain vol- exist in the machine-learning community and are discussed in detail ume defined by r. Hence, it can be interpreted as the density of cells in the discussion and conclusion section. at that proximity. Consequently, we redefine the density normalized Here, we propose diffusion maps (Coifman et al., 2005) as a tool transition probability matrix P as: for analyzing single-cell differentiation data. Diffusion maps use a jjxyjj distance metric (usually referred to as diffusion distance) conceptu- exp 1 2r ~ ~ P ¼ ; P ¼ 0 (5) ally relevant to how differentiation data is generated biologically, as xy xx ZðxÞZðyÞ ZðxÞ cells follow noisy diffusion-like dynamics in the course of taking sev- eral differentiation lineage paths. Diffusion maps preserve the non- 2 jjxyjj exp X 2 2r linear structure of data as a continuum and are robust to noise. ZðxÞ¼ (6) ZðxÞZðyÞ Furthermore, with density normalization, diffusion maps are resist- y2X=x ant to sampling density heterogeneities and can capture rare as well Because we are only interested in the transition probabilities be- as abundant populations. As a non-linear dimension-reduction tool, tween cells and not the on-cell potentials imposed by local densities, diffusion maps can be applied on single-cell omics data to perform we set the diagonal of P to zero and exclude y ¼ x from the sum in dimension-reduction and ordering of cells along the differentiation ~ ~ the partition function Z. For a large enough r, the matrix P defines path in a single step, thus providing insight to the dynamics of differ- an ergodic Markovian diffusion process on the data and has n entiation (or any other concept with continuous dynamics). In this ordered eigenvalues k ¼ 1 > k :::k with corresponding right 0 1 n1 article, we eigenvectors w :::w . 0 n1 propose an adaptation of diffusion maps for the analysis of The t-th power of P will present the transition probabilities be- single-cell data which is less affected by sampling density tween cells in a diffusion (random walk) process of length t. Noting Diffusion maps for high-dimensional single-cell analysis 2991 ~ ~ that P has the same eigenvectors as P, one can show that this transi- Markovian transition probability matrix (DC1 and DC2) are then tion probability can be represented as follows: used for low-dimensional representation and visualization of data. n1 t t ~ ~ P ¼ k w ðxÞw ðyÞZðyÞ (7) i i xy i i¼0 2.2 Accounting for missing and uncertain values The data generated from qPCR, RNA-Seq or cytometry experiments Each row of P can be viewed as a vector, which we represent as are very often prone to imperfections such as missing values or de- p ðx; Þ and consider as the feature representation (Shawe-Taylor tection limit thresholds. It is important to properly treat such uncer- and Cristianini, 2004) for each cell x. By computing the weighted L tainties of data (Buettner et al., 2014; McDavid et al., 2013). Our distance in the feature space, the diffusion distance D between two probabilistic interpretation of diffusion maps allows a straightfor- cells x and y is defined as follows: ward mechanism of handling missing and uncertain data measure- t t 2 ~ ~ X ments. First, we have to decompose the kernel into G components. ðP  P Þ xz yz 2 t t 2 D ðx; yÞ¼jjp ðx; Þ  p ðy; Þjj ¼ (8) t 1=Z Then, instead of a Gaussian, we can use any other wave function ZðzÞ that best represents our prior knowledge on the probability distribu- This diffusion distance can be expressed in terms of the eigenvectors tion of the missing or uncertain values, which then should be of P such that: square-normalized to ensure equal contribution of the G compo- nents. For example, for missing values and non-detects (measure- n1 2 2t 2 ments below the limit of detection), one might choose a uniform D ðx; yÞ¼ k ðw ðxÞ w ðyÞÞ (9) i i t i i¼1 distribution over the whole range of possible values. In the following, we describe how to account for the uncertainty The corresponding eigenvector to the largest eigenvalue k is a con- of non-detect measurements in qPCR data. The statistical subtleties stant vector w ¼ 1. Therefore, it only contributes a zero term to D of non-detect values in qPCR experiments have been systematically and is excluded from the spectral decomposition of D in studied by McDavid et al. (2013) for univariate models. In addition, Equation (9). That means the Euclidean distance of the cells in the for a multivariate PCA analysis, Buettner et al. (2014) proposed that firstleigenvector space represents an approximation of their diffu- different kernels be allowed in each dimension. For the diffusion sion distance D . Moreover, the eigenvalues of P determine the dif- map implementation, we assume any value between the detection fusion coefficients in the direction of the corresponding eigenvector. limit (M ) and a completely non-expressed (off) state of genes valued As real data usually lie on a lower dimensional manifold than the en- as M , is equally possible for the non-detect measurements. tire dimensions of space G, these diffusion coefficients drop to a Considering the kernel width formulated in the diffusion map wave noise level other than a few first (l) prominent directions. Therefore, functions, we assume an indicator wave function between M  r if there is a significant gap between the l-th and ðl þ 1Þ-th eigen- 1=2 and M þ r normalized by ðM  M þ 2rÞ . Thus, we have to 1 1 0 value, the sum up to the l-th term usually determines a good ap- calculate three different kinds of interference of wave functions: proximation for diffusion distances. Thus, for data visualization we The interference of two cells with definite measured values for select these eigenvectors and instead of the mathematical notation gene g is the standard Gaussian kernel (see Section 2.1): w, we call them diffusion components (DCs). Figure 1 presents a summary of diffusion map embedding. ð ðx  y Þ g g 0 0 0 Each cell is represented by a Gaussian wave function in the G- Y ðx ÞY ðx Þdx ¼ exp  ; x y g g g 2r dimensional gene space. On an adequate Gaussian width, the wave functions of neighbouring cells interfere with each other and form the interference of two cells both with non-detect values for gene g the diffusion paths along the (non-linear) data manifold in the high- is 1 (due to the square-normalization constraint): dimensional space. Hence, we construct the Markovian transition probability matrix, the elements of which are the transition proba- 0 0 0 Y ðx ÞY ðx Þdx ¼ 1; x y g g g bilities between all pairs of cells. The eigenfunctions of the Fig. 1. Schematic overview of diffusion maps embedding. (A) The n  G matrix representation of single-cell data consisting of four different cell types. The last column on the right side of the matrix (colour band) indicates the cell type for each cell. (B) Representation of each cell by a Gaussian in the G-dimensional gene space. Diffusion paths (continuous paths with relatively high-probability density) form on the data manifold as a result of interference of the Gaussians. The Probability density function is shown in the heat map. (C) The n  n Markovian transition probability matrix. (D) Data embedding on the first two eigenvectors of the Markovian transition matrix (DC1 and DC2) which correspond to the largest diffusion coefficients of the data manifold. The embedding shows the continuous flow of cells across four cell types; however, it does not suggest the putative time direction 2992 L.Haghverdi et al. the interference of a missing (non-detect) value to a definite meas- where we compute the average of log ðZðxÞÞ with consideration of ured value x is: density heterogeneities such that: ðlog ðZðxÞÞ  ð1=ZðxÞÞÞ 0 0 0 Y ðx ÞY ðx Þdx ¼ x x y g g g hlog ðZðxÞÞi ¼ X (12) ð1=ZðxÞÞ M1þr 1 2 ðx0  x Þ 1=4 g g pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffið Þ exp  dx 2 2 It is worth noting that this average density underestimates the real M  M þ 2r pr r M r 1 0 dimensionality of the structure due to the contribution of the cells 1=4 1 pr lying on the surface of the manifold. However, this does not affect ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi M  M þ 2r 8 1 0 our heuristic since the variation of hdi is our main interest rather than hdi itself. M  r  x M þ r  x 0 g 1 g erfc  erfc : Each time hdi reaches its maximum and starts to decrease, one r r can deduce that an intrinsically lower dimensional structure is For data with missing or uncertain values, we need to check the pair- emerging from the noise-enriched distributed cells in the original wise interference of the wave functions for each gene. The computa- high-dimensional space. Therefore several characteristic length tion time is thus proportional to the number of genes G for a fixed scales of the data manifold (i.e. width of its linear parts, radius of its number of cells n. Therefore, it might be preferable (especially in the curves, etc.) give rise to several local maxima in hdi. Such character- case of large G) to choose the wave function of the missing (or un- istic scales indeed make our choice for the Gaussian width r since certain) value also in the form of a Gaussian such that the multipli- they indicate the scale at which the Euclidean distances used in the cation of the G components of interference can be expressed as the Gaussian kernel are sensible in an assumed Euclidean tangent space sum of the exponents and the exponentiation step can be performed to the manifold. Although Euclidean distances are also valid for only once at the end of the algorithm for computation of the transi- smaller rs than the characteristic length scale, they are not recom- tion matrix. An implementation of this fast version of the censoring mended because smaller kernel width would mean less connectivity algorithm is also provided in the codes package. Supplementary in the cells graph which in turn results in an increased sensitivity to Figure S1 provides an illustration of our approach for accounting noise. Supplementary Figures S2 and S3 illustrate the resulting diffu- for missing and uncertain values. sion map on optimal kernel width and several other kernel widths values for a U-shaped toy data. Also the performance of diffusion map at the optimal kernel width when there is no distinguishable 2.3 Determination of Gaussian kernel width pattern in the data (e.g. normally distributed data in all dimensions The parameter r in Equation (1) determines the scale at which we or sparse data) is illustrated in the Supplementary Figure S4. visualize the data. If r is extremely small, most elements of the tran- sition probability matrix P will tend to be zero and we do not get an overall view of a connected graph structure. In fact, when r is too 2.4 Toy model for differentiation small, the number of degenerate eigenvectors with eigenvalue equal As toggle switches are known to play a role in differentiation to one, indicates the number of disconnected segments that P defines branching processes (Orkin and Zon, 2008), we designed a regula- on the data. For too large r however, the transition probability sen- tory network of three pairs of toggle genes to evaluate the perform- sitivity on the distance between the cells blurs. There is a certain ance of our method on a toy dataset that mimics a differentiation range of r variations for which P defines an ergodic diffusion pro- tree (Krumsiek et al., 2011). Assuming a genetic regulatory module cess on the data as a connected graph and still the diffusion distances as presented in Figure 2A, we simulated the stochastic differenti- between the cells are informative. ation process by the Gillespie algorithm (Gillespie, 1977) with the The un-normalized density at each cell (ZðxÞ in Equation (3))is reactions as shown in Figure 2B and C (Strasser et al., 2012). More proportional to the number of cells in a fixed volume in its neigh- details about the chemical reactions and the reaction rates used in bourhood and depends on r. At scales of r close to zero, cells do not the Gillespie algorithm model can be found in the supplement have any neighbours and their average density is 1 (because of the 1s (Supplementary Figs S5A and B). Genes G and G are antagonistic 1 2 on the diagonal of P). By increasing r, the average density gradually to each other through an inhibiting Hill function. Therefore, starting increases as more cells find other cells in their neighbourhood. There from an initial undifferentiated state where G and G are both in a 1 2 is a density saturation point where r reaches the system size and all very low expression level, single samples may end up in either of the cells form part of one neighbourhood. At this point, for every cell states where G or G is exclusively expressed. At this stage, the 1 2 x 2 X, the density ZðxÞ will be equal to the entire system size n. next pair of toggle genes in the differentiation hierarchy is activated Assuming that the density gradient is not extremely sharp along (through an activating binding Hill function), which are again an- the data manifold, the number of neighbours of cell x in the neigh- tagonistic to each other. This model generates four different types of bourhood r will be proportional to the volume of a hypersphere of fully differentiated cells in the six-dimensional space of genes. radius r, hence: To establish a steady state in the cell population, once a cell hits dðx;rÞ ZðxÞ/ r (10) the end of each branch, we remove it from the population and initi- ate a new cell at the original undifferentiated state. This approach where dðx; rÞ is the dimensionality of data manifold at the position maintains the population size of cells. After an extended simulation of cell x and at the scale r. By differentiating both sides with respect run, the steady state of the population is established and resembles to log ðrÞ, we find that the average dimensionality of the manifold the haemostatic state of (e.g. hematopoietic) stem cells in natural can be estimated by the slope of the log–log plot of the number of organisms. neighbours versus the length scale: We sampled cells from this toy model in two different sets, a bal- @hlog ðZðxÞÞi anced toy dataset, wherein 600 samples serve as a snapshot of the hdðrÞi ¼ (11) @log ðrÞ steady state of the system with no additional extrinsic noise, and an Diffusion maps for high-dimensional single-cell analysis 2993 Fig. 2. (A) Toy model of a differentiation regulatory network consisting of three pairs of antagonistic genes simulated by the Gillespie algorithm. The arrows show activation or inhibition interactions between genes. The toy model employs two classes of gene regulation: (B) G is connected to an inhibitor, its production 0 0 00 rate a is proportional to a Hill function of the concentration of the inhibitor Protein (C) G is connected to an inhibitor G and an activator G , its production rate i i i i i a is proportional to product of an inhibiting and an activating Hill function. The degradation rate c is constant for all proteins imbalanced toy dataset, wherein 1800 sample are derived from a to the performance of two other dimension-reduction methods PCA non-steady-state density distribution with heavier sampling density and t-SNE. Data embeddings with several other methods including þ þ on the G G branch. We also added an extrinsic Gaussian noise ICA, SPADE, kernel-PCA (Scho ¨ lkopf et al., 1998), isomap and 1 3 with a variance of 25% maximum expression to each gene. The Hessian Locally linear embedding (HLLE) are provided in the gene expression plot for a simulated single cell as it proceeds from Supplementary Figures S16–S20. the initial pluripotent state to a fully differentiated state is presented in the supplement (Supplementary Figs S5C and D). 3.1 Diffusion maps cope with high noise level and sampling density heterogeneity for toy data 2.5 Experimental data 3.1.1 Gaussian width determination of the toy data 2.5.1 qPCR data of mouse haematopoietic stem cells. We demonstrate the heuristic determination of r on balanced and We calculated a diffusion map embedding for the haematopoietic imbalanced toy datasets. The average dimensionality of the structure and progenitor stem cells dataset from Moignard et al. (2013). In this of some chosen characteristic length scale can be estimated by experiment, 597 cells from five different haematopoietic cell types, Equation (11). Figure 3 shows the average dimensionality hdi for namely, haematopoietic stem cell (HSC), lymphoid-primed multipo- balanced toy data (red) and imbalanced toy data (black) as a func- tent progenitor (LMPP), megakaryocyte-erythroid progenitor tion of log ðrÞ. The balanced set exhibits two maxima. The first one (PreMegE), common lymphoid progenitor (CLP) and granulocyte– arises at the length scale of the thickness of the differentiation monocyte progenitor (GMP) were gated by Fluorescence-activated branches which include only a few cells. At this r several subpopula- cell sorting (FACS) sorting. Single-cell qPCR expression level meas- tions form at the more densely populated stages of the steady state. urement was then performed for 24 genes. Housekeeping genes were The second maximum appears at a larger length scale when several only used for cell-cycle normalization, where for each cell, all expres- subpopulations become visible to each other and the continuous sion values were divided by the average expression of its housekeep- branches form. We picked the r at the second maximum for visual- ing genes. Furthermore we excluded the five housekeeping genes, as ization (data visualization at the first maximum is provided in the well as c-Kit, which is a stem-cell receptor factor expressed on the Supplementary Fig. S6). For the imbalanced set, however, due to the surface of all analyzed cells, from the diffusion map analysis. high noise level, the first maximum vanished and we only detected one maximum which we then used for the visualization. 2.5.2 qPCR data of mouse stem cells from zygote to blastocyst To understand the earliest cell fate decision in a developing mouse embryo, Guo et al. (2010) conducted a qPCR experiment for 3.1.2 Performance of the diffusion map on the toy data as compared 48 genes in seven different developmental time points. The gene ex- to the other methods pression levels were normalized to the endogenous controls Actb Definition of diffusion distance (Equation (8)) based on probability and Gapdh. The authors also identified four cell types, namely, of transition between cells through several paths renders diffusion inner cell mass (ICM), trophectoderm (TE), primitive endoderm maps very robust to noise. Figure 4 presents a comparison between (PE) and epiblast (EPI) using characteristic markers. The total num- the performance of the diffusion map and the other two methods ber of single cells used in the diffusion map analysis was 429. PCA and t-SNE on the balanced toy dataset. The eigenvalues of the diffusion map (Fig. 4D) suggest that there are four leading dimen- 2.5.3 RNA-Seq of human preimplantation embryos sions that explain the data structure and the higher dimensions pre- For the dataset published by Yan et al. (2013), RNA-Seq analysis sent noise rather than the intrinsic structure of the data manifold. was performed on 90 individual cells from 20 oocytes and embryos. The complete set of two-by-two projections up to the fourth eigen- The sequenced embryos were picked at seven crucial stages of pre- vector can be found in the supplementary Figure S7. PCA of this implantation: metaphase II oocyte, zygote, 2-cell, 4-cell, 8-cell, mor- dataset generated results that were similar to the diffusion map, ula and late blastocyst at the hatching stage. where all four branches of the data could be distinguished. However, standard t-SNE did not preserve the data structure con- tinuity. Visualization using t-SNE with non-standard perplexity val- 3 Results ues are also provided in the Supplementary Figure S8. To determine In this section, we evaluate the performance of the diffusion map on how additional extrinsic noise and density heterogeneities affect each of the datasets described in the Methods section and compare it each method, we also applied the three methods on imbalanced toy 2994 L.Haghverdi et al. data (Fig. 5). The eigenvalues plot of the diffusion map in this figure normalization. Supplementary Figure S12 demonstrates how density suggests the same order of significance for the third and fourth normalization improves the quality of the diffusion map. The third eigenvectors as k almost equals k and that the higher order eigen- refinement that we used in our implementation of diffusion maps is 4 3 functions mostly present noise. We chose two projections accounting for missing and non-detect values (Section 2.2). (DC, DC2, and DC3) and (DC1, DC2, and DC4) for illustration Generally speaking as the proportion of missing and non-detect val- in Figure 5. The complete set of two-by-two projection can be found ues increases, there is a decrease in the quality of the diffusion map. in the Supplementary Figure S9. From Figure 5A, one can infer the However the magnitude of this effect depends highly on the archi- same size for all four branches of differentiation despite different tecture of the gene regulatory network and the role of the corres- sampling densities. This figure also suggests that the diffusion map ponding gene in the network. For example, for a toggle switch, low clearly shows four branches of the imbalanced toy data, whereas expression of a gene would always imply high expression of the PCA and t-SNE produce noisier visualization and represent the two other gene. Therefore, increasing the detection threshold (i.e. rarer branches as smaller. For additional t-SNE visualizations with increasing number of non-detects) does not have a major influence non-standard perplexity values for the imbalanced toy data see on the analysis, as the information is still present in the other gene Supplementary Figure S10. with high expression. We evaluate the performance of diffusion map in several proportions of missing values for the balanced toy data in Supplementary Figure S13. 3.1.3 Refinement of the transition matrix by density normalization, zero diagonal and accounting for missing values In order to adapt the standard diffusion map algorithm to the prop- 3.2 Diffusion map allows identification of differentiation erties of single-cell gene expression parameters, we refined the tran- trajectories on experimental data sition matrix in different ways. First, we set the diagonal of the 3.2.1 Performance on haematopoietic stem cells qPCR data as transition matrix to zero (Equation (5)) since the (non-zero version) compared to the other methods diagonal carries information about local sampling densities. Unlike The diffusion map embedding for the haematopoietic stem cells many other applications where the information about local densities (Fig. 6A) indicates a major branching of HSCs to PreMegE and has some value, the sampling density in the context of single-cell LMPP cell types and a further branching of LMPPs to CLP and GMP data is somewhat arbitrary (e.g. only specific cell types are moni- cells. The branching structures are less clear in the PCA plot (Fig. 6B). tored, different proliferation rates in several stages of differentiation Moreover, PCA produces artificial planes of data in the embedding alters the sampling density, outlier cells show lower density, etc.). because of the non-detect measurements in the qPCR data. The t-SNE For a demonstration of how zero diagonal improves the quality of plot (Fig. 6C) almost separated the cell types (except for LMPPs) into the diffusion map see Supplementary Figure S11. Second, we refined different clusters. However, the notion of temporal progress is less the Markovian transition matrix by density normalization clear compared to the diffusion map embedding. In addition, since (Equation (5)) since the number of diffusion paths between two cells uncertainties in the values of non-detects were not considered, a wid- depends on the density of cells connecting them and more densely ening within the clusters is observed. Detailed visualization using the sampled regions of the data would seem to have smaller diffusion three methods and the Gaussian width determination for diffusion distance to each other on a diffusion map without density map embedding are provided in the Supplementary Figure S14.The ordered eigenvalues plot for the diffusion map and PCA are shown in Figures 6D and E. The ordered eigenvalues plot of the diffusion map suggests that there is no clear separation between the eigenvectors of the diffusion map that captures the intrinsic low-dimensional data manifold and those characterizing noise for this dataset. However, what makes the diffusion map embedding of this dataset more plaus- ible is the concordance between the branching structure as suggested by the diffusion map and the recently established hierarchy of haem- atopoietic cell types (Arinobu et al.,2007; Moignard et al.,2013) illustrated in Figure 6F. 3.2.2 Performance of the diffusion map on mouse embryonic stem cells qPCR data as compared to the other methods Fig. 3. The average dimensionality of the data hdi as a function of log ðrÞ for For the mouse embryonic stem cells, diffusion map visualization the balanced and imbalanced toy datasets using the first three eigenvectors indicated a branching at the early Fig. 4. Visualization of the balanced toy data on (A) the first three eigenvectors of the diffusion map, (B) PCA and (C) t-SNE. The colours (heat map of blue to red) indicate the maximum expression among all genes. Eigenvalues sorted in decreasing order for (D) diffusion map and (E) PCA Diffusion maps for high-dimensional single-cell analysis 2995 Fig. 5. Visualization of the imbalanced toy data on (A) the first three eigenvectors of the diffusion map, (B) the first, second and fourth eigenvectors of the diffusion map, (C) the first three components of the PCA (D) the first, second and fourth components of PCA and (E) t-SNE. The colours (heat map of blue to red) indicate the maximum expression among all genes. Eigenvalues sorted in a decreasing order for (F) diffusion map and (G) PCA Fig. 6. Visualization of haematopoietic stem cells data on the first three eigenvectors of (A) diffusion map, (B) PCA and (C) t-SNE. Eigenvalues sorted in a decreas- ing order for (D) diffusion map and (E) PCA. (F) The hierarchy of haematopoietic cell types 16-cell stage to the ICM and TE cell types, and further branching of to high sequencing costs. A low number of sampled cells could not the ICM at the late 32-cell stage into the EPI and PE (Fig. 7A). The meaningfully indicate a complex structure. Hence, PCA and t-SNE branching structure is unclear in the PCA and t-SNE plots (Figs 7B performance is almost as good as that of the diffusion map. and C). The ordered eigenvalues plot for the diffusion map and PCA However, with the expected development of new and cheaper RNA are shown in Figures 7D and 7E. The branching structure indicated sequencing technologies, we propose a diffusion map that could be by the diffusion map is in agreement with the results of previous used as a powerful dimension-reduction tool the computation time studies on this dataset (Buettner and Theis, 2012; Guo et al., 2010), of which is only linear with respect to the number of genes. which suggests a branching into the two cell types, ICM and TE, after the 8-cell stage and further branching of the ICM into EPI and PE cells (Fig. 7F). More information on Gaussian width determin- 4 Discussion and conclusion ation and two-dimensional projections of data on each pair of the In this manuscript, we have demonstrated the capabilities of diffu- first to fourth eigenvectors of the diffusion map are provided in the sion maps for the analysis of continuous dynamic processes, in par- Supplementary Figure S15. ticular, differentiation data in a toy dataset and a few experimental datasets. Using a biologically relevant distance metric (i.e. diffusion 3.2.3 Performance on human pre-implantation embryos RNA-Seq distance), the adapted diffusion map method outperforms other di- data compared with other methods mension-reduction methods in pseudotemporal ordering of cells The performance of the diffusion map on this RNA-Seq dataset is along the differentiation paths and could capture the expected dif- comparable (although slightly sharper with respect to pseudotime ferentiation structure in all cases. Table 1 provides a general com- ordering) to the other two methods, PCA and t-SNE (Fig. 8). The parison of several dimension-reduction methods, detailing number of single cells measured in RNA-Seq is currently limited due capabilities and limitations in application to single-cell omics data. 2996 L.Haghverdi et al. Fig. 7 Visualization of mouse embryonic stem cells on (A) the first three eigenvectors of diffusion map, (B) PCA and (C) t-SNE. Eigenvalues sorted in a decreasing order for (D) diffusion map and (E) PCA. (F) The hierarchy of cells for mouse embryonic stem cells Fig. 8. Visualization of human preimplantation embryos data on (A) the first three eigenvectors of the diffusion map, (B) PCA and (C) t-SNE. Eigenvalues sorted in a decreasing order for (D) the diffusion map and (E) PCA Among these methods, isomap and (H)LLE have not been applied compute the average intrinsic dimensionality and hence the average for the analysis of single-cell differentiation data and pseudotime characteristic length scale. However, when density heterogeneities ordering so far, mainly because they do not meet the specific re- are extremely large, or the data manifold has many sharp changes quirements for the analysis of such data including capability to han- and several scales, a single r may not provide a globally optimal dle high levels of technical noise, sampling density heterogeneities, scale for data embedding. Therefore, implementation of an efficient detection limits and missing values. Supplementary Figures S19 and and cost-effective method for several locally valid rs determinations, S20 in the supplement demonstrate the poor performance of these instead of a single global value is of interest. methods for finding the differentiation manifold in presence of noise It is worth noting that the mathematical ergodicity in diffusion and density heterogeneity for our toy dataset as well as the three ex- maps reached by adequate kernel width selection does not necessar- perimental single-cell datasets. For any dataset, it is important to ily imply biological ergodicity. If there appears a trace of transitory consider the advantages and disadvantages of each method with re- cells between two clusters, we conclude the two clusters are also bio- spect to the data properties and the purpose of the analysis, in order logically connected to each other in an ergodic sense. However this to make a suitable choice for applying to that dataset. trace might be not present if the transition is too fast or switch-like In our diffusion maps implementation, by performing density abrupt, so that no transitory cells have been caught in the finite set normalization and setting the diagonal of the transition probability of sampled cells of snapshot data. Thus it has to be proven with matrix to zero, we propose a mapping technique wherein the close- dedicated biological experiments (e.g. as used by Buganim et al. ness of cells in the diffusion metric is unaffected by density heteroge- (2012) and Takahashi and Yamanaka (2006)) whether the data is neities in data sampling (see Supplementary Figs S9 and S10). This biologically ergodic or not. feature can be crucial for the detection of rare populations, which is A possible strategy for enhancing the capacity of capturing de- one of the main challenges in the analysis of differentiation data. tails of the structure of rare populations using diffusion maps is to By breaking the diffusion kernel (Mohri et al., 2012) to its multi- limit the transition possibility of each cell only to its closest neigh- plicand wave functions, we also propose a method in accommodat- bours. In this scenario, we could render the diffusion map more local ing the uncertainties of measurement and missing values into the by building the transition matrix P in Equation (6) for k nearest wave function. Consequently, we have successfully addressed uncer- neighbours only. This method, however, might end up with several tainties in the value of non-detects in qPCR data. disconnected sub-graphs of cells when the sampling density along Tuning the scale parameter r is also important for generating in- the intrinsic data manifold is extremely heterogeneous. sights into the structure of the data, for which we proposed a criter- Furthermore, P (without the row normalization) will not be sym- ion on the basis of the characteristic length scales of the data metric any more and we cannot ensure real eigenvalues for the tran- manifold. Because of computational limitations, for our criterion we sition probability matrix. However, as long as the graph is Diffusion maps for high-dimensional single-cell analysis 2997 Table 1. Comparison of several dimension-reduction algorithms in the view of single-cell omics data application Reference Methodology Linear/ Structure Robustness No. of dims Handles missing/un- Keeps Clustering/ Tuning Best performance non-linear faithfulness to noise/density needed for certain values? single-cell keeping parameters heterogeneities embedding resolution? continuity PCA Hotteling (1993) Orthogonal Linear Global þ/ Depends on þ (Buettner et al., þ/þ None Linear data transformation eigenvalues 2014) subspace ICA Stone (2004) Orthogonal Linear Global þ/ Arbitrary þ /þ None Linear data sub- transformation space, known no. of sources SPADE Qui et al. (2011) Agglomerative/k Non-linear Local and /þ 2D  þ/þ -Outlier density Low noise, desired means clustering, (weak) -Target density no. of clusters minimum span- global -Desired no. of %O(2d ) ning trees clusters t-SNE Van der Maaten Attraction/repulsion Non-linear Local þ/ þþ 2or3D þ þ/ Perplexity Clustering to sep- and Hinton balance arate groups, (2008) presence of noise and dens- ity heterogeneities Kernel Scholkopf et al. Kernel methods Non-linear Global þ/ Depends on þ (Buettner et al., þþ/þ Depends on the Physically relevant PCA (1998) eigenvalues 2014) used kernel kernel Isomap Tenenbaum et al. Spectral clustering, Non-linear Global /þ Depends on þ /þ No. of nearest Low noise or a (2000) geodesic distance eigenvalues neighbours priory known geodesics (H)LLE Donoho and Weighted linear Non-linear Global / Arbitrary þ /þ No. of nearest Continuous data Grimes (2003) combination of neighbours manifold, low nearest noise, uniform neighbours sampling Diffusion Coifman et al. Spectral clustering, Non-linear Global þþ/þ Depends on þ (Our þ/þ Kernel width Continuous data map (2005) diffusion distance eigenvalues implementation) manifold, pres- ence of noise and density heterogeneity d is the intrinsic dimensionality of the data manifold. 2998 L.Haghverdi et al. Buettner,F. et al. (2014). Probabilistic PCA of censored data: accounting for connected and eigenvalues are real, we can benefit from a more lo- uncertainties in the visualisation of high-throughput single-cell qPCR data. cally detailed map. 2 Bioinformatics. One caveat in the current version of diffusion map is the n  G Buganim,Y. et al. (2012). Single-cell expression analyses during cellular computation time which can be prohibitive for large cell numbers reprogramming reveal an early stochastic and a late hierarchic phase. Cell, (> 10 ) as generated from cytometry experiments. Choosing the k 150, 1209–1222. nearest neighbours version of diffusion map can therefore be a solu- Chattopadhyay,P.K. et al. (2006). Quantum dot semiconductor nanocrystals tion to this problem. Diffusion distances are based on a robust con- for immunophenotyping by polychromatic flow cytometry. Nat. Med., 12, nectivity measure between cells which is calculated over all possible 972–977. paths of a certain length between the cells. Thus, a diffusion mapping Chu,Y. and Corey,D.R. (2012). RNA sequencing: platform selection, experi- obtained by accounting for a smaller fraction of all possible paths mental design, and data interpretation. Nucleic Acid Therapeutics, 22, 271– (namely those going through each cells’ nearest neighbours) can still Coifman,R.R. et al. (2005). Geometric diffusions as a tool for harmonic ana- provide a good approximation of the diffusion distance between the lysis and structure definition of data: Diffusion maps. Proc. Natl. Acad. Sci. cells and yet avoid computing all n elements of the transition prob- USA, 102, 7426–7431. ability matrix. With such modifications, diffusion maps prevail as a Donoho,D.L. and Grimes,C. (2003). Hessian eigenmaps: Locally linear promising method for the analysis of large cell numbers omics data. embedding techniques for high-dimensional data. Proc. Natl. Acad. Sci. Another issue is the number of embedding dimensions. The num- USA, 100, 5591–5596. ber of significant dimensions of the diffusion map is determined Dykstra,B. et al. (2007). Long-term propagation of distinct hematopoietic dif- where a remarkable gap occurs in its sorted eigenvalues plot. This is ferentiation programs in vivo. Cell Stem Cell, 1, 218–229. not intrinsically bound to the conventional visualizable dimensions Gillespie,D.T. (1977). Exact stochastic simulation of coupled chemical reac- two or three. In contrast, for some other methods such as t-SNE, tions. J. Phys. Chem., 81, 2340–2361. Guo,G. et al. (2010). Resolution of cell fate decisions revealed by single-cell one can pre-determine the number of visualization dimensions for gene expression analysis from zygote to blastocyst. Dev. Cell, 18, 675–685. the embedding optimization to two or three dimensions. Huang,S. (2009). Non-genetic heterogeneity of cells in development: more We conclude that diffusion maps are appropriate and powerful than just noise. Development, 136, 3853–3862. for the dimension-reduction of single-cell qPCR and RNA-Seq cell dif- Krumsiek,J. et al. (2011). Hierarchical differentiation of myeloid progenitors ferentiation data as they are able to handle high noise levels, sampling is encoded in the transcription factor network. PloS One, 6, e22649. density heterogeneities, and missing and uncertain values. As a result McDavid,A. et al. (2013). Data exploration, quality control and testing in single- diffusion maps can organize single cells along the non-linear and com- cell qPCR-based gene expression experiments. Bioinformatics, 29, 461–467. plex branches of differentiation, maintain the global structure of the Mohri,M. et al. (2012). Foundations of Machine Learning. MIT Press, differentiation dynamics and detect rare populations as well. Cambridge, MA, USA. Moignard,V. et al. (2013). Characterization of transcriptional networks in blood stem and progenitor cells using high-throughput single-cell gene ex- pression analysis. Nature Cell Biol., 15, 363–372. Acknowledgements Orkin,S.H. and Zon,L.I. (2008). Hematopoiesis: an evolving paradigm for We thank Michael Strasser (Institute for Computational Biology, Helmholtz stem cell biology. Cell, 132, 631–644. Centre Munich), Victoria Moignard (Cambridge Institute of Medical Park,H.Y. et al. (2014). Visualization of dynamics of single endogenous mrna Research), Berthold Goettgens (Cambridge Institute of Medical Research) labeled in live mouse. Science, 343, 422–424. and Mauro Maggioni (Department of Mathematics, Duke University) for Qiu,P. et al. (2011). Extracting a cellular hierarchy from high-dimensional helpful advice and discussions. cytometry data with SPADE. Nat. Biotechnol., 29, 886–891. Rieger,M.A. et al. (2009). Hematopoietic cytokines can instruct lineage choice. Science, 325, 217–218. Scho ¨ lkopf,B. et al. (1998). Nonlinear component analysis as a kernel eigen- Funding value problem. Neural Comput., 10, 1299–1319. Schroeder,T. (2011). Long-term single-cell imaging of mammalian stem cells. This study has been funded by The Bavarian Research Network for Nat. Methods, 8, S30–S35. Molecular Biosystems (BioSysNet) and the European Research Council (ERC Shawe-Taylor,J. and Cristianini,N. (2004). Kernel Methods for Pattern starting grant LatentCauses). Analysis. Cambridge University Press. Conflict of Interest: none declared. Stingl,J. et al. (2006). Purification and unique properties of mammary epithe- lial stem cells. Nature, 439, 993–997. Strasser,M. et al. (2012). Stability and multiattractor dynamics of a toggle References switch based on a two-stage model of stochastic gene expression. Biophys. J., 102, 19–29. Amir,E.-a. D. et al. (2013). viSNE enables visualization of high dimensional Takahashi,K. and Yamanaka,S. (2006). Induction of pluripotent stem cells single-cell data and reveals phenotypic heterogeneity of leukemia. Nat. from mouse embryonic and adult fibroblast cultures by defined factors. Cell, Biotechnol., 31, 545–552. 126, 663–676. Arinobu,Y. et al. (2007). Reciprocal activation of GATA-1 and PU. 1 marks Tenenbaum,J.B. et al. (2000). A global geometric framework for nonlinear initial specification of hematopoietic stem cells into myeloerythroid and dimensionality reduction. Science, 290, 2319–2323. myelolymphoid lineages. Cell Stem Cell, 1, 416–427. Trapnell,C. et al. (2014). The dynamics and regulators of cell fate decisions Bandura,D.R. et al. (2009). Mass cytometry: technique for real time single cell are revealed by pseudotemporal ordering of single cells. Nature Biotechnol., multitarget immunoassay based on inductively coupled plasma time-of- 32, 381–386. flight mass spectrometry. Anal. Chem., 81, 6813–6822. Van der Maaten,L. and Hinton,G. (2008). Visualizing data using t-SNE. J. Bendall,S.C. et al. (2014). Single-cell trajectory detection uncovers progression Mach. Learn. Res., 9, 2579–2605. and regulatory coordination in human B cell development. Cell, 157, 714– Wilhelm,J. and Pingoud,A. (2003). Real-time polymerase chain reaction. Chembiochem, 4, 1120–1128. Buettner,F. and Theis,F.J. (2012). A novel approach for resolving differences Yan,L. et al. (2013). Single-cell RNA-Seq profiling of human preimplantation in single-cell gene expression patterns from zygote to blastocyst. embryos and embryonic stem cells. Nat. Struct. Mol. Biol., 20, 1131–1139. Bioinformatics, 28, i626–i632.

Journal

BioinformaticsOxford University Press

Published: May 21, 2015

There are no references for this article.