Which bootstrap for principal axes methods?
Adapted and updated from a draft version of the paper:
L. Lebart (2007) Which bootstrap for principal axes methods? In: Selected Contributions in Data Analysis and Classification, P. Brito et al. Editors, Springer, 581 – 588.
Abstract:
This paper deals with validation techniques in the context of multivariate descriptive techniques involving singular values decomposition (SVD), namely: Principal Components Analysis (PCA), Simple and Multiple Correspondence analysis (CA and MCA). We briefly show that, according to the concerns of the users, at least five types of resampling techniques could be carried out to assess the quality of the obtained visualisations: a) Partial bootstrap, that considers the replications as supplementary variables, without diagonalization of the replicated moment-product matrices. b) Total bootstrap type 1, that imply a new diagonalization for each replicate, with corrections limited to possible changes of signs of the axes. c) Total bootsrap type 2, which adds to the preceding one a correction for the possible interversions of axes. d) Total bootstrap type 3, that implies procrustean transformations of all the replicates striving to take into account both rotations and interversions of axes. e) Specific bootstrap, implying a resampling at a different level (case of a hierarchy of statistical units). An example is presented for each type of resampling.
Key words: Principal components analysis, correspondence analysis, bootstrap, assessment.
1. Introduction
Our aim is to assess the results of principal axes methods (PAM), i.e.: multivariate descriptive techniques involving singular values decomposition (SVD), such as principal components analysis (PCA), simple and multiple correspondence analyses (CA and MCA). These methods provide useful data visualisations bur their outputs (parameter estimates, graphical displays) are difficult to assess. Computer intensive techniques allow us to go far beyond the criterion of “interpretability” of the results that was frequently used during the first phases of the upsurge of data analytic methods thirty years ago (see, e.g., Diday and Lebart, 1976).
To compute the precision of estimates, the classical analytical approach is both unrealistic and analytically complex. The bootstrap (Efron, 1979), on the contrary, makes almost no assumption about the underlying distributions, and gives the possibility to master every statistical computation for each sample replication and therefore to deal with parameters computed through the most complex algorithms.
2 Basic principles of the bootstrap, a reminder
The nonparametric bootstrap consists in drawing with replacement a sample of size n out of n statistical units. Then, the parameter estimates such as means, variances, eigenvectors are computed on the new obtained “sample”. This two-fold phase is repeated K times. A current value of K is 200 (Efron and Tibshirani, 1993), but it can vary from 10 to several thousands according to the type of application. We have at this stage K samples (the replicates) drawn from a new “theoretical population” defined by the empirical distribution of the original data set, and, as a consequence, K estimates of the parameters of interest. Briefly and under rather general assumptions, it has been proved that we can estimate the variance (and other statistical parameters) of these parameters directly from the set of their K replicates.
In the PCA case, variants of bootstrap do exist for active variables and supplementary variables, both continuous and nominal. In the case of numerous homogeneous variables, a bootstrap on variables is also proposed, with examples of application to the case of semiometric data (Lebart et al., 2003).
Numerous papers have contributed to select the relevant number of axes, and have proposed confidence intervals for points in the subspace spanned by the principal axes. The s-th eigenvector of a replicated correlation matrix is not necessarily homologous of the s-th eigenvector of the original matrix, because of possible rotations, permutations or changes of sign of the axes). In addition, the expectations of the eigenvalues of the replicated matrices are not the original eigenvalues (see, e.g., Alvarez et al., 2004; Lebart, 2006). Several procedures have been proposed to overcome these difficulties (Chateau and Lebart, 1996): partial replications using supplementary elements (partial bootstrap), use of a three-way analysis to process simultaneously the whole set of replicates (Holmes, 1989), filtering techniques involving reordering of axes and Procrustean rotations (Markus, 1994; Milan and Whittaker, 1995; Gower and Dijksterhuis, 2004).
3. The illustrative example
An open-ended question has been included in a multinational survey conducted in seven countries (Japan, France, Germany, Italy, Nederland, United Kingdom, USA) in the late nineteen eighties (Hayashi et al., 1992). The respondents were asked: "What is the single most important thing in life for you?".
The illustrative example is limited to the British sample (sample size: 1043). The counts for the first phase of numeric coding are as follows: Out of 1043 responses, there are 13 669 occurrences (tokens), with 1 413 distinct words (types). When the words appearing at least 16 times are selected, there remain 10 357 occurrences of these words (tokens), with 135 distinct words (types). The same questionnaire also had a number of closed-end questions (among them, the socio-demographics of the respondents). In this example we focus on a partitioning of the sample into 9 categories, obtained by cross-tabulating age (3 categories) with educational level (3 categories). All the following figures contain an excerpt (only four words) of the principal plane obtained from a CA of the contingency table cross-tabulating the 135 words (rows) appearing at least 16 times, and the 9 categories of respondents (columns). The entry (i, j) of such table is the number of occurrences of word i in the responses of individuals belonging to category j ((see: Lebart et al., 1998).
4. Partial bootstrap
The partial bootstrap, makes use of projections of replicated elements on the original reference subspace provided by the eigen-decomposition of the observed covariance matrix. It has several advantages. From a descriptive standpoint, this initial subspace is better than any subspace undergoing a perturbation by a random noise. In fact, unlike the eigenvalues, this subspace is the expectation of all the replicated subspaces having undergone perturbations. The plane spanned by the first two axes, for instance, provides an optimal point of view on the data set. In this context, to apply the partial bootstrap to PCA, one may project the K replicates of variable-points in the common reference subspace, and compute confidence regions (ellipses or convex hulls) for the locations of these replicates. Gifi (1980), Greenacre (1984) addressed the problem in the context of CA and MCA.
Then, for each variable-point and each pair of principal axes, a confidence ellipse is derived from a PCA of the two-dimensional cloud of the K replicates. The lengths of the two principal diameters of these ellipses are normatively fixed to four standard deviations, the corresponding ellipses contain then approximately 90% of the replicates. Empirical evidence suggests that 30 is an acceptable value for K, the number of replicates. Confidence ellipses may also be replaced by convex hulls. Both ways of visualizing the uncertainty around each variable-point are complementary: ellipses take into account the density of the cloud of replicated points, whereas convex hulls pinpoint peripheral points and possible outliers.
Figure 1 shows four confidence ellipses for four row-points (words).The words corresponding to strongly overlapping ellipses could not be deemed to be significantly distinct as regard their distributions among the nine categories that are the column of the lexical contingency table). Thus, the words church and mind, despite their distinct locations, correspond to the same profile of respondents (profile described by the nine categories). Such profile is significantly distinct from those of the words nothing and things.
|
Figure 1. Partial bootstrap: Four confidence ellipses for word-points in the principal plane of a CA [contingency table crossing 135 words and 9 categories of respondents]
5. The total bootstrap and its three options
The total bootstrap consists in performing a new PAM for each replicate. Evidently, the absence of a common subspace of reference may induce a pessimistic view of the variances of the coordinates of the replicates on the principal axes. The most obvious change concerns the signs of the coordinates on the axes, which is a mere by-product of the diagonalization algorithm. We can also observe interversion of axes from one replicate to another, and also rotations of the axes (Cf. Milan et Whittaker, 1995).
We have then to perform a series of transformations to identify the homologous axes during the successive diagonalizations of the s replicated covariances matrices Ck (Ck corresponding to the k-th replicates).
Three types of transformations lead to three tests for the stability of the observed structure:
1. Total Bootstrap type 1 (very conservative) : simple change (when necessary) of signs of the axes found to be homologous (merely to remedy the arbitrarity of the signs of the axes). A simple scalar product between homologous original and replicated axes allows for this elementary transformation.
2. Total Bootstrap type 2 (rather conservative) : correction for possible interversions of axes. Replicated axes are sequentially assigned to the original axes with which the correlation (in fact its absolute value) is maximum. Then, alteration of the signs of axes, if needed, as previously.
|
Figure 2. Total bootstrap type 1: Confidence ellipses for the same word-points in the same original principal plane.
|
Figure 3. Total bootstrap type 2: Confidence ellipses for the same word-points in the same original principal plane, with correction of the possible interversion of axes.
3. Total Bootstrap type 3 (could be lenient if the procrustean rotation is done in a space spanned by many axes) : a procrustean rotation (see: Gower and Dijksterhuis, 2004) aims at superimposing as much as possible original and replicated axes.
Total bootstrap type 1 ignores the possible interversions and rotations of axes. It allows for the validation of stable and robust structures. Each réplication is supposed to produce the original axes with the same ranks (order of the eigenvalues).
Total bootstrap type 2 is ideally devoted to the validation of axes considered as latent variables, without paying attention to the order of the eigenvalues.
Total bootstrap type 3 allows for the validtion of a whole subspace. If, for instance, the subspace spanned by the first four replicated axes can coincide with the original four-dimensional subspace, one could find a rotation that can put into coincidence the homologous axes. The situation is then very similar to that of partial bootstrap.
Figure 2 shows the case of total bootstrap of type 1: evidently, the ellipses are much larger. Figure 3 introduces the corrections implied by possible interversion of axes. The pattern observed in figure 1 reappears, albeit less clearly. This improvement means that some axes interversions were responsible for the perturbations of figure 2: Some stable dimensions may exist, but their order of appearance (order of the corresponding eigenvalues) can vary from one replicate to another.
Figure 4 is similar the figure 1 as far as the size of the ellipses are concerned. In fact, the procrustean transformations depends on the number of axes taken into considerations. They have been performed here in a 5-dimensional space, and the original space can be retrieved without difficulty, leading to a procedure similar to the partial bootstrap. The lack of space does not allow for displaying all the other possibilities.
Figure 4. Total bootstrap type 3: Confidence ellipses for the same word-points in the same original principal plane, with correction of the possible interversion of axes and of possible rotations (procrustean transformations).
6. Specific (or hierarchical) bootstrap
When dealing with textual data, these resampling techniques can help to solve the problem of plurality of statistical units (see, in the case of responses to open questions: Tuzzi and Tweedie, 2000). In fact, two (or more) levels of statistical units coexist in textual data analysis. On the one hand, observations or individuals (with their usual meaning in statistics) could be respondents (case of sample surveys) or, e.g., cybernauts, web-users (Web-Mining). On the other end, within the same textual corpus, other types of observations or individuals could be occurrences (token), words, lemmas, phrases.
The principle of replication can be customized to both mentioned levels, leading to conclusions adapted to the expected inference. Owing to the discrepancies of text sizes, a pattern could be significant when the statistical unit is the word, and not relevant if the statistical unit is the respondent or the web user.
Figure 5 shows again the same set of four points-word after a partial specific bootstrap consisting of drawing with replacement of the 1043 respondents and projection of the replicates as supplementary variables.
If we compare the ellipses with those of figure 1, we observe for exemple that the location of the word “things” is now more ambiguous: this is due to the fact that some respondents use several times the word. Consequently, a drawing of respondents induces a larger perturbation fo the data. The specific bootstrap is however the right procedure to carry out if we want to infer the results to the parent population of respondents.
Figure 5 . Specific two-level partial bootstrap: Bootstrapping the observatons (i.e.: respondents) instead of the words. This figure should be compared only with figure 1
(both of them use partial bootstrap)
7. Conclusion
The bootstrap stipulates that the observed sample can serve as an approximation of the population. It takes into account the multivariate nature of the observations and involves simultaneously all the axes. We can then compare and analyze the proximities between pairs of categories, without reference to a specific axis.
Bootstrapping can also be used to process weighted data (circumstances occurring in most sample surveys) and to draw confidence intervals around the location of supplementary continuous or numerical variables in PAM. In the case of multilevel samples (for example: sample of respondents, and samples of words within the responses), the replications can involve separately the different levels, and allows one to study the different components of the observed variance.
From practitioner’s standpoint, principal axes techniques are particularly profitable when they consider the principal space spanned by the first dimensions as a predictive map which purports to receive all the remaining information contained in the data file (set of supplementary questions). In fact, that approach, closely related to regression, is widely used in practice.
In all these cases, assessment procedures are difficult to carry out in a classical statistical framework. Bootstrap techniques can now confer to the obtained visualizations a scientific status.
References
Alvarez R., Bécue M., Valencia O. (2004). Etude de la stabilité des valeurs propres de l’AFC d’un tableau lexical au moyen de procédures de rééchantillonnage, in: « Le poids des mots », Purnelle G., Fairon C., Dister A. (eds), Louvain : PUL, 42-51.
Chateau F., Lebart L. (1996). Assessing sample variability in visualization techniques related to principal component analysis: bootstrap and alternative simulation methods. In : COMPSTAT96, A. Prats (ed), Physica Verlag, Heidelberg, 205-210.
Diday E., Lebart L. (1977). L’analyse des données. La Recherche, 74, 15-25.
Efron B., Tibshirani R. J. (1993). An Introduction to the Bootstrap. New York: Chapman and Hall.
Efron B.(1979). Bootstraps methods : another look at the Jackknife. Annals of Statistics, 7, 1-26.
Gifi A. (1990). Non Linear Multivariate Analysis, Chichester: J. Wiley [updated version of: Gifi A. (1980) (same title), Department of Data theory, University of Leiden].
Gower J. C., Dijksterhuis G. B. (2004) Procrustes Problems, Oxford Univ. Press, Oxford.
Greenacre M. (1984). Theory and Applications of Correspondence Analysis. London: Academic Press.
Hastie T., Tibshirani R., Friedman J. (2001). The Elements of Statistical Learning, New York: Springer.
Hayashi C., Suzuki T., Sasaki M. (1992): Data Analysis for Social Comparative research: International Perspective, North-Holland, Amsterdam.
Holmes S.: Using the bootstrap and the RV coefficient in the multivariate context. in: Data Analysis, Learning Symbolic and Numeric Knowledge, E. Diday (ed.), Nova Science, New York, (1989) 119-132.
Lebart L., Piron M., Morineau A. (2006) Statistique exploratoire multidimensionnelle, Validation et Inférence en fouilles de données. Dunod, Paris.
Lebart L., Salem A., Berry L.: Exploring Textual Data, Kluwer, Dordrecht, Boston (1998).
Lebart L. (2006) Validation techniques in multiple correspondence analysis, in : Multiple Correspondence Analysis and Related Methods, Greenacre M. and Blasius J. (eds), Chapman an Hall, 179-196
Markus M.Th. (1994). Bootstrap Confidence Regions for Homogeneity Analysis.; the Influence of Rotation on Coverage Percentages. COMPSTAT 1994, (Dutter R. and Grossmann W. (eds)) Physica Verlag, Heidelberg, 337-342.
Milan L., Whittaker J. (1995). Application of the parametric bootstrap to models that incorporate a singular value decomposition. Applied Statistics, 44, 1, 31-49.
Tuzzi A., Tweedie F. J. (2000): The best of both worlds: Comparing Mocar and Mcdisp. In: JADT2000 (Cinquièmes Journées Internationales sur l’Analyse des Données Textuelles), Rajman M., Chappelier J-C. (eds), EPFL, Lausanne, 271-276.