Updated version of "Multivariate Descriptive Statistical Analysis", by L. Lebart, A. Morineau, K. Warwick, John Wiley, New York, 1984.

 

 

Validity of Results, Assessment of Visualizations

 

 

In this chapter we attempt to answer the following questions:

1. Which data matrices should we analyze? How do we construct these matrices?

2. What can we expect from multivariate descriptive statistical analysis (MDSA) techniques?

3. How do we evaluate the quality of the configurations we obtain?

We limit ourselves to the methods presented in the preceding chapters, that is, to the techniques of descriptive principal components analysis, correspondence analysis and its extensions, and clustering techniques. A fourth section will be devoted to bootstrap techniques that play a central role in the assessment of visualizations.

 

1. Which data matrices should we analyze?

 

These methods are most appropriate in the following situation: we want to describe large, homogeneous matrices (of measurements, ratings, or codes) about which very little is known a priori. Three conditions should exist:

 

1. The matrix must be so large that visual inspection or elementary statistical analyses cannot reveal its structure.

 

2. It must be homogeneous, so that it is appropriate to calculate statistical distances between its rows and its columns, and so that these distances can be meaningfully interpreted.

 

3. It must be amorphous, a priori; this means that applying these methods  is most useful when the structure of the matrix is unknown or only partially understood.

The property of homogeneity needs further elaboration. It is usually understood as homogeneity of the texture of the matrix; the coding of the matrix must allow the rows or columns to be comparable; for example, quantities expressed in grams and in meters should not be mixed.

 

Textural homogeneity can generally be achieved through analytical transformations or appropriate coding. Thus normed principal components analysis allows us to analyze heterogeneous measurements (with disparate scales) by standardizing the original variables. In the same vein, transformation into ranks (provided the initial variables have a more or less continuous distribution, with few ties) allows us to increase, even more, the homogeneity of the matrix to be analyzed.

 

Binary coding (cf. The case of MCA) allows correspondence analysis to simultaneously analyze nominal variables (such as region or socio-professional category) and continuous variables (age, income) that have previously been coded into classes.

However, textural homogeneity is generally not sufficient. For a clear interpretation, it is important that the material being analyzed (i.e.: the set of active variables) should be homogeneous in its substance, or, rather, its content, thus respecting the principle of relevance recommended by linguists: out of the heterogeneous mass of facts, retain only those facts that are related to one point of view.

 

This supplementary condition is not mandatory, but it often makes interpretation easier and clearer. In practice, this requirement leads us to identify several groups of variables, some of which have an active role in the construction of typologies, while others have the role of illustrative variables (also known as supplementary variables).

 

The difference between analyzed variables and illustrative variables is a fundamental one. We have already encountered this in previous chapters.

 

The ultimate location of a variable that did not participate in the analysis is, in a sense, a validity check; since it did not contribute to a principal axis, the interpretation of its correlation with a principal axis is all the more significant.

 

Of course, the illustrative variables of one analysis may become the active variables of another analysis, provided they are a homogeneous group of variables; then the formerly active variables become supplementary. This process sometimes yields a more complete interpretation.

 

2. What can we expect from multivariate descriptive techniques (MDT)?

 

Our experience has been mainly in economics, social science, and marketing. Therefore we limit ourselves to discussing applications of the methods in these particular fields. There certainly must exist good opportunities for these methods in the natural sciences, psychology, and other disciplines. It is too early to assess the real impact of MDT on these fields; our evaluation is thus only a partial and temporary one. However, we make a clear distinction between the technical advantages and the more fundamental advantages of MDT.

 

2.1 Technical Advantages

 

(a) Gain in Productivity in Survey Data Processing.

 

By means of the techniques of MDT, tasks can be ordered rationally in time, thus avoiding confusion among steps. Most of the steps can be illustrated by maps.

Finally, information that was formerly inaccessible now becomes available.

MDT also allows us to perform tests of consistency and error detection, as discussed in the next section. In fact, these procedures simultaneously provide a gain in productivity and an improvement in the quality of the results.

 

(b) Tests of Consistency of Data and Error Detection.

 

Detecting outlying or erroneous values is an ancillary result with which statisticians who use principal axes analysis or related techniques are familiar: outliers are often found on the hyperplanes spanned by the first axes.

On the other hand, we can perform a real evaluation of the data by positioning "marker variables" on the maps. Just as we can illustrate a map by adding certain characteristics of the individuals that are intrinsic to the survey, we can also use variables that characterize the way information is gathered: individuals interviewed by same interviewer; time of interview,

interviewer's comments (properly coded), and so on. We can also perform an evaluation of the questionnaire itself, by looking at the locations of "non-responses" with respect to the position of the actual responses.

 

 (c) Construction of Artificial Indices.

 

The first axis is the linear combination of the variables having the most variance. It constitutes an excellent index for discriminating among individuals. In many applications, this artificial index has a great deal of descriptive power and meaningful interpretation: it is, for example, the general aptitude factor found in psychological studies in the early XXth century. It also allows us to replace large batteries of nominal variables by numerical variables, which are much easier to analyze.

 

2.2. Fundamental Advantages

 

(a) New Fields of observation.

 

The possibility of simultaneously treating numerous pieces of information eliminates the need for trimming down a priori the variables to be analyzed in a data set.

Statisticians have long concerned themselves with numerous observations on a small number of variables, whether the purpose be to validate particular dependency model, or to test a hypothesis. The statistician concern has always been with handling too many observations, not to many variables. However, the convergence of certain estimates in principal axes analysis, when the number of variables increases indefinitely, was already mentioned by Hotelling (1933) (see also: Wachter, 1978).

Now that the computational obstacle has been removed, "sampling" can be done on both dimensions. In fact, correspondence analysis, which was initially intended for contingency tables, treats these two dimensions symmetrically. The possibility of exploring the" variable dimension” is an innovation the consequences of which are as yet relatively unexplored. In a discussion concerning economics, Benzecri (1974) doubts that purely analytical data reduction (i.e., explanation of complex phenomena by simple phenomena) can be possible (as they are in physics).  In economics as in other branches of social sciences, "the order of the composite phenomenon is worth more than the elementary properties of its components." Data analysis makes it possible to observe complex multidimensional universes, albeit in a still rudimentary fashion, and to globally treat information that previously had to be partitioned to become analyzable.

 

(b) New Analytical Tools.

 

Presenting the results as maps is in itself a methodological innovation-although the rules for reading these maps an more complicated than they would appear to be. In fact, common language, by its linear and sequential character, makes it easy to express non-symmetric relationships such as implications, whereas the relationship of covariance, which is symmetric, is more difficult to translate into language that implies a causal relationship.

This is why the two-dimensional pictures that represent the principal planes are very useful tools for analysis and communication.  It is possible to obtain a general overview of large data matrices that is not purely subjective. Thus two researchers who have collected similar data can, in a few words (most often by examining the first two principal axes) grasp the similarities and differences between the two data matrices.

 

 

 3. How do we evaluate the quality of the configurations?

 

The results of MDT raise several questions:

Are we really observing something that exists? Do the data have a structure? Or, on the contrary, do random errors or sampling fluctuations alone account for the values obtained for the eigenvalues and the explained variances?

Do the variances explained represent a part of information?

Are the configurations obtained stable, given what we know about the precision of the data, the nature of the coding, and the relative importance of the different variables?

The following sections attempt to answer each one of these questions.

 

3.1. Hypothesis of Independence

 

The hypothesis of independence of the rows and columns of a matrix is generally too strict to be realistic. It is highly improbable that a matrix being analyzed would be similar to a matrix of random numbers.

Although it is an extreme case with limited applicability, the hypothesis of independence allows us to define thresholds of significance for the eigenvalues and the percentages of explained variance, which can serve as guidelines for the user.

The eigenvalues follow nonparametric distributions in the cases of analysis of ranks (chapter 2) and of the correspondence analysis of contingency tables (chapter 4). In these favourable circumstances it was possible to obtain approximate tabulations, and to draw summary nomograms.

 

(a) The Case of Principal Components Analysis.

 

In some applications the classical correlation coefficients are replaced by Spearman's correlation coefficients in principal components analysis.

Under the hypothesis of independence, the distribution of the Spearman' coefficient depends only on the number of observations n. Similarly, the distribution of the eigenvalues of the correlation matrix of ranks depends only on the parameters n and p, where p is the number of variables. It is possible to construct an approximate tabulation by simulation. We thus obtain an idea of the "degree of. significance" of. the different percentages of variance under the hypothesis where the p variables are independent by pairs, and have continuous distributions (regardless of the shape of these distributions).

This does not entirely solve the problem of the number of significant axes to retain in the analysis, because the eigenvalues themselves are not independent.

Under the hypothesis of independence and of mutinormality, the covariance matrix  has a Wishart distribution.

This distribution was established by Fisher (1915) in the case  p = 2,  then by Wishart (1928), in the general case.

 

If the n rows  of a matrix X  of order (n,p) are independent realisations  of a multinormal vector with zero means and theoretical covariance matrix S (non singular) then the matrix S = X'X (that contains  p(p+1)/2 distinct elements)  is distributed according to a Wishart distribution  W(p,n,S) whose density function  f(S) is given by the following equation :

 

,

 

the constant  in the right hand side being:

 

 

We check than, for   S = I (unit matrix) and  p = 1, denoting  s = x'x,  we find again the classical c2  distribution.

 

       with :     

 

The probability density of the eigenvalues extracted from a Wishart matrix was formulated by Fisher (1939), Girshick (1939), Hsu (1939), and Roy (1939), and then by Mood (1951). The proof is found in Anderson (1958).

 

When  S = I, the density of the  Wishart distribution is written as a function of the trace and of the determinant of S, that is of the sum and the product of the eigenvalues lk :

 

The density of the eigenvalues  is written  :

 

The constante  is :

 

Integration of this rather complex density function has given rise to several publications; among them are Pillai (1965), and Krishnaiah and Chang (1971), based on the work of the physicist Mehta (1967).

 

Tables of thresholds corresponding to the two extreme eigenvalues were published: by Choudary Hanumara and Thompson (1968) for matrices whose smaller dimension p is less than 10; by Pillai and Chang (1970) and by Clemm, Krishnaiah, and Waikar (1973) for p < 21.

 

Unfortunately, the hypothesis of independence is unrealistic for most applications to real data.

The problem of significance  will be less timely than the problem of assessing the stability of observed structures.

 

c) Case of Correspondence Analysis.

 

The distribution of eigenvalues extracted from a correspondence analysis under the hypothesis of independence has given rise to a number of erroneous publications. Thus in Kendall and Stuart (1961) the eigenvalues are said to follow the chi-squared distribution, as is the total variance. Lancaster (1963) refuted this result by showing that the expected value of the first eigenvalue is always greater than the value derived from Kendall and Stuart's assertions. References concerning other approximations can be found in the work of Kshirsagar (1972), where it is suggested that eigenvalues, being canonical correlation coefficients calculated on binary variables might follow a distribution very close to that of these same coefficients calculated on Gaussian variables. Simulations have also shown that this approximation is unsatisfactory.

 

As a matter of fact, it can be shown (Lebart, 1975b, 1976; Corsten, 1976; O'Neil, 1978) that the distribution is related to the Wishart distribution in the following sense: if l is the ath eigenvalue produced by the correspondence analysis of table K of order (n, p), with total sum k, then the distribution of kl is approximately that of the ath eigenvalue of a Wishart matrix with parameters n -1  and p - 1 (" Fisher- Hsu" law).

 

Let us summarize by reminding that in both cases of simple and multiple correspondence analysis, most of the theoretical results provided by mathematical statistics are not applicable.

 

c) Conclusion of section 3.1

 

The hypothesis of independence between rows and columns of a matrix is often too strict to be realistic. Under this hypothesis, in two-way correspondence analysis applied to an IxJ contingency table, the eigenvalues are asymptotically those obtained from a Wishart matrix. As a consequence, always under the hypothesis of independence, the relative values of the eigenvalues (percentages of variance) are statistically independent from their sum, which follows the usual chi-square distribution with (n-1)( p-1) degrees of freedom. In the case of MCA,  or more generally in the case of binary data, the distribution of eigenvalues is more complex, their sum does not have the same meaning as in CA, and the percentages of variance are misleading measures of information (Lebart et al., 1984).

 

The delta method, one of the classical methods of asymptotic statistics (Gifi, 1990), allows us to observe the consequences of perturbations of the data on eigenvalues and eigenvectors under more realistic hypotheses. In practice, the versatile technique of bootstrap (section 4, below) will be the most useful validation tool.

 

3.2. Percentage of Variance and Information: some counterexamples

 

Besides the correspondence analysis of real contingency tables, the utility of percentages of variance is very limited. A few counterexamples show us that these coefficients are not suitable for characterizing the quality of a description.

 

 Case of Analysis of Matrix Associated with a Symmetric Graph.

 

Although it is clear that CA is appropriate for count data or binary data and PCA for real valued measurement, the user of this latter method (much more widespread) may legitimately ask what are the risks of false results when applying it to count or binary data (see: Lebart et al., 1998). Since it seems natural to calibrate visualization tools on artificial data sets provided with an a priori structure, we present below a comparison of the two methods applied to a same binary data matrix associated with a "chessboard shaped graph", (figure 1). In this figure, a line (an edge) drawn between two vertices means that the vertices are adjacent.

M is the symmetric binary sparse matrix associated with the graph. Its general entry (i,j) has value of 1 if the edge (i,j) exists, and the value of 0 otherwise.

 

Principal components analysis of matrix M

 

In a first step, principal components analysis is applied to data matrix M. Such an analysis can be performed using either the covariance matrix or the correlation matrix.

The numerical results appear to be similar in both cases, the obtained visualizations being almost identical. Thus the analysis involving the correlation matrix is presented here. Figure 2 shows a visualization of the locations of the 25 vertices in the plane spanned by the first two principal axes.

 

These axes correspond to two identical eigenvalues (l1 = l2 = 3.98), explaining together 31.86 % of the total variance. The vertices adjacent in the original graph have been joined by an edge to highlight the initial structure.

 

 

 

Figure 1. Graph G associated with a "chessboard" (square lattice grid)

 

 

Figure 2. Visualization of graph G through principal components analysis

(plane spanned by the first two principal axes)

 

 

The symmetry with respect to vertex number 13 is reconstituted. The relative locations of the vertices vis-à-vis their neighbours is generally taken into account by the display, with the exception of the four vertices corresponding to the corners of the rectangle (vertices 1, 5, 21, 25) that are folded back toward the center. The changes in the lengths of some edges are noticeable. They are characterized by a dilation of the four most central cycles of the graph.

 

Correspondence analysis of matrix M

 

Correspondence analysis is then applied to the same data matrix M. Figure 3 shows a visualization of the locations of the 25 vertices in the plane spanned by the first two principal axes. These axes also correspond to two identical eigenvalues (l 1 = l 2 = 0.814), explaining together 32.24 % of the total variance.

 

Although the graph in figure 1 is somewhat conventional (it can be drawn in several possible ways), the display in figure 3 satisfactorily reconstitutes both the relative positions of the vertices and an acceptable order of magnitude for the lengths of the various edges. This ability of CA to produce legible maps out of such data matrices can be extended to binary matrices describing various planar graphs .

 

Figure 3. Visualization of graph G through correspondence analysis

(plane spanned by the first two principal axes)

 

 

We note that the percentage of explained variance (32.24 %) is relatively modest if confronted with the quality of the reconstitution of the original structure in the corresponding plane. In CA (and in PCA as well), this phenomenon often occurs when dealing with binary data. In this context, the percentages of variance explained by principal axes always give a pessimistic view of the extracted information.

 

Such empirical evidences favour the use of CA to visualize regular planar graphs known through their associated matrices.

 

In the case of simple graphs (chains, cycles), an analytical calculation is possible without the help of a computer. It can be shown for instance that the percentage of variance relatiing to the first principal plane is a decreasing function of the number of nodes of the graph. Whereas a perfect representation of the graph is obtained, the proportion of explained variance can be  smaller than the inverse of the number of node.

 

The preceding counterexamples show that the percentages of explained variance are extremely conservative measures of the quality of an analysis. This is in contrast to. multiple correlation coefficients, for example, which are generous measures of the quality of a regression. The initial raw information is not an adequate frame of reference; thus we are often not justified in referring to. percentages of variance explained as "parts of information”.  

 

3.3. Stability of the Patterns: an empirical approach

 

Calculations of stability and sensitivity are probably the most convincin validation procedures. These calculations basically consist of verifying the stability of the configurations obtained after modifying the initial matrix in various ways.

What are the elements that can influence the quality of the results of a principal axes analysis?

 

We can name three of them:

a. Measurement errors.

b. Choice and weight of variables.

c. Coding of variables.

 

The fundamental problem of sampling fluctuations will be dealt with in section 4 (bootstrap techniques).

 

Each one of these sources of disturbance produces alterations in the initial data matrix, which should not affect the configurations if they are stable. In some cases, a single simulation may be enough, since the purpose is to verify the stability of the initial configuration.

 

(a) Measurement Errors. The order of magnitude of these errors and their approximate distribution in the population must be specified by the user as a function of his or her own knowledge of the field under study. For example, in the classical case of ordinal responses of the type: "completely disagree," "somewhat disagree," "agree somewhat," "completely agree," we can assume that there is one chance out of two that the respondent answered exactly the way he or she felt, and one chance out of four (except at the extremes) that the answer category was right next to the way the respondent felt.

Computational programs generally allow us to simulate a great variety of situations whose analytical interpretation would be impossible. Because of this, the hypotheses we test may be much better adapted to real situations and to users' actual problems than are classical hypotheses of mathematical statistics. On the other hand, a certain amount of programming is required in order to perform these validations.

 

(b) Choice and Weight of Variables. This problem arises when the statistician is able to "sample" within the variable space, which is not always the case. "Random samplings" may be performed among variables, to test the sensitivity of the results with respect to the composition of the variable set. The bootstrap (section 4) is also a valuable alternative.

 

(c) Coding of the Variables.

By coding, we mean the preliminary controlled transformation of the raw data before performing an analysis. We feel that this is a basically empirical operation, because is inextricably linked to the contents of the data. Like data analysis itself, coding is intended to increase the practical value of the data. The purpose is not to make the data more easily understood by the analyst, but to make them better adapted to the method.

On the other hand, coding can be a source of disturbance in the ratings, scales, and rankings (for example, in analysis of preferences). It is important, then, to verify that the obtained configurations are resistant to monotone transformations of the data (exponential, logarithm, logistic curve, etc.). The order of the ratings should be more important than the particular metric properties of the scale used.

 

4. Bootstrap techniques

 

For many principal axes techniques such as principal component analysis (PCA), simple (CA) or multiple (MCA) correspondence analysis, bootstrap resampling techniques (Efron, 1979) are used to produce confidence regions on two-dimensional displays. The bootstrap replication scheme allows one to draw confidence ellipses or convex hulls for both active and supplementary categories, as well as supplementary continuous variables.

To compute the precision of estimates, the bootstrap method is very useful because:

- the classical approach is both unrealistic and analytically complex,

- the bootstrap makes almost no assumption about the underlying distributions,

- it gives the possibility to master every statistical computation for each sample replication and therefore to deal with parameters computed through the most complex algorithms.

 

4.1 Basic idea of the bootstrap

 

The first phase consists of drawing a sample of size n, with replacement, of the n statistical units (the rows of the data matrix Z), and of computing the parameter estimates of interest  such as means, variances, eigenvectors, on the new “sample” obtained. A simple uniform pseudo-random generator provides n independent drawings of one integer between 1 and n, from which the n “bootstrap weights” are easily derived. This phase is repeated K times. The value of K can vary from 10 to several thousands according to the type of application (see Efron and Tibshirani, 1993). We have at this stage K samples (the replicates) drawn from a new “theoretical population” defined by the empirical distribution of the original data set, and, as a consequence, K estimates of the parameters of interest. Under rather general assumptions, it has been proved that we can estimate the variance of these parameters (or other summary statistics) directly from the set of their K replicates.

 

4.2 Context of PCA

 

In the case of PCA, variants of bootstrap exist for active variables and supplementary variables, both continuous and nominal. In the case of numerous homogeneous variables, a bootstrap on variables is also proposed, with examples of application to the case of semiometric data (Lebart et al., 2003). Numerous papers have contributed to select the relevant number of axes, and have proposed confidence intervals for points in the subspace spanned by the principal axes. These parameters are computed after the realization of each replicated samples and involve constraints that depend on these samples. The s-th eigenvector of a replicated correlation matrix is not necessarily homologous of the s-th eigenvector of the original matrix, because of possible rotations, permutations or changes of sign. In addition, the expectations of the eigenvalues of the replicated matrices are distinct from the original eigenvalues. This is  exemplified by a classical result:  let us suppose that the theoretical covariance matrix is the unit matrix I (that is, all theoretical eigenvalues take the value 1 and all theoretical covariances take the value 0). However, the largest eigenvalue from the PCA of any finite sample drawn from that population will be greater than 1.  Similarly, in the context of bootstrap, the expectation of the first eigenvalue of the PCA of the replicated matrices is markedly greater than its “theoretical” counterpart, calculated on the observed covariance matrix.

 

Some procedures have been proposed to overcome these difficulties (Chateau and Lebart, 1996): partial replications using supplementary elements (partial bootstrap), use of a three-way analysis to process simultaneously the whole set of replications , filtering techniques involving reordering of axes and Procrustean rotations (Markus, 1994; Milan and Whittaker, 1995).

 

The partial bootstrap makes use of projections of replicated elements on the original reference subspace provided by the eigen-decomposition of the observed covariance matrix. It has several advantages. From a descriptive standpoint, that initial subspace is better than any subspace undergoing a perturbation by a random noise. In fact, unlike the eigenvalues, this subspace is the expectation of all the replicated subspaces having undergone perturbations. The plane spanned by the first two axes, for instance, provides an optimal point of view on the data set. In this context, to apply the usual bootstrap to PCA, one may project the K replicates of variable-points in the common reference subspace, and compute confidence regions (ellipses or convex hulls) for the locations of these replicates (Greenacre, 1984).

 

4.3 Context of CA and MCA

 

Gifi (1990) and Greenacre (1984) did pioneering work in addressing the problem in the context of simple CA and MCA. As mentioned previously in the case of PCA, it is easier to assess eigenvectors than eigenvalues that are biased replicates of the theoretical ones (Alvarez et al., 2004). In fact, as far as bootstrapping is concerned, the context of MCA is identical to that of PCA. All that we have said above about total and partial bootstrap apply to MCA. A specific replication can be generated by a drawing with replacement of the n individuals (rows of the indicator matrix Z). Each replication k leads to a Burt contingency table Ck whose rows (or columns) can be projected as supplementary elements onto the initial principal axes (partial bootstrap). K replicates (j1, j2, ..jk, ..jK) are obtained for each category-point  j.

Then, p PCAs are performed in the two-dimensional space spanned by a chosen pair of axes to draw the p confidence ellipses. The lengths of the two principal radiuses of these ellipses are normatively fixed to two standard deviations (i.e. twice the square roots of the eigenvalues,  for each PCA). Empirical evidence suggests that K=30 is acceptable for the number of replicates, and that in such a case, the corresponding ellipses contain approximately 86% of the replicates.

Alternatively, the confidence ellipses can be replaced by convex hulls. Note that the two ways of visualizing the uncertainty around each category-point (ellipses or convex hulls) are complementary, the former taking into account the density of the replicated points, the latter pinpointing the peripheral points.

 

4.4  Example of CA validation 

 

In 1985, Gary Taylor discovered a poem with nine verses and 258 words (429 occurrences) that might be attributable to Shakespeare. This gave an opportunity to Thisted and Efron (1987) to design a complex statistical model developed in the above paper (their model does not use bootstrap technique). As a precaution, the authors simultaneously analyzed seven other Elizabethan poems that are definitely attributed, of which four are  Shakespeare's.

 

                     Table 1.   Eight Elizabethan poems

 

Author                               Title

Ben Jonson                      AnElegy

C. Marlowe                      Four poems

J. Donne                           TheEcstasy

     --------

Shakespeare                    Cymbeline (excerpts)

Shakespeare                    A Midsummer Night's Dream (excerpts)

Shakespeare                    ThePhoenix and Turtle

Shakespeare                    Sonnets (excerpts)

Shakespeare (?)              Taylor's Poem

 

Table 2. Distribution of words in the eight poems according to their frequency

of appearance in Shakespeare's work.

Rows: frequency of words in the work of Shakespeare (defining 12 categories = 12 rows).

Columns: number of words of these 12 categories found in each of the eight poems.

Example: The poem of Ben Jonson (column BJon) contains 8 words that have never been used by Shakespeare (first row: Freq = 0), 2 words that have been used once by Shakespeare, …,148 words that have been used more than 100 times by Shakespeare.

 

 

Freq.  BJon Marl Donn Cymb Mids  Phoe  Sonn Tayl Total

 

0       8   10   17    7    1   14     7     9     73

1       2    8    5    4    4    5     8     7     43

2       1    8    6    3    0    5     1     5     29

3-4     6   16    5    5    3    9     5     8     57

5-9     9   22   12   13    9    8    16    11    100

10-19   9   20   17   17    6   18    14    10    111

20-29  12   13   14    9    9   13    12    21    103

30-39  12    9    6   12    4    7    13    16     79

40-59  13   14   12   17    5   13    12    18    104

60-79  10    9    3    4    9    8    13     8     64

80-99  13    5   10    4    3    5     8     5     53

+100  148  138  145  120  103  111   155   140   1060

                                                    

Total 243  272  252  215  156  216   264   258   1876

 

Figure 4 describes the positions of poems and frequency ranges in a principal axes plane of a CA of table 2 . The controversial poem Taylor is near the origin, therefore near the average profile. Shakespeare's poems are lined up along a line that connects The Phoenix with Midsummer. The Phoenix, which was rejected as non-Shakespearean by other tests based on the Poisson model, is indeed close to the periphery, between Marlowe and Donne, in a zone that is abundant in new words (i.e. words whose frequency  is "0" in Shakespeare) and in frequencies "2" and "10-19". Conversely, the Midsummer extract is abnormally devoid of new (exclusive) words.

 

shakespeare

 

Figure 4.   Partial bootstrap confidence ellipses for the eight poems in CA principal plane.

 

The poems definitely attributed to Shakespeare are located along the principal diagonal. The controversial Taylor poem is in the centre of the display. It is not possible to reject the assumption of a Shakespearian authorship.

 

Each of these bootstrap confidence ellipses are drawn from a PCA of the 30 bootstrap replications of each poem. In this case of partial bootstrap, each replication of a column is projected onto the plane as a “supplementary column”. For the sake of clarity, the replications are not visible here, but they will be represented below on figures 5 and 6.

 

A replicated table is obtained through 1876 drawings with replacement from a fictitious urn containing also 1876 balls having 96 (8 x 12) different colours, corresponding to the 96 cells of table 2.

 

4.5   Example of MCA validation

 

4.5.1 Data

 

The data set is the British section of a multinational survey conducted in seven countries in the late 1980s  (Hayashi et al., 1992), (example entitled “TDA2” included in the downloading of the software DTM). The data set consists of the responses of n = 1043 individuals and comprises objective characteristics of the respondent or his/her household (age, status, gender, facilities).  Other questions relate to attitudes or opinions.

4.5.2 Active questions, global analysis

In this example we focus on a set of Q = 6 attitudinal questions, with a total of p = 24 response categories, that constitutes the set of active variables. The six active questions leads to p = 24 categories of response. There are n = 1043 respondents (rows). The first two eigenvalues from the MCA of the binary disjunctive table are 0.342 and 0.260 respectively and account for 12.1 % and 9.2 % of the total inertia. It is widely recognised that these values give a pessimistic idea of the quality of the description provided by the MCA.

4.5.3  Partial bootstrap in MCA for active variables

The most positive and optimistic responses are grouped in the upper left side of Figure 5 (answers “much better”  to three questions about “standard of living”, answers “people happier”, “peace of mind increases”, “more freedom”). The most negative and pessimistic responses occupy the upper right side of the display, in which the three “very much worse” items relating to the three first questions form a cluster markedly  separated from the remaining categories. All neutral responses (“the same”, “no change”) are located in the lower part of the display, together with some moderate responses such as “little better”.

 

 

figure2

 

Figure 5.    Partial bootstrap confidence ellipses for 8 active categories-points in MCA principal plane

5 categories concern question 2 (Change in your personal standard of living last years -  identifiers beginning with “SL.pers”) and 3 categories concern question 4  (Will people be happier in years to come? -  identifiers beginning with “People”)

 

While the first axis can be considered as defining (from the right to the left) a particular scale of “optimism, satisfaction, happiness”, the second axis appears to oppose moderate responses (lower side) to both extremely positive and extremely negative responses. Such pattern of responses suggests what is known in the literature under the name of “Guttmann effect”.

 

Figure 5 is an example of partial bootstrap applied to the previous MCA. To obtain a legible display, we have selected a subset of all the possible confidence areas. Over this representation are drawn confidence ellipses corresponding to the categories of the active variables “Change in your personal standard of living last years” (categories:  much better, little better, the same, little worse, much worse) and “Will people be happier in years to come” (categories: happier, less happy, the same). The smaller the confidence area, the more precise is the location of the corresponding category. The sizes of the ellipses (as measured by the radius of a circle having the same area) are approximately proportional to the inverse square root of the frequencies.

 

Only K = 30 replications of the original sample have been used. In practice, 10 replications would suffice to produce ellipses having similar shapes and sizes. Each replication involves n = 1043 drawing with replacements of the n respondents and leads to a new positioning of the categories. Most of these replications are visible as dots within (and sometimes around) the confidence zones of each category on the display.

 

figure3new

 

Figure 6.  Total bootstrap confidence ellipses for 8 active categories-points in MCA principal plane (same categories as Figure 5. Note that the scale has changed, owing to the larger dispersion of the replicates)

 

The response categories belonging to a specific question appear to be significantly distinct. Although they are not all displayed here, the confidence areas of the homologous categories of the three first questions have similar sizes: most sets of identical items relating to distinct questions overlap. For example, the three categories “very much worse” relating to the three first questions are characterised by largely overlapping ellipses: their locations are not significantly distinct.

 

4.5.4  Total bootstrap in MCA

 

The total bootstrap can be considered as a conservative validation procedure. In this context, each replication leads to a separate MCA, but the absence of a common subspace of reference may induce a pessimistic view of the variances of the coordinates of the replicates on the principal axes. To remedy the arbitrariness of the signs of the axes, the orientations of the replicated axes could be a posteriori changed (if necessary) to maintain a positive correlation with the axes of the original MCA having the same rank. 

No permutations relating to the first two axes have been observed for the example. The occurrence of rotation has been considered as a sign of instability of the axes and no correction has been applied. Figure 6 shows the first principal plane that superimposes, after the possible changes of sign mentioned above, the principal planes of the 30 MCAs performed on the replicated data. Evidently, the ellipses are markedly larger than that of the partial bootstrap, and the scale of the display had to be modified in consequence.  Note that the stability of the underlying structure is clearly established by the overly strict trial of the total bootstrap.

 

4.6 Conclusion of section 4

 

The bootstrap stipulates that the observed sample can serve as an approximation of the population: it takes into account the multivariate nature of the observations and involves simultaneously all the axes. We can then compare and analyze the proximities between pairs of categories, without reference to a specific axis.

 

In both cases of CA and MCA, validation procedures are difficult to carry out in a classical statistical framework. Bootstrap techniques possess all the properties to adapt to complex situations and therefore to generalize the use of inferential tools.

 

 

End of Chapter 7:  Validation