English translation of the statistical appendix of the book “La Sémiométrie” (L. Lebart, M. Piron, J.-F. Steiner, Dunod, Paris, 2003), including some translated developments from the book (in French): “Statistique Exploratoire Multidimensionnelle”, L. Lebart, M. Piron,. A. Morineau, Dunod, 2006.

 

Some elements of Multivariate Descriptive Statistical Analysis



The exploratory statistical analytical methods provided in the software DtmVic aim at describing large data sets, at extracting these structures and at validating them. They belong to exploratory multivariate statistics, data analysis, or data mining. Those three expressions are roughly equivalent. We sometimes use the expression structural statistics to highlight the emphasis on the validation phase of structures. These methods generalize classical descriptive statistics, using mathematical fairly intuitive tools, but tools that are more complex than the means, variances and empirical correlation coefficients of elementary descriptive statistics.


This elementary text is inspired by the statistical appendix of the book “La Sémiométrie” (L. Lebart, M. Piron, J.-F. Steiner, Dunod, Paris, 2003): principal component analysis is the basic principal axes technique for the applications in Semiometry. It will be supplemented by more recent work on validation methods, and in particular, on bootstrap techniques, on Kohonen maps, or on techniques of analysis less frequently used such as logarithmic analysis.

 

A1.1 A reminder of the principles of exploratory multivariate

        methods

 
The exploratory multivariate methods cover a large number of techniques that aim to describe and synthesize the information contained in large data tables.


A1.1.1 Geometric representations and scattering of points

 

Initially, the data are in the form of large rectangular tables, denoted X.  The lines (or rows) (i = 1, ..., n) in the table represent the n individuals, the subjects or respondents surveyed, for example, and the columns (j = 1, ... m) represent m variables that can be measures, or scores measured on those individuals.

To understand the principle of exploratory multivariate statistical methods, it is useful to represent geometrically the set of n individuals (n lines) and the set of m variables (m columns) as two clouds of points, each set being described by the other one. We then define, for the two clouds, the distances between line points and between column points that reflect the statistical associations between individuals (rows) and between variables (columns).

 

In the case of Semiometry, a word (variable) is a point whose coordinates are the scores given by the n individuals (respondents): the cloud of m words is embedded within an n-dimensional space. Similarly, an individual is a point whose coordinates are the scores for the m words, the cloud of n individuals is embedded within an m-dimensional space.

 

Reminder of the wording of the question: On the following pages you will find a list of words.  We would like you to mark them on a scale of 1 to 7 according to how they make you feel, between very negative (1), or very positive (7).  The scale allows you to express the extent to which the words make you feel negative or positive.

Table A1.1 :
Example of Table X of scores (1 to 7)
attributed to: m = 7 words, for n = 12 respondents

 

          words

respondents

tree

gift

danger

morality

storm

politeness

sensual

 

R01

7

4

2

2

3

1

6

R02

6

3

1

2

4

1

7

R03

4

5

3

4

3

4

3

R04

5

5

1

7

2

7

1

R05

4

5

2

7

1

6

2

R06

5

7

1

5

2

6

5

R07

4

2

1

3

5

3

6

R08

4

1

5

4

5

4

7

R09

6

6

2

4

7

5

5

R10

6

6

3

5

3

6

6

R11

7

7

6

7

7

6

7

R12

2

2

1

2

1

3

2


Figures A1.1 and A1.2 illustrate, from Table A1.1 containing the scores for 7 words given by 12 respondents, the representation of both those intrinsically linked clouds of points.

 

The cloud of word points is built within the space of people, here from only two individuals, R04 and R08, since two dimensions make it possible to have a graph in a plane (see Figure A1.1).

 

 

          words

respondents

tree

gift

danger

morality

storm

politeness

sensual

 

R04

5

5

1

7

2

7

1

R08

4

1

5

4

5

4

7

 

 

Figure1

 

Figure A1.1: Representation of the cloud of words
in the space of two respondents "R04" and "R08"

 


Similarly, the cloud of 12 respondents is built in the space of variables, here from two words, Morals and Sensual, that is to say, within a two dimensional space (see Figure A1.2 ). (Please, check from the corresponding columns of Table A1.1)

 

 

                        Figure2

Figure A1.2: Representation of clouds of respondents in
the space of words "Sensual" and "Morals"

For each cloud, the mean point called centre of gravity is shown. This is G for the centre of gravity of the scores assigned by respondents (see Figure A1.1) and G’ for that of respondents who rated the two words Sensual and Morality (Figure A1.2).

 

A1.1.2 Principle and methods of analysis

 

While it is still possible to calculate distances between the lines and distances between the columns of a table X, it is not possible to visualize them immediately (the geometric representations associated usually involving spaces of more than two or three dimensions): it is then necessary to carry out transformations and approximations to obtain a flat (planar) representation.

 

Tables of distances associated with these geometric representations (simple in principle but complex because of the large number of dimensions of the areas concerned) can be described by the two main families of methods: the factorial methods (or principal axes methods) and clustering. The first one consists in finding the main directions according to which the points deviate most from the average point. The second is to search for groups or clusters of individuals that are as homogeneous as possible (Figure A1.3).

 

 

Figure3

Figure3b

 

           Factorial method

(search for principal directions)

        Clustering method

(search for homogeneous groups)

 

Figure A1.3: Two major families of methods

 

These methods often involve, in the same way, individuals (rows) and variables (columns).

The confrontation of spaces of individuals and spaces of variables enriches the interpretations.

 

 

 

 

A1.2  Factorial methods: technical aspects

 

The term factorial methods include in the French statistical literature of the last thirty years all the techniques of representation using the "principle axes": principal component analysis, single and multiple correspondence analysis, factor analysis or common and specific factor analysis. All these techniques can simultaneously handle large amounts of data and their system of correlations. Through a technique producing a kind of data compression, they can bring out the internal structure of the data, especially as planar graphical displays.

 

- Search for factorial (or: principal) subspaces

 

The goal is to find sub-spaces of smaller dimensions (between three and ten, for example) that best fit the cloud of individual-points and the cloud of variable-points, so that the proximities measured in these sub-spaces reflect as much as possible the actual proximities. This gives a representation space, the factor space, defined by the principal “axes of inertia”.  It is possible to represent the points of the cloud in this system of axes (see Figure A1.4). These axes achieve the best fit of all the points according to the classic least square criterion which involves minimizing the sum of the squared differences between points and axes.

 

Figure4a

 

Figure A1.4: Fit of the cloud of individual points in the word space

 

The first of these axes corresponds to the line of maximum elongation of the cloud of points, the second axis maximizes the same criterion while being subject to be orthogonal to the first axis, and so on for the following axes, which are at the end all mutually orthogonal. This orthogonality implies the absence of correlation between pairs of axes.

 

X is the array of data having undergone preliminary transformations (reduced centered variables, for example), X' is its transpose.

u1 is the unit vector that characterizes the first axis. lu1 is then the eigenvector of the matrix X'X corresponding to the largest eigenvalue  [SEM 2006].

More generally, the subspace with q dimensions, which fit best (in the least squares sense) to the cloud, is generated by the first q eigenvectors of the matrix X'X corresponding to the q largest eigenvalues.

 

The adjustment procedure is exactly the same for both clouds. We can then find simple relationships linking the axes calculated in both spaces, one for the individuals and one for the variables (transition relationships).

 

The vector of the coordinates of points on each axis, called a factor, is a linear combination of initial variables. We denote by ja and ya the factors corresponding to the axis a  in the space noted Rm (a space whose n points have as coordinates the m words) and in the space noted  Rn (a space whose m points have for coordinates the n individuals).

 

The two clouds of points, that of the words and that of the respondents, are intrinsically linked and, in fact show two facets of the same structures: in one case, the factors describe the correlations between words; in the other, the associations between respondents.

 

Each of the factorial planes of visualization corresponds to a pair of factors.

 

The elements (words or individuals) involved in the computation of the axes are the active elements. We also introduce, in the analysis, some supplementary elements (or illustrative elements) that do not participate in the computation of the axes but are projected subsequently onto the factorial planes. These elements can be of utmost importance in the interpretation of the planes (see Section A1.2.4).

 

- Basic techniques and methods derived

 

The nature of the information, its coding in the data table, the specific application domain will introduce variations in the factorial methods.

Those used here are in fact derived from two basic techniques, principal component analysis and correspondence analysis.

 

The principal component analysis applies to an array of numerical measurements and has been used, within the framework of Semiometry, to process an array of scores.

 

Examples of textual data analysis, (examples 4, 5, 6 presented in Tutorial A), are based on correspondence analysis applied to lexical tables (contingency.tables cross-tabulating words and texts).

 

A1.3 Principal Component Analysis: technical aspects


Principal Component Analysis (Hotelling, 1933) applies to variables with numerical values (measurements, rates, words etc.) represented as a rectangular array of measures R whose general term is rij, whose  columns are variables, and whose rows represent the individuals about whom these variables are measured. In Semiometry, the variables are the words, the rows the respondents, and the numerical values in the entries of the table, the scores.


A1.3.1 Geometric Interpretations


Geometric representations between the rows and the columns of the data table allow one to visualize the proximities, respectively between individuals and between variables (see Figures A1.1 and A1.2 above) .

 

In Rm, two individual points are very similar, if, overall, their m coordinates are very close. Both respondents concerned are therefore characterized by almost equal values for each variable. The distance used is the usual Euclidean distance.

 

In  Rn, if the values taken on by two particular variables are very close for all respondents, these variables will be represented by two very close points in this space. This may mean that these variables measure the same thing or they are bound by a particular relationship.

But the units of measurement of the variables can be very different, and therefore make it necessary to transform beforehand the data table.


A1.3.2 Problems of scale of measurement and data transformation


We want the distance between two individuals to be independent of the units of the variables for each variable to play a similar role. For this, we assign to each variable j the same dispersion by dividing each of its values by its standard deviation sj with

.


Furthermore, we are interested in how individuals differ from the mean. We then place the midpoint at the centre of gravity of the cloud of individuals. The coordinates of the midpoint are the average values of the variables noted

Taking this point as the origin amounts to subtracting for each variable its mean .

 

In this way, we correct the scales by transforming the data table R into a new table X as follows:



The variables thus reduced and centered all have a variance s2(xj), equal to 1, and an average    null, and become comparable. Other preliminary transformations are possible (see Section 2.5 of Chapter 2).

 

A1.3.3 Analysis of the cloud of n respondents


Data transformation leads to a translation from the original to the centre of gravity of this cloud and changing (in the case of the analysis we call normalized) the scales on the different axes.

To achieve the analysis of the cloud of respondent points in Rm, the matrix X'X to be diagonalized in this space, is the correlation matrix whose general term is :

 

 


cjj’ is the correlation coefficient between variables j and j '.

 The coordinates of n individual points on the factorial axis are the n components of the vector Xu.

 

Figure A1.5 below illustrates the representation of the cloud of respondents for a table of 12 respondents having rated 7 words (already presented in section A1.1) in the principal plane (2, 3) [The plane (2, 3) has been considered the primary plane of Semiometry given the special nature of the first axis (axis responsible of a “size effect”, cf. “La semiometrie”, op. cit.) Chapter 3].

 Respondents R01 and R02 have given in the same way, very contrasted scores and have given high marks to Tree and Sensual and low grades to Morals and Politeness, and are therefore near in the plane and differentiate with respondents R05 and R04 who have expressed themselves reversely on these words. Respondent R08 is distinguished by having highly rated Danger without giving a high score to the other words, while R11 highly rated all the words.

 

A1.3.4 Analysis of variables cloud (words)

 

The factorial coordinates of variable points on the a axis are the components of

 

and we have:  

The coordinate  of a variable-point j on the axis  a  is none other than the correlation coefficient of this variable with the factor  ya  (a linear combination of initial variables) considered as an artificial variable whose coordinates are made up of the n projections of the individuals on this axis.


The factorial axes are mutually orthogonal, we thus obtain a series of artificial variables, uncorrelated with each other, called principal components, which summarize the correlations of all initial variables. [
Note that principal components analysis translates only linear relationship between variables. A low correlation coefficient between two variables means that they are linearly independent, whereas there may still be a nonlinear relationship].

 

In Figure A1.4-b, as on the corresponding correlation matrix, Politeness and Morals are highly correlated and to a lesser extent, Storm and Sensual. We can observe the behavior of respondents: R01 and R02 are going in the direction of good raters of both words Tree and Sensual and poor raters of Morals and Politeness in contrast to the respondents R04 and R05.

Variables strongly correlated with an axis will contribute to the definition of that axis [This small sized example is obviously not representative enough for the plane to be interpreted. It is just intended to bring the data table and the results nearer]. This correlation can be read directly on the chart as it is the coordinate of point j on the axis. 

 
Table of scores (1-7) given to 7 words by 12 respondents (Reminder)

 

 

tree

gift

danger

morals

storm

politeness

sensual

R01

7

4

2

2

3

1

6

R02

6

3

1

2

4

1

7

R03

4

5

3

4

3

4

3

R04

5

5

1

7

2

7

1

R05

4

5

2

7

1

6

2

R06

5

7

1

5

2

6

5

R07

4

2

1

3

5

3

6

R08

4

1

5

4

5

4

7

R09

6

6

2

4

7

5

5

R10

6

6

3

5

3

6

6

R11

7

7

6

7

7

6

7

R12

2

2

1

2

1

3

2

 

Correlation matrix

 

         tree   gift danger morals storm  polite  sensual

tree    1.00

gift     .55   1.00

dang     .29    .14   1.00

mora     .16    .62    .36   1.00

stor     .51    .09    .54   -.01   1.00

poli     .00    .63    .23    .91   -.05   1.00

sens     .56   -.08    .45   -.30    .68   -.37   1.00

ansflec2

repanx-a

 

 

 

fleche

Figure4

 

 

A1.5-a: Representation of respondents

             in the plane (2,3)

 

 

A1.5-b: Representation of

            words in the plane (2,3)

 

Figure A1.5 : Principal Component Analysis in the table of scores of 7 words by 12 respondents

 

We have to especially focus on variables with the highest coordinates and we can then interpret the principal components based on the clustering of some of these variables and on the opposition with others.


Note that, in the original space, all the variable points lie on a sphere of radius 1, centered at the origin of the axes [
The analysis of the cloud of variable points in Rn is not performed with respect to the centre of gravity of the cloud  (unlike that of individual points) but with respect to the origin. The distance of a variable j to the origin O  is expressed by:    ]

The planes of fit cut the sphere according to large circles (of radius 1), the circles of correlations, within which are positioned variable-points. In DtmVic plots, the circles with radius 1 are not plotted in the factorial planes representing the words for a better readability of labels (the boxing of factorial planes would have led to a significant reduction of scale).

 

 

A1.4 Correspondence Analysis

 

Correspondence analysis (CA) has been studied systematically as a flexible technique of exploratory analysis of multidimensional data by Benzécri J.-P. (1973), correspondence analysis has other precursors, in particular, L. Guttmann (1941), C. Burt (1950), C. Hayashi (1956), and has given rise to scattered and mutually independent works. CA applies primarily to a contingency table K (cross-tabulation), with n rows and p columns, which describes the distribution of the population according to two qualitative (or categorical) variables with n and p categories. The rows and columns thus play similar roles.

 

- Notation

Let be the sum of all the elements kij of the contingency table K.

We note , the relative frequencies, with  .

 

We note : , , the marginal relative frequencies.

The contingence table K is transformed into both an array of line profiles  and an array of column profiles, .


 Point i of
Rm   has the coordinates:  for every  j £ m.

Likewise, point j of  Rn   has the coordinates : for every i £ n.

We can observe a significant difference between correspondence analysis and principal component analysis: the transformations on the table in both spaces are identical (because the row set and the column set play similar roles).

 

- Chi-square distance (c2) and distributional equivalence

 

The distances between two line points i and i’, on the one hand, and between two column points j and j’, on the other, are given by the following equations:

                  

 

The distance of the c2 offers the advantage of verifying the distributional equivalence principle. This principle ensures the robustness of the results of the correspondence analysis vis-à-vis the arbitrary division into categories of the nominal variables. It is expressed as follows: If two rows (resp. columns) of the contingency table have the same profile (i.e.: are proportional) then their aggregation does not affect the distance between the columns (resp. rows). This aggregation results in a new line point (resp. column point) with the same profile and to which is assigned the sum of the frequencies of the two point-lines (resp. column points).


This property is important because it guarantees a certain invariance of the results vis-à-vis the nomenclature chosen for the construction of the categories of a qualitative variable.

 

 


A1.5  Logarithmic analysis

 

Logarithmic analysis, proposed by Kazmierczak (1985), realizes the property of distributional equivalence of correspondence analysis on tables that are not necessarily contingency tables. Kazmierczak uses and generalizes the principle of Yule, which states that one does not change the distance between two lines or the distance between two columns of a table by replacing the rows and columns of this table by any other rows and columns that are proportional (this is actually a generalization of the principle of distributional equivalence).

 

Logarithmic analysis involves taking the log of data (after possible addition of a constant in case of possible negative or null data). Then, after centering both rows and columns, the data table is submitted to a non-normalized principal components analysis, which coincides here with a singular value decomposition [SEM 2002].

 

Note that if R is a table of data (n, m) and if A and B are two diagonal matrices respectively of dimensions (n, n) and (p, p) with positive diagonal elements, the ARB matrix gives rise to the same logarithmic analysis as the matrix R. This invariance property has had the effect of suppressing the first semiometric axis (size effect) without altering the following axes.

 


A1.6  Factor analysis (into common and specific factors)

 
Factor analysis (or analysis into common and specific factors) is probably the oldest model of latent variables.
In econometrics, we usually distinguish between functional models, or fixed-effect models (such as multiple regression and the linear model as a whole), and structural models or random effect models (models of latent variables).

The original principles of the method dates back to Spearman (1904) (univariate analysis) and Garnett (1919), Thurstone (1947) (multivariate analysis). These models were mainly developed by psychologists and psychometricians. The developments to which they give rise are complex and diverse. We can consult on this point the classic books of Harman (1967) and Mulaik (1972). We should also mention the work of Anderson and Rubin (1956) and Lawley and Maxwell (1963), who placed factor analysis in a classic inferential framework.


- The model of factor analysis

 

This model proposes to reconstruct, from a small number q of factors, the correlations between m observed variables. We assume the existence of an a priori model:

 

In this writing, Xi represents the i-th vector observed from m variables ; G is a table (m, q) of unknown coefficients (with q < m) ; fi is the i-th value of the random vector and the unobservable common q factors ; and ei the i-th value of the unobservable vector de residuals, the latter represent the combined effect of  specific factors and a random disturbance.

We denote G the table (n,p) whose i-th row represents the observation i. Likewise F denotes the unobservable table (n,q) whose i-th row is  and E the unobservable table (n,p) whose i-th row is . The model linking all the observations to the hypothetical factors is written:

 

In this writing, only X is observable. As such, the model is indeterminate.

 

The identification of this model and the estimation of parameters raise complex problems. A series of additional a priori assumptions allows that identification.

 

A1.7 Methods of hierarchical clustering


Automatic clustering techniques are designed to produce clusters of objects or individuals described by a number of variables or characters. Clustering is a branch of data analysis and a fundamental step in many scientific disciplines.
It has given rise to numerous and diverse publications including: Sokal and Sneath (1963) and Benzécri (1973).The circumstances of use are substantially the same as in descriptive factorial analysis methods presented in previous sections. In chapter 3, the clustering is performed on all 210 words from the coordinates of these words on the principal axes.

There are several families of clustering algorithms: hierarchical algorithms that provide a hierarchy of partitions of objects; and algorithms that lead directly to partitions such as the methods of aggregation around mobile centers (Aka: k-means algorithm). The principles, common to various techniques of ascending hierarchical clustering, are simple. One must create, at each step of the algorithm, a partition obtained by aggregating pairwise the closest elements.

 
- The algorithm of hierarchical clustering

 

The basic algorithm of ascending (bottom up) hierarchical clustering produces a hierarchy starting from the partition in which each element to be classified constitutes a class, leading to the partition consisting of one single class containing all the elements.


For n elements to be classified, it is composed of n steps. At the first stage, there are therefore n elements to classify. We construct the distance matrix between the n elements and we look for the two closest, which are aggregated into a new element.


We construct a new distance matrix resulting from the aggregation, by calculating the distances between the new element and the remaining elements. We are now in the same conditions as in step 1, but with only (n-1) elements to be classified.


We look again for the two closest new elements, which we aggregate. The process is reiterated until there is only one element containing all the objects and which is the final partition.

 

Figure A1.6: Dendrogram or hierarchical tree


The algorithm does not provide a partition in q classes of a set of n objects; but a hierarchy of partitions, which appears under the form of trees also called dendrograms, and containing n - 1 partitions (see Figure A1.6). The value of these trees is that they can give an idea of how many classes actually exist in the population. Each cut of a dendrogram provides a partition.

 


A1.8 The self-organizing (Kohonen) maps


The goal of self-organizing maps is to classify a set of observations in order to maintain the initial topology of the space in which they are described. Like the neural networks to which they are associated, these maps obtain good performances for pattern recognition. Introduced by Teuvo Kohonen in 1981, they give rise to several applications such as text analysis, medical diagnosis, industrial process controls, and robotics.


- The principle

 

Kohonen maps seek to represent in a space with two (sometimes three) dimensions the rows or columns of a table in accordance with the concept of neighborhood in the space of the elements to be classified. Like PCA, it is useful to imagine at the start all the data (words) as a cloud of points in a high-dimensional space (that of the individuals or respondents).


The principle is to consider a map as a rectangular grid (sometimes hexagonal). This grid, once unfolded, fits the shape of the cloud of points as best as possible. The grid nodes are the neurons of the map. Each point in the original cloud is projected onto the node that is the closest. In fact, each point, first described in a multidimensional space, is represented in the end by two coordinates giving the position of the node on the map: the space is reduced. The set of points assigned to a single neuron are close in the original space.


We define a priori the notion of neighborhood between classes. These neighborhoods can be chosen in various ways but generally are assumed to be directly contiguous in the rectangular grid (representing 8 neighbors for a neuron).

Neighboring observations in the space of variables of dimension q belong, after clustering, to the same class or to neighboring classes.


- The algorithm

The learning algorithm for classifying m points is iterative. The initialization consists in associating with each class k (node) a temporary centre Ck (with q components) chosen randomly in the q-dimensional space containing the m words to classify.

 

At each step, we choose a word i at random to be compared to all the provisional centers and we assign the word to the closest center  (closest in the sense of a given distance a priori). Then we bring the center  nearer to the word i, together with the centers neighboring , which is expressed in step t by :


 Ck(t+1)= Ck (t)+
e(i(t+1)- Ck (t))

 

where i (t+1) is the word presented at stage t+1, e is a parameter of adaptation, positive and less than 1. This expression only concerns the centre and its neighbors.


This algorithm is similar to the k-means algorithm, but in this latter case, there is no notion of neighborhood between classes; and we only modify at each step the position of the centre .

Like the k-means algorithm, this algorithm is suitable for applications in which data is abundant and in which there is no need to store it in the central core of the computer.

 

 

A1.9 Validation Tools

 

Test Values (Section A1.9.1) are tools of elementary statistical inference, but very versatile and useful, especially if the user is aware of problems of multiple comparisons that often appear (section A1.9.2) .

 

The technique of supplementary categories (Section A1.9.3) is a fundamental validation tool for principal axes methods. It allows for an external validation of the results, which is both a test of consistency and an enrichment of interpretation.

 

The two other validation tools used in this work are the confidence intervals of Anderson and the bootstrap re-sampling procedure.

 

1.9.1 What is a test value?

A test value is a criterion to assess quickly whether a category of a nominal variable (i.e. a category of respondents) has a significant position on an axis. For this, we test the hypothesis that a group of individuals, corresponding to a given category of an external categorical variable (such as the category Female of the variable Gender, for example) may be considered to be drawn randomly, without replacement, from the sample under consideration.


In the case of a true random selection, the mean point (or center of gravity) of the sub-cloud representing the group (i.e. the category) departs somewhat from the center of gravity of the overall cloud corresponding to the entire sample.

We then convert the coordinate of this category on the axis into a test value which is, under this assumption of independence, the realization of a standard normal variable. In other words, assuming that a variable has a random distribution on the axis, the corresponding test value has 95% chances of being included in the interval [-1.96, +1.96].


We then consider as occupying a significant position the variables whose test values ​​are greater than 1.96 in absolute value, which corresponds approximately to the usual threshold of 5% probability.


Often the test values ​​are well above this threshold. They are then used to sort the categories from the most significant to the least. The test value systematizes the notion of t-value often used in the literature.

 

Suppose a category j concerns nj individuals. If these nj individuals are drawn at random (this is called the null hypothesis H0) among the n individuals analyzed (supposed to be drawn without replacement), the average of nj coordinates drawn at random from the finite set of n values ​​ yai (coordinates of the respondent i on axis a) is a random variable     such that:

with the expectation E(Xaj) = 0 and for variance:

This is the classical formula giving the variance of the mean in a drawing without replacement of nj objects from n, depending on the total variance la  , which is also, in the case of factorial coordinates, the eigenvalue corresponding to the axis a. In the formula giving , I(j) is the subset of respondents characterized by the category j. The coordinate  of the variable j is proportional to the random variable  and is written:      

And we therefore have    and:

The quantity:

Is a measure, in terms of standard deviations, of the distance between the variable j, and the origin, on the factorial axis. We call this quantity a "test value". According to the central limit theorem, its distribution tends towards a reduced centered Normal distribution.

 

It should be noted that the test values ​​are meaningful only for the supplementary categories (see next section), or active variables with low absolute contributions, that is to say, behaving in fact as supplementary variables. Coordinates on an axis of individuals corresponding to an active category cannot be considered as drawn at random, since that category has helped build the axis.

 

Test values allow us to quickly identify the significant variables for the interpretation of an axis or a factorial plane.

 

A1.9.2 The problem of multiple comparisons


The simultaneous calculation of several test values or several probability thresholds comes up against the pitfall of multiple comparisons, well known to statisticians, cf. O'Neill and Wetherill (1971), Saville (1990), Westfall and Young (1993), Westfall et al. (1999), Hsu (1996).

 

Suppose we projected 100 additional categories (see next section A1.9.3) that are truly randomly drawn. The test values ​​attached to these variables are then all the realizations of standard random reduced independent variables.

 
Under these conditions, on average, from 100 test values ​​calculated, five are outside the interval [-1.96, +1.96] and will be, in appearance only, significant. The 5% threshold is meaningful only for a single test, and not for multiple tests.


[Test values ​​ can above all sort the supplementary categories in order of decreasing interest, which is invaluable for the interpretation of factors].

We solve this problem in practice by choosing a more stringent threshold. The threshold the most stringent and pessimistic that we can imagine is the "Bonferroni threshold” (the initial threshold is divided by the number of tests: in the case of 210 tests: 0.05 / 210 = 2.4 10-4). The corresponding unilateral value test is 3.49. This value provides us with a prudent safeguard against excessiveness. [See, for example, Hochberg (1988), Perneger (1998).]

 

In our example, the interdependence of words does not allow us to blindly apply the results for multiple comparisons. What can we conclude indeed when several words with similar meanings have simultaneously test values ​​of about 1.96? These are not significant one by one by applying the Bonferroni threshold, but they confirm and validate one another when considered simultaneously.

 

A pragmatic solution (in the multidimensional case): the bootstrap.

 

The bootstrap validation technique, which will be discussed later, makes a valuable contribution to the difficult problem of multiple comparisons. The bootstrap involves replications of samples that take into account all the variables simultaneously, and consequently the interdependence of variables.


A1.9.3 Usefulness of supplementary elements


Principal axes analyses produce representation subspaces for individuals and/or for variables. It relies on elements (individuals or variables) known as active.

 

It is possible to introduce other supplementary points (or supplementary elements) that did not intervene in the composition and definition of the axes. We may want to know their positions in the factorial space.

 

We can cite three reasons for making a point a supplementary one:  

- to enrich the interpretation of axes by the variables (thematically  or in nature different from that of active elements) that did not participate in their construction;

- to adopt the perspective of forecasting by projecting the additional variables onto the space of individuals. These will be "explained" by the active variables, and

- to bring out the essential structure that could be obscured by the existence of active

points of low mass which could distort the cloud.

 

We then project these points after the construction of the factorial axes in this new referential framework. This projection is very simple using the so-called transition formulae, whether this be in principal component analysis or correspondence analysis.

 

These criteria actually define groups of individuals, and are regarded either as categories of categorical variables or as individuals, added as supplementary elements.

The centers of gravity of these groups are positioned in the space of the variables. The test value is used to appreciate the significance of the locations of these groups on the axis.

 

A1.9.4 Anderson Confidence Intervals

 

Anderson (1963) calculated the limiting distributions of eigenvalues ​​of a principal components analysis without necessarily assuming that the corresponding theoretical values ​​are distinct.


The amplitude of the intervals gives an indication about the stability of the eigenvalue vis-à-vis the sampling fluctuations (assumed to be normal). Overlapping of the intervals of two consecutive eigenvalues ​​therefore suggest the equality of these values. Thus the user can avoid interpreting an axis, unstable according to this criterion.

 

If the eigenvalues la of the theoretical covariance matrix S are distinct, the eigenvalues  ​​of the empirical covariance matrix S follow asymptotically the normal distributin with expectation la   and  variance  where n is the size of the sample.

We deduce the confidence intervals approaching the threshold 95%:

 

 

Anderson confidence intervals concern in fact both the eigenvalues ​​of covariance matrices as well as the matrices of correlations. The simulations show that the confidence intervals obtained are generally conservative: the percentage of coverage of the true value is usually higher than the confidence level announced.

 

In all the cases, both the asymptotic nature of the results and the underlying assumption of normality lead us to consider the results as merely indicative. Note that Muirhead (1982) has shown that the hypothesis of existence of the first four moments for the theoretical law of the sample was sufficient to validate these intervals.

 

A1.9.5   Bootstrap techniques

 

With the results of principal axes techniques, some questions about the validity of the obtained patterns naturally arise: Are there any criteria to test the stability of a structure and validate it? What was the incidence of dealing only with a sample of individuals? And also a more difficult question arises:  What are the consequences of  the selection of variables?

 

We may partially answer these questions using empirical validation methods. They disrupt the original table by additions or withdrawals of array elements, individuals or variables (weight, coding, etc..). The assumption is that if the perturbations performed on the samples do not affect the observed patterns in the subspaces, the latter are assumed to be stable and the highlighted structure could be "significant."

 

Re-sampling methods propose to systematize this approach. These methods are computationally intensive techniques, based on simulations of samples arising from one single sample. Made possible by the increase of computational power, these techniques are substituted in some cases for more traditional procedures that rely on restrictive assumptions. They are the only possible procedures when the analytical complexity of the problem does not allow for classical inference.

The nonparametric bootstrap is well adapted to the problem of the validity of the patterns observed in a principal plane: based on simulations, it allows for calculating areas of confidence for the locations of line points and column points.

 

- Principle of the bootstrap

 

The bootstrap technique, introduced by Efron (1979), consists in simulating s (s is generally greater than 30) samples of the same size n as the original sample. They are obtained by drawing randomly with replacement from the n individuals observed at the outset: the latter all have the same probability 1/n of being chosen. Some individuals appear more than once and have therefore a higher weight (2, 3 ,...) while others are absent (zero weight).

 

This method is used to analyze the variability of simple statistical parameters by generating confidence intervals for these parameters. It can also be applied to many problems for which we cannot analytically estimate the variability of a parameter. This is the case for multi-dimensional methods when assumptions of multi-normality are rarely verified. Principal component analysis is an application domain that has given a great deal of work using the bootstrap re-sampling method.

Take the example of estimating the correlation coefficient r between two variables or between a variable and a factor (principal axes). The principle consists in  computing the correlation coefficient for each replicated sample (drawing with replacement of pairs of observations). The frequency distribution of the correlation coefficient is then obtained (histogram of the s values ​​of r coefficients corresponding to s replications). Then one calculates from that histogram the required confidence intervals. The bounds of the confidence interval can be estimated directly by the quantiles of the simulated distribution.

 

One obtains an estimate of the accuracy of the value of r on the base sample without assuming a normal distribution of data.

 

To estimate the factorial coordinates from a principal components analysis, the principle is the same as for the correlation coefficient: we perform on each simulated sample a principal component analysis and then we draw a frequency distribution for each of the components. The bootstrap method gives in most cases a good picture of the statistical accuracy of the estimate on a sample. Theoretical research conducted by Efron, in particular, show that for many statistical parameters, the confidence interval corresponding to the simulated bootstrap distribution and that corresponding to the actual distribution are generally of the same amplitude.

 

- Implementation and computation of confidence zones

 

There are several procedures for assessing the stability of factorial coordinates through Bootstrap techniques. Gifi (1981) and Meulman (1982), Greenacre (1984) have conducted early work in the context of the analysis of single or multiple correspondences. In the case of principal component analysis, Diaconis and Efron (1983), Holmes (1989), Stauffer et al. (1985), Daudin et al. (1988) have dealt with  the problem of choosing the appropriate number of axes.

 

To take into account the replicates, we must refer to a common factorial space. Several variants are possible.

 

We can consider mainly two techniques called the total bootstrap and the partial bootstrap.

 

The total bootstrap consists in carrying out as many principal component analysis as there are replications, with a series of transformations to find homologous axes during successive diagonalization of the s correlation matrices of replicate Ck (Ck corresponds to the k- th replication). These transformations are: changes of sign of the axes, rotations or permutations of the axes. This method was proposed by Milan and Whittaker (1995).

 

In the partial bootstrap method, proposed by Greenacre (1984) in the case of correspondence analysis, it is not necessary to calculate the values ​​and eigenvectors for all the replications: the principal axes calculated on the undisturbed original data, play a special role (the initial correlation matrix C is indeed the expectation of the disturbed matrices Ck).

 

The partial bootstrap method is based on the projection as supplementary points of the replicated points on the subspaces of reference provided by the principal axes of the initial correlation matrix  C = X'X.

 

In the following transition formula, we replace the matrix X with the replicated matrix Xk to obtain the corresponding replicate uq(k).

 

where uq, vq are respectively the q-th eigenvectors of X’X et XX’ and lq the associated eigenvalue.

More precisely, the projection of the k-th replication of the m variables (words) is given by the vector uq (k) of  such that :

and Dk denotes the diagonal matrix (n, n) of the bootstrap weights associated to the k-th replication.

The projection of bootstrap replications in the context of principal component analysis uses the fact that the coordinate of a variable on a factorial axis is none other than its correlation coefficient with the variable: “coordinates of the individuals on the axis”. We therefore calculate the replications of this coefficient, which amounts to re-weighting for each replication, the individuals with bootstrap weights that characterize a selection without replacement. We obtain, as a byproduct, replications of the variance on the axis, which are obviously different from what would be replications of eigenvalues.

 

 In the case of partial bootstraps, the analyses of the matrices Ck are by no means necessary, since the eigenvectors are obtained from the principal component analysis of matrix C.

 

Bootstrap variability is thus observed better on the original permanent referential, which is also the only one that has not been disturbed. This technique, tested empirically, largely meets the users' concerns in the case of principal component analysis.

 

 

-Bootstrap on variables

 

Replications are classically obtained by drawing with replacement the n individuals. To test structural stability vis-à-vis the set of words, we propose to replicate this set using the total bootstrap method.

 

Following again he semiometric example, we thus implicitly assume that all the words of the questionnaire are a sample of m words randomly extracted from all of the "semiometrisables" words of the considered language.

This “sample of words” will undergo the same perturbations as the sample of individuals in the case of usual bootstrap.

 

In practice, we replace the null bootstrap weights by infinitesimal weights, so that the variables missing in one replication still appear with the status of additional variable.

 

This validation test is obviously very conservative. The drawing without replacement implies approximately, on average, giving up one third of the elements (here, the words!) at each replication.



 

 

References

 

Anderson T. W., Rubin H. (1956), Statistical inference in factor analysis, Proc. of the 3rd Berkeley Symp. on Math. Statist., 5, p 111-150.

Anderson T. W. (1963), Asymptotic theory for principal component analysis, Ann. Math. Statist., 34, p 22-148.

Benzécri J.-P. (1989), Essai d’analyse des notes attribuées par un ensemble de sujets aux mots d’une liste, Cahiers de l’Analyse des Données, Vol XIV, 1, p 73-98.

Benzécri J-P. (1973), L'Analyse des Données, Tome 1 : La Taxinomie, Tome 2 : L'Analyse des Correspondances, Dunod, Paris.

Chateau F., Lebart L. (1996), Assessing sample variability in the visualization techniques related to principal component analysis: bootstrap and alternative simulation methods, in : COMPSTAT96, A. Prats (ed), Physica Verlag, Heidelberg, p 205-210.

Cottrell M., Rousset P. (1997), The Kohonen Algorithm: a powerful tool for analysing and representing multidimensional qualitative and quantitative data, in: Biological and Artificial Computation : From Neuroscience to Technology, J. Mira, R. Moreno-Diaz, J. Cabestany, (eds), Springer, p 861-871.

Daudin J.-J., Duby C., Trécourt P. (1988), Stability of principal components studied by the bootstrap method, Statistics, 19, p 241-258.

Diaconis P., Efron B. (1983), Computer intensive methods in statistics, Scientific American, 248, p 116-130.

Efron B. (1979), Bootstraps methods : another look at the Jackknife, Ann. Statist., 7, p 1-26.

Efron B. (1982), The Jackknife, the Bootstrap and other resampling plans, SIAM, 1982, p 116-130.

Garnett J.-C. (1919), General ability, cleverness and purpose, British J. of Psych., 9, p 345-366.

Gifi A. (1981), Non Linear Multivariate Analysis, Department of Data theory, University of Leiden, (updated version: 1990, same title, J. Wiley, Chichester).

Greenacre M. (1984), Theory and Applications of Correspondence Analysis, Academic Press, London.

Harman H.H. (1967), Modern Factor Analysis, Chicago University  Press, Chicago.

Hayashi C. (1956), Theory and examples of quantification, (II), Proc. of the Institute of Statistical Mathematics, 4, (2), p 19-30.

Hochberg, Y. (1988), A sharper Bonferroni procedure for multiple tests of significance, Biometrika, 75, p 800-803

Holmes S. (1989), Using the bootstrap and the RV coefficient in the multivariate context, in : Data Analysis, Learning Symbolic and Numeric Knowledge, E. Diday (ed.), Nova Science, New York, p 119-132.

Hotelling H. (1933), Analysis of a complex of statistical variables into principal components, J. Educ. Psy.  24, p 417-441, p 498-520.

Hsu, J. C. (1996), Multiple Comparisons: Theory and Methods, Chapman & Hall, London.

Kazmierczak J.-B. (1985), Analyse logarithmique : deux exemples d'application,  Revue de Statist. Appl., 33, (1), p 13-24.

Kohonen T. (1989), Self-Organization and Associative Memory, Springer-Verlag, Berlin.

Lawley D. N., Maxwell A. E. (1963), Factor Analysis as a Statistical Method, Methuen, London.

Lebart L., Morineau A., Warwick K. (1984), Multivariate Descriptive Statistical Analysis,  J. Wiley, New York.

Lebart L., Morineau M., Piron M. (2002), Statistique exploratoire multidimen­sionnelle, Dunod, Paris.

Lebart L., Salem A. (1994), Statistique textuelle. Dunod, Paris.

Lebart, L., Salem, A., Berry, L. (1998), Exploring Textual Data, Kluwer Academic Publishers, Dordrecht.

Meulman J. (1982), Homogeneity Analysis of Incomplete Data, DSWO Press, Leiden.

Milan L., Whittaker J. (1995), Application of the parametric bootstrap to models that incorporate a singular value decomposition, Appl. Statist. 44, 1, p 31-49.

Muirhead  R. J. (1982), Aspects of Multivariate Statistical Theory, J. Wiley, New York.

Mulaik S. A. (1972), The Foundation of Factor Analysis, McGraw Hill, New York.

O'Neill, R., and G. B. Wetherill. (1971), The present state of multiple comparison methods (with discussion), Journal of the Royal Statistical Society, Series B, 33, p 218-250.

Perneger T.,V. (1998), What is wrong with Bonferroni adjustments, British Medical Journal, 136, p 1236-1238

Saville, D. J. (1990), Multiple comparison procedures: The practical solution,  American Statistician, 44, p 174-180.

Sokal R. R., Sneath P. H. A. (1963), Principles of Numerical Taxonomy, Freeman and co., San-Francisco.

Spearman C. (1904), General intelligence, objectively determined and measured,  Amer. Journal of Psychology, 15, p 201-293.

Thiria S., Lechevallier Y., Gascuel O., Canu S. (1997), Statistique et méthodes neuronales, Dunod, Paris.

Thurstone L. L. (1947), Multiple Factor Analysis, The University of Chicago Press, Chicago.

Westfall, P. H., Young S. S. (1993), Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment, J. Wiley, New York.

Westfall P. H., Tobias R. D., Rom D., Wolfinger R. D., Hochberg Y. (1999), Multiple Comparisons and Multiple Tests Using the SAS System, SAS Institute.

Young G. A. (1994), Bootstrap: more than a stab in the dark, Statistical Science, 9, p 382-418.

 



End of "Some elements of Multivariate Descriptive Statistical Analysis"