Updated version of "Multivariate Descriptive Statistical Analysis", by L. Lebart, A. Morineau, K. Warwick, John Wiley, New York, 1984.
Multiple Correspondence Analysis
1. Introduction
The method of multiple correspondence analysis, which we present here, is a simple extension of two-way analysis. It is characterized by straightforward computations, interesting properties, and simple rules for interpreting the resulting maps.
The principles of this method actually stem from the works of the statistician C. Burt (1950), of L. Guttman (1941), and C. Hayashi (1956). [For an historical review, see Tenenhaus et al. (1985)].
2. Definitions, notations
Survey data generally includes a number of responses to questions in complete disjunctive form : this means that the various response categories are mutually exclusive, and that only one category is chosen.
The letter s will designate the number of questions. A single question q consists of pq response categories.
The total number of response categories, p, contained in the questionnaire is:
p = Sq pq (1)
The number of individuals who responded to the questionnaire is called n.
H is the set whose elements consist of all the series of s categories, each of which is taken from a different question. The elements of H therefore comprise the sum-total of possible responses by the subjects.
Each element of H corresponds to one cell of the multiway contingency table cross-tabulating the s questions.
We must note, however, that this hypertable is in general almost empty. If 1000 individuals are given 12 questions, each of which has 10 categories, n = 1000, whereas the number of elements of H is 1012 Thus, at the most, only one cell out of one billion will be non-empty.
We denote by Z the matrix with n rows and p columns describing the response of the n individuals with binary coding.
Matrix Z is the juxtaposition of s submatrices (see figure 1 for s = 3) :
Z = [ Z1 , Z2 , ... Zq, ... Zs ] (2)
2 3 4 0 1 0 0 0 0 1 0 0 0 1
2 1 3 0 1 0 0 1 0 0 0 0 1 0
3 1 2 0 0 1 0 1 0 0 0 1 0 0
4 2 4 0 0 0 1 0 1 0 0 0 0 1
1 2 3 1 0 0 0 0 1 0 0 0 1 0
R= 2 2 3 Z = 0 1 0 0 0 1 0 0 0 1 0
3 1 1 0 0 1 0 1 0 0 1 0 0 0
1 1 1 1 0 0 0 1 0 0 1 0 0 0
4 1 2 0 0 0 1 1 0 0 0 1 0 0
2 2 3 0 1 0 0 0 1 0 0 0 1 0
3 2 2 0 0 1 0 0 1 0 0 1 0 0
4 1 4 0 0 0 1 1 0 0 0 0 0 1
Figure 1. Example of homologous tables R and Z
Submatrix Zq (with n rows and pq columns) is such that its ith row contains pq - 1 times the value zero, and once the value 1 in the column corresponding to the category of question q chosen by subject i. In other words, matrix Zq describes the partition of the n individuals that is created by the responses to question q.
Finally, we define a matrix R with n rows and s columns. R is the condensed coding matrix of Z. Cell (i,q) contains the number riq of the category of question q chosen by subject i. It is obvious that riq ≤ pq (see figure 1).
The computational programs will use matrix R as input data, thus reducing considerably the volume of calculations.
Burt's Table Associated with Z
Let us consider matrix Z such that Z = [ Z1 , Z2 , ... Zq, ... Zs ] . The square matrix:
B = Z'Z (3)
is called Burt's contingency table associated with Z, the matrix of responses.
Figure 2. Example of matrix B corresponding to the tables Z and R of figure 1.
Matrix B is made up of s2 blocks:
The square matrix Zq'Zq is a (p ,p ) diagonal matrix, since two categories of one question cannot be chosen simultaneously.
The block (Z'q Zq' ) whose index is (q,q') is the contingency table that cross-tabulates the responses to the two questions q and q'.
The diagonal matrix D is a (p,p) matrix that has the same diagonal elements as B ; these diagonal elements are the frequencies for each of the categories.
We may also consider matrix D as consisting of s2 blocks. Only the s diagonal blocks are nonzero matrices .
The qth diagonal block such that Dq = Zq'Zq is the diagonal matrix whose diagonal terms are the frequencies corresponding to the various categories of question q.
3. The case of two questions: two-ways correspondence
In the case of two questions, the response matrix Z is written
Z = [ Z1 , Z2 ] .
It is then equivalent, from the point of view of describing the relationships among categories, to performing any of the following :
1. A correspondence analysis of the (n,p) matrix Z.
2. A correspondence analysis of the (p,p) matrix B.
3. A correspondence analysis of the (p1 , p2 ) matrix Z1'Z2 .
4. A canonical analysis of the two column blocks Z1 and Z2 .
Equivalence between (2) and (1)
Let us show that analyses 1 and 2 provide the same factors of norm 1. The ath factor Fa extracted from analysis 1 is such that
(1/s) D-1Z'ZFa = l aF a (4)
since, using the notation we adopted for correspondence analysis, the matrix on the left-hand side of this equation would be written
Dp-1 F' Dn-1 F (5)
here :
F = (1/ns) Z (matrix of relative frequencies)
Dp = (1/ns) D (margins of columns)
Dn = (1/n) In (margins of rows) ( In = Identity matrix)
We recognize equation (4) in the formula:
Dp-1 F' Dn-1 F F a = la F a
We have defined the matrix B such that B = Z'Z.
B is symmetric. Its row and column margins are the diagonal elements of the matrix sD.
In analyzing B, we have a new matrix of relative frequencies F :
F = (1/ns2) B
The corresponding new matrices Dp and Dn are equal :
Dp = Dn = (1/ns) D
The matrix that is to be diagonalized in this case is written
(1/s2) D-1 B' D-1 B
When we premultiply the two members of equation (4) by (1/s)D-1B, we immediately obtain:
(1/s2) D-1 B' D-1 B B F a = l a2 F a
The factors are therefore identical for both analyses.
Note that this equivalence (between (2) and (1) ) is valid even when there are more than two questions.
Equivalence between (1) and (3) :
Let us now show that, for each pair of factors ( Fa , ya ) relative to the same eigenvalue m a extracted from the analysis of the contingency table Z'Z , there is a corresponding factor F a from analyzing Z (or B) such that :
We have the notation D1 = Z'1Z1 , D2 = Z'2Z2 , and
The diagonal elements of D1 and D2 are the row and column margins of the matrix Z'Z .
Analysis of this table leads us to the double transition equations:
These equations can be written as a system of equations :
which can be written:
This equation is written in a more condensed form after multiplying its two members by 1/2 (that is, 1/s, since s=2)
[8]
The reader will recognize equation (4) with
If ma is the ath largest eigenvalue extracted from the analysis of Z'1 Z2, then the previous equation gives the ath largest eigenvalue of the analysis of Z .
Equivalence between (3) and (4) :
The equivalence between non-centered canonical analysis and correspondence analysis is the consequence of the fact that in both cases, the matrix to be diagonalised is the same :
S = (Z'1 Z1 )-1 Z'1 Z2 (Z'2 Z2 )-1 Z'2 Z1
Remember that Z'1Z1 and Z'2Z2 are diagonal, and that Z'1Z2 is nothing but the contingency table crossing the two questions.
Remark : In the analysis of the disjunctive matrix Z, the points that represent the various response categories of the two questions are elements of the same set, which is the set of Z's columns. On the other hand, in analyzing the contingency table Z'1Z2 , they are split up into row-points and column-points.
The fact that the maps obtained in the space of the first factors are identical (although they are dilated, since the eigenvalues are not the same) shows that it is legitimate to simultaneously represent the row-points and the column-points in correspondence analysis.
Table 1. Equivalence between the analyses in the case of two questions
Table |
Dimension |
Axes |
Eigenvalues |
contingency table |
(p1 , p2) |
y dans Rp1 j dans Rp2 |
m |
Z = [Z1, Z2] disjonctive table |
(p, n) p = p1 + p2. |
|
|
B = Z'Z Burt table |
(p, p) |
|
|
4. The case of more than two questions
The generalization to the problem of more than two questions requires a reformulation of the two-ways problem.
Matrix Z = ( Z1, Z2, ... Zq, ... , Zs ) has p columns, to which there correspond p points of Rn. Let us consider this space Rn. Each submatrix Zq generates a linear subspace Vq with pq dimensions.
These linear subspaces have in common at least the first bisector 1n (the vector all of whose components are equal to 1). The rank of matrix Z is therefore at the most equal to p - (s-1).
Let jq be the vector whose p components are the coordinates of a point mq of Vq in the basis defined by the columns of Zq .
The coordinates of mq in Rn are the components of
mq = Zq jq.
The square of the distance of this point mq to the origin (its Euclidean norm), is:
j’q Z'qZq jq = j’q Dq jq
The correspondence analysis of the contingency table that confront the categories of two questions q and q' is reduced to studying the relative positions of the subspaces Vq and Vq' . This leads to a canonical analysis of the matrix [Zq , Zq' ]
The double transition equations (6) and (7), are written (we have omitted the index a in order to simplify the notation).
From these equations we can deduce the following:
namely,
[9]
[10]
in which :
The matrices Pq and Pq' are the projection operators on the subspaces Vq and Vq' .
When canonical analysis is presented as a problem of finding the smallest angles between two mappings, it does not lend itself to generalization to the problem of more that two questions.
However, the canonical analysis of the matrix [ Zq , Zq' ] can be also formulated in the following way:
Find two points mq and mq' such that the mean of the sum of their squared distances to the origin is constant :
[11]
and such that the distance to the origin of the point m such that m = mq + mq' is maximum.
The square of this distance is
which reads
To maximize ||m|| 2 with the constraint (11) or with the two constraints:
leads to the same result, since the two Lagrange multipliers relative to the last two constraints are equal.
With the single constraint (11), the problem is readily generalized to more than two questions.
Let us designate by j1, …, jq, …, js respectively, the vectors of the components of s points: m1, …, mq, …, ms in the bases Z1, …, Zq, …, Zs , and let m = m1 +…+ mq +…+ ms.
The quantity to be maximized is:
under the constraint:
If is
the vector with p components defined by
The problem becomes
maximizing :
under the constraint :
Factors are therefore the eigenvectors
of the matrix D-1B relative to the largest eigenvalues. They
are proportional to those extracted from the correspondence analysis of matrix Z
and coincide, as we have seen, with the factors (axes) extracted from analyzing
matrix B, itself considered a data table.
5. Properties of the representations
The pq category-points relating to one question q are centred
The s subsets of pq points corresponding to the s questions (or blocks Zq) have the same center of gravity, which is the general center of gravity of the p categories.
With the previous notations, the coordinates of the pq points corresponding to question q are in the space Rn the columns of the block : ZqDq-1, the corresponding masses being the diagonal elements of (1/n)Dq.
The center of gravity Gq of this subset has n coordinates indexed by i such as:
where the previous symbol S means a sum for the pq categories of question q.
Gq is therefore independent of q.
In MCA, there are at the most (p - s ) non-zero eigenvalues.
We have seen that all the subspaces Vq generated by the blocks Zq have in common the vector 1n, whose all component are 1.
This vector is the "trivial" axis corresponding to the eigenvalue 1; it corresponds to the eigenvalue 0 when the analysis is performed with respect to the center of gravity.
Therefore, the maximum dimensionality of the configuration of points is:
( p1 - 1 ) + ( p2 - 1 ) + ... + ( ps - 1 ) = p - s
The usual softwares takes into account this property to diagonalize a (p - s) (p - s) matrix instead of a (p, p) matrix.
In particular, when all the questions have two categories (yes - no, for example), then the matrix to be diagonalized is twice smaller.
But it can be shown in this case ( pq = 2, for all q) that the multiple correspondence analysis is equivalent to a principal component analysis performed on the (s,s) binary table containing one column (chosen among the two initial columns) for each question q.
Test-values for supplementary categories
We have seen in the chapter relating to correspondence analysis how the transition formulas allow to project supplementary elements onto the axis (calculated from the "active elements"). In the case of M.C.A., the interpretation is particularly easy, due to the binary coding of the data:
The coordinate of a
supplementary category on axis a is the arithmetic mean
[multiplied by of the coordinates of the concerned individuals (individuals having
chosen this category as a response).
This suggest the following statistical test:
Let a supplementary category j contain nj individuals or observations. The null hypothesis is that the nj observations are chosen at random (without replacement) among the n active observations.
The coordinate jaj
of the category j on axis r is consequently a random variable, product by the
constant of the arithmetic mean of nj
values yai drawn at random from a finite set of n
values.
We have : E ( jaj) = E( yai) = 0
since var ( yai) = la
The quantities tj = jaj [ nj (n - 1 )/ ( n - nj) ]1/2 are standardized variables: they allow us to compare respective significances of several supplementary categories. These quantities are named test-values.
A test value greater than 2 or less than -2 indicates that the corresponding category has a significant location on the axis.
The test-values are generally printed besides the coordinates and the contributions in M.C.A. listing.
6. Example of application
The data set is the British section of a multinational survey conducted in seven countries in the late 1980s (Hayashi et al., 1992).
In the set of examples “DtmVic_Examples” downloadable with DtmVic, this data set is used in the two examples in the directory: DtmVic_Examples_A_Start (examples: EX_A05.Text-Responses_1 and EX_A06.Text-Responses_2).
These examples are mainly dealing with textual data (responses to open-ended questions), but we use only here the closed questions described by both files TDA_dat.txt (data) and TDA_dic.txt (dictionary).
The example deals with the responses of n = 1043 individuals and comprises objective characteristics of the respondent or his/her household (age, st atus, gender, facilities). Other questions relate to attitude or opinions.
6.1 Active questions, global analysis
In this example we focus on a set of s = 6 attitudinal questions, with a total of p= 24 response categories, that constitutes the set of active variables. The external information (supplementary categories) is provided by a single categorical variable (9 categories).
Table 2. Frequencies, coordinates, contributions and squared cosines (relative contributions)
of active categories on axes 1 and 2
Active categories Frequencies Coordinates Contributions
Axis 1 Axis 2 Axis 1 Axis 2
1 . Change in the global standard of living last years
- Std.liv/much better 223 -.96 .51 9.6 3.6
- Std.liv/lit better 417 -.17 -.30 .6 2.3
- Std.liv/the same 164 .34 -.42 .9 1.8
- Std.liv/a lit worse 156 .78 -.02 4.5 .0
- Std.liv/v.much worse 83 1.30 1.00 6.5 5.1
2 . Change in your personal standard of living last years
- SL.pers/much better 250 -.94 .47 10.4 3.4
- SL.pers/lit better 317 -.07 -.42 .1 3.5
- SL.pers/the same 283 .28 -.23 1.0 .9
- SL.pers/a lit worse 123 .82 .14 3.9 .1
- SL.pers/v.much worse 70 1.11 .93 4.0 3.7
3 . Change in your personal standard of living next 5 years
- SL.next/much better 123 -.57 .56 2.0 2.5
- SL.next/lit.better 294 -.42 .35 2.5 2.3
- SL.next/the same 460 .07 -.49 .1 6.8
- SL.next/lit.worse 134 1.16 .34 9.1 1.0
- SL.next/much worse 32 1.21 1.00 4.1 3.0
4 . Will people be happier in years to come?
- People/happier 188 -1.02 .76 9.2 6.7
- People/less happy 535 .61 .20 9.4 1.3
- People/the same 320 -.42 -.78 2.7 11.8
5 . Will people peace of mind increase...
- P.of.mind/increases 180 -.76 .72 4.9 5.8
- P.of.mind/decreases 618 .40 .18 4.7 1.2
- P.of.mind/no change 245 -.46 -.97 2.4 14.3
6 . Will people have more or less freedom...
- More freedom 443 -.40 .47 3.3 5.9
- Less freedom 336 .70 .15 7.8 .5
- Freedom/the same 264 -.23 -.97 .6 15.4
(total for contributions) 100.0 100.0
Table 2 gives the wording of the 6 active questions and the 24 corresponding categories, together with the basic results of the MCA of the indicator matrix Z crossing the 1043 respondents (rows) with the 24 categories (columns): frequency of the responses, coordinates on the two first principal axes and contributions (also called “absolute contributions”). The first two eigenvalues are respectively 0.342 and 0.260 and account for 12.1 % and 9.2 % of the total inertia. It is widely recognised that these values give a pessimistic idea of the quality of the description provided by the MCA. In fact, several corrected or adjusted formulas for these percentages have been suggested by Benzecri (1979) and Greenacre (1994).
Figure 3 is a rough sketch of the principal plane obtained from that MCA.
|
Fig.3 Plane of the axes 1 and 2 from the MCA of the 1043 x 24 indicator matrix.
The most positive and optimistic responses are grouped in the upper left side of figure 3 (answers “much better” to the three questions about “standard of living”, answers “people happier”, “peace of mind increases”, “more freedom”). The most negative and pessimistic responses occupy the upper right side of the display, in which the three “very much worse” items relating to the three first questions form a cluster markedly separated from the remaining categories. All neutral responses ( “the same”, “no change”) are located in the lower part of the display, together with some moderate responses such as “little better”).
While the first (horizontal) axis can be considered as defining (from the right to the left) a particular scale of “optimism, satisfaction, happiness”, the second axis appears to oppose moderate responses (lower side) to both extremely positive and extremely negative responses. Such pattern of responses suggests what is known in the literature under the name of “Guttmann effect” or “horseshoe effect” (Van Rickevorsel, 1987). However, we are not dealing here with a pure artefact of the MCA: we will see below that some particular categories of respondents are significantly devoted either to moderate responses or to systematic extreme responses.
Another possible pattern visible in figure 3 is the “battery effect”, often observed in survey analysis. When several questions have the same response items, (which is the case for the three first questions) the respondent is tempted to choose identical answers. That could account for the tight grouping of the responses “very much worse” in the upper right part of the display.
6.2 Supplementary categories and test-values
The unique supplementary variable comprising 9 categories is built via cross-tabulating two basic questions: age (3 categories: less than 35, between 35 and 55, over 55 ) and educational level (3 categories: low, medium, high).
<Table 3 gives the identifiers of these 9 categories together with the frequencies of the responses, the corresponding coordinates and test-values.
Table 3. Frequencies, coordinates and test-values of supplementary categories on axes 1 and 2
Supplementary Freq. Coordinates Test-values
categories
Axis 1 Axis 2 Axis 1 Axis 2
Age and educational level
- -30/low 18 -.08 -.01 -.3 -.1
- 30-55/low 226 .08 .21 1.4 3.6
- +55/low 237 .38 -.16 6.7 -2.7
- -30/medium 187 -.16 .09 -2.4 1.3
- 30-55/medium 159 -.35 .08 -4.8 1.0
- +55/medium 72 .19 -.24 1.6 -2.1
- -30/high 61 -.35 .12 -2.8 1.0
- 30-55/high 61 -.18 -.24 -1.4 -2.0
- +55/high 22 -.18 -.60 -.8 -2.8
Considering the test-values allows us to conclude in the framework of a classical statistical inference. The young educated respondents (age : less than 30, high level of education) are significantly optimistic in their answers to these attitudinal questions: their location is at -2.8 standard deviations (t = -2.8 ) from the mean point (origin of the axes). Likewise, the younger respondents with a medium level of education are on the optimistic side of the axis. But the category that best characterises this optimistic side of the first axis is the category (from 30 to 55; medium level of education) (t = -4.8).
On the opposite side, among pessimistic categories, the persons over 55 having a low level of education occupy a very significant location (t = 6.7).
On axis 2, the
persons between 30 and 55 with a low level of education are located on the side
of extreme responses, opposed to all the categories of respondents over 55.
Owing to the
limited incidence of multiple comparisons effect (only 9 tests),
we have chosen a moderate conservative threshold here (t > 2.3). The
incidence of multiple comparison effect will be even more limited in the case
of bootstrap validation which provides simultaneous confidence intervals for
all the categories, and take into account the structure of correlations between
the categories.
End of Chapter 5: MCA