Updated version of "Multivariate Descriptive Statistical Analysis", by L. Lebart, A. Morineau, K. Warwick, John Wiley, New York, 1984.

 

 

Multiple Correspondence Analysis

 

  

  1. Introduction

 

The method of multiple correspondence analysis, which we present here, is a simple extension of two-way analysis. It is characterized by straightforward computations, interesting properties, and simple rules for interpreting the resulting maps.

  The principles of this method actually stem from the works of the statistician C. Burt (1950), of L. Guttman (1941), and C. Hayashi (1956). [For an historical review, see Tenenhaus et al. (1985)].

   

  2. Definitions, notations

 

Survey data generally includes a number of responses to questions in complete disjunctive form : this means that the various response categories are mutually exclusive, and that only one category is chosen.

  The letter s will designate the number of questions. A single question q consists of  pq  response categories.

  The total number of response categories, p, contained in the questionnaire is:

                             p   =   Sq pq                                                                                             (1)

The number of individuals who responded to the questionnaire  is called n.

H is the set whose elements consist of all the series of s categories, each of which is taken from a different question. The elements of H therefore comprise the sum-total of possible responses by the subjects.

  Each element of H corresponds to one cell of the multiway contingency table cross-tabulating the s questions.

We must note, however, that this hypertable is in general almost empty. If 1000 individuals are given 12 questions, each of which has 10 categories, n = 1000, whereas the number of elements of H is 1012  Thus, at the most, only one cell out of one billion will be non-empty.

  We denote by Z the matrix with n rows and p columns describing the response of the n individuals with binary coding.

  Matrix Z is the juxtaposition of s submatrices (see figure 1 for s = 3) :

 

            Z  =  [ Z1 , Z2 , ... Zq, ... Zs  ]                                                                  (2)

 

          2 3 4                 0 1 0 0   0 0 1   0 0 0 1
          2 1 3                 0 1 0 0   1 0 0   0 0 1 0 
          3 1 2                 0 0 1 0   1 0 0   0 1 0 0 
          4 2 4                 0 0 0 1   0 1 0   0 0 0 1 
          1 2 3                 1 0 0 0   0 1 0   0 0 1 0 
   R=     2 2 3         Z =     0 1 0 0   0 1 0   0 0 1 0 
          3 1 1                 0 0 1 0   1 0 0   1 0 0 0 
          1 1 1                 1 0 0 0   1 0 0   1 0 0 0 
          4 1 2                 0 0 0 1   1 0 0   0 1 0 0 
          2 2 3                 0 1 0 0   0 1 0   0 0 1 0 
          3 2 2                 0 0 1 0   0 1 0   0 1 0 0 
          4 1 4                 0 0 0 1   1 0 0   0 0 0 1 

              Figure 1. Example of homologous tables R and Z


 

Submatrix  Zq  (with n rows and pq  columns) is such that its ith row contains pq - 1  times the value zero, and once the value 1  in the column corresponding to the category of question q chosen by subject i. In other words, matrix Zq  describes the partition of the n individuals that is created by the responses to question q.

   Finally, we define a matrix R with n rows and s columns. R is the condensed coding matrix of Z. Cell (i,q) contains the number riq  of the category of question q chosen by subject i. It is obvious that riq    pq  (see figure 1).

  The computational programs will use matrix R as input data, thus reducing considerably the volume of calculations.

 

   Burt's Table Associated with  Z

 

Let us consider matrix Z such that Z  =  [ Z1 , Z2 , ... Zq, ... Zs  ] .     The square matrix:

                               B = Z'Z                                                                                                   (3)

  is called Burt's contingency table associated with Z, the matrix of responses.

 

 

Figure 2.  Example of matrix B corresponding to the tables Z and R of figure 1.

 

Matrix B is made up of s2  blocks:

 

The square matrix Zq'Zq  is a (p ,p ) diagonal matrix, since two categories of one question cannot be chosen simultaneously.

 

   The block (Z'q Zq'  ) whose index is (q,q') is the contingency table that cross-tabulates the responses to the two questions q and q'.

 

  The diagonal matrix D is a (p,p) matrix that has the same diagonal elements as B ; these diagonal elements are the frequencies for each of the categories.

 

  We may also consider matrix D as consisting of s2  blocks. Only the s diagonal blocks are nonzero matrices .

The qth diagonal block such that   Dq  = Zq'Zq   is the diagonal matrix whose diagonal terms are the frequencies corresponding to the various categories of question q.

 

  3. The case of two questions: two-ways correspondence

 

In the case of two questions, the response matrix Z is written

 Z =  [ Z1 Z2 ]  .

 

 It is then equivalent, from the point of view of describing the relationships among categories, to performing any of the following :                   

1.  A correspondence analysis of the (n,p) matrix Z.

2.  A correspondence analysis of the (p,p) matrix B.

3.  A correspondence analysis of the (p1 , p2 ) matrix Z1'Z2 .

4.  A canonical analysis of the two column blocks Z1  and Z2 .

 

          Equivalence between  (2)  and   (1)

 

 Let us show that analyses 1 and 2 provide the same factors of norm 1. The ath factor  Fa  extracted from analysis 1 is such that

 

              (1/s) D-1Z'ZFa  =  l aF a                                                                                        (4)

 

since, using the notation we adopted for correspondence analysis, the matrix on the left-hand side of this equation would be written

 

                       Dp-1 F' Dn-1 F                                                                                                    (5)

 here :

                      F  =  (1/ns) Z   (matrix of relative frequencies)

                        Dp  =  (1/ns) D   (margins of columns)

                        Dn  =  (1/n) In     (margins of rows)            ( In = Identity matrix)

 

We recognize equation (4) in the formula:

                             Dp-1 F' Dn-1 F F a  =  la  F a        

 

We have defined the matrix B such that B = Z'Z.

B is symmetric. Its row and column margins are the diagonal elements of the matrix   sD.

In analyzing B, we have a new matrix  of relative frequencies F  :

 

                                            F = (1/ns2) B

 

  The corresponding new matrices Dp  and Dn  are equal :

 

                                            Dp  = Dn  = (1/ns) D

 

  The matrix that is to be diagonalized in this case is written

 

                                 (1/s2) D-1 B' D-1 B

 

  When we premultiply the two members of equation (4) by (1/s)D-1B, we immediately obtain:

                                 (1/s2) D-1 B' D-1 B B F a  =  l a2 F a      

 

The factors are therefore identical for both analyses.

 

Note that this equivalence (between  (2) and  (1) ) is valid even when there are more than two questions.

 

            Equivalence between  (1) and   (3) :

 

Let us now show that, for each pair of factors ( Fa , ya  ) relative to the same eigenvalue    m a extracted from the analysis of the contingency table Z'Z , there is a corresponding factor  F a  from analyzing Z (or B) such that :

 

We have the notation D1  = Z'1Z1   ,     D2  = Z'2Z2 ,     and

 

 

The diagonal elements of D1  and D2  are the row and column margins of the matrix Z'Z .

Analysis of this table leads us to the double transition equations:

 

                                                                                          

 

These equations can be written as a system of equations :

 

                                     

 

which can be written:

 

                                                   

 

This equation is written in a more condensed form after multiplying its two members by 1/2   (that is, 1/s, since    s=2)

                                                                                         [8]

                                                          

 

The reader will recognize equation (4) with

 

 If  ma  is the  ath  largest eigenvalue extracted from the analysis of Z'1 Z2, then the previous equation  gives the  ath  largest eigenvalue of the analysis of Z .

    

 Equivalence between  (3) and  (4) :

 

  The equivalence between non-centered canonical analysis and correspondence analysis is the consequence of the fact that in both cases, the matrix to be diagonalised is the same :

 

       S  =   (Z'1 Z1 )-1  Z'1 Z2   (Z'2 Z2 )-1  Z'2 Z1

 

 Remember that Z'1Z1  and Z'2Z2  are diagonal, and that Z'1Z2  is nothing but the contingency table crossing the two questions.

 

Remark : In the analysis of the disjunctive matrix Z, the points that represent the various response categories of the two questions are elements of the same set, which is the set of Z's columns.  On the other hand, in analyzing the contingency table Z'1Z2 , they are split up into row-points and column-points.

 

The fact that the maps obtained in the space of the first factors are identical (although they are dilated, since the eigenvalues are not the same) shows that it is legitimate to simultaneously represent the row-points and the column-points in correspondence analysis.

 

 

  Table  1. Equivalence between the analyses  in the case of two questions

 

Table

Dimension

 Axes

Eigenvalues

contingency table

 

(p1 ,  p2)

y dans Rp1

j  dans Rp2

m

 Z = [Z1, Z2]

disjonctive table

(p, n)

      p =  p1 + p2.

B = Z'Z

 Burt table

 

(p, p)

 

 

 4. The case of more than two questions

 

The generalization to the problem of more than two questions requires a reformulation of the two-ways problem.

 

 Matrix    Z =  ( Z1, Z2,  ...   Zq, ... , Zs )  has p columns, to which there correspond p points of Rn. Let us consider this space  Rn.  Each submatrix Zq  generates a linear subspace   Vq  with pq   dimensions.

These linear subspaces have in common at least the first bisector 1n (the vector all of whose components are equal to 1). The rank of matrix Z is therefore at the most equal to  p - (s-1).

Let jq  be the vector whose p components are the coordinates of a point mq  of  Vq  in the basis defined by the columns of Zq .

 

The coordinates of mq  in Rn  are the components of

     mq  = Zq jq.

 

The square of the distance of this point mq  to the origin  (its Euclidean norm), is:

    jq Z'qZq jq   =   jq Dq jq

 

The correspondence analysis of the contingency table that confront the categories of two questions  q and q' is reduced to studying the relative positions of the subspaces  Vq  and  Vq'  . This leads to a canonical analysis of the matrix [Zq , Zq'

The double transition equations  (6) and (7), are written (we have omitted the index a  in order to simplify the notation).

                                                  

 

From these equations we can deduce the following:

 

                                             

namely,

                                                                                                           [9]

                                                                                                         [10]

in which :       

                         

  

  The matrices Pq  and Pq'   are the projection operators on  the subspaces  Vq  and  Vq'  .

  When canonical analysis is presented as a problem of finding the smallest angles between two mappings, it does not lend itself to generalization to the problem of more that two questions.

However, the canonical analysis of the matrix  [ Zq , Zq' ]  can be also formulated in the following way:

Find two points  mq  and  mq'  such that the mean of the sum of their squared distances to the origin is constant :

                                                                         [11]

 

and such that the distance to the origin of the point  m  such that  m = mq + mq' is maximum.

 

The square of this distance is

 

                                  

    which reads

To maximize  ||m|| 2   with the constraint (11) or with the two constraints:

 

                                               

 

leads to the same result, since the two Lagrange multipliers relative to the last two constraints are equal.

With the single constraint (11), the problem is readily generalized to more than two questions.

Let us designate by  j1, …, jq, …, js     respectively, the vectors of the components of s points: m1, …, mq, …, ms  in the bases  Z1, …, Zq, …, Zs , and let  m = m1 +…+ mq +…+ ms.

The quantity to be maximized is:

under the constraint:                           

If   is the vector with p components defined by

                                                   

 

The problem becomes maximizing :       

under the constraint :        

                                                          

 

Factors  are therefore the eigenvectors of the matrix  D-1relative to the largest eigenvalues. They are proportional to those extracted from the correspondence analysis of matrix Z and coincide, as we have seen, with the factors (axes) extracted from analyzing matrix B, itself considered a data table.

 

 

  5. Properties of the representations

 

      The  pcategory-points relating to one question q are centred

 

The s subsets of pq points corresponding to the s questions (or blocks Zq) have the same center of gravity, which is the general center of gravity of the p categories.

 

With the previous notations, the coordinates of the pq points corresponding to question q are in the space Rn the columns of the block : ZqDq-1, the corresponding masses being the diagonal elements of (1/n)Dq.

 

The center of gravity  Gq  of this subset has n coordinates indexed by i such as:

       

 

where the previous symbol  S   means a sum for the  pq categories of question q.

Gq is therefore independent of q.

 

 

  In MCA,  there are at the most  (p  -  s ) non-zero eigenvalues.

 

We have seen that all the subspaces Vq generated by the blocks Zq have in common the vector 1n, whose all component are 1.

This vector is the "trivial" axis corresponding to the eigenvalue 1; it corresponds to the eigenvalue 0 when the analysis is performed with respect to the center of gravity.

Therefore, the maximum dimensionality of the configuration of points is:

 

                 ( p1 - 1 )  +  ( p2 - 1 )  +  ...  +   ( ps - 1 )  =   p  -  s 

 

The usual softwares takes into account this property to diagonalize a   (p - s) (p - s)   matrix instead of a   (p, p)   matrix.

 In particular, when all the questions have two categories (yes - no, for example), then  the matrix to be diagonalized is twice smaller.

 

But it can be shown in this case   ( pq = 2,  for all q)   that the multiple correspondence analysis is equivalent to a principal component analysis performed on the (s,s) binary table containing one column (chosen among the two initial columns) for each question q.

 

Test-values for supplementary categories

 

We have seen in the chapter relating to correspondence analysis how the transition formulas allow to project supplementary elements onto the axis (calculated from the "active elements"). In the case of M.C.A., the interpretation is particularly easy, due to the binary coding of the data:

 

The coordinate of a supplementary category on axis a is the arithmetic mean [multiplied by   of the coordinates of the concerned individuals (individuals having chosen this category as a response).

 

This suggest the following statistical test:

 

Let a supplementary category j contain  nj individuals or observations. The null hypothesis  is that the nj observations are chosen at random (without replacement) among the n active observations.

The coordinate jaj of the category j on axis r is consequently a random variable, product  by the constant  of the arithmetic mean of nj values yai drawn at random from a finite set of n values.

 

We have :      E ( jaj)  =  E( yai)  =  0

   since         var ( yai) =   la

                                                     

 

The quantities  tjaj [ nj (n - 1 )/ ( n - nj) ]1/2  are standardized variables: they allow us to compare respective significances of several supplementary categories. These quantities are named test-values.

A test value greater than 2 or less than -2 indicates that the corresponding category has a significant location on the axis.

The test-values are generally printed besides the coordinates and the contributions in M.C.A. listing.

 

 

6. Example of application


The data set is the British section of a multinational survey conducted in seven countries in the late 1980s  (Hayashi et al., 1992).  

In the set of examples “DtmVic_Examples” downloadable with DtmVic, this data set is used in the two examples in the directory: DtmVic_Examples_A_Start (examples:  EX_A05.Text-Responses_1  and EX_A06.Text-Responses_2).


These examples are mainly dealing with textual data (responses to open-ended questions), but we use only here the closed questions described by both files TDA_dat.txt (data) and TDA_dic.txt (dictionary).

The example deals with the responses of  n = 1043 individuals and comprises objective characteristics of the respondent or his/her household (age, st atus, gender, facilities).  Other questions relate to attitude or opinions.


6.1 Active questions, global analysis

In this example we focus on a set of s = 6  attitudinal questions, with a total of p= 24 response categories, that constitutes the set of active variables. The external information (supplementary categories) is provided by a single categorical variable (9 categories).


Table 2. Frequencies, coordinates, contributions and squared cosines (relative contributions)

of active categories on axes  1 and 2

 

 

   Active categories          Frequencies   Coordinates           Contributions

                                                                                                                        

                                                                             Axis 1      Axis 2             Axis 1       Axis 2

 

1 .  Change in the global standard of living last years                       

                                                                     

 - Std.liv/much better   223       -.96   .51      9.6    3.6

 - Std.liv/lit better    417       -.17  -.30       .6    2.3

 - Std.liv/the same      164        .34  -.42       .9    1.8

 - Std.liv/a lit worse   156        .78  -.02      4.5     .0

 - Std.liv/v.much worse   83       1.30  1.00      6.5    5.1

                                                               

 2 . Change in your personal standard of living last years                              

                                                                     

 - SL.pers/much better   250       -.94   .47     10.4    3.4

 - SL.pers/lit better    317       -.07  -.42       .1    3.5

 - SL.pers/the same      283        .28  -.23      1.0     .9

 - SL.pers/a lit worse   123        .82   .14      3.9     .1

 - SL.pers/v.much worse   70       1.11   .93      4.0    3.7

                                                               

 3 . Change in your personal standard of living next 5 years                           

                                                                     

 - SL.next/much better   123       -.57   .56      2.0    2.5

 - SL.next/lit.better    294       -.42   .35      2.5    2.3

 - SL.next/the same      460        .07  -.49       .1    6.8

 - SL.next/lit.worse     134       1.16   .34      9.1    1.0

 - SL.next/much worse     32       1.21  1.00       4.1    3.0

                                                                

 4 . Will people be happier in years to come?                             

                                                                     

 - People/happier        188      -1.02   .76      9.2    6.7

 - People/less happy     535        .61   .20      9.4    1.3

 - People/the same       320       -.42  -.78      2.7   11.8

                                                               

 5 . Will people peace of mind increase...                                

                                                                      

 - P.of.mind/increases   180       -.76   .72      4.9    5.8

 - P.of.mind/decreases   618        .40   .18      4.7    1.2

 - P.of.mind/no change   245       -.46  -.97      2.4   14.3

                                                                

6 . Will people have more or less freedom...                              

                                                                     

 - More freedom           443     -.40   .47      3.3    5.9

 - Less freedom           336      .70   .15      7.8     .5

 - Freedom/the same       264     -.23  -.97       .6   15.4

 

(total for contributions)                         100.0  100.0

 

 

Table 2 gives the wording of the 6 active questions and the 24 corresponding categories, together with the basic results of the MCA of the indicator matrix Z crossing the 1043 respondents (rows) with  the 24 categories (columns): frequency of the responses, coordinates on the two first principal axes and contributions (also called “absolute contributions”). The first two eigenvalues  are respectively 0.342 and 0.260 and account for 12.1 % and 9.2 % of the total inertia. It is widely recognised that these values give a pessimistic idea of the quality of the description provided by the MCA. In fact, several corrected or adjusted formulas for these percentages have been suggested by Benzecri (1979) and Greenacre (1994).

Figure 3 is a rough sketch of the principal plane obtained from that MCA.


figure1nm

 

Fig.3 Plane of the axes 1 and 2 from the MCA of the  1043 x 24  indicator matrix.

 

The most positive and optimistic responses are grouped in the upper left side of figure 3 (answers “much better”  to the three questions about “standard of living”, answers “people happier”, “peace of mind increases”, “more freedom”). The most negative and pessimistic responses occupy the upper right side of the display, in which the three “very much worse” items relating to the three first questions form a cluster markedly  separated from the remaining categories. All neutral responses ( “the same”, “no change”) are located in the lower part of the display, together with some moderate responses such as “little better”).

While the first (horizontal) axis can be considered as defining (from the right to the left) a particular scale of  “optimism, satisfaction, happiness”, the second axis appears to oppose moderate responses (lower side) to both extremely positive and extremely negative responses. Such pattern of responses suggests  what is known in the literature under the name of  “Guttmann effect”  or “horseshoe effect” (Van Rickevorsel, 1987). However, we are not dealing here with a pure artefact of the MCA: we will see below that some particular categories of respondents are significantly devoted either to moderate responses or to systematic extreme responses.


Another possible pattern visible in figure 3 is the “battery effect”, often observed in survey analysis. When several questions have the same response items, (which is the case for the three first questions) the respondent is tempted to choose identical answers. That could account for the tight grouping of the responses “very much worse” in the upper right part of the display.


6.2   Supplementary categories and test-values

 

The unique supplementary variable comprising 9 categories is built via cross-tabulating  two basic questions: age (3 categories: less than 35, between 35 and 55, over 55 ) and educational level  (3 categories: low, medium, high).

<

Table 3 gives the identifiers of these 9 categories together with the frequencies of the responses, the corresponding coordinates and test-values.

 

Table 3. Frequencies, coordinates and test-values of supplementary categories on axes  1 and 2

 

 

   Supplementary          Freq.               Coordinates               Test-values

     categories                                                  

                                                                               Axis 1    Axis 2               Axis 1     Axis 2

 

  Age and educational level                       

                                                                      

 - -30/low          18         -.08   -.01       -.3   -.1 

 - 30-55/low       226          .08    .21       1.4   3.6 

 - +55/low         237          .38   -.16       6.7  -2.7 

 - -30/medium      187         -.16    .09      -2.4   1.3 

 - 30-55/medium    159         -.35    .08      -4.8   1.0 

 - +55/medium       72          .19   -.24       1.6  -2.1 

 - -30/high         61         -.35    .12      -2.8   1.0 

 - 30-55/high       61         -.18   -.24      -1.4  -2.0 

 - +55/high         22         -.18   -.60       -.8  -2.8 

 

 

Considering the test-values allows us to conclude in the framework of a classical statistical inference. The young educated respondents (age : less than 30, high level of education) are significantly optimistic in their answers to these attitudinal questions: their location is at -2.8 standard deviations (t  =  -2.8 )  from the mean point (origin of the axes). Likewise, the younger respondents with a medium level of education are on the optimistic side of the axis. But the category that best characterises this optimistic side of the first axis is the category (from 30 to 55; medium level of education) (t = -4.8).

 

On the opposite side, among pessimistic categories, the persons over 55 having a low level of education occupy a very significant location (t =  6.7).


On axis 2, the persons between 30 and 55 with a low level of education are located on the side of extreme responses, opposed to all the categories of respondents over 55.
Owing to the limited incidence of multiple comparisons effect (only 9 tests), we have chosen a moderate conservative threshold here (t > 2.3). The incidence of multiple comparison effect  will be even more limited in the case of bootstrap validation which provides simultaneous confidence intervals for all the categories, and take into account the structure of correlations between the categories.



End of Chapter 5: MCA