Updated version of "Multivariate Descriptive Statistical Analysis", by L. Lebart, A. Morineau, K. Warwick, John Wiley, New York, 1984.
Descriptive Principal Components Analysis
1. Introduction
There are several different ways of looking at principal components analysis. The classical statistician considers principal components analysis to be the determination of the major axes of an ellipsoid derived from a multivariate normal distribution; the axes are estimated from the sample. It was in this context that Harold Hotelling (1933) originally described principal components analysis. Most historical texts on classical multivariate analysis usually present the topic in this fashion (see Anderson, 1958; Kendall and Stuart, 1968; Dempster, 1969).
Psychologists, or more precisely, psychometricians consider principal components analysis as one specific type of factor analysis. Factor analysis, in its many forms, makes a variety of assumptions for dealing with communalities (specific item variance) and error variance. A detailed account of psychometric thinking about factor analysis can be found in Horst (1965), Lawley and Maxwell (1970), and Mulaik (1972), Harman (1976), and Cattell (1978).
More recently, data analysts have taken an entirely different point of view about principal components analysis. They have used principal components analysis without making any assumptions about distributions or an underlying statistical model. Rather, they have used it as a technique for describing a set of data in which certain algebraic or geometric criteria are optimized. The availability of high speed digital computers has accelerated the growth of this approach, although it is, in fact, the original method outlined by Karl Pearson (1901). One way of looking at principal components analysis is implicit in Pearson's original formulation, but Rao's (1964) pioneering review article presents the topic in a way most closely allied to the philosophy described in this course.
2. Scope of the Method
The data analyst is frequently faced with the task of analyzing a rectangular matrix in which the columns represent variables (e.g., anthropometric measurements, psychological test scores, or economic indicators) and the rows represent measurements of specific objects or individuals on these variables.
For example, in biometry it is not unusual to collect a variety of measurements on a group of animals, or a sample of organs. Similarly, in economics, data may be collected on a variety of measures related to household expenditures or corporate activity. This data, when displayed in tabular form, is often very unwieldy and difficult to comprehend. The problem is that the information is not readily absorbed because of its sheer volume.
The methods to be described in this course are designed to summarize the information contained in tables like those described above and simultaneously to provide the analyst with a clear visual or geometric representation of the information.
Principal components analysis in the context of this course differs from correspondence analysis in one important respect. In principal components analysis the columns of the data matrix are generally a set of variables or measurements, whereas the rows are a relatively homogeneous sample of objects or individuals on which measurements have been made. Correspondence analysis, on the other hand, treats the rows and columns of the data matrix in a symmetric fashion (the leading case being a contingency table).
Principal components analysis can be considered a theoretically appropriate method for the analysis of data that derives from a multivariate normal distribution. Correspondence analysis, in contrast, is theoretically the more appropriate method for the analysis of data in the form of a contingency table. In succeeding chapters we discuss the analysis of data matrices that do not precisely meet these requirements.
3. Applying Principal Component Analysis
We now show how to adapt the general analysis presented previously to a situation where the problem is to describe rather than to simply summarize large data sets.
Let us consider matrix R whose columns are 300 responses or measurements taken on 2000 individuals. The order of the matrix is then: n = 2000, p = 300. Specifically, let us assume that the data consist of values of 300 types of annual expenditures made by 2000 individuals.
We would like to understand how the 300 expenditures are related to one another as well as whether similarities exist among the behaviour patterns of the individuals.
Figure 1. Location of the cloud of point in Rp
3.1. Analysis in Rp
In the variable space we attempt to fit the set of n points, first in a one-dimensional, then in a two-dimensional subspace so as to obtain, on a graphical display, the most faithful possible visual representation of the distances between the individuals with respect to their p expenditures.
The problem is no longer one of maximizing the sums of squares of the projections of the points' distances from the origin. What now has to be maximized is the sum of squares of the distances between all the pairs of individuals. In other words, the best fitting line H1 is no longer limited to passing through the origin as was Ho in the previous chapter.
Figure 2. Projection onto H1.
Let hi and hj represent the values of the projections of two individual points i and j on H1, We have the following equation:
with:
The summation is for all i’s and j’s, and h designates the mean of the projection of the individuals, which happens to be also the projection of the mean-point G of the individuals in the full space, whose coordinates are given by:
Thus :
If the origin is placed in G, the quantity to be maximized becomes again the sum of squares of the distances to the origin, which brings us to the problem discussed above (Chapter 0).
The required
subspace is obtained by first transforming the data matrix R into a
matrix X whose general term is and then perform the analysis on X.
The distance between two individuals k and k' is written, in Rp
In this summation there might be some values of the index j for which corresponding variables are of very different orders of magnitude: for example expenditures on stamps, expenditures on rent. It may be necessary in some cases, particularly when the units of measurement are different, to give each variable the same weight in defining the distances among individuals. Normalized principal axes analysis is used in this situation. The measurement scales are standardized by using the following distance measure:
In this
formula, is the standard
deviation of variable j:
Finally, we note that the normalized analysis in Rp of the raw data matrix R is also the general analysis of X, whose general term is:
In this space, we must therefore diagonalize the matrix C = X'X. The generic term of this matrix is written:
That is :
cjj’ is the correlation coefficient between variables j and j'. The coordinates of the n individual points on the principal axis ua (the ath eigenvector of matrix C) are the n components of the vector va = Xua.
The abscissa of the individual point i on this axis a is explicitly written as
3.2. Analysis in Rn
The general analysis developed in chapter 1 shows that fitting a set of points in one space implies fitting the other set in the other space as well.
The purpose of transforming the initial matrix R was twofold: first, to obtain a fit that accounts in the best possible way for the distances between individual points; second, to give equal weights to each of the variables in defining the distances among the individuals.
We note that the equation transforming R into X:
does not treat the rows and columns of the initial matrix R in a symmetric fashion.
What is the meaning of the distance between two variables j and j in Rn if the columns of the transformed matrix X are used as coordinates for the variables?
Let us calculate the Euclidean distance between two variables j and j ':
that is :
When we substitute for xij its value taken from the equation transforming R into X, taking in account the relationship:
we see that :
(All the variable points are located on a sphere of radius 1, whose centre is at the origin of the axes.) We also see that
We obtain the relationship of the distance between two variable points j and j' and these variables' correlation coefficient cjj':
d2(j, j') = 2 (1 - cjj')
therefore :
0 ≤ d2(j, j') ≤ 4
The distances between the transformed variables have the following characteristics (see figure 3):
1. Two variables that are highly correlated are located either very near one another (cjj' close to 1.0) or as far away as possible from one another (cjj' close to -1.0), depending on whether they are positively or negatively correlated.
2. Two variables that are orthogonal (uncorrelated) (cjj'= 0) are at moderate distance from one another.
cjj' » 1 cjj' » 0 cjj' » -1
d(j,j') » 0 d(j,j') » Ö2 d(j,j') » 2
Figure 3. Correlations and distances between “variable-points”
Note that the analysis in the space Rn is not performed relative to the centroid (mean point) of the variable points (whereas in the space Rp, the analysis is performed relative to the mean point of individuals)..
We have seen that it is unnecessary to diagonalize matrix XX', of order (n, n) once the eigenvalues la and the eigenvectors ua of matrix C = X'X are known. The reason is that the vector
is the unit eigenvector of XX' relative to the eigenvalue la..
The abscissas of the variable points on the axis are the components of
and are by-products of the computations that have already been performed in the space Rp.
Note. The cosine of two variable vectors in space Rn is the correlation coefficient between the two variables. If the two variables are located at a distance of 1 from the origin (i.e., if they have unit variance), the cosine is their scalar product.
Thus the p correlation coefficients that exist among the coordinates of the points on an axis and the p variables are precisely the p components of the previous vector:
The abscissa of a variable point on an axis is the correlation coefficient between this variable and the new variable (the linear combination of the initial variables) that constitutes the coordinates of the individuals in the corresponding axis.
It is important to note that the interpretation of the equation transforming R into X differs greatly in the two spaces Rn and Rp.
Consider, for instance, the operation of centering the variables:
1. In Rp
the transformation is equivalent to
translating the axes' origin to the centre of gravity (or centroid) of the
points.
2. In Rn
the transformation is a
projection parallel to the first bisector of the axes. (The general term of the
(n, n) matrix P that is associated with this transformation is the
classical idempotent matrix associated with the operation of centering.)
4) Supplementary Variables and Supplementary Individuals
It often happens, in practice, that additional information is available that might be added to matrix R. We might have some data on the n individuals that could be added to the p variables being analyzed but that is of a somewhat different nature than the p variables. For example, we might wish to add income level and age of the individuals to a data set consisting of food consumption variables.
On the other hand, we may have data on the p variables for a number additional individuals who belong to a control group, and who therefore cannot be included in the sample.
Matrix R thus may have additional rows and columns:
1. A matrix R + with n rows and ps columns added to its columns.
2. A matrix R+ with n s rows and p columns added to its rows.
Figure 4. Supplementary rows and columns
The matrix R++, in which both individuals and variables are supplementary is unnecessary for the analysis; see Figure 4.
Matrices R + and R+ are transformed, respectively, into matrices X + and X+ in order to make the new rows and columns comparable to those of X.
(a) In Rn, we have ps supplementary variable points. In order to remain consistent in interpreting inter-variables distances in terms of correlations, the following transformation must be performed (normalized principal analysis):
We simply compute the new means and the new standard deviations relative to the supplementary variables, in order to place these supplementary variables on the sphere of unit radius.
The abscissas of the ps supplementary variables on this axis are therefore the ps components of the vector X+'va .
(b) In Rp, the placement of the supplementary individuals in relation to the others consists of positioning them relative to the centroid of the points (which has already been calculated) and then dividing the coordinates by the standard deviations of the variables (which have already been calculated for the n individuals). Therefore the following transformation is performed:
The projection operator on axis a of Rp is still the unit vector ua.
The variables or individuals involved directly in fitting the data are called active elements (active variables or active individuals). The supplementary elements (variables or individuals) are called either illustrative or supplementary elements.
5) Nonparametric Analysis
The only difference between nonparametric PCA methods and what we have already described is that they require a preliminary transformation of the data. These techniques should be used when the data are heterogeneous. They provide extremely robust results, and also lend themselves to statistical interpretation.
5.1 Analysis of Ranks.
Let us transform the original data matrix into a matrix of ranks. In this matrix an observation i on a variable j is qij , the rank of observation i after the n observations have been ranked. Under these circumstances the distance between two variables j and j' is define by the formula
The reader will recognize that (1 - d2(j, j')) is Spearman's rank-correlation coefficient.
Rank-order analysis is best applied in the following situations:
(1) The basic data consists of ranks, in which case there is no choice.
(2) The variables' scales of measurement are so different from another that principal components data reduction operations remain inadequate. However, this procedure is not a remedy for asymmetric distributions. Finally, analyzing a set of ranks is easier to justify than analyzing extremely heterogeneous set of measurements.
(3) The implicit assumptions applied to the measurements are weak, and consequently less arbitrary. The distribution of the distances is nonparametric; confidence levels, as in any nonparametric test, are dependent only upon the hypothesis that the observations are distributed continuously, which is more plausible in this context than the assumption of normality.
(4) Finally, this method is robust, in the sense that it is insensitive to the presence of outliers, often an appreciable advantage.
The results are interpreted in the same way as a regular principal components analysis. The reason is that a regular principal components analysis is performed after transformation into ranks. Note that standardizing variables is unnecessary here, since all of the ranks have the same variance. The distance between two variables is interpreted in terms of correlation: two variables are close to one another if their ranks are similar across all the observations. Conversely, two variables are distant from another if their ranks are almost totally opposite. Two observations are close when their ranks are similar for each of the variables. When variables and observations are mapped simultaneously, the relative location of a variable vis-à-vis all of the observations gives an idea of the configuration of the observations' rank for this variable.
Finally, the nonparametric nature of the results allows us to perform some tests of validity on the eigenvalues. This is because the rules governing eigenvalues of a matrix of ranks depend only on the parameters n and p number of rows and columns of the matrix. It is therefore possible to construct tables to determine the significance levels of the eigenvalues.
5.2. Robust Axes Analysis.
The least squares criterion is best applied to a normal distribution. When the distribution is uniform, excessive weight is given to observations at both extremes. In this case, the analysis becomes more robust if we apply a transformation that "normalizes" the uniform distribution of the ranks.
Let us consider the kth observation among the n ranked observations.
Let F be the distribution function for a normal distribution. We substitute the rank observation k with the value yk obtained from the inverse distribution function of the normal distribution:
|
Figure 5. Transformation of a uniform distribution into a normal distribution
This type of transformation can be found in Fisher and Yates (1949); see Figure 5.
For a large n, the transformation is equivalent to replacing the kth observation with the expected value of the kth observation in a ranked sample of n normal observations.
6). Example of outputs for principal components analysis
Data set : Example “EX_A01, in the set of example DtmVic_Examples_A_Start attached to the software DtmVic.
The CESP (Centre d'Étude des Supports de Publicité, a joint industry committee for the audit of audience measurements ) has carried out a time-budget sample survey in France during the years 1991/1992. Sample size : 17 665 .
Variables : measurement of media contacts (Radio, Television, Press (magazines, daily newspapers) and the duration of 60 activities the day before the interview.
Description of Table- 1
Columns: 16 active continuous variables : minutes per day, identifiers in French)
Rows: 27 fictitious individuals that are in fact groupings of respondents according to the age, the genderr, educational level, size of town :
Identifiers of the 27 rows:
- The first character is the age (1=young, 2=medium, 3=old)
- The 2nd character (gender) is always 1 (active males in this sub-sample)
- The 3rd character is the educational level (1=primaire, 2=secondaire, 3=supérieur)
- The 4th character is the type of town (1= rural areas; 2= small towns; 3= large towns;
- 4= Paris and suburbs; 5,6,7 = other categories).
Table -1. Daily time-budget p = 16 activities (columns) for n = 27 groups of active males (rows).
IDENT Somm Repo Reps Repr Trar Ména Visi Jard Lois Disq Lect Cour Prom A pi Voit Fréq
1111 463.8 23.8 107.3 4.8 300.0 21.3 51.0 82.3 10.0 1.2 .0 41.3 6.9 7.1 52.1 135.8
1115 515.6 58.5 102.7 10.4 208.8 41.9 30.0 32.9 2.1 4.6 .6 33.7 8.3 24.6 29.4 225.8
1121 463.3 34.2 84.8 17.1 298.3 18.1 37.8 55.8 18.4 5.9 2.6 30.7 5.9 8.8 56.7 135.8
1122 456.4 43.1 74.2 21.9 239.0 26.0 51.2 59.7 18.4 3.6 4.6 52.2 9.5 10.8 72.7 142.3
1123 478.0 44.2 76.7 15.2 212.3 22.3 42.0 43.7 18.4 2.3 6.4 48.3 14.7 15.5 72.8 167.7
1124 465.1 41.6 85.2 23.7 226.0 37.0 42.5 16.3 10.7 8.7 9.4 44.3 13.7 19.8 59.0 145.1
1136 458.4 47.4 94.7 15.1 314.3 25.3 39.1 42.4 16.9 .9 16.7 34.5 4.6 6.4 61.5 103.4
1133 457.2 30.7 82.0 26.2 269.8 52.1 37.6 35.6 25.6 6.0 8.0 42.8 10.4 12.0 81.4 107.6
1134 465.2 40.2 78.6 31.1 268.6 36.3 21.6 4.0 19.4 6.0 14.8 46.9 10.7 21.9 48.3 82.4
2111 449.0 42.1 86.2 7.9 312.5 15.1 16.1 112.9 15.4 .0 2.2 32.1 7.6 8.1 60.1 153.9
2112 450.2 63.1 86.7 9.8 249.6 40.4 55.6 83.3 3.0 2.2 .0 45.0 9.4 10.4 61.9 145.4
2117 455.2 47.4 95.6 9.0 250.8 30.4 13.5 57.3 7.9 2.9 7.0 52.2 15.1 15.7 49.1 194.8
2121 461.9 39.3 90.3 8.5 323.5 14.9 21.7 81.8 15.4 1.2 5.3 26.0 3.8 7.4 59.6 130.8
2122 453.7 44.7 97.5 18.7 269.0 23.1 39.6 93.5 3.1 3.4 12.1 42.0 12.1 10.6 62.4 129.1
2123 433.1 49.8 91.7 12.6 283.7 22.4 21.0 62.9 13.1 6.2 7.3 38.1 11.6 11.7 47.6 168.6
2124 438.3 32.8 102.3 11.1 338.3 28.0 6.5 64.8 13.8 1.4 19.8 34.9 7.4 14.1 53.2 130.5
2131 457.7 44.0 87.9 6.9 313.0 24.4 23.2 63.8 9.2 .6 11.8 30.0 7.3 7.5 69.7 108.3
2132 455.0 47.0 78.9 31.6 380.6 23.9 7.5 40.0 13.0 .0 10.3 23.3 1.4 9.4 59.4 100.0
2133 467.3 37.5 86.9 21.9 264.0 40.8 27.6 33.4 11.9 1.6 10.8 45.3 6.7 10.7 72.8 135.2
2134 433.5 35.6 76.1 17.1 355.0 34.1 13.4 31.7 12.6 3.2 13.2 37.5 8.5 22.3 57.5 96.5
3116 473.0 51.5 99.3 6.3 356.3 21.2 27.6 82.1 8.6 .0 1.5 35.7 13.4 7.1 40.6 107.7
3117 461.9 60.0 103.7 9.1 240.5 35.3 14.5 83.4 1.4 2.0 7.4 46.1 5.7 16.6 53.3 183.7
3121 453.4 45.6 86.2 7.8 358.7 12.9 18.5 54.4 4.2 .0 4.9 34.3 3.3 10.3 48.7 143.1
3122 485.1 53.5 86.0 .3 222.4 24.7 23.2 91.9 8.5 .0 3.7 52.9 7.1 9.9 75.3 166.3
3123 456.7 43.2 94.6 12.1 265.3 30.5 23.7 61.1 9.1 2.3 11.6 50.1 17.6 13.2 46.3 185.3
3136 444.2 53.6 90.7 7.2 302.4 31.7 16.4 97.6 4.7 2.4 4.3 38.8 13.6 11.4 61.8 127.2
3137 438.4 50.7 81.0 11.2 306.6 19.3 23.8 10.5 13.6 .0 18.4 67.6 8.3 18.6 63.1 143.3
Table 2, Summary, (number of « individuals » : 27)
IDEN — Label Mean std- MINIMUM MAXIMUM
dev.
Active Variables
Somm — Sommeil (sleep) 458.91 16.47 433.10 515.60
Repo — Repos (rest) 44.63 8.90 23.80 63.10
Reps - Repas chez soi (meals at home) 89.18 8.90 74.20 107.30
Repr - Repas restaurant (restaurant) 13.87 7.82 .30 31.60
Trar - Travail rémunéré (work) 286.27 46.75 208.80 380.60
Ména - Ménage (housexwork) 27.90 9.29 12.90 52.10
Visi - Visite à amis friends) 27.64 13.26 6.50 55.60
Jard - Jardinage, Bricolage (gardening, DOY) 58.49 27.39 4.00 112.90
Lois - Loisirs extérieur (leisure) 11.42 5.95 1.40 25.60
Disq - Disque cassette (records) 2.54 2.32 .00 8.70
Lect - Lecture livre (reading) 7.95 5.47 .00 19.80
Cour - Courses démarches (errands) 40.99 9.47 23.30 67.60
Prom - Promenade (stroll) 9.06 3.88 1.40 17.60
A pi - Déplacement à pied (walk) 12.66 5.01 6.40 24.60
Voit - Déplacement en Voiture (ride) 58.38 11.29 29.40 81.40
Fréq - Fréquentation Média (media contacts) 140.58 32.56 82.40 225.80
______________________________________________________________________
Supplementary numerical variables
Autr — Autres activités (other) 12.71 5.70 2.10 25.90
Domi — Total Domicile (total home) 928.73 49.92 826.00 1034.00
Tdep — Total Déplacement (total transport) 88.45 14.65 67.50 122.10
IDEN — Label Mean std- MINIMUM MAXIMUM
dev.
Supplementary numerical variables (continuation)
Habitudes Cinema (cinema) .14 .14 .00 .60
Habitudes Radio (radio) 1.92 .23 1.49 2.64
Habitudes Télévision (telly) 3.20 .37 2.13 3.90
Habitudes Presse Quotidienne (daily press) .18 .14 .03 .53
Habitudes Presse magazine (magazine) 3.56 .74 2.00 5.31
Habitudes Hebdomadaires News (news) .31 .18 .00 .67
Table 3 : Correlation Matrix and eigenvaluess
Sommeil | 1.00 Correlation matrix
Repos | .21 1.00
Repas c.| .21 .10 1.00
Repas r.| —.08 —.30 —.53 1.00
Travail | —.52 —.28 —.02 —.01 1.00
Ménage | .20 .08 —.01 .39 —.46 1.00
Visites | .27 —.08 —.07 .10 —.47 .15 1.00
Jardin. | —.09 .19 .43 —.64 .08 —.37 —.02 1.00
Loisirs | —.17 —.61 —.55 .52 .10 —.01 .12 —.39 1.00
Disques | .07 —.17 —.15 .52 —.46 .50 .30 —.42 .25 1.00
Lecture | —.44 —.21 —.15 .38 .24 .08 —.36 —.51 .27 —.01 1.00
Courses | —.04 .18 —.17 —.03 —.56 .23 .24 —.24 —.01 .08 .18 1.00
Promen. | .00 .09 .04 —.02 —.45 .27 .18 —.01 —.05 .40 —.03 .48 1.00
A pied | .17 .15 —.14 .28 —.38 .49 —.18 —.62 —.09 .48 .27 .37 .30 1.00
Voiture | —.19 —.22 —.55 .21 —.15 .10 .27 .03 .44 —.09 .15 .23 —.11 —.33 1.00
Fréq.med| .40 .42 .37 —.44 —.62 .05 .01 .18 —.45 .07 —.38 .30 .28 .28 —.33 1.00
+——————+—————————+———————+———————+ Histogram of the 16 eigenvalues
|NUMBER| Eigen |PERCEN |PERCEN |
| | value | TAGES |(CUMU) |
+——————+—————————+———————+———————+—————————————————————————————————————————————————————————————
| 1 | 3.871 | 24.20 | 24.20| ******************************************************//*****
| 2 | 3.660 | 22.88 | 47.07| *************************************************//********
| 3 | 2.006 | 12.54 | 59.61| ******************************************
| 4 | 1.514 | 9.47 | 69.08| ********************************
| 5 | 1.126 | 7.04 | 76.12| ************************
| 6 | .837 | 5.23 | 81.35| ******************
| 7 | .766 | 4.79 | 86.15| ****************
| 8 | .596 | 3.73 | 89.87| *************
| 9 | .444 | 2.78 | 92.65| **********
| 10 | .374 | 2.34 | 94.99| ********
| 11 | .246 | 1.54 | 96.53| ******
| 12 | .222 | 1.39 | 97.92| *****
| 13 | .161 | 1.01 | 98.93| ****
| 14 | .114 | .72 | 99.64| ***
| 15 | .037 | .23 | 99.88| *
| 16 | .019 | .12 |100.00| *
Table 4 Coordinates of active variables (axes 1 to 3)
Variables Coordinates Unitary axes
1 2 3 1 2 3
Sommeil .22 -.52 .18 .11 -.27 .13
(sleep)
Repos .46 -.40 -.17 .23 -.21 -.12
(rest)
Repas chez soi .67 -.15 -.23 .34 -.08 -.17
(meals at home)
Repas restaurant -.84 .00 -.07 -.43 .00 -.05
(restaurant)
Travail rémunéré .05 .88 -.34 .03 .46 -.24
(work)
Ménage -.40 -.57 -.08 -.20 -.30 -.06
(house work)
Visite à amis -.13 -.33 .73 -.07 -.17 .52
(calling friends)
Jardinage, Bricolage .76 .22 .35 .39 .11 .25
(gardening, DOY)
Loisirs extérieur -.72 .30 .30 -.37 .16 .21
(leisure out)
Disque cassette -.53 -.53 .01 -.27 -.27 .01
(records, cassettes)
Lecture livre -.54 .24 -.50 -.27 .12 -.36
(reading books)
Courses démarches -.21 -.54 .11 -.11 -.28 .08
(Errands)
Promenade -.10 -.58 .04 -.05 -.30 .03
(walk, stroll, ride)
A pied -.37 -.62 -.57 -.19 -.33 -.40
(walk)
En Voiture -.41 .22 .65 -.21 .11 .46
(ride)
Fréquentation Média .49 -.68 -.05 .25 -.36 -.03
(contacts media)
Figure 6. 16 active variables in the plan spanned by axes 1 and 2
Table 5 Coordinates of supplementary variables
(or : illustrative) on axes 1 to 3
VARIABLES Coordinates
1 2 3
Autres activités .08 .16 .04
Total Domicile .67 -.50 -.21
Total Déplacement -.72 .05 .14
Habitudes Cinema -.87 -.11 -.14
Habitudes Radio -.27 -.57 .07
Habitudes Télévision .04 -.55 .34
Habitudes Presse Quot -.39 .01 -.70
Habitudes Presse mag -.24 -.38 -.26
Habitudes Hebdo-News -.46 .20 -.48
Figure 7. Plot of des supplementary variables (same plan as figure 3)
Table 6. Coordinates, contributions and squared cosines of individuals on axes 1 and 2
INDIVIDUAL COORDINATES CONTRIBUT. SQUARED COS.
IDENTIF. DISTO 1 2 1 2 1 2
1111 19.89 2.01 .85 3.8 .7 .20 .04
1115 47.51 2.26 -5.11 4.9 26.4 .11 .55
1121 10.55 -.71 1.01 .5 1.0 .05 .10
1122 13.29 -1.86 -.64 3.3 .4 .26 .03
1123 14.49 -1.28 -1.81 1.6 3.3 .11 .23
1124 19.06 -2.72 -2.93 7.1 8.7 .39 .45
1136 10.68 -.56 1.97 .3 3.9 .03 .36
1133 27.04 -4.21 -.30 17.0 .1 .66 .00
1134 25.35 -4.29 -.91 17.6 .8 .73 .03
2111 12.86 1.91 2.12 3.5 4.5 .28 .35
2112 17.27 1.43 -1.68 2.0 2.8 .12 .16
2117 10.89 1.03 -2.16 1.0 4.7 .10 .43
2121 10.96 1.27 2.55 1.5 6.6 .15 .59
2122 7.92 .62 -.21 .4 .0 .05 .01
2123 8.33 .30 -.33 .1 .1 .01 .01
2124 15.54 -.12 2.06 .0 4.3 .00 .27
2131 7.39 .55 2.03 .3 4.2 .04 .56
2132 24.45 -1.17 3.53 1.3 12.6 .06 .51
2133 7.85 -1.63 -.11 2.5 .0 .34 .00
2134 17.19 -2.54 1.36 6.2 1.9 .37 .11
3116 16.19 2.68 .96 6.9 .9 .45 .06
3117 15.96 2.43 -1.84 5.7 3.4 .37 .21
3121 13.00 1.90 2.11 3.4 4.5 .28 .34
3122 17.31 2.12 -.95 4.3 .9 .26 .05
3123 10.26 .56 -1.74 .3 3.1 .03 .30
3136 9.09 1.56 .09 2.3 .0 .27 .00
3137 21.68 -1.55 .08 2.3 .0 .11 .00
Table 7. Test –values and Coordinates of supplementary categories on axes 1 and 2
Categories Test_values Coordinates
Wording Number Axe 1 Axe 2 Axe 1 Axe 2
. Age
A-35 - Jeunes 9 -2.3 -1.6 -1.26 -.87
A+35 - Age-Moy 11 .3 1.8 .15 .83
A+50 - Ages 7 2.1 -.3 1.39 -.18
. Niveau d’éducation
prim - primaire 7 3.0 -1.5 1.96 -.98
seco - secondaire 11 .0 -.2 .01 -.08
supe - superieur 9 -2.8 1.6 -1.54 .86
. Agglomération (excerpts)
AGG1 - - de 20 000 6 1.6 2.5 1.15 1.78
AGG2 - de 20 a 100 000 5 .3 .0 .23 .01
AGG3 - Plus de 100 000 5 -1.5 -1.1 -1.25 -.86
AGG4 - Paris 4 -2.6 -.1 -2.42 -.11
Figure 8. Location of individuals (symbols with 4 digits) and categorical variables in the PCA principal plan (axes 1 and 2)
End of Chapter 2: Principal Components Analysis