Example A.1: EX_A01.PrinCompAnalysis (PCA)

(Principal Components Analysis)


Example A.1 aims at describing a set of continuous variables through PCA. The principal axes visualization is complemented with a clustering, including an automatic description of the clusters. The importance of the dichotomy Active variables - Supplementary variables is stressed.

The data are an excerpt from a “ Multimedia time budget sample survey” (carried out by the CESP in 1992. [about the CESP, see: www.cesp.org ]). They deal with the average responses of a (small) subset of 96 groups of respondents to 44 questions.

The 18 000 original respondents are grouped according to some combinations of five nominal (categorical) variables: gender (2 categories), age (3 categories) activity (2 categories), educational level (3 categories) and size of town (8 categories). Our “fictitious respondents” are in fact these 96 groups.

The 39 questions corresponding to numerical variables (from V6 to V44) concern the “Time spent to various activities, including sleep, meals, reading, working, etc…” (expressed in minutes per day, measured for the day preceding the interview).

The 5 questions corresponding to nominal (or categorical) variables (from V1 to V5) are: gender, age, activity, educational level, size of town.


1) Looking at the two files: dictionary and data.


To look at these files, use your text editor outside DtmVic (Notepad, Notepad++, Ultraedit, TotalEdit) or simply a text editor within DtmVic: button "Open an existing command file" of the main menu.
(See also the Introduction of Tutorial D for comments about the internal formats of DtmVic for dictionary and data)


1.1) Dictionary file:

To have a look at the internal DtmVic format for the dictionary, search for the example directory DtmVic_Examples_A_Start, and in that directory, open the directory of example A.1, named “EX_A01.PrinCompAnalysis” .

Open then the dictionary file: “PCA_dic_Eng.txt. Do not use a text processor (such as “Word”). (For a dictionary in French, open “PCA_dic_Fr.txt ).

The dictionary file “PCA_dic_Eng.txt contains the identifiers of 44 variables. In this internal format of the dictionary, the identifiers of categories must begin at: “column 6” [a fixed interval font - also known as teletype font - such as “courier” can be used to facilitate this kind of format]. Such a dictionary can be imported from a spreadsheet format (Excel ® for instance, see Tutorial D: “Importation”). The identifier of a categorical variable is preceded by the number N of its categories (columns 1 to 5); the N following lines identify the N response items. An optional “short identifier” occupies columns 1 to 5. A numerical variable has 0 category.



1.2) Data file:

In a similar fashion, open the data file “PCA_dat.txt”.

The data file “PCA_dat.txt” comprises 96 rows and 45 columns (identifier of rows [between quotes] + 44 values [corresponding either to numerical variables or to item numbers of categorical variables] separated by at least one blank space).

Note that in this particular case, the identifier of each group happens to be a summary of the characteristics of the group: The first digit (<= 6) describe the cross-tabulation “gender – age”, the second digit (<= 2) the activity, the third digit (<= 3) the educational level and the fourth and last digit the size of town (or category of agglomeration).



2) Generation of a command file (or: “parameter file”)

Open DtmVic.


2.1) Click the button: “Create” of the main menu: Basic Steps, line “ Command File ”.

A window “ Choosing among some basic analyses” appears.


2.2) Click then the button: “PCA_Principal Components Analysis”

This button is located in the paragraph “Numerical data”.


2.3) Click the button: “Open a dictionary (Dtm format)”

To open the dictionary, search for the example directory DtmVic_Examples_A_Start , and in that directory, open the directory of example 1, named “EX_A01.PrinCompAnalysis” . Open then the dictionary file: “PCA_dic_Eng.txt (or: “PCA_dic_Fr.txt for a French version of the dictionary).

The DtmVic dictionary file in displayed in a window. Another window indicates the status of each variable (numerical or categorical).


2.4) Click the button: “Open a data file (Dtm format)”

Open the data file “PCA_dat.txt” .

As shown before, the data file “Dtm_PCA_dat.txt” comprises 96 rows and 45 columns (identifier of rows [between quotes] + 44 values [corresponding either to numerical variables or to item numbers of categorical variables] separated by at least one blank space.


2.5) Click the button: “Continue (select active and supplementary elements)” .

A new window is displayed, allowing for the selection of active variables.

We suggest to select the following set of numerical variables as active variables [the reader is free to select another set of numerical variables]

Suggested set of active numerical variables

We suggest to select the set ranging from variable V6 (duration of sleep) to variable V32 (time spent watching TV)


6 . Sleep_V6
7 . Rest_V7
8 . Wash_V8
9 . Meal_V9
10 . Breakfast_V10
11 . Meal_home_V11
12 . Meal_rest_V12
13 . Work_V13
14 . Work_H_V14
15 . Children_V15
16 . Housework_V16
17 . Contacts_V17
18 . Call_friends_V1
19 . Leisure_V19
20 . Game_V20
21 . Gardening_V21
22 . Ext_leisure_V22
23 . Records_V23
24 . Reading_V24
25 . Books_V25

26 . Errands_V26
27 . Ambling_V27
28 . Errand2_V28
29 . Moving_V29
30 . Mov_Walk_V30
31 . Mov_Car_V31
32 . TV_V32

Suggested set of supplementary variables (socio-demographic characteristics): We will characterize a posteriori the respondents by some socio-demographics:


1 . Gender_V1
2 . Age_V2
3 . Activity_V3
4 . Education_V4

2.6) Click the button: “Continue”

A new window devoted to the selection of active observations (rows) is displayed.

Click on the button: “All the observations will be active” .

The window “Create a starting parameter file” is displayed.


2.6.1 Click on: “1) Select some options” .

A new window entitled “Options Bootstrap and/or clustering of observations” is displayed. Click “yes” for the “Bootstrap validation”, and then, click “Enter” for confirming the default number of replicates (25). Ignore the other suggested bootstrap options.

Select then the number of clusters (we suggest 7 clusters).

Click on: “Enter” and on: “Continue” .

Back to the previous window.


2.6.2 Click on: “2) Create a parameter file for PCA” .

A parameter file is displayed in the memo [It can be edited by the advanced users. It allows for performing again the same analysis later on, if needed].


Important : The parameter file is saved as “Param_PCA.txt” in the current directory.

If you wish, you could now exit from DtmVic, and, later on, use the button of the main menu “Open an existing command file” (line: “Command file” ) to open directly the file “Param_PCA.txt”, and, in so doing, reach this point of the process, using afterwards the “Execute” command of the main menu.


2.6.3 Click then on: “3) Execute” .

This step will run the basic computation steps present in the command file: archiving data and dictionary, selection of active elements, principal components analysis of the selected data, bootstrap replications of the table, brief description of the axes, clustering procedure, thorough description of clusters. After the execution has taken place, a small window summarizes the different steps of computation.


3) Basic numerical results

Click: “Basic numerical results” button

The button opens a created (and saved) html file named “imp.html” which contains the main results of the previous basic computation steps. After perusing these numerical results, return to the main menu. Note that this file is also saved under another name. The name “imp.html” is concatenated with the date and time of the analysis (continental notation): “imp_08.07.09_14.45.html” means July 8 th , 2009, at 2:45 p.m. That file keeps as an archive the main numerical results whereas the file “imp.html” is replaced for each new analysis performed in the same directory.

This file in also saved under a simple text format, under the name “imp.txt” , and likewise with a name including the date and time of execution.
Return.



4) Steps VIC (Visualization, Inference, Classification)

4.1) Click “AxeView “ button

... and follow the sub-menus. In fact, only three tabs are relevant for this example: “Active variables” , “Individuals (observations)” and “supplementary categories” . After clicking on “View” , the set of principal coordinates along each axis is obtained.

Clicking on a column header produces a ranking of all the rows according to the values of that column. In this particular example, this is somewhat redundant with the printed results of the step “DEFAC”. Evidently, the use of the AxeView menu is justified when the data set is very large.

Note that for the tab: “Individuals (observations)” , the procedure may help to detect possible outliers.
Return.


4.2) Click “ PlaneView Research” button ... and follow the sub-menus.

In this example, six items of the menu are relevant “Active columns (variables or categories)” , “Supplementary categories”, “Active rows (individuals, observations)”, “Active columns + Active rows”, “Active individuals (density)” and “Active columns + Supplementary categories” . The graphical display of selected pairs of axes is then produced.

In the “Active individuals (density)” , the identifiers of individuals are replaced by a single character [case of very large set of individuals]. This display shows mainly the shape of the cloud of individuals, but the original identifiers can be produced by clicking the right button of the mouse. All the displays concern the planes spanned by the chosen pairs of axes.

In the case of PCA, the first menu item “Active columns (variables or categories)” contains in fact both active numerical variables (in black) and supplementary numerical variables (in red). The item “individuals (rows) contain our “individuals” that are, in this particular example, groups of respondents.

The roles of the different buttons are straightforward, except perhaps the button: “Rank” , which is useful only in the case of very intricate displays, (which is far from being the case here!): this button converts the two coordinates of the current display into ranks. For instance, the n values of the abscissa are converted into n integers, from 1 to n, having the same order as the original values. Thus the two distributions are uniform, and the identifiers turn out to be much less overlapping, and more legible (at the cost of a substantial distortion of the display).


4.3) Click “BootstrapView” button.

This button opens the “DtmVic: Bootstrap - Validation - Stability – Inference” windows.

4.3.1 Click on: “LoadData” . In this case (partial bootstrap), the two replicated coordinates file to be opened are named “ngus_var_boot.txt” and “ngus_sup_cat_boot.txt” (see the panel reminding the names of the relevant files below the menu bar). The file ngus_var_boot.txt contains only active variables. The file ngus_sup_cat_boot.txt contains only supplementary categories, for which the bootstrap procedure is all the more meaningful.

4.3.2 Click on: “ Confidence Areas” , submenu, and choose the pair of axes to be displayed (select axes 1 and 2 [default option] to begin with)[Enlarge the window if necessary].

4.3.3 In the window that appears then, displaying the dictionaries of variables, tick the chosen white boxes to select the elements the location of which should be assessed, and press the button “Select” .

4.3.4 Click on: “Confidence Ellipses” to obtain the graphical display of the active variable points (if the file ngus_var_boot.txt has been loaded), or of the supplementary category points (if the file ngus_sup_cat_boot.txt has been loaded).

[Note that the ellipses are large because of the small number of involved individuals (we remind that, in this example, “individuals” are in fact groups of respondents). To use bootstrap in this case leads to pessimistic confidence zones for the points. In a real application, the original individual file ( comprising thousands of individuals) should be replicated before carrying out the grouping of individuals, leading then to much smaller confidences ellipses… ]

4.3.5 Close the display window (Return), and press “Convex hulls” . The ellipses are now replaced by the convex hulls of the replicates for each point. The convex hulls take into account the peripheral points, whereas the ellipses are drawn using the density of the clouds of replicates. The two pieces of information are complementary.

Go back to the main menu.

4.4) Click “ClusterView”

4.4.1 Choose the axes (1 and 2 to begin with), and “Continue” .

4.4.2 Click on: “View” . The centroids of the 7 clusters appears on the first principal plane.

4.4.3 Activate the button “Categorical” , and, pointing with the mouse on a specific cluster, press the right button of the mouse. A description of the cluster involving the most characteristic response items appears. This description is somewhat redundant with that of the Step DECLA (see files “imp.html” or “imp.txt” using the buttons “Basic numerical results” of the main menu ) . But we do have in front of us the pattern of clusters and their relative locations. One can easily imagine the usefulness of the tool for a survey with thousands of individuals, hundreds of variables, and more than 20 clusters.

4.4.4 Activate the button “Numerical” . We will observe the link between the numerical variables (both active and supplementary variables) of the data file and the 7 clusters. Owing to the small number of individuals, some clusters do not produce significant results.

In the context of this example, the other items of the main menu are not relevant.

General remark.

As you can observe when looking at the content of the example’s directory, several files have been created and saved [these files are briefly described in the memo “Help about files” in the toolbar of the main menu]. If you need to continue using again the buttons of the paragraph VIC of the main menu after having closed DtmVic, just click on the button “Open an existing command file” from the line “commandfile” , select and open the saved command file: “Param_PCA.txt” , and close it. It is not necessary to click on: “execute” again. You can then continue your investigation (axes views, graphs, maps, etc.).

The advanced users can also edit the parameter file “Param_PCA.txt” , (using the memo “Help about parameters” in the toolbar of the text editor activated by the button “Open” ) to perform a new analysis in which the parameters are given new values. All the intermediate files will be replaced (except the files “imp_date_time.txt” and “imp_date_time.html” which are the only saved archives)



End of example A1 (PCA)