Example A.1: EX_A01.PrinCompAnalysis (PCA)
(Principal Components Analysis)
Example A.1 aims at describing a set of continuous variables through PCA.
The principal axes visualization is complemented with a clustering, including an automatic
description of the clusters. The importance of the dichotomy
Active variables - Supplementary variables is stressed.
The data are an excerpt from a “ Multimedia time budget sample
survey” (carried out by the CESP in 1992. [about the CESP, see:
www.cesp.org ]).
They deal with the average responses of a (small) subset of 96 groups
of respondents to 44 questions.
The 18 000 original respondents are grouped according to some combinations of
five nominal (categorical) variables: gender (2 categories), age (3
categories) activity (2 categories), educational level (3 categories)
and size of town (8 categories). Our “fictitious respondents”
are in fact these 96 groups.
The 39 questions corresponding to numerical variables (from V6 to V44)
concern the “Time spent to various activities, including sleep,
meals, reading, working, etc…” (expressed in minutes
per day, measured for the day preceding the interview).
The 5 questions corresponding to nominal (or categorical) variables (from
V1 to V5) are: gender, age, activity, educational level, size of town.
1) Looking at the two files: dictionary and data.
To look at these files, use your text editor outside DtmVic (Notepad, Notepad++, Ultraedit,
TotalEdit) or simply a text editor within DtmVic: button "Open an existing command file" of the main menu.
(See also the Introduction of Tutorial D for comments about the internal formats of DtmVic for dictionary and data)
1.1) Dictionary file:
To have a look at the internal DtmVic format for the dictionary, search
for the example directory DtmVic_Examples_A_Start,
and in that directory, open the directory of example A.1, named
“EX_A01.PrinCompAnalysis” .
Open then the dictionary file:
“PCA_dic_Eng.txt ” .
Do not use a text processor (such as “Word”). (For a
dictionary in French, open “PCA_dic_Fr.txt ” ).
The dictionary file “PCA_dic_Eng.txt ”
contains the identifiers of 44 variables. In this internal format of the dictionary,
the identifiers of categories must begin at: “column 6” [a fixed interval
font - also known as teletype font - such as “courier”
can be used to facilitate this kind of format]. Such a dictionary can
be imported from a spreadsheet format (Excel ® for instance, see
Tutorial D: “Importation”). The identifier of a
categorical variable is preceded by the number N of its categories
(columns 1 to 5); the N following lines identify the N response
items. An optional “short identifier” occupies columns 1
to 5. A numerical variable has 0 category.
1.2) Data file:
In a similar fashion, open the data file
“PCA_dat.txt”.
The data file “PCA_dat.txt”
comprises 96 rows and 45 columns (identifier of rows [between quotes]
+ 44 values [corresponding either to numerical variables or to item
numbers of categorical variables] separated by at least one blank
space).
Note that in this particular case, the identifier of each group
happens to be a summary of the characteristics of the group: The first digit
(<= 6) describe the cross-tabulation “gender – age”,
the second digit (<= 2) the activity, the third digit (<= 3)
the educational level and the fourth and last digit the size of town
(or category of agglomeration).
2) Generation of a command file (or: “parameter file”)
Open DtmVic.
2.1) Click the button: “Create”
of the main menu: Basic Steps, line “
Command File ”.
A window “
Choosing among some basic analyses” appears.
2.2) Click then the button: “PCA_Principal Components Analysis”
This button is located in the paragraph “Numerical data”.
2.3) Click the button: “Open a dictionary (Dtm format)”
To open the dictionary, search for the example directory
DtmVic_Examples_A_Start , and in that directory,
open the directory of example 1, named
“EX_A01.PrinCompAnalysis” .
Open then the dictionary file: “PCA_dic_Eng.txt”
(or: “PCA_dic_Fr.txt ”
for a French version of the dictionary).
The DtmVic dictionary file in displayed in a window. Another window indicates
the status of each variable (numerical or categorical).
2.4) Click the button: “Open
a data file (Dtm format)”
Open the data file “PCA_dat.txt” .
As shown before, the data file
“Dtm_PCA_dat.txt” comprises 96 rows and 45 columns
(identifier of rows [between quotes] + 44 values [corresponding
either to numerical variables or to item numbers of categorical
variables] separated by at least one blank space.
2.5) Click the button:
“Continue (select active and supplementary elements)”
.
A new window is displayed, allowing for the selection of active variables.
We suggest to select the following set of numerical
variables as active variables [the reader is free to select another
set of numerical variables]
Suggested set of active numerical variables
We suggest to select the set ranging from variable V6 (duration of
sleep) to variable V32 (time spent watching TV)
6 . Sleep_V6
7 . Rest_V7
8 . Wash_V8
9 . Meal_V9
10 . Breakfast_V10
11 . Meal_home_V11
12 . Meal_rest_V12
13 . Work_V13
14 . Work_H_V14
15 . Children_V15
|
16 . Housework_V16
17 . Contacts_V17
18 . Call_friends_V1
19 . Leisure_V19
20 . Game_V20
21 . Gardening_V21
22 . Ext_leisure_V22
23 . Records_V23
24 . Reading_V24
25 . Books_V25
|
26 . Errands_V26
27 . Ambling_V27
28 . Errand2_V28
29 . Moving_V29
30 . Mov_Walk_V30
31 . Mov_Car_V31
32 . TV_V32
|
Suggested set of supplementary variables (socio-demographic characteristics):
We will characterize a posteriori the respondents by some
socio-demographics:
1 . Gender_V1
2 . Age_V2
3 . Activity_V3
4 . Education_V4
|
2.6) Click the button: “Continue”
A new window devoted to the selection of
active observations (rows) is displayed.
Click on the button:
“All the observations will be active” .
The window
“Create a starting parameter file” is displayed.
2.6.1 Click on: “1) Select some options”
.
A new window entitled “Options
Bootstrap and/or clustering of observations”
is displayed. Click “yes”
for the “Bootstrap validation”, and then, click
“Enter”
for confirming the default number of replicates (25). Ignore the
other suggested bootstrap options.
Select then the number of clusters (we suggest 7 clusters).
Click on: “Enter”
and on: “Continue” .
Back to the previous window.
2.6.2 Click on: “2) Create a parameter file for PCA”
.
A parameter file is displayed in the memo [It can be
edited by the advanced users. It allows for performing again the same
analysis later on, if needed].
Important :
The parameter file is saved as
“Param_PCA.txt”
in the current directory.
If you wish, you could now exit
from DtmVic, and, later on, use the button of the main menu
“Open an existing command file” (line:
“Command file” )
to open directly the file “Param_PCA.txt”,
and, in so doing, reach this point of the process, using afterwards the
“Execute” command of the main menu.
2.6.3 Click then on: “3) Execute” .
This step
will run the basic computation steps present in the command file:
archiving data and dictionary, selection of active elements,
principal components analysis of the selected data, bootstrap
replications of the table, brief description of the axes, clustering
procedure, thorough description of clusters. After the execution has
taken place, a small window summarizes the different steps of
computation.
3) Basic numerical results
Click: “Basic numerical results”
button
The button opens a created (and saved) html
file named “imp.html”
which contains the main results of the previous basic computation
steps. After perusing these numerical results, return to the main
menu. Note that this file is also saved under another name. The name
“imp.html”
is concatenated with the date and time of the analysis (continental
notation): “imp_08.07.09_14.45.html”
means July 8 th , 2009, at 2:45 p.m. That file keeps as an archive the main numerical
results whereas the file “imp.html”
is replaced for each new analysis performed in the same directory.
This file in also saved under a simple text format, under the name
“imp.txt” ,
and likewise with a name including the date and time of execution.
Return.
4) Steps VIC (Visualization, Inference, Classification)
4.1) Click “AxeView “ button
... and follow the sub-menus. In fact, only three tabs are relevant for
this example: “Active
variables” , “Individuals
(observations)” and “supplementary
categories” . After clicking on “View” ,
the set of principal coordinates along each axis is obtained.
Clicking on a column header produces a ranking of all the rows
according to the values of that column. In this particular example, this is
somewhat redundant with the printed results of the step “DEFAC”.
Evidently, the use of the AxeView menu is justified when the data set
is very large.
Note that for the tab: “Individuals
(observations)” , the procedure may help to detect possible outliers.
Return.
4.2) Click “ PlaneView Research”
button ... and follow the sub-menus.
In this example, six items of the menu are relevant
“Active columns (variables or categories)” ,
“Supplementary categories”, “Active rows (individuals, observations)”,
“Active columns + Active rows”, “Active
individuals (density)” and “Active columns
+ Supplementary categories” .
The graphical display of selected pairs of axes is then produced.
In the “Active
individuals (density)” , the identifiers of individuals are replaced
by a single character [case of very large set of individuals]. This display shows mainly the
shape of the cloud of individuals, but the original identifiers can
be produced by clicking the right button of the mouse. All the
displays concern the planes spanned by the chosen pairs of axes.
In the case of PCA, the first menu item “Active
columns (variables or categories)”
contains in fact both active
numerical variables (in black) and supplementary numerical variables
(in red). The item “individuals (rows) contain our
“individuals” that are, in this particular example,
groups of respondents.
The roles of the different buttons are straightforward, except perhaps
the button: “Rank” ,
which is useful only in the case of very intricate displays, (which
is far from being the case here!): this button converts the two
coordinates of the current display into ranks. For instance, the n
values of the abscissa are converted into n integers, from 1 to n,
having the same order as the original values. Thus the two
distributions are uniform, and the identifiers turn out to be much
less overlapping, and more legible (at the cost of a substantial
distortion of the display).
4.3) Click “BootstrapView” button.
This button opens the
“DtmVic: Bootstrap - Validation - Stability – Inference”
windows.
4.3.1 Click on: “LoadData” .
In this case (partial bootstrap), the two replicated coordinates file
to be opened are named “ngus_var_boot.txt”
and “ngus_sup_cat_boot.txt”
(see the panel reminding the names of the relevant files below the
menu bar). The file ngus_var_boot.txt
contains only active variables. The file ngus_sup_cat_boot.txt
contains only supplementary categories, for which the bootstrap
procedure is all the more meaningful.
4.3.2 Click on: “ Confidence Areas” ,
submenu, and choose the pair of axes to be displayed (select axes 1 and 2 [default
option] to begin with)[Enlarge the window if necessary].
4.3.3 In the window that appears
then, displaying the dictionaries of variables, tick the chosen white
boxes to select the elements the location of which should be
assessed, and press the button “Select” .
4.3.4 Click on: “Confidence
Ellipses” to obtain the
graphical display of the active variable points (if the file
ngus_var_boot.txt
has been loaded), or of the supplementary category points (if the
file ngus_sup_cat_boot.txt
has been loaded).
[Note that the ellipses are large
because of the small number of involved individuals
(we remind that, in this example, “individuals” are in
fact groups of respondents). To use bootstrap in this case leads to
pessimistic confidence zones for the points. In a real application,
the original individual file ( comprising thousands of individuals)
should be replicated before carrying out the grouping of individuals,
leading then to much smaller confidences ellipses… ]
4.3.5 Close the display window (Return), and press
“Convex hulls” . The ellipses are
now replaced by the convex hulls of the replicates for each point.
The convex hulls take into account the peripheral points, whereas the
ellipses are drawn using the density of the clouds of replicates. The
two pieces of information are complementary.
Go back to the main menu.
4.4) Click “ClusterView”
4.4.1 Choose the axes (1 and 2 to begin with), and
“Continue” .
4.4.2 Click on: “View” .
The centroids of the 7 clusters appears on the first principal
plane.
4.4.3 Activate the button
“Categorical” ,
and, pointing with the mouse on a specific cluster, press the right
button of the mouse. A description of
the cluster involving the most characteristic response items appears.
This description is somewhat redundant with that of the Step DECLA
(see files “imp.html”
or “imp.txt” using
the buttons “Basic numerical
results” of the main menu ) .
But we do have in front of us the pattern of clusters and their
relative locations. One can easily imagine the usefulness of the tool
for a survey with thousands of individuals, hundreds of variables,
and more than 20 clusters.
4.4.4 Activate the button “Numerical” .
We will observe the link between the numerical variables (both active
and supplementary variables) of the data file and the 7 clusters.
Owing to the small number of individuals, some clusters do not
produce significant results.
In the context of this example, the other items of the main menu
are not relevant.
General remark.
As you can observe when looking at the content of the example’s
directory, several files have been created and saved [these files are
briefly described in the memo “Help
about files” in the toolbar of
the main menu]. If you need to continue using again the buttons of
the paragraph VIC of the main menu after having closed DtmVic, just
click on the button “Open an existing command file”
from the line “commandfile” ,
select and open the saved command file:
“Param_PCA.txt” ,
and close it. It is not necessary to click on: “execute”
again. You can then continue your investigation (axes views, graphs, maps, etc.).
The advanced users can also edit the parameter file
“Param_PCA.txt” ,
(using the memo “Help
about parameters” in the
toolbar of the text editor activated by the button “Open” ) to perform a new
analysis in which the parameters are given new values. All the intermediate files will be
replaced (except the files “imp_date_time.txt”
and “imp_date_time.html” which are the only saved archives)
End of example A1 (PCA)