Example A.6: EX_A06.Text-Responses_2. (VISURECA)
(Open questions in a sample survey: Direct analysis and link with closed-end questions)
Example A.6 aims at describing directly the responses to an open ended question
in a sample survey, without prior agglomeration, in relation with a set of
categorical variables. The survey and the responses are the same as in examples A.5.
More explanation about this type of example and the corresponding
methodology can be found in the book: “Exploring Textual data”
(L. Lebart, A. Salem, L. Berry; Kluwer Academic Publisher, 1998).
We can take advantage of the presence of closed-end questions to describe
the clusters, not only with characteristic words and responses, but also with categories, selected
after a step SELEC, and analysed through the step DECLA .
Another new step of the command file, POSIT, describes the location of these supplementary
categories in the plane spanned by the first principal axes.
1) Looking at the three files: data, dictionary and texts.
To have a look at the data, search for the directory:
DtmVic_Examples.
In this directory, open the sub-directory
DtmVic_Examples_B_Texts .
In that directory, open the directory of Example A.6, named:
“EX_A06.Text-Responses_2 ” .
It is recommended to use one directory for each application, since
DtmVic produces a lot of intermediate txt-files related to the application.
At the outset, such directory must contain at least 3 files :
a) the data file,
b) the dictionary file,
c) the text file,
a) Data file: TDA_dat.txt
(same as that of Example A.5)
This file contains responses to questions which were included in the
multinational survey conducted in seven countries (Japan, France, Germany,
United Kingdom, USA, Netherlands, Italy) in the late nineteen eighties
(Hayashi et al., 1992).
It is the United Kingdom survey which is presented here.
It deals with the responses of 1043 individuals to 14 questions.
Some questions concern objective characteristics of the respondent
or his/her household (age, status, gender, facilities).
Other questions relate to attitude or opinions.
The data file: " TDA_dat.txt"
comprises 1043 rows and 15 columns (identifier of rows [between quotes]
+ 14 values [corresponding either to numerical variables or
to item numbers of categorical variables] separated by at least
one blank space).
b) Dictionary file: TDA_dic.txt
(same as that of Example A.5)
The dictionary file "TDA_dic.txt"
contains the identifiers of these 14 variables. In this version of Dtm-Vic,
the identifiers of categories must begin at: "column 6"
[a fixed interval font - also known as teletype font - such as
"courier" should be used to facilitate this kind of format].
c) Text file: TDA_TEX.txt
(same as that of examples A.5)
Let us remind its characteristics.
It contains the free responses of 1043 individuals to three open-ended questions.
Firstly, the following open-ended question was asked: "What is the
single most important thing in life for you?" It was followed by
the probe: "What other things are very important to you?".
A third question (not analysed here) has also been asked:
“ What means to you the culture of your own country”.
We refer to the previous example (example A.5) for comments about the data format.
2) Generation of a command file (or: “parameter file”)
2.1)Click the button: “Create a command file”
of the main menu: Basic Steps, line:
“ Command File”.
A window:
“Choosing among some basic analyses”
appears.
2.2) Click then on the button:
“
VISURECA analysis - (Visualization and Clustering of responses with
suplementary categorical data) ” – located in the
section “Textual and numerical data”.
A window “Opening a text file”
is displayed
2.3) Press the button:“Open the text file”,
then search for the directory: “
DtmVic_Examples_A_Start” .
In that directory, open the directory of example A.6, named
“EX_A06.Text-Responses_2” .
Open then the text file:
“TDA_tex.txt “.
A message box indicates then that the corpus comprises 7329 lines,
1043 observations and 3 open questions.
2.4) Click on:
“Select Open questions and separators”
The next window allows for the selection of open questions and the selection
of separators of words (the default separators suffice in this example).
We will select questions 1 and 2 (that means that the two responses will be
merged). It is licit here to merge the two responses because question 2
is a probe for question 1.
2.5) Click directly on:
“Vocabulary and counts” .
The next window presents the vocabulary (alphabetic and frequency orders).
We must select a threshold of frequency by selecting a line in the right hand
side memo frequency order). The line number 397 [first column] corresponds to
the frequency 4 [second column]. (We took a threshold of 16 in the previous
example A5. For individual responses, lexically very poor, it takes more words
not to generate too many empty answers after choosing the threshold).
We'll keep the 397 most frequent words. After selecting that line, click on:
“Confirm”.
The frequency appears in a message box. Reply: "OK".
Then click on: “Continue” .
A window dictionary and data files
appears.
2.6) Click the button:
“Open a dictionary (Dtm format)”
Open then the dictionary file:
“TDA_dic.txt“ .
The dictionary file:
TDA_dic.txt contains the identifiers
of the 14 variables.
The dictionary file in displayed in a window. Another window indicates the
status of each variable (numerical or categorical).
2.7) Press the button:
“Open a data file (Dtm format)”
Open the data file:
“TDA_dat.txt”
That data file comprises 1043 rows and 15 columns (identifier of rows
[between quotes] + 14 values [corresponding either to numerical variables
or to item numbers of categorical variables] separated by at least one
blank space).
A new window displays the data file.
2.8) Click the button: “Continue
(select active and supplementary variables)”.
A new window is displayed, allowing for the selection of active variables.
There is no active variable, since the responses to the 2 open questions are
active here. We actually chose the active variables by selecting the open-ended
questions 1 and 2.
All the remaining variables could be selected as supplementary elements.
They will serve to describe the categories of the active variable.
2.9) Click then on the button:
“Continue”
A new window devoted to the selection of active observations (rows)
is displayed. Click on the button:
“All the observations will be active”
.
The window:
“Create a starting parameter file”
is displayed.
2.10) Then click directly on:
“Create a first parameter file”.
For this type of analysis, there is no (yet) bootstrap validation.
The clustering is automatic, and the number of clusters is selected (default)
depending on the number of responses (30 clusters in this case).
[This number of cluster can be changed by editing the command file
(or parameter file) before the execution, the parameters to be altered
belong to the "STEP PARTI" and "STEP DECLA"].
A parameter file is displayed in the memo [It can be edited by the advanced
users. It allows for performing again the same analysis later on, if needed].
Important:
The parameter file is saved as
“Param_VIRURECA.txt”
in the current directory. If you wish, you could now exit from DtmVic,
and, later on, use the button of the main menu
“Open an existing command file” (Section:
“Command file”) to open directly the file
“Param_VIRURECA.txt”,
and, in so doing, reach directly this point of the process,
using the “Execute”
command of the main menu.
Let us remind that this set of commands comprises 14 steps:
ARDAT (archiving data),
ARTEX (Archiving texts),
SELOX (selecting the open question),
NUMER (numerical coding of the text),
ASPAR (correspondence analysis of the [sparse]
contingency table “respondents - words”),
CLAIR (Brief description of factorial axes),
RECIP (Clustering using a hierarchical
classification of the clusters - reciprocal neighbours method),
PARTI (Cut of the dendrogram produced by
the previous step, and optimisation of the partition obtained),
MOTEX (crosstabulating the partition produced
by step PARTI with words: the obtained
contingency table is called a lexical table),
MOCAR (characteristic words, and
characteristics responses for each class of the partition),
SELEC (selecting active and supplementary elements),
DECLA (systematic description of the classes
of the partition produced by step PARTI using
the other relevant categorical variables),
POSIT (illustrating the principal spaces
of responses with supplementary categorical variables).
2.11) Click: “Execute”.
This step will run the basic computation steps present in the command file:
archiving data and text, characteristic words and responses, correspondence
analysis of the lexical table, thorough descriptions of clusters using both
words and categorical variables.
7) Click the button:
“Basic numerical results”
The button opens a created (and saved) html file named
“imp.html”
which contains the main results of the previous basic computation
steps. After perusing these numerical results, return to the main
menu. Note that this file is also saved under another name. The name
“imp.html”
is concatenated with the date and time of the analysis (continental
notation). That file keeps as an archive the main numerical results
whereas the file “imp.html”
is replaced for each new analysis performed in the same directory.
This file in also saved under a simple text format, under the name
“imp.txt” ,
and likewise with a name including the date and time of execution.
Perusing the complete list of words highlights some errors in the original
text file (inevitable in real sized applications) : for instance,
the symbol “]” was absent from the list of separators,
and creates some new “words”…
8) At this stage, we click on one of the lower buttons of the basic steps
panel (Steps: “VIC”)
9) Click the button “AxeView”
... and follow the sub-menus. In fact, three tabs are relevant for this
example: “Active variables”
[ = words in the case of the analysis: ""VISURECA"],
“Individuals (observations)
[= respondents]” and
“Supplementary Categories”.
After clicking on “View”
in each case, one obtains the set of principal coordinates along each axis.
Clicking on a column header produce a ranking of all the rows according to
the values of that column. In this particular example, this is somewhat
redundant with the printed results of the step
CLAIR ”.
Return.
10) The button: PlaneView Research ,
and follow the sub-menus...
In this example, six items of the menu are relevant
“Active columns (variables or categories)”
(principal coordinates of the active words) ,
“Supplementary categories”
(coordinates of the supplementary categories derived from the step
“ POSIT ”),
“Active rows (individuals, observations)”
,(coordinates of the respondents),
“Active columns + Active rows”,
“Active individuals (density)”
and
“Active columns + Supplementary categories” .
The graphical display of chosen pairs of axes are then produced.
11) About the button:
“BootstrapView”...
In fact, The bootstrap is implicitly performed for the analyses VISURESP and VISURECA.
No parameter needs to be specified. The replicate file "ngus_dir_var_boot.txt" is created
using the so-called “specific bootstrap”.
Using the Button “BootstrapView”... , we will have to load the file "ngus_dir_var_boot.txt",
and select the words whose confidence ellipses should be drawn.
The bootstrap replicates are in this case obtained after a drawing with replacement of
the respondents (or: rows, individuals, observations).
12) Click on “ ClusterView ”
12.1 Choose the axes (1 and 2 to begin with), and “Continue”.
12.2 Click on “View” .
The centroids of the 30 clusters (produced by the Step
PARTI ) appears on the first principal plane.
12.3 Activate the button:
“Words” ,
and , pointing with the mouse on a specific cluster, press the right
button of the mouse. A description of the cluster involving the most
characteristic words of the cluster appears. This description is
somewhat redundant with that of the Step MOCAR .
But this display exhibits the pattern of clusters and their
relative locations.
12.4 Activate the button “Texts” .
Pointing with the mouse on a specific cluster, and pressing the right
button of the mouse, we can read the most characteristic responses of
the selected cluster.
12.5 Activate the button:
“Categorical” .
Pointing with the mouse on a specific category, and pressing the
right button of the mouse, we can read the most characteristic
categories of the selected category. This description is somewhat
redundant with that provided in the results file (file “imp.txt)
by the step DECLA. But we do have simultaneously in front of us the
pattern of categories and their relative locations.
13) Click on “Kohonen map”
Select the type of coordinate.
13.1 Select:
“Variables (columns)” :
these active variables are the words in this example.
13.2 Select a (5 x 5) map, and continue.
13.3 After clicking on two small
check-boxes, press “Draw”
on the menu of the large green windows entitled Kohonen map.
13.4 You can change the font size
(“Font” )
and dilate the obtained Kohonen map
( “Dilat.” )
to make it more legible. The words appearing in the same cell are
often associated in the same responses. This property holds, at a
lesser degree, for contiguous cells.
13.5 Pressing “AxeView” ,
and selecting one axis allows one to enrich the display with pieces
of information about a specified principal axis : large positive
coordinates in red colour, large negative coordinates in green, with
some transitional hues.
13.6 Go back to the main menu, click on
“Kohonen map” and choose the item
“Observations”
13.7 Select a (10 x 10) map, and redo the operations 13.3 to 13.5 for the
observations.
End of example A.6 (VISURECA)