Example A.5:
EX_A05.Text-Responses_1. ANALEX
(Textual Data Analysis: Open and Closed Questions)
Example A.5 aims at describing the responses to an open-ended
question in a sample survey in relation with the responses to a
specific closed-end question.
After archiving dictionary, data and texts, the numerical coding of the
text allows us to build a lexical table cross-tabulating the words
with a selected categorical variable. A correspondence analysis is
then performed on that lexical table. Bootstrap confidence areas
(ellipses or convex hulls) can be drawn around words and categories.
Characteristics words and responses are computed for each category.
About the data
The open questions were included in a multinational survey conducted in seven
countries (Japan, France, Germany, United Kingdom, USA, Netherlands,
Italy) in the late nineteen eighties (Hayashi et al ., 1992).
It is the United Kingdom survey which is presented here.
It deals with the responses of 1043 individuals to 14 closed-end questions
and three open-ended questions. Some questions concern objective
characteristics of the respondent or his/her household (age, status, gender,
facilities). Other questions relate to attitude or opinions.
The first open-ended question was “What is the single
most important thing in life for you?”
It was followed by the probe: “
What other things are very important to you?” .
A third question (not analysed in this tutorial, but included in the
example data set) has also been asked: “What
means to you the culture of your own country?”
We will
focus on the first open question and its probe. Being interested with
the relationships between these responses and both the age and
educational level of the respondent, we will use a specific
categorical variable to agglomerate the open responses: a variable
with nine categories cross-tabulating three categories of age with
three educational levels.
More explanations about this particular example and the corresponding
methodology can be found in the book: “ Exploring Textual data”
(L. Lebart, A. Salem, L. Berry; Kluwer Academic Publisher, 1998).
This example corresponds to the directory
“EX_A05.Text-Responses_1”
included in: “DtmVic_Examples_A_Start”.
1) Looking at the three files: data, dictionary and texts.
1.1) Data file: “ TDA_dat.txt”
The data file comprises 1043 rows and 15 columns (identifier of rows [between quotes] +
14 values [corresponding either to numerical variables or to item numbers
of categorical variables] separated by at least one blank space).
1.2) Dictionary file: “TDA_dic.txt”
The dictionary file “
TDA _dic.txt” contains the identifiers of the 14 variables. In this internal
version of DtmVic dictionary, the identifiers of categories must
begin at: “column 6” The identifier of a categorical
variable is preceded by the number N of its categories (columns 1 to
5); the N following lines identify the N response items. An optional
“short identifier” could be located in columns 1 to 5. A
numerical variable (such as “age”) has 0 category. Note
that blank spaces are not allowed within the identifiers
(about DtmVic formats, see: Introduction of Tutorial D).
1.3) Text file: “TDA_tex.txt”
This file contains the free responses of 1043
individuals to three open-ended questions mentioned earlier.
The DtmVic internal
format of the text file is very specific. Since the
responses may have very different lengths, separators are used to
distinguish between questions and between individuals (or:
respondents). Individuals are separated by the chain of characters
“----“ (starting column 1) possibly followed by an
identifier. Within each individual data, the open questions are
separated by “++++” (column 1). The symbol “====”
indicates the end of the file. Like all the data files involved in
DtmVic as input files, that file is a raw text file (.txt). If the
text file comes from a text processing
phase, it must be saved as a “.txt file”.
2) Generation of a command file (or: “parameter file”)
2.1) Click the button: “Create”
of the main menu: "Basic Steps", line
“ Command File”.
A window
“Choosing among some basic analyses”
appears.
2.2) Click then on the button: “ANALEX”
– located in the paragraph “Textual and numerical data”.
2.3) Press the button:
“Open a text file”,
then search for the directory:
“DtmVic_Examples_A_Start”.
In that directory, open the directory of example A.5, named
“EX_A05.Text-Responses” .
Open then the text file:
“TDA_tex.txt“.
A message box indicates then that the corpus comprises 7329 lines, 1043
observations and 3 open questions.
2.4) Click on:
“Select Open questions and separators”
The next window allows for the selection of open questions and the selection
of separators of words (the default list of separators suffices in this example).
We will select questions 1 and 2 (that means that the two responses will be
merged). It is licit here to merge the two responses because question 2 is
a probe for question 1.
2.5) Click directly on: “Vocabulary”.
The next window presents the vocabulary (alphabetic and frequency orders). We
must select a threshold of frequency by selecting a line in the right hand side memo
frequency order). The line number 135 corresponds to the frequency 16.
After selecting that line, click on:
“Confirm”.
Then click on:
“Continue”.
2.6) Click the button:
“Open a dictionary (Dtm format)”
Open then the dictionary file: “TDA_dic.txt
“ .
The dictionary file TDA_dic.txt
contains the identifiers of the 14 variables.
The dictionary file in displayed in a window. Another window indicates
the status of each variable (numerical or categorical).
2.7) Press the button:
“Open a data file (Dtm format)”
Open the data file: “TDA_dat.txt”
.
That data file comprises 1043 rows and 15 columns (identifier of rows
[between quotes] + 14 values [corresponding either to numerical variables or to item
numbers of categorical variables] separated by at least one blank space).
A new window displays the data file.
2.8) Click the button: “Continue
(select active and supplementary elements)”.
A new window is displayed, allowing for the selection of active variables.
We suggest to select the categorical variable number 14, (age - education).
Only one active variable can be selected in the ANALEX case.
All the remaining variables could be selected as supplementary elements.
They will serve to describe the categories of the active variable.
2.9) Click then on the button:
“Continue”
A new window devoted to the selection of active observations (rows) is
displayed. Click on the button:
“All the observations will be active”
.
The window
“Create a starting parameter file”
is displayed.
2.10) Click on:
“1-Select some options”.
A new window entitled “Options
Bootstrap and/or clustering of observations” is displayed.
Click “yes”
for the “Bootstrap validation”, and then, click
“Enter” for confirming the
default number of replicates (25). Ignore the other suggested bootstrap options.
Back to the previous window.
2.11) Then click:
“2-Create a first parameter file”
A parameter file is displayed in the memo
[It can be edited by the advanced users. It allows for performing again the same analysis
later on, if needed].
Important :
The parameter file is saved as
“Param_ANALEX.txt”
in the current directory. If you wish, you could now exit from DtmVic,
and, later on, use the button of the main menu
“Open” (line:
“Command file” ) to open directly the file
“Param_ANALEX.txt” ,
and, in so doing, reach directly this point of the process,
using the “Execute”
command of the main menu.
2.12) Click: “3-Execute” .
This step will run the basic computation steps present in the command file:
archiving data and text, characteristic words and responses, correspondence analysis
of the lexical table, thorough descriptions of categories using other variables.
3) Basic numerical results
Click on “Basic numerical results”
button
The button opens a created (and saved) html
file named “imp.html”
which contains the main results of the previous basic computation
steps. After perusing these numerical results, return to the main
menu. Note that this file is also saved under another name. The name
“imp.html”
is concatenated with the date and time of the analysis (continental
notation):
“imp_08.07.09_14.45.html”
means July 8th , 2009, at 2:45 p.m. That file keeps as an archive
the main numerical results whereas the file:
“imp.html”
is replaced for each new analysis performed in the same directory.
This file in also saved under a simple text format , under the name
“imp.txt” ,
and likewise with a name including the date and time of execution.
From the step NUMER, we learn
for instance that we have 1043 responses, with a total number of words
(occurrences or token) of 13 919, involving 1 365 distinct words (or: types).
Using a frequency threshold of 16, [the same threshold id denoted 15 in the result file:
first neglected frequency] the total number of kept words reduces
to 10738, whereas the number of distinct kept words reduces (more drastically) to 136.
The book “Exploring textual data” (op.
cit.) deals in details with this pre-processing and with all
the results that follow.
4) Steps VIC (Visualization, Inference, Classification)
4.1) Click the button: “AxeView”
... and follow the sub-menus.
In fact, two tabs are relevant for this example:
“Active variables”
[ = categories, in the case of ANALEX], and
“Indivuduals” [words].
After clicking on: “View” ,
one obtains the set of principal coordinates along each axis.
Clicking on a column header produces a ranking of all the rows
according to the values of that column. Evidently, the use of the
AxeView menu is justified when the data set is large, which is the
case here.
Return.
4.2) Press the button:
“PlaneView”... and follow
the sub-menus...
In this example, three items of the menu are relevant:
“Active columns (variables or categories)”
, “Rows (Individuals)”
, and “Active columns + Rows”
. This last item concerns both rows and columns of the contingency
table (lexical table). The graphical display of the selected pairs of
axes are then produced. The active categories (columns of the lexical
table) are printed in red, while the active words (rows) are printed
in blue.
The roles of the different buttons are straightforward, except perhaps
the button: “Rank”, which is useful only in the case of very
intricate displays (which is not the case here). (See comments in
the texts relating to examples A.1 and A.2).
4.3) Click on the button:
“BootstrapView”
This button opens the “DtmVic:
Bootstrap - Validation - Stability – Inference”
windows.
4.3.1 Click on: “LoadData” .
In this case (partial bootstrap), the replicated coordinates file to
be opened is named:
“ngus_par_boot1.txt” .
(The set of possible files is given by the panel).
4.3.2 Click on: “Confidence
Areas” submenu, and choose the pair of axes to be
displayed (select axes 1 and 2 to begin with).
4.3.3 We obtain the list of the identifiers of active rows and columns
(identifiers of columns [categories
age x education] are at the end
of the list). Since the column set is quite small, tick all the white
cases to select all the categories, select also some words, and press
the button: “Select” .
– Click on: “Confidence Ellipses”
to obtain the graphical display of the chosen column points in red
colour, and of the row points (or individuals or observations) in
blue. We can see that, individually, some words have no significant
position. In this display, we learn for example that almost all the
age-education groups (column points) have distinct “lexical
profiles”, except the categories “-30-low” [less
than 30 years old, low level of education] and “-30-medium”
[less than 30 years old, medium level of education] whose confidence
areas are largely overlapping.
We can see, for instance, that some flections of the verb "to be",
such as "is, be, are, being" may have locations that differ significantly.
– Close the display window, and, again in the blue window, press:
“Convex hulls” .
The ellipses are now replaced with the convex hulls of the replicates
for each point. The convex hulls take into account the peripheral points,
whereas the ellipses are drawn using the density of the clouds of replicates.
The two pieces of information are complementary.
Return.
4.4) Click on:“ClusterView ”
4.4.1 Choose the axes (1 and 2 to begin with), and
“Continue” .
4.4.2 Click on “View” .
The locations of the 9 categories (variable 14: age-education)
appears on the first principal plane. Thanks to some possible change
of sign for the axes, the display is the same as that provided by the
“PlaneView”
procedure.
4.4.3 Activate the button: “Words”
, and, pointing with the mouse on a specific category, press
the right button of the mouse. A description of the category involving
the most characteristic words of the category appears.
This description is again redundant with
that of the Step MOCAR (file “imp.txt”
). But we can observe here the pattern of categories and their relative
locations.
4.4.4 Activate the button:
“Texts” .
Pointing with the mouse on a specific category, and pressing the
right button of the mouse, we can read the most characteristic
responses of the selected category.
More explanation about the corresponding methodology can
be found in the book: “Exploring Textual data” (L. Lebart,
A. Salem, L. Berry; Kluwer Academic Publisher, 1998).
Return.
4.5) Click “Kohonen map”
4.5.1 Select: “variables
+ observations (rows + columns)” :
these active variables are the words and
the texts (categories) in this example.
4.5.2 Select a (5 x 5) map, and
“continue” .
4.5.3 Press “draw”
on the menu of the large green windows entitled “Kohonen
map” .
4.5.4 You can change the font size
( “Font” )
and dilate the obtained Kohonen map ( “Dilat.” )
to make it more legible. The words appearing in the same cell are
often associated with the same responses. This property holds, at a
lesser degree, for contiguous cells.
4.5.5 Note that we have obtained a simultaneous Kohonen
representation of rows and columns, owing to the use, as input file,
of the coordinates from the correspondence analysis of the lexical
table.
4.6) Click: “Seriation”
The aim of seriation techniques has
been bri efly described in the section 4.6 of example 4. Seriation
will be applied here to the lexical table cross-tabulating the 9 categories
of respondents and the selected words (words appearing at least 16 times in
the corpus). In this version of DtmVic, Seriation can be obtained only after
the three types of analysis: SCA, VISUTEX and ANALEX. All these approaches
involve Correspondence Analysis of contingency tables.
A new window named “Reordering”
appears.
Click on the button:
“Reordering the rows and the column of a word-text table”
.
The reordered table cross-tabulating the 9 categories and the selected
words is then displayed. It can be seen that the first words of the
reordered list of words characterize (sometimes exclusively) the
first categories in the reordered list of categories. The last words
of the same list are either absent or rarely observed among these
categories. However, they are frequent among the last categories
(right hand side of the table).
General remark.
As you can observe when looking at the content of the example’s
directory, several files have been created and saved [these files are
briefly described in the memo “Help
about files” in the toolbar of the main menu].
If you need to continue using again the buttons of
the paragraph VIC of the main menu after having closed DtmVic, just
click on the button “Open”
from the line “command file” ,
select and open the saved command file:
“Param_ANALEX.txt” ,
and close it. It is not necessary to click on:
“execute”
again. You can then continue your investigation (axes views, graphs,
maps, etc.).
The advanced users can also edit the parameter file:
“Param_ANALEX.txt” , (using the
memo “Help about parameters”
in the toolbar of the main menu) to perform a new analysis in which
the parameters are given new values.
It is advised to give it a new name (such as:
“Param_
ANALEX.txt” , for example).
All the intermediate files will be replaced (except the files
“imp_date_time.txt”
and “imp_date_time.html” which
are the only saved archives).
End of example A.5