Example A.5: EX_A05.Text-Responses_1. ANALEX

(Textual Data Analysis: Open and Closed Questions)



[Return to Tutorial A menu]

Example A.5 aims at describing the responses to an open-ended question in a sample survey in relation with the responses to a specific closed-end question.

After archiving dictionary, data and texts, the numerical coding of the text allows us to build a lexical table cross-tabulating the words with a selected categorical variable. A correspondence analysis is then performed on that lexical table. Bootstrap confidence areas (ellipses or convex hulls) can be drawn around words and categories. Characteristics words and responses are computed for each category.


About the data

The open questions were included in a multinational survey conducted in seven countries (Japan, France, Germany, United Kingdom, USA, Netherlands, Italy) in the late nineteen eighties (Hayashi et al ., 1992). It is the United Kingdom survey which is presented here. It deals with the responses of 1043 individuals to 14 closed-end questions and three open-ended questions. Some questions concern objective characteristics of the respondent or his/her household (age, status, gender, facilities). Other questions relate to attitude or opinions. The first open-ended question was “What is the single most important thing in life for you?”

It was followed by the probe: “ What other things are very important to you?” . A third question (not analysed in this tutorial, but included in the example data set) has also been asked: “What means to you the culture of your own country?”

We will focus on the first open question and its probe. Being interested with the relationships between these responses and both the age and educational level of the respondent, we will use a specific categorical variable to agglomerate the open responses: a variable with nine categories cross-tabulating three categories of age with three educational levels.

More explanations about this particular example and the corresponding methodology can be found in the book: “ Exploring Textual data” (L. Lebart, A. Salem, L. Berry; Kluwer Academic Publisher, 1998).

This example corresponds to the directory “EX_A05.Text-Responses_1” included in: “DtmVic_Examples_A_Start”.




1) Looking at the three files: data, dictionary and texts.


1.1) Data file: “ TDA_dat.txt”

The data file comprises 1043 rows and 15 columns (identifier of rows [between quotes] + 14 values [corresponding either to numerical variables or to item numbers of categorical variables] separated by at least one blank space).


1.2) Dictionary file: “TDA_dic.txt”

The dictionary file “ TDA _dic.txt” contains the identifiers of the 14 variables. In this internal version of DtmVic dictionary, the identifiers of categories must begin at: “column 6” The identifier of a categorical variable is preceded by the number N of its categories (columns 1 to 5); the N following lines identify the N response items. An optional “short identifier” could be located in columns 1 to 5. A numerical variable (such as “age”) has 0 category. Note that blank spaces are not allowed within the identifiers (about DtmVic formats, see: Introduction of Tutorial D).


1.3) Text file: “TDA_tex.txt”

This file contains the free responses of 1043 individuals to three open-ended questions mentioned earlier.

The DtmVic internal format of the text file is very specific. Since the responses may have very different lengths, separators are used to distinguish between questions and between individuals (or: respondents). Individuals are separated by the chain of characters “----“ (starting column 1) possibly followed by an identifier. Within each individual data, the open questions are separated by “++++” (column 1). The symbol “====” indicates the end of the file. Like all the data files involved in DtmVic as input files, that file is a raw text file (.txt). If the text file comes from a text processing phase, it must be saved as a “.txt file”.



2) Generation of a command file (or: “parameter file”)


2.1) Click the button: “Create” of the main menu: "Basic Steps", line “ Command File”.

A window “Choosing among some basic analyses” appears.


2.2) Click then on the button: “ANALEX” – located in the paragraph “Textual and numerical data”.


2.3) Press the button: “Open a text file”, then search for the directory: DtmVic_Examples_A_Start”. In that directory, open the directory of example A.5, named “EX_A05.Text-Responses” .

Open then the text file: “TDA_tex.txt“.

A message box indicates then that the corpus comprises 7329 lines, 1043 observations and 3 open questions.


2.4) Click on: “Select Open questions and separators”

The next window allows for the selection of open questions and the selection of separators of words (the default list of separators suffices in this example).

We will select questions 1 and 2 (that means that the two responses will be merged). It is licit here to merge the two responses because question 2 is a probe for question 1.


2.5) Click directly on: “Vocabulary”.

The next window presents the vocabulary (alphabetic and frequency orders). We must select a threshold of frequency by selecting a line in the right hand side memo frequency order). The line number 135 corresponds to the frequency 16. After selecting that line, click on: “Confirm”.

Then click on: “Continue”.


2.6) Click the button: “Open a dictionary (Dtm format)”

Open then the dictionary file: “TDA_dic.txt “ .

The dictionary file TDA_dic.txt contains the identifiers of the 14 variables.

The dictionary file in displayed in a window. Another window indicates the status of each variable (numerical or categorical).


2.7) Press the button: “Open a data file (Dtm format)”

Open the data file: “TDA_dat.txt” .

That data file comprises 1043 rows and 15 columns (identifier of rows [between quotes] + 14 values [corresponding either to numerical variables or to item numbers of categorical variables] separated by at least one blank space).

A new window displays the data file.


2.8) Click the button: “Continue (select active and supplementary elements)”.

A new window is displayed, allowing for the selection of active variables.

We suggest to select the categorical variable number 14, (age - education). Only one active variable can be selected in the ANALEX case.

All the remaining variables could be selected as supplementary elements. They will serve to describe the categories of the active variable.


2.9) Click then on the button: “Continue”

A new window devoted to the selection of active observations (rows) is displayed. Click on the button: “All the observations will be active” .

The window “Create a starting parameter file” is displayed.


2.10) Click on: “1-Select some options”.

A new window entitled “Options Bootstrap and/or clustering of observations” is displayed. Click “yes” for the “Bootstrap validation”, and then, click “Enter” for confirming the default number of replicates (25). Ignore the other suggested bootstrap options.

Back to the previous window.


2.11) Then click: “2-Create a first parameter file”

A parameter file is displayed in the memo [It can be edited by the advanced users. It allows for performing again the same analysis later on, if needed].

Important : The parameter file is saved as “Param_ANALEX.txt” in the current directory. If you wish, you could now exit from DtmVic, and, later on, use the button of the main menu “Open” (line: “Command file” ) to open directly the file “Param_ANALEX.txt” , and, in so doing, reach directly this point of the process, using the “Execute” command of the main menu.


2.12) Click: “3-Execute” .

This step will run the basic computation steps present in the command file: archiving data and text, characteristic words and responses, correspondence analysis of the lexical table, thorough descriptions of categories using other variables.



3) Basic numerical results


Click on “Basic numerical results” button

The button opens a created (and saved) html file named “imp.html” which contains the main results of the previous basic computation steps. After perusing these numerical results, return to the main menu. Note that this file is also saved under another name. The name “imp.html” is concatenated with the date and time of the analysis (continental notation): “imp_08.07.09_14.45.html” means July 8th , 2009, at 2:45 p.m. That file keeps as an archive the main numerical results whereas the file: “imp.html” is replaced for each new analysis performed in the same directory.

This file in also saved under a simple text format , under the name “imp.txt” , and likewise with a name including the date and time of execution.

From the step NUMER, we learn for instance that we have 1043 responses, with a total number of words (occurrences or token) of 13 919, involving 1 365 distinct words (or: types). Using a frequency threshold of 16, [the same threshold id denoted 15 in the result file: first neglected frequency] the total number of kept words reduces to 10738, whereas the number of distinct kept words reduces (more drastically) to 136.

The book “Exploring textual data” (op. cit.) deals in details with this pre-processing and with all the results that follow.




4) Steps VIC (Visualization, Inference, Classification)



4.1) Click the button: “AxeView” ... and follow the sub-menus.

In fact, two tabs are relevant for this example: “Active variables” [ = categories, in the case of ANALEX], and “Indivuduals” [words]. After clicking on: “View” , one obtains the set of principal coordinates along each axis. Clicking on a column header produces a ranking of all the rows according to the values of that column. Evidently, the use of the AxeView menu is justified when the data set is large, which is the case here.
Return.


4.2) Press the button: “PlaneView”... and follow the sub-menus...

In this example, three items of the menu are relevant: “Active columns (variables or categories)” , “Rows (Individuals)” , and “Active columns + Rows” . This last item concerns both rows and columns of the contingency table (lexical table). The graphical display of the selected pairs of axes are then produced. The active categories (columns of the lexical table) are printed in red, while the active words (rows) are printed in blue.

The roles of the different buttons are straightforward, except perhaps the button: “Rank”, which is useful only in the case of very intricate displays (which is not the case here). (See comments in the texts relating to examples A.1 and A.2).


4.3) Click on the button: “BootstrapView”

This button opens the “DtmVic: Bootstrap - Validation - Stability – Inference” windows.


4.3.1 Click on: “LoadData” . In this case (partial bootstrap), the replicated coordinates file to be opened is named: “ngus_par_boot1.txt” . (The set of possible files is given by the panel).


4.3.2 Click on: “Confidence Areas” submenu, and choose the pair of axes to be displayed (select axes 1 and 2 to begin with).


4.3.3 We obtain the list of the identifiers of active rows and columns (identifiers of columns [categories age x education] are at the end of the list). Since the column set is quite small, tick all the white cases to select all the categories, select also some words, and press the button: “Select” .


– Click on: “Confidence Ellipses” to obtain the graphical display of the chosen column points in red colour, and of the row points (or individuals or observations) in blue. We can see that, individually, some words have no significant position. In this display, we learn for example that almost all the age-education groups (column points) have distinct “lexical profiles”, except the categories “-30-low” [less than 30 years old, low level of education] and “-30-medium” [less than 30 years old, medium level of education] whose confidence areas are largely overlapping.
We can see, for instance, that some flections of the verb "to be", such as "is, be, are, being" may have locations that differ significantly.


– Close the display window, and, again in the blue window, press: “Convex hulls” . The ellipses are now replaced with the convex hulls of the replicates for each point. The convex hulls take into account the peripheral points, whereas the ellipses are drawn using the density of the clouds of replicates. The two pieces of information are complementary.
Return.




4.4) Click on:“ClusterView ”


4.4.1 Choose the axes (1 and 2 to begin with), and “Continue” .


4.4.2 Click on “View” . The locations of the 9 categories (variable 14: age-education) appears on the first principal plane. Thanks to some possible change of sign for the axes, the display is the same as that provided by the “PlaneView” procedure.


4.4.3 Activate the button: “Words” , and, pointing with the mouse on a specific category, press the right button of the mouse. A description of the category involving the most characteristic words of the category appears. This description is again redundant with that of the Step MOCAR (file “imp.txt” ). But we can observe here the pattern of categories and their relative locations.


4.4.4 Activate the button: “Texts” . Pointing with the mouse on a specific category, and pressing the right button of the mouse, we can read the most characteristic responses of the selected category.

More explanation about the corresponding methodology can be found in the book: “Exploring Textual data” (L. Lebart, A. Salem, L. Berry; Kluwer Academic Publisher, 1998).
Return.


4.5) Click “Kohonen map”


4.5.1 Select: “variables + observations (rows + columns)” : these active variables are the words and the texts (categories) in this example.


4.5.2 Select a (5 x 5) map, and “continue” .


4.5.3 Press “draw” on the menu of the large green windows entitled “Kohonen map” .


4.5.4 You can change the font size ( “Font” ) and dilate the obtained Kohonen map ( “Dilat.” ) to make it more legible. The words appearing in the same cell are often associated with the same responses. This property holds, at a lesser degree, for contiguous cells.


4.5.5 Note that we have obtained a simultaneous Kohonen representation of rows and columns, owing to the use, as input file, of the coordinates from the correspondence analysis of the lexical table.



4.6) Click: “Seriation”

The aim of seriation techniques has been bri efly described in the section 4.6 of example 4. Seriation will be applied here to the lexical table cross-tabulating the 9 categories of respondents and the selected words (words appearing at least 16 times in the corpus). In this version of DtmVic, Seriation can be obtained only after the three types of analysis: SCA, VISUTEX and ANALEX. All these approaches involve Correspondence Analysis of contingency tables.


A new window named “Reordering” appears.


Click on the button: “Reordering the rows and the column of a word-text table” .

The reordered table cross-tabulating the 9 categories and the selected words is then displayed. It can be seen that the first words of the reordered list of words characterize (sometimes exclusively) the first categories in the reordered list of categories. The last words of the same list are either absent or rarely observed among these categories. However, they are frequent among the last categories (right hand side of the table).



General remark.

As you can observe when looking at the content of the example’s directory, several files have been created and saved [these files are briefly described in the memo “Help about files” in the toolbar of the main menu]. If you need to continue using again the buttons of the paragraph VIC of the main menu after having closed DtmVic, just click on the button “Open” from the line “command file” , select and open the saved command file: “Param_ANALEX.txt” , and close it. It is not necessary to click on: “execute” again. You can then continue your investigation (axes views, graphs, maps, etc.).

The advanced users can also edit the parameter file: “Param_ANALEX.txt” , (using the memo “Help about parameters” in the toolbar of the main menu) to perform a new analysis in which the parameters are given new values. It is advised to give it a new name (such as: “Param_ ANALEX.txt” , for example). All the intermediate files will be replaced (except the files “imp_date_time.txt” and “imp_date_time.html” which are the only saved archives).





End of example A.5