Example A.6: EX_A06.Text-Responses_2. (VISURECA)

(Open questions in a sample survey: Direct analysis and link with closed-end questions)


Example A.6 aims at describing directly the responses to an open ended question in a sample survey, without prior agglomeration, in relation with a set of categorical variables. The survey and the responses are the same as in examples A.5.

More explanation about this type of example and the corresponding methodology can be found in the book: “Exploring Textual data” (L. Lebart, A. Salem, L. Berry; Kluwer Academic Publisher, 1998).

We can take advantage of the presence of closed-end questions to describe the clusters, not only with characteristic words and responses, but also with categories, selected after a step SELEC, and analysed through the step DECLA .

Another new step of the command file, POSIT, describes the location of these supplementary categories in the plane spanned by the first principal axes.



1) Looking at the three files: data, dictionary and texts.


To have a look at the data, search for the directory: DtmVic_Examples.

In this directory, open the sub-directory DtmVic_Examples_B_Texts .

In that directory, open the directory of Example A.6, named: “EX_A06.Text-Responses_2 ” .

It is recommended to use one directory for each application, since DtmVic produces a lot of intermediate txt-files related to the application.

At the outset, such directory must contain at least 3 files :


a) the data file,
b) the dictionary file,
c) the text file,


a) Data file: TDA_dat.txt (same as that of Example A.5)

This file contains responses to questions which were included in the multinational survey conducted in seven countries (Japan, France, Germany, United Kingdom, USA, Netherlands, Italy) in the late nineteen eighties (Hayashi et al., 1992). It is the United Kingdom survey which is presented here.

It deals with the responses of 1043 individuals to 14 questions. Some questions concern objective characteristics of the respondent or his/her household (age, status, gender, facilities). Other questions relate to attitude or opinions. The data file: " TDA_dat.txt" comprises 1043 rows and 15 columns (identifier of rows [between quotes] + 14 values [corresponding either to numerical variables or to item numbers of categorical variables] separated by at least one blank space).


b) Dictionary file: TDA_dic.txt (same as that of Example A.5)

The dictionary file "TDA_dic.txt" contains the identifiers of these 14 variables. In this version of Dtm-Vic, the identifiers of categories must begin at: "column 6" [a fixed interval font - also known as teletype font - such as "courier" should be used to facilitate this kind of format].


c) Text file: TDA_TEX.txt (same as that of examples A.5)

Let us remind its characteristics. It contains the free responses of 1043 individuals to three open-ended questions.

Firstly, the following open-ended question was asked: "What is the single most important thing in life for you?" It was followed by the probe: "What other things are very important to you?".

A third question (not analysed here) has also been asked: “ What means to you the culture of your own country”. We refer to the previous example (example A.5) for comments about the data format.



2) Generation of a command file (or: “parameter file”)


2.1)Click the button: “Create a command file” of the main menu: Basic Steps, line: “ Command File”.

A window: “Choosing among some basic analyses” appears.


2.2) Click then on the button: “ VISURECA analysis - (Visualization and Clustering of responses with suplementary categorical data) ” located in the section “Textual and numerical data”. A window “Opening a text file” is displayed


2.3) Press the button:“Open the text file”, then search for the directory: DtmVic_Examples_A_Start” . In that directory, open the directory of example A.6, named “EX_A06.Text-Responses_2” .

Open then the text file: “TDA_tex.txt “.

A message box indicates then that the corpus comprises 7329 lines, 1043 observations and 3 open questions.


2.4) Click on: “Select Open questions and separators”

The next window allows for the selection of open questions and the selection of separators of words (the default separators suffice in this example).

We will select questions 1 and 2 (that means that the two responses will be merged). It is licit here to merge the two responses because question 2 is a probe for question 1.


2.5) Click directly on: “Vocabulary and counts” .

The next window presents the vocabulary (alphabetic and frequency orders). We must select a threshold of frequency by selecting a line in the right hand side memo frequency order). The line number 397 [first column] corresponds to the frequency 4 [second column]. (We took a threshold of 16 in the previous example A5. For individual responses, lexically very poor, it takes more words not to generate too many empty answers after choosing the threshold). We'll keep the 397 most frequent words. After selecting that line, click on: “Confirm”. The frequency appears in a message box. Reply: "OK".

Then click on: “Continue” . A window dictionary and data files appears.


2.6) Click the button: “Open a dictionary (Dtm format)”

Open then the dictionary file: “TDA_dic.txt“ .

The dictionary file: TDA_dic.txt contains the identifiers of the 14 variables.

The dictionary file in displayed in a window. Another window indicates the status of each variable (numerical or categorical).


2.7) Press the button: “Open a data file (Dtm format)”

Open the data file: “TDA_dat.txt”

That data file comprises 1043 rows and 15 columns (identifier of rows [between quotes] + 14 values [corresponding either to numerical variables or to item numbers of categorical variables] separated by at least one blank space).

A new window displays the data file.


2.8) Click the button: “Continue (select active and supplementary variables)”.

A new window is displayed, allowing for the selection of active variables. There is no active variable, since the responses to the 2 open questions are active here. We actually chose the active variables by selecting the open-ended questions 1 and 2.

All the remaining variables could be selected as supplementary elements. They will serve to describe the categories of the active variable.


2.9) Click then on the button: “Continue”

A new window devoted to the selection of active observations (rows) is displayed. Click on the button: “All the observations will be active” .

The window: “Create a starting parameter file” is displayed.


2.10) Then click directly on: “Create a first parameter file”.


For this type of analysis, there is no (yet) bootstrap validation. The clustering is automatic, and the number of clusters is selected (default) depending on the number of responses (30 clusters in this case). [This number of cluster can be changed by editing the command file (or parameter file) before the execution, the parameters to be altered belong to the "STEP PARTI" and "STEP DECLA"].

A parameter file is displayed in the memo [It can be edited by the advanced users. It allows for performing again the same analysis later on, if needed].


Important: The parameter file is saved as “Param_VIRURECA.txt” in the current directory. If you wish, you could now exit from DtmVic, and, later on, use the button of the main menu “Open an existing command file” (Section: “Command file”) to open directly the file “Param_VIRURECA.txt”, and, in so doing, reach directly this point of the process, using the “Execute” command of the main menu.


Let us remind that this set of commands comprises 14 steps:


ARDAT (archiving data),
ARTEX (Archiving texts),
SELOX (selecting the open question),
NUMER (numerical coding of the text),
ASPAR (correspondence analysis of the [sparse] contingency table “respondents - words”),
CLAIR (Brief description of factorial axes),
RECIP (Clustering using a hierarchical classification of the clusters - reciprocal neighbours method),
PARTI (Cut of the dendrogram produced by the previous step, and optimisation of the partition obtained),
MOTEX (crosstabulating the partition produced by step PARTI with words: the obtained contingency table is called a lexical table),
MOCAR (characteristic words, and characteristics responses for each class of the partition),
SELEC (selecting active and supplementary elements),
DECLA (systematic description of the classes of the partition produced by step PARTI using the other relevant categorical variables),
POSIT (illustrating the principal spaces of responses with supplementary categorical variables).


2.11) Click: “Execute”.

This step will run the basic computation steps present in the command file: archiving data and text, characteristic words and responses, correspondence analysis of the lexical table, thorough descriptions of clusters using both words and categorical variables.



7) Click the button: “Basic numerical results”

The button opens a created (and saved) html file named “imp.html” which contains the main results of the previous basic computation steps. After perusing these numerical results, return to the main menu. Note that this file is also saved under another name. The name “imp.html” is concatenated with the date and time of the analysis (continental notation). That file keeps as an archive the main numerical results whereas the file “imp.html” is replaced for each new analysis performed in the same directory.

This file in also saved under a simple text format, under the name “imp.txt” , and likewise with a name including the date and time of execution.

Perusing the complete list of words highlights some errors in the original text file (inevitable in real sized applications) : for instance, the symbol “]” was absent from the list of separators, and creates some new “words”…


8) At this stage, we click on one of the lower buttons of the basic steps panel (Steps: “VIC”)


9) Click the button “AxeView”

... and follow the sub-menus. In fact, three tabs are relevant for this example: “Active variables” [ = words in the case of the analysis: ""VISURECA"], “Individuals (observations) [= respondents]” and “Supplementary Categories”. After clicking on “View” in each case, one obtains the set of principal coordinates along each axis.

Clicking on a column header produce a ranking of all the rows according to the values of that column. In this particular example, this is somewhat redundant with the printed results of the step CLAIR ”.
Return.


10) The button: PlaneView Research , and follow the sub-menus...

In this example, six items of the menu are relevant “Active columns (variables or categories)” (principal coordinates of the active words) , “Supplementary categories” (coordinates of the supplementary categories derived from the step “ POSIT ”), “Active rows (individuals, observations)” ,(coordinates of the respondents), “Active columns + Active rows”, “Active individuals (density)” and “Active columns + Supplementary categories” . The graphical display of chosen pairs of axes are then produced.


11) About the button: “BootstrapView”...

In fact, The bootstrap is implicitly performed for the analyses VISURESP and VISURECA. No parameter needs to be specified. The replicate file "ngus_dir_var_boot.txt" is created using the so-called “specific bootstrap”. Using the Button “BootstrapView”... , we will have to load the file "ngus_dir_var_boot.txt", and select the words whose confidence ellipses should be drawn. The bootstrap replicates are in this case obtained after a drawing with replacement of the respondents (or: rows, individuals, observations).


12) Click on “ ClusterView ”


12.1 Choose the axes (1 and 2 to begin with), and “Continue”.


12.2 Click on “View” . The centroids of the 30 clusters (produced by the Step PARTI ) appears on the first principal plane.


12.3 Activate the button: “Words” , and , pointing with the mouse on a specific cluster, press the right button of the mouse. A description of the cluster involving the most characteristic words of the cluster appears. This description is somewhat redundant with that of the Step MOCAR . But this display exhibits the pattern of clusters and their relative locations.


12.4 Activate the button “Texts” . Pointing with the mouse on a specific cluster, and pressing the right button of the mouse, we can read the most characteristic responses of the selected cluster.


12.5 Activate the button: “Categorical” . Pointing with the mouse on a specific category, and pressing the right button of the mouse, we can read the most characteristic categories of the selected category. This description is somewhat redundant with that provided in the results file (file “imp.txt) by the step DECLA. But we do have simultaneously in front of us the pattern of categories and their relative locations.


13) Click on “Kohonen map”

Select the type of coordinate.


13.1 Select: “Variables (columns)” : these active variables are the words in this example.


13.2 Select a (5 x 5) map, and continue.


13.3 After clicking on two small check-boxes, press “Draw” on the menu of the large green windows entitled Kohonen map.


13.4 You can change the font size (“Font” ) and dilate the obtained Kohonen map ( “Dilat.” ) to make it more legible. The words appearing in the same cell are often associated in the same responses. This property holds, at a lesser degree, for contiguous cells.


13.5 Pressing “AxeView” , and selecting one axis allows one to enrich the display with pieces of information about a specified principal axis : large positive coordinates in red colour, large negative coordinates in green, with some transitional hues.


13.6 Go back to the main menu, click on “Kohonen map” and choose the item “Observations”


13.7 Select a (10 x 10) map, and redo the operations 13.3 to 13.5 for the observations.





End of example A.6 (VISURECA)