The active categorical variables are in this case 13
questions (opinions and attitudes) about family, housing expenditure,
society, health problems, and anxiety.
A new window devoted to the selection of active observations (rows) is
displayed. Click on the button:
“All the observations will be active”.
A parameter file is displayed in the memo [such a
parameter can be edited by advanced users. It allows for
performing again the same analysis later on, if needed].
This step will run the basic computation steps present in the command file:
archiving data and dictionary, selection of active elements, multiple
correspondence analysis of the selected table, bootstrap replications
of the table, brief description of the axes, clustering procedure
with a thorough descriptions of clusters.
After the execution has taken place, a small
window summarizes the different steps of computation.
Clicking on a column header produces a ranking of all the rows according to
the values of that column. In this particular example, this is
somewhat redundant with the printed results of the step “DEFAC”
(see files “imp.txt”
or “imp.html” through
the buttons “Basic numerical results” ). The use of
the AxeView menu is profitable when the data set is very large.
Return.
The graphical displays of the selected pairs of axes are then produced.
The roles of the different buttons are straightforward, except perhaps
the button: “Rank” ,
which is useful only in the case of very intricate displays, (which
is far from being the case here!): this button converts the two
coordinates of the current display into ranks. For instance, the n
values of the abscissa are converted into n integers, from 1 to n,
having the same order as the original values. Thus the two
distributions are uniform, and the identifiers turn out to be much
less overlapping, and more legible (at the cost of a substantial
distortion of the display). This example is in fact a counterexample
of that property: MCA derived from a few active categorical variables
produces a lot of superimposed points, that are perfectly
superimposed in the display of “individuals” and slightly
different in the display of ranks
(according to the option chosen here, they occupy consecutive or
neighbouring ranks).
Return.
4.3.3 In the window that appears then, displaying the dictionaries
of variables, tick the chosen white boxes to select the elements
the location of which should be assessed, and press the button
“Select” .
In this display, we learn for example that in this
principal space (built as a “space of opinions”, due to
the selection of active questions), male and female [two
supplementary categories that did not participate in building the
axes] occupy distinct locations (ellipses no overlapping at all).
To test such a hypothesis (independence between the
pattern of opinions and the gender) it is convenient (i.e. more
legible) to tick only the two categories “male” and
“female”.
In the same vein, we can tick the classes of ages, and
observe that the extreme categories (“under 30” and “over
60” correspond to confidence ellipses clearly separated).
In the context of this example,
the other items of the main menu are not relevant.
As you can observe when looking
at the content of the example’s directory, several files have been
created and saved [these files are briefly described in the memo
“Help about files” in the toolbar of
the main menu]. If you need to continue using again the buttons of
the paragraph VIC of the main menu after having closed DtmVic, just
click on the button “Open an existing command file”
from the line “command file” ,
select and open the saved command file:
“Param_MCA.txt” ,
and close it. It is not necessary to click on:
“execute” again. You can then
continue your investigation (axes views, graphs, maps, etc.).
In this simple format, DtmVic can process up to 1000 texts without limitation
of size for each text. Our corpus serving as an example is thus a
“small scale model”, emphasizing only the functionalities
(but not the power) of DtmVic. The conversion to lower case
characters is meant to avoid typifying the first word of each verse
or sentence.
As mentioned in the previous examples, it is recommended to use one directory for each
application, since DtmVic produces a lot of intermediate ”txt” files related
to the application. At the outset, such directory must contain at least one text file:
– The format is specific (DtmVic internal text format type 1. See the
tutorial D: “Importation”).
– Since the texts may have very different lengths, separators **** (at
the beginning of a line) are used to distinguish between texts.
– The length of the lines is limited to 200 characters (instead of 80, in the previous
version of DtmVic).
– The identifiers of texts must follow the separator “****”
after 4 blank spaces.
– The symbol “====” indicates the end of the file.
– Like all the data files involved in DtmVic as input
files, that file is a raw text file (.txt). If the text file comes
from a text processing phase, it must be saved beforehand as a “.txt
file”.
Open DtmVic.
2.1) Click the button: “Create a command file”
of the main menu: Basic Steps, line “Command
File”
A window
“Choosing among some basic analyses”
appears.
2.2) Click then the button:
“VISUTEX – Visualization of Texts”
This button is located in the paragraph
“Textual and numerical data”.
2.3) Press the button: “Open the text file”,
then search for the directory:
“DtmVic_Examples_A_Start”.
In that directory, open the directory of example A.4, named:
“EX_A04.Text-poems” .
Open the file: “Sonnet_LowerCase.txt” .
A message box indicates then that the corpus contains 20 texts totalizing 321 lines.
2.4) Click:
“Select open questions and separators”
The next window allows for the selection of open questions (not relevant
here) and the selection of separators of words (the produced
default separators suffice in this example).
2.5) Click directly “Vocabulary”.
The next window presents the vocabulary (alphabetic and frequency
orders). We must select a threshold of frequency by selecting a line
in the right hand side memo (frequency order). The line number 113
corresponds to the frequency 4 (It is a very small frequency,
adapted to a very small corpus. This example is just an opportunity of
exploring the sequence of commands, without meaningful linguistic
interpretation).
After selecting that line, click then on: “Confirm”.
2.6) Then click on: “Continue. (Create the parameter file)”
Continuing our visit, we have to “select some options” .
Click “yes”
for the bootstrap validation, and “enter”
to confirm the default number of replicates (25). Click then on "Continue".
2.7) Then click“Create a first parameter file”
A parameter file is displayed in the memo [It can be edited by the advanced users.
It allows for performing again the same analysis later on, if needed].
Important :
The parameter file is saved as “Param_VISUTEX.txt”
in the current directory.
If you wish, you could now exit from DtmVic, and, later on, use the
button of the main menu “Open an existing command file”
(line: “Command file” )
to open directly “Param_VISUTEX.txt” ,
and, in so doing, the user reaches this point of the process. You can then use afterwards the
“Execute” command of the main menu.
2.8) Click on: “Execute”.
This step will run the basic computation steps present in the command file:
archiving data and text, characteristic words and responses,
correspondence analysis of the lexical table.
3) Basic numerical results
Click on: “Basic numerical results” button
The button opens a created (and saved) html
file named “imp.html”
which contains the main results of the previous basic computation
steps. After perusing these numerical results, return to the main
menu. Note that this file is also saved under another name. The name
“imp.html”
is concatenated with the date and time of the analysis (continental
notation): “imp_08.07.09_14.45.html”
means July 8 th , 2009, at 2:45 p.m. That file keeps as an archive
the main numerical results whereas the file “imp.html”
is replaced for each new analysis performed in the same directory.
This file in also saved under a simple text format , under the name
“imp.txt” ,
and likewise with a name including the date and time of execution.
From the step NUMER, we learn for instance that
we have 280 responses (lines), with a total number of words
(occurrences or token) of 2321, involving 830 distinct words (or:
types). Using a frequency threshold of 3 (it means here keeping the
words with frequency over three) the total number of kept words
reduces to 1384, whereas the number of distinct kept words reduces to
114. (Note some – provisional– notational differences:
the minimal selected frequency 4 corresponds to the frequency 3 in
the listing meaning, equivalently, that all the words appearing
more than three times are kept).
Return.
4) Steps VIC (Visualization, Inference, Classification)
4.1) Click the button:“AxeView”
... and follow the sub-menus. In fact, only two tabs are relevant for
this example: “Active variables”
[ = poems] and “observations”
[words]. After clicking on “View” ,
the user obtains the set of principal coordinates along each axis.
Clicking on a column header produce a ranking of all the rows according to the
values of that column.
As mentioned in the previous examples, the use of the AxeView menu is
justified when the data set is large, which is not the case here.
Return.
4.2) Click the button: “PlaneView Research” ,
and follow the sub-menus...
In this example, only one item of the menu is relevant “Active
columns + Rows” . This item concerns both rows and columns of the
contingency table (lexical table). The graphical displays of selected pairs of axes are then
produced. Normally, the active categories (columns of the lexical
table) are printed in red, while the active words (rows) are printed
in blue.
The roles of the different buttons are straightforward, except perhaps the
button: “Rank”, which is useful only in the case of very
intricate displays (which is not the case here) (see comments in the
previous examples).
Return.
4.3) Click the button:
“BootstrapView”
This button opens the
“DtmVic: Bootstrap - Validation - Stability –
Inference” windows.
4.3.1 Click on: “LoadData” .
In this case (partial bootstrap), the replicated coordinates file to be opened
is named “ngus_par_boot1.txt” .
(The set of possible files is given by a background panel).
4.3.2 Click on: “Confidence Areas”
submenu, and choose the pair of axes to be displayed (select axes 1 and 2 to begin with).
4.3.3 The window that appears (enlarge it if necessary) contains the list of identifiers
of active rows and columns (identifiers of columns [Sonnets in this case] are at
the end of the list). Tick some white boxes to select some poems, select also some
words, and press the button “Select” .
4.3.4 Click on: “Confidence Ellipses”
to obtain the graphical display of the chosen column points in red
colour, and of the row points (here: words) in blue. We can see that many sonnets
occupy significant locations (several confidence ellipses do not overlap) whereas
the locations of the words is far from being as accurate.
4.3.5 Close the display window, and, again in the blue window, press
“Convex hulls” . The ellipses are
now replaced by the convex hulls of the replicates for each point.
The convex hulls take into account the peripheral points, whereas the
ellipses are drawn using the density of the clouds of replicates. The
two pieces of information are complementary.
4.4) Click “ClusterView” (in this case,
the clusters are the texts themselves)
4.4.1 Choose the axes (1 and 2 to begin with), and “Continue” .
4.4.2 Click on “View” .
The locations of the 20 categories (texts) appear on the first
principal plane. Thanks to some possible change of signs for the
axes, the display is the same as that provided by the
“PlaneView Research” procedure.
4.4.3 Activate the button “Words” ,
and, pointing with the mouse on a specific category, press the right button of the mouse.
A description of the category involving the most characteristic words
of the category appears. This description
is again redundant with that of the Step MOCAR (see files
“imp.txt” or
“imp.html” using the button
“Basic numerical results” ).
But we can appreciate here the pattern of categories and their relative
locations.
4.4.4 Activate the button “Texts” .
Pointing with the mouse on a specific category, and pressing the
right button of the mouse, we can read the most characteristic lines
(verses) of the selected category. The concept of characteristic line
is not obviously relevant in the case of poetries. It is in fact a
particular case of the concept of “characteristic responses”,
extremely useful in the case of open questions.
More explanation about the corresponding methodology can be found in the
already quoted book: “ Exploring
Textual data ” (L. Lebart, A. Salem, L. Berry; Kluwer Academic
Publisher, 1998).
Return.
4.5) Click
“Kohonen map”
4.5.1 Select: “variables + observations (rows + columns)” :
these active variables are the words and the texts (poems)
in this example.
4.5.2 Select a (5 x 5) map, and “continue” .
4.5.3 Press “draw”
on the menu of the large green windows entitled “Kohonen
map”.
4.5.4 You can change the font size ( “Font” )
and dilate the obtained Kohonen map ( “Dilat.” )
to make it more legible. The words appearing in the same cell are
often associated in the same responses. This property holds, at a
lesser degree, for contiguous cells.
4.5.5 Note that we have obtained a simultaneous Kohonen
representation of rows and columns, owing to the use, as an input
file, of the coordinates from the correspondence analysis of the
lexical table.
4.6) Click
“Seriation”
Seriation techniques as well as Block Seriation techniques are widely used by
practitioners. Seriation is based on simple row and column
permutations of the table under study; they have the great practical
and cognitive advantage of showing the raw data to the user and
therefore allowing the user to forego the use of intricate
interpretation rules. These permutations can display homogenous
blocks of high values or on the contrary, of small or null values.
They can also pinpoint a continuous and progressive evolution of
profiles.
An optimal property of correspondence analysis is the
following: the first axis of a correspondence analysis provides us
with a ranking of the row-points and of the column-points. That
ranking can be used to sort the rows and columns of the analysed data
table. The new obtained data table has then undergone an optimal
seriation. Seriation will be applied here to the lexical table
cross-tabulating the 20 sonnets and the selected words (words
appearing at least 4 times in the corpus).
A new window named “Reordering”
appears.
Click on the button:“Reordering the rows and the column of
a word-text table” .
The reordered table cross-tabulating
the 20 sonnets and the selected words is then displayed. It can be
seen that the first words of the reordered list of words characterize
(sometimes exclusively) the first sonnets in the reordered list of
sonnets. The last words of the same list are either absent or rarely
observed among these sonnets. However, they are frequent among the
last sonnets (right hand side of the table).
That reordered printing of the raw data is a useful tool of
communication with the practitioners, since it can be interpreted
without prior knowledge of data analysis techniques.
General remark.
As you can observe when looking at the content of the example’s
directory, several files have been created and saved [these files are
briefly described in the memo “Help
about files” in the toolbar of
the main menu]. If you need to continue using again the buttons of
the paragraph VIC of the main menu after having closed DtmVic, just
click on the button “Open an existing command file”
from the line “command file” , select and open the
saved command file: “Param_VISUTEX.txt” ,
and close it. It is not necessary to click on: “execute”
again. You can then continue your investigation (axes views, graphs, maps, etc.).
The advanced users can also edit the parameter file
“Param_VISUTEX.txt” ,
(using the memo “Help about parameters”
in the toolbar of the main menu) to perform a new analysis in which the
parameters are given new values. It is advised to give it a new name
(such as “Param_VISUTEX2.txt” ,
for example). All the intermediate files will be replaced (except the
files “imp_date_time.txt”
and “imp_date_time.html” which
are the only saved archives).
End of example A4
Example A.5:
EX_A05.Text-Responses_1
(Textual Data Analysis: Open and Closed Questions)
Example A.5 aims at describing the responses to an open-ended
question in a sample survey in relation with the responses to a
specific closed-end question.
After archiving dictionary, data and texts, the numerical coding of the
text allows us to build a lexical table cross-tabulating the words
with a selected categorical variable. A correspondence analysis is
then performed on that lexical table. Bootstrap confidence areas
(ellipses or convex hulls) can be drawn around words and categories.
Characteristics words and responses are computed for each category.
About the data
The open questions were included in a multinational survey conducted in seven
countries (Japan, France, Germany, United Kingdom, USA, Netherlands,
Italy) in the late nineteen eighties (Hayashi et al ., 1992).
It is the United Kingdom survey which is presented here.
It deals with the responses of 1043 individuals to 14 closed-end questions
and three open-ended questions. Some questions concern objective
characteristics of the respondent or his/her household (age, status, gender,
facilities). Other questions relate to attitude or opinions.
The first open-ended question was “What is the single
most important thing in life for you?”
It was followed by the probe: “
What other things are very important to you?” .
A third question (not analysed in this tutorial, but included in the
example data set) has also been asked: “What
means to you the culture of your own country?”
We will
focus on the first open question and its probe. Being interested with
the relationships between these responses and both the age and
educational level of the respondent, we will use a specific
categorical variable to agglomerate the open responses: a variable
with nine categories cross-tabulating three categories of age with
three educational levels.
More explanations about this particular example and the corresponding
methodology can be found in the book: “ Exploring Textual data”
(L. Lebart, A. Salem, L. Berry; Kluwer Academic Publisher, 1998).
This example corresponds to the directory
“EX_A05.Text-Responses_1”
included in: “DtmVic_Examples_A_Start”.
1) Looking at the three files: data, dictionary and texts.
1.1) Data file: “ TDA_dat.txt”
The data file comprises 1043 rows and 15 columns (identifier of rows [between quotes] +
14 values [corresponding either to numerical variables or to item numbers
of categorical variables] separated by at least one blank space).
1.2) Dictionary file: “TDA_dic.txt”
The dictionary file “
TDA _dic.txt” contains the identifiers of the 14 variables. In this internal
version of DtmVic dictionary, the identifiers of categories must
begin at: “column 6” The identifier of a categorical
variable is preceded by the number N of its categories (columns 1 to
5); the N following lines identify the N response items. An optional
“short identifier” could be located in columns 1 to 5. A
numerical variable (such as “age”) has 0 category. Note
that blank spaces are not allowed within the identifiers
(about DtmVic formats, see: Introduction of Tutorial D).
1.3) Text file: “TDA_tex.txt”
This file contains the free responses of 1043
individuals to three open-ended questions mentioned earlier.
The DtmVic internal
format of the text file is very specific. Since the
responses may have very different lengths, separators are used to
distinguish between questions and between individuals (or:
respondents). Individuals are separated by the chain of characters
“----“ (starting column 1) possibly followed by an
identifier. Within each individual data, the open questions are
separated by “++++” (column 1). The symbol “====”
indicates the end of the file. Like all the data files involved in
DtmVic as input files, that file is a raw text file (.txt). If the
text file comes from a text processing
phase, it must be saved as a “.txt file”.
2) Generation of a command file (or: “parameter file”)
2.1) Click the button: “Create a command file”
of the main menu: "Basic Steps", line
“ Command File”.
A window
“Choosing among some basic analyses”
appears.
2.2) Click then on the button: “ANALEX”
– located in the paragraph “Textual and numerical data”.
2.3) Press the button:
“Open a text file”,
then search for the directory:
“DtmVic_Examples_A_Start”.
In that directory, open the directory of example A.5, named
“EX_A05.Text-Responses” .
Open then the text file:
“TDA_tex.txt“.
A message box indicates then that the corpus comprises 7329 lines, 1043
observations and 3 open questions.
2.4) Click on:
“Select Open questions and separators”
The next window allows for the selection of open questions and the selection
of separators of words (the default list of separators suffices in this example).
We will select questions 1 and 2 (that means that the two responses will be
merged). It is licit here to merge the two responses because question 2 is
a probe for question 1.
2.5) Click directly on: “Vocabulary”.
The next window presents the vocabulary (alphabetic and frequency orders). We
must select a threshold of frequency by selecting a line in the right hand side memo
frequency order). The line number 135 corresponds to the frequency 16.
After selecting that line, click on:
“Confirm”.
Then click on:
“Continue”.
2.6) Click the button:
“Open a dictionary (Dtm format)”
Open then the dictionary file: “TDA_dic.txt
“ .
The dictionary file TDA_dic.txt
contains the identifiers of the 14 variables.
The dictionary file in displayed in a window. Another window indicates
the status of each variable (numerical or categorical).
2.7) Press the button:
“Open a data file (Dtm format)”
Open the data file: “TDA_dat.txt”
.
That data file comprises 1043 rows and 15 columns (identifier of rows
[between quotes] + 14 values [corresponding either to numerical variables or to item
numbers of categorical variables] separated by at least one blank space).
A new window displays the data file.
2.8) Click the button: “Continue
(select active and supplementary elements)”.
A new window is displayed, allowing for the selection of active variables.
We suggest to select the categorical variable number 14, (age - education).
Only one active variable can be selected in the ANALEX case.
All the remaining variables could be selected as supplementary elements.
They will serve to describe the categories of the active variable.
2.9) Click then on the button:
“Continue”
A new window devoted to the selection of active observations (rows) is
displayed. Click on the button:
“All the observations will be active”
.
The window
“Create a starting parameter file”
is displayed.
2.10) Click on:
“1-Select some options”.
A new window entitled “Options
Bootstrap and/or clustering of observations” is displayed.
Click “yes”
for the “Bootstrap validation”, and then, click
“Enter” for confirming the
default number of replicates (25). Ignore the other suggested bootstrap options.
Back to the previous window.
2.11) Then click:
“2-Create a first parameter file”
A parameter file is displayed in the memo
[It can be edited by the advanced users. It allows for performing again the same analysis
later on, if needed].
Important :
The parameter file is saved as
“Param_ANALEX.txt”
in the current directory. If you wish, you could now exit from DtmVic,
and, later on, use the button of the main menu
“Open an existing command file” (line:
“Command file” ) to open directly the file
“Param_ANALEX.txt” ,
and, in so doing, reach directly this point of the process,
using the “Execute”
command of the main menu.
2.12) Click: “3-Execute” .
This step will run the basic computation steps present in the command file:
archiving data and text, characteristic words and responses, correspondence analysis
of the lexical table, thorough descriptions of categories using other variables.
3) Basic numerical results
Click on “Basic numerical results”
button
The button opens a created (and saved) html
file named “imp.html”
which contains the main results of the previous basic computation
steps. After perusing these numerical results, return to the main
menu. Note that this file is also saved under another name. The name
“imp.html”
is concatenated with the date and time of the analysis (continental
notation):
“imp_08.07.09_14.45.html”
means July 8th , 2009, at 2:45 p.m. That file keeps as an archive
the main numerical results whereas the file:
“imp.html”
is replaced for each new analysis performed in the same directory.
This file in also saved under a simple text format , under the name
“imp.txt” ,
and likewise with a name including the date and time of execution.
From the step NUMER, we learn
for instance that we have 1043 responses, with a total number of words
(occurrences or token) of 13 919, involving 1 365 distinct words (or: types).
Using a frequency threshold of 16, [the same threshold id denoted 15 in the result file:
first neglected frequency] the total number of kept words reduces
to 10738, whereas the number of distinct kept words reduces (more drastically) to 136.
The book “Exploring textual data” (op.
cit.) deals in details with this pre-processing and with all
the results that follow.
4) Steps VIC (Visualization, Inference, Classification)
4.1) Click the button: “AxeView”
... and follow the sub-menus.
In fact, two tabs are relevant for this example:
“Active variables”
[ = categories, in the case of ANALEX], and
“Indivuduals” [words].
After clicking on: “View” ,
one obtains the set of principal coordinates along each axis.
Clicking on a column header produces a ranking of all the rows
according to the values of that column. Evidently, the use of the
AxeView menu is justified when the data set is large, which is the
case here.
Return.
4.2) Press the button:
“PlaneView Research”... and follow
the sub-menus...
In this example, three items of the menu are relevant:
“Active columns (variables or categories)”
, “Rows (Individuals)”
, and “Active columns + Rows”
. This last item concerns both rows and columns of the contingency
table (lexical table). The graphical display of the selected pairs of
axes are then produced. The active categories (columns of the lexical
table) are printed in red, while the active words (rows) are printed
in blue.
The roles of the different buttons are straightforward, except perhaps
the button: “Rank”, which is useful only in the case of very
intricate displays (which is not the case here). (See comments in
the texts relating to examples A.1 and A.2).
4.3) Click on the button:
“BootstrapView”
This button opens the “DtmVic:
Bootstrap - Validation - Stability – Inference”
windows.
4.3.1 Click on: “LoadData” .
In this case (partial bootstrap), the replicated coordinates file to
be opened is named:
“ngus_par_boot1.txt” .
(The set of possible files is given by the panel).
4.3.2 Click on: “Confidence
Areas” submenu, and choose the pair of axes to be
displayed (select axes 1 and 2 to begin with).
4.3.3 We obtain the list of the identifiers of active rows and columns
(identifiers of columns [categories
age x education] are at the end
of the list). Since the column set is quite small, tick all the white
cases to select all the categories, select also some words, and press
the button: “Select” .
– Click on: “Confidence Ellipses”
to obtain the graphical display of the chosen column points in red
colour, and of the row points (or individuals or observations) in
blue. We can see that, individually, some words have no significant
position. In this display, we learn for example that almost all the
age-education groups (column points) have distinct “lexical
profiles”, except the categories “-30-low” [less
than 30 years old, low level of education] and “-30-medium”
[less than 30 years old, medium level of education] whose confidence
areas are largely overlapping.
We can see, for instance, that some flections of the verb "to be",
such as "is, be, are, being" may have locations that differ significantly.
– Close the display window, and, again in the blue window, press:
“Convex hulls” .
The ellipses are now replaced with the convex hulls of the replicates
for each point. The convex hulls take into account the peripheral points,
whereas the ellipses are drawn using the density of the clouds of replicates.
The two pieces of information are complementary.
Return.
4.4) Click on:“ClusterView ”
4.4.1 Choose the axes (1 and 2 to begin with), and
“Continue” .
4.4.2 Click on “View” .
The locations of the 9 categories (variable 14: age-education)
appears on the first principal plane. Thanks to some possible change
of sign for the axes, the display is the same as that provided by the
“PlaneView Research”
procedure.
4.4.3 Activate the button: “Words”
, and, pointing with the mouse on a specific category, press
the right button of the mouse. A description of the category involving
the most characteristic words of the category appears.
This description is again redundant with
that of the Step MOCAR (file “imp.txt”
). But we can observe here the pattern of categories and their relative
locations.
4.4.4 Activate the button:
“Texts” .
Pointing with the mouse on a specific category, and pressing the
right button of the mouse, we can read the most characteristic
responses of the selected category.
More explanation about the corresponding methodology can
be found in the book: “Exploring Textual data” (L. Lebart,
A. Salem, L. Berry; Kluwer Academic Publisher, 1998).
Return.
4.5) Click “Kohonen map”
4.5.1 Select: “variables
+ observations (rows + columns)” :
these active variables are the words and
the texts (categories) in this example.
4.5.2 Select a (5 x 5) map, and
“continue” .
4.5.3 Press “draw”
on the menu of the large green windows entitled “Kohonen
map” .
4.5.4 You can change the font size
( “Font” )
and dilate the obtained Kohonen map ( “Dilat.” )
to make it more legible. The words appearing in the same cell are
often associated with the same responses. This property holds, at a
lesser degree, for contiguous cells.
4.5.5 Note that we have obtained a simultaneous Kohonen
representation of rows and columns, owing to the use, as input file,
of the coordinates from the correspondence analysis of the lexical
table.
4.6) Click: “Seriation”
The aim of seriation techniques has
been bri efly described in the section 4.6 of example 4. Seriation
will be applied here to the lexical table cross-tabulating the 9 categories
of respondents and the selected words (words appearing at least 16 times in
the corpus). In this version of DtmVic, Seriation can be obtained only after
the three types of analysis: SCA, VISUTEX and ANALEX. All these approaches
involve Correspondence Analysis of contingency tables.
A new window named “Reordering”
appears.
Click on the button:
“Reordering the rows and the column of a word-text table”
.
The reordered table cross-tabulating the 9 categories and the selected
words is then displayed. It can be seen that the first words of the
reordered list of words characterize (sometimes exclusively) the
first categories in the reordered list of categories. The last words
of the same list are either absent or rarely observed among these
categories. However, they are frequent among the last categories
(right hand side of the table).
General remark.
As you can observe when looking at the content of the example’s
directory, several files have been created and saved [these files are
briefly described in the memo “Help
about files” in the toolbar of the main menu].
If you need to continue using again the buttons of
the paragraph VIC of the main menu after having closed DtmVic, just
click on the button “Open an existing command file”
from the line “command file” ,
select and open the saved command file:
“Param_ANALEX.txt” ,
and close it. It is not necessary to click on:
“execute”
again. You can then continue your investigation (axes views, graphs,
maps, etc.).
The advanced users can also edit the parameter file:
“Param_ANALEX.txt” , (using the
memo “Help about parameters”
in the toolbar of the main menu) to perform a new analysis in which
the parameters are given new values.
It is advised to give it a new name (such as:
“Param_
ANALEX.txt” , for example).
All the intermediate files will be replaced (except the files
“imp_date_time.txt”
and “imp_date_time.html” which
are the only saved archives).
End of example A.5
Example A.6: EX_A06.Text-Responses_2.
(Open questions in a sample survey: Direct analysis and link with closed-end questions)
Example A.6 aims at describing directly the responses to an open ended question
in a sample survey, without prior agglomeration, in relation with a set of
categorical variables. The survey and the responses are the same as in examples A.5.
More explanation about this type of example and the corresponding
methodology can be found in the book: “Exploring Textual data”
(L. Lebart, A. Salem, L. Berry; Kluwer Academic Publisher, 1998).
We can take advantage of the presence of closed-end questions to describe
the clusters, not only with characteristic words and responses, but also with categories, selected
after a step SELEC, and analysed through the step DECLA .
Another new step of the command file, POSIT, describes the location of these supplementary
categories in the plane spanned by the first principal axes.
1) Looking at the three files: data, dictionary and texts.
To have a look at the data, search for the directory:
DtmVic_Examples.
In this directory, open the sub-directory
DtmVic_Examples_B_Texts .
In that directory, open the directory of Example A.6, named:
“EX_A06.Text-Responses_2 ” .
It is recommended to use one directory for each application, since
DtmVic produces a lot of intermediate txt-files related to the application.
At the outset, such directory must contain at least 3 files :
a) the data file,
b) the dictionary file,
c) the text file,
a) Data file: TDA_dat.txt
(same as that of Example A.5)
This file contains responses to questions which were included in the
multinational survey conducted in seven countries (Japan, France, Germany,
United Kingdom, USA, Netherlands, Italy) in the late nineteen eighties
(Hayashi et al., 1992).
It is the United Kingdom survey which is presented here.
It deals with the responses of 1043 individuals to 14 questions.
Some questions concern objective characteristics of the respondent
or his/her household (age, status, gender, facilities).
Other questions relate to attitude or opinions.
The data file: " TDA_dat.txt"
comprises 1043 rows and 15 columns (identifier of rows [between quotes]
+ 14 values [corresponding either to numerical variables or
to item numbers of categorical variables] separated by at least
one blank space).
b) Dictionary file: TDA_dic.txt
(same as that of Example A.5)
The dictionary file "TDA_dic.txt"
contains the identifiers of these 14 variables. In this version of Dtm-Vic,
the identifiers of categories must begin at: "column 6"
[a fixed interval font - also known as teletype font - such as
"courier" should be used to facilitate this kind of format].
c) Text file: TDA_TEX.txt
(same as that of examples A.5)
Let us remind its characteristics.
It contains the free responses of 1043 individuals to three open-ended questions.
Firstly, the following open-ended question was asked: "What is the
single most important thing in life for you?" It was followed by
the probe: "What other things are very important to you?".
A third question (not analysed here) has also been asked:
“ What means to you the culture of your own country”.
We refer to the previous example (example A.5) for comments about the data format.
2) Generation of a command file (or: “parameter file”)
2.1)Click the button: “Create a command file”
of the main menu: Basic Steps, line:
“ Command File”.
A window:
“Choosing among some basic analyses”
appears.
2.2) Click then on the button:
“
VISURECA analysis - (Visualization and Clustering of responses with
suplementary categorical data) ” – located in the
section “Textual and numerical data”.
A window “Opening a text file”
is displayed
2.3) Press the button:“Open the text file”,
then search for the directory: “
DtmVic_Examples_A_Start” .
In that directory, open the directory of example A.6, named
“EX_A06.Text-Responses_2” .
Open then the text file:
“TDA_tex.txt “.
A message box indicates then that the corpus comprises 7329 lines,
1043 observations and 3 open questions.
2.4) Click on:
“Select Open questions and separators”
The next window allows for the selection of open questions and the selection
of separators of words (the default separators suffice in this example).
We will select questions 1 and 2 (that means that the two responses will be
merged). It is licit here to merge the two responses because question 2
is a probe for question 1.
2.5) Click directly on:
“Vocabulary and counts” .
The next window presents the vocabulary (alphabetic and frequency orders).
We must select a threshold of frequency by selecting a line in the right hand
side memo frequency order). The line number 397 [first column] corresponds to
the frequency 4 [second column]. (We took a threshold of 16 in the previous
example A5. For individual responses, lexically very poor, it takes more words
not to generate too many empty answers after choosing the threshold).
We'll keep the 397 most frequent words. After selecting that line, click on:
“Confirm”.
The frequency appears in a message box. Reply: "OK".
Then click on: “Continue” .
A window dictionary and data files
appears.
2.6) Click the button:
“Open a dictionary (Dtm format)”
Open then the dictionary file:
“TDA_dic.txt“ .
The dictionary file:
TDA_dic.txt contains the identifiers
of the 14 variables.
The dictionary file in displayed in a window. Another window indicates the
status of each variable (numerical or categorical).
2.7) Press the button:
“Open a data file (Dtm format)”
Open the data file:
“TDA_dat.txt”
That data file comprises 1043 rows and 15 columns (identifier of rows
[between quotes] + 14 values [corresponding either to numerical variables
or to item numbers of categorical variables] separated by at least one
blank space).
A new window displays the data file.
2.8) Click the button: “Continue
(select active and supplementary variables)”.
A new window is displayed, allowing for the selection of active variables.
There is no active variable, since the responses to the 2 open questions are
active here. We actually chose the active variables by selecting the open-ended
questions 1 and 2.
All the remaining variables could be selected as supplementary elements.
They will serve to describe the categories of the active variable.
2.9) Click then on the button:
“Continue”
A new window devoted to the selection of active observations (rows)
is displayed. Click on the button:
“All the observations will be active”
.
The window:
“Create a starting parameter file”
is displayed.
2.10) Then click directly on:
“Create a first parameter file”.
For this type of analysis, there is no (yet) bootstrap validation.
The clustering is automatic, and the number of clusters is selected (default)
depending on the number of responses (30 clusters in this case).
[This number of cluster can be changed by editing the command file
(or parameter file) before the execution, the parameters to be altered
belong to the "STEP PARTI" and "STEP DECLA"].
A parameter file is displayed in the memo [It can be edited by the advanced
users. It allows for performing again the same analysis later on, if needed].
Important:
The parameter file is saved as
“Param_VIRURECA.txt”
in the current directory. If you wish, you could now exit from DtmVic,
and, later on, use the button of the main menu
“Open an existing command file” (Section:
“Command file”) to open directly the file
“Param_VIRURECA.txt”,
and, in so doing, reach directly this point of the process,
using the “Execute”
command of the main menu.
Let us remind that this set of commands comprises 14 steps:
ARDAT (archiving data),
ARTEX (Archiving texts),
SELOX (selecting the open question),
NUMER (numerical coding of the text),
ASPAR (correspondence analysis of the [sparse]
contingency table “respondents - words”),
CLAIR (Brief description of factorial axes),
RECIP (Clustering using a hierarchical
classification of the clusters - reciprocal neighbours method),
PARTI (Cut of the dendrogram produced by
the previous step, and optimisation of the partition obtained),
MOTEX (crosstabulating the partition produced
by step PARTI with words: the obtained
contingency table is called a lexical table),
MOCAR (characteristic words, and
characteristics responses for each class of the partition),
SELEC (selecting active and supplementary elements),
DECLA (systematic description of the classes
of the partition produced by step PARTI using
the other relevant categorical variables),
POSIT (illustrating the principal spaces
of responses with supplementary categorical variables).
2.11) Click: “Execute”.
This step will run the basic computation steps present in the command file:
archiving data and text, characteristic words and responses, correspondence
analysis of the lexical table, thorough descriptions of clusters using both
words and categorical variables.
7) Click the button:
“Basic numerical results”
The button opens a created (and saved) html file named
“imp.html”
which contains the main results of the previous basic computation
steps. After perusing these numerical results, return to the main
menu. Note that this file is also saved under another name. The name
“imp.html”
is concatenated with the date and time of the analysis (continental
notation). That file keeps as an archive the main numerical results
whereas the file “imp.html”
is replaced for each new analysis performed in the same directory.
This file in also saved under a simple text format, under the name
“imp.txt” ,
and likewise with a name including the date and time of execution.
Perusing the complete list of words highlights some errors in the original
text file (inevitable in real sized applications) : for instance,
the symbol “]” was absent from the list of separators,
and creates some new “words”…
8) At this stage, we click on one of the lower buttons of the basic steps
panel (Steps: “VIC”)
9) Click the button “AxeView”
... and follow the sub-menus. In fact, three tabs are relevant for this
example: “Active variables”
[ = words in the case of the analysis: ""VISURECA"],
“Individuals (observations)
[= respondents]” and
“Supplementary Categories”.
After clicking on “View”
in each case, one obtains the set of principal coordinates along each axis.
Clicking on a column header produce a ranking of all the rows according to
the values of that column. In this particular example, this is somewhat
redundant with the printed results of the step
CLAIR ”.
Return.
10) Click the button: PlaneView Research ,
and follow the sub-menus...
In this example, six items of the menu are relevant
“Active columns (variables or categories)”
(principal coordinates of the active words) ,
“Supplementary categories”
(coordinates of the supplementary categories derived from the step
“ POSIT ”),
“Active rows (individuals, observations)”
,(coordinates of the respondents),
“Active columns + Active rows”,
“Active individuals (density)”
and
“Active columns + Supplementary categories” .
The graphical display of chosen pairs of axes are then produced.
11) About the button:
“BootstrapView”...
In fact, The bootstrap is implicitly performed for the analyses VISURESP and VISURECA.
No parameter needs to be specified. The replicate file "ngus_dir_var_boot.txt" is created
using the so-called “specific bootstrap”.
Using the Button “BootstrapView”... , we will have to load the file "ngus_dir_var_boot.txt",
and select the words whose confidence ellipses should be drawn.
The bootstrap replicates are in this case obtained after a drawing with replacement of
the respondents (or: rows, individuals, observations).
12) Click on “ ClusterView ”
12.1 Choose the axes (1 and 2 to begin with), and “Continue”.
12.2 Click on “View” .
The centroids of the 30 clusters (produced by the Step
PARTI ) appears on the first principal plane.
12.3 Activate the button:
“Words” ,
and , pointing with the mouse on a specific cluster, press the right
button of the mouse. A description of the cluster involving the most
characteristic words of the cluster appears. This description is
somewhat redundant with that of the Step MOCAR .
But this display exhibits the pattern of clusters and their
relative locations.
12.4 Activate the button “Texts” .
Pointing with the mouse on a specific cluster, and pressing the right
button of the mouse, we can read the most characteristic responses of
the selected cluster.
12.5 Activate the button:
“Categorical” .
Pointing with the mouse on a specific category, and pressing the
right button of the mouse, we can read the most characteristic
categories of the selected category. This description is somewhat
redundant with that provided in the results file (file “imp.txt)
by the step DECLA. But we do have simultaneously in front of us the
pattern of categories and their relative locations.
13) Click on “Kohonen map”
Select the type of coordinate.
13.1 Select:
“Variables (columns)” :
these active variables are the words in this example.
13.2 Select a (5 x 5) map, and continue.
13.3 After clicking on two small
check-boxes, press “Draw”
on the menu of the large green windows entitled Kohonen map.
13.4 You can change the font size
(“Font” )
and dilate the obtained Kohonen map
( “Dilat.” )
to make it more legible. The words appearing in the same cell are
often associated in the same responses. This property holds, at a
lesser degree, for contiguous cells.
13.5 Pressing “AxeView” ,
and selecting one axis allows one to enrich the display with pieces
of information about a specified principal axis : large positive
coordinates in red colour, large negative coordinates in green, with
some transitional hues.
13.6 Go back to the main menu, click on
“Kohonen map” and choose the item
“Observations”
13.7 Select a (10 x 10) map, and redo the operations 13.3 to 13.5 for the
observations.
End of example A.6
End
of tutorial A
DtmVic - Tutorial B
DtmVic and textual data
Three more examples to practise DtmVic with textual data
Unlike Tutorial A, Tutorial B contain examples which use existing
command files (or: parameter files). Each example corresponds to a directory included
in the directory “DtmVic_Examples_B_Texts” that has been
downloaded with DtmVic.
Application examples B.1—B.3
Example B.1. EX_B01.Text-Responses_Corda
(Open questions in a sample survey: First exploration)
First processing of the responses to an open-ended question. Examples of
modification of the frequency threshold for words. Example of
concordances (syntactic context) for some words. Numerical coding of the responses.
Correspondence Analysis (CA) of the sparse lexical table words x respondents,
clustering of the responses, and a description of the obtained clusters through
their characteristic words and responses. Kohonen map for words and for
respondents.
Example B.2.
EX_B02.Text-Responses_MCA
(Open questions and MCA in a sample survey)
Multiple Correspondence Analysis and
Clustering of respondents using closed questions. Processing
aggregated [and lemmatised] responses to open questions. Example B.3
illustrates another technique for grouping and processing responses
to open question in a sample survey. In a
first phase, a multiple correspondence analysis is performed on a set
of selected categorical variables (i.e.: responses to closed-end
questions). The principal axes visualisation is complemented with a
clustering, followed by an automatic description of the clusters.
These clusters are then used to aggregate the responses to an open
question. The survey, the closed-end questions and the textual
responses are the same as those of previous examples A.5, A.6 and B.1.
Example B.3. EX_B03.Text-Semantic.
(Visualization
of the Semantic network of French verbs)
Visualization of the semantic links existing between 829 French verbs. Each verb
is described by a list of “synonyms”. This example is in
fact similar to example B.1 (Responses to an open question). The
“respondents” are here the 829 verbs. The (fictitious) open-ended
question is “Which are your synonyms ?”, and the textual
response is constituted by a list of synonyms.
Example B.1: EX_B01.Text-Responses_Corda
(Textual Data Analysis: A single open question )
Example B.1 aims at describing the responses to an open-ended question in a
sample survey. The principal axes visualization is complemented by a
clustering, with an automatic description of the clusters. Example of
modification of the frequency threshold for words. Example of
concordances (syntactic context) for some words. This is a
typical first outlook on the set of responses: to detect and describe
the main groupings of responses. Such outlook is by no means an
achieved processing.
Example A.6, above, provided another point of view, making use of other
pieces of information about the respondents.
To have a look at the data, search for the directory
DtmVic_Examples.
In this directory, open the sub-directory
DtmVic_Examples_B_Texts.
In that directory, open the directory of Example B.1, named
“EX_B01.Text-Responses_Corda”
.
It is recommended to use one directory for each application, since DtmVic
produces a lot of intermediate txt-files related to the application.
At the outset, such directory must contain 2 files :
- a) the text file,
- b) the command file.
(in this particular context,
there are neither data file nor dictionary file: the questionnaire
comprises three open-ended questions, without considering the closed-end
questions)
a) Text file: TDA_TEX.txt
This file has already served as an example for Examples A.5 and A.6 of
Tutorial A. It contains the free responses of 1043 individuals to three
open-ended questions.
Firstly, the following open-ended question was asked:
"What is the single most important thing in life for you?"
It was followed by the probe: "What other things are very important
to you?".
A third question has also been asked:
“What means to you the culture of your own country”
We analyse here the responses to this third question.
See examples A.5 and A.6 for a description of both the data and the corresponding files.
b) Command file:
“EX_B01_Param.txt”
As shown in Tutorial A, another “command file” similar to
the “command file ”
“EX_B01_Param.txt” can be also
generated by clicking on the button:
“Create a command file” of
the main menu (Basic Steps). A window “Choosing
among some basic analysis” appears.
Click in this case on the button:
“VISURESP– Visualization of Responses”
– located in the paragraph “Textual
data”, and follow the instructions as indicated in Tutorial A.
The computational phase of the analysis is decomposed
into "steps". Each step requires some parameters briefly described
in the main menu of DtmVic (button:
"Help about command parameters" ) and,
with more details, below (Appendix B.1).
Running the example B.1 and reading the results
1) Click on the button:
: “Open an existing command file”
(panel DTM: Basic Steps of the main menu)
2) Then, search for the sub- directory:
DtmVic_Examples_B_Texts
in: DtmVic_Examples.
3) In that directory, open the directory of Example B.01:
“ EX_B01.Text-Responses_Corda”
.
4) Open the command file:EX_B01_par.txt
After identifying the textual data file, 11 "steps" are
performed:
ARTEX (Archiving texts),
SELOX (selecting the open question),
NUMER (numerical coding of the text: now,
all the words are kept),
CORDA (concordance for some selected words),
SETEX (introducing a new threshold for the
frequencies of words),
ASPAR (correspondence analysis of the [sparse]
contingency table “respondents - words”),
CLAIR (Brief description of factorial axes),
RECIP (Clustering using a hierarchical
classification of the clusters - reciprocal neighbours method),
PARTI (Cut of the dendrogram produced by the
previous step, and optimisation of the partition obtained),
MOTEX (crosstabulating the partition produced
by step PARTI with words: the obtained contingency table is called a
lexical table),
MOCAR (characteristic words, and characteristic
responses for each class of the partition).
We will comment later on this command file (Appendix of the section) which
commands the basic computation steps. Instead of editing this file,
we directly go back to the main menu and execute the basic computation steps.
5) Return to the main menu (“return to execute”)
6) Click on the button:“Execute”
This step will run the basic computation steps present in the command file:
archiving text, correspondence analysis of the lexical table, brief
description of the axes, clustering procedure, thorough descriptions
of clusters using characteristic words and responses.
7) Click the button:
“Basic numerical results”
The button opens a created (and saved) html file named
“imp.html”
which contains the main results of the previous basic computation
steps. After perusing these numerical results, return to the main
menu. Note that this file is also saved under another name. The name
“imp.html”
is concatenated with the date and time of the analysis (continental
notation): “imp_08.07.09_14.45.html”
means July 8 th , 2009, at 2:45 p.m. That file keeps as an archive
the main numerical results whereas the file:
“imp.html” is replaced for
each new analysis performed in the same directory.
This file in also saved under a simple text format , under the name
“imp.txt” ,
and likewise with a name including the date and time of execution.
From the step NUMER, we learn for instance that
we have 1043 responses, with a total number of words (occurrences or
token) of 9148, involving 1629 distinct words (or: types) . Using a
frequency threshold of 8 (see STEP SETEX in the command file below)
the total number of kept words reduces to 11559, whereas the number of distinct
kept word reduces (drastically) to 170.
From the step CORDA, we can observe the contexts of the selected words
(see the command file in appendix B.1) life, money, love, museum, fish.
Note that from the button "Create a command file" of the main menu, we can build the command file
leading to the step "CORDA" (button: "Other analyses" and button "CORDA").
The book “Exploring textual data”
(L. Lebart, A. Salem, E. Berry. Kluwer, 1998) deals in details with
this pre-processing and with all the processing that follow.
8) At this stage, we click on one of the lower buttons of the basic
steps panel (Steps: “VIC”)
9) Click the button:
:“AxeView”
and ... follow the sub-menus.
In fact, only two tabs are relevant for this example:
“Active variables”
[ = words in the case of step ASPAR] ,
“Individuals (observations) [= respondents]” .
After clicking on “View”
in both cases, one obtains the set of principal coordinates along
each axis.
Clicking on a column header produce a ranking of all the rows according to the
values of that column. In this particular example, this is somewhat
redundant with the printed results of the step
“CLAIR” .
Evidently, the use of the AxeView menu is justified when the data set is large,
which is the case here.
Return.
10)Click the button:
“PlaneView Research”
and follow the sub-menus...
In this example, four items of the menu are relevant:
“Active columns (variables or categories)
”, “Active rows (individuals, observations)”,
“Active columns + Active rows”, “
Active individuals (density)”.
The graphical display of chosen pairs of axes are then produced.
The roles of the different buttons are straightforward, except perhaps
the button: “Rank” ,
which is useful only in the case of very intricate displays, (which
is the case here). Since the set “individual” has 1043
elements, it is possible to test, with this example, partial
printings of the individuals in two subsets of 50% or four subsets of
25%…(subsets randomly drawn without replacement).
Return.
11) About the button: “BootstrapView”.
In fact, The bootstrap is implicitly performed for the analyses VISURESP and VISURECA.
No parameter needs to be specified. The replicate file "ngus_dir_var_boot.txt" is created
using the so-called “specific bootstrap”.
Using the Button “BootstrapView”... , we will have to load the file "ngus_dir_var_boot.txt",
and select the words whose confidence ellipses should be drawn.
The bootstrap replicates are in this case obtained after a drawing with replacement of
the respondents (or: rows, individuals, observations).
12) Click on: “ClusterView”
12.1 Choose the axes (1 and 2 to begin with), and
“Continue” .
12.2 Click on: “View”.
The centroids of the 12 clusters (Step PARTI) appears on the first
principal plane.
12.3 Activate the button:
“Words” , and,
pointing with the mouse on a specific cluster, press the right button
of the mouse. A description of the cluster involving the most
characteristic words of the cluster appears. This description is
somewhat redundant with that of the Step MOCAR. But we do have in
front of us the pattern of clusters and their relative locations.
12.4 Activate the button:
“Texts” .
Pointing with the mouse on a specific cluster, and pressing the right
button of the mouse, we can read the most characteristic responses of
the selected cluster.
13) Click on: “Kohonen map”.
Select the type of coordinate.
13.1 Select: “Active
variables (columns)” :
these active variables are the words in this example.
13.2 Select a (5 x 5) map, and continue.
13.3 After clicking on two small check-boxes, press:
“Draw”
on the menu of the large green windows entitled Kohonen map.
13.4 You can change the font size
(“Font” )
and dilate the obtained Kohonen map:
( “Dilat.” )
to make it more legible. The words appearing in the same cell are
often associated in the same responses. This property holds, at a
lesser degree, for contiguous cells.
13.5 Pressing “AxeView” ,
and selecting one axis allows one to enrich the display with pieces
of information about a specified principal axis : large positive
coordinates in red colour, large negative coordinates in green, with
some transitional hues.
13.6 Go back to the main menu, click on “Kohonen
map” and choose the item:
“Active observations”
13.7 Select a (10 x 10) map, and redo the operations 13.3 to 13.5 for the
observations.
In the context of this example, the other items of the menu are not relevant.
Appendix B1: (for
advanced users)
The command file can be generated using the menu:
“Create_parameters”
Therefore, freshman practitioners could skip this appendix.
The computational phase of the analysis is decomposed into "steps".
Each step requires some parameters briefly described in the main menu of
DtmVic (button: "Help about
parameters" ).
Now, we exhibit the command file that contains comments
(preceded by #).
As seen previously, comments are also allowed in the (mandatory) line
that immediately follows a statement "STEP xxxxx"
Command file “EX_B01_Param.txt”
# The Program DtmVic needs 2 files in this "open survey case"
# -------------------------------------------
# 1) The present file of commands, whatever its name.
# 2) The text file (NTEXZ).
# Syntax: ">"= continuation, "#"= comments
#----------------------------------------------------------
LISTP = yes, LISTF = no # Global parameters(leave as it is)
#
NTEXZ = 'TDA_tex.txt' # name of text file (free name)
#
STEP ARTEX
==== Archive - Texts or responses to open ended questions
ITYP=2 NBQT=3
#---------------------- Comments about step ARTEX
# - ITYP: type of textual data file NTEXZ
# ITYP = 2 ==> type of file = responses to open questions
# - NBQT: number of questions per respondent
# NBQT = 3 ==> there are 3 open questions
#----------------------------------------------------------
#
STEP SELOX
==== Selection of open questions (and of individuals)
NUMQ = LIST
3
#---------------------- Comments about step SELOX
# - NUMQ: index of the selected question
# if NUMQ = -1 or NUMQ = LIST : several questions
# will be merged (the list of question numbers
# must follow immediately next line)
# here: question 3 is selected
#----------------------------------------------------------
STEP NUMER
==== Numerical coding of words
NSEU = 0 LEDIT=TOT
weak -
strong . ; : ( ) ! ? ,
end
#---------------------- Comments about step NUMER
# - NSEU: frequency threshold of the kept words
# (here, only the frequencies > 8 will be kept)
# - LEDIT: printing the words (0=no; 1=alphabetical order;
# 2=frequency order; 3= both 1 and 2).
# --- key-words:
# - weak (weak separators) followed by those separators
# [separators of words]
# - strong (strong separators) followed by those separators
# [separators of segments, for step SEGME]
# - end ... indicates the end of key-words statements.
#----------------------------------------------------------
STEP CORDA # concordances
==========
LEDIT = 1
FORME life money love museum fish
END
#---------------------- Comments about step CORDA
#LEDIT: printing identifiers of individuals
# (0 = no printing, 1 = identifiers of
# respondents are printed, default = 0)
# --- key-word of headings :
# FORME must be followed by the selected words
# END end of the selection.
#----------------------------------------------------------
#---- selecting a new threshold for words -
NSPB ='NSPB'
# the file NSPB created by SETEX is given the name: ‘NSPB’
STEP SETEX
============================
NSEU =8 NMOMI=0 NREMI=2 LEDIT =NEW
#---------------------- Comments about step SETEX
# NSEU: threshold of frequency for selecting words.
# nmomi: minimum number of letters of a kept word.
# nremi: minimum number of words of a kept response.
# ledit: printing the dictionaries (0=no, 1=new, 2=tot).
#----------------------------------------------------------
NSPA = 'NSPB'
#----- the file ‘NSPB’ created by SETEX is substituted to
# the file NSPA that was created by NUMER.
STEP ASPAR
==== Correspondence analysis of the table: Words X Responses
NAXE=8 LEDIT=0 NGRAF=5 NROWS=60 NPAGE=1 NBASE=12 NITER=20
#---------------------- Comments about step ASPAR
# - NAXE: number of requested principal coordinates
# - LEDIT: printing the responses
# (0 = no; 1 = coordinates of variables;
# 2 = 1 + coordinates of respondents)
# - NGRAF: number of requested printer graphics
# in the results file “imp.txt”
# NGRAF = 5 means that we will get the printouts of
# the planes spanned by the following pairs of axes:
# (1, 2), (2, 3), (3, 4), (4, 5), (4, 6).
# - NPAGE: number of pages of these graphics
# - NROWS: number of lines of these graphics
# The two following parameters concern an option
# for diagonalizing very large matrices: (if NBASE > 0)
# - NBASE: dimension of the approximation space
# (NBASE = 0: main core diagonalization)
# - NITER: number of iterations (if NBASE > 0)
#--------------------------------------------------
STEP CLAIR
==== Brief description of NAXE principal axes
NAXE=6 LIGN=no NMAX=40
#---------------------- Comments about step CLAIR
# - NAXE = ... number of axes to be described
# - LIGN = no means that lines (or rows, or individuals
# or respondents are excluded)
# - NMAX = ... Maximum number of elements that will
# be sorted to describe each axis
#--------------------------------------------------
STEP RECIP
==== Clustering of respondents using reciprocal neighbours
NAXU=7 LDEND=DENSE NTERM=20 LDESC=no
#---------------------- Comments about step RECIP
# This step carries out a hierarchical clustering
# using the reciprocal neighbours technique (recommended
# when dealing with less than 1000 individuals.
# - naxu... number of axes kept from the
# previous MCA .
# - LDEND... printing dendrogram (0=no, 1=dense,
# 2=large).
# - nterm... number of kept terminal elements
# NTERM = TOT means that all the elements are kept.
# - LDESC... describing nodes of the tree (0=no, 1=yes).
#--------------------------------------------------
STEP PARTI
==== Cut of the dendrogram to obtain 9 clusters
NITER=7 LEDIN=3
12 # number of classes of the partition
#---------------------- Comments about step PARTI
# - NITER... number of "consolidation" iterations (0=no).
# - LEDIN... printing the correspondences classes-
# individuals (3 = printing of the correspondence
# classes->individuals and the correspondence
# individuals-->classes).
# The line immediately following the command must
# contain the sizes of the desired final partition
# (here: 9).
#--------------------------------------------------
STEP MOTEX
==== cross-tabulating words and clusters
NVSEL=-1 LEDIT = 0
#---------------------- Comments about step MOTEX
# NVSEL: index of the categorical variable defining
# the groupings of texts
# the conventional value NVSEL = -1 means that
# the categorical variable coincides with the
# previously computed partition.
# LEDIT: parameter for printing the table words*texts
# (0=no, 1=yes).
#--------------------------------------------------
STEP MOCAR
==== Characteristics words for each cluster
NOMOT=10 NOREP=6
#---------------------- Comments about step MOCAR
# NOMOT: number of requested characteristic words for
# each text (i.e: for each cluster)
# NOREP: number of characteristic responses for each text.
# MOCAR considers as a characteristic response for a category
# a response containing as many characteristic words as possible.
# (with penalties for anti-characteristic words).
#--------------------------------------------------
STOP
#-------------------------------------------------------
End of example B.1
Example B.2: EX_B02.Text-Responses_MCA
(Open questions and MCA in a sample survey)
Example B.2 illustrates another technique for grouping and processing
responses to open question in a sample survey. In a first phase, a
multiple correspondence analysis is performed on a set of selected
categorical variables (i.e: responses to closed-end questions). The
principal axes visualisation is complemented with a clustering,
followed by an automatic description of the clusters. These clusters
are then used to aggregate the responses to an open question. The
survey, the closed-end questions and the textual responses are the
same as those of previous examples.
More explanation about this type of example and the corresponding
methodology can be found in the book: “Exploring Textual data”
(L. Lebart, A. Salem, L. Berry; Kluwer Academic Publisher, 1998).
The sequence of steps is enriched by the following computations:
As in Example B.1, the numerical coding (step NUMER )
is performed with a frequency threshold of 0 : all the words (types)
are kept. We can then carry out the new step CORTE ,
allowing us to perform a “primary lemmatization” of the
text. (see also the procedure CORTEX
from the menu invoked by the button “Create a command file” of the main menu,
and the complementary command files whose names ends by “TEX”).
We can now take advantage of the presence of both open-ended and
closed-end questions to describe the clusters, not only with
characteristic words and responses (as done previously in Example
B.1), but also with categories. Another new step:
POLEX , describes the location of the words
in the plane spanned by the first principal axes.
To have a look at the data, search for the directory:
DtmVic_Examples.
In this directory, open the sub-directory
DtmVic_Examples_B_Texts .
In that sub-directory, open the directory of Example B.2, named
“EX_B02.Text-Responses_MCA” .
It is recommended to use one directory for each application, since DtmVic
produces a lot of intermediate txt-files related to the application.
At the outset, such directory must contain 4 files :
- a) the data file,
- b) the dictionary file,
- c) the text file,
- d) the command file.
a) Data file: TDA_dat.txt
(same as that of Example B.2)
This file contains responses to questions which were included in the
multinational survey [see also Examples A.5 and A.6] conducted in seven countries
(Japan, France, Germany, United Kingdom, USA, Netherlands, Italy) in the
late nineteen eighties (Hayashi et al., 1992).
It is the United Kingdom survey which is presented here.
It deals with the responses of 1043 individuals to 14
questions. Some questions concern objective characteristics of the
respondent or his/her household (age, status, gender, facilities).
Other questions relate to attitude or opinions.
The data file " TDA_dat.txt"
comprises 1043 rows and 15 columns (identifier of rows [between
quotes] + 14 values [corresponding either to numerical variables or
to item numbers of categorical variables] separated by at least one
blank space).
b) Dictionary file: TDA_dic.txt
(same as that of Examples A.5 and A.6)
The dictionary file "TDA_dic.txt"
contains the identifiers of these 14 variables. In this version of
DtmVic, the identifiers of categories must begin at: "column 6"
[using a fixed interval font such as "courier"].
c) Text file: TDA_TEX.txt
(same as that of examples A.5, A.6, and B.1)
We refer to previous example for comments about the questionnaire and the
data format.
d) Command file: EX_B02_Param.txt
The computational phase of the analysis is decomposed into "steps".
Each step requires some parameters briefly described in the main menu of
DtmVic (button: "Help about command parameters"
) and, with more details, below.
Note that another “command file” similar to
the “command file ”EX_B02_Param.txt
can be also generated by clicking on the button:
“Create a command file”
of the main menu (Basic Steps).
A window
“Choosing among some basic analysis”
appears. Click then on the button:
“MCA_Texts – Visualization of Responses”
located in the paragraph “textual and
numerical data” , and follow the instructions.
Running the example B.2 and reading the results
1) Click on the button:
“Open an existing command file”
(panel DTM: Basic Steps of the main menu)
2) Then, search for the sub-directory:
“DtmVic_Examples_B_Texts”
in “DtmVic_Examples”.
3) In that sub-directory, open the directory of Example B.2
named “EX_B02.Text-Responses_MCA”
4) Open the existing command file: EX_B02_Param.txt.
After identifying the textual data file, 16 "steps" are performed:
ARDAT (archiving data),
ARTEX (Archiving texts),
SELOX (selecting the open question),
NUMER (numerical coding of the text: now,
all the words are kept),
CORTE (deleting some function words [or empty words], declaring as
equivalent flections of a same lemma),
SETEX (introducing a new threshold for the frequencies of words),
SELEC (selecting active and supplementary elements),
MULTM (Multiple correspondence analysis),
DEFAC (Brief description of factorial axes),
POLEX ( projecting the words of the responses as supplementary elements in
the principal planes),
RECIP (Clustering using a hierarchical classification of the clusters -
reciprocal neighbours method),
PARTI (Cut of the dendrogram produced by the previous step, and
optimisation of the partition obtained),
DECLA (systematic description of the
classes of the partition produced by step PARTI
using the other relevant categorical variables),
MOTEX (crosstabulating the partition produced by step PARTI
with words: the obtained contingency table is a “lexical table”),
MOCAR (characteristic words, and characteristic responses for each class
of the partition),
RECAR (characteristic responses for each
class of the partition using a different criterion of selection,
allowing for lengthy responses).
We will comment later on this command file (Appendix B.2 of the section)
which commands the basic computation steps. Instead of editing this
file, we will directly go back to the main menu and execute the basic
computation steps.
5) Return to the main menu (
“Return to execute” )
6) Click on the button: “Execute”
This step will run the basic computation steps present in the command file.
7) Click the button:
“Basic numerical results”
The button opens a created (and saved) html
file named “imp.html”
which contains the main results of the previous basic computation
steps. After perusing these numerical results, return to the main
menu. Note that this file is also saved under another name. The name
“imp.html”
is concatenated with the date and time of the analysis (continental
notation). That file keeps as an archive the main numerical results
whereas the file “imp.html”
is replaced for each new analysis performed in the same directory.
This file in also saved under a simple text format , under the name
“imp.txt” ,
and likewise with a name including the date and time of execution.
From the step NUMER ,
with the new threshold of “0”, we check for instance that
we still have 1043 responses, with a total number of words
(occurrences or token) of 13 918, involving 1 368 distinct words
(or: types). In this version of DtmVic, the results of the new step
CORTE are confined to this “result file”.
Return.
8) At this stage, we click on one of the lower buttons of the basic steps
panel (Steps: “VIC”)
9) Click the button “AxeView”
and ... follow the sub-menus. Here, four tabs are relevant for this
example: “Active variables”
[ = categories in this MCA case], “Supplementary
categories”, “Individuals (observations) [= respondents]”
, and “supplementary lexical
units” (provided by step POLEX
= projections of words onto the axes of the MCA).
After clicking on “ View ”
in both cases, one obtains the set of principal coordinates along
each axis.
Clicking on a column header produce a ranking of all the rows according to the
values of that column.
Return.
10) Click the button: “PlaneView Research”
and follow the sub-menus...
In this example, seven items of the menu are relevant
“Active columns (variables or categories)”
,(Active categories of the Multiple Correspondence Analysis),
“Supplementary categories” ,
(Supplementary categories of the
same MCA), “Active rows
(individuals, observations)”, “Active columns + Active rows”,
“Supplementary lexical units”
(projection of the words used by the respondent in their responses to
the open question), provided by step POLEX ),
, “Active individuals (density)”
and “Active columns + Supplementary
categories” . The graphical displays of the chosen pairs
of axes are then produced.
Return.
11) Click the button: “BootstrapView” [The Bootstrap concerns here the MCA]
This button open the DtmVic-Bootstrap-Stability windows.
11.1 Click: “LoadData ” .
In this case (partial bootstrap), the two replicated coordinates file
to be opened are named “ngus_var_boot.txt”
and “ngus_sup_cat_boot.txt”
(see the panel reminding the names of the relevant files
below the menu bar).
In fact, ngus_var_boot.txt
contains both active and supplementary categories. The file
ngus_sup_cat_boot.txt
contains only supplementary categories, for which the bootstrap
procedure is all the more meaningful.
11.2 Click on “Confidence
areas” , submenu, and choose the pair of axes to be
displayed (choose axes 1 and 2 to beginwith).
11.3 Tick the chosen white cases to select the elements the
location of which should be assessed, and press the button
“Select”.
Select, for instance, the supplementary elements “male, female,
less than 30 years old with high level of education, over 55 with
high, and also with low level of education.
11 .4 Click on:
“Confidence Ellipses”
to obtain the graphical display of the active category points (in blue colour),
and of the supplementary category points (in red).
In this display, we learn for example that in this
principal space (built as a “space of opinions”, due to
the selection of active questions), male and female do not occupy
statistically distinct locations (ellipses overlapping). As shown by the
locations of other categories, age and education lead to distinct
patterns of opinions.
11.5 Close the display window, and press
“Convex hulls” .
The ellipses are now replaced by the convex hulls of the replicates for each point.
The convex hulls take into account the peripheral points, whereas the
ellipses are drawn using the density of the clouds of replicates. The
two pieces of information are complementary. Go back to the main
menu.
12) Click on: “ClusterView ”
12.1 Choose the axes (1 and 2 to begin with), and:
“Continue” .
12.2 Click on: “View” .
The centroids of the 7 clusters (produced by Step PARTI) appears on
the first principal plane.
12.3 Activate the button:
“Categorical”.
Pointing with the mouse on a specific category, and pressing the
right button of the mouse, we can read the most characteristic
categories of the selected category. This description is somewhat
redundant with that provided in the results file (file “imp.txt)
by the step DECLA. But we do have simultaneously in front of us the
pattern of categories and their relative locations.
12.4 Activate the button “Words”
, and , pointing with the mouse on a specific cluster, press the right
button of the mouse. A description of the cluster involving the most
characteristic words of the cluster appears. This description is
somewhat redundant with that of the Step MOCAR. But, again, we do
have in front of us the pattern of clusters and their relative
locations.
12.5 Activate the button: “Texts”.
Pointing with the mouse on a specific cluster, and pressing the right
button of the mouse, we can read the most characteristic responses of
the selected cluster.
Return.
13) Click on “Kohonen map”
Select the type of coordinate.
13.1 Select:
“Columns (variables)” :
these active variables are the categories in this example.
13.2 Select a (4 x 4) map, and continue.
13.3 After clicking on some check-boxes, press:
“Draw”
on the menu of the large green windows entitled Kohonen map.
13.4 You can change the font size:
(“Font” )
and dilate the obtained Kohonen map:
( “Dilat.” )
to make it more legible. The categories appearing in the same cell are
often associated in the same responses. This property holds, at a
lesser degree, for contiguous cells.
13.5 Pressing “AxeView” ,
and selecting one axis allows one to enrich the display with pieces
of information about a specified principal axis: large positive
coordinates in red colour, large negative coordinates in green, with
some transitional hues.
13.6 Go back to the main menu, click on:
“Kohonen map” and choose
the item “Active observations”
13.7 Select a (12 x 12) map, and redo the previous operations for the
observations (the button “Dilat.”
is now indispensable).
Appendix B.2
(for advanced users)
A similar (but not identical) command file can be generated using the menu
“Create_parameters”. Therefore, beginners could skip this
appendix
The computational phase of the analysis is decomposed
into "steps". Each step requires some parameters briefly
described in the main menu of DtmVic (button: "Help about
parameters").
Command file: EX_B02_Param.txt
Now, we will exhibit the command file that contains
comments (preceded by #).
# ---------------- EX_B02_Param.txt : Textual Data Analysis ----
# The Program DtmVic needs 4 files in this "open survey case"
# -------------------------------------------------------------
# 1) The present file of commands, whatever its name.
# 2) The text file (NTEXZ).
# 3) The dictionary file (NDICZ).
# 4) The data file (NDONZ).
# Syntax: ">"= continuation, "#"= comments
#--------------------------------------------------------------
LISTP = yes, LISTF = no # leave as it is...
NTEXZ = 'TDA_tex.txt' # text file (same as in example TDA1)
NDICZ = 'TDA_dic.txt' # dictionary file
NDONZ = 'TDA_dat.txt' # data file
STEP ARDAT # Archiving data and dictionary
==========
NQEXA =14 , NIDI = 1, NIEXA =1043
# See Appendix B2 for the comments about this step
#--------------------------------------------------
STEP ARTEX # Archiving responses to 3 open questions
===========
ityp = 2 nbqt = 3 nlig=5
# See Appendix B1 for the comments about this step
# or the "Help about Command Parameters" (Main menu and Editor "Open").
#--------------------------------------------------
STEP SELOX # Selecting responses to questions 1 and 2
===========
NUMQ=LIST LDONA=1
1,2
# See Appendix B1 for the comments about this step
#--------------------------------------------------
STEP NUMER # extracting words : threshold= 0
===========
NSEU = 0, LEDIT = TOT NXMAX = 20000 coef = 10
weak -
strong . ? ; ( ) : , '
end
# See Appendix B1 for the comments about this step
#--------------------------------------------------
#---------------- example of pre-processing texts --
NSPC = 'NSPC'
# the file NSPC created by CORTE is given the name: ‘NSPC’
step CORTE
========== deletion and equivalence between words
LEDIT = 2
delet a an and at but by etc for from if in into of on or >
out over pp than the to up
equiv two 2
equiv be am m are re is been being was
equiv child children
equiv content contented
equiv can could
equiv would d
equiv do doing don
equiv enjoy enjoying
equiv family families
equiv get got getting
equiv go going
equiv have having ve
equiv help helping
equiv holiday holidays
equiv job jobs
equiv keep keeping
equiv live living
equiv look looking
equiv see seeing
equiv son sons
equiv sport sports
equiv thing things
equiv work working
equiv worry worries
end
#---------------------- Comments about step CORTE
# step CORTE (correction of texts) helps us to perform
# what we may term a manual lemmatisation.
# In fact, the frequency threshold NSEU should be “0”
# I the preceding step NUMER…
# The deletions concerns mainly function words (or tool
# words, or auxiliary words, or grammatical words…).
# Many equivalences are found simply by looking at the
# alphabetical list of words provided by step NUMER.
# ledit: printing of words (0=no, 1=nspc, 2=tot).
# lclas: printing sorted words (0=no, 1=yes).
#
# CORTE uses 3 key-words whose meanings are straightforward:
# delet, equiv, end
#----------------------------------------------------------
# IMPORTANT NOTE
# The previous series of deletions and equivalences can be
# generated via the step CORTEX:
# Click on the button “Create” of the main
# menu (Basic Steps) and follow the proposed instructions
# (button: CORTEX, in the paragraph “Textual data”).
#----------------------------------------------------------
NSPA = 'NSPC'
#----- the file ‘NSPC’ created by CORTE is substituted to
# the file NSPA that was created by NUMER.
#---- selecting a new threshold for words –
NSPB ='NSPB'
# the file NSPB created by SETEX is given the name: ‘NSPB’
STEP SETEX
============================
NSEU =15 NMOMI=0 NREMI=2 LEDIT =NEW
#---------------------- Comments about step SETEX
# NSEU: threshold of frequency for selecting words.
# nmomi: minimum number of letters of a kept word.
# nremi: minimum number of words of a kept response.
# ledit: printing the dictionaries (0=no, 1=new, 2=tot).
#----------------------------------------------------------
NSPA = 'NSPB'
#----- the file ‘NSPB’ created by SETEX is substituted to
# the file NSPA that was created by NUMER and modified by CORTE.
STEP SELEC
========== Selects active, supplementary variables and observations
LSELI = TOT, IMASS = UNIF, LZERO = REC, LEDIT = short
NOMI ILL 1 2 11 14
NOMI ACT 4--10
end
# See Appendix B2 for the comments about this step
#--------------------------------------------------
STEP MULTM
========== Multiple correspondence analysis
NAXE = 7, PCMIN = 2. , LBURT = TOT, LEDCO = yes NSIMU=10
#---------------------- Comments about step MULTM
# - NAXE = ... number of computed principal axes
# - PCMIN ... threshold for "cleaning" the active
# categories (in percent). This means that the low-
# frequency active categories (less than 2% in this
# case) are eliminated, and the corresponding
# individuals are dispatched at random among the
# other categories of the same variable (to remedy
# a well known weakness of the chi-square distance).
#
# - LBURT... printing the Burt contingency table
# (0=NO, 1=MASS, 2=TOT, 3=PROF).
# - LEDCO... printing the correlations variable-
# axes (0=no, 1=yes).
# - NSIMU...number of bootstrap replication (less than 30)
# (0 = no bootstrap)
#--------------------------------------------------
STEP DEFAC # Description of factorial axes
========== Multiple correspondence analysis
SEUIL = 40., LCRIM = VTEST, VTMIN = 2.0
VEC = 1--2 / MOD
end
#---------------------- Comments about step DEFAC
# SEUIL = ... Maximum number of elements that will
# be sorted to describe each axis
# LCRIM = ... Criterion for sorting the elements
# (here VTEST means “test-values” (signed number
# of standard deviations)
# VEC = ... list of axes to be described
# CONT = continuous variables , MOD = categories
# The key-word END indicates the end of the list.
#--------------------------------------------------
STEP POLEX
==== projecting supplementary words
ngraf = 2
#---------------------- Comments about step POLEX
# POLEX aims at positioning words on principal space
# (here: principal space provided by MCA of closed questions)
# ngraf = number of requested graphics (on file imp.txt)
#--------------------------------------------------
STEP RECIP
==== Clustering of respondents using reciprocal neighbours
NAXU=7 LDEND=DENSE NTERM=20 LDESC=no
# See Appendix B1 for the comments about this step
#--------------------------------------------------
STEP PARTI
==== Cut of the dendrogram to obtain 7 clusters
NITER=10 LEDIN=3
7 # number of classes of the partition
# See Appendix B1 for the comments about this step
#--------------------------------------------------
STEP DECLA
========== Systematic description of clusters
CMODA = 5.0, PCMIN = 2.0, LSUPR = no, CCONT = 5.0 >
LPNOM = no, EDNOM = no, EDCON = no
7 # list of numbers of classes of requested partitions
# See Appendix B2 for the comments about this step
#--------------------------------------------------
STEP MOTEX
=========== Cross-tabulating words and partition
NVSEL = -1, LEDIT = 1
#---------------------- Comments about step MOTEX
# See Appendix B1 for the comments about this step
#--------------------------------------------------
STEP MOCAR
==== Characteristic words for each cluster (criterion 1)
NOMOT=10 NOREP=6
# See Appendix B1 for the comments about this step
#--------------------------------------------------
STEP RECAR
=========== characteristic responses (criterion 2)
NOREP = 4
#---------------------- Comments about step RECAR
# NOREP: number of characteristic responses for each text.
# RECAR, for each cluster or category, computes the Chi-square
# distances between the responses and the mean-point of the category.
# Responses having the shortest distances are considered as
# characteristic of the category. This criterion is favourable
# to lengthy responses.
#--------------------------------------------------
STOP
#--------------------------------------------------
End of example B.2
Example B.3: EX_B03.Semantic
(Visualization of the Semantic network of French verbs)
Example B.3 provides a visualisation of the semantic links existing between
829 French verbs. Each verb is described by a list of synonyms. This
example is in fact very similar to Example B.1 (Responses to an open
question). The “respondents” are here the 829 verbs. The
fictitious open-ended question is “Which are your synonyms?”,
and the textual “response” is constituted by a list of
synonyms. The example is also similar to the “Japan Map”
example, pertaining to Example C.3 (Descriptions of graphs) from
Tutorial C.
The principal axes visualization is complemented by a clustering, with an
automatic description of the clusters. This is a typical first
outlook on the set of responses: to detect and describe the main
groupings of responses. Such outlook is by no means an achieved
processing.
For more information, please refer to the book:
“La sémiométrie”
(2003) [in French] by L. Lebart, M. Piron, J.F. Steiner; Publisher:
Dunod, Paris. (can be downloaded from
www.dtm-vic.com).
To have a look at the data, search for the directory
DtmVic_Examples.
In this directory, open the sub-directory
DtmVic_Examples_B_Texts .
In that directory, open the directory of Example B.03, named
“ EX_B03.Text-Semantic”.
It is recommended to use one directory
for each application, since DtmVic produces a lot of intermediate txt-files
related to the application. At the outset, such directory must contain 2 files:
- a) the text file, synotex.txt
- b) the command file:
“syno_par.txt”
(in this particular context, there are neither data file nor dictionary file:
the fictitious questionnaire comprises one open-ended question,
without closed-end questions).
a) Text file: synotex.txt
The format is typical of responses to open questions (see examples A.5,
B.1, B.2). Since the “responses” (here: lists of synonym
verbs) may have different lengths, separators are used to distinguish
between these lists. Lists (in fact : responses) are separated by the
chain of characters “----“ (starting column 1) possibly
followed by an identifier. Like all the data files involved in DtmVic
as input files, that file is a raw text file (.txt). If the text
file comes from a text processing phase, it must be saved beforehand
as a “.txt file”.
b)Command file: EX_B03_Param.txt
The computational phase of the analysis is decomposed into "steps".
Each step requires some parameters briefly described in the main menu
of DtmVic (button: "Help about command parameters"
) and, with more details, below.
Note that this “command file”
“EX_B03_Param.txt”
can be also generated by clicking on the button
“Create a command file” of the main menu
(DTM: Basic Steps). A window “Choosing
among some basic analysis” appears. Click then on the button:
VISURESP
– Visualization of Responses – located in the paragraph
“Textual data” ,
and follow the instructions.
Running the example B.3 and reading the results
1) Click on the button:
“Open an existing command file”
(panel DTM: Basic Steps of the main menu)
2) Then, search for the sub-directory
DtmVic_Examples_B_Texts in:
DtmVic_Examples.
3) In that directory, open the directory of Example B.03, named
“EX_B03.Text-Semantic” .
4) Open then the command file: EX_B03_Param.txt
After identifying the textual data file, seven "steps" are
performed:
ARTEX (Archiving texts),
SELOX (selecting the open question),
NUMER (numerical coding of the text),
ASPAR (correspondence analysis of the [sparse] contingency table
“respondents - words”),
CLAIR (Brief description of factorial axes),
RECIP (Clustering using a hierarchical classification of the clusters -
reciprocal neighbours method),
PARTI (Cut of the dendrogram produced by the previous step, and
optimisation of the partition obtained),
MOTEX (cross-tabulating the partition produced by
step PARTI with words: the obtained contingency table is called a lexical table),
MOCAR (characteristic words, and characteristics responses for each class of
the partition).
Instead of editing this file, we go back directly to the main menu and execute
the basic computation steps.
Return to the main menu (“Return to execute”)
5) Click on the button: “Execute”
This step will run the basic computation steps present in the command file.
6) Click the button:
“Basic numerical results”
The button opens a created (and saved) html file named
“imp.html”
which contains the main results of the previous basic computation
steps. After perusing these numerical results, return to the main
menu. (Note that this file is also saved under another name.
See previous examples)
From the step NUMER, we learn for instance that we have
829 “responses”, with a total number of words (occurrences or
token) of 17 446, involving 3 839 distinct words (or: types). Using a frequency
threshold of 12, the total number of kept words reduces to 5 013, whereas
the number of distinct kept word reduces (more drastically) to 280.
Return.
7) At this stage, we click on one of the lower buttons of the basic steps
panel (Steps: “VIC”)
8) Click the button:
“AxeView”
and ... follow the sub-menus. In fact, only two tabs are relevant for
this example:
“Active variables”
[ = synonyms in the case of step ASPAR ],
“Individuals (observations)” [= 829 initial words].
After clicking on “View”
in both cases, one obtains the set of principal coordinates along each axis.
Clicking
on a column header produce a ranking of all the rows according to the
values of that column. In this particular example, this is somewhat
redundant with the printed results of the step “CLAIR” .
Evidently, the use of the AxeView menu is justified when the data set
is large, which is the case here.
9) Click the button: PlaneView Research ...
and follow the sub-menus.
In this example, four items of the menu are relevant
“Active columns (variables or categories)”
( = synonyms, here),
“Active Rows” (individuals, observations)” (= 829 original words),
“Active columns + Rows” ,
“Individuals (density)”.
The graphical display of chosen pairs of axes are then produced.
The roles of the different buttons are straightforward, except perhaps
the button: “Rank” ,
which is useful only in the case of very intricate displays, (which
is the case here). Since the set “individual” has 829
elements, it is possible to test, with this example, partial
printings of the individuals in two subsets of 50% or four subsets of
25%…(subsets randomly drawn without replacement)
10) About the button:“BootstrapView”
In fact, The bootstrap is implicitly performed for the analyses VISURESP and VISURECA.
No parameter needs to be specified. The replicate file "ngus_dir_var_boot.txt" is created
using the so-called “specific bootstrap”.
Using the Button “BootstrapView”... , we will have to load the file "ngus_dir_var_boot.txt",
and select the words whose confidence ellipses should be drawn.
The bootstrap replicates are in this case obtained after a drawing with replacement of
the respondents (or: rows, individuals, observations).
11) Click on“ ClusterView ”
11.1 Choose the axes (1 and 2 to begin with), and
“Continue” .
11.2 Click on “View” .
The centroids of the 20 clusters (Step PARTI )
appears on the first principal plane.
11.3 Activate the button
“Words” , and,
pointing with the mouse on a specific cluster, press the right button
of the mouse. A description of the cluster involving the most
characteristic words of the cluster appears. This description is
somewhat redundant with that of the Step MOCAR .
But we do have in front of us the pattern of clusters and their
relative locations.
11.4 Activate the button “Texts” .
Pointing with the mouse on a specific cluster, and pressing the right
button of the mouse, we can read the most characteristic responses of
the selected cluster.
12) Click on: “Kohonen map”
Select the type of coordinate.
12.1 Select:
“Active variables (columns)” :
these active variables are the words in this example.
12.2 Select a (8 x 8) map, and continue.
12.3 After clicking on two small check-boxes, press
“Draw”
on the menu of the large green windows entitled Kohonen map.
12.4 You can change the font size
(“Font” )
and dilate the obtained Kohonen map
( “Dilat.” )
to make it more legible. The words appearing in the same cell are
often associated in the same verbs. This property holds, at a lesser
degree, for contiguous cells.
12.5 Pressing “AxeView” ,
and selecting one axis allows one to enrich the display with pieces
of information about a specified principal axis : large positive
coordinates in red colour, large negative coordinates in green, with
some transitional hues.
12.6 Go back to the main menu, click on
“Kohonen map” and choose the item
“Active observations”
12.7 Select a (10 x 10) map, and redo the operations 12.3 to 12.5 for the observations.
In the context of this example, the other items of the main menu are not relevant.
13 Click on
“Visualization”
A new window is displayed.
13.1 Click on: “Load coordinate”
In the corresponding sub-menu, choose the file:
“ngus_ind.txt” .
The principal coordinates of the individuals (rows) are selected.
13.2 Click then on “Select
or Create Partition”
In the corresponding sub-menu, choose
“no partition” .
13.3 Click on: “MST”
(Minimum Spanning Tree). Choose then the number of axes that will
serve to compute the Minimum Spanning Tree: full space (for example).
13.4 Click on: “ N.N.”
(search for Nearest Neighbours – limited to 20 NN).
13.5 Click on: “Graphics” .
Choose the axes 1 and 2 (default) in the small window “Description
of classes” and click on:
“Display” .
In the new window entitled “Visualisation-Graphics”
are displayed the individuals in the plane spanned by the selected axes.
A random colour is attributed to each cluster (if any).
The button “Change colour”
allows you to try a new set of colour.
About the window “Visualisation - Graphics”
(from the sub-menu Graphics)
On the vertical tool bar, you can press each button to activate it (red
colour), and press it again to cancel the activation (original colour)
The button “Density” ,
for sake of legibility, replaces the identifiers of individuals by a
single character reminding the cluster (the identifier and the
cluster number can be obtained by clicking on the left button of the
mouse in the vicinity of each point).
The button “C.Hull”
(Convex hull) draws the convex hull of each cluster.
The button “MST”
(Minimum Spanning Tree) draws the minimum spanning tree.
The button “Ellipse”
perform a Principal Components
Analysis of each cluster within the two-dimensional sub-space of
visualisation and draws the corresponding ellipses (containing
roughly 95% of the points).
The button “N.N.”
(Nearest neighbours) joins each point to its nearest neighbours.
Pressing afterwards the button “N.N.
up” allows you to increment the number of neighbours up
to 20 nearest neighbours.
Appendix B.3
The steps and the command file of example B.3 are included in those of
Example B.1 (if we except the name of the data file containing the
input text).
The reader should then refer to Appendix B.1 to obtain the corresponding
comments.
Remind that this command file can be also generated by clicking on the button
“Create a command file” of the main menu (DTM: Basic Steps),
and selecting the procedure: VISURESP
– Visualization of Responses – in the paragraph
“Textual data” ,
End of example B3
End of tutorial B
DtmVic - Tutorial C
DtmVic and numerical data
Five more examples to practise DtmVic with numerical data:
Semiometry, Fisher’s Iris data, Graphs, Images.
Each example corresponds to a
directory included in the directory “DtmVic_Examples_NumDat”
that has been downloaded with DtmVic.
Application examples C.1—C.4
Example C.1.
EX_C01.PCA_Semio
(Visualization in Principal Components Analysis)
Example C.1 aims at describing a set of numerical variables (an excerpt of
semiometric data) through Principal Components Analysis. The
principal axes visualisation is complemented with a clustering which includes an
automatic description of the clusters. Bootstrap procedures, Kohonen
maps are followed by the various tools of visualisation provided in
the sub-menu “ Visualization ”
of the phase “VIC” : visualisation of clusters (or
categories) using symbols or colours, convex hulls or density
ellipses for clusters, Minimum spanning tree, drawing of various
nearest neighbours graphs.
Example C.2.
EX_C02.PCA_Contiguity
(PCA and Contiguity Analysis on Fisher’s Iris Data)
Example C.2 aims at analysing a classical set of
numerical variables (The Iris data set of Anderson and Fisher)
through Principal Components Analysis, Classification, Contiguity
Analysis, Discriminant Analysis. The principal axes visualisation is
complemented by a clustering, with an automatic description of the
clusters.
At the outset, example C.2 is very similar to example
C.1: Principal components analysis and classification (clustering) of
a set of numerical data, with various tools of visualisation,
involving also a specific categorical data. It presents then the
improvements provided by Contiguity Analysis and its particular case:
Linear Discriminant Analysis.
Example C.3.
EX_C03.Graphs
(Description of graphs through Correspondence Analysis)
Example C.3 aims at describing three
simple symmetrical planar graphs, mainly
through correspondence analysis. Unlike the previous examples, the
directory EX_03.Graphs contains several
sub-directories and examples. The three graphs are planar graphs: a
chessboard shaped graph, a cycle, and empirical graphs supposed to
roughly represent maps of the regions of Japan and France.
The examples provide a bridge between distinct facets of DtmVic: a same
graph can lead to different input data : classical numerical data,
textual data, and a specific “external format”.
Example C.4.
EX_C04.Images
(Structural Compression of Images through SVD, CA and Discrete Fourier Transform)
Example C.4 could be viewed as a pedagogical appendix. It does not make use
of data in DtmVic format, since it deals with digitalized images. A
simple rectangular array of integers suffices: there is no need for
identifiers of rows or column. A specialized interface is provided
via the button “DtmVic Images” of the main menu.
Example C.1:
EX_C01.PCA_Semio
(Visualization in Principal Components Analysis)
Example C.1 aims at describing a set of numerical variables (an excerpt of
“semiometric data”) through Principal Components
Analysis. The principal axes visualisation is complemented by a
clustering, with an automatic description of the clusters. Bootstrap
procedures, Kohonen maps are followed by the various tools of
visualisation provided in the menu “Visualization” in the
sub-window “Visualization, Inference, Classification”:
visualisation of clusters (or categories) using symbols or colours,
convex hulls or density ellipses for clusters, Minimum spanning tree,
drawing of various nearest neighbours
graphs.
A new clustering of variables (or of observations/individuals) through a
simple k-means method can be obtained and visualized from the
sub-menu “Visualization”.
About Semiometric data:
In most surveys in the field of marketing research, it is customary to
include information about lifestyles and values. Such information is
generally obtained through a set of questions describing the
attitudes and opinions towards a list of sentences or statements.
"Semiometry" is a technique introduced by Jean-François
Steiner, a writer interested in marketing research, to tackle that
problem in a more general way.
The basic idea is to insert in the questionnaire a
series of questions consisting uniquely of words(a list of 210 words is
currently used, but we will be dealing here
with an abbreviated lists containing a subset of 70 words). The
interviewees must rate these words according to a seven levels scale,
the lowest level (mark = 1) relating to a "most disagreeable (or
unpleasant) feeling about the word", the highest level (mark = 7) relating
to a "most agreeable (or pleasant) feeling" about the word.
The processing of the filled questionnaires (mainly
through Principal Component Analysis) produces a stable pattern (up
to 8 stable principal axes). Very similar patterns are obtained in
ten different countries, despite the problems posed by the
translation of the list of words.
For more information, please refer to the book:
“La sémiométrie”
(2003) [in French] by L. Lebart, M. Piron, J.F. Steiner; Publisher:
Dunod, Paris. (This book can be downloaded from the site:
www.dtm-vic.com
(section “publication”).
Semiometrics data files
To have a look at the data, search for the directory
DtmVic_Examples.
In this directory, open the sub-directory
DtmVic_Examples_C_NumData .
In that directory, open the directory of Example C.1, named
“EX_C01_PCA.Semio” .
It is recommended to use one directory
for each application, since DtmVic produces a lot of intermediate txt-files
related to the application. At the outset, such directory
must contain 3 files :
- a) the data file,
- b) the dictionary file,
- c) the command file.
a) Data file: “PCA_semio.dat.txt”
Our reduced-size example comprises 300 respondents (instead of 1000
or 2000 that are the usual sizes of semiometric
survey samples) and 76 variables: 70 words (the marks given to the
words are considered here as numerical variables) and 6 categorical
variables describing the characteristics of the respondents.
The data file "PCA.dat.txt" comprises 300 rows
and 76 columns (identifier of rows [between quotes] + 75 values
[corresponding either to numerical variables or to item numbers of
categorical variables] separated by at least one blank space).
b) Dictionary file: “PCA_semio.dic.txt”
The dictionary file "PCA.dic.txt" contains the
identifiers of these 76 variables. In this version of DtmVic, the
identifiers of categories must begin at: "column 6" [a
fixed interval font - also known as teletype font - such as "courier"
can be used to facilitate this kind of format].
c) Command file: “EX_C01_Param.txt”
The computational phase of the analysis is decomposed into "steps".
Each step requires some parameters briefly described in the main menu of DtmVic
(button: "Help about command parameters") or
in the editor (button "Open an existing command file"/B>), of the main menu.
Note that another “command file” similar (but not
identical) to the “command file:
”EX_C01_Param.txt
can be also generated by clicking on the button
“Create a command file” of
the main menu (DTM: Basic Steps). Proceed then as shown by the first
example “EX_A01.PrinCompAnalysis” of Tutorial A.
Running the example C.1 and reading the results
1) Click on the button :
“Open an existing command file” (panel DTM: Basic Steps, line: Command file
of the main menu)
2) Search for the sub-directory
“ DtmVic_Examples_C_NumData”
in “ DtmVic_Examples”.
3) In that directory, open the directory of Example C.1:
“ EX_C01.PCA_Semio”
4) Open the command file:
“EX_C01_Param.txt”
After identifying the two data files, 10 "steps" are identified:
ARDAT (Archiving data),
SELEC (selecting active and supplementary elements),
STATS (some basic statistics),
PRICO (Principal components analysis),
DEFAC (Brief description of factorial axes),
RECIP (Clustering using a hierarchical classification of the clusters -
reciprocal neighbours method),
PARTI (Cut of the dendrogram produced by the previous step, and
optimisation of the obtained partition),
DECLA (Automatic description of the classes of the partition),
SELEC (selecting one categorical variable, in this case),
EXCAT (extracting one categorical variable - selected by step
SELEC - to be used in some graphical displays).
In this example, as in most applications, the step
SELEC plays a fundamental role, in deciding
which set of variables will be active, and which set will be illustrative or
supplementary.
In that command file, the step
RECIP performs a hierarchical clustering of
the elements using the “reciprocal neighbour algorithm”
and the step PARTI that follows cuts the obtained
tree according to the a priori fixed number of clusters.
PARTI optimizes afterwards the corresponding
partition through k-means iterations.
The methodology of this “hybrid algorithm” is
presented in “Multivariate Descriptive Statistical Analysis”
(L. Lebart, A. Morineau, K. Warwick; J. Wiley, New York, 1984). See
also [in French] “Statistique Exploratoire Multidimensionnelle”
(4th printing, L.Lebart, M. Piron, A. Morineau, Dunod, Paris, 2006).
We will comment later on this command file (Appendix of the section) which
commands the basic computation steps. Instead of editing this file,
we will go back to the main menu and execute the basic computation steps.
5) Return to the main menu
(“return to execute”)
6) Click on the button: “Execute”
This step will run the basic computation steps present in the command file.
7) Click the button:
“Basic numerical results”
The button opens a created (and saved) html file named
“imp.html”
which contains the main results of the previous basic computation
steps. After perusing these numerical results, return to the main
menu. Note that this file is also saved under another name. The name
“imp.html”
is concatenated with the date and time of the analysis (continental
notation). That file keeps as an archive the main numerical results
whereas the file “imp.html”
is replaced for each new analysis performed in the same directory.
This file in also saved under a simple text format , under the name
“imp.txt” ,
and likewise with a name including the date and time of execution.
8) At this stage, we click on one of the lower buttons of the basic steps
panel (Steps: “VIC”)
9) Click the button “AxeView”
... and follow the sub-menus. In fact, only two tabs are relevant for
this example: “Active variables”
[ = the selected words] ,
“Individuals (observations) [= respondents]” and
"supplementary categories" .
After clicking on “View”
in each case, the set of principal coordinates along
each axis is displayed.
Clicking on a column header produces a ranking of all the rows according to
the values of that column. In this particular example, this is
redundant with the printed results of the step
“DEFAC” .
In the case of this particular example, in which the first axes appear
to be stable and to have an interpretation, the
AxeView procedure is useful to observe at a
glance the bundles of words occupying extreme locations along each axis.
Example for active variables and axis 2: opposition between the words
“sacred, God, perfection, soul” on the one hand, and
“sensual, adventurer, nudity, island, desire” on the other.
The supplementary categories characterize the respondents. The bootstrap,
later on, will provide a validation of some aspects of the observed
structure.
10) Click the button:
PlaneView Research ... and follow the sub-menus.
In this example, four items of the menu are relevant
“Active columns (variables or categories)”
and "supplementary categories",
“Active rows (individuals, observations)”,
“Active columns + Active rows”,
“Active individuals (density)” and
“Active columns + Supplementary categories”
. The graphical display of chosen pairs of axes are then produced.
In the case of semiometric data, the so-called “first semiometric
plane” is in fact the plane spanned by the axes 2 and 3. The
first axis is referred to as a “purely methodological axis”,
linked to a “size effect” common in many PCA applications
(a whole chapter of the [downloadable] book quoted previously: “La
sémiométrie” is devoted to this first axis).
In the case of PCA, the first menu item
“Active columns (variables or categories)”
may contain, in fact, both active numerical variables (in black) and
supplementary numerical variables (in red). We have only active
numerical variables in this particular example, but, later on, the
reader can edit the command file (step SELEC )
to withdraw some words from the active set and give them the status
of “supplementary (or illustrative) elements”.
He or she can also use the
“Create a command file” menu,
exemplified in Tutorial A, example A.1, to choose the procedure "PCA",
allowing then for selecting more comfortably the active and supplementary elements.
Go back to the “VIC” set of buttons.
11) Click the button:
“BootstrapView”
This button open the DtmVic-Bootstrap-Stability windows.
11.1 Click “LoadData” .
In this case (partial bootstrap), the two replicated coordinates file
to be opened are named “ngus_var_boot.txt”
and
“ngus_sup_cat_boot.txt” (see the panel reminding
the names of the relevant files below the menu bar).
The file “ngus_var_boot.txt”
contains only active variables. The file
“ngus_sup_cat_boot.txt”
contains only supplementary categories, for which the bootstrap
procedure is also meaningful.
11.2 Click on “Confidence areas”
submenu, and choose the pair of axes to be displayed
(choose axes 2 and 3, to begin with).
11.3 Click on “Loading”
in the blue window that appears then, to obtain the dictionaries of variables.
Tick the chosen white boxes to select the elements the location of which
should be assessed, and press the button
“Select” .
Select for instance among others, the categories Male and Female
11.4 Click on “Confidence Ellipses”
to obtain the graphical display of the active variable points
(if the file ngus_var_boot.txt
has been loaded), or of the supplementary category points (if the file
ngus_sup_cat_boot.txt has been loaded).
11.5 Close the display window, and press
“Convex hulls” .
The ellipses are now replaced by the convex hulls of the replicates for
each point. The convex hulls take into account the peripheral points,
whereas the ellipses are drawn using the density of the clouds of replicates.
The two pieces of information are complementary.
Go back to the “VIC” set of buttons.
12. Click on
“ClusterView ”
12.1 Choose the axes (2 and 3 to begin with), and
“Continue” .
12.2 Click on “View” .
The centroids of the 7 clusters of individuals (Step
PARTI ) appear on the first principal plane.
12.3 Activate the button
“Categorical” , and, pointing with the mouse on a
specific cluster, press the right button of the mouse. A description of the
cluster involving the most characteristic response items appears. This
description is similar to that of the Step
DECLA .
But we can watch on this display the pattern of clusters and their
relative locations. One can easily imagine the usefulness of the tool
for a survey with thousands of individuals, hundreds of variables,
and more clusters.
12.4 Activate the button
“Numerical” . We will
observe the link between the numerical variables (both active and
supplementary variables) of the data file and the 5 clusters. Due to
the small number of individuals, some clusters do not produce
significant results.
Go back to the “VIC” menu.
13) Click on “Kohonen map”
Select the type of coordinate.
13.1 Select: “Active
variables (columns)” :
these active variables are the 70 words in this example.
13.2 Select a (4 x 4) map, and continue.
13.3 After clicking on two small check-boxes, press
“draw” on the menu of the
large green windows entitled Kohonen map.
13.4 You can change the font size (
“Font” ) and dilate the obtained Kohonen map
( “Dilat.” )
to make it more legible. The words appearing in the same cell are
often associated in the same responses. This property holds, at a
lesser degree, for contiguous cells.
13.5 Pressing “AxeView” ,
and selecting one axis allows one to enrich the display with pieces
of information about a specified principal axis : large positive
coordinates in red colour, large negative coordinates in green, with
some transitional hues.
13.6 Go back to the main menu,
click on “Kohonen map”
and choose the item “Observations”
13.7 Select a (8 x 8) map, and redo the operations 13.3 to 13.5 for the
observations.
Go back to the “VIC” set of buttons.
14. Click on “Visualization”
A new window entitled “DTM-Visualization:
Loading files, Selecting axes” appears.
14.1 Click on “Load
coordinate”
In the corresponding sub-menu, choose the file:
“ngus_ind.txt” .
The principal coordinates of the individuals (rows) are selected.
14.2 Click then on “Load a partition file”
In the corresponding sub-menu, choose
“Select a partition” .
The partition obtained previously from the computation step must
then be loaded (its name:
“part_cla_ind.txt” ).
14.3 Click on “MST”
(Minimum Spanning Tree). Choose then the number of axes that will
serve to compute the Minimum Spanning Tree: 5 (for example).
14.4 Click on “N.N.”
(search for nearest neighbours – limited to 20 NN).
14.5 Click on “Graphics” .
Choose the axes 1 and 2 (default) in the small window
“Selection of Axes”
and click on “Display” .
In the new window entitled
“Visualisation, Graphics”
are displayed the individuals in the plane spanned by the selected
axes. A random colour is attributed to each cluster. The button
“Change colour” allows you to
try a new set of colour.
14.6 About the window
“Visualisation, Graphics”
14.6.1 -- On the vertical tool bar, you can press each button to activate it (red
colour), and press it again to cancel the activation (initial colour)
14.6.2 -- The button “Density” ,
for sake of clarity, replace the identifiers of individuals by a
single character reminding the cluster (the identifier and the
cluster number can be obtained by clicking on the left button of the
mouse in the vicinity of each point).
14.6.3 -- The button “C.Hull”
(Convex hull) draws the convex hull of each cluster.
14.6.4 -- The button “MST”
(Minimum Spanning Tree) draws
the minimum spanning tree.
14.6.5 -- The button “Ellipse”
perform a Principal Components Analysis into each cluster within the
two-dimensional sub-space of visualisation and draws the corresponding
ellipses (containing roughly 90% of the points).
14.6.6 -- The button “N.N.”
(Nearest neighbours) joins each point to its nearest neighbours.
Pressing afterwards the button
“N.N. up” allows you
to increment the number of neighbours up to 20 nearest neighbours.
14.6.7 -- We will see in section 16 below how to use the
lower buttons of the left side vertical bar “IterKM”,
“Mean”, “Clust” (useful to visualize the
iteration of a k-means partition).
Go back to the “VIC” menu.
15. Click again on
“Visualization”
15.1 We are going to redo the operation of paragraph 14, but instead of
loading a partition provided by a clustering algorithm, we will load
the partition induced by the categories of a specific categorical
variable. Such partition correspond to the variable number 76
(gender), selected and extracted through the steps
SELEC and EXCAT
(at the end of the command file, see below).
15.2 In the window entitled
“DTM-Visualization: loading files, selecting axes “ ,
click on “Load coordinate”
15.3 In the corresponding sub-menu, choose again
the file: “ngus_ind.txt” .
The principal coordinates of the individuals (rows) are selected.
15.4 Click then on
“Load or create Partition”
15.5 In the sub-menu
“Load or create Partition” choose
the file “part_cat.txt” .
The partition induced by the categories of variable number 76
(gender) is loaded.
After loading that partition, all the operations from 14.3 to 14.6 can be
carried out again.
Comment:
It is interesting to visualise the individuals in
the plane spanned by the axes 2 and 3.
The two categories Male and Female are significantly linked to axis
3 (as it can be highlighted by looking at the bootstrap confidence
areas). But this link is hardly visible when we look directly at the
convex hulls of the two sub-clouds corresponding to these two
categories of respondents. This (almost) paradoxical result
exemplifies the difference between “statistically significant”
(which is the case here) and “obviously different” (which
is not the case).
16) Direct computation of a partition within the menu
“Visualization”
DtmVic makes it possible now to build on line (i.e. outside the “command
file”) a “k-means partition” of variables
(or of individuals as well).
Click on
“Visualization”
16.1 A new window entitled “Visualization,
Loading files, Selecting axes” appears.
16.2 Click on “Load coordinate”
16.3 In the corresponding sub-menu, if you want a clustering of variables,
choose the file: “ngus_var_act.txt”
. The principal coordinates of the active variables
are selected. If you want a clustering of individuals, select the file:
“ngus_ind.txt” .
16.4 Click then on
“Load or create Partition” .In the corresponding
sub-menu, select the item
“Create a new k-means partition” .
16.5 You have then to select the
number of desired clusters , the
number of principal coordinates to compute the distances,
the maximum number of iteration
(generally 12 iterations suffice).
You can also choose to visualize the iterations ( answer:
yes to the question
“Do you want to look at the intermediate iterations?” ).
16.6 The obtained partition will be automatically loaded and visualized.
You can then redo the previous steps 14.3 to 14.6.
16.7 If you want to visualize the different steps, in the window
“Contiguity_Visualisation” ,
click on “IterKM” ,
then click alternately on “Means”
(computation of the centroids of the clusters) and on
“Clust” (assignment of the elements
to the new centroid) until the convergence is reached.
Note that the partition obtained through the classical k-means algorithm
generally will not coincide with the partition induced by the
parameters of the command file. In that command file, the step
RECIP performs a hierarchical clustering of the elements
using the “reciprocal neighbour algorithm” and the step
PARTI that follow cuts the obtained tree according
to the a priorifixed number of clusters.PARTI
optimizes afterwards the corresponding partition through k-means iterations.
Appendix C.1 (for advanced users)
A similar (but not identical) command file can be generated using the menu
“Create a command file” .
Therefore, beginners could skip this appendix
The computational phase of the analysis is decomposed into "steps".
Each step requires some parameters briefly described in the main menu
of DtmVic (button: "Help
about command parameters" ).
Command file: EX_C01_Param.txt
#-------------------------------------------------
# EX_C01_Param.txt (Example C.1 of principal component analysis)
#-------------------------------------------------
# continuation symbol = ">", Comments symbol = "#"
# title mandatory immediately after each line "STEP"
LISTP = yes, LISTF = no, LERFA = yes # Global Parameters
NDICZ = 'PCA_semio.dic.txt' # dictionary file
NDONZ = 'PCA_semio.dat.txt' # data file
STEP ARDAT #reading the dictionary
========== builds the Archive Dictionary
NQEXA = 76, NIDI = 1, NIEXA = 300
#---------------------- Comments about step ARDAT
# - NQEXA = ... number of questions (or variables)
# in both the dictionary and the data file
# - NIEXA = ... number of "individuals" (or rows)
# in the data file.
# - NIDI = 1...indicate the presence of an identifier
#--------------------------------------------------
STEP SELEC # Selection of variables
========== Selecting active and supplementary variables
LSELI = TOT, IMASS = UNIF, LZERO = NOREC, LEDIT = short
CONT ACT 1--70
NOMI ILL 71--76
END
#---------------------- Comments about step SELEC
# - LSELI = ... Parameter describing the selection
# of individuals, the value FILT (or: 2) means that
# the rows are selected through a “filter”.
# This filter is defined after the closing key-word “END”
# - IMASS = ... weight (or: mass) of the individuals
#(rows); the value 0 (or: UNIF) means "uniform"(same weights)
# - LZERO = REC ... means that the value 0, which
# indicates a missing value, will be recoded as an
# extra response item for the categorical variables.
# - CONT ILL (absent) means “continuous illustrative
# variable”. This key-word is followed by the list
# of the variables numbers.
# - NOMI ILL means illustrative nominal (or:
# categorical) variable. This key-word is followed
# by the list of the numbers of the variables.
# - CONT ACT means active variable. This key-word
# is followed by the list of the numbers of the variables.
# Note that, for example, 6-12 means: 6 7 8 9 10 11 12.
# The key-word END indicates the end of the list.
#---------------------------------------------------------
STEP PRICO
========== Principal component analysis
LCORR = 2, NAXE = 12, LEDIN = 0, NAVIR = 2 nsimu = 10 nboot = 1
#---------------------- Comments about step PRICO
# - LCOOR = 1 means that the variables will be standardized
# (the correlation matrix will be diagonalized)
# - NAXE = ... number of computed principal axes
# - LEDIN = 1: the principal coordinates of individuals are printed
# - NSIMU...number of bootstrap replication (less than 30)
# (0 = no bootstrap)
# - NBOOT = 1 : partial bootstrap.
#--------------------------------------------------
STEP DEFAC # Description of factorial axes
========== Describing axes
SEUIL = 40., LCRIM = VTEST, VTMIN = 2.
VEC = 1--6 / CONT / MOD
END
#---------------------- Comments about step DEFAC
# SEUIL = ... Maximum number of elements that will
# be sorted to describe each axis
# LCRIM = ... Criterion for sorting the elements
# (here VTEST means “test-values” (signed number
# of standard deviations)
# VTMIN = elements with a test-value less than VTMIN are
# not printed.
# VEC = ... list of axes to be described
# CONT = continuous variables , MOD = categories
# The key-word END indicates the end of the list.
#--------------------------------------------------
STEP RECIP # hierarchical classification
==== Clustering of respondents using reciprocal neighbours
NAXU=7 LDEND=DENSE NTERM=20 LDESC=no
#---------------------- Comments about step RECIP
# This step carries out a hierarchical clustering
# using the reciprocal neighbours technique (recommended
# when dealing with less than 1000 individuals.
# - naxu... number of axes kept from the previous MCA
# - nterm... number of kept terminal elements
# TOT means that all the elements are kept.
# - LDESC... describing nodes of the tree (0=no, 1=yes).
# - LDEND... printing dendrogram (0=no, 1=dense, 2=large).
#--------------------------------------------------
STEP PARTI # partition
========== cutting the dendrogram and improving the partition
NITER = 10, LEDIN = 6
7 # list of the numbers of clusters required (here one cut, 5 clusters)
#---------------------- Comments about step PARTI
# - NITER... number of "consolidation" iterations (0=no).
# - LEDIN... printing the correspondences classes-
# individuals (3 = printing of the
# correspondence classes->individuals and the
# correspondence individuals-->classes).
# The line immediately following the command must
# contain the sizes of the desired final partition
# (here: 7).
#--------------------------------------------------
STEP DECLA # Description of partitions
========== Systematic description of clusters
CMODA = 5.0, PCMIN = 2.0, LSUPR = yes, CCONT = 5.0 >
LPNOM = NO, EDNOM = NO, EDCON = NO
7 # list of partitions (characterised by they numbers of clusters)
#---------------------- Comments about step DECLA
# - CMODA... describing classes with categories
# (0=no; CMODA = 5.0 means a p-value less than 0.05 for
# the selection of characteristic categories).
# - PCMIN... minimum relative ( % ) weight for a
# category (categories whose relative weight is
# less than 2% are discarded).
# - LSUPR... characteristic category if
# %(cat./class) > %(cat./total)(0=no,1=yes).
# (LSUPR = yes means that only characteristic
# elements will be printed)
# - CCONT... describing classes with numerical
# variables.(0=no; CCONT = 5.0 means a
# p-value less than 0.05 for the selection of
# characteristic variables).
# - LPNOM... describing partition with questions
# (i.e: whole set of categories)(0=no, 1=yes).
# - EDNOM... printing the tables crosstabulating
# (classes * questions) (0=no).
# - EDCON... describing partition globally with
# numerical variables (0=no, 1=yes),
#--------------------------------------------------
STEP SELEC
========== Selecting one categorical variable
LSELI = TOT, IMASS = UNIF, LZERO = NOREC, LEDIT = short
NOMI ACT 76
END
#-------------- Comments about this second call to SELEC
# The variable 76 is selected, to feed the following step.
#--------------------------------------------------
STEP EXCAT
========== Extracting categorical variable 76 (on file part_cat.txt)
#--------------------------------------------------
# This particular step without parameter serves to enrich some displays
# in the menu Visualization.
#
#--------------------------------------------------
STOP # End of command file.
#--------------------------------------------------
End of example C.1
Example C.2: EX_C02.PCA_Contiguity
(PCA and Contiguity Analysis on Fisher’s Iris Data)
Example C2 aims at analysing a classical set of numerical variables
(The Iris data set of Anderson and Fisher) through Principal Components
Analysis, Classification, Contiguity Analysis, Discriminant Analysis.
As in previous examples, the principal axes visualisation is complemented
with a clustering, together with an automatic description of the clusters.
The first phases of Example C.2 are very similar to Example C.1:
Principal components analysis and classification (clustering)
of a set of numerical data, with various tools of visualisation,
involving also a specific categorical data.
Subsection 12 and the following subsections present the improvements provided
by Contiguity Analysis, with a comparison with Linear Discriminant Analysis.
Reminder about Contiguity Analysis
In Contiguity analysis, we consider the case of a set of multivariate
observations, (n objects described by p variables, leading
to a ( n, p ) matrix X ), having an a priori
graph structure. The n observations are the vertices of a symmetric
graph G, whose associated ( n, n ) matrix is M
(mii' = 1 if vertices i and i' are joined by an edge, mii' = 0
otherwise).
Such situation occurs when vertices represent time-points, geographic areas.
Contiguity Analysis, confronting local and global variances, provides a
straightforward generalization of Linear Discriminant Analysis.
It enables to point out the levels responsible of the observed patterns
(local versus global level).
In this example, we will deal with the situation in which
M and the graph structure are not external, but derived from the data
matrix X itself, M being for example the k- nearest
neighbours graph derived from a distance between observations.
Some interesting possibilities of exploration of data are sketched. Note that
the idea of deriving a metric likely to highlight the existence of clusters
dates back to the works of Art et al. (1982) and Gnanadesikan et al. (1982).
Some references
Art D., Gnanadesikan R., Kettenring J.R. (1982) Data Based Metrics for
Cluster Analysis, Utilitas Mathematica, 21 A, 75-99.
Burtschy B., Lebart L. (1991) Contiguity analysis and projection pursuit.
In : Applied Stochastic Models and Data Analysis, R.
Gutierrez and M.J.M. Valderrama, Eds, World Scientific, Singapore, 117-128.
Gnanadesikan R., Kettenring J.R., Landwehr J.M. (1982) Projection Plots for
Displaying Clusters, in Statistics and Probability, Essays in Honor of
C.R. Rao, G. Kallianpur, P.R. Krishnaiah, J.K.Ghosh, eds, North-Holland.
Lebart L. (1969) Analyse statistique de la contiguité.
Publications de l’ISUP. XVIII, 81-112.
Lebart , L. (2000): Contiguity Analysis and Classification, In: W. Gaul,
O. Opitz and M. Schader (Eds): Data Analysis. Springer,Berlin,
233--244.
Lebart L. (2006): Assessing Self Organizing Maps via Contiguity Analysis.
Neural Nerworks , 19, 847-854.
Looking at the data
To have a look at the data, search for the directory
DtmVic_Examples.
In this directory, open the sub-directory
DtmVic_Examples_C_NumData.
In that directory, open the directory of Example C.2, named
“EX_C02.PCA_Contiguity”
It is recommended to use one directory for each application, since DtmVic
produces a lot of intermediate txt-files related to the application.
At the outset, such directory must contain 3 files:
- a) the data file,
- b) the dictionary file,
- c) the command file.
a) Data file: “iris_dat.txt”
Our example comprises 150 observations and 4 variables: 4 measurements
(these numerical variables are the lengths of various constituents of
the flowers: Sepal Length, Sepal Width, Petal Length, Petal
Width) and one categorical variable describing the characteristics of the
observations (three species of plants : setosa, versicolor, virginica).
The data file "iris_dat.txt" comprises 150 rows and 6 columns (the
identifier of rows [between quotes] + 5 values [corresponding to 4
numerical variables and one categorical variable] separated by at
least one blank space).
[Reference: Anderson, E. (1935). The irises of the Gaspe
Peninsula, Bulletin of the American Iris Society
59, 2–5.]
b) Dictionary file: “iris_dic.txt”
The dictionary file "iris_dic.txt" contains the identifiers of
these 5 variables. In this version of DtmVic dictionary, the
identifiers of categories must begin at: "column 6" [a
fixed interval font - also known as teletype font - such as "courier"
can be used to facilitate this kind of format].
c) Command file:
“EX_C02_Param.txt”
The computational phase of the analysis is decomposed into "steps".
Each step requires some parameters briefly described in the main menu
of DtmVic (button: "Help
about parameters" ) and below.
Note that another “command file” similar
(but not identical) to the “command file”
“iris_par.txt”
can be also generated by clicking on the button
“Create a command file”
, line "Command file" of the main
menu (DTM: Basic Steps). Proceed than as shown in the first
example “EX_A01.PrinCompAnalysis” of Tutorial A.
Running the example C.2 and reading the results
1) Click on the button :
“Open an existing command file” (panel DTM: Basic Steps
of the main menu)
2) Search for the sub-directory
“DtmVic_Examples_C_NumData”
in “ DtmVic_Examples”.
3) In that directory, open the directory of Example C.2 named
“EX_C02.PCA_Contiguity” .
4) Open the command file:
“EX_C02_Param.txt”
In that command file, we can read that after identifying the two files
(data and dictionary) , 9 "steps" are performed:
ARDAT (Archiving data),
SELEC (selecting active and supplementary elements),
PRICO (Principal components analysis),
DEFAC (Brief description of factorial axes),
RECIP (Clustering using a hierarchical
classification of the clusters - reciprocal neighbours method),
PARTI (Cut of the dendrogram produced by the
previous step, and optimisation of the obtained partition),
DECLA (Automatic description of the classes of
the partition),
SELEC (selecting one categorical variable,
in this case),
EXCAT
(extracting one categorical variable: the species of iris - selected by step
SELEC - to be used in some graphical displays).
We will comment later on this command file (Appendix of the section) which
commands the basic computation steps. Instead of editing this file, we will go
back to the main menu and execute the basic computation steps.
5) Return to the main menu (
“return to execute ” )
6) Click on the button: “Execute”
This step will run the basic computation steps present in the command file.
7) Click the button:
“Basic numerical results”
The button opens a created (and saved) html
file named “imp.html”
which contains the main results of the previous basic computation
steps. After perusing these numerical results, return to the main
menu. Note that this file is also saved under another name. The name
“imp.html”
is concatenated with the date and time of the analysis (continental
notation). That file keeps as an archive the main numerical results
whereas the file “imp.html”
is replaced for each new analysis performed in the same directory.
As usual, this file in also saved under a simple text format , under the name
“imp.txt” ,
and likewise with a name including the date and time of execution.
Return.
8) At this stage, we click on one of the buttons of the lower
panel of the main menu (Steps: “VIC”)
9) Click directly on the button:
“BootstrapView”
This button open the DtmVic-Bootstrap-Stability windows.
9.1 Click “LoadData”.
In this case (partial bootstrap), the two replicated coordinates file
to be opened are named “ngus_var_boot.txt”
and “ngus_sup_cat_boot.txt”
(see the small panel reminding the names of the relevant files below
the menu bar). In fact, “ngus_var_boot.txt”
contains only active variables.
The file “ngus_sup_cat_boot.txt”
contains only supplementary categories, for which the bootstrap
procedure is also meaningful.
9.2 Click on “Confidence
Areas” submenu, and choose the pair of axes to be
displayed (choose axes 1 and 2, to begin with).
9.3 Tick the chosen white boxes to select the elements the location
of which should be assessed, and press the button
“Select” .
Select all the four variables when you open the file
“ngus_var_boot.txt” ,
and, later on, the three species when you open the file
“ngus_sup_cat_boot.txt” .
9.4 Click on “Confidence Ellipses”
to obtain the graphical display of the active variable points
(if the file ngus_var_boot.txt
has been loaded), or of the supplementary category points (if the
file ngus_sup_cat_boot.txt
has been loaded). We can observe, for the variables, that for example,
the "petal lengths" seem to be somewhat redundant with "petal widths",
since their ellipses markedly overlap. We can see also that the three categories are
significantly distinct (that does not mean that they can be linearly
separated...).
9.5 Close the display window, and, again in the blue window, press
“Convex hulls” .
The ellipses are now replaced with the convex hulls of the replicates
for each point.
The convex hulls take into account the peripheral points, whereas the
ellipses are drawn using the density of the clouds of replicates. The
two pieces of information are complementary.
Go back to the “VIC” menu.
10. Click on “Visualization”
[Visualization of the three species]
A new window entitled “Visualization,
Loading files, Selecting axes” appears.
We are going to visualise the different species of flowers (categorical
variable n° 5) in the plane spanned by the first principal
components.
10.1 Click on “Load coordinate”
10.2 In the corresponding sub-menu, choose the file:
“ngus_ind.txt” .
The principal coordinates of the individuals (rows) are selected.
10.3 Click then on “Load
or create Partition”
10.4 In the corresponding sub-menu, choose
“Load a partition file” .
The partition obtained previously from the computation step must then
be loaded (its name: “part_cat.txt ”
).
The partition induced by the 4 categories of variable number 5
(species of irises) is loaded. This partition has been selected and extracted
through the steps SELEC
and EXCAT
(at the end of the command file, see below).
10.5 Click on “MST”
(Minimum Spanning Tree). Choose then the number of axes that will
serve to compute the Minimum Spanning Tree: 5 (for example).
10.6 Choose then the number of axes that will
serve to compute the Minimum Spanning Tree: 5 (for example).
10.7 Click on “N.N.”
(search for nearest neighbours – limited to 20 NN).
10.8 Click on “Graphics”
.
10.9 Choose the axes 1 and 2 (default) in the small window
“Selection of axes”
and click on “Display” .
In the new window entitled
“Visualisation” are displayed the individuals in
the plane spanned by the selected axes. A random colour is attributed to
each species. The button “Change colour”
allows you to try a new set of colour.
On the vertical tool bar, you can press each button to activate it (red colour),
and press it again to cancel the activation (original colour)
-- The button “Density” ,
for sake of clarity, replaces the identifiers of individuals with a
single character reminding the cluster (the identifier of individuals and the
cluster number can be obtained by clicking on the left button of the
mouse in the vicinity of each point).
-- The button “C.Hull”
(Convex hull) draws the convex hull for each cluster.
-- The button “MST”
(Minimum Spanning Tree) draws the minimum spanning tree.
-- The button “Ellipse”
perform a Principal Components
Analysis of each cluster within the two-dimensional sub-space of
visualisation and draws the corresponding ellipses (containing
roughly 95% of the points).
-- The button “N.N.”
(Nearest neighbours) joins each point to its nearest neighbours. Pressing
afterwards the button “N.N.up”
allows you to increment the number of neighbours up to the 20 nearest neighbours.
At this step, we have obtained a display of the 150 individuals, with either
the convex hulls (or the ellipses) corresponding to the three species.
This is a classical display of the Iris data in the principal plane of PCA, showing that
the first species (number < 51) are well separated from the species 2 and
3.
Go back to the “VIC” menu.
11. Click again on
“Visualization” [Visualization of the clusters]
We are going to redo the operation of paragraph 10, but instead of
loading a partition induced by the 4 categories of variable number 5
(species of irises), we will load a partition produced by a
clustering algorithm ignoring the species. Such partition correspond to the steps
RECIP and PARTI
(see the command file, below).
11.1 Click on “Load coordinate”
11.2 In the corresponding sub-menu, choose the file:
“ngus_ind.txt” .
The principal coordinates of the individuals (rows) are selected.
11.3 Click then on
“Load or create Partition”
11.4 In the corresponding sub-menu, choose
“Load a partition file” .
The partition obtained previously from the computation step must then
be loaded (its name: “part_cla_ind.txt”
).
This partition is derived from the steps RECIP
and PARTI .
After loading that partition, all the operations from 10.5 to 10.9
can be carried out again.
It is interesting to visualise the individuals in the
plane spanned by the axes 1 and 2.
As suspected, the partition obtained directly from the numerical
measurements, ignoring the species, is unable to separate the three
species. Only the species “setosa”, well separated from
the two other species, coincides with a cluster of the partition.
Go back to the “VIC” menu.
12. Click now on the button:
“Contiguity” [Contiguity Analysis of Iris data]
We are now going to perform a “Contiguity analysis” using a
“nearest neighbours graph” derived from the data.
The partition into species is no more taken into account.
12.1 Click on
“Parameters/Edit”
Choose the item “Create”
We are going to enter the parameters needed by a contiguity analysis:
- In the first block entitled
“ncoord = input coordinate file” ,
tick “1” (File ngus_ind.txt :
coordinates of individuals). The contiguity analysis will use the
coordinates of individuals as input data.
- In the second block entitled
“npart = partition file” , tick:
“0” (no partition)
- In the third block entitled “meth
= method” , tick “2”
(Contiguity graph defined by nearest neighbours).
- Then we will have to enter the following numerical values :
npas = 2 (increment for the number
of nearest neighbours)
Min = 4 (minimum number of nearest neighbours)
Max = 8 (maximum number of nearest
neighbours)
Three contiguity analysis will be performed for the three (symmetrised)
graphs corresponding respectively to 4, 6, 8 neighbours
(from Min = 4 up to
Max = 8, with an increment of
npas = 2).
Then: Click on: “Validate” .
A summary of the parameters appears.
12.2 In the upper bar of the window, Click on
“Execute”.
The computations are carried out.
The item “Results”
of that bar contains technical details about the computations
involved in Contiguity analysis.
12.3 Click on “Contiguity View”.
We are led to the same window of visualisation than previously.
In the menu “Load coordinates”
, of the new window, choose the file:
ngus_contig.txt. Instead of using the
principal coordinates of PCA ( ngus_ind.txt
as done previously), we use now the result of the Contiguity Analysis
ngus_contig.txt .
From the menu “Load or Create a Partition”
, choose the file: part_cat.txt .
(this file identifies the species)
We cannot compute the Minimum Spanning Tree nor the Nearest Neighbours
from the “ngus_contig.txt”
coordinates.
12.4 Click on “Graphics”.
Then choose the axes 1 and 2 (default values)
Choose (tick) the contiguity level number 2, that correspond to 6 nearest
neighbours. (level 1 corresponds to 4 nearest neighbours, and level 3
to 8 nearest neighbours).
Click on “Display”
Change the colours to obtain a good contrast between species.
Click on “Convex Hull”
(vertical bar)
The three species are now better separated.
That means that the (“symmetrised”) graph of 6 nearest neighbours
allows for computing a “local covariance matrix” that can act as a
“within covariance matrix”.
In this example, the principal plane
of a contiguity analysis is similar to the principal plane of a Fisher
Linear Discriminant Analysis.
We must keep in mind that the contiguity analysis did not use the
a priori knowledge about the species. It is an unsupervised method.
Go back to the “VIC” menu.
13. Click again on “Contiguity”
We are now going to perform a “Contiguity analysis” that
coincide exactly with a classical Linear Discriminant Analysis.
(Linear Discriminant Analysis is a particular case of Contiguity Analysis. In
such a case, the graph involved in this Contiguity Analysis is made of k cliques
(complete graphs) corresponding to the k classes of the Discriminant Analysis).
13.1 Click on “Parameters/Edit”
Choose the item “Create”
We are going to enter the parameters needed by a contiguity analysis:
In the first block entitled
“ncoord = input coordinate file”
, tick “1” (File:
ngus_ind.txt: coordinates of individuals). The contiguity
analysis will use the coordinates of individuals as input data.
In the second block entitled “npart
= partition file” , tick
“2” ( part_cat.txt ,
categorical) (this partition will now be used to derive a graph).
In the third block entitled “meth
= method” , tick “3”
(Classical Discriminant Analysis).
In this case, the following parameters are meaningless. DtmVic asks you
to skip them.
The contiguity analysis will be performed using the graph associated
with the partition into species. (all pairs of individual
belonging to the same species are joined by an edge; no edge between
individuals belonging to different species)
Then: Click on: “Validate” .
A summary of the parameters appears.
13.2 In the upper bar of the window,
Click on “Execute” .
The computations are carried out.
The item “Results”
of that upper bar contains technical details about the computations
involved in Contiguity analysis. The matrix associated with the graph
with its three diagonal blocks of “1” and with the value
“0” elsewhere is visible in this listing of results.
13.3 Click on “Contiguity View”.
In the menu “Load coordinates”
, of the new window, choose the file:
ngus_contig.txt .
In the menu “Load or Create a partition;
, choose the file: part_cart.txt
(we will identify the “species of iris”)
We cannot compute the Minimum Spanning Tree nor the Nearest Neighbours
from the “ngus_contig.txt”
coordinates.
13.4 Click on “Graphics”.
Then choose the axes 1 and 2 (default values)
Click on
“Display”.
Change the colours of the display to obtain
a good contrast between classes, then lock the colours.
Click on “Convex Hull”
(vertical bar)
The three species of iris are well separated, too. But this is less of a
surprise, since the Linear (Fisher) Discriminant Analysis aims precisely at
separating the classes. We are here in a supervised case. The method
uses the a priori knowledge of the species of iris to exhibit the
coordinates (discriminant functions) that induce the best separation of the
classes.
Appendix C.2 (for advanced users)
A similar command file can be generated using the menu
“Create a command file ”
(see Tutorial A, exemple A.1) Therefore, beginners could skip this
appendix.
Since the steps are the same as those of Example C.1, the listing of
parameters will not be commented here.
Command file: EX_C02_Param.txt
#-------------------------------------------------
# Example C.2 of principal component/ Contiguity analysis
#-------------------------------------------------
# continuation symbol = ">", Comments symbol = "#"
# title mandatory immediately after each line "STEP"
#-------------------------------------------------
#
LISTF = NO, LERFA = yes # Global Parameters
NDICZ = 'iris_dic.txt' # dictionary file
NDONZ = 'iris_dat.txt' # data file
#------------------------------------------------------------
# Description of IRIS DATA through PCA, then, classification (unsupervised)
# into 3 groups (evidently, these groups will not coincide with the 3 real
# groups [species], as described by the categorical variable number 5).
#------------------------------------------------------------
# SEE COMMENTS IN APPENDIX C.1
#------------------------------------------------------------
STEP ARDAT #reading dictionary and data
========== builds the Archive Dictionary
NQEXA = 5, NIDI = 1, NIEXA = 150
STEP SELEC # Selection for PCA
========== Selects active, supplementary variables and observations
LSELI = 0, IMASS = UNIF, LZERO = REC, LEDIT = short
NOMI ILL 5
CONT ACT 1--4
end
STEP PRICO # Principal Components Analysis
==========
LCORR = 1, NAXE =4, LEDIN = 1, NSIMU = 10 nboot = 1
STEP DEFAC # Description of factorial axes
========== Principal Component Analysis
SEUIL = 40., LCRIM = VTEST, VTMIN = 2.
VEC = 1--2 / CONT / MOD
end
STEP RECIP # hierarchical classification
========== Clustering using reciprocal neighbours algorithm
NAXU = 4, NTERM = TOT, LDESC = NO, LDEND = DENSE
STEP PARTI # partition
========== Cut of the dendrogram and optimization
NITER = 4, LEDIN = 3
3 # list of the numbers of clusters required (here one cut, 3 clusters)
STEP DECLA # Description of partitions
========== Systematic description of clusters
CMODA = 5.0, PCMIN = 2.0, LSUPR = yes, CCONT = 5.0 >
LPNOM = NO, EDNOM = NO, EDCON = NO
3 # list of partitions (characterised by they numbers of clusters)
STEP SELEC # Selection of one nominal variable
========== Selects active, supplementary variables and observations
LSELI = 0, IMASS = UNIF, LZERO = REC, LEDIT = short
NOMI act 5
end
STEP EXCAT # Selection of the previous nominal variable
=========
# no parameter
STOP # End of command file.
End of example C.2
Example C.3 EX_C03.Graphs
(Description
of graphs)
Example C.3 aims at describing four
simple symmetrical planar graphs from their associated matrices, mainly
through correspondence analysis. Unlike the previous example
directories, the directory EX_C03.Graphs contains several
sub-directories and examples.
Section 1 : Overview of the different directories and files
1.1 Search for the examples directory DtmVic_Examples
1.2 In that directory, open the directory of Example C.3, named
EX_C03.Graphs .
This directory comprises three sub-directories.
The sub-directory named “Chessboard”
relates to the description of a “chessboard shaped graph”
(49 nodes corresponding to a square chessboard with 7 rows and 7
columns, the associated matrix being a 49 x 49 binary matrix).
The sub-directory named “Cycle”
similarly relates to the description of a “cycle shaped graph”
(49 nodes).
The sub-directory named “Geography”
concerns the description of graphs associated with geographical maps
(graphs of contiguous regions in Japan recorded under textual
form, graphs of contiguous “departments” of France recorded under both
textual form and “external form”).
1.3 Open the sub-directory named
“Chessboard” .
1.3.1 Open the sub-sub-directory
“Chessboard_numerical” .
The file: “Chessboard_7x7_dat.txt”
contains the data set representing the incidence matrix of the graph,
with 49 rows and 49 columns. Like any classical data set of DtmVic,
each row begins with its identifier. The entry cell
m(i, j) of such a matrix M has the value 1 if the nodes i and j are
joined by an edge, 0 otherwise. The identifiers of columns are to be found in the
associated dictionary file:
“Chessboard_7x7_dic.txt” .
That file will be analysed through Correspondence analysis (command file:
“Chessboard_CA.Param.txt”)
and also, through Principal Component Analysis (command file:
“Chessboard_PCA.Param.txt”)
for the sake of a comparison. The comparison is not favourable to PCA
in this particular case. [see, e.g.: Exploring textual Data(1998),
by L. Lebart, A. Salem, L. Berry, Kluwer Academic Publisher].
Note that these command files can be generated from the button "Create a command file" of the
main menu, as exemplified in Tutorials A1 and A2 relating to Principal Components
Analysis and Correspondence Analysis.
1.3.2 In the sub-sub-directory
“Chessboard_textual” .
The file: “Chessboard_textual_7x7.txt”
contains the same basic information under a quite distinct form: the
format relates to responses to open ended questions. Each node of the
graph is considered as a respondent, answering to the fictitious
open-ended question: “ Please, tell me which are your
neighbours ?”. Instead of a binary matrix M,
we are dealing here with a much smaller data matrix containing the
address (column numbers) of the “1” in the matrix M.
The command file “Chessboard_Textual.Param.txt”
leads to the same results as those from the correspondence analysis
of the previous paragraph, using however a quite distinct sequence of
DtmVic steps. It is a “pedagogical example” of bridge
between numerical and textual steps of DtmVic. In this type of data,
the numbers are not considered as numbers in the mathematical meaning
of the term, but as mere sequences of characters. [See below the
example of the maps of Japan and France ].
1.3.3 The file:
“Chessboard_Extern_7x7.txt” is present in both
preceding directories numerical and textual. It is another possible coding of the
Chessboard graph, similar to the previous textual file. But in this case, the
number are effectively read as integers, not as simple sequences of characters.
The first line of the data set contains the number of nodes (49), then the
length of the identifiers (4) and the maximum degree of the graph
(upper bound of the numbers of edges adjacent to a single node)
(10). Note that each row terminates with the dummy value 0.
Such specific format, the most compact one, can lead directly to a description of
the graph in the sub-menu “Contiguity” of DtmVic, without
command file.
1.4 Open the sub-directory named
“Cycle” .
This sub-directory is the counterparts of the previous one relating to the
chessboard graph. Only the shape of the graph is different. The
textual coding and the PCA command file are omitted in this case.
1.5 Open the sub-directory named “Geography”
.
The two sub-sub-directories files are the counterparts of those
relating to the chessboard textual example. The directories
“Japan_text”
and “France_text”
exemplify the “textual coding” in the case of a maps
describing the different regions of Japan and the departments of
France.
In the case of Japan, for example, the two first lines of the file
Japan_map_text.txt set indicate that the provinces of Akita
and Iwate are contiguous to the province of Aomori,
etc. The file “Japan_text_param.txt”
is the corresponding command file. It is identical to the file
“Chessboard_Textual.Param.txt” ,
except for the name of the input data set.
Section 2: Running the example
“Chessboard_numerical”
Click on the button
“Open an existing command file” (panel
DTM: Basic Steps of the main menu)
2.1 Reach again the “sub-sub-directory”:
“Chessboard_numerical”.
We are in the framework of either a classical correspondence analysis or a
Principal Components Analysis.
a) Data file: “Chessboard_7x7_dat.txt”
b) Dictionary file: “Chessboard_7x7_dic.txt”.
c) Command file: “Chessboard_CA.Param.txt”
[Correspondence Analysis] or, later on,
“Chessboard_PCA.Param.txt”
[Principal Components Analysis]
Note again that other “command files” similar to the previous ones,
can be easily generated by clicking on the button
“Create a command file”
of the main menu (Basic Steps).
A window “
Choosing among some basic analysis ”
appears. Click then either on the button: SCA
– Simple correspondence analysis– or on the button:
PCA –Principal components analysis - both
of them located in the paragraph “ Numerical data ”,
and follow the instructions as shown in Tutorial A.
We will start with correspondence analysis.
2.2 Open the command file: “Chessboard_CA.Param.txt”
After identifying the two data files, four "steps" are performed:
ARDAT
(Archiving data),
SELEC
(selecting active and supplementary elements),
AFCOR
(Correspondence analysis). (See, e.g., : Example A.2)
2.3 Return to the main menu ( “return
to execute” )
2.4 Click on the button: “Execute”
This step will run the basic computation steps present in the command file: archiving data
and dictionary, selection of active elements, correspondence analysis
of the selected table.
2.5 Click the button:
“Basic numerical results”
The button opens a created (and saved) html file named
“imp.html”
which contains the main results of the previous basic computation
steps. After perusing these numerical results, return to the main
menu. Note that this file is also saved under another name
(see previous examples).
2.6 At this stage, we click on one of the lower buttons of the basic steps
panel (Steps: “VIC”)
2.7 Click directly on the button: “Visualization”
(we skip here the buttons AxeView ,
PlaneView Research , etc.)
We are going to visualise the graph.
2.7.1 A new window named “DTM-Visualization:
Loading files, Selecting axes” appears.
2.7.2 Click on “Load coordinate”
2.7.3 In the corresponding sub-menu,
choose the file: “ngus_ind.txt”
(individuals)”.
The principal coordinates of the individuals (rows) are selected.
[Since the data matrix is symmetrical, it is equivalent to choose
“ngus_var_act.txt” ].
2.7.4 Click then on “Load
or create a Partition”
2.7.5 In the corresponding sub-menu, choose ”No
Parition”.
2.7.6 Click on “MST”
(Minimum Spanning Tree). Choose then the number of axes that will
serve to compute the Minimum Spanning Tree: 8 (for example).
2.7.7 Click on “N.N.”
(search for nearest neighbours – limited to 20 NN).
2.7.8 Click on “Graphics” .
2.7.9 Choose the axes 1 and 2
(default) in the window “Selection
of axes” and
click on “Display” .
2.7.10 A new window entitled
“Visualisation, Graphics”
is displayed.
2.7.11 About the window
“Visualisation, Graphics”
In the window entitled
“Visualisation, Graphics”
are displayed the nodes in the plane spanned by the selected axes. A
random colour is attributed to the display. The button
“Change colour” allows you to
try a new set of colour.
On the vertical tool bar, you can press each button to activate it
( red colour), and press it again to cancel
the activation (initial colour)
– The button “Density” ,
for sake of clarity, replace the identifiers of nodes by a single
character.
– The button “C.Hull”
(Convex hull) is irrelevant here.
– The button “MST”
(Minimum Spanning Tree) draws a possible minimum spanning tree.
– The button “Ellipse”
is not relevant here.
– The button “N.N.”
(Nearest neighbours) joins each point to its nearest neighbours.
Pressing afterwards the button
“N.N.up” allows you to increment the number of
neighbours up to the 20 nearest neighbours.
– The button “ExtG”
allows you to load the graph in "External format".
– The button “Graph”
(only when an extern graph has been loaded) allows you to draw the edges of the
graph (Interesting to watch the distortions of the graphaccording to the
selected pairs of axes.
Important: in this particular application, the Minimum Spanning Tree and also
the nearest neighbours are computed from the coordinates of the nodes
in a space spanned by the first components.
2.8 Go back to the main menu.
2.9 Redo all the operations 2.2 to 2.7, opening now, during step 2.2, the
command file: “Chessboard_PCA.Param.txt”
(Principal Components Analysis).
It will be seen through this example that PCA is less faithful than CA
vis-à-vis the description of the graph structure.
Section 3 : Running the example “Chessboard_textual”
Click on the button
“Open an existing command file”
(panel Basic Steps of the main menu)
3.1 Open the “sub-sub-directory”:
“Chessboard_textual”:
We are in the framework of a textual analysis similar to the one of examples which
aimed at describing the responses to an open ended question in a sample survey
(examples A.5, A.6, B.1 to B.3).
We find in this directory the text file and the command file.
(in this particular context, there are neither data file nor dictionary file: the
questionnaire comprises one pseudo open-ended question, put to
each node: ” Which are your neighbouring nodes ?”)
3.1.1) Text file: Chessboard_textual_7x7.txt
The format is the same as in Example A.5. Since the
responses may have very different lengths, separators are used to
distinguish between individuals (or: respondents). Individuals (here:
nodes) are separated by the chain of characters “----“
(starting column 1) possibly followed by an identifier.
3.1.2) Command file: Chessboard_Textual.Param.txt
The computational phase of the analysis is decomposed into "steps".
Each step requires some parameters briefly described in the main menu
of DtmVic (button: "Help about command parameters" ).
3.2) Open the command file: Chessboard_Textual.Param.txt
After identifying the input textual data file, four "steps" are
performed:
ARTEX (Archiving texts),
SELOX (selecting the open question),
NUMER (numerical coding of the text),
ASPAR (correspondence analysis of the [sparse] contingency
table “respondents - words”).
We will not comment on
this command file which commands the basic computation steps (see
Example B.1). Instead of editing this file, we will content
ourselves here in going back to the main menu and execute the basic
computation steps.
Recall that such a command file can be generated by clicking
on the button “Create a command file” of the main menu
(DTM: Basic Steps).
A window “
Choosing among some basic analysis ”
appears. Click then on the button : VISURESP ,
located in the paragraph “ Textual
data ”, and follow the instructions as shown in Tutorial A.
Note also that in this simple data case (only one fictitious “open
question”), it is possible to consider each response as a text.
In such a case, the response separators “----“ should be replaced
with the text separator “****”, as in example A.4 of Tutorial A.
Instead of the analysis “VISURESP”,
it is then necessary to perform the analysis
“VISUTEX”.
3.3) Return to the main menu (“return
to execute” )
3.4) Click "Execute"
This step will run the basic computation steps present in the command file:
archiving text, correspondence analysis of the lexical table.
3.5) Click the
“Basic numerical results” button
The button opens a created (and saved) html file named
“imp.html”
which contains the main results of the previous basic computation
steps. After perusing these numerical results, return to the main
menu. Note that this file is also saved under another name.
(see for example Tutorial A).
From the step NUMER, we learn for instance that we have 49 responses,
with a total number of words (occurrences or token = here: edges of the graph) of 217,
involving 49 distinct words (here: neighbours). Note that each node
has been considered as its own neighbour.
3.6 Click the
PlaneView Research button, and follow the sub-menus...
In this example, four items of the menu are relevant
“Active columns (variables or categories)”,
“Active rows (individuals, observations)”, “Active columns + Active
rows”, “Active individuals (density)”.
The graphical display of chosen pairs of axes are then produced.
3.7 Click on “Visualization”
All the steps of the previous section 2.7 could be carried out likewise.
Section 4 : Running the example “Chessboard_Extern”
There are neither command file, nor dictionary file in
this directory, since the specific type of coding of the graph (“external
coding”) provides a direct entry into the “Contiguity”
menu.
In the menu “Visualization, Inference, Classification”,
click on the button: “Contiguity”.
4.1 Click on “Parameters/Edit”
Choose the item “Create”
We are going to enter the parameters needed by a graph description:
- In the first block entitled
“ncoord = input coordinate file”,
tick “0”: “No coordinate file (simple description
of an external graph)”.
- In the second block entitled
“npart = partition file” ,
tick “0” (No partition)
- In the third block entitled
“meth = method” , tick “4”
(External contiguity graph ).
Then: Click on: “Validate”
(as prompted by a message).
The parameter should be in the same directory as the external graph file
(as suggested by a pop up message).
4.2 In the upper bar of the window, Click on
“Execute”.
A new window appears, and you are asked to choose the external graph file.
It is in this example the file:
“Chessboard_Extern_7x7.txt”.
The computations are carried out.
The item “Results”
of that bar contains some technical details about the computations involved in the
correspondence analysis of the associated matrix M
(These results are saved in the file
“imp_contig.txt” ).
4.3 Click on “Visualisation”.
In the menu “Load coordinates” ,
of the new window, choose the file: anagraf.txt .
(graph view through Correspondence Analysis)
In the menu “Load or Create Partition” , choose the
item: No partition .
We can compute and load the Minimum Spanning Tree or the Nearest
Neighbours from the “anagraf.txt” file coordinates,
choosing for instance 12 axes (maximum number allowed in this
version = 30).
4.4 Click on “Graphics”.
Then choose the axes 1 and 2 (default values)
Click on “Display”.
Change the colours if necessary.
Once again, all the steps of the previous section 2.7 could be carried out
likewise.
4.5 About the window
“Visualisation, Graphics”
To represent the edges of the original graph, click on the button
“ExtG”
(External Graph) of the vertical bar.
Open then again the file “Chessboard_Extern_7x7.txt”. .
Click on the button “Graph”.
The button “Graph”
produces the original graph as recorded in the file.
This allows you to observe the distortion of the planar graph in the
spaces spanned by the axes 3 to 12. It is the multidimensional
Guttmann effect [See Benzécri, (1973),(in French) “L’analyse
des données”, Tome II B, Chapter 10, “Sur
l’analyse de la correspondance définie par un graphe”,
pp 244-261].
Section 5 : Running the example “Cycle_Numerical”
This section is identical to Section 2
(Running the example “Chessboard_Numerical”). The graph
has now the shape of a cycle, with the same number of nodes.
The homologues of the files
“Chessboard_7x7_dat.txt”,
“Chessboard_7x7_dic.txt” and
“Chessboard_CA_Param.txt”
are now respectively “Cycle_49_dat.txt”,
“Cycle_49_dic.txt” and
“Cycle_CA_Param.txt” . They can be found in the directory
“Cycle”.
Section 6 : Running the example “Cycle_Extern”
This section is identical to Section 4 (Running the example
“Chessboard_Extern”). The graph has
now the shape of a cycle, with the same number of nodes.
The homologue of the file
“Chessboard_7x7_Extern.txt”
is the file: “Cycle_Extern_49.txt” .
Section 7 : Running the example “Japan_map”
This section is identical to Section 3
(Running the example “Chessboard_Textual”). The graph is
now a sketch of a map of Japan, presented as a set of responses to
the open question “Which are your neighbouring regions”,
the “respondents” being the same regions of Japan…
The homologue of the directory
“Chessboard_Textual”
is : “Japan_map”
whereas the homologue files of
“Chessboard_textual_7x7.txt”
and “Chessboard_textual_Param.txt”
are respectively: “Japan_map_Textual.tex.txt”
and “Japan_map_Textual.Param.txt” .
Section 8 : Running the examples “France_map”
This section is identical to Section 3
(Running the example “Chessboard_Textual”). The graph is
now a sketch of a map of France, presented as responses to the open
question “Which are your neighbouring departements (= counties)”, the
“respondents” also being the departements of France…
The homologue of the directory
“Chessboard_Textual”
is : “France_map”
whereas the homologue files of “
Chessboard_textual_7x7.txt”
and “Chessboard_textual_Param.txt”
are respectively : “France.tex.txt”
and “France.Param.txt” .
The homologue of the file
“Chessboard_7x7_Extern.txt”
is the file: “France_Extern.txt” .
End of example C.3
Example C.4: EX_C04.Images
(Structural Compression of Images through SVD, CA and Discrete Fourier Transform)
Examples C.4 are mainly pedagogical examples which
serve as an illustration for the compression effect of principal
axes techniques (keeping a limited number of principal axes in Singular Value
Decomposition and Correspondence Analysis) in the domain of image analysis
(rather unexpected for most DtmVic users).
Comparison is made with Discrete Fourier Transform (keeping a limited number
of terms from the expansion) that takes into account the relative locations
of the pixels.
It does not make use of data in internal DtmVic text format, since it deals with
digitalized images. A simple rectangular array of integers suffices: there is
no need for identifiers of rows or column.
In fact, three particular formats will be used: rectangular arrays of
levels of gray (simple text format), plain “pgm” format
( acronym derived from "Portable Gray
Map") and, for color images, plain “ppm” format
(acronym derived from "Portable Pixel Map")
A specialized interface is provided via the button
“DtmVic Images”
of the main menu.
1. About the data (some image formats)
To have a look at the data,
1.1 Search for the directory “ DtmVic_Examples”.
1.2 Search for the sub-directory
“DtmVic_Examples_C_NumData”
in “ DtmVic_Examples”.
1.3 In that directory, open the directory of Example C.4:
“EX_C04.Images” .
1.4 Four sub-directories correspond to four examples:
“1_Cheetah_txt”,
“2_Baalbeck_pgm”,
“3_Cardinal_ppm_color”,
“4_Extra_pgm_ppm”
All these file can be examined via a text editor (such as “notepad”
included in Windows, or a free software such that “notepad++”,
or “TotalEdit”, etc.).
1.5 For greyscale (US: grayscale) images, two input format are available:
1.5.1 Simple text format :
The data table contains positive integer s <= 255 that are the values of
the level of grey for each pixel (no identifiers). This is the case of the image
“cheetah.txt” in the folder
“1_Cheetah_txt”
(adapted from "The Data Compression Book", Mark Nelson,
M&T Publishing Inc., 1992). Such a format that does not contain
explicitly the size of the image is the simplest one. Because of its
rusticity, this format is neither used nor provided by the usual image
processing software.
1.5.2 The "pgm" format :
(Portable Grayscale Map) (look at the example:
"2_Baalbeck.pgm", using a text editor or a notepad)
The PGM format is a simple and transparent greyscale file format.
The difference in the plain format is:
There is one image in a file (general pgm format can cope with several images).
The first line contains the format identifier: P2.
The second and the third lines contain three integers:
number of columns, number of rows, and the maximum value (255).
Then the table is displayed row-wise.
Each pixel in the table is represented as an ASCII decimal number (<255).
Each pixel in the table has at least one white space before and after it.
No line should exceed 72 characters.
For more information about such a format, please consult
(e.g.): http://netpbm.sourceforge.net/doc/pgm.html
1.5.3 The "ppm" format for colour images:
For (small)colour images, the input format is the ppm text format
(acronym for: portable pixel map). Look at the example
"3_Cardinal.ppm" , via a text editor or a notepad.
The three integers (levels of: Red, Green, Blue) describing each pixels are
located consecutively in the same row.
Both pgm and ppm files can be obtained through an exportation from the
free software "Open office", using a jpeg file as an
input.
2. Running a first example (simple greyscale format)
In the Main Menu, Click on the button
""SVD and CA of images"
(in the section:
"DtmVic-Images").
The first thing to do is to select an image.
One of the three buttons, on the left hand side of the window,
have to be selected to open the image, according to its format.
2.1 Click on the first button
"Read (formatted txt file)"
in the section "Open greyscale image".
2.2 In the directory "EX_CO4_Image",
open the sub-directory "1_Cheetah_txt".
Within "1_Cheetah_txt",
open the file "Cheetah.txt".
A message-box recalls the size of the image file.
If you wish to visualize the original image,
in the section "Visualization", click on:
"Image (greyscale)".
2.3 Then, in the lower left part of the window,
in the section "Compression techniques",
click the button:
"Correspondence Analysis" (to begin with).
2.4 If you wish to obtain an overview of the data reconstitution,
from 1 to 100 axes, Click directly on the button:
"Series from first term to total",
in the right hand side panel.
You can then observe the progressive reconstitution of the original
data table (i.e.: the image).
2.5 If you are interested in focusing on a specific number of axes,
then select the required number of axes in the vertical corresponding
list, and visualize each image.
Note that all the created images are saved in bitmap format
(extension: ".bmp") in the directory of the analysed image file.
2.6 Instead of Correspondence Analysis, you can choose
"Singular Value Decomposition",
and redo all the operations 2.4 and 2.5.
2.7 If you select the lower button:
"Discrete Fourier Transform",
a new window is displayed.
2.8 You have then to select the mode of computation of the
Fourier series ("Row-wise" or "Column-wise").
Select "Row-wise", for example.
2.9 Then, as previously, you can go directly to the right-hand
side panel, and press the button:
"Series from the first term to total (greyscale)"
.
The comparison of the obtained reconstitution
(according to the number of kept terms in the Fourier decomposition)
with the preceding reconstitution (using CA or SVD) is quite interesting.
2.10 If you are interested in focusing on a specific number of terms,
then select the required number of terms in the vertical
corresponding list, and visualize each image.
Note 1: Incidently, the graphical display of levels of grey for
each row can be obtained from the button
"Curves of grey levels"
(press it several times to scan the whole image).
Note 2: All created images are saved in bitmap format
(extension: ".bmp") in the directory of the analysed image file.
Note 3: The compression through SVD or CA does not depends on the order
of rows and columns of the table (unlike the Fourier compression).
Nevertheless, the "structural compression" (i.e. ignoring the relative locations of the pixels)
gives worthwhile results.
3. Running other examples:
3.1 Baalbeck Temple example.
Click on the second button "Read (pgm format)",
always in the section "Open greyscale image".
In the directory "EX_CO4_Image",
open the sub-directory "2_Baalbeck_pgm".
Within "2_Baalbeck_pgm",
open the file "Baalbeck.pgm".
A message-box recalls the size of the image file.
If you wish to visualize the original image,
in the section "
Visualization", click on:
"Image (greyscale)".
Then redo all the operations 2.3 to 2.10.
This example is interesting since it emphasize the fact a strong pattern
(here: the columns of the temple) can contaminate the reconstitution
(not in the case of Fourier reconstitution row-wise, as expected...).
3.2 Cardinal (of Mauritius) example.
Click on the third button "Read (ppm format)",
in the section "Open colour image".
In the directory "EX_CO4_Image",
open the sub-directory "3_cardinal_ppm_colour".
Within "3_cardinal_ppm_colour",
open the file "cardinal.ppm".
A message-box recalls the size of the image file.
If you wish to visualize the original image, in the
section "Visualization",
click on: "Image (colour)".
Then redo all the operations 2.3 to 2.10.
3.3 Extra_pgm_ppm example.
The folder "4_Extra_pgm_ppm" contains two
versions (colour and grey) of an image of a young boy using a broom.
Proceed as in section 3.1 for the pgm image, and as shown in section 3.2 for the ppm image.
Note: Remind that in the ppm format, the three basic colours (RGB = Red, Green, Blue)
corresponding to each pixel have consecutive locations in the same row
(the length of which is consequently three times the number of pixels).
The compression through SVD or CA does not depends on the order of the
columns, that means that we don't use the fact that the three colours
are relative to a same pixel! Nevertheless, the "structural compression" works.
In this case, the Fourier series row-wise is not adapted (unless we choose to
juxtapose column-wise the three tables Red, Green, Blue, and, in so doing, abandon
the ppm format). The reader can compare the results from the two Fourier
compressions row-wise and column-wise.
End of example C.4
DtmVic - Tutorial D
Data importation
Three examples of data importation for DtmVic
This tutorial contains a series of examples of data importation that aim at
capturing or transforming data to comply with the DtmVic
format files. Each example corresponds to a directory included in the
sub-directory “DtmVic_Examples_D_Import”
included in the directory “DtmVic_Examples”
that has been downloaded with DtmVic.
Importation examples D.1—D.3
Introduction Dtm-Vic Format
(Internal DtmVic format for input data and texts)
Premiminary Example D.0.
Capture of dictionary and data
(Dictionary and Data from the keyboard)
Example D.1.
EX_D01.Importation.XL.
Importation from an Excel File
(Importation of dictionary, numerical and textual data from an Excel ® file)
Example D2. EX_D02.Importation.Text.Free
(Importation of Textual Data from a specific free format file)
Example D3. EX_D03.Importation.Text.Num.XML.
(Importation of both numerical and Textual Data from a XML format file)
-Introduction -
Internal DtmVic format for input data and texts
The aim of the importation procedures is to transform a
pre-existing text file into the “Internal DtmVic format”.
The knowledge of the internal DtmVic format could be useful to some
advanced users; it is not indispensable for the beginners.
Let us remind that DtmVic is a software devoted to
exploratory analysis of multivariate numerical and textual data.
The leading case that exemplifies all the possibilities
of the software is a sample survey data set, comprising both
responses to closed questions and responses to open-ended questions
(the closed questions may lead to numerical [quantitative] or
categorical [qualitative] data).
In the most general configuration, three files
constitute the internal DtmVic input data set:
1) The dictionary file that provides the names (or
identifiers) of the numerical and categorical variables It includes
the names of the categories corresponding to each categorical
variable. That latter feature is rather uncommon in statistical
software, but seems indispensable to explore high dimensional
categorical data sets.
2) The data file , that contains the values of these
variables for a set of individuals (or: observations), together with
the identifiers of the individuals.
3) The text file made (e.g.) of the responses to open ended
questions. The text file (known as text file type 2) concerns the
same respondents as those of the data file, in the same order. A simplified
“text file format” (text file type 1) can be used when
dealing only with a series of texts, without associated data file and
dictionary file.
Some applications may involve only the text file (see
for instance the example A4 of Tutorial A), whereas others may need
only the dictionary and the data files (application examples A1, A2,
A3, C1, C2, of Tutorials A and C).
Internal “DtmVic format”
The format is specific, but not proprietary: The three types of files
are in simple text format (extension “.txt”, readable
through a “notepad” or a text editor, or also with a word
processor, provided that they are saved as simple text files).
As an introductory exercise, they can be recorded directly from the
keyboard, or with the help of the menu “DataCapture”
(see preliminary example D.0 below).
In most cases, however, they have to be imported from (often large)
pre-existing files. The transformation into DtmVic format is then
transparent to the user.
Table 1 shows an example of a small DtmVic dictionary, involving four
variables. Table 2 displays an example of a DtmVic data file (same four
variables, three individuals or respondents).
Table 3 presents a text file relating to three open-ended questions
and three respondents.
Table 1: Example of an internal DtmVic dictionary for 4 variables
Gender (2 categories); Age (0 categories = numerical variable); Age broken
down into 4 categories;
Educational level (3 categories). [fixed format, comments in italic, blue].
2 GENDER (number of categories [2] in columns 1-4; blank; title of the variable)
MALE MALE (short identifier [column 1-4]; blank; identifier [< 20 characters]
FEMA FEMALE (short identifier [column 1-4]; blank; identifier [< 20 characters]
0 AGE (number of categories [0] in columns 1-4; blank; numerical variable)
4 AGE_CODE (number of categories [4] in columns 1-4; blank; title of the variable)
AGE1 18_24 (short identifier [column 1-4]; blank; identifier [< 20 characters]
AGE2 25_39 (short identifier [column 1-4]; blank; identifier [< 20 characters]
AGE3 40_59 (short identifier [column 1-4]; blank; identifier [< 20 characters]
AGE4 >60 (short identifier [column 1-4]; blank; identifier [< 20 characters]
3 EDUCATION (number of categories [3] in columns 1-4; blank; title of the variable)
EDUL LOW (short identifier [column 1-4]; blank; identifier [< 20 characters]
EDUM MEDIUM (short identifier [column 1-4]; blank; identifier [< 20 characters]
EDUH HIGH (short identifier [column 1-4]; blank; identifier [< 20 characters]
Table 2: Example of an internal DtmVic data file for the previous 4 variables:
Gender, Age broken down into 4 categories, Educational level. 3 respondents
(individuals, observations)
'1006' 1 76 12 1 (Identifiers of the respondents : between quotes,
'1007' 2 20 2 2 without blank, less than 20 characters. Separators
'1008' 2 29 3 2 between values: at least one blank space)
Table 3: Example of an internal DtmVic text file (type 1) for three texts
(see: application example EX_A04.Text.Poems of Tutorial A for texts in English).
Free text format on less than 200 columns (80 columns in the previous versions of DtmVic).
Separator of texts : “****“ followed, after four blank spaces,
by the identifier (<= 20 characters); End of file: “====”.
All separators are in columns 1, 2, 3, 4.
Such a format does not imply a specific importation procedure.
The original text, in MsWord, for instance, has to be saved in .txt format,
with an option: Insert or save the Ends of lines and Carriage Return
(to obtain lines of less than 200 characters).
Check afterwards that separators and identifiers comply with the previous constraints.
**** LAMARTINE
Voilà les feuilles sans sève,
Qui tombent sur le gazon
Voilà le vent qui s'élève,
Et gémit dans le vallon
Voilà l'errante hirondelle,
Qui rase du bout de l'aile,
L'eau dormante des marais...
**** GAUTIER
L'automne va finir, au milieu du ciel terne,
Dans un cercle blafard et livide que cerne
Un nuage plombe, le soleil dort. Du fond
Des étangs remplis d'eau monte un brouillard qui fond
Collines, champs, hameaux dans une même teinte.
**** VERLAINE
Les sanglots longs
Des violons
De l'automne
Blessent mon coeur
D’une langueur
Monotone.
=====
Table 4: Example of an internal DtmVic text file (type 2) for three
responses to three open-ended questions and for three
respondents (see: application examples A5, A6 from Tutorial A and B1,
B2, of Tutorial B).
Free text format on less than 200 columns. Separator of respondents: “----“
followed by the identifier (<= 20 characters); Separator of
question: “++++”; End of file: “====”. All
separators are in columns 1, 2, 3, 4.
Note the blank lines for empty responses (last respondent, second and
third questions).
---- 1006
my sons, my kids are very important to me,
being on my own I am responsible for their education
and moral standard
++++
education and moral standard of the youngsters, law and order
++++
basically, British culture is traditional,
people tend to keep themselves to themselves
---- 1007
job, being a teacher I love my job, for the well being
of the children
++++
law and order, drug abuse, child abuse
++++
accommodating, of course people from different races
and culture have settled in here, (i.e., Irish, Jewish,
Asians) and the British culture is working alright
---- 1008
job, sometimes it is very hard to find a job
++++
++++
====
Preliminary Example D.0
Capture of numerical data and dictionary
Recording the dictionary presented above in the introduction and recording some data.
Click on the button “Data
Importation , Preprocessing, Data Capture, Exportation”.
(Basic Step from the main menu of DtmVic).
A new window appears.
-
Choose the item :
‘Building the Dictionary (manually)”.
In the new green windows, three yellow cases are ready to receive the information
relating to the first variable.
“Variable number” ,
: “ 1 ”(default value).
“Variable
identifier” ,
type: Gender ;
“Variable type” :
type “ 2 ”
[the type of a variable is the number of its categories](special value: 0 for a numerical
variable).
Since the number of categories is greater than 1, a second green window
appears, inviting you to record the names of the categories:
male,
female .
A second variable is then proposed. It will be the age, the type of
which is “ 0 ”,
it is a numerical variable. No window appears since no categories
are involved.
A third variable is proposed, you may record a categorical variable “age” in 4
categories… etc.
A report of the recorded data is printed is the lower window, while the right hand side window
displays the dictionary in DtmVic internal format. It is that
dictionary that will be recorded at the end of the capture process.
When all variables are recorded, one must click on the button
“Save dictionary”.
We suggest to build a new directory (or: folder) in a workspace that
is convenient to you, to open that directory, and to save the
dictionary as “dic.txt”.
Then:
“Return” .
We are back in the Data Capture window.
Choose now the item “Creating the data
file”.
The window “Creating data source file” appears.
-
Click on
“LOAD DICTIONARY”
- Ignore, at this stage, the button
“Update an existing data file” .
The previous chosen directory appears. Select the dictionary:
“dic.txt”.
That dictionary is displayed in the upper right window.
Simultaneously, the yellow upper left window is ready to receive the data relating to
the first individual : its identifier, (type for example “
Rita ”),
Type then the value of the first variable for that individual:
GENDER. A click on the right border of the
caption window displays the two possible values.
Let us choose “ female ”,
to be consistent with the identifier.
The second variable, AGE, is then proposed to the user. A numerical
value must be inserted in the window, etc.
At the end of the record, the second individual or observation is
proposed…
We suggest to record 3 or 4 individuals in this exercise, more if
you wish...
Then press the button “Save Data”.
The same directory is again proposed. A name should be proposed for
the data file. The extension “.txt” is recommended, to
facilitate a quick access to the content of the file. Let us select
for example the name “dat.txt”.
Press then the button: “Create a first
parameter file”.
The window “Creating a starting parameter file” appears.
Click on
“Create a parameter file”.
A DtmVic parameter file is displayed in the lower window.
That parameter file is automatically saved under the name:
“param_start.txt”.
The parameter file does not include any statistical analysis command,
except basic counts of categories, together with a computation of
extreme and average values for the purely numerical variables.
It is only meant here as a check of the capture of the data.
Comments about the “first parameter file”
After an identification of the two input files, three “steps”
of DtmVic are involved:
The step “ARDAT” that archives data and dictionary. The
step “SELEC” that selects the variables for the
subsequent processing. In this case, all the available variables are
selected. The step “STATS” that computes the basic
statistics mentioned above.
Click on “Execute”.
Read the results by clicking on the “Basic
numerical results” item of the menu. These results
are saved under the names: “imp.html” an “imp.txt”
in the same directory.
End of example D.0
Example D.1: EX_D01.Importation.XL
Importation of numerical and textual data
in “Excel ® format”.
(updated: January 4th, 2011)
Transforming a specific XL (csv) format file into DtmVic
dictionary file, text file and data file.
This importation procedure can be applied to any text
file (.txt) having the following features, for n individuals and p
variables:
The first row (p + 1 elements) contains the generic name of the
identifiers (for example: ident ) and the p names of the variables
(no blank space allowed within the name, less than 20 characters, preferably
less than 10) separated with a semicolon (or a comma, or a tab). Blank spaces
are allowed between names (free format).
The n remaining rows contain p + 1 elements: the identifier of the individual
(less than 20 characters) and the values of the p variables (for categorical
variables, no blank space are allowed within the alphanumeric values; preferably
less than 10 characters) separated with a semicolon (or a tab). Blank spaces
are allowed between values (free format) and, evidently, within textual variables
(responses to open-ended questions, for example).
Only one type of separator (semicolon or tab) can be used in a file.
Such file can be obtain by saving an Excel file as either a "CSV file", or
a "tab-separated text file".
1- Looking at the data, preliminary steps
The folder “EX_D01.Importation.XL”
contains the file “datbase_global.xls”.
The file datbase_global.xls corresponds to a frequent situation: the
first row of the table contains the variable identifiers, the first
column comprises the observations identifiers.
To begin with, we will have a look (outside
DtmVic) at the original file to be imported.
This file is under Microsoft Excel ® format. The
reader who is not provided with that software should skip the next
instructions… or use the free software “Open Office”
instead.
1.1 Search for the examples directory
DtmVic_Examples
1.2 In that directory, open the directory of example D.01, named
EX_D01.Importation.XL
1.3 Click on the file:
“datbase_classical.xls”
(basic dictionary and data and texts) to obtain a view of the
data through an Excel spreadsheet.
- The first row contains the names of the 17 variables (there are 18 columns,
but the first one relates to the identifier of individuals).
Note again two important constraints:
a) the names of variables must have less than 20 characters,
b) these names should not contain blank spaces (replace them by underscores, if any).
Note that these names will be truncated down to 10 characters to build the
identifiers of the categories. It is then important that these first
10 characters allow for identifying the variable.
The remaining rows consist of 1043 lines (it is the same sample of
individuals from the socio-economic sample surveys serving as example
in the applications A.5, B.2).
The sequence of characters in the first cell of each line is the
identifier of individual, the following sequences being the values of
the 17 variables. Blank cells means “no-answer” or
“missing value”.
1.4 We must save this file as a text file in “.csv” format.
(command: File, then “Save as” )We obtain a free format
file with semicolons as separators. The file in “csv”
format is provided in the example directory.
Important:
1.4.1 If there are some semicolons in the data file, they
should be replaced by another symbol before saving the “Excel
file” as a “csv file”.
1.4.2 Note also that before saving the file, the format of the
cells containing numerical values must be “standard”, to avoid
some additional small blank spaces in numbers of more than 3 digits that
are misinterpreted by the csv file. In the French version and in some
European versions of Excel, the “decimal commas” should be
replaced by the usual decimal dots.
1.4.3 If your version of Excel does not allow for “saving
as a csv file”, you can save the file using “tabs”
as separators, and then, change the “tabs” into
“semicolons”, alteration allowed by the button:
“Change tabs into semicolons”
(see below).
This supposes that the initial data set does not already contain semicolons:
if semicolons are present, you should replace them with another
symbol before the importation process).
1.4.4 In many versions of Excel, the csv format uses commas as
separators, instead of semicolons. You can then transform these
commas into semicolons (provided that the initial data set does not
already contain semicolons: you should replace then these semicolons
with another symbol before the importation process).
2) Sequence of operations
2.1 Click on the button:“Data Importation,
Preprocessing, Data Capture, Exportation”,
(Basic Steps from the main menu of DtmVic). A new window appears.
2.2 Choose the item:
“Importing Dictionary, Data and Texts” .
The new window “Data Importation”
is displayed.
2.3 Press the button entitled: “Excel®
type files (saved as csv files)” .
A new window entitled “Data Importation from an
Excel (r) file” appears.
If the Excel file has been saved using “tabs”
or “commas” as separators, click on one of the optional
buttons:
“0. Change tabs into semi-colons”.
“0. Change commas into semi-colons”.
Select the file saved with tabs or commas, and convert
it. Note that a new name is given to the created file. The
importation process will continue using this new file.
2.4 Then, click on the button: “Start
the Importation Process”
In the new window, click on:
“1.Select input data file.”
(widen the window if necessary).
Select the previously saved file:
“datbase_global.csv”( or the file produced by one of the
previous buttons “0”)
The left hand side memo contains, for each variable, all its
observed values. In the case of continuous numerical variables, the
number of values could be the same as the number of observations. In
the case of textual data, the number of values is the number of
“words” (separators : blank, periods, commas)
- The central memo is a summary of the
previous one. For each variable, we can read within the brackets the
number of distinct values observed in the file.
- The letter (A) in parenthesis means that some letters or
non-numerical values have been observed.
- The letter (N) indicates that only numerical values have been
obtained.
It is then easier to choose the types of the variables:
- categorical ( CHAR ),
- numerical ( NUM ),
- textual ( TEXT ),
- variables to be abandoned ( DISCARD ).
To choose these types, you have then to select one or several consecutive
variables in the list, and choose, for each variable, one keyword among the
four keywords {CHAR, NUM, TEXT, DISCARD}.
- “ CHAR ”
means that we are dealing with a category of a nominal variable. Such
variable could be coded with at most 6 characters. For instance,
‘male’ and ‘female’ for coding the gender
(or “0” and “1”, or “10” and “20”
…). Conventionally, the first item (identifier) should be a
“CHAR”.
- “ NUM ”
means that we are dealing with a purely numerical variable.
- “ TEXT ”
means that the records (up to 8000 characters, another constraint)
will feed the textual data file.
- “ DISCARD ”
means that the records (whatever the prior status) will be suppressed
in the imported file.
Clearly, a variable with a few distinct values containing letters (A) should
be a categorical variable “CHAR”.
Similarly, a variable with hundreds of purely numerical values (N) will probably
deserve the type: “NUM”.
If expected numerical values contain letters (A), it could be than in
the original Excel file, the missing values or ”Do not apply
(DNA)” are represented by alphanumeric symbols. These symbols
should be replaced with blank spaces in the original file, or
directly in the “csv file” before the importation. If you
give the status “NUM” to a variable whose values contain
letters, the importation process will be stopped before being
completed, entailing a waste of time.
2.5 Once the attribution of types is completed, click on the button
“3. Updating and continue” .
2.6 In the new window, Click on “Values
and counts” .
A further check of the consistency of the selected types
or the variables. A list of all the categories found in the data
file, with the corresponding frequencies is displayed. Basic
parameters are also provided for numerical variables. We will not
dwell on this output serving mainly as a technical check.
2.7 Click then on “Create
dictionary and data” .
A new window entitled “Creating a dictionary and a
data file“ appears on the screen.
2.8 Click on “Name
for the new dictionary” .
You have to choose a name for the
forthcoming DtmVic dictionary, always in the same directory (the
extension “.txt ” is recommended) select for example: “dtm_dic.txt.
2.9 Click on
“Name for the new data file”
You have to choose a name for the forthcoming DtmVic data file, always in the same directory (the
extension “.txt ” is recommended). Select for example: “dtm_dat.txt”
2.10 [if textual data have been selected] Click on
“Name for the new text file”
You have to choose a name for the forthcoming DtmVic text file, always in the
same directory (the extension “.txt” is recommended).
Select for example: “dtm_text.txt”
2.11 Click on “Create new dictionary”
A DtmVic dictionary is created (number of lines = total number of
variables + number of found categories). The DtmVic dictionary is displayed
in the right hand side memo.
2.12 Click on
“Create new data file” .
2.13 [if textual data have been selected] Click on
“Create new text file” .
A message box producing the numbers of different types of variables is displayed.
2.14 Click on “Create a first parameter file”.
The window “Creating a starting parameter file” appears.
[Reminder: In DtmVic, the phrases “Parameter
file” and “Command file” are equivalent].
A DtmVic parameter file (or: command file) is displayed in the lower window.
The command file is automatically saved under the name:
“param_start.txt”.
The command file does not include any statistical analysis command, except
basic counts of categories, together with a computation of extreme and average
values for the purely numerical variables.
It is only meant here as a check of the importation of the data.
Comments about the “first command file”
After an identification of the two input files, three
“steps” of DtmVic are involved:
The step “ARDAT”
that archives data and dictionary.
The step “>SELEC”
that selects the variables for the subsequent processing. In this
case, all the available variables are selected.
The step “STATS”
that computes the basic statistics mentioned above.
2.15 Click on “Execute”.Back in the
main menu window, the sequence of steps is displayed.
2.16 Click on the button: “Basic
numerical results” .
The button opens a created (and saved) html file named
“imp.html”
which contains the main results of the previous basic computation
steps. After perusing these numerical results, return
to the main menu. Note that this file is also saved under another
name: The name “imp.html”
is concatenated with the date and time of the analysis (continental
notation). That file keeps as an archive the main numerical results
whereas the file “imp.html”
is replaced for each new analysis performed in the same
directory. Likewise, a simple text format file
“imp.txt” is created and saved.
If the original Excel file contains textual variables (generally: responses
to open-ended questions) a DtmVic textual file is created (the name of which
has been given during the step 2.10). The step "VISURESP"
(in the panel open by the button :
"Create a command file" of the main menu) allows you to check the
consistency of that textual file.
End of example D.1
Example D.2: EX_D02.Importation.Text.Free
Importation of textual data in “free format”.
Transforming a specific free format text file into DtmVic text files (type2) .
The DtmVic format for textual data (type 2) is described in Table 4 of the introduction
of this Tutorial D.
It contains two types of separators: separators of individuals: “----“
and separators of questions “++++”, located in columns
[1,2,3,4]. There is one constraint for the length of a line (200
characters) but, in principle, no constraint about the number of
lines for one question or for one individual. However the number of
open questions should not exceed 12, the number of closed questions
should not exceed 1000, and the number of individuals is limited
to 22,500 in the present version of DtmVic.
Remark about DtmVic text file type 1:
Another separator (separator of texts : “****”) could
be used in the case of DtmVic text file type 1, exemplified by Table
3 of the introduction.
This kind of internal format can be easily built directly from the
original corpus of texts without using the importation procedure (see
the example: EX_A04.Text-Poems of Tutorial A).
No importation procedure is needed in that case.
To begin with, we will have a look at the original textual data to be
imported.
1- Looking at the data, preliminary steps
We will use the editor of the button
“Open an existing command file” of the main menu of
DtmVic as a simple text editor.
Click on the button
“Open an existing command file”
Search for the examples directory
DtmVic_Examples .
In that directory, open the directory of
example D.02 , named
EX_D02.Importation.Text.Free
Select the basic text file:
“TDA1_text_free.txt” .
(the responses are
those involved in application examples A.5, A.6, B.1, B.2).
The free format of that file is the following:
- Each line corresponds to an individual (a respondent)
(up to 100,000 characters, no linebreak or "end of line" allowed).
- The separators are the character #, which
serves to separate the identifier of a respondent from its first
response, and also to separate two consecutive responses.
We deal here with three open ended questions, since we have three #
per line (a character # at the end of a line means an empty response
to the last open question).
Nature of the
importation process
The importation process consists in building a DtmVic text file from the
original text file.
It consists in inserting the different separators.
The DtmVic format is closer to the usual format of texts in everyday
life, easier to consult and peruse. However, the matching of both
textual and numerical files is more easily carried out with the basic
textual format (one individual = one row).
2 - Sequence of operations:
2.1 Click on the button
“Data Importation , Preprocessing, Data Capture, Exportation”,
(Basic Step from the main menu of DtmVic). A new window appears.
2.2 Choose the item : “Importing
Dictionary, Data and Texts”.
The window “Data Importation” is displayed.
2.3 Press the button: “Textual
data (free format)”.
The window “ Importation of a text file“ is displayed.
2.4 Click on : “Open text file”
You have to select the file “TDA1_text_free.txt”
in the directory: EXD02.Importation.Text.Free.
2.5 Click on “Convert into DtmVic file”.
The 100 first line of the new DtmVic text file are
displayed in the right hand side memo.
A prudent message is given “Conversion apparently
completed”.
Two message boxes give successively the number of
individuals (1043) and the maximum length of the identifiers (10).
- The DtmVic text file is automatically saved under the name:
”DtmTextFile.txt”
2.6 Click then on the button :
“Create a first parameter file”.
The window “Creating a starting parameter file” appears.
[Reminder: In DtmVic, the phrases “Parameter
file” and “Command file” are equivalent].
2.7 Click on “Create a first parameter file”.
A DtmVic parameter file (or: command file) is displayed in the lower window.
The command file is automatically saved under the name:
“param_tex_start.txt”.
The command file does not include any statistical
analysis command, except basic counts of words for the first open
question (parameter NUMQ = 1 in the step SELOX).
It is only meant here as a check of the conversion of the data.
Optional comments about the “first parameter file”
After an identification of the two input files, three “steps” of DtmVic
are involved:
The step “ARTEX” that archives the three sets of responses
to the three open questions. The step “SELOX” that selects the open
questions for the subsequent processing. In this case, by default, the first
question is selected. The step “NUMER” that performs the numerical
coding of the selected text.
The right hand side memo indicates how to run that parameter file.
2.8 Click on “Execute”.
2.9 Click on the button: “Basic
numerical results” .
The button opens a created (and saved) html file named
“imp.html”
which contains the main results of the previous basic computation
steps. After perusing these numerical results, return
to the main menu. Note that this file is also saved under another
name: The name “imp.html”
is concatenated with the date and time of the analysis (continental
notation). That file keeps as an archive the main numerical results
whereas the file “imp.html”
is replaced for each new analysis performed in the same
directory. Likewise, a simple text format file
“imp.txt” is created and saved.
End of Example D.2
Example D.3: EX_D03.Importation.Text.num.XML
Importation of numerical and Textual data in “XML format”.
(updated:
January 4th th, 2011)
A specific XML format allows for dealing with both numerical data
and textual data in a unique file. Such format could be generated from some online
questionnaires in the framework of mySQL databases.
The DtmVic format for textual data is described in Table 4 of the introduction
of this Tutorial D.
It contains two types of separators: separators of individuals: “----“
and separators of questions “++++”, located in columns
[1,2,3,4]. There is one constraint for the length of a line (200
characters) but, in principle, no constraint about the number of
lines for one question or for one individual. However the number of
open questions should not exceed 12, the number of closed questions
should not exceed 1000, and the number of individuals is limited
to 22,500 in this version of DtmVic.
To begin with, we will have a look at the original XML data file to be
imported.
1- Looking at the data, preliminary steps
We will use the editor of the button
“Open an existing command file” from the main menu of
DtmVic as a simple text editor.
Click on the button
“Open an existing command file”
Search for the examples directory
DtmVic_Examples
In it, open the directory of Example D.03, named EX_D03.
Importation.Text.num.XML
Select
the unique file: “TDA2__dtm.xml”
.
(The data are the same as those of example A.5 of Tutorial A :
instead of three files – dictionary, numerical data, textual data - we have now only one
(rather large) file).
The structure is schematised below:
<FileName.xml>
<individual>
<id > identifier1 </id >
< question1 > response1 < /question1 >
< question2 > response2 </question2 >
....................................................
< open >
<Open_quest_1> free response 1 </Open_quest_1 >
<Open_quest_2> free response 2 </Open_quest_2>
....................................................
</ open >
</ individual >
< individual >
<id>
identifier2 </id>
< question1 > response2.1 </question1 >
< question2 > response2.2 </question2 >
....................................................
< open >
< Open_quest_1 > free response 2.1 </Open_quest_1>
< Open_quest_2 >
free response 2.2 < /Open_quest_2 >
....................................................
</ open >
</ individual >
....................................................
....................................................
< individual >
......................................etc.
</ individual >
</FileName.xml>
All the tags can be chosen by the user, except the tag <
individual >
that indicates an end of record. However; that keyword “
individual ”
is only a default value. It can be changed during the importation process.
For an individual, a missing tag means “no-response”. It is advisable, but not
necessary, to put the tags in the same order.
The first tag after < individual >
must be the identifier of the individual (tag <id> in the
example, but any other name is acceptable).
Comments complying with XML syntax are possible anywhere in the file.
Note that this simple format is directly provided by saving the mySQL data bases derived
from “on line surveys” as a simple XML file (without
attributes, the tags being nested as shown before).
The drawbacks are the following:
- An obvious drawback of the XML structure is the size of the file, owing to
the presence of opening and closing tags for each variable and individual.
- Another problem is the presence of XML-dedicated symbols such
as: &, <, >, ‘, “. The dictionary must not contain such
characters.
The advantages are the following:
- A unique file replaces three files (dictionary, data and text).
- The order of variables can change from an individual to another.
- The order of individuals is no more important, since it is not necessary to match
the data file and the text file.
- The length of individual records can vary, since the absence of a tag means “no
response” to the corresponding variable.
The importation process consists in building the three internal DtmVic files (dictionary
file, data file and, possibly, text file) from the original XML file.
2 - Sequence of operations:
2.1 Choose the item : “
Data Importation, Preprocessing, Data Capture, Exportation”.
(Basic Step from the main menu of DtmVic). A new window appears.
2.2 Select the item:
“Importing Dictionary, Data and Texts”.
The window “Data Importation” is displayed.
2.3 Press the button: “XML
specific file”.
The window “Find and select the tags, Import XML data file“ is
displayed.
2.4 If the tag separating the individuals in your XML file is not the
keyword “ individual ”,
type your own tag in the first small white window.
Press “enter”
to register the new tag.
2.5 As explained in the pop-up purple window, two thresholds are necessary.
Threshold1 is the minimal number of respondent to an
open question (default value: 40)
(that default value means that if the sample size is
1000, we tolerate 960 non-responses for some open questions). The
question is discarded if the number of response is less than
Threshold1.
Threshold2 is the minimal length of the lengthiest response to an open
question (default value : 60)
(that default value means that if all the responses to a question
have less than 60 characters, this question will not be selected as
an open question, and discarded)
Remind that in this version of DtmVic, the number of open-questions should be
less than 12, whereas the number of closed question should be less
than 1000.
If you wish to change the previous default values Threshold1 and Threshold2,
enter the new values in the two windows below (don’t forget to press the
“enter”
button afterwards).
2.6 Click on: “List
of tags and content”
You have to select the file
“TDA2__dtm.xml”
in the directory: EXD04.Importation.Text.num.XML.
Some messages are produced, describing the different steps of the process.
2.7 If the XML file contains responses to open-ended questions:
Click on
“Create the textual data file to be imported” .
2.8 Always in the case in which the XML file contains responses to
open-ended questions:
Click on: “import
as an internal DTM file”
The 100 first lines of the new DtmVic text file are
displayed in the right hand side memo.
A message box gives the number of individuals.
- The DtmVic text file is automatically saved under the name :
”Dtm_final_text_TDA2__dtm.xml.txt”
- The DtmVic data file and the DtmVic dictionary file remain to be
imported from the created csv file:
“ Dtm_import_num_TDA2__dtm.xml.txt”
(importation as an Excel file).
- An intermediate file,
“Dtm_import_text_TDA2__dtm.xml.txt”
is also created, as a mere check. It is the text file in importation
format. In fact, the importation of text has been completed and the final text has already
been provided ( ”Dtm_final_text_TDA2__dtm.xml.txt” )
2.9 About the control files
Five control files are created to check the different steps of the process.
- The (huge) file:
“Check1_data_TDA2__dtm.xml.txt”
contains a list of all the tags encountered for all the individuals.
- The file:
“Check2_Tags_TDA2__dtm.xml.txt”
contains a list of all the encountered tags, with the parameters
characterizing these tags (frequency, mean rank, average length of
the content, minimum length maximum length). In this case, all the
tags are present and have the same position.
- The file “Check3_Dict_TDA2__dtm.xml.txt”
contains all the tags sorted according to their average rank (in this
case the same rank for each individual), the tags corresponding to textual responses,
the tags corresponding to numerical responses.
- The file
“Check4_Textual_TDA2__dtm.xml.txt”
contains all the encountered responses to open questions.
- The file
“Check5_final_text_TDA2__dtm.xml.txt”
complements the previous one.
- The file
“Check6_import_text_TDA2__dtm.xml.txt”
contains the open questions in the free text format.
These five files could be suppressed after checking the
whole process.
As a conclusion:
The DtmVic text file is created: (
”Dtm_final_text_TDA2__dtm.xml.txt” ).
The dictionary file and the data file have been
converted from the XML file into a unique csv file.
They remain to be imported through a standard Excel
importation process, using the created file:
“Dtm_import_num_TDA2__dtm.xml.txt”
as an input (see: example D.1) (you may replace the extension ".txt" with the
extension ".csv" to obtain an Excel file for the numerical data).
End of Example D.3
End of tutorial D