Example A.4: EX_A04.Text-poems (VISUTEX)

(Textual Data Analysis: simple series of texts)



This elementary example deals with the simplest form of text analysis: The data set comprises a series of texts separated by the separator **** (columns 1 to 4). The dataset serving as an example, “Sonnet_LowerCase.txt”, contains the first 20 Sonnets from Shakespeare. For a larger set of sonnets and for comments, see, among many other websites, www.shakespeare-online.com/sonnets/ .

In this simple format, DtmVic can process up to 1000 texts without limitation of size for each text. Our corpus serving as an example is thus a “small scale model”, emphasizing only the functionalities (but not the power) of DtmVic. The conversion to lower case characters is meant to avoid typifying the first word of each verse or sentence.

The general methodology underlying the processing is presented in the book: “Exploring Textual data” (L. Lebart, A. Salem, L. Berry; Kluwer Academic Publisher, 1998). That textbook is an upgraded translation of the book: ”Statistique Textuelle” (Ludovic Lebart and André Salem, Dunod, Paris, 1994). This latter book (in French) can be freely downloaded from the site: www.dtmvic.com (section “publication”).



1) Looking at the text file

Search for the examples directory: DtmVic_Examples_A_Start”

In that directory, open the directory of example A.4 named : “EX_A04.Text-poems“ .

As mentioned in the previous examples, it is recommended to use one directory for each application, since DtmVic produces a lot of intermediate ”txt” files related to the application. At the outset, such directory must contain at least one text file:

“Sonnet_LowerCase.txt” . Look at this text file using a text editor such as Notepad, Notepad++, TotalEdit, Ultraedit, TextEdit, or simply a text editor within DtmVic: button "Open an existing command file" of the main menu.

– The format is specific (DtmVic internal text format type 1. See the tutorial D: “Importation”).
– Since the texts may have very different lengths, separators **** (at the beginning of a line) are used to distinguish between texts.
– The length of the lines is limited to 200 characters (instead of 80, in the previous version of DtmVic).
– The identifiers of texts must follow the separator “****” after 4 blank spaces.
– The symbol “====” indicates the end of the file.
– Like all the data files involved in DtmVic as input files, that file is a raw text file (.txt). If the text file comes from a text processing phase, it must be saved beforehand as a “.txt file”.



2) Generation of a command file (or: “parameter file”)


Open DtmVic.


2.1) Click the button: “Create a command file” of the main menu: Basic Steps, line “Command File”

A window “Choosing among some basic analyses” appears.


2.2) Click then the button: “VISUTEX – Visualization of Texts”

This button is located in the paragraph “Textual and numerical data”.


2.3) Press the button: “Open the text file”, then search for the directory: DtmVic_Examples_A_Start”. In that directory, open the directory of example A.4, named: “EX_A04.Text-poems” .

Open the file: “Sonnet_LowerCase.txt” .

A message box indicates then that the corpus contains 20 texts totalizing 321 lines.


2.4) Click: “Select open questions and separators”

The next window allows for the selection of open questions (not relevant here) and the selection of separators of words (the produced default separators suffice in this example).


2.5) Click directly “Vocabulary”.

The next window presents the vocabulary (alphabetic and frequency orders). We must select a threshold of frequency by selecting a line in the right hand side memo (frequency order). The line number 113 corresponds to the frequency 4 (It is a very small frequency, adapted to a very small corpus. This example is just an opportunity of exploring the sequence of commands, without meaningful linguistic interpretation).

After selecting that line, click then on: “Confirm”.


2.6) Then click on: “Continue. (Create the parameter file)”

Continuing our visit, we have to “select some options” . Click “yes” for the bootstrap validation, and “enter” to confirm the default number of replicates (25). Click then on "Continue".


2.7) Then click“Create a first parameter file”

A parameter file is displayed in the memo [It can be edited by the advanced users. It allows for performing again the same analysis later on, if needed].

Important : The parameter file is saved as “Param_VISUTEX.txt” in the current directory.

If you wish, you could now exit from DtmVic, and, later on, use the button of the main menu “Open an existing command file” (line: “Command file” ) to open directly “Param_VISUTEX.txt” , and, in so doing, the user reaches this point of the process. You can then use afterwards the “Execute” command of the main menu.


2.8) Click on: “Execute”.

This step will run the basic computation steps present in the command file: archiving data and text, characteristic words and responses, correspondence analysis of the lexical table.




3) Basic numerical results

Click on: “Basic numerical results” button

The button opens a created (and saved) html file named “imp.html” which contains the main results of the previous basic computation steps. After perusing these numerical results, return to the main menu. Note that this file is also saved under another name. The name “imp.html” is concatenated with the date and time of the analysis (continental notation): “imp_08.07.09_14.45.html” means July 8 th , 2009, at 2:45 p.m. That file keeps as an archive the main numerical results whereas the file “imp.html” is replaced for each new analysis performed in the same directory.

This file in also saved under a simple text format , under the name “imp.txt” , and likewise with a name including the date and time of execution.


From the step NUMER, we learn for instance that we have 280 responses (lines), with a total number of words (occurrences or token) of 2321, involving 830 distinct words (or: types). Using a frequency threshold of 3 (it means here keeping the words with frequency over three) the total number of kept words reduces to 1384, whereas the number of distinct kept words reduces to 114. (Note some – provisional– notational differences: the minimal selected frequency 4 corresponds to the frequency 3 in the listing meaning, equivalently, that all the words appearing more than three times are kept).
Return.




4) Steps VIC (Visualization, Inference, Classification)


4.1) Click the button:“AxeView”

... and follow the sub-menus. In fact, only two tabs are relevant for this example: “Active variables” [ = poems] and “observations” [words]. After clicking on “View” , the user obtains the set of principal coordinates along each axis.

Clicking on a column header produce a ranking of all the rows according to the values of that column.

As mentioned in the previous examples, the use of the AxeView menu is justified when the data set is large, which is not the case here.
Return.


4.2) Click the button: “PlaneView Research” , and follow the sub-menus...

In this example, only one item of the menu is relevant “Active columns + Rows” . This item concerns both rows and columns of the contingency table (lexical table). The graphical displays of selected pairs of axes are then produced. Normally, the active categories (columns of the lexical table) are printed in red, while the active words (rows) are printed in blue.

The roles of the different buttons are straightforward, except perhaps the button: “Rank”, which is useful only in the case of very intricate displays (which is not the case here) (see comments in the previous examples).
Return.


4.3) Click the button: “BootstrapView”

This button opens the “DtmVic: Bootstrap - Validation - Stability – Inference” windows.


4.3.1 Click on: “LoadData” . In this case (partial bootstrap), the replicated coordinates file to be opened is named “ngus_par_boot1.txt” . (The set of possible files is given by a background panel).


4.3.2 Click on: “Confidence Areas” submenu, and choose the pair of axes to be displayed (select axes 1 and 2 to begin with).


4.3.3 The window that appears (enlarge it if necessary) contains the list of identifiers of active rows and columns (identifiers of columns [Sonnets in this case] are at the end of the list). Tick some white boxes to select some poems, select also some words, and press the button “Select” .


4.3.4 Click on: “Confidence Ellipses” to obtain the graphical display of the chosen column points in red colour, and of the row points (here: words) in blue. We can see that many sonnets occupy significant locations (several confidence ellipses do not overlap) whereas the locations of the words is far from being as accurate.


4.3.5 Close the display window, and, again in the blue window, press “Convex hulls” . The ellipses are now replaced by the convex hulls of the replicates for each point. The convex hulls take into account the peripheral points, whereas the ellipses are drawn using the density of the clouds of replicates. The two pieces of information are complementary.



4.4) Click “ClusterView” (in this case, the clusters are the texts themselves)


4.4.1 Choose the axes (1 and 2 to begin with), and “Continue” .


4.4.2 Click on “View” . The locations of the 20 categories (texts) appear on the first principal plane. Thanks to some possible change of signs for the axes, the display is the same as that provided by the “PlaneView Research” procedure.


4.4.3 Activate the button “Words” , and, pointing with the mouse on a specific category, press the right button of the mouse. A description of the category involving the most characteristic words of the category appears. This description is again redundant with that of the Step MOCAR (see files “imp.txt” or “imp.html” using the button “Basic numerical results” ). But we can appreciate here the pattern of categories and their relative locations.


4.4.4 Activate the button “Texts” . Pointing with the mouse on a specific category, and pressing the right button of the mouse, we can read the most characteristic lines (verses) of the selected category. The concept of characteristic line is not obviously relevant in the case of poetries. It is in fact a particular case of the concept of “characteristic responses”, extremely useful in the case of open questions.


More explanation about the corresponding methodology can be found in the already quoted book: “ Exploring Textual data ” (L. Lebart, A. Salem, L. Berry; Kluwer Academic Publisher, 1998).
Return.



4.5) Click “Kohonen map”


4.5.1 Select: “variables + observations (rows + columns)” : these active variables are the words and the texts (poems) in this example.


4.5.2 Select a (5 x 5) map, and “continue” .


4.5.3 Press “draw” on the menu of the large green windows entitled “Kohonen map”.


4.5.4 You can change the font size ( “Font” ) and dilate the obtained Kohonen map ( “Dilat.” ) to make it more legible. The words appearing in the same cell are often associated in the same responses. This property holds, at a lesser degree, for contiguous cells.


4.5.5 Note that we have obtained a simultaneous Kohonen representation of rows and columns, owing to the use, as an input file, of the coordinates from the correspondence analysis of the lexical table.


4.6) Click “Seriation”

Seriation techniques as well as Block Seriation techniques are widely used by practitioners. Seriation is based on simple row and column permutations of the table under study; they have the great practical and cognitive advantage of showing the raw data to the user and therefore allowing the user to forego the use of intricate interpretation rules. These permutations can display homogenous blocks of high values or on the contrary, of small or null values. They can also pinpoint a continuous and progressive evolution of profiles.
An optimal property of correspondence analysis is the following: the first axis of a correspondence analysis provides us with a ranking of the row-points and of the column-points. That ranking can be used to sort the rows and columns of the analysed data table. The new obtained data table has then undergone an optimal seriation. Seriation will be applied here to the lexical table cross-tabulating the 20 sonnets and the selected words (words appearing at least 4 times in the corpus).


A new window named “Reordering” appears.


Click on the button:“Reordering the rows and the column of a word-text table” .

The reordered table cross-tabulating the 20 sonnets and the selected words is then displayed. It can be seen that the first words of the reordered list of words characterize (sometimes exclusively) the first sonnets in the reordered list of sonnets. The last words of the same list are either absent or rarely observed among these sonnets. However, they are frequent among the last sonnets (right hand side of the table). That reordered printing of the raw data is a useful tool of communication with the practitioners, since it can be interpreted without prior knowledge of data analysis techniques.


General remark.

As you can observe when looking at the content of the example’s directory, several files have been created and saved [these files are briefly described in the memo “Help about files” in the toolbar of the main menu]. If you need to continue using again the buttons of the paragraph VIC of the main menu after having closed DtmVic, just click on the button “Open an existing command file” from the line “command file” , select and open the saved command file: “Param_VISUTEX.txt” , and close it. It is not necessary to click on: “execute” again. You can then continue your investigation (axes views, graphs, maps, etc.).

The advanced users can also edit the parameter file “Param_VISUTEX.txt” , (using the memo “Help about parameters” in the toolbar of the main menu) to perform a new analysis in which the parameters are given new values. It is advised to give it a new name (such as “Param_VISUTEX2.txt” , for example). All the intermediate files will be replaced (except the files “imp_date_time.txt” and “imp_date_time.html” which are the only saved archives).




End of example A4 (VISUTEX)