Internal DtmVic format for input data and texts



The aim of the importation procedures (see: section "DataFile" of the main menu) is to transform a pre-existing data file into the “Internal DtmVic format”.
Internal DtmVic format is a transparent text (.txt) format, readable with any text editor or notepad. The knowledge of the internal DtmVic format could be useful. It is not indispensable for the beginners. The tutorial is based on examples the data of which are already in DtmVic format.


Let us remind that DtmVic is a software devoted to exploratory analysis of multivariate numerical and textual data. The leading case that exemplifies all the possibilities of the software is a sample survey data set, comprising both responses to closed questions and responses to open-ended questions (the closed questions may lead to numerical [quantitative] or categorical [qualitative] data).


In the most general configuration, three files constitute the internal DtmVic input data set:


1) The dictionary file that provides the names (or identifiers) of the numerical and categorical variables (less than 1,200 variables). It includes the names of the categories corresponding to each categorical variable. That latter feature is rather uncommon in statistical software, but seems indispensable to explore high dimensional categorical data sets.


2) The data file , that contains the values of these variables for a set of individuals (or: observations), together with the identifiers of the individuals (less than 45,000 individuals).


3) The text file made (e.g.) of the responses to open ended questions. The text file (known as text file type 2) concerns the same respondents as those of the data file, in the same order. A simplified “text file format” (text file type 1) can be used when dealing only with a series of texts, without associated data file and dictionary file.


Some applications may involve only the text file (see for instance the example A4 of Tutorial A), whereas others may need only the dictionary and the data files (application examples A1, A2, A3, C1, C2, of Tutorials A and C).



Internal “DtmVic format”


The format is specific, but not proprietary: The three types of files are in simple text format (extension “.txt”, readable through a “notepad” or a text editor, or also with a word processor, provided that they are saved as simple text files). As an introductory exercise, they can be recorded directly from the keyboard, or with the help of the menu “DataCapture” (see preliminary example D.0 below).


In most cases, however, they have to be imported from (often large) pre-existing files. The transformation into DtmVic format is then transparent to the user.


Table 1 shows an example of a small DtmVic dictionary, involving four variables. Table 2 displays an example of a DtmVic data file (same four variables, three individuals or respondents). Table 3 presents a text file relating to three poems (Dtm-Vic text file type 1), Table 4 a text file relating to three open-ended questions and three respondents (Dtm-Vic text file type 2).



Table 1: Example of an internal DtmVic dictionary for 4 variables

Gender (2 categories); Age (0 categories = numerical variable); Age broken down into 4 categories;

Educational level (3 categories). [fixed format, comments in italic, blue].



   2 GENDER  (number of categories [2] in columns 1-4; blank; title of the variable) 
MALE MALE  (short identifier [column 1-4]; blank; identifier [< 20 characters] 
FEMA FEMALE  (short identifier [column 1-4]; blank; identifier [< 20 characters] 
   0 AGE    (number of categories [0] in columns 1-4; blank; numerical variable) 
   4 AGE_CODE    (number of categories [4] in columns 1-4; blank; title of the variable) 
AGE1 18_24    (short identifier [column 1-4]; blank; identifier [< 20 characters] 
AGE2 25_39   (short identifier [column 1-4]; blank; identifier [< 20 characters] 
AGE3 40_59  (short identifier [column 1-4]; blank; identifier [< 20 characters] 
AGE4 >60    (short identifier [column 1-4]; blank; identifier [< 20 characters]  
   3 EDUCATION    (number of categories [3] in columns 1-4; blank; title of the variable)  
EDUL LOW       (short identifier [column 1-4]; blank; identifier [< 20 characters]  
EDUM MEDIUM   (short identifier [column 1-4]; blank; identifier [< 20 characters]  
EDUH HIGH     (short identifier [column 1-4]; blank; identifier [< 20 characters]  


Table 2: Example of an internal DtmVic data file for the previous 4 variables:


Gender, Age broken down into 4 categories, Educational level. 3 respondents (individuals, observations)


  '1006'   1   76  12   1     (Identifiers of the respondents : between quotes,  
  '1007'   2  20   2   2     without blank, less than 20 characters. Separators   
  '1008'   2  29   3   2      between values: at least one blank space)  


Important: the categorical variables are coded as numerical consecutive integers. For example the variable Gender with 2 categories (male, female) will be coded as (1,2). This transformation is done through the importation from an "Excel (c)" file.

Table 3: Example of an internal DtmVic text file (type 1) for three texts (see: application example EX_A04.Text.Poems of Tutorial A for texts in English).


Free text format on less than 200 columns (80 columns in the previous versions of DtmVic). Separator of texts : “****“ followed, after four blank spaces, by the identifier (<= 20 characters); End of file: “====”. All separators are in columns 1, 2, 3, 4.

Such a format does not imply a specific importation procedure. The original text, in MsWord, for instance, has to be saved in .txt format, with an option: Insert or save the Ends of lines and Carriage Return (to obtain lines of less than 200 characters). Check afterwards that separators and identifiers comply with the previous constraints.


****    LAMARTINE   
Voilà les feuilles sans sève,
Qui tombent sur le gazon  
Voilà le vent qui s'élève,
Et gémit dans le vallon  
Voilà l'errante hirondelle,
Qui rase du bout de l'aile,  
L'eau dormante des marais...   
****    GAUTIER 
L'automne va finir, au milieu du ciel terne, 
Dans un cercle blafard et livide que cerne  
Un nuage plombe, le soleil dort. Du fond  
Des étangs remplis d'eau monte un brouillard qui fond   
Collines, champs, hameaux dans une même teinte.   
****    VERLAINE    
Les sanglots longs  
Des violons 
De l'automne    
Blessent mon coeur  
D’une langueur  
Monotone.   
=====

Table 4: Example of an internal DtmVic text file (type 2) for three responses to three open-ended questions and for three respondents (see: application examples A5, A6 from Tutorial A and B1, B2, of Tutorial B).


Free text format on less than 200 columns. Separator of respondents: “----“ followed by the identifier (<= 20 characters); Separator of question: “++++”; End of file: “====”. All separators are in columns 1, 2, 3, 4.

Note the blank lines for empty responses (last respondent, second and third questions).



---- 1006
 my sons, my kids are very important to me, 
being on my own I am responsible for their education 
and moral standard 
++++
 education and moral standard of the youngsters, law and order 
++++
 basically, British culture is traditional, 
people tend to keep themselves to themselves 
---- 1007
 job, being a teacher I love my job, for the well being 
of the children 
++++
 law and order, drug abuse, child abuse 
++++
 accommodating, of course people from different races 
and  culture have settled in here, (i.e., Irish, Jewish, 
Asians) and  the British culture is working alright  
---- 1008
 job, sometimes it is very hard to find a job  
++++
 
++++

====


End of the DtmVic data format memo.