Internal DtmVic format for input data and texts
The aim of the importation procedures (see: section "DataFile" of the main menu) is to transform a
pre-existing data file into the “Internal DtmVic format”.
Internal DtmVic format is a transparent text (.txt) format, readable with any text editor or notepad.
The knowledge of the internal DtmVic format could be useful. It is not indispensable for the beginners. The tutorial is based on examples the data of which
are already in DtmVic format.
Let us remind that DtmVic is a software devoted to
exploratory analysis of multivariate numerical and textual data.
The leading case that exemplifies all the possibilities
of the software is a sample survey data set, comprising both
responses to closed questions and responses to open-ended questions
(the closed questions may lead to numerical [quantitative] or
categorical [qualitative] data).
In the most general configuration, three files
constitute the internal DtmVic input data set:
1) The dictionary file that provides the names (or
identifiers) of the numerical and categorical variables (less than 1,200 variables).
It includes the names of the categories corresponding to each categorical
variable. That latter feature is rather uncommon in statistical
software, but seems indispensable to explore high dimensional
categorical data sets.
2) The data file , that contains the values of these
variables for a set of individuals (or: observations), together with
the identifiers of the individuals (less than 45,000 individuals).
3) The text file made (e.g.) of the responses to open ended
questions. The text file (known as text file type 2) concerns the
same respondents as those of the data file, in the same order. A simplified
“text file format” (text file type 1) can be used when
dealing only with a series of texts, without associated data file and
dictionary file.
Some applications may involve only the text file (see
for instance the example A4 of Tutorial A), whereas others may need
only the dictionary and the data files (application examples A1, A2,
A3, C1, C2, of Tutorials A and C).
Internal “DtmVic format”
The format is specific, but not proprietary: The three types of files
are in simple text format (extension “.txt”, readable
through a “notepad” or a text editor, or also with a word
processor, provided that they are saved as simple text files).
As an introductory exercise, they can be recorded directly from the
keyboard, or with the help of the menu “DataCapture”
(see preliminary example D.0 below).
In most cases, however, they have to be imported from (often large)
pre-existing files. The transformation into DtmVic format is then
transparent to the user.
Table 1 shows an example of a small DtmVic dictionary, involving four
variables. Table 2 displays an example of a DtmVic data file (same four
variables, three individuals or respondents).
Table 3 presents a text file relating to three poems (Dtm-Vic text file type 1),
Table 4 a text file relating to three open-ended questions
and three respondents (Dtm-Vic text file type 2).
Table 1: Example of an internal DtmVic dictionary for 4 variables
Gender (2 categories); Age (0 categories = numerical variable); Age broken
down into 4 categories;
Educational level (3 categories). [fixed format, comments in italic, blue].
2 GENDER (number of categories [2] in columns 1-4; blank; title of the variable)
MALE MALE (short identifier [column 1-4]; blank; identifier [< 20 characters]
FEMA FEMALE (short identifier [column 1-4]; blank; identifier [< 20 characters]
0 AGE (number of categories [0] in columns 1-4; blank; numerical variable)
4 AGE_CODE (number of categories [4] in columns 1-4; blank; title of the variable)
AGE1 18_24 (short identifier [column 1-4]; blank; identifier [< 20 characters]
AGE2 25_39 (short identifier [column 1-4]; blank; identifier [< 20 characters]
AGE3 40_59 (short identifier [column 1-4]; blank; identifier [< 20 characters]
AGE4 >60 (short identifier [column 1-4]; blank; identifier [< 20 characters]
3 EDUCATION (number of categories [3] in columns 1-4; blank; title of the variable)
EDUL LOW (short identifier [column 1-4]; blank; identifier [< 20 characters]
EDUM MEDIUM (short identifier [column 1-4]; blank; identifier [< 20 characters]
EDUH HIGH (short identifier [column 1-4]; blank; identifier [< 20 characters]
Table 2: Example of an internal DtmVic data file for the previous 4 variables:
Gender, Age broken down into 4 categories, Educational level. 3 respondents
(individuals, observations)
'1006' 1 76 12 1 (Identifiers of the respondents : between quotes,
'1007' 2 20 2 2 without blank, less than 20 characters. Separators
'1008' 2 29 3 2 between values: at least one blank space)
Important: the categorical variables are coded as numerical consecutive integers.
For example the variable Gender with 2 categories (male, female) will be coded as (1,2).
This transformation is done through the importation from an "Excel (c)" file.
Table 3: Example of an internal DtmVic text file (type 1) for three texts
(see: application example EX_A04.Text.Poems of Tutorial A for texts in English).
Free text format on less than 200 columns (80 columns in the previous versions of DtmVic).
Separator of texts : “****“ followed, after four blank spaces,
by the identifier (<= 20 characters); End of file: “====”.
All separators are in columns 1, 2, 3, 4.
Such a format does not imply a specific importation procedure.
The original text, in MsWord, for instance, has to be saved in .txt format,
with an option: Insert or save the Ends of lines and Carriage Return
(to obtain lines of less than 200 characters).
Check afterwards that separators and identifiers comply with the previous constraints.
**** LAMARTINE
Voilà les feuilles sans sève,
Qui tombent sur le gazon
Voilà le vent qui s'élève,
Et gémit dans le vallon
Voilà l'errante hirondelle,
Qui rase du bout de l'aile,
L'eau dormante des marais...
**** GAUTIER
L'automne va finir, au milieu du ciel terne,
Dans un cercle blafard et livide que cerne
Un nuage plombe, le soleil dort. Du fond
Des étangs remplis d'eau monte un brouillard qui fond
Collines, champs, hameaux dans une même teinte.
**** VERLAINE
Les sanglots longs
Des violons
De l'automne
Blessent mon coeur
D’une langueur
Monotone.
=====
Table 4: Example of an internal DtmVic text file (type 2) for three
responses to three open-ended questions and for three
respondents (see: application examples A5, A6 from Tutorial A and B1,
B2, of Tutorial B).
Free text format on less than 200 columns. Separator of respondents: “----“
followed by the identifier (<= 20 characters); Separator of
question: “++++”; End of file: “====”. All
separators are in columns 1, 2, 3, 4.
Note the blank lines for empty responses (last respondent, second and
third questions).
---- 1006
my sons, my kids are very important to me,
being on my own I am responsible for their education
and moral standard
++++
education and moral standard of the youngsters, law and order
++++
basically, British culture is traditional,
people tend to keep themselves to themselves
---- 1007
job, being a teacher I love my job, for the well being
of the children
++++
law and order, drug abuse, child abuse
++++
accommodating, of course people from different races
and culture have settled in here, (i.e., Irish, Jewish,
Asians) and the British culture is working alright
---- 1008
job, sometimes it is very hard to find a job
++++
++++
====
End of the DtmVic data format memo.