Dtm-Vic - Tutorials
Example D.1: EX_D01.Importation.XL
Importation of numerical and textual data in “Excel ® format”.
Transforming a specific XL (csv) format file into DtmVic
dictionary file, text file and data file.
This importation procedure can be applied to any text
file (.txt) having the following features, for n individuals and p
variables:
The first row (p + 1 elements) contains the generic name of the
identifiers (for example: ident ) and the p names of the variables
(no blank space allowed within the name, less than 20 characters, preferably
less than 10) separated with a semicolon (or a comma, or a tab). Blank spaces
are allowed between names (free format).
The n remaining rows contain p + 1 elements: the identifier of the individual
(less than 20 characters) and the values of the p variables (for categorical
variables, no blank space are allowed within the alphanumeric values; preferably
less than 10 characters) separated with a semicolon (or a tab). Blank spaces
are allowed between values (free format) and, evidently, within textual variables
(responses to open-ended questions, for example).
Only one type of separator (semicolon or tab) can be used in a file.
Such file can be obtain by saving an Excel file as either a "CSV file", or
a "tab-separated text file".
1- Looking at the data, preliminary steps
The folder “EX_D01.Importation.XL”
contains the file “datbase_global.xls”.
The file datbase_global.xls corresponds to a frequent situation: the
first row of the table contains the variable identifiers, the first
column comprises the observations identifiers.
To begin with, we will have a look (outside
DtmVic) at the original file to be imported.
This file is under Microsoft Excel ® format. The
reader who is not provided with that software should skip the next
instructions… or use the free software “Open Office”
instead.
1.1 Search for the examples directory
DtmVic_Examples
1.2 In that directory, open the directory of example D.01, named
EX_D01.Importation.XL
1.3 Click on the file:
“datbase_classical.xls”
(basic dictionary and data and texts) to obtain a view of the
data through an Excel spreadsheet.
- The first row contains the names of the 17 variables (there are 18 columns,
but the first one relates to the identifier of individuals).
Note again two important constraints:
a) the names of variables must have less than 20 characters,
b) these names should not contain blank spaces (replace them by underscores, if any).
Note that these names will be truncated down to 10 characters to build the
identifiers of the categories. It is then important that these first
10 characters allow for identifying the variable.
The remaining rows consist of 1043 lines (it is the same sample of
individuals from the socio-economic sample surveys serving as example
in the applications A.5, B.2).
The sequence of characters in the first cell of each line is the
identifier of individual, the following sequences being the values of
the 17 variables. Blank cells means “no-answer” or
“missing value”.
1.4 We must save this file as a text file in “.csv” format.
(command: File, then “Save as” )We obtain a free format
file with semicolons as separators. The file in “csv”
format is provided in the example directory.
Important:
1.4.1 If there are some semicolons in the data file, they
should be replaced by another symbol before saving the “Excel
file” as a “csv file”.
1.4.2 Note also that before saving the file, the format of the
cells containing numerical values must be “standard”, to avoid
some additional small blank spaces in numbers of more than 3 digits that
are misinterpreted by the csv file. In the French version and in some
European versions of Excel, the “decimal commas” should be
replaced by the usual decimal dots.
1.4.3 If your version of Excel does not allow for “saving
as a csv file”, you can save the file using “tabs”
as separators, and then, change the “tabs” into
“semicolons”, alteration allowed by the button:
“Change tabs into semicolons”
(see below).
This supposes that the initial data set does not already contain semicolons:
if semicolons are present, you should replace them with another
symbol before the importation process).
1.4.4 In many versions of Excel, the csv format uses commas as
separators, instead of semicolons. You can then transform these
commas into semicolons (provided that the initial data set does not
already contain semicolons: you should replace then these semicolons
with another symbol before the importation process).
2) Sequence of operations
2.1 Click on the button:“Data Importation,
Preprocessing, Data Capture, Exportation”,
(Basic Steps from the main menu of DtmVic). A new window appears.
2.2 Choose the item:
“Importing Dictionary, Data and Texts” .
The new window “Data Importation”
is displayed.
2.3 Press the button entitled: “Excel®
type files (saved as csv files)” .
A new window entitled “Data Importation from an
Excel (r) file” appears.
If the Excel file has been saved using “tabs”
or “commas” as separators, click on one of the optional
buttons:
“0. Change tabs into semi-colons”.
“0. Change commas into semi-colons”.
Select the file saved with tabs or commas, and convert
it. Note that a new name is given to the created file. The
importation process will continue using this new file.
2.4 Then, click on the button: “Start
the Importation Process”
In the new window, click on:
“1.Select input data file.”
(widen the window if necessary).
Select the previously saved file:
“datbase_global.csv”( or the file produced by one of the
previous buttons “0”)
The left hand side memo contains, for each variable, all its
observed values. In the case of continuous numerical variables, the
number of values could be the same as the number of observations. In
the case of textual data, the number of values is the number of
“words” (separators : blank, periods, commas)
- The central memo is a summary of the
previous one. For each variable, we can read within the brackets the
number of distinct values observed in the file.
- The letter (A) in parenthesis means that some letters or
non-numerical values have been observed.
- The letter (N) indicates that only numerical values have been
obtained.
It is then easier to choose the types of the variables:
- categorical ( CHAR ),
- numerical ( NUM ),
- textual ( TEXT ),
- variables to be abandoned ( DISCARD ).
To choose these types, you have then to select one or several consecutive
variables in the list, and choose, for each variable, one keyword among the
four keywords {CHAR, NUM, TEXT, DISCARD}.
- “ CHAR ”
means that we are dealing with a category of a nominal variable. Such
variable could be coded with at most 6 characters. For instance,
‘male’ and ‘female’ for coding the gender
(or “0” and “1”, or “10” and “20”
…). Conventionally, the first item (identifier) should be a
“CHAR”.
- “ NUM ”
means that we are dealing with a purely numerical variable.
- “ TEXT ”
means that the records (up to 8000 characters, another constraint)
will feed the textual data file.
- “ DISCARD ”
means that the records (whatever the prior status) will be suppressed
in the imported file.
Clearly, a variable with a few distinct values containing letters (A) should
be a categorical variable “CHAR”.
Similarly, a variable with hundreds of purely numerical values (N) will probably
deserve the type: “NUM”.
If expected numerical values contain letters (A), it could be than in
the original Excel file, the missing values or ”Do not apply
(DNA)” are represented by alphanumeric symbols. These symbols
should be replaced with blank spaces in the original file, or
directly in the “csv file” before the importation. If you
give the status “NUM” to a variable whose values contain
letters, the importation process will be stopped before being
completed, entailing a waste of time.
2.5 Once the attribution of types is completed, click on the button
“3. Updating and continue” .
2.6 In the new window, Click on “Values
and counts” .
A further check of the consistency of the selected types
or the variables. A list of all the categories found in the data
file, with the corresponding frequencies is displayed. Basic
parameters are also provided for numerical variables. We will not
dwell on this output serving mainly as a technical check.
2.7 Click then on “Create
dictionary and data” .
A new window entitled “Creating a dictionary and a
data file“ appears on the screen.
2.8 Click on “Name
for the new dictionary” .
You have to choose a name for the
forthcoming DtmVic dictionary, always in the same directory (the
extension “.txt ” is recommended) select for example: “dtm_dic.txt.
2.9 Click on
“Name for the new data file”
You have to choose a name for the forthcoming DtmVic data file, always in the same directory (the
extension “.txt ” is recommended). Select for example: “dtm_dat.txt”
2.10 [if textual data have been selected] Click on
“Name for the new text file”
You have to choose a name for the forthcoming DtmVic text file, always in the
same directory (the extension “.txt” is recommended).
Select for example: “dtm_text.txt”
2.11 Click on “Create new dictionary”
A DtmVic dictionary is created (number of lines = total number of
variables + number of found categories). The DtmVic dictionary is displayed
in the right hand side memo.
2.12 Click on
“Create new data file” .
2.13 [if textual data have been selected] Click on
“Create new text file” .
A message box producing the numbers of different types of variables is displayed.
2.14 Click on “Create a first parameter file”.
The window “Creating a starting parameter file” appears.
[Reminder: In DtmVic, the phrases “Parameter
file” and “Command file” are equivalent].
A DtmVic parameter file (or: command file) is displayed in the lower window.
The command file is automatically saved under the name:
“param_start.txt”.
The command file does not include any statistical analysis command, except
basic counts of categories, together with a computation of extreme and average
values for the purely numerical variables.
It is only meant here as a check of the importation of the data.
Comments about the “first command file”
After an identification of the two input files, three
“steps” of DtmVic are involved:
The step “ARDAT”
that archives data and dictionary.
The step “>SELEC”
that selects the variables for the subsequent processing. In this
case, all the available variables are selected.
The step “STATS”
that computes the basic statistics mentioned above.
2.15 Click on “Execute”.Back in the
main menu window, the sequence of steps is displayed.
2.16 Click on the button: “Basic
numerical results” .
The button opens a created (and saved) html file named
“imp.html”
which contains the main results of the previous basic computation
steps. After perusing these numerical results, return
to the main menu. Note that this file is also saved under another
name: The name “imp.html”
is concatenated with the date and time of the analysis (continental
notation). That file keeps as an archive the main numerical results
whereas the file “imp.html”
is replaced for each new analysis performed in the same
directory. Likewise, a simple text format file
“imp.txt” is created and saved.
If the original Excel file contains textual variables (generally: responses
to open-ended questions) a DtmVic textual file is created (the name of which
has been given during the step 2.10). The step "VISURESP"
(in the panel open by the button :
"Create a command file" of the main menu) allows you to check the
consistency of that textual file.
End of example D.1 (Importation of an Excel File or a csv file)