Dtm-Vic - Tutorials

Example D.1: EX_D01.Importation.XL


Importation of numerical and textual data in “Excel ® format”.

Transforming a specific XL (csv) format file into DtmVic dictionary file, text file and data file.




This importation procedure can be applied to any text file (.txt) having the following features, for n individuals and p variables:

The first row (p + 1 elements) contains the generic name of the identifiers (for example: ident ) and the p names of the variables (no blank space allowed within the name, less than 20 characters, preferably less than 10) separated with a semicolon (or a comma, or a tab). Blank spaces are allowed between names (free format).

The n remaining rows contain p + 1 elements: the identifier of the individual (less than 20 characters) and the values of the p variables (for categorical variables, no blank space are allowed within the alphanumeric values; preferably less than 10 characters) separated with a semicolon (or a tab). Blank spaces are allowed between values (free format) and, evidently, within textual variables (responses to open-ended questions, for example).

Only one type of separator (semicolon or tab) can be used in a file. Such file can be obtain by saving an Excel file as either a "CSV file", or a "tab-separated text file".


1- Looking at the data, preliminary steps


The folder “EX_D01.Importation.XL” contains the file “datbase_global.xls”.


The file datbase_global.xls corresponds to a frequent situation: the first row of the table contains the variable identifiers, the first column comprises the observations identifiers.


To begin with, we will have a look (outside DtmVic) at the original file to be imported.

This file is under Microsoft Excel ® format. The reader who is not provided with that software should skip the next instructions… or use the free software “Open Office” instead.


1.1 Search for the examples directory DtmVic_Examples


1.2 In that directory, open the directory of example D.01, named EX_D01.Importation.XL


1.3 Click on the file: “datbase_classical.xls” (basic dictionary and data and texts) to obtain a view of the data through an Excel spreadsheet.

- The first row contains the names of the 17 variables (there are 18 columns, but the first one relates to the identifier of individuals).

Note again two important constraints:

a) the names of variables must have less than 20 characters,

b) these names should not contain blank spaces (replace them by underscores, if any).

Note that these names will be truncated down to 10 characters to build the identifiers of the categories. It is then important that these first 10 characters allow for identifying the variable.

The remaining rows consist of 1043 lines (it is the same sample of individuals from the socio-economic sample surveys serving as example in the applications A.5, B.2).

The sequence of characters in the first cell of each line is the identifier of individual, the following sequences being the values of the 17 variables. Blank cells means “no-answer” or “missing value”.


1.4 We must save this file as a text file in “.csv” format. (command: File, then “Save as” )We obtain a free format file with semicolons as separators. The file in “csv” format is provided in the example directory.


Important:


1.4.1 If there are some semicolons in the data file, they should be replaced by another symbol before saving the “Excel file” as a “csv file”.


1.4.2 Note also that before saving the file, the format of the cells containing numerical values must be “standard”, to avoid some additional small blank spaces in numbers of more than 3 digits that are misinterpreted by the csv file. In the French version and in some European versions of Excel, the “decimal commas” should be replaced by the usual decimal dots.


1.4.3 If your version of Excel does not allow for “saving as a csv file”, you can save the file using “tabs” as separators, and then, change the “tabs” into “semicolons”, alteration allowed by the button: “Change tabs into semicolons” (see below). This supposes that the initial data set does not already contain semicolons: if semicolons are present, you should replace them with another symbol before the importation process).


1.4.4 In many versions of Excel, the csv format uses commas as separators, instead of semicolons. You can then transform these commas into semicolons (provided that the initial data set does not already contain semicolons: you should replace then these semicolons with another symbol before the importation process).


2) Sequence of operations


2.1 Click on the button:“Data Importation, Preprocessing, Data Capture, Exportation”, (Basic Steps from the main menu of DtmVic). A new window appears.


2.2 Choose the item: “Importing Dictionary, Data and Texts” . The new window “Data Importation” is displayed.


2.3 Press the button entitled: “Excel® type files (saved as csv files)” .


A new window entitled “Data Importation from an Excel (r) file” appears.


If the Excel file has been saved using “tabs” or “commas” as separators, click on one of the optional buttons:


0. Change tabs into semi-colons”.

0. Change commas into semi-colons”.


Select the file saved with tabs or commas, and convert it. Note that a new name is given to the created file. The importation process will continue using this new file.


2.4 Then, click on the button: “Start the Importation Process”


In the new window, click on: “1.Select input data file.” (widen the window if necessary).


Select the previously saved file: “datbase_global.csv”( or the file produced by one of the previous buttons “0”)

The left hand side memo contains, for each variable, all its observed values. In the case of continuous numerical variables, the number of values could be the same as the number of observations. In the case of textual data, the number of values is the number of “words” (separators : blank, periods, commas)

- The central memo is a summary of the previous one. For each variable, we can read within the brackets the number of distinct values observed in the file.

- The letter (A) in parenthesis means that some letters or non-numerical values have been observed.

- The letter (N) indicates that only numerical values have been obtained.

It is then easier to choose the types of the variables:


- categorical ( CHAR ),
- numerical ( NUM ),
- textual ( TEXT ),
- variables to be abandoned ( DISCARD ).


To choose these types, you have then to select one or several consecutive variables in the list, and choose, for each variable, one keyword among the four keywords {CHAR, NUM, TEXT, DISCARD}.


- “ CHAR ” means that we are dealing with a category of a nominal variable. Such variable could be coded with at most 6 characters. For instance, ‘male’ and ‘female’ for coding the gender (or “0” and “1”, or “10” and “20” …). Conventionally, the first item (identifier) should be a “CHAR”.


- “ NUM ” means that we are dealing with a purely numerical variable.


- “ TEXT ” means that the records (up to 8000 characters, another constraint) will feed the textual data file.


- “ DISCARD ” means that the records (whatever the prior status) will be suppressed in the imported file.


Clearly, a variable with a few distinct values containing letters (A) should be a categorical variable “CHAR”.


Similarly, a variable with hundreds of purely numerical values (N) will probably deserve the type: “NUM”.


If expected numerical values contain letters (A), it could be than in the original Excel file, the missing values or ”Do not apply (DNA)” are represented by alphanumeric symbols. These symbols should be replaced with blank spaces in the original file, or directly in the “csv file” before the importation. If you give the status “NUM” to a variable whose values contain letters, the importation process will be stopped before being completed, entailing a waste of time.


2.5 Once the attribution of types is completed, click on the button “3. Updating and continue” .


2.6 In the new window, Click on “Values and counts” .

A further check of the consistency of the selected types or the variables. A list of all the categories found in the data file, with the corresponding frequencies is displayed. Basic parameters are also provided for numerical variables. We will not dwell on this output serving mainly as a technical check.


2.7 Click then on “Create dictionary and data” .


A new window entitled “Creating a dictionary and a data file“ appears on the screen.


2.8 Click on “Name for the new dictionary” .

You have to choose a name for the forthcoming DtmVic dictionary, always in the same directory (the extension “.txt ” is recommended) select for example: “dtm_dic.txt.



2.9 Click on “Name for the new data file”

You have to choose a name for the forthcoming DtmVic data file, always in the same directory (the extension “.txt ” is recommended). Select for example: “dtm_dat.txt”


2.10 [if textual data have been selected] Click on “Name for the new text file”

You have to choose a name for the forthcoming DtmVic text file, always in the same directory (the extension “.txt” is recommended). Select for example: “dtm_text.txt”


2.11 Click on “Create new dictionary”

A DtmVic dictionary is created (number of lines = total number of variables + number of found categories). The DtmVic dictionary is displayed in the right hand side memo.


2.12 Click on “Create new data file” .


2.13 [if textual data have been selected] Click on “Create new text file” .


A message box producing the numbers of different types of variables is displayed.


2.14 Click on “Create a first parameter file”.


The window “Creating a starting parameter file” appears.


[Reminder: In DtmVic, the phrases “Parameter file” and “Command file” are equivalent].


A DtmVic parameter file (or: command file) is displayed in the lower window.


The command file is automatically saved under the name: “param_start.txt”.

The command file does not include any statistical analysis command, except basic counts of categories, together with a computation of extreme and average values for the purely numerical variables.

It is only meant here as a check of the importation of the data.


Comments about the “first command file”

After an identification of the two input files, three “steps” of DtmVic are involved:

The step “ARDAT” that archives data and dictionary. The step “>SELEC” that selects the variables for the subsequent processing. In this case, all the available variables are selected. The step “STATS” that computes the basic statistics mentioned above.


2.15 Click on “Execute”.Back in the main menu window, the sequence of steps is displayed.


2.16 Click on the button: “Basic numerical results” .

The button opens a created (and saved) html file named “imp.html” which contains the main results of the previous basic computation steps. After perusing these numerical results, return to the main menu. Note that this file is also saved under another name: The name “imp.html” is concatenated with the date and time of the analysis (continental notation). That file keeps as an archive the main numerical results whereas the file “imp.html” is replaced for each new analysis performed in the same directory. Likewise, a simple text format file “imp.txt” is created and saved.


If the original Excel file contains textual variables (generally: responses to open-ended questions) a DtmVic textual file is created (the name of which has been given during the step 2.10). The step "VISURESP" (in the panel open by the button : "Create a command file" of the main menu) allows you to check the consistency of that textual file.


End of example D.1 (Importation of an Excel File or a csv file)