Dtm-Vic - Tutorials

Example C.2: EX_C02.PCA_Contiguity

(PCA and Contiguity Analysis on Fisher’s Iris Data)


Example C2 aims at analysing a classical set of numerical variables (The Iris data set of Anderson and Fisher) through Principal Components Analysis, Classification, Contiguity Analysis, Discriminant Analysis. As in previous examples, the principal axes visualisation is complemented with a clustering, together with an automatic description of the clusters.

The first phases of Example C.2 are very similar to Example C.1: Principal components analysis and classification (clustering) of a set of numerical data, with various tools of visualisation, involving also a specific categorical data.

Subsection 12 and the following subsections present the improvements provided by Contiguity Analysis, with a comparison with Linear Discriminant Analysis.


Reminder about Contiguity Analysis

In Contiguity analysis, we consider the case of a set of multivariate observations, (n objects described by p variables, leading to a ( n, p ) matrix X ), having an a priori graph structure. The n observations are the vertices of a symmetric graph G, whose associated ( n, n ) matrix is M (mii' = 1 if vertices i and i' are joined by an edge, mii' = 0 otherwise).

Such situation occurs when vertices represent time-points, geographic areas. Contiguity Analysis, confronting local and global variances, provides a straightforward generalization of Linear Discriminant Analysis. It enables to point out the levels responsible of the observed patterns (local  versus global level).

In this example, we will deal with the situation in which M and the graph structure are not external, but derived from the data matrix X itself, M being for example the k- nearest neighbours graph   derived from a distance between observations.

Some interesting possibilities of exploration of data are sketched. Note that the idea of deriving a metric likely to highlight the existence of clusters dates back to the works of Art et al. (1982) and Gnanadesikan et al. (1982).


Some references

Art D., Gnanadesikan R., Kettenring J.R. (1982) Data Based Metrics for Cluster Analysis, Utilitas Mathematica, 21 A, 75-99.

Burtschy B., Lebart L. (1991) Contiguity analysis and projection pursuit. In : Applied Stochastic Models and Data Analysis, R. Gutierrez and M.J.M. Valderrama, Eds, World Scientific, Singapore, 117-128.

Gnanadesikan R., Kettenring J.R., Landwehr J.M. (1982) Projection Plots for Displaying Clusters, in Statistics and Probability, Essays in Honor of C.R. Rao, G. Kallianpur, P.R. Krishnaiah, J.K.Ghosh, eds, North-Holland.

Lebart L. (1969) Analyse statistique de la contiguité. Publications de l’ISUP. XVIII, 81-112.

Lebart , L. (2000): Contiguity Analysis and Classification, In: W. Gaul, O. Opitz and M. Schader (Eds): Data Analysis. Springer,Berlin, 233--244.

Lebart L. (2006): Assessing Self Organizing Maps via Contiguity Analysis. Neural Nerworks , 19, 847-854.



Looking at the data


To have a look at the data, search for the directory DtmVic_Examples.

In this directory, open the sub-directory DtmVic_Examples_C_NumData.

In that directory, open the directory of Example C.2, named “EX_C02.PCA_Contiguity”

It is recommended to use one directory for each application, since DtmVic produces a lot of intermediate txt-files related to the application. At the outset, such directory must contain 3 files:

- a) the data file,

- b) the dictionary file,

- c) the command file.


a) Data file: “iris_dat.txt”

Our example comprises 150 observations and 4 variables: 4 measurements (these numerical variables are the lengths of various constituents of the flowers: Sepal Length, Sepal Width, Petal Length, Petal Width) and one categorical variable describing the characteristics of the observations (three species of plants : setosa, versicolor, virginica).

The data file "iris_dat.txt" comprises 150 rows and 6 columns (the identifier of rows [between quotes] + 5 values [corresponding to 4 numerical variables and one categorical variable] separated by at least one blank space).

[Reference: Anderson, E. (1935). The irises of the Gaspe Peninsula, Bulletin of the American Iris Society 59, 2–5.]


b) Dictionary file: “iris_dic.txt”

The dictionary file "iris_dic.txt" contains the identifiers of these 5 variables. In this version of DtmVic dictionary, the identifiers of categories must begin at: "column 6" [a fixed interval font - also known as teletype font - such as "courier" can be used to facilitate this kind of format].


c) Command file: “EX_C02_Param.txt”

The computational phase of the analysis is decomposed into "steps". Each step requires some parameters briefly described in the main menu of DtmVic (button: "Help about parameters" ) and below.


Note that another “command file” similar (but not identical) to the “command file” “iris_par.txt” can be also generated by clicking on the button “Create a command file” , line "Command file" of the main menu (DTM: Basic Steps). Proceed than as shown in the first example “EX_A01.PrinCompAnalysis” of Tutorial A.




Running the example C.2 and reading the results


1) Click on the button : “Open an existing command file” (panel DTM: Basic Steps of the main menu)


2) Search for the sub-directory “DtmVic_Examples_C_NumData” in DtmVic_Examples”.


3) In that directory, open the directory of Example C.2 named “EX_C02.PCA_Contiguity” .


4) Open the command file: “EX_C02_Param.txt”


In that command file, we can read that after identifying the two files (data and dictionary) , 9 "steps" are performed:

ARDAT (Archiving data), SELEC (selecting active and supplementary elements), PRICO (Principal components analysis), DEFAC (Brief description of factorial axes), RECIP (Clustering using a hierarchical classification of the clusters - reciprocal neighbours method), PARTI (Cut of the dendrogram produced by the previous step, and optimisation of the obtained partition), DECLA (Automatic description of the classes of the partition), SELEC (selecting one categorical variable, in this case), EXCAT (extracting one categorical variable: the species of iris - selected by step SELEC - to be used in some graphical displays).

We will comment later on this command file (Appendix of the section) which commands the basic computation steps. Instead of editing this file, we will go back to the main menu and execute the basic computation steps.


5) Return to the main menu (return to execute)


6) Click on the button: “Execute”

This step will run the basic computation steps present in the command file.


7) Click the button: “Basic numerical results”

The button opens a created (and saved) html file named “imp.html” which contains the main results of the previous basic computation steps. After perusing these numerical results, return to the main menu. Note that this file is also saved under another name. The name “imp.html” is concatenated with the date and time of the analysis (continental notation). That file keeps as an archive the main numerical results whereas the file “imp.html” is replaced for each new analysis performed in the same directory.

As usual, this file in also saved under a simple text format , under the name “imp.txt” , and likewise with a name including the date and time of execution.
Return.


8) At this stage, we click on one of the buttons of the lower panel of the main menu (Steps: “VIC”)


9) Click directly on the button: “BootstrapView”

This button open the DtmVic-Bootstrap-Stability windows.


9.1 Click “LoadData”. In this case (partial bootstrap), the two replicated coordinates file to be opened are named “ngus_var_boot.txt” and “ngus_sup_cat_boot.txt” (see the small panel reminding the names of the relevant files below the menu bar). In fact, “ngus_var_boot.txt” contains only active variables. The file “ngus_sup_cat_boot.txt” contains only supplementary categories, for which the bootstrap procedure is also meaningful.


9.2 Click on “Confidence Areas” submenu, and choose the pair of axes to be displayed (choose axes 1 and 2, to begin with).


9.3 Tick the chosen white boxes to select the elements the location of which should be assessed, and press the button “Select” . Select all the four variables when you open the file “ngus_var_boot.txt” , and, later on, the three species when you open the file “ngus_sup_cat_boot.txt” .


9.4 Click on “Confidence Ellipses” to obtain the graphical display of the active variable points (if the file ngus_var_boot.txt has been loaded), or of the supplementary category points (if the file ngus_sup_cat_boot.txt has been loaded). We can observe, for the variables, that for example, the "petal lengths" seem to be somewhat redundant with "petal widths", since their ellipses markedly overlap. We can see also that the three categories are significantly distinct (that does not mean that they can be linearly separated...).


9.5 Close the display window, and, again in the blue window, press “Convex hulls” . The ellipses are now replaced with the convex hulls of the replicates for each point. The convex hulls take into account the peripheral points, whereas the ellipses are drawn using the density of the clouds of replicates. The two pieces of information are complementary.

Go back to the “VIC” menu.


10. Click on “Visualization”  [Visualization of the three species]

A new window entitled “Visualization, Loading files, Selecting axes” appears.


We are going to visualise the different species of flowers (categorical variable n° 5) in the plane spanned by the first principal components.


10.1 Click on “Load coordinate”


10.2 In the corresponding sub-menu, choose the file: “ngus_ind.txt” . The principal coordinates of the individuals (rows) are selected.


10.3 Click then on “Load or create Partition”


10.4 In the corresponding sub-menu, choose “Load a partition file” . The partition obtained previously from the computation step must then be loaded (its name: part_cat.txt ). The partition induced by the 4 categories of variable number 5 (species of irises) is loaded. This partition has been selected and extracted through the steps SELEC and EXCAT (at the end of the command file, see below).


10.5 Click on “MST” (Minimum Spanning Tree). Choose then the number of axes that will serve to compute the Minimum Spanning Tree: 5 (for example).


10.6 Choose then the number of axes that will serve to compute the Minimum Spanning Tree: 5 (for example).


10.7 Click on “N.N.” (search for nearest neighbours – limited to 20 NN).


10.8 Click on “Graphics” .


10.9 Choose the axes 1 and 2 (default) in the small window “Selection of axes” and click on “Display” .

In the new window entitled “Visualisation” are displayed the individuals in the plane spanned by the selected axes. A random colour is attributed to each species. The button “Change colour” allows you to try a new set of colour.

On the vertical tool bar, you can press each button to activate it (red colour), and press it again to cancel the activation (original colour)

-- The button “Density” , for sake of clarity, replaces the identifiers of individuals with a single character reminding the cluster (the identifier of individuals and the cluster number can be obtained by clicking on the left button of the mouse in the vicinity of each point).

-- The button “C.Hull” (Convex hull) draws the convex hull for each cluster.

-- The button “MST” (Minimum Spanning Tree) draws the minimum spanning tree.

-- The button “Ellipse” perform a Principal Components Analysis of each cluster within the two-dimensional sub-space of visualisation and draws the corresponding ellipses (containing roughly 95% of the points).

-- The button “N.N.” (Nearest neighbours) joins each point to its nearest neighbours. Pressing afterwards the button “N.N.up” allows you to increment the number of neighbours up to the 20 nearest neighbours.

At this step, we have obtained a display of the 150 individuals, with either the convex hulls (or the ellipses) corresponding to the three species. This is a classical display of the Iris data in the principal plane of PCA, showing that the first species (number < 51) are well separated from the species 2 and 3.


Go back to the “VIC” menu.


11. Click again on “Visualization”  [Visualization of the clusters]

We are going to redo the operation of paragraph 10, but instead of loading a partition induced by the 4 categories of variable number 5 (species of irises), we will load a partition produced by a clustering algorithm ignoring the species. Such partition correspond to the steps RECIP and PARTI (see the command file, below).


11.1 Click on “Load coordinate”


11.2 In the corresponding sub-menu, choose the file: “ngus_ind.txt” . The principal coordinates of the individuals (rows) are selected.


11.3 Click then on “Load or create Partition”


11.4 In the corresponding sub-menu, choose “Load a partition file” . The partition obtained previously from the computation step must then be loaded (its name:part_cla_ind.txt ). This partition is derived from the steps RECIP and PARTI .

After loading that partition, all the operations from 10.5 to 10.9 can be carried out again.

It is interesting to visualise the individuals in the plane spanned by the axes 1 and 2.

As suspected, the partition obtained directly from the numerical measurements, ignoring the species, is unable to separate the three species. Only the species “setosa”, well separated from the two other species, coincides with a cluster of the partition.

Go back to the “VIC” menu.


12. Click now on the button: “Contiguity”   [Contiguity Analysis of Iris data]

We are now going to perform a “Contiguity analysis” using a “nearest neighbours graph” derived from the data. The partition into species is no more taken into account.


12.1 Click on “Parameters/Edit”

Choose the item “Create”

We are going to enter the parameters needed by a contiguity analysis:

- In the first block entitled “ncoord = input coordinate file” , tick “1” (File ngus_ind.txt : coordinates of individuals). The contiguity analysis will use the coordinates of individuals as input data.

- In the second block entitled “npart = partition file” , tick: “0” (no partition)

- In the third block entitled “meth = method” , tick “2” (Contiguity graph defined by nearest neighbours).

- Then we will have to enter the following numerical values :

Three contiguity analysis will be performed for the three (symmetrised) graphs corresponding respectively to 4, 6, 8 neighbours (from Min = 4 up to Max = 8, with an increment of npas = 2).

Then: Click on: “Validate” . A summary of the parameters appears.


12.2 In the upper bar of the window, Click on “Execute”. The computations are carried out.

The item “Results” of that bar contains technical details about the computations involved in Contiguity analysis.


12.3 Click on “Contiguity View”. We are led to the same window of visualisation than previously.

In the menu “Load coordinates” , of the new window, choose the file: ngus_contig.txt. Instead of using the principal coordinates of PCA ( ngus_ind.txt as done previously), we use now the result of the Contiguity Analysis ngus_contig.txt .

From the menu “Load or Create a Partition” , choose the file: part_cat.txt . (this file identifies the species)

We cannot compute the Minimum Spanning Tree nor the Nearest Neighbours from the “ngus_contig.txt” coordinates.


12.4 Click on “Graphics”.

Then choose the axes 1 and 2 (default values)

Choose (tick) the contiguity level number 2, that correspond to 6 nearest neighbours. (level 1 corresponds to 4 nearest neighbours, and level 3 to 8 nearest neighbours).

Click on “Display”

Change the colours to obtain a good contrast between species.

Click on “Convex Hull” (vertical bar)

The three species are now better separated.


That means that the (“symmetrised”) graph of 6 nearest neighbours allows for computing a “local covariance matrix” that can act as a “within covariance matrix”.
In this example, the principal plane of a contiguity analysis is similar to the principal plane of a Fisher Linear Discriminant Analysis.


We must keep in mind that the contiguity analysis did not use the a priori knowledge about the species. It is an unsupervised method.


Go back to the “VIC” menu.



13. Click again on “Contiguity”


We are now going to perform a “Contiguity analysis” that coincide exactly with a classical Linear Discriminant Analysis.


(Linear Discriminant Analysis is a particular case of Contiguity Analysis. In such a case, the graph involved in this Contiguity Analysis is made of k cliques (complete graphs) corresponding to the k classes of the Discriminant Analysis).


13.1 Click on “Parameters/Edit”

Choose the item “Create”

We are going to enter the parameters needed by a contiguity analysis:

Then: Click on: “Validate” . A summary of the parameters appears.


13.2 In the upper bar of the window, Click on “Execute” . The computations are carried out.

The item “Results” of that upper bar contains technical details about the computations involved in Contiguity analysis. The matrix associated with the graph with its three diagonal blocks of “1” and with the value “0” elsewhere is visible in this listing of results.


13.3 Click on “Contiguity View”.

In the menu “Load coordinates” , of the new window, choose the file: ngus_contig.txt .

In the menu “Load or Create a partition; , choose the file: part_cart.txt (we will identify the “species of iris”)

We cannot compute the Minimum Spanning Tree nor the Nearest Neighbours from the “ngus_contig.txt” coordinates.


13.4 Click on “Graphics”.

Then choose the axes 1 and 2 (default values)

Click on “Display”.

Change the colours of the display to obtain a good contrast between classes, then lock the colours.

Click on “Convex Hull” (vertical bar)


The three species of iris are well separated, too. But this is less of a surprise, since the Linear (Fisher) Discriminant Analysis aims precisely at separating the classes. We are here in a supervised case. The method uses the a priori knowledge of the species of iris to exhibit the coordinates (discriminant functions) that induce the best separation of the classes.




Appendix C.2 (for advanced users)


A similar command file can be generated using the menu “Create a command file ” (see Tutorial A, exemple A.1) Therefore, beginners could skip this appendix.

Since the steps are the same as those of Example C.1, the listing of parameters will not be commented here.



Command file: EX_C02_Param.txt


#-------------------------------------------------
# Example C.2 of principal component/ Contiguity analysis 
#-------------------------------------------------
# continuation symbol = ">",  Comments symbol = "#"
# title mandatory immediately after each line "STEP"
#-------------------------------------------------
#

 LISTF = NO, LERFA = yes        # Global Parameters

NDICZ = 'iris_dic.txt'          # dictionary file
NDONZ = 'iris_dat.txt'          # data file

#------------------------------------------------------------
# Description of IRIS DATA through PCA, then, classification (unsupervised) 
# into 3 groups (evidently, these groups will not coincide with the 3 real 
# groups [species], as described by the categorical variable number 5).
#------------------------------------------------------------
#  SEE COMMENTS IN APPENDIX C.1   
#------------------------------------------------------------

STEP ARDAT             #reading dictionary and data 
========== builds the Archive Dictionary
 NQEXA = 5, NIDI = 1,  NIEXA = 150  


STEP SELEC             # Selection for PCA
========== Selects active, supplementary variables and observations
LSELI = 0, IMASS = UNIF, LZERO = REC, LEDIT = short
NOMI ILL  5
CONT ACT 1--4
end

STEP PRICO    # Principal Components Analysis
========== 
LCORR = 1, NAXE =4, LEDIN = 1,  NSIMU = 10 nboot = 1

STEP DEFAC             # Description of factorial axes
========== Principal Component Analysis
SEUIL = 40., LCRIM = VTEST, VTMIN = 2.
VEC = 1--2 / CONT / MOD
end

STEP RECIP            # hierarchical classification
========== Clustering using reciprocal neighbours algorithm
NAXU = 4, NTERM = TOT, LDESC = NO, LDEND = DENSE

STEP PARTI            # partition
========== Cut of the dendrogram and optimization
NITER = 4, LEDIN = 3
3 # list of the numbers of clusters required (here one cut, 3 clusters)

STEP DECLA            # Description of partitions
========== Systematic description of clusters
CMODA = 5.0, PCMIN = 2.0, LSUPR = yes, CCONT = 5.0 >
LPNOM = NO, EDNOM = NO, EDCON = NO  
 3 # list of partitions (characterised by they numbers of clusters)

STEP SELEC             # Selection of one nominal variable
========== Selects active, supplementary variables and observations
LSELI = 0, IMASS = UNIF, LZERO = REC, LEDIT = short
NOMI act  5
end

STEP EXCAT # Selection of the previous nominal variable
=========
# no parameter       

STOP                   # End of command file.




End of example C.2 (Contiguity Analysis)