Updated version of "Multivariate Descriptive Statistical Analysis", by L. Lebart, A. Morineau, K. Warwick, John Wiley, New York, 1984.
Clustering Techniques
This section deals with three families of classification algorithms :
1) the methods leading directly to one partition of the data set (clustering around moving centers), comprising k-means methods and the Kohonen Self Organizing Maps (SOM),
2) the methods of hierarchical clustering,
3) a mixed method, adapted to large data set, involving both approaches 1) and 2).
1- Partitioning methods
These methods are particularly well adapted to large numerical data sets, since the data matrix can be saved on auxiliary memory (such as a disk). It is then possible to read this matrix several times in sequential fashion, without requiring large amounts of computer memory. Reading directly can also take advantages of the specificity of the coding, leading to appreciable saving in computation time in the case of sparse matrices.
1.1 Clustering around Moving Centers or k-means method
This method is particularly well adapted to large numerical data sets, since the data matrix can be saved on auxiliary memory (such as a disk). It is then possible to read this matrix several times in sequential fashion, without requiring large amounts of computer memory. Reading directly also takes advantages of the specificity of the coding, leading to appreciable saving in computation time in the case of sparse matrices.
1.1.1 Theoretical basis of the algorithm
The algorithm underlying the programs presented here can be attributed mainly to Forgy (1965), although numerous developments (sometimes earlier (Thorndike, 1953), most often later: Ball and Hall, 1967; MacQueen, 1967; Diday, 1971) arose independently.
Let us suppose we wish to partition a set I of n individuals characterised by p variables. Let us also suppose that the space RP containing the n individual-points has an appropriate euclidean distance denoted by d. The maximum number of groups desired is q.
STEP 0. q provisional group centers are determined. (this could be done by a pseudo-random selection, of q individuals in the population to be clustered, as proposed by MacQueen).
The q centers,
C°1, C°2, C°k, C°q
immediately create a partition P° of I into q clusters,
I°1, I°2, I°k, I°q
The assignment rule is: an individual i belongs to I°k if point i is nearer to C°k than to all other centers. (The clusters are delimited in space by convex polyhedral divisions formed by the median planes of the segments joining all the center pairs.).
STEP 1. q new cluster centers are determined:
C11, C12, C1k, C1q
by using the centers of gravity of the clusters
I°1, I°2, I°k, I°q
These new centers generate a new partition P1 built according to the same rule as for P°. P1 is made of the clusters :
I11, I12, I1k, I1q
STEP m. q new cluster centers are determined:
Cm1, Cm2, Cmk, Cmq
by using the centers of gravity of the clusters
Im-11, Im-12, Im-1k, Im-1q
The algorithm stops either when two successive iterations lead to the same partition, or when a conveniently chosen criterion (for example, the measure of the within-groups variance) stops decreasing significantly, or when a previously established maximum number of iterations is reached. In every case, the resulting partition depends upon the initial choice of the centers at step 0.
|
Drawing 2 centers at random
|
|
Building 2 clusters |
|
New centers
and new clusters
|
|
New centers
and new clusters
|
Figure 1. Steps of the k-means algorithm in the case of two clusters
1.1.2 Rationale for the Algorithm
We show that the within-group variance can only decrease (or remain constant) between step m and step m + 1. Programming conventions make it possible to force this decrease to be a strict one, and therefore to conclude that the algorithm does converge since the initial set I is finite (obviously it is not the convergence itself, but the rate of convergence that justifies using this method in practice).
Let us suppose that the n individuals of the set I to be clustered have relative weights pi such that
S pi = 1
and let be the squared distance
between individual i and the center of class k during the step m.
We are interested in the criterion quantity
Recall that at step m, group Imk is formed from the individuals who are closer to Ck than to all the other centers (these centers are the centers of gravity of the groups Im-1k of the preceding step). The within-group variance at step m is therefore the quantity
where Ckm+1 is the center of gravity of group Imk .
At step m + 1, the criterion quantity v(m) is written
We show that
v(m) >= V(m) >= v(m+1)
which establishes the simultaneous decrease of the criterion and of the within-group variance.
We note pk the weight of the cluster k.
First of all, let us note that
according to Huyghens' theorem, which establishes the first part of the inequality.
The second part follows from the fact that between the parentheses that appear in the definitions of V(m) and v(m + 1), only the assignments of the points to the centers change. Since Ikm+1 is the set of points that are closer to Ckm+1 than to all the other centers, the distances can only have decreased (or stayed the same) in the course of this reassignment.
1.1.3 Related Techniques, pioneering approaches
There exist many algorithms that have general principles similar to the algorithm of clustering around moving centers, but that differ in several respects.
(a) Thus in the technique of "dynamic clusters," the clusters are not characterized by a center of gravity, but by a certain number of individuals to be clustered, called " standards," who constitute a "nucleus" that has, for some applications, a better descriptive power than do point-centers. This formulation has led to several generalizations of the method (Diday, 1972).
(b) The k-means method introduced by MacQueen (1967) starts with a pseudo-random selection of point-centers. However the rule for calculating the new centers is not the same. The position of the centers is modified before all of the individuals have been reassigned : each reassignment of individuals leads to a modification of the position of the corresponding center. This procedure therefore gives a high quality partition in a single iteration. But it depends on the order of the individuals on the data file, which was not the case for the technique discussed above.
(c) Other methods differ in the initial choice of the centers (equidistant individuals for Thorndike, 1953), in the introduction of thresholds, or in the introduction of protection whose purpose is to be able to modify the number of groups. Thus the technique proposed under the name of ISODATA by Ball and Hall (1965) uses several parameters to guide the building of the partition.
1.1.4 Stable Clusters
Algorithms for clustering around moving centers converge toward local optima. The problem of finding an optimal partition into q clusters (using within-group variance as a criterion, which must then be minimized over the set of possible partitions into q clusters) has not yielded, to date, a satisfactory algorithm. The partitions obtained generally depend on the initial centers chosen.
The procedure for finding stable clusters, proposed by Diday (1972), is at least a partial remedy for this situation. Its main advantage is that it elaborates the results obtained in the rigid framework of a single partition by highlighting high-density regions of the cluster of individual-points.
The technique consists in performing several partitions starting with several different sets of centers, and keeping as stable clusters the sets of individuals that are always assigned to the same cluster in each of the partitions.
Let us designate by Pl, P2, . . . Ps the s partitions into q clusters (where each cluster may require a different number of iterations).
The new cluster that will be indexed by (kl, k2,...,ks) contains the individuals that belonged first to cluster kl of Pl, then to cluster k2 of P2,..., and finally to cluster ks of Ps. This product partition thus contains (q)s clusters.
The non-empty classes of this partition constitute the stable clusters.
In practice the number of stable clusters with substantial bases is much smaller than (q)s . For instance, it is not uncommon, after 3 partitions into 9 clusters have been performed on 1000 individuals, to have only 20 stable clusters with bases greater than 10 (even though the product partition contains 93 = 729 clusters).
1.2 Self Organising Maps (SOM) or Kohonen maps.
The Self Organizing Maps (SOM) proposed
by Kohonen (1989) aims at clustering a set of multivariate
observations. The obtained clusters are often displayed as the vertices of a
rectangular (chessboard like) or octagonal graph. The distances between
vertices on the graph are supposed to reflect, as much as possible, the
distances between clusters in the initial space.
The algorithm is similar to the McQueen
algorithm (1967)in its on line version (the centers are
updated after reading each individual or each row of the data matrix), and to
the k-means algorithm in its batch version (the centers are updated after
reading the whole data set).
1.2.1 Principles of the algorithm
The size of the grid, and consequently, the number of clusters are chosen a priori (e.g;: a square grid with 5 rows and 5 columns, leading to 25 clusters). The algorithm is very similar to the classical k-means algorithm. Let us consider n points in a p-dimensional space (rows of the (n, p) matrix X). At the outset, to each cluster k is assigned a provisional center Ck with p components (e.g.: chosen at random, or among the first elements).
For each step m, the element i(m) is assigned to its nearest center Ck(m).
The difference with the basic moving centers method (k-means) lies in the updating of centers: The center at step m+1, together with its neighbours on the grid, is then modified according to the formula:
Ck(m+1) = Ck(m) + e(m) [i(m)-Ck(m)]
In this formula, e(m) is an adaptation parameter (0< e < 1) which is a (slowly) decreasing function of m. This process is reiterated, and eventually stabilizes, but the partition obtained generally depends on the initial choice of the centers (as in k-means).
The convergence of the algorithm is
attested by a large amount of experiments, but it has not yet be proved
mathematically, except in the case in which the grid has the elementary shape
of a single string (see: Cottrell and Fort, 1987).
SOM techniques have the advantage of
providing the reader with a visualisation of the distances within clusters. It
constitutes a compromise between the purely geometrical representations
produced by principal axes techniques (PCA, CA, MCA) and the clustering
techniques.
In fact, to build a visual representation of a partition of n objects described by p variables, one can choose between several options :
(A) to build the partition optimizing a criterion (such as k-means techniques) then, afterwards, to position the clusters in the principal plane derived from, e.g., a CA of the same data set.
(B) to build the partition optimizing a criterion, then, in a to build a representation that take into account the partition (example: LDA: Linear Discriminant Analysis, which purports to find the best plane likely to exhibit the partition).
(C) to build the partition and the representation simultaneously: the partition is then constrained, but the quality of the representation is improved. That is the case of the Self Organizing Maps [SOM].
1.2.2 An example of application
The selected data set deals with survey data.
[ Data sets are in the DtmVic example folder named “EX_C01.PCA_Semio”
in the subfolder DtmVic_Examples_C_NumData of the directory: DtmVic_Examples.]
Following an idea of J.-F. Steiner, the
“semiometric” questionnaire (see: Lebart, Piron et Steiner, 2003) consists in a simple list of 210 words. The 3360
respondents must rate each word (on a 7-items scale) according to the pleasure
or displeasure they experience at its mention. We are interested here in the
pattern of words induced by their correlations.
The pattern obtained in the space spanned by the
six main principal axes of a Principal Component Analysis of the (3360 x 210)
data table appears to be stable over time and to be similar in several European
countries. In this text, we have run the example on a subset of 70 words and on
a sub-sample of 300 respondents. Since we are interested in the clustering of
words, the data matrix X will be a (70, 300) matrix.
|
Figure 2. A (3 x 3) SOM showing the 70 words, rated in the 300 responses
Figure 2 represents a (3 x 3) Self Organising Map intending to visualise the distances between variables (words) in terms of correlations (the distances between words are the Euclidean distances computed on standardised variables).
A comparison with classical Principal Component Analysis:
Figure 3 deals with a classical principal components analysis of X’. It represents the projections of the 70 variables (words) onto the plane spanned by the two first principal components.
|
Figure 3. Principal plane of a PCA. The points C1, C2, …C9 represent
the centroids of the 9 clusters derived from the SOM map.
On the same display, one can find the nine centroids (C-1, …C-9) of the nine clusters of words previously produced by the SOM algorithm (square 3 x 3 grid). Those centroids that correspond to adjacent clusters in figure 2 are joined by a straight line. The initial SOM grid appears then to be strongly distorted by the projection.
Note about the implementations of k-means and SOM in the software DtmVic:
Both these families of methods are considered as complementary to principal axes techniques: they are carried out after computation of the principal axes, and the points to be clustered (individuals, observations, or variables as in the previous example) are characterised by their principal coordinates.
A partitioning of individuals or rows of the data matrix (using k-means and hierarchical method, as exposed in section 3 of this chapter) is proposed for each principal axes method (CA, PCA, MCA) in the basic menu. In the “Visualization, Inference Classification” submenu, it is possible to visualize the series of k-means steps (button “Load partition”, after “Visualization”) and to perform SOM clustering on both variables and individuals.
2. Hierarchical Classification
2.1 Introduction
2.1.1 Distances Between Elements and Between Groups.
The principles common to various hierarchical classification techniques are simple.
1. At the outset we suppose that some measure of distance can be defined between pairs of objects that are to be classified. Sometimes this distance is simply a measure of dissimilarity. In this case, the triangular inequality, d(x, y) < d(x, z) + d(x, z), is not required.
The distances are not necessarily calculated at the outset : in this situation it must be possible to compute or recompute them from the coordinates of the object-points, and these coordinates must be easily accessible.
2. Then we suppose that there exist rules for calculating the distances between disjoint clusters of objects. This between-cluster distance can generally be calculated directly from the distances of the various elements involved in the clustering.
For example, if x, y, and z are three objects, and if x and y are clustered into a cluster h, we can define the distance from this cluster h to z as the smallest distance from the various elements of h to z :
d(h, z) = min{d(x, z), d(y, z)}
This distance is called the single linkage (Sneath, 1957; Johnson, 1967), and it is discussed later in greater detail.
Another rule that is simple and frequently used is that of the group average: for two objects x and y clustered in h:
Generally, if x and y designate disjoint subsets of the set of objects, containing, respectively, nx and ny elements, h is a subset containing (nx + ny) elements, we define
2.1.2 Classification Algorithm.
The general concepts we describe here are difficult to trace historically. They are based on common sense rather than on formalized theory. The most systematic and earliest published explanation is perhaps that of Lance and Williams (1967). The basic algorithm for ascending hierarchical classification is as follows (see figure 4).
We call points either the objects to be classified or the clusters of objects generated by the algorithm.
1. At step 1 there are n points to classify (which are the n objects).
2. We find the two points that are closest to one another and aggregate them into a new point.
3. We calculate the distances between the new point and the remaining points. We return to step 1 with only n -1 points to classify.
4. We again find the two closest points and aggregate them; we calculate the new distances, and repeat the process until there is only one point remaining.
Figure 4. Hierarchical clustering of 5 points in a 2-dimensional space
2.1.3 Terminology. We make a few remarks to introduce the terminology associated with classification.
Figure 5. Description of the hierarchy
1. In the case of single linkage, the algorithm uses distances in terms of the inequalities between them. The same tree could be derived (with a different scale) by simply using an ordering of pairs of objects. In this case the tree is drawn with equidistant levels.
2. The family of clusters built by such an ascending algorithm forms a hierarchy. This family has the property of containing the entire set as well as every one of the objects taken separately. The other parts of this family are then either disjoint, or included in one another. Every time a new cluster is formed out of disjoint points, the new cluster becomes itself a new point, and therefore necessarily included in an ulterior cluster.
The hierarchy is an indexed hierarchy if a numerical value v(h) > 0 is associated with every part h of the hierarchy, such that the value is compatible with the inclusion relationship in the following way: if h is included into h', then v(h) < v(h').
3. By "cutting" a tree or dendrogram, we obtain a partition in which the number of classes increases as the cut-point approaches the initial points.
A hierarchy allows us to obtain a set of n nested partitions containing 1 to n groups.
2.2 Single Linkage Classification and Minimum Spanning Tree
2.2.1 Definition of an Ultrametric Distance.
The method of classification presented above is very simple to perform, and has interesting properties that we formulate and discuss. We show that the concept of a hierarchy is closely linked to a class of distances between objects called ultrametric distances. For the hierarchy produced by the single linkage algorithm we show that the corresponding ultrametric distance is in a sense the closest to the original distance. This is called the maximum lower ultrametric distance, or the subdominant ultrametric distance. Then we show that applying this method is practically equivalent to solving a classical problem in operation research: finding the minimum spanning tree on a graph.
Recall that a set E is provided with distance d, if d is a positive mapping that satisfies the following conditions:
1. d(x, y) = 0 if and only if x = y.
2. d(x, y) = d(y, x) (symmetry).
3. d(x, y) < d(x, z) + d(y, z) (triangular inequality).
A distance is called an ultrametric distance if it satisfies the following condition, which is stronger than the triangular inequality:
4. d(x, y) < max{d(x, z), d(y, z)}.
2.2.2 Equivalence between an Ultrametric Distance and an Indexed Hierarchy.
It is equivalent to provide a finite set E with an ultrametric distance or to define an indexed hierarchy of parts of this set.
Let us show first that any indexed hierarchy makes it possible to define a distance between elements having the required properties.
For the distance d(x, y), we take the value of the index corresponding to the smallest part that contains both x and y.
Let us show in a general fashion that we always have, with such a distance:
d(x, y) < max{d(x, z), d(y, z)}
Recall that two parts of hierarchy H are either disjoint or linked by an inclusion re1ationship. Let us call h(x, z) the smallest part of H that contains both x and z (whose index is therefore d(x, z)).
Since h(x, z) and h(y, z) are not disjoint, we have, for example, h(x, z) Í h(y, z). And since x, y, and z are all contained in h(y, z), we necessarily have h(x, y) Í h(y, z); consequently:
d(x, y) < d(y, z), which establishes the inequality.
Conversely, to every ultrametric distance d there can correspond an indexed hierarchy of which d is the associated index.
We just have to apply the single linkage algorithm to the corresponding distance matrix.
It is then seen that recomputing the distances at each step is unnecessary: instead one of the two aggregated points needs only to be deleted.
For example, if x and y are aggregated in t, the distances to the new point t should be calculated. But, for any z, we necessarily have d(z, x) > d(x, y) and d(z, y) > d(x, y); otherwise (z, x) or (z, y) would have been aggregated instead of (x, y).
For an ultrametric distance, this will imply that: d(z, x) = d(z, y), which is expressed by saying that for an ultrametric distance, all of the triangles are isosceles triangles, the base being the smallest side.
Let us prove this last result. We have, for the triangle (x, y, z):
d (z, x) < max{ d(z, y), d (x, y) }
therefore,
d(z, x) < d(z, y)
Similarly,
d(z, y) < max{d(z, x), d(x, y)}.
Therefore
d(z, y) < d(z, x)
It follows that: d(z, y) = d(z, x).
The computation of the distances from z to t is unnecessary because the two distances under consideration are equal.
2.2.3 Subdominant Ultrametric Distance.
We have shifted from a metric distance to an ultrametric distance (that is, equivalently, to a hierarchy), by decreasing the values of certain distances. We can ask the following question:
Does there exist an ultrametric distance that is nearer (in a way undefined as yet) to the metric distance?
We can give the following partial answer. We say that metric distance d1 is less than a metric distance d2 if, for every x and y,
dl(x, y) < d2(x, y)
(This definition makes it possible to assign a partial order relationship to the set of metric distances over a set E.)
The greatest ultrametric distance that is less than a metric distance d, in the preceding sense, is called the maximum lower ultrametric distance or subdominant ultrametric distance. This precisely is the one given by the single linkage algorithm.
2.2.4 Minimum Spanning Tree.
Introduction and Definition.
The set of n points to be classified may be considered as a set of points in a space. This representation is classical if the objects are described by a series of p parameters: we have n points in the space Rp . Then we can calculate a distance for every pair of points.
More generally, if only the values of an index of dissimilarity are available (these do not necessarily have the properties of a distance), then we can represent the objects by points (of a plane for example), where every pair of objects is joined by a continuous line, to which is assigned the value of the index of dissimilarity. Thus the set of objects and the values of the index are represented by a complete valued graph. The nodes (or vertices) are the objects, and the values attached to the edges are the values of the dissimilarity index. But, if there are more than a few objects, this type of representation becomes messy. Then we attempt to extract a partial graph from this graph (with the same nodes and fewer edges); this partial graph is easier to represent, and permits us to summarize the values of the index.
Among partial graphs, those having a tree structure are particularly interesting, because they can be shown in two dimensions. (This type of tree is a tree in the sense of the theory of graphs; its nodes are the objects to classify. It is not to be confused with a tree made up of parts of a set, resulting from a classification technique.)
A tree is connected (there is a path linking every pair of nodes), without a cycle (a cycle is a path leaving from and arriving at the same point without passing through the same edge twice). We can define in equivalent manner a tree with n nodes, either as a graph without a cycle having (n - l) edges or as a connected graph having (n - l) edges. The length of a tree is the sum of the "lengths" (values of the index) of its edges.
Among all the partial graphs that are trees, the minimum spanning tree has long held the attention of statisticians because of its excellent descriptive qualities, stemming from its relationship to hierarchical classification. If, for example, we wish to rapidly detect, without a computer, the structural features of a correlation matrix of 30 variables, it is probably the easiest procedure to use.
First, we present the algorithms for finding a minimum spanning tree, then we show the equivalence with a single linkage classification. We suppose that all the edges of the graphs have different lengths (values of the index or of the distance), because under these circumstances the tree is unique and this simplifies the presentation of the algorithms.
Minimum Spanning Tree. Kruskal's Algorithm (1956).
The n(n—1)/2 edges are arranged in order of increasing values of the index. Starting with the first two edges, all of the edges that do not form a cycle with the edges already chosen are selected. The procedure is stopped when there are n - 1 edges. In this way we are certain that we have obtained a tree (a graph without a cycle, having n - 1 edges).
Minimum Spanning Tree. Prim's Algorithm (1957).
We begin with any node of the graph. Step 1 consists of finding the nearest object that is, the shortest edge. Step k consists of joining to the existing series of edges Vk-1 the shortest edge uk that touches one of the vertices of Vk-1 and that does not form a cycle with the edges of Vk-1 . The tree obtained is of minimal length because Vk is at all times a minimal length tree over the k nodes.
Minimum Spanning Tree. Florek's Algorithm (1951).
At the first step, each node is joined to its nearest neighbour. This is equivalent to taking the smallest distance in each row of the distance matrix. This rapid operation produces more than n/2 edges (or (n - 1)/2 if n is odd). It gives a forest Fl (a family of trees, or simply, a graph without a cycle). At step k, each tree of forest Fk-1 (each connected component of the graph without a cycle) is joined to its nearest neighbour by taking as a distance between trees the shortest edge between any node of one and any node of the other. This process stops as soon as graph Fk is connected. This algorithm is the fastest one to compute manually on rather large distance matrices. Generally there are only two or three steps.
Relationship between Minimum Spanning Tree and Single Linkage (Gower and Ross, 1969).
Let V be a minimum spanning tree built from a distance matrix between n objects. Since V does not have a cycle, and is connected, there exists one and only one path joining two vertices x and y.
Let us call dv(x, y) the length of the largest edge encountered on this path. We show that dv(x, y) is d*(x, y), the ultrametric distance of the "smallest maximal jump" between x and y (see below the appendix for the properties of the “smallest maximal jump” ).
Let v be the largest edge between x and y. Suppressing v leads to dividing V into two separate connected components. If there exists a path from x to y (that does not necessarily go through the edges of V) whose largest edge is shorter than v, then there exists an edge u that is distinct from v, and shorter than v, joining the two connected components. Replacing v by u would give a tree of shorter length than that of V, which contradicts the definition of V.
Thus dv(x, y), the length of v, is indeed the smallest maximal jump. This argument provides a way of building the hierarchy associated with the single linkage starting with the minimum spanning tree V. This descending construction works in the following manner. We break the largest edge of V; we obtain thus the two groups that are farthest apart, and the index corresponding to their fusion is the length of this edge. Then we break the edges in succession, in order of decreasing magnitude, descending along the hierarchy until the terminal elements, the objects themselves, are reached. The last broken edge corresponds to the two objects that were aggregated first in the ascending algorithm.
We can simultaneously represent the hierarchy and the minimum spanning tree in perspective: some supplementary information is then added to the classical dendrogram. Specifically, the relative positions of the points are better represented. For the practitioner of principal axes methods, it is often interesting to place the minimal length tree on the factorial planes, in order to remedy the deformations associated with projections.
Example of application:
We use the same semiometric data as in section 1.2.2. of this chapter.
Figure 6 shows the principal plane from the PCA of
the 70 variables (words), with the minimum spanning tree computed in the
2-dimensional space. Using Florek algorithm, such tree could be drawn (or at
least sketched) manually, since the used distances are the usual geometrical
distances in the plane.
Figure 7 shows, in the same plane, the minimum
spanning tree computed with the distances in the space of six principal axes.
Figure 6. Minimum Spanning Tree computed in the plane (axis 1, axis 2)
Figure 7. Drawing in the plane (Axis 1, Axis 2) of the Minimum Spanning Tree computed in the 6- dimensional space (axis 1 --> axis 6)
The distortion of distances after projection onto the plane are revealed by the MST: the two distinct branches (danger, gun, wall, armour) and (rigid, space, to_break) , for instance, are far apart, and the apparent closeness of “gun” and “space” is an unwanted “shrinking” effect of the PCA.
2.3 Minimum Variance Algorithms and Related Techniques (see: Ward, 1963)
2.3.1 Introduction and Notation.
The classification techniques discussed in the previous section have the advantage of being simple to compute, and they also have interesting mathematical properties.
For some applications the results can, however, be criticized. Specifically, single linkage has the defect of producing "chain effects."
Other techniques for calculating distances may give more reliable results. We have mentioned the group average distance. The techniques of minimum variance clustering seek to optimize, at every step, the partition obtained by aggregating two elements, using criteria that are linked to variance. These techniques are particularly easy to compute when the clustering is performed after a principal axes analysis, and the objects to classify are located by their coordinates on the first axes of the analysis.
Here we consider the n objects to classify as n points in a Euclidean space with p dimensions. Each point x, (vector with p components) has a mass mi. The entire mass of the set of points is noted:
The square of the distance between the points xi and xj is noted:
The total variance of the set of points is the quantity
where G is the center of gravity of the points
If there exists a partition of the set of objects into Q clusters, the qth cluster having a center of gravity gq and a mass mq, Huyghens' equation provides a decomposition of the quantity I into within cluster and between cluster variances according to the formula
[1]
2.3.2 Reduction in Variance by Aggregation of Two Elements (Ward's Method)
Let xi and xi’ be two elements of mass mi and mi’ that are aggregated into one element x of mass m = mi + mi’ , with
We can decompose the variance Iii’ of xi and xi’ with respect to g according to Huyghens' equation:
Only the last term remains if xi and xi’ are replaced by their center of gravity x. The reduction in variance ∆Iii’ is therefore
By replacing x with its value as a function of xi and xi’, we obtain:
Thus the strategy of c1ustering based on the minimum variance criterion is as follows: instead of finding the two closest elements, we find the elements xi and xi’ corresponding to a minimal ∆Iii’, which is the same as considering the ∆Iii’ as new dissimilarity indices. (By means of this transformation, points with small weights are more readily clustered.)
If we work with the coordinates of the points, we calculate the centers of gravity (x for xi and xi’ ). On the other hand, if we work with the distances, it is convenient to be able to calculate the new distances from the old (as was the case in the preceding techniques). The square of the distances between a point z and a cluster center x is written, as a function of the distances to xi and xi’ ,
There are variants of this method (minimal variance loss) that use slightly different computational formulae. For example, we can find clusters having a minimal internal variance. These techniques are discussed in Benzecri (1973).
2.4 Reciprocal Neighbours (CHAIN SEARCH ALGORITHM)
The main difficulty in building a hierarchical tree is the volume of computation. The preceding basic algorithms, where the nodes are constructed one by one at each step, require a number of calculations and of distance comparisons on the order of n3 if there are n points to classify.
New algorithms greatly reduce the number of operations. The volume of calculations can decrease from n3 to n2, allowing the classification of several thousand points in a reasonable amount of time. They utilize the concept of reciprocal neighbours (reciprocal pairs) introduced by McQuitty (1966). The two points (or groups) a and b are reciprocal neighbours if a is the nearest neighbour of b and b is the nearest neighbour of a. At each step of the basic algorithm, instead of aggregating only the two nearest neighbours, as many new nodes are created as there are reciprocal neighbours. At the final step, all of the points are combined into one group, and the tree is completed. The problem is then reduced to an efficient search for reciprocal neighbours. We describe the "chain search" algorithm (see Benzecri, 1982).
Algorithm
STEP 1. We begin with an element called xl. We create a chain of successive elements:
x1, x2,....xi-1, xi,...
such that for all i, xi is a nearest neighbour of xi-1. Such a chain necessarily stops at an element xk when xk-1 is also the nearest neighbour of xk. Then xk-1 and xk are reciprocal neighbours.
They are aggregated to form a node.
STEP 2. If k = 2, that is if the chain started with an element that has a reciprocal neighbour, we choose a new element from which a new chain is constructed that stops on new reciprocal neighbours, whose aggregation results in a new node.
STEP 3. If k > 2, the search is continued for reciprocal neighbours by extending the chain starting with xk-2. The algorithm stops when n -1 nodes have been created.
Remarks
(a) It has been established that the maximum cost of the chain search algorithm is an2 (where a is a coefficient independent of n), regardless of the configuration of the n points.
(b) In order to be able to use the search chain algorithm, the chain must necessarily be extensible beyond xk-2 when the reciprocal neighbours xk-1 and xk have been aggregated. It is thus necessary that this aggregation should not destroy the closest neighbour relationship that previously existed between xi-1 and xi (i = 2,3, ...k-2). This property is assured if the aggregation rule used to construct the tree does not create an inversion. I
There is no inversion if node g[a; b] created by aggregating a and b cannot be nearer to some element c than are the elements a and b.
This condition is written:
if d(a,b) < inf{ d(a,c), d(b,c)} then inf{ d(a,c), d(b,c)} < d(g[a; b], c)
This property is verified by several aggregation criteria:
1. Single linkage:
d(a,b) = inf { d(u, v)ôu Î a, v Î b }
2. Complete linkage:
d(a,b) = inf { d(u, v)ôu Î a, v Î b }
3- Group average
4. Ward's method:
where ga and gb are the centers of gravity of the groups a and b.
3 Hybrid Clustering for Large Data Sets
The idea of combining the approaches of Section 1 (finding partitions) and 2 (hierarchical clustering) is not a new one: it is found, for example, under the name of "hybrid clustering" in Wong (1982). When a large number of objects are to be classified, the following classification strategy is suggested. The first step is to obtain, at a low cost, a partition of the n objects into k homogeneous groups, where k is far greater than the "real" or desired number of groups in the population. The second step is an ascending hierarchical classification, where the terminal elements of the tree are the k groups of the partition. The third step is to visually inspect the tree and cut it at the most appropriate level. The final step consists in a "consolidation" of the partition issued from the previous step, using a reassignment rule similar to that used in the moving center algorithms. The "definitive" partition of the n individuals into ko groups (ko < k < n) is thus obtained.
3.1 Preliminary Partition
The purpose is to obtain rapidly a large number of small groups that are very homogeneous. We use the partition defined by the stable groups obtained from crosstabulating a number of base partitions (see above). Each base partition is calculated by using the algorithm of moving centers after reading the data directly, so as to minimize the use of central memory. The calculations are generally performed on the coordinates of the individuals on the first significant principal axes of a principal axes analysis (PCA, CA, MCA).
3.2 Hierarchical Aggregation of the Stable Groups
Some of the stable groups can be very close to one another, corresponding to a group that is artificially cut by the preceding step. On the other hand, the procedure generally creates several small groups, sometimes containing only one element. The goal of the hierarchical aggregation phase is to reconstitute the groups that have been fragmented, and to aggregate the apparently dispersed elements around their original centers. The tree is built according to Ward's aggregation criterion, which has the advantage of accounting for the size of the elements to classify (as a weight in the calculations of the loss of variance through aggregation).
3.3 Choosing the number of classes
The number of classes in the final partition of the population is defined by cutting the dendrogram. Choosing the level of the cut, and thus the number of classes, can be done by looking at the dendrogram : the cut has to be made above the low aggregations, which bring together the elements that are very close to one another, and under the high aggregations, which lump together all of the various groups in the population. In practice, the situation is often less clearly defined (except when "real," well defined groups do exist in the population). The user may vacillate between two or three possible cutting levels.
3.4 Final partition, and description of classes
A reallocation (using an iterative procedure similar to the k-means method) of the individuals or objects is then necessary in order to improve the quality of the partition. This is due to the fact that a cut of a dendrogram cannot provide a partition locally optimized (to build a series of nested partitions introduces some undesirable constraints).
Each group is then described with the help of the continuous and nominal variables in the original data set. The continuous variables could be ranked by comparing their means within the group to their overall mean. For example, to rank the importance of variable x, one can compare the actual mean with the expected value of the mean, assuring that the individuals were allocated in class j at random (without replacement).
The categories of the nominal variables can be ordered in an analogous way. If there are nj(m) individuals having modality m among the nj individuals of class j, we may calculate a criterion that is analogous to a t -value and the probability (hypergeometric distribution for N) of having N > nj(m) where the nj individuals of the group are drawn at random (without replacement) among the population of n individuals.
3.5 Comments on Classification Strategy
Classifying a large data set is a complex task. It is difficult to find a method (or an algorithm) that alone will surely lead to an optimal result. The proposed strategy, which is not entirely automatic, and which requires several control parameters, allows us to retain control over the classification process. Similarly, nearest neighbours accelerated algorithms for hierarchical classification permit us to directly build a tree on the entire population. However these algorithms cannot read the data matrix sequentially; the data or the principal coordinates must be stored in central memory. This can be a huge matrix, even for as few as 15 or 20 principal axes. Besides working with direct reading, the partitioning algorithm has another advantage. The criterion of homogeneity of the groups is better satisfied in finding an optimal partition, rather than in the more constrained case of finding an optimal family of nested partitions (hierarchical tree).
3.6 An example of clustering 27 countries described by 12 socio-economic indicators
The data set in an excerpt of the data of the example relating to 100 countries (name of the folder: “Data_100_Countries”). A description of the data can be found in that folder.
Table 1 describes the agglomerative process (the first and the second elements are agglomerated into the node “num” – first column).
Table 2 is a sketch of the dendrogram.
Table 3 show the process of optimization of the cut of the dendrogram into 4 clusters.
Table 4 gives the content of each cluster.
The menu visualization of DtmVic allows for further descriptions of the cluster
onto the principal planes of a PCA of the data.
Table 1. Agglomeration of 27 countries: the nodes are numbered from 28
to 53.
Hierarchical classification : describing the nodes
----------------------------------------------------------------------------------
num. first seco size weight indices histogram of indices
28 22 17 2 2.00 .02576 *
29 11 1 2 2.00 .03222 *
30 7 12 2 2.00 .03650 *
31 21 18 2 2.00 .03859 *
32 16 24 2 2.00 .03871 *
33 13 2 2 2.00 .05512 **
34 32 15 3 3.00 .05978 **
35 10 30 3 3.00 .06569 **
36 31 28 4 4.00 .06883 **
37 23 34 4 4.00 .10433 **
38 25 9 2 2.00 .12001 ***
39 27 29 3 3.00 .13312 ***
40 6 4 2 2.00 .14976 ***
41 36 20 5 5.00 .15064 ***
42 35 33 5 5.00 .18241 ****
43 38 19 3 3.00 .19943 ****
44 3 42 6 6.00 .20974 ****
45 37 41 9 9.00 .31404 ******
46 40 39 5 5.00 .40121 ********
47 8 26 2 2.00 .60615 ***********
48 45 5 10 10.00 .65997 ************
49 44 46 11 11.00 .77374 ***************
50 43 48 13 13.00 .94946 ******************
51 47 49 13 13.00 1.04155 *******************
52 14 51 14 14.00 1.17440 **********************
53 50 52 27 27.00 4.40884 **************//*******************
sum of indices = 12.00000
Table 2. Agglomeration of 27 countries: sketch of the dendrogram.
rank ind. iden dendrogram (indices as percentages of sum of indices :
12.00000 min = .21% / max = 36.74%)
1 .27 Albania --+
2 1.11 China --*-+
3 3.34 Viet_Nam ----*-----+
4 1.25 Bangladesh ----+ !
5 6.45 Bolivia ----*-----*--------+
6 .46 Algeria --+ !
7 1.52 Costa_Rica --*--+ !
8 .30 Colombia --+ ! !
9 .55 Brazil --* ! !
10 1.75 Chile --*--*+ !
11 8.68 Argentina ------*------------*-----+
12 5.05 Venezuela ---------------+ !
13 9.79 Bulgaria ---------------*---------*--+
14 36.74 Cote_d_Ivoir ----------------------------*--------//---------------+
15 5.50 Belgium ----------------+ !
16 1.26 Poland ----+ ! !
17 .21 Greece --+ ! ! !
18 .57 Spain --*+! ! !
19 .32 Italy --+!! ! !
20 2.62 Portugal --***---+ ! !
21 .50 France --+ ! ! !
22 .32 UK --* ! ! !
23 .87 Germany --*+ ! ! !
24 7.91 Switzerland ---*----*-------*------+ !
25 1.66 Norway ------+ ! !
26 1.00 Canada ----+ ! ! !
27 ----- USA ----*-*----------------*--------------//--------------*
Table 3. Cut of the dendrogram. Improving the partition.
cut a of the tree into 4 classes
===================================================================================
Brief description
--------------------
+--------+----------+-----------+-----------+
! class ! base ! weight ! content !
+--------+----------+-----------+-----------+
! aa1a ! 11 ! 11.00 ! 1 a 11 !
! aa2a ! 2 ! 2.00 ! 12 a 13 !
! aa3a ! 1 ! 1.00 ! 14 a 14 !
! aa4a ! 13 ! 13.00 ! 15 a 27 !
+--------+----------+-----------+-----------+
consolidating the partition around the 4 clusters centroids
through 10 k-means-like iterations
----------------------------------------------------------------------------------
Increase of inter-classes variance
--------------------------------------
+-----------+----------+----------+----------+
! iteration ! v.total ! v.inter ! ratio !
+-----------+----------+----------+----------+
! 0 !12.000002 ! 6.624787 ! .5521 !
! 1 !12.000002 ! 6.702548 ! .5585 !
! 2 !12.000002 ! 6.702548 ! .5585 !
! 3 !12.000002 ! 6.702548 ! .5585 !
+-----------+----------+----------+----------+
stop after iteration 3 : Increase of inter-classes variance
during the previous iteraion is only .000 %.
Table 4. Content of the 4 final clusters.
content of : cut a of the tree into 4 classes
----- class 1 / 4 ------------------------------------------------------
Albania Algeria Argentina Bangladesh Bolivia Brazil Chile
China Colombia Costa_Rica Poland Viet_Nam
----- class 2 / 4 ------------------------------------------------------
Bulgaria Venezuela
----- class 3 / 4 ------------------------------------------------------
Cote_d_Ivoire
----- class 4 / 4 ------------------------------------------------------
Belgium Canada France Germany Greece Italy Norway
Portugal Spain Switzerland UK USA
APPENDIX (chapter 6):
EQUIVALENCE BETWEEN SINGLE
LINKAGE AND SUBDOMINANT ULTRAMETRIC
To demonstrate this equivalence
1. We define, starting with a distance d, a new distance called the distance of the smallest maximum jump.
2. We then show that this distance is an ultrametric distance.
3. Next we show that this ultrametric distance is the subdominant one.
4. Finally, we show that this distance corresponds to the ultrametric distance given by the single linkage algorithm.
Definition of the Distance of the Smallest Maximum Jump
Let there be a set E having a distance d.
Let x and y be two points contained in E.
The pair (x, y) is called the edge, of length d(x, y), of the complete graph whose nodes are the elements of E (the name complete graph comes from the fact that every pair of nodes is joined by an edge).
Still using the terminology of graph theory, we call a path from x to y a series of edges of the type (x, tl),(tl, t2),....,(tk-l, tk) ...(tk,y) where tl,...,tk are elements of E.
Given a path from x to y, the maximum jump is the length of the longest edge of the path from x to y.
To every path joining x and y, there corresponds a maximum jump. Since the set of nodes is finite, there exists a smallest maximum jump over the set of paths from x to y; we denote it by d*(x, y).
Demonstration that d * is an Ultrametric Distance
The smallest maximum jump (SMJ) between x and y is an ultrametric distance.
It is clear that the first two axioms of a distance are verified by d*.
To verify that this distance is ultrametric, let us consider three points x, y, and z ). The SMJ from x to y, passing through z, is max{d*(x, z), d*(z, y)}.
The smallest maximum jump from x to y without the constraint of passing through z can only be less than or equal to this quantity; thus
d*(x, y) < max {d*(x, z), d*(y, z)}
and d* is then shown to be an ultrametric distance.
Demonstration that d* is the Subdominant
To show that d* is the subdominant, we show that d* is smaller than d and greater than every ultrametric distance that is less than d.
First of all, it is clear that edge (x, y) is a particular path going from x to y;
therefore d*(x, y) < d(x, y) and d* is less than d.
Let dl be an ultrametric distance that is smaller than d. For every triplet xl, x2, x3 we obviously have
dl (xl, x3) < max { dl (xl, x2), dl (x2, x3) }
By applying this inequality successively to a path (xl , x2, x3, ....)
we obtain:
dl (xl,xp) < max{ dl (xj, xj+l) j<p}
Sin,ce dl < d, we have
dl (xl,xp) < max{ d(xj, xj+l) j<p}
This inequality is valid for every path joining xl to xp. For at least one of them, we have, by definition of d*,
max { d (xj, xj+l ) } = d *(xl, xp)
This last equation demonstrates the required inequality.
Equivalence of d * and du
It remains to demonstrate that the ultrametric distance du given by the single linkage algorithm is the distance d* of the smallest maximal jump.
Let du(x, y) be the value of the distance at the step where points x and y are joined for the first time. Previously these two points were in different clusters (or constituted clusters in themselves). The method of calculating distances at every aggregation insures that du(x, y) is the smallest distance between two points belonging to different clusters.
The distances within a cluster are smaller than du(x, y), after aggregation; and the distances involving points that do not belong to the two clusters are greater because they will be aggregated at a future step. The paths joining x and y will therefore have edges inside both clusters, whose length is less than du(x, y), and edges outside the clusters whose length is necessarily greater than or equal to du(x, y). Thus du(x, y) is the smallest maximum jump d*(x, y).
End of Chapter 6. Clustering Techniques.