Three contributions exemplifying the use of Bootstrap in the field of textual data

 

 

List of contributions:

  1) Validation Techniques in Text Mining (with application to the processing of open-ended questions)

  2) Which bootstrap for principal axes methods?

  3) Visualization, validation and seriation. (Application to a corpus of medieval texts)



 

 

 

 

 

1. Validation Techniques in Text Mining    

       (with application to the processing of open-ended questions)

[Return to Table of Content]

 

Draft of the original article:

Lebart L. (2003) Validation Techniques in Text Mining, in: Text Mining and its Applications, Spiros Sirmakessis, Editor, Springer. 169-178.

 

 

 

Abstract. Clustering methods and principal axes techniques as well play a major role in the computerized exploration of textual corpora. However, most of the outputs of these unsupervised procedures are difficult to assess. We will focus on the two following issues: External validation, involving external data and allowing for classical statistical tests. Internal validation, based on re-sampling techniques such as  bootstrap and other Monte Carlo methods. In the domain of textual data, these techniques can efficiently tackle the difficult problem of the plurality of statistical units (words, lemmas, segments, sentences, respondents)..

 

1   Introduction

 

The amount of information available only in unstructured textual form is rapidly increasing. Clustering methods and principal axes techniques  play a major role in the computerized exploration of such corpora. Clustering methods comprise unsupervised classification techniques such as hierarchical clustering techniques (cf., e.g., Gordon, 1987), techniques of partitioning such as k-means or k-medoids (McQueen, 1967; Kaufman and Rousseeuw, 1990), self organizing maps (Kohonen, 1989). Principal axes techniques designate various methods comprising at the outset a Singular Value Decomposition (Eckart and Young, 1936) such as principal components analysis (Hotelling, 1933), two-ways and multiple correspondence analysis (Hayashi, 1956; Benzécri, 1973), etc.

These methods provide visualizations and/or groupings of items (responses in marketing and socio-economic surveys, web pages/frames/sites, scientific abstracts, patents, broadcast news, financial reports, literary texts, etc.) highlighting associations and patterns.

They help to devise decision aids for attributing a text to an author or a period; choosing a document within a database; coding information expressed in natural language (techniques of Information Retrieval and Latent Semantic Indexing, cf., e.g.: Berry et al., 1999).

They help also to achieve some technical contributions, such as disambiguation, parsing, visualization of semantic graphs, optical character and speech recognition.

However, the outputs of these procedures are often difficult to assess. We will focus here on the two following issues, mainly in the context of principal axes techniques.

- External validation, involving external data or meta-data, and allowing for some classical statistical tests, including cross-validation procedures in the scope of supervised learning.

- Internal validation, based on re-sampling techniques such as bootstrap and other Monte Carlo methods.

Other issues related more specifically to clustering techniques, such as the determination of the number of clusters (see, e.g.: Milligan and Cooper, 1985; Hardy, 1994) are not addressed here.

 

2   External validation

 

 External validation is the standard procedure in the case of supervised learning models. It serves to estimate the parameters of the model (learning phase) and to assess the  model (generalisation), using, in most cases, cross-validation methods.  But external validation can be used in the unsupervised context in the two following practical circumstances:

 

a)  when the data set comprises at least two comparable parts (or can be artificially split into two parts), one part being used to estimate the model (generally a finite mixture probabilistic model, in the context of clustering), the other part(s) serving to check the adequacy of that model. 

 

b) when some meta-data or external information are available to complement the description of the elements to be clustered.

 

2.1 External validation in the context of clustering

 

The situation (a) above corresponds to external validation in the sense of Gordon (1997, 1998) or Bock (1985, 1996). For these authors, the null hypothesis (model assuming an a priori absence of cluster) is immaterial in most practical situation. External validation in statistical literature has then a more restrictive meaning  than in Data Mining (see, e.g. Halkidi et al., 2001).

In fact, in the context of text mining, clustering techniques are often used to dissect data rather than to cluster data (about the distinction between clustering and dissection, see, e.g., the classical paper of Cormack, 1971). The term dissection designates the result of a clustering algorithm in the situation in which no clear-cut partition does exist beforehand: the aim is to produce homogeneous areas in the high-dimensional space of the data, instead of discovering existing clusters. A dissection can be regarded as a possible multivariate generalization of an histogram, in which a continuous univariate distribution is segmented (sometimes arbitrarily) only for the sake of description. The external information can be used, in a second phase, to systematically describe these areas, i.e. the classes of the obtained partition (situation (b) above). We will assume that such external information has the form of supplementary (or illustrative) elements (extra rows or columns of the data table that are utilised afterwards, Benzécri, 1973; Lebart et al., 1984). To describe (or to characterize) a partition by a set of variables that did not participate in building the partition involves several classical statistical tests: in the case of  a continuous supplementary variable, one can use simple means comparisons between classes, and/or a global analysis of variance; in the case of supplementary categorical variable, one can use frequencies comparisons (between global frequency of a category and its frequency within a class) and /or a more global chi-square test. A stepwise discriminant analysis can also be performed if the user needs a simultaneous use of a set supplementary continuous variables.

 

As a frequent software output, all these significant supplementary variables can be sorted afterwards according to the strengths of their links with each classes and/or with the partition as a whole. Such strengths are aptly described by the p-values provided by the statistical tests: the smaller the p-value, the more significant the link between the corresponding variable and the class. The unavoidable problem of multiple comparisons will be dealt with briefly in section 2.3 below.

 

 2.2 External validation in the context of principal axes methods

 

In data analysis practice, the so-called supplementary (or illustrative) elements are projected afterwards onto the principal visualization planes. For each projection and each principal axis, a “test-value” is computed that converts the coordinate on the axis into a standardized normal variable (under the hypothesis of independence between the supplementary variable and the axis).

The principle of this assessment procedure is the following: Suppose that a supplementary category j contains kj individuals (e.g. respondents or documents). The null hypothesis is that the kj individuals are chosen at random (without replacement) among the k individuals in the analysis. Under these circumstances, on a specific principal axis a, the abscissa jaj  of a supplementary category concerning kj individuals is a random variable, with mean 0, and standard deviation v(j, kj). Such standard deviation is independent of the axis under consideration (Lebart et al., 1984).

The value of  v(j, kj) is given by:           




 Then, ta(j) = jaj / v(j, kj) is a standardized random variable (having mean 0 and variance 1). Moreover, ta(j) is asymptotically normally distributed. Thus, using a common approximation, a test-value  greater than 2 or less than -2 indicates a significant position of the corresponding category j on axis a (at level 0.05). In fact, test-values and p-values provide equivalent pieces of information.

In an exploratory approach, numerous supplementary elements could be projected, leading to as many test-values.

In some other contexts, external information takes the form of instrumental variables whose effects on the data must be eliminated beforehand (leading, for instance, to analyse partial correlations instead of correlations).

 

2.3 Multiple Comparisons

 

The simultaneous computation of numerous test-values runs into the obstacle of multiple comparisons, a permanent problem in text mining applications. Suppose that the corpus under study is perfectly homogeneous and thus that the hypothesis of independence holds. Under these conditions, on the average, out of 100 calculated test-values, 5 are significant with respect to the probability threshold 5%. In fact such 5% threshold only makes sense for a single test, and not for multiple tests. In other words, the unsuspecting user will always find "something significant" at the 5% level. A practical way to solve this difficulty is to choose a stricter threshold. In the context of analysis of variance, several procedures have been devised to overcome this difficulty. As an example, the Bonferroni method recommends dividing the probability threshold by the number of tests (number of comparisons in the case of the design of experiments). This reduction of the probability threshold is generally considered as too severe (Hochberg, 1988; Perneger, 1998). Classical overviews and discussions about multiple comparisons are found in Hsu (1996), Saville (1990), Westfall and Young (1993).

 

3   Resampling techniques

 

In the context of principal axes techniques (such as Singular Values Decomposition (SVD), principal component analysis (PCA), two-way (CA) or multi-way (MCA) correspondence analysis), Bootstrap resampling techniques (Efron, 1979; Diaconis and Efron, 1983) are used to produce confidence areas on two-dimensional displays. The bootstrap replication scheme allows one to draw confidence ellipses or convex hulls for active elements (i.e.: elements participating in building principal axes) and for both supplementary categories and for supplementary continuous variables.

In order to compute the precision of estimates, many reasons lead to the bootstrap method:

 

– the classical analytical approach is both unrealistic and complex,

 

– the bootstrap makes (almost) no assumption about the underlying distributions,

 

– it gives the possibility to master every statistical computation for each sample replication, and therefore to deal with parameters computed through the most complex algorithms.

 

3.1 Basic principles of bootstrap

 

The first phase consists of performing n drawing with replacement of the n statistical units (respondents, documents…), and of computing the parameters of interest  (means, variance, eigenvectors…) on this new sample. This phase is repeated q times. A current value of q is 200 (Efron and Tibshirani, 1993), but it can vary from 30 to several thousands according to the type of application. We have at this stage q samples (the replicates) drawn from a new “theoretical population” defined by the empirical distribution of the original data set, and q replicates of the parameters of interest.

Under rather general assumptions, it has been proved that we can estimate the variance (and other statistical parameters) of these parameters empirically directly from the set of their q replicates.

 

3.2 Context of SVD and PCA

 

In PCA case, variants of bootstrap (partial and total bootstrap) are presented for active variables, supplementary variables, and supplementary nominal variables as well. In the case of numerous homogeneous variables, a bootstrap on variables is also proposed, with examples of application to the case of semiometric data (Lebart et al., 2003).

In the case of principal component analysis, numerous papers have contributed to select the relevant number of axes, and have proposed confidence intervals for points in the subspace spanned by the principal axes. These parameters are computed after the realization of each replicated samples, and involve constraints that depend on these samples.

Several procedures have been proposed to overcome these difficulties (Lebart and Chateau, 1996): partial replications using supplementary elements (partial bootstrap), use of a three-way analysis to process simultaneously the whole set of replications (Holmes, 1989), filtering techniques involving reordering of axes and procrustean rotations (Markus, 1994; Milan and Whittaker, 1995).

 

Partial bootstrap making use of projections of replicated elements on the reference subspace provided by Singular Value Decomposition of the observed covariance matrix has several advantages for data analysts. From a descriptive standpoint, this initial subspace is better than any subspace undergoing a perturbation by a random noise. In fact, this subspace is the expectation of all the perturbated subspaces (replicates). The plane spanned by the first two axes, for instance, provides nothing but a point of view on the data set. In this context, to apply the non-parametric bootstrap to PCA, one may project the q replicates of variable-points in the common reference subspace, and compute confidence areas (ellipses or convex hulls) for the locations of these replicates.

 

3.3 Context of CA and MCA

 

Gifi (1981), Meulman (1982), Greenacre (1984) did pioneering work in addressing the problem in the context of two-way and multiple correspondence analysis.

It appears easier to assess eigenvectors than eigenvalues (see, e.g.: Alvarez et al., 2002) that are biased replications of the theoretical ones.

 

3.3.1 An application example

 

An open-ended question has been included in a multinational survey conducted in seven countries (Japan, France, Germany, Italy, Nederland, United Kingdom, USA) in the late nineteen eighties (Hayashi et al., 1992).

The respondents were asked : "What is the single most important thing in life for you?" . This open question was followed by the probe: "What other things are very important to you?".

The illustrative example is limited to the British sample (sample size: 1043). The counts for the first phase of numeric coding are as follows: Out of 1043 responses, there are 13 669 occurrences (tokens), with 1 413 distinct words (types). When the words appearing at least 16 times are selected, there remain 10 357 occurrences of these words (tokens), with 135 distinct words (types).

The same questionnaire also had a number of closed-end questions (among them, the socio-demographic characteristics of the respondents, which play a major role).

In this example we focus on a partitioning of the sample into 9 categories, obtained by cross-tabulating age (3 categories) with educational level (3 categories).

 

uk_ellipses_mono

 

Fig. 1. Bootstrap confidence ellipses for 9 category-points in CA principal plane

 

The 9 identifiers combine age categories (-30, 30-55, +55) with educational levels (low, medium, high).

Figure 1 is a rough sketch of the principal plane obtained from a CA of the contingency table cross-tabulating the 135 words (rows) appearing at least 16 times, and the 9 categories of respondents (columns). The entry (i, j) of such table is the number of occurrences of word i in the responses of  individuals belonging to category j. Over this simultaneous representation of rows and columns are drawn the 9 confidence ellipses of the column-points (educational levels): the smaller the confidence area, the most typical the vocabulary used by the corresponding category (situation exemplified by the category “+55 / low” on the left hand part of the display).

The vocabularies of overlapping categories could not be deemed to be significantly distinct (case of  the two categories “-30 / Low” and “-30 / medium”,  about the centre of the display).

 

uk_ellipses_mots_mono

 

Fig. 2. Bootstrap confidence ellipses for 6 word-points in CA principal plane

 

Figure 2 is another sketch of the same principal plane, with the bootstrap confidence ellipses relating to 6 row-points (words): friends (116), job (142), money (171), health (612), peace (77), mind (47) (the numbers of occurrences in the corpus are in parentheses). The sizes of the ellipses (as measured by the radius of a circle having the same area) are approximately proportional to the square root of the frequencies. We can note for instance, that the overlapping of the areas corresponding to peace and mind (lower part of the display) is due to the fact that “peace of mind” is a frequent phrase in the corpus of responses. As a matter of fact, these two words are also two characteristic words (see, e.g; Lebart et al., 1998) of the category (+55 / high). Their test-values are respectively 2.9 and 2.2. The detection of these characteristic elements is under the curse of multiple comparisons effects. This is not the case of the bootstrap validation which provides simultaneous confidence intervals for all the words, and takes into account the structure of correlations between words.

 

In practice, figure 1 and 2 are superimposed, with different colours attached to the ellipses of the row-points and of the column-points.

 

3.3.2 Different statistical levels: versatility of bootstrap

 

When dealing with textual data, these resampling techniques can help solving the problem of plurality of statistical units (see, in the case of responses to open questions: Tuzzi and Tweedie, 2000). In fact, two (or more) levels of statistical units coexist in textual data analysis. On the one hand, observations or individuals (with their usual meaning in statistics) could be respondents (case of sample surveys), documents or abstracts (Information retrieval) or cybernauts, web-users (Web-Mining). On the other end, within the same textual corpus, other types of observations or individuals could be occurrences (token), words, lemmas, phrases. At an intermediate level, it could be also for some other applications: pages, sentences, frames.

The replication scheme can be customized to all the mentioned levels, leading to conclusions adapted to the expected inference. Due to the discrepancies of text sizes, a structural pattern could be significant when the statistical unit is the word, and not relevant if the statistical unit is the document, the respondent or the web user.

Note incidentally that bootstrap techniques play also a major role when the analyzed corpus is extracted through a sampling procedure from a huge database.

 

4. Conclusion

 

From a statistical perspective,  textual data have in common the following characteristics: they are large, high-dimensional, qualitative (categorical), sparse, with a high level of noise. Validation procedures are then difficult to carry out, but are all the more necessary…

 

Re-sampling techniques (mainly bootstrap in the case of unsupervised approaches) possess all the required properties to provide the user with the versatile tools that transform appealing visualizations into scientific documents.

 

 

References

 

1. Alvarez R., Bécue M., Lanero J. J., Valencia O.: Results stability in Textual Analysis: its Application to the Study of the Spanish Investiture Speeches (1979-2000). In: JADT-2002, 6-th International Conference on Textual Data Analysis, Morin A., Sébillot P., (eds),  INRIA-IRISA, Rennes (2002)  1-12.

2. Benzécri, J.-P.:  Analyse des Données. Tome II: Analyse des Correspondances.  Dunod, Paris (1973).

3. Berry M. W., Drmac Z., Jessup E. R.: Matrices, Vector Spaces and Information Retrieval. SIAM Review, 41, 2, (1999), 335 – 362.

4. Bock H.-H. : On some significance tests in Cluster Analysis. Journal of Classification, 2, (1985), 77-108.

5. Bock, H.-H.: Probability model and hypothese testing in partitionning cluster analysis. In: Clustering and Classification, P. Arabie, L.J. Hubert,& G. De Soete (Eds), World Scientific, Singapore (1996), 377-453.

6. Chateau F. , Lebart L.: Assessing sample variability and stability in the visualization techniques related to principal component analysis; bootstrap and alternative simulation methods. COMPSTAT 1996,  Prat A. (ed), Physica Verlag, Heidelberg (1996),   205-210.

7. Cormack R.M.: A review of classification. J. of Royal Statist. Society, Serie A, 134, Part. 3, . (1971),  321-367.

8. Deerwester S., Dumais S.T., Furnas G.W., Landauer T.K., Harshman R.: Indexing by latent semantic analysis, J. of the Amer. Soc. for Information Science, 41 (6), . (1990,) 391-407.

9. Diaconis P., Efron B.: Computer intensive methods in statistics. Scientific American, 248, (May), . (1983) , 116-130.

10. Eckart C., Young G.: The approximation of one matrix by another of lower rank. Psychometrika, l, . (1936), 211-218.

11. Efron B.: Bootstraps methods : another look at the Jackknife. Ann. Statist., 7 (1979),   1-26.

12. Efron B., Tibshirani R. J.: An Introduction to the Bootstrap. Chapman and Hall, New York, (1993).

13. Gifi A.:  Non Linear Multivariate Analysis, Department of Data theory, University of Leiden (1981). (Updated version :same title,  J. Wiley, Chichester, 1990).

14. Gordon A.D.: A review of hierarchical classification, J.R.Statist.Soc., A, 150, Part2, (1987), 119-137.

15. Gordon A. D.: External validation in cluster analysis. Bulletin of the International Statistical Institute 51(2), 353-356 (1997). Response to comments. Bulletin of the International Statistical Institute 51(3), (1998), 414-415.

16. Gordon A.: Cluster validation. In Data Science, Classification, and Related Methods (C Hayashi, N Ohsumi, K Yajima, Y Tanaka, H-H Bock and Y Baba, eds.), Springer, Tokyo,  (1998), 22-39.

17. Greenacre, M.:  Theory and Applications of Correspondence Analysis. Academic Press, London (1984).

18. Halkidi, M., Batistakis, Y., Vazirgiannis, M. On clustering validation techniques. Journal of Intelligent Information Systems, 17:2/3, (2001), 147 – 145.

19. Hardy A.: An examination of procedures for determining the number of clusters in a data set. In: New Approaches in Classification and Data Analysis, Diday et al. (eds) Springer Verlag, Berlin, (1994)  178-195.

20. Hayashi C.: Theory and examples of quantification. (II) Proc. of the Institute of Statist. Math. 4 (2), (1956), 19-30.

21. Hochberg, Y.: A sharper Bonferroni procedure for multiple tests of significance, Biometrika, 75, (1988), 800-803.

22. Holmes S.: Using the bootstrap and the RV coefficient in the multivariate context. in: Data Analysis, Learning Symbolic and Numeric Knowledge, E. Diday (ed.), Nova Science, New York, (1989)  119-132.

23. Hotelling H.: Analysis of a complex of statistical variables into principal components. J. Educ. Psy.  24, . (1933), 417-441, and: 498-520.

24. Hsu, J. C.:  Multiple Comparisons: Theory and Methods, Chapman & Hall, London, (1996).

25. Kaufman L., Rousseeuw P. J.: Finding Groups in Data. J. Wiley, New York, (1990).

26.Kohonen T.: Self-Organization and Associative Memory. Springer Verlag, Berlin, (1989).

27. Lebart L., Piron M., Steiner J.-F. : La Sémiométrie. Dunod, Paris (2003).

28. Lebart L., Salem A., Berry L.: Exploring Textual Data, Kluwer, Dordrecht, Boston (1998).

29. Lebart L., Morineau A., Warwick K.: Multivariate Descriptive Statistical Analysis.  J. Wiley, New York, (1984).

30. Markus M.Th.: Bootstrap Confidence Regions for Homogeneity Analysis.; the Influence of Rotation on Coverage Percentages. COMPSTAT 1994, (Dutter R. and Grossmann W. (eds))  Physica Verlag, Heidelberg,  (1994), 337-342.

31. Milan L., Whittaker J.: Application of the parametric bootstrap to models that incorporate a singular value decomposition. Appl. Statist. 44, 1 (1995)  31-49.

32. MacQueen J. B.: Some methods for classification and analysis of multivariate observations. Proc. Symp. Math. Statist. and Probability (5th), Berkeley, 1, (1967), 281-297, Univ. of Calif. Press, Berkeley.

33. Milligan G. W., Cooper M. C.: An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50, 159-179  (1985).

34. Perneger T.,V.: What is wrong with Bonferroni adjustments, British Medical Journal, 136,  1236-1238, (1998).

35. Saville D. J.: Multiple comparison procedures : The practical solution. American Statistician, 44, (1990) 174-180.

36. Tuzzi A., Tweedie F. J.: The best of both worlds: Comparing Mocar and Mcdisp. In: JADT2000 (Cinquièmes Journées Internationales sur l’Analyse des Données Textuelles), Rajman M., Chappelier J-C. (eds), EPFL, Lausanne (2000), 271-276.

37. Westfall P. H., Young S. S.: Resampling Based Multiple Testing: Examples and Methods for p-values Adjustment. Wiley, New York (1993).

 

 

 

 

 

 

 

 

 

 

2. Which bootstrap for principal axes methods?

 

Adapted and updated from a draft version of the paper:

L. Lebart (2007) Which bootstrap for principal axes methods?   In: Selected Contributions in Data Analysis and Classification, P. Brito et al. Editors, Springer, 581 – 588.

[Return to Table of Content]

 

Abstract:

 

This paper deals with validation techniques in the context of multivariate descriptive techniques involving singular values decomposition (SVD), namely: Principal Components Analysis (PCA), Simple and Multiple Correspondence analysis (CA and MCA). We  briefly show that, according to the concerns of the users, at least five types of resampling techniques could be carried out to assess the quality of the obtained visualisations: a) Partial bootstrap, that considers the replications as supplementary variables, without diagonalization of the replicated moment-product matrices. b) Total bootstrap type 1, that imply a new diagonalization for each replicate,  with corrections limited to possible changes of signs of the axes. c) Total bootsrap type 2, which adds to the preceding one a correction for the possible interversions of axes. d) Total bootstrap type 3, that implies procrustean transformations of all the replicates striving to take into account both rotations and interversions of axes. e) Specific bootstrap, implying a resampling at a different level (case of a hierarchy of statistical units). An example is presented for each type of resampling.

 

Key words: Principal components analysis, correspondence analysis, bootstrap, assessment.

 

1. Introduction

 

Our aim is to assess the results of principal axes methods (PAM), i.e.: multivariate descriptive techniques involving singular values decomposition (SVD), such as principal components analysis (PCA), simple and multiple correspondence analyses (CA and MCA). These methods provide useful data visualisations bur their outputs (parameter estimates, graphical displays) are difficult to assess. Computer intensive techniques allow us to go far beyond the criterion of  “interpretability” of the results that was frequently used during the first phases of the upsurge of data analytic methods thirty years ago (see, e.g., Diday and Lebart, 1976).

To compute the precision of estimates, the classical analytical approach is both unrealistic and analytically complex. The bootstrap (Efron, 1979), on the contrary, makes almost no assumption about the underlying distributions, and gives the possibility to master every statistical computation for each sample replication and therefore to deal with parameters computed through the most complex algorithms.

 

2  Basic principles of the bootstrap, a reminder

 

The nonparametric bootstrap consists in drawing with replacement a sample of size n out of n statistical units. Then, the parameter estimates such as means, variances, eigenvectors are computed on the new obtained “sample”. This two-fold phase is repeated K times. A current value of K is 200 (Efron and Tibshirani, 1993), but it can vary from 10 to several thousands according to the type of application. We have at this stage K samples (the replicates) drawn from a new “theoretical population” defined by the empirical distribution of the original data set, and, as a consequence, K estimates of the parameters of interest. Briefly and under rather general assumptions, it has been proved that we can estimate the variance (and other statistical parameters) of these parameters directly from the set of their K replicates.

 

In the PCA case, variants of bootstrap do exist for active variables and supplementary variables, both continuous and nominal. In the case of numerous homogeneous variables, a bootstrap on variables is also proposed, with examples of application to the case of semiometric data (Lebart et al., 2003).

 

Numerous papers have contributed to select the relevant number of axes, and have proposed confidence intervals for points in the subspace spanned by the principal axes. The s-th eigenvector of a replicated correlation matrix is not necessarily homologous of the s-th eigenvector of the original matrix, because of possible rotations, permutations or changes of sign of the axes). In addition, the expectations of the eigenvalues of the replicated matrices are not the original eigenvalues (see, e.g., Alvarez et al., 2004; Lebart, 2006). Several procedures have been proposed to overcome these difficulties (Chateau and Lebart, 1996): partial replications using supplementary elements (partial bootstrap), use of a three-way analysis to process simultaneously the whole set of replicates (Holmes, 1989), filtering techniques involving reordering of axes and Procrustean rotations (Markus, 1994; Milan and Whittaker, 1995; Gower and Dijksterhuis, 2004).

 

  3. The illustrative example

 

An open-ended question has been included in a multinational survey conducted in seven countries (Japan, France, Germany, Italy, Nederland, United Kingdom, USA) in the late nineteen eighties (Hayashi et al., 1992).  The respondents were asked: "What is the single most important thing in life for you?".

The illustrative example is limited to the British sample (sample size: 1043). The counts for the first phase of numeric coding are as follows: Out of 1043 responses, there are 13 669 occurrences (tokens), with 1 413 distinct words (types). When the words appearing at least 16 times are selected, there remain 10 357 occurrences of these words (tokens), with 135 distinct words (types). The same questionnaire also had a number of closed-end questions (among them, the socio-demographics of the respondents). In this example we focus on a partitioning of the sample into 9 categories, obtained by cross-tabulating age (3 categories) with educational level (3 categories). All the following figures contain an excerpt (only four words) of the principal plane obtained from a CA of the contingency table cross-tabulating the 135 words (rows) appearing at least 16 times, and the 9 categories of respondents (columns). The entry (i, j) of such table is the number of occurrences of word i in the responses of individuals belonging to category j ((see: Lebart et al., 1998).

 

4. Partial bootstrap

 

The partial bootstrap, makes use of projections of replicated elements on the original reference subspace provided by the eigen-decomposition of the observed covariance matrix. It has several advantages. From a descriptive standpoint, this initial subspace is better than any subspace undergoing a perturbation by a random noise. In fact, unlike the eigenvalues, this subspace is the expectation of all the replicated subspaces having undergone perturbations. The plane spanned by the first two axes, for instance, provides an optimal point of view on the data set. In this context, to apply the partial bootstrap to PCA, one may project the K replicates of variable-points in the common reference subspace, and compute confidence regions (ellipses or convex hulls) for the locations of these replicates. Gifi (1980), Greenacre (1984) addressed the problem in the context of CA and MCA.

 

Then, for each variable-point and each pair of principal axes, a confidence ellipse is derived from a PCA of the two-dimensional cloud of the K replicates. The lengths of the two principal diameters of these ellipses are normatively fixed to four standard deviations, the corresponding ellipses contain then approximately 90% of the replicates. Empirical evidence suggests that 30 is an acceptable value for K, the number of replicates. Confidence ellipses may also be replaced by convex hulls. Both ways of visualizing the uncertainty around each variable-point are complementary: ellipses take into account the density of the cloud of replicated points, whereas convex hulls pinpoint peripheral points and possible outliers.

 

Figure 1 shows four confidence ellipses for four row-points (words).The words corresponding to strongly overlapping ellipses could not be deemed to be significantly distinct as regard their distributions among the nine categories that are the column of the lexical contingency table). Thus, the words church and mind, despite their distinct locations, correspond to the same profile of respondents (profile described by the nine categories). Such profile is significantly distinct from those of the words nothing and things.

 

Figure 1. Partial bootstrap: Four confidence ellipses for word-points in the principal plane of a CA [contingency table crossing 135 words and 9 categories of respondents]

 

5. The total bootstrap and its three options

 

The total bootstrap consists in performing a new PAM for each replicate. Evidently, the absence of a common subspace of reference may induce a pessimistic view of the variances of the coordinates of the replicates on the principal axes. The most obvious change concerns the signs of the coordinates on the axes, which is a mere by-product of the diagonalization algorithm. We can also observe interversion of axes from one replicate to another, and also rotations of the axes (Cf. Milan et Whittaker, 1995).

We have then to perform a series of transformations to identify the homologous axes during the successive diagonalizations of the s replicated covariances matrices Ck (Ck corresponding to the k-th replicates).

Three  types of transformations lead to three tests for the stability of the observed structure:

1.      Total Bootstrap type 1 (very conservative) : simple change (when necessary) of signs of the axes found to be homologous (merely to remedy the arbitrarity of the signs of the axes). A simple scalar product between homologous original and replicated axes allows for this elementary transformation.

2.      Total Bootstrap type 2 (rather conservative) : correction for possible interversions of axes. Replicated axes are sequentially assigned to the original axes with which the correlation (in fact its absolute value) is maximum. Then, alteration of the signs of  axes, if needed, as previously.

 

Figure 2.  Total bootstrap type 1: Confidence ellipses for the same word-points in the same original principal plane.

 

 

 

 

Figure 3.  Total bootstrap type 2: Confidence ellipses for the same word-points in the same original principal plane, with correction of the possible interversion of axes.

 

3.      Total Bootstrap type 3 (could be lenient if the procrustean rotation is done in a space spanned by many axes) : a procrustean rotation  (see: Gower and Dijksterhuis, 2004) aims at superimposing as much as possible original and replicated axes.

 

Total bootstrap type 1 ignores the possible interversions and rotations of axes. It allows for the validation of stable and robust structures. Each réplication is supposed to produce the original axes with the same ranks (order of the eigenvalues).

 

Total bootstrap type 2 is ideally devoted to the validation of axes considered as latent variables, without paying attention to the order of the eigenvalues.

 

Total bootstrap type 3 allows for the validtion of a whole subspace. If, for instance, the subspace spanned by the first four replicated axes can coincide with the original four-dimensional subspace, one could find a rotation that  can put into coincidence the homologous axes. The situation is then very similar to that of partial bootstrap.

 

Figure 2 shows the case of total bootstrap of type 1: evidently, the ellipses are much larger. Figure 3 introduces the corrections implied by possible interversion of axes. The pattern observed in figure 1 reappears, albeit less clearly. This improvement means that some axes interversions were responsible for the perturbations of figure 2: Some stable dimensions may exist, but their order of appearance (order of the corresponding eigenvalues) can vary from one replicate to another.

 

Figure 4 is similar the figure 1 as far as the size of the ellipses are concerned. In fact, the procrustean transformations depends on the number of axes taken into considerations. They have been performed here in  a 5-dimensional space, and the original space can be retrieved without difficulty, leading to a procedure similar to the partial bootstrap. The lack of space does not allow for displaying all the other possibilities.  

 

 

 

 

 

Figure 4.  Total bootstrap type 3: Confidence ellipses for the same word-points in the same original principal plane, with correction of the possible interversion of axes and of possible rotations (procrustean transformations).

 

 

6. Specific (or hierarchical) bootstrap

 

When dealing with textual data, these resampling techniques can help to solve the problem of plurality of statistical units (see, in the case of responses to open questions: Tuzzi and Tweedie, 2000). In fact, two (or more) levels of statistical units coexist in textual data analysis. On the one hand, observations or individuals (with their usual meaning in statistics) could be respondents (case of sample surveys) or, e.g., cybernauts, web-users (Web-Mining). On the other end, within the same textual corpus, other types of observations or individuals could be occurrences (token), words, lemmas, phrases.

The principle of replication can be customized to both  mentioned levels, leading to conclusions adapted to the expected inference. Owing to the discrepancies of text sizes, a pattern could be significant when the statistical unit is the word, and not relevant if the statistical unit is the respondent or the web user.

 

Figure 5 shows again the same set of four points-word after a partial specific bootstrap consisting of drawing with replacement of the 1043 respondents and projection of the replicates as supplementary variables.

If we compare the ellipses with those of figure 1, we observe for exemple that the location of the word “things” is now more ambiguous: this is due to the fact that some respondents use several times the word. Consequently, a drawing of respondents induces a larger perturbation fo the data.  The specific bootstrap is however the right procedure to carry out if we want to infer the results to the parent population of respondents.

 

Figure 5 . Specific two-level partial bootstrap: Bootstrapping the observatons (i.e.: respondents) instead of the words. This figure should be compared only with figure 1

(both of them use partial bootstrap)

 

 

7. Conclusion

 

The bootstrap stipulates that the observed sample can serve as an approximation of the population. It takes into account the multivariate nature of the observations and involves simultaneously all the axes. We can then compare and analyze the proximities between pairs of categories, without reference to a specific axis.

Bootstrapping can also be used to process weighted data (circumstances occurring in most sample surveys) and to draw confidence intervals around the location of supplementary continuous or numerical variables in PAM. In the case of multilevel samples (for example: sample of respondents, and samples of words within the responses), the replications can involve separately the different levels, and allows one to study the different components of the observed variance.

From practitioner’s standpoint, principal axes techniques are particularly profitable when they consider the principal space spanned by the first dimensions as a predictive map which purports to receive all the remaining information contained in the data file (set of supplementary questions). In fact, that approach, closely related to regression, is widely used in practice. 

In all these cases, assessment procedures are difficult to carry out in a classical statistical framework. Bootstrap techniques can now confer to the obtained visualizations a scientific status.

 

 References

 

Alvarez R., Bécue M., Valencia O. (2004). Etude de la stabilité des valeurs propres de l’AFC d’un tableau lexical au moyen de procédures de rééchantillonnage, in: « Le poids des mots »,  Purnelle G., Fairon C., Dister A. (eds),  Louvain : PUL, 42-51.

Chateau F., Lebart L. (1996). Assessing sample variability in visualization techniques related to principal component analysis: bootstrap and alternative simulation methods. In : COMPSTAT96, A. Prats (ed), Physica Verlag, Heidelberg,  205-210.

Diday E., Lebart L. (1977). L’analyse des données. La Recherche, 74, 15-25.

Efron B., Tibshirani R. J. (1993).  An Introduction to the Bootstrap. New York: Chapman and Hall.

Efron B.(1979). Bootstraps methods : another look at the Jackknife. Annals of  Statistics, 7,   1-26.

Gifi A. (1990).  Non Linear Multivariate Analysis, Chichester:  J. Wiley  [updated version of: Gifi A. (1980) (same title), Department of Data theory, University of Leiden].

Gower J. C., Dijksterhuis G. B. (2004)  Procrustes Problems, Oxford Univ. Press, Oxford.

Greenacre M. (1984). Theory and Applications of Correspondence Analysis. London: Academic Press.

Hastie T., Tibshirani R., Friedman J. (2001). The Elements of Statistical Learning, New York: Springer.

Hayashi C., Suzuki T., Sasaki M. (1992): Data Analysis for Social Comparative research: International Perspective, North-Holland, Amsterdam.

Holmes S.: Using the bootstrap and the RV coefficient in the multivariate context. in: Data Analysis, Learning Symbolic and Numeric Knowledge, E. Diday (ed.), Nova Science, New York, (1989)  119-132.

Lebart L., Piron M., Morineau A. (2006)  Statistique exploratoire multidimensionnelle, Validation et Inférence en fouilles de données. Dunod, Paris.

Lebart L., Salem A., Berry L.: Exploring Textual Data, Kluwer, Dordrecht, Boston (1998).

Lebart L. (2006)  Validation techniques in multiple correspondence analysis, in : Multiple Correspondence Analysis and Related Methods, Greenacre M. and Blasius J. (eds), Chapman an Hall, 179-196

Markus M.Th. (1994). Bootstrap Confidence Regions for Homogeneity Analysis.; the Influence of Rotation on Coverage Percentages. COMPSTAT 1994, (Dutter R. and Grossmann W. (eds))  Physica Verlag, Heidelberg,  337-342.

Milan L., Whittaker J. (1995). Application of the parametric bootstrap to models that incorporate a singular value decomposition. Applied Statistics, 44, 1, 31-49.

Tuzzi A., Tweedie F. J. (2000): The best of both worlds: Comparing Mocar and Mcdisp. In: JADT2000 (Cinquièmes Journées Internationales sur l’Analyse des Données Textuelles), Rajman M., Chappelier J-C. (eds), EPFL, Lausanne, 271-276.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3.  Visualization, validation and seriation.

Application to a corpus of medieval texts.

 

 

Draft of the  paper:

F. Dupuis, L. Lebart (2008) Visualization, validation and seriation, Application to a corpus of medieval texts. In: Historical Linguistics 2007, M. Dufresne, F. Dupuis, E. Vocaj, (eds), John Benjamins Publishing Company, 269 – 284.   [Fernande Dupuis: UQAM, Montreal]

[Return to Table of Content]

 

 

Abstract

Principal axes methods (such as correspondence analysis [CA]) provide useful visualizations of high-dimensional data sets. In the context of historical textual data, these techniques produce planar maps highlighting the associations between graphemes and texts (paragraphs, chapters, full texts, authors). First, we recall that a simple technique of seriation (re-ordering the rows and columns of a table) is readily derived from the first CA axis. Second, we stress the important role played by bootstrap techniques to allow for valid statistical inferences in a context in which classical analytical approach is both unrealistic and analytically complex. A series of medieval French texts (12th-13th centuries), rich in spelling variants, exemplify the proposed approaches.  A free software is available.

 

Résumé

Les méthodes d’analyse en axes principaux telles que l’analyse des correspondances fournissent des outils de visualisation précieux. Dans le cadre des études de textes anciens, ces méthodes permettent, par exemple, de représenter les associations entre graphèmes et textes (paragraphes, chapitres, textes complets, Authors) sur des cartes planes. Dans un premier temps, on rappelle qu’une méthode élémentaire de sériation (ré-ordonnancement des lignes et des colonnes d’une table lexicale) est un simple sous-produit de l’analyse des correspondances. Puis on insiste sur le rôle important joué par les techniques de validation issues du bootstrap dans un contexte où les inférences statistiques classiques sont impossibles. L’exposé se fait à partir d’exemples qui concernent un corpus de textes médiévaux (12e-13e siècles). Le logiciel utilisé est librement accessible.

 

 

 

1. Introduction

 

Studies of earlier ancient texts are faced with difficulties well-known to specialists due to because of several closely linked factors: variation in graphical forms (words or types) [This set of factors is discussed in Dees (1987). Variation in graphical forms is a well-known phenomenon of medieval literature. Dees (1987, p. 535) in his forms’ inventory, counts for map number 4, 36 forms in opposition for the morpheme ce], the inherent interference of copyists, marked regional and time effects,  corpus and standards, and finally, stemming from the preceding factors, problem in the systematic use of automatic language processing tools [See Dupuis and Lemieux (2006) on this issue].

We will show using a corpus of medieval texts (section 2) that the application to a basic lexical table with crossed texts and graphical forms of three mutually dependant techniques: correspondence analysis (section 3), seriation (section 4) and confidence areas (section 5) gives relatively fine-grain observations and allows for testing sophisticated hypotheses while staying close to the basic texts.

 

2. Corpus

The corpus consists of 15 texts in verse written during the 12th and 13th centuries  [The texts in the corpus are drawn from the Base de français médiéval created by Christiane Marchello-Nizia at ENS-LSH in Lyon]. It contains 383193 occurrences of 27459 graphical forms. We give below the list of texts and their characteristics:

 

- Identifier from Base de français médiéval : [stbrend]; Author: Benedeit; Title: Voyage de saint Brendan; Date: early 12th c.; Ed. Sc.: I. Short, B. Merrilees;  Manchester University Press; 1979; Domain: religion; Genre: hagiography; Dialect: anglo-norman; 10829 words.

- Ident: [roland]; Anonymous; Title: Chanson de Roland; circa 1100; Ed. Sc.: G. Moignet;  Bordas; Collection: n/a; 1969; Domain: litterary; Genre: epic; Dialect: anglo-norman;  29338 words.

- Ident: [gormont]; Anonymous; Title: Gormont et Isembart;  circa 1130; Ed. Sc.: A. Bayot;  Champion;  1931; Domain: litterary; Genre: epic; Dialect: unknown (addition = center or south-west of Paris);  3815 words.

 Ident: [louis]; Anonymous; Title: Couronnement de Louis;  circa 1130; Ed. Sc.: E. Langlois;  Champion;  1925; Domain: litterary; Genre: epic; Dialect: unknown;  19786 words.

- Ident: [thebe]; Anonymous; Title: Roman de Thèbes; circa 1150; Ed. Sc.: G. Raynaud de Lage;  Champion;  1968; Domain: litterary; Genre: novel; Dialect: unknown;  62698 words.

 - Ident: [thomas]; Author: Guernes de Pont-Sainte-Maxence; Title: Vie de saint Thomas; 1172 - 1174; Ed. Sc.: E. Walberg; Champion; 1936; Domain: religion; Genre: hagiography; Dialect: unknown; 53947 words.

- Ident: [eracle]; Author: Gautier d’Arras; Title: Eracle; circa 1176 1184; Ed. Sc.: G. Raynaud de Lage;  Champion;  1976; Domain: litterary; Genre: novel; Dialect: unknown;  40839 words.

- Ident: [beroul]; Author: Béroul; Title: Tristan; between 1165 and 1200; Ed. Sc.: L. M. Defourques, E. Muret;  Champion;  1947; Domain: litterary; Genre: novel; Dialect: franco-picard; 27257 words.

-  Ident: [amiamil]; Anonymous; Title: Ami et Amile; circa 1200; Ed. Sc.: P.F. Dembowski;  Champion;  1969; Domain: litterary; Genre: epic; Dialect: unknown;  25283 words.

- Ident: [belinc]; Author: Renaut de Beaujeu; Title: Bel Inconnu; before 1214; Ed. Sc.: P. Williams;  Champion,  1929; Domain: litterary; Genre: novel; Dialect: unknown;  36692 words.

- Ident: [renart10]; Anonymous; Title: Roman de Renart (branch X); early 13th c.; Ed. Sc.: M. Roques;  Champion;  1948-1963; Domain: litterary; Genre: short novels; Dialect: unknown;  13472 words.

- Ident: [renart11]; Anonymous; Title: Roman de Renart (branch XI); early 13th c. ; Ed. Sc.: M. Roques;  Champion;  1948-1963; Domain: litterary; Genre: short novels; Dialect: unknown;  8563 words.

- Ident: [escoufle]; Author: Jean Renart; Title: Escoufle; between 1200 and 1202; Ed. Sc.: F. P. Sweester;  Droz; Collection: TLF; 1974; Domain: litterary; Genre: novel; Dialect: picard;  57967;  words.

- Ident: [dole]; Author: Jean Renart; Title: Roman de la Rose ou de Guillaume de Dole; 1210 or 1228; Ed. Sc.: F. Lecoy;  Champion; 1962; Domain: litterary; Genre: novel; Dialect: not verified;  34555 words.

- Ident: [vergy]; Anonymous; Title: Châtelaine de Vergy; mid 13th c., before 1288; Ed. Sc.: G. Raynaud, L. Foulet; Champion;  1921; Domain: litterary; Genre: novel; Dialect: unknown;  6117 words.

 

This corpus, albeit reduced, demonstrates the diversity in genre during medieval times: hagiography, epic, novel and narrative. We note furthermore that the texts vary in length from 62698 words to 6117 words, which often creates a problem for those interested studying phenomena of low frequency.

 

3. Visualization through Correspondence Analysis

 

The first step is to conduct a correspondence analysis of lexical tables. First, the forms’ minimal frequency threshold was set at 40; as a result, 941 words for a total of 290769 occurrences were kept.  Table 1 in section 3 below lists the first rows of this lexical table.  In fact this is the analysis that will be used for the seriation step (section 3). However, the corresponding graphs cannot be published in the format of this article; thus, the graph in figure 1 below, which is already very cluttered (many overlapping points in the circular area), corresponds to the minimal frequency threshold of 200, leaving 227 distinct words (types) for a total of 233697 occurrences.

 

 

 

 

Figure1_new 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Figure 1. Correspondence analysis of (227 x 15) table corresponding to a frequency threshold of 200.

[Planar display corresponding to the first principal axes]

 

 

It is worth noting that the pattern observed for the texts is extremely close to the one for the (941 x15) table corresponding to a threshold of 40. The pattern holds for the position of forms shared by both analyses. The opposition seen between the anglo-norman graphical forms (on the left) and the remaining forms allows us to characterize the copyists of the four authors situated in the left area of the first factorial plane. [This procedure enables us to classify “gormont” as a anglo-norman dialect in accordance with Dees (1987) localization and to specify the dialectal origin of “thomas” for which this information was missing in Lyon’s BFM description].

 

4. Seriation

 

4.1 General principle

 

Seriation techniques as well as Block Seriation techniques are widely used by practitioners. The oldest reference is likely the Egyptologist Petrie (1899).  Seriation is based on simple row and column permutations of the table under study; they have the great practical and cognitive advantage of showing the raw data to the user and therefore allowing the user to forego the use of subtle interpretation rules.  These permutations can display homogenous blocks of high values or on the contrary, of small or null values. They can also pinpoint a continuous and progressive evolution of profiles.  We should also cite Bertin (1973) amongst the pioneers of this type of approach, before the current electronic calculation tools.  Among the standard references in automatic classification and in statistics, there is the work of Hartigan (1972), Arabie et al. (1975), and Lerman (1972, 1981). Hill (1974) presents results that are relevant for the present article: we can observe on the first axis of a correspondence analysis an order of the row-points and of the column-points.  That order can be used to sort the rows and columns of the analysed data table. The new obtained data table has then undergone an optimal seriation.

 

rectangle

Figure 2. A particular structure of the data table

(area in grey: positive elements; area in white: null elements)

 

If the initial data table T, after re-ordering rows and columns, can appear in the shape of the table drawn in figure 2 (as a matter of fact, the limits of the area in grey are not necessarily straight lines), then this re-ordering is given by the order of the coordinates of the rows and columns according to the first factor (axis) of the correspondence analysis of T.

 

 

 

4.2 Application to the global lexical table

 

Table 1.  First 25 rows (out of 941) from the original lexical table (15 texts in columns)

     

      amia beli bero dole erac esco gorm loui rena renx rola stbr theb thom verg

             

a     719  964  727  953 1127 1452   69  543  514  698  392  140  830  353  180 

abat    2    4    1    0    1    4    4    6    1    1   25    0    5    0    0 

  abes    0    0    0    0    0    0    0   10    0    0    0   48    0    0    0 

  ad      0    0    0    0    0    0   26    0    0    0  442   65    0  100    0 

  afaire  0    6    2    1   11   47    0    0    8    4    0    0    5    0    0 

  ahi     5    1    5    5    7   14    3    5    2    5    1    0    0    0    0 

  ai     44   60   70   47   71   73    6   22   39   67   47   22   26   71   23 

  aidier  4   17    1    3    8    7    0   31    2    4    0    0    2    0    0 

  aim     0    6    4    7   13    9    0    0    1    4    3    0    0    5    2 

  aime    0    1    5   17   39   17    0    2    5    2    0    0    2    9    3 

  ainc    0   18    0    4   35   62    0    0    0    0    0    0    0    0    3 

  ains    1   24    0    2   79  106    0    0    0    0    0    0    0    0    0 

  ainz   38    2   44   55    0    0    0   30   11   32    1   20   47    2    8 

  ainçois 0    1    0   18    0    0    0    0    8    5    0    0   25    0    1 

  aise    0    0    2    3    9   17    0    1    4    7    0    1    2    4    3 

  ait    14   20   36   23   64   51    1   21   10    9   25    4   19   30   10 

  al      0   22    0    0   63   43   29   93    0    0   92   64    6   64    0 

  ala     6   14    6    6    3    9    1    2    3    4    0    0    2    1    2 

  aler   22   69    7   27    7   39    0   13   10   16   15    6   24   12    4 

  alez    0    0   13   10    0    0    0    4    8   16   10    2   11    2    2 

  altre   0    0    0    0    0    0    0   22    0    0   62   21    0   50    0 

  ame     1    1    1    7   25   27    0    0    8    3    0    0    6    0    5 

  amer   14   16    2   13   17   14    0    2    1    1    6    0    5   27    6 

  ami    72   10    8    6    7   27    1    3    1    5   10    0    4   16    3 

  amie    4   46   20   17   20   72    0    0    1    3    1    0    9   24   14 

 

 

Table 2. Excerpt from the same table where rows and columns

have been re-ordered according to CA  first principal axis

( first rows of the re-ordered table)

 

            rola stbr thom gorm loui bero theb beli amia rena erac rena dole esco verg

 

 respunt     49    8    5    0    0    0    0    0    0    0    0    0    0    0    0

    mult    186   88   58    2    0    0    0    0    0    0    0    0    0    0    0

    unt     104   52   21    8    0    0    0    0    0    0    0    0    0    0    0

    lur      93  144   38   10    0    0    0    0    0    0    0    0    0    0    0

     ad     442   65  100   26    0    0    0    0    0    0    0    0    0    0    0

    tuz      42   38   22    2    0    0    0    0    0    0    0    0    0    0    0

    tute     34    8   14    1    0    0    0    0    0    0    0    0    0    0    0

    vunt     19   27   18    0    0    0    0    0    0    0    0    0    0    0    0

    ben      97    1   40    3    0    0    0    0    0    0    0    0    0    0    0

    fud       0   51   23    4    0    0    0    0    0    0    0    0    0    0    0

    sun     231   38  130   40    0    0    0    0    0    0    0    0    0    0    0

    tut      58   50   28   20    0    0    0    0    0    0    0    0    0    1    1

    cum      46   83   63   10    0    0    0    2    0    0    1    0    0    0    0

     od      48   44   30    5    0    0    0    0    0    0    4    0    0    0    0

    mun      36   12   38    5    0    0    0    0    0    0    0    0    0    0    0

    dunc     17   38   41    7    0    0    0    0    0    0    0    0    0    0    0

    dunt     17    9   27    1    0    0    0    0    0    0    0    0    0    0    0

    sur      77   31   29   24    0    0    0    1    0    0    2    0    0    0    0

      e    1040  344  635  107    0    2    0    5    0    0    4    0    8   41    0

    sei      21   12   20    1    1    1    0    0    0    0    0    0    0    0    0

    nef       0   42   17    1    0    0    0    0    2    0    1    0    0    0    0

    pur      94   83  238   20    0    0    0    0    0    0    0    1    5    3    0

    vus      35   35  192   21    0    0    0    0    0    0    0    0    0    0    0

 

We clearly observe the exclusive vocabulary of the first four texts (roland, stbrendand, thomas, gormont) but we also note some interesting exceptions in the table’s right area (escoufle column notably). According to Lejeune-Dehousse (1935) several copyists were involved in copying the Escoufle manuscript.

 

Table 3. Last 25 rows of the same table where rows and columns

have been re-ordered according to CA first principal axis

 

             rola stbr thom gorm loui bero theb beli amia rena erac rena dole esco verg

 

   cis      0    0    0    0    0    1    0    7    1    0   30    0    1    30    0

comment      0    0    0    0    5    0    0    0    0    6    8    8    0    75    8

    tex      0    0    0    0    0    6    0    0    7    3    0    2   17    30    0

    çou      0    0    0    0    0    0    0    2    0    0   71    0    0    22    0

   velt      0    0    0    0    0    0    1   10    0    0   44    0    0    43    0

   tans      0    0    0    0    0    0    0   14    6    3   11    5    0    67    0

 maison      0    0    0    0    0    2    0    0    7    5    9   12    1    29    0

   anui      0    0    0    0    0    0    2    9    0    7   17    9   11    27    3

 afaire      0    0    0    0    0    2    5    6    0    8   11    4    1    47    0

   ainc      0    0    0    0    0    0    0   18    0    0   35    0    4    62    3

    ame      0    0    0    0    0    1    6    1    1    8   25    3    7    27    5

   biax      0    0    0    0    0    0    5    0   18    5    0    2    8    71    0

   ains      0    0    0    0    0    0    0   24    1    0   79    0    2   106    0

maniere      0    0    0    0    0    2    3    5    1    4   14    7   17    24    7

  moult      0    0    0    0    0    0    0    0  160    0    0    0    0   512    0

damoisele    0    0    0    0    0    1    6    7    0    0    0    1   17    40    1

 vallet      0    0    0    0    0    0    1    7    0    0    3    0   18    24    0

   tous      0    0    0    0    0    0    0    0    0    0   81    0    0    62    0

 ensamble    0    0    0    0    0    0    0    0    9    0    0    0   10    37    1

  assés      0    0    0    0    0    0    0    0    0    0   32    0    0    35    0

   ausi      0    0    0    0    0    0    2    5    0    3    1    2   12    31    5

   lués      0    0    0    0    0    0    0    2    0    0   10    0   57    48    0

samblant     0    0    0    0    0    0    0    0    6    0    4    0    5    25   15

    jou      0    0    0    0    0    0    0    0    0    0   32    1    0   134    0

  comme      0    0    0    0    1    0    0    1    0    1    1    0    0    71   18

 

We note at the bottom of the re-ordered table words missing in the first authors; however, we also see the existence of intermediate situations creating a continuum along the first axis. The presence of raw data enables a more thorough interpretation than the graphic visualization of the principal plane.

 

4.3 Application to the lexical table without the first four texts/authors

 

Once the main heterogeneity factor has been found and analyzed (presence of four authors or copyists making use of specific graphical forms), it is important to go further.  The simplest method to continue the investigation is to eliminate the four authors that greatly contributed to the first principal axis and to carry on a new analysis on the remaining (941 x11) table.

 

The new first axis found on this reduced table is, as we expected it, very close to the second factor of the global analysis. However, the situation is not always that simple and eliminating texts does not in general result in finding a known axis.  We see in tables 4 and 5 that the forms corresponding to extreme ranks are not the same as those in tables 2 and 3.

 

Thus, through this progressive “peeling” of the lexical table with this time, new oppositions between dialects or regions (of authors or copyists) can be observed.

 

 Table 4.  First rows (out of 941) of the lexical table (11 texts in columns) where rows and columns

have been re-ordered.

 

            esco  erac  beli  verg  dole  loui  amia  rena  rena  bero  theb

 

      k     266     6     8     0     0     0     0     0     1     0     0

    jou     134    32     0     0     0     0     0     0     1     0     0

     ki     184     2    19     0     0     0     0     0     0     0     0

contesse     47     1     0     0     2     0     0     0     0     0     0

  assés      35    32     0     0     0     0     0     0     0     0     0

   tous      62    81     0     0     0     0     0     0     0     0     0

   ains     106    79    24     0     2     0     1     0     0     0     0

   velt      43    44    10     0     0     0     0     0     0     0     1

cascuns      32    24    12     0     0     0     0     0     0     0     0

    cis      30    30     7     0     1     0     1     0     0     1     0

   ainc      62    35    18     3     4     0     0     0     0     0     0

   cose      25    76    23     0     0     0     0     0     0     0     0

 jamais      53    11     4     0     0     0     0     0     0     9     0

   gens      99    25    17     0     1     0    15     0     0     0     0

  quens     167     3     4     0    16     0     0     0     8     0     7

   voel      24    37     0     0     8     0     0     0     0     0     0

 

  Table 5.  Last rows (out of 941) for the lexical table (11 texts in columns) where rows and columns have been re-ordered according to CA first principal axis applied on 11 texts.

 

             esco  erac  beli  verg  dole  loui  amia  rena  rena  bero  theb

 

    val       3     5     0     0     2     3     1     2     5     0    33

   piez       0     0     0     3     7    28     0    22    15    16    26

   foiz       0     0     0     3     9     0     0    20    17    16    14

    pou       0     0     0     0     1    16     0     8    17     0    15

   mout       0     3     0    26   297     0     0   129   104     0   299

  granz       0     0     1     1    52    26     0    12     7     8    66

chascun       1     0     0     0    12     0    10     9     6    23    33

ainçois       0     0     1     1    18     0     0     8     5     0    25

  conme       0     8     0     0     1     0    10     9    23    15    62

   unne      11     0     0     0     0     0    11     0     0     0    70

 dedenz       0     0     0     2     8    10     1     0     0    16    33

    vet       0     0     0     0    36     0     0     0     1    31    61

   filz       0     0     0     0    12    21     0     5     0     6    50

    onc       0     0     0     0    10    21     0     1     0     2    42

   touz       0     0     0     0     3     0    37     8     6     0    83

   leur       4     0     3     0     0     0     0     0     1     0   146

      y       0     0     0     0     0     0     0     0     0     0    62

 

 

 

Once again, the observed exceptions may prove interesting whether it is a question of interpretation or simply a matter of assessing the basic documents.

 

5. Local statistical inference

 

While it is acknowledged that factorial maps are priceless tools for describing in broad terms the main association structures in lexical tables, their role for carrying out more detailed statistical inferences is lesser known. Their suggestive character is sometimes criticized for generating complacent or lax interpretations. Above all, the data-points precise positions are rarely taken into account.

 

Confidence areas mentioned and used thereafter in this section will address those criticisms and will give a more scientific status to those visualizations.

 

 

5.1 Principle for bootstrap confidence areas

The bootstrap technique (cf. Efron and Tibshirani, 1993) allows for drawing confidence areas (in general ellipses) around the points represented on the main maps whether those points represent words or texts. The method consists in building n ‘replicates’ of the sample by drawing and replacing the statistical units, that is, the occurrences of graphical forms. In a replicate, some units will thus appear twice or more, others will not appear at all.

These drawings create variability around the original data table. Under weak hypotheses, It has been shown that the variability observed in the n replicates has the same order of magnitude as the variability that could be observed in the parent population. In other words, we have n replicates of complex parameters (such as the vectors themselves and hence, the factorial coordinates) and we can obtain from those replicates confidence areas for the parameters.

 

5.2 The case of principal components and Singular Values Decomposition

 

5.2.1 Total bootstrap

There exist several variations of this method: total bootstrap consists of redoing a complete analysis for each replicate. However, the replicated axes are not necessarily equivalent from one replicate to the next; axis inversions or even rotations can occur.  One must then make the equivalent axes coincide using the so-called Procrustean analysis techniques.

 

5.2.2 Partial bootstrap

Partial bootstrap helps remedy this problem. It stems from the observation that the initial table is closer to observed reality than all of the replicated tables – that are perturbations of the initial table. The analysis and main designs of the initial table act as reference for projecting all the replicated tables (rows and columns) as supplementary elements. Intensive experiments (cf. Lebart et al., 2006) have proved the effectiveness of this method.

 

 

5.3 The case of correspondence analysis of lexical tables

 

Let us recall that in the case of a lexical table, the technique consists of drawing and replacing the occurrences of the forms carried over. This drawing is done according to a multinomial variation having as many categories as the table has cells and where the theoretical frequencies are those of the cells. The replicated tables’ rows and columns are then projected as supplementary elements on the principal axes of the real lexical table analysis (partial bootstrap).  The principal components analysis of the “replicate clouds” corresponding to each element (row or column) give the confidence ellipses that we are looking for.

 

5.4 Two examples of confidence areas

We want to show using the following two examples that exploratory analyses of lexical tables not only can be used to draw out broad structural features but also to set up precise focus and to test specific hypotheses. The basic factorial analysis will be the one carried out on the 941 words and the 15 texts with a frequency threshold of 40.

 

Figure 3 illustrates for example the position of different demonstratives as well as their written form in the first principal plane.   The confidence areas permit to assess the interpretation of these positions. While the graphical form “cels” is quite characteristic of the “anglo-norman” group, other positions however lead to overlapping areas (group: cest, cel, ceste; group: cele, celui, cestui).

 

 

Figure3_new
 

 

 

 

 

 

 

 

 

 

 

 



Figure  3. Confidence areas of different demonstratives in the principal

plane (where its orientation has been reversed compared to the plane in figure 1)

 

 

On the other hand, the opposition between “cel” and “ces” along the vertical axis can be interpreted in terms of preferential distribution among some authors.

 

We observe with respect to the opposition “cels/cel, ces” that “cels” is always a plural object pronoun in anglo-norman (where it is in opposition to “celi”, singular object pronoun).

 

“Cel” is overwhelmingly a singular object deteminer and is in opposition to “ces”, the plural object determiner.

 

Figure 4 shows (on the basis of a smaller number of texts, after grouping graphical forms, elisions…) that the distributions of  the words “et” and “que” at the beginning of a verse are significantly distinct from the distribution of the same words located elsewhere in a verse (ellipses being clearly separated). It is thus possible to interpret distances that a priori would have been deemed not very important.

 

This significant distribution could allow us to measure the structuring effect of these lexical elements and help us strengthen the hypothesis that evolution goes from paratactic subordination (by simple juxtaposition) to hypotactic subordination (with the conjunction “que” for example). [See Moignet (1973) p. 367]

 

figure4_new 


Figure 4. Comparison between the words  “que” and “ et” appearing anywhere in a verse  (bottom of arrows) and those same words appearing at the beginning of a verse (tip of arrows).

 

An exhaustive study of the forms of  “et” in the corpus demonstrates indeed that the position of this element leads to distinctive syntactic behaviours. This conjunction is known to have at least two roles in medieval French grammar. “Et” coordinates by addition phrases (nominal, verbal, etc.) or sentences. Moreover, “et” can (Moignet, 1973, pp. 330-331) appear at the beginning of a sentence without creating a syntactic coordination with the preceding sentence. Some grammarians therefore talk of elements having a strictly discursive value. Analysis of the corpus reveals a syntactic difference between “et” at the beginning of a verse and “et” appearing elsewhere. The corpus contains over 13,000 “et” of which 5,300 are found at the beginning of a verse. Most of the “et” appearing in the middle of the verse are of the first type:

 

       Vos li durrez urs e leons e chens,

       Set cenz camelz e mil hosturs muers,

       D'or e d'argent .IIII.C. muls cargez                                La chanson de Roland

 

On the other hand, we found about 400 constructions similar to the following examples where “et” at the beginning of the line acts as a discourse marker:

 

       Tristran l' entent, fist un sospir

       Et dist: "Roïne de parage,

       Tornon ariere a l' ermitage;/                                       Tristan

 

       Et dist li rois: "De gréz et volentiers,/                                          Ami et Amile

 

       Avrum nos la victorie del champ ?"

       E cil respunt : "Morz estes, Baligant !                       La chanson de Roland

 

This particular syntax where “et” as a discourse marker precedes declarative verbs such as dire (say) or répondre (answer) for introducing direct speech constitutes one of the characteristics of “et” at the beginning of line [There are about 8 exceptions].  These examples highlight paratactic subordination where the verb’s implicit argument of the first sentence, the citing sentence, appears in the following sentence, the cited sentence.  Both sentences are deemed independent syntactically.

 

There is a gradual transition, during French’s evolution in the medieval period, from paratactic subordination in direct speech in La Chanson de Roland (early 12th century) to explicit subordination with que in indirect speech.

 

       Et qui li dist: "Fole, demeure.

       Vels tu hounir tot ton lignage?                                       Escoufle

 

       Et li cuens dist qu' a tous donroit/

       Reubes, chevax, cels qui n' en orent.                                           Escoufle

 

 

          Et dist li quens qu' il se departent                                            Escoufle

      

       Se li a le castel mostré.

       Por l'esgarder sont aresté

       Et dient que bials est et gens,

       Millor n'en ot ne rois ne quens                                      Bel Inconnu

 

       Se li demande qu'el fera.                                 

       Et dist que ele s'en ira   Bel Inconnu                                            Bel Inconnu

 

As the examples above illustrate, direct speech alternates with explicit subordination in Escoufle and Bel Inconnu, both 13th century novels.

 

6. Conclusion

 

This research carried on a homogenous corpus (octosyllabic or decasyllabic verses) demonstrates, using selected examples, that it is possible to characterize syntactic features without prior categorization and therefore to explore with substantial advantages weakly enriched texts.  This exploration is neither intuitive nor impressionistic. It is based on two useful adjuncts of correspondence analysis of lexical tables: seriation which brings to light the original gross data in a context where they become more significant; confidence areas which allow for extracting valid patterns and rejecting illusive closeness.

 

We were able to note that there was little intra-textual variation among high frequencies. This type of analysis highlights the need for documenting the texts’ external characteristics (manuscripts, copyists, location…)

 

Finally we only hope that our approach will allow combining good statistical methods with variationist analyses that have been used in language change theory in the last decades.[1]

 

This method should allow us in a subsequent stage to highlight typological differences, for instance, phenomena whose evolution differs depending on the genre.

 

The constructions found in the texts were verified with the software SATO (site www.ling.uqam.ca/ato)

 

 

7. Bibliography

 

Benzécri J.-P.& collaborateur (1981). Pratique de l'analyse des données, , Linguistique & Lexicologie, vol. 3. Dunod , Paris.

Bertin J. (1973). La graphique et le traitement graphique de l’information. Flammarion, Paris.

Dees, A. (1987). Atlas des formes linguistiques des textes littéraires de l’ancien français. Max Niemeyer Verlag, Tübingen.

Dupuis F. et M. Lemieux (2006). Vérification d’hypothèse(s) et choix de corpus. À la quête du sens. ENS Éditions, Paris.

Dupuis, Fernande., M. Lemieux and D. Gosselin. (1993). “Conséquences de la sous-spécification des traits de Agr dans l'identification de pro”, Language Change and Variation, vol. 3 no. 3, pp. 275-299

Efron B., Tibshirani R. J. (1993). An Introduction to the Bootstrap. Chapman and Hall, New York.

Hartigan J. A. (1972). Direct clustering of a data matrix. Journal of the American Statistical Association. 6, 123-129.

Hill M. O. (1974). Correspondence analysis: a neglected multivariate method. Applied Statistics, 23, 340-354.

Lebart L., Salem A., Berry E. (1998). Exploring Textual Data, Kluwer Ac. Publisher, Dordrecht.

Lebart L., Morineau, a., Warwick, K. (1984) Multivariate Descriptive Statistical Analysis: Correspondence Analysis for large Matrices. John Wiley, New York.

Lejeune-Dehousse, R. (1935). L’oeuvre de Jean Renard : Contribution à l’étude du genre romanesque au Moyen Age. Genève, Slatkine Reprint.

Lerman I. C. (1972). Analyse de phénomène de la sériation. Mathématique et Sciences Humaines, 38, 39-57.

Lerman I. C. (1981). Classification et Analyse Ordinale des Données. Dunod, Paris.

Marchello-Nizia, C. (2006). From personal to spatial deixis : The semantic evolution of demonstratives from Latin to French, in : M. Hickman et S. Robert ed., Space in languages, linguistic systems and cognitive categories, Amsterdam, Benjamins Publishing Company: chapitre 5.

Moignet, G. (1973). Grammaire de l’ancien français. Klincksieck, Paris.

Morin, Yves-Charles (To appear). Histoire du corpus d’Amsterdam : Le traitement des données dialectales. Le Nouveau Corpus d’Amsterdam, Actes de l’atelier de Lauterbad, Stuttgart. Steiner.

Petrie W. M. F. (1899). Sequence in prehistoric remains. Journal of the Anthropological Institute of Great Britain and Ireland. 29, 295-301.



[1] For an example of such an approach, see Dupuis et al. (1993).