^{1}

^{1}

^{1}

^{1}

Edited by: Taha Yasseri, University of Oxford, UK

Reviewed by: Martin Gerlach, Northwestern University, USA; Tom Nicholls, University of Oxford, UK

Specialty section: This article was submitted to Big Data, a section of the journal Frontiers in Digital Humanities

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

This paper presents a methodology to analyze linguistic changes in a given textual corpus allowing to overcome two common problems related to corpus linguistics studies. One of these issues is the monotonic increase of the corpus size with time, and the other one is the presence of noise in the textual data. In addition, our method allows to better target the linguistic evolution of the corpus, instead of other aspects like noise fluctuation or topics evolution. A corpus formed by two newspapers “La Gazette de Lausanne” and “Le Journal de Genève” is used, providing 4 million articles from 200 years of archives. We first perform some classical measurements on this corpus in order to provide indicators and visualizations of linguistic evolution. We then define the concept of a lexical kernel and word resilience, to face the two challenges of noises and corpus size fluctuations. This paper ends with a discussion based on the comparison of results from linguistic change analysis and concludes with possible future works continuing in that direction.

This research investigates methods to study linguistic evolution using a corpus of scanned newspapers, continuing the work presented in conference paper (Buntinx et al.,

Considering the lack of data for Le Journal de Genève for the years 1837, 1917, 1918, and 1919, we left these years out in all further graphs and analyses. In addition, some years had to be removed because the scanning quality was too poor (1834, 1835, 1859, and 1860 for JDG and 1808 for GDL).

A straightforward approach to the problem consists in computing a textual distance between subsets of the corpora. One could, for instance, easily compute the so-called Jaccard distance (Jaccard, _{1} and _{2}, and their lexica, i.e., the list of unique (non-lemmatized) words, _{1}) ≡ _{1} and _{2}) ≡ _{2}, the Jaccard distance _{1}, _{2}) is defined as follows:

In the same way, other distances could also be explored, such as those given by Kullback and Leibler (Kullback and Leibler,

The Jaccard distance is an intuitive measure that determines the similarity of two texts using the relative size of their common lexicon. This distance, which is complementary to the notion of lexical connexion (Muller,

The Jaccard distance is a metric (Levandowsky and Winter,

Separation: _{1}, _{2}) = 0 ≡ _{1} = _{2};

Symmetry: _{1}, _{2}) = _{2}, _{1});

Triangular inequality: _{1}, _{3}) ≤ _{1}, _{2}) + _{2}, _{3}).

Since the Jaccard distance measure is based only on the presence/absence of word in the corpus subsets, noise can affect the measure of linguistic evolution. In order to reduce this effect, _{1}) and _{2}) are filtered to keep only the words whose frequency is greater than 1/100,000. However, the frequency threshold is quite arbitrary, and filtered data still present OCR errors and noises. The computation of the Jaccard distance between all subsets yields a symmetric matrix _{i}_{j}

The values on the matrix’s diagonal are equal to zero by definition (property of separation). We observe the expected behavior of the values outside the diagonal, which should be highly correlated with the difference between the compared years. In addition, level lines of the heatmap suggest the hypothesis that the linguistic evolution is not linear but evolves period of time by period of time. Indeed, in the case of a linear evolution, the level line would be parallel to the diagonal of the matrix. The same data are presented in a more convenient form in Figure

If we restrict the Jaccard distance matrix to using only the data from the most stable years in terms of size and recompute the Figure _{i}_{i+n}

_{i}_{i}_{+}_{n}

On the distance _{i}, y_{i}_{+1}) showed in Figure _{i}, y_{i}_{+}_{n}

The uneven distribution of the size of corpus subsets (Figure

These difficulties of interpretation motivate the exploration of another, possibly sounder approach to the same problem. We define the notion of the lexical kernel.

_{x,y,C} is the sequential subset of unique words common to a given period of time starting in year x and finishing in year y of a corpus C

_{1804,1997,}_{GDL}_{1826,1997,}_{JDG}

It is interesting to note that 4,464 words are common to the two kernels. Figure

_{1804,1997,}_{GDL}_{1826,1997,}_{JDG}

Extending the notion of a kernel, it is rather easy to study the resilience of a given word.

_{d,C} is the union of all kernels K_{x,y,C} corresponding to a duration of y − x

The definition of word resilience is naturally derived from the resilience set notion.

_{d,C}

For instance, _{100,}_{GDL}_{1,}_{C}_{2,}_{C}_{i,C}_{i}_{+1,}_{C}

_{d}

The GDL resilience curve in Figure _{d}

These definitions pave the way for a formulation of the study of linguistic change in terms of algebra of sets. Instead of analyzing what is rapidly changing in the language, we study the most stable elements of language through the notions of kernel and word resilience. We can then apply a new definition of distance to the set of resilient words, which is the maximum duration kernel. Indeed, reducing the analyzed set of words to the more resilient ones allows us to exclude noise efficiently. In addition, the issue of distance sensibility to the corpus size is reduced, and the method targets linguistic evolution more precisely because the lower use of resilient words can be the result of semantic evolution, punctual journalistic events, or linguistic diversity induced by the newspaper layout evolution. The number of words is the same for each year, but the corpus size influences the frequency of kernel words when the size is small. Indeed, the smaller the corpus size, the higher the frequency fluctuation. In order to reduce these effects, we defined a distance based on word ranks ordered by their frequencies.

In order to compare the same kernel from two different years, let us consider their ordering according to the frequency of those words in those years. We may then define their distance as the computational cost to reorder one into the other. Again, we require a metric that can satisfy the mathematical properties of a distance. One way to do so is to consider a distance equal to the sum of each differences of position for each word in two given lists.

_{j}_{i}_{i}in the list L_{j}

We have applied this new distance definition to the list of words from the kernel set, ordered by frequency for each years, and we have plotted the same analysis than for the Jaccard distance in the Figure _{i}_{i}_{+}_{n}

_{i}_{i}_{+}_{n}

_{i}_{i}_{+}_{n}

The Jaccard distance represented in Figure

When comparing the Jaccard distance and the kernel distance in Figures

Several distance definitions have been applied to the corpus of GDL and JDG in order to quantify linguistic changes. We first used the Jaccard distance on the whole corpus with a filter frequency. Our observations from the Figures

We defined a kernel distance based on frequency rank comparison between 2 years on kernel words. Surprisingly, Figures

From our experiments on the two corpora of GDL and JDG, we have made a series of observations that support the existence of a continuous and relatively constant linguistic drift. We tried several methods to quantify this linguistic change with success in overcoming problems of noise, corpus size fluctuation, and precise targeting of linguistic change instead of other cumulated effects on the corpora’s textual data like topics, OCR quality, or noise evolution. If these measures show a quantization way of the linguistic drift, we do not have any serious indicators or proof of a potential acceleration or deceleration of the language change evolution on the periods of 1804–1997 (GDL) and 1826–1997 (JDG). However, these methods should be applied on a corpus where data are available after 1997 in order to verify if this observed stability is maintained during the period of 1998–2016 where a lot of technologies mediating our language have potentially accelerated linguistic evolution (Kaplan,

Large databases of scanned newspapers open new avenues for studying linguistic evolution (Westin and Geisler,

In this paper, we introduced the notion of a kernel as a possible approach to studying linguistic changes under the lens of linguistic stability. Focusing on stable words and their relative distribution is likely to make interpretations more robust. Results were computed from two independent corpora. It is striking to see that most of the results obtained from each of them are extremely similar. The kernels compositions in terms of grammatical word typologies are very similar.

The kernel distance, applied to the kernels words in order to measure the linguistic changes, has showed to be robust when it comes to OCR errors and noise. In addition, we observed that the study of kernel words allows the extraction of the same linguistic distance’s information as the Jaccard distance applied to the whole corpus. This suggests that our methods are indeed measuring general linguistic phenomena beyond the specificity of the corpora chosen for this study. Future works and analysis should include the case where corpus kernel size is too small and implement a distance measuring the linguistic change between subset of resilient words that are not necessarily part of the kernel. In addition, our results still need to be confirmed with subsequent studies involving other corpora, such as non-journalistic texts and texts written in other languages.

The three authors have contributed equally to the conception and design of this work through ideas and results discussion. VB has performed data acquisition, computation and analysis, visualizations of computed results, and has written the article. CB has provided visualizations of computed results and suggested to reduce the analyzed set of words to those shared by each subcorpora. FK has provided deeper formalization of the developed concepts of kernels and words resilience. The three authors have participated in the process of reviewing the articles’ final version, ensuring its accuracy and integrity.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The reviewer TN and handling Editor declared their shared affiliation, and the handling Editor states that the process nevertheless met the standards of a fair and objective review.

We thank the team of Le Temps newspaper and the BNS (Bibliothque Nationale Suisse) for giving us the opportunity to work on those 200 years of archives.

This study is funded by FNS and is part of the project “How algorithms shape language,” number 149758.