Papers by Christian Bentz
The choice associated with words is a fundamental property of natural languages. It lies at the h... more The choice associated with words is a fundamental property of natural languages. It lies at the heart of quantitative linguistics, computational linguistics, and language sciences more generally. Information-theory gives us tools at hand to measure precisely the average amount of choice associated with words – the word entropy. Here we use three parallel corpora – encompassing ca. 450 million words in 1916 texts and 1259 languages – to tackle some of the major conceptual and practical problems of word entropy estimation: dependence on text size, register, style and estimation method, as well as non-independence of words in co-text. We present three main results: 1) a text size of 50K tokens is sufficient for word entropies to stabilize throughout the text, 2) across languages of the world, word entropies display a unimodal distribution that is skewed to the right. This suggests that there is a trade-off between the learnability and expressivity of words across languages of the world. 3) There is a strong linear relationship between unigram entropies and entropy rates, suggesting that they are inherently linked. We discuss the implications of these results for studying the diversity and evolution of languages from an information-theoretic point of view.
The morphological complexity of languages differs widely and changes over time. Pathways of chang... more The morphological complexity of languages differs widely and changes over time. Pathways of change are often driven by the interplay of multiple competing factors, and are hard to disentangle. We here focus on a paradigmatic scenario of language change: the reduction of morphological complexity from Latin towards the Romance languages. To establish a causal explanation for this phenomenon, we employ three lines of evidence: 1) analyses of parallel corpora to measure the complexity of words in actual language production, 2) applications of NLP tools to further tease apart the contribution of inflectional morphology to word complexity, and 3) experimental data from artificial language learning, which illustrate the learning pressures at play when morphology simplifies. These three lines of evidence converge to show that pressures associated with imperfect language learning are good candidates to causally explain the reduction in morphological complexity in the Latin-to-Romance scenario. More generally, we argue that combining corpus, computational and experimental evidence is the way forward in historical linguistics and linguistic typology.
Language complexity is an intriguing phenomenon argued to play an important role in both language... more Language complexity is an intriguing phenomenon argued to play an important role in both language learning and processing. The need to compare languages with regard to their complexity resulted in a multitude of approaches and methods, ranging from accounts targeting specific structural features to global quantification of variation more generally. In this paper, we investigate the degree to which morphological complexity measures are mutually correlated in a sample of more than 500 languages of 101 language families. We use human expert judgements from the World Atlas of Language Structures (WALS), and compare them to four quantitative measures automatically calculated from language corpora. These consist of three previously defined corpus-derived measures, which are all monolingual, and one new measure based on automatic word-alignment across pairs of languages. We find strong correlations between all the measures, illustrating that both expert judgements and automated approaches converge to similar complexity ratings, and can be used interchangeably.
The average uncertainty associated with words is an information-
theoretic concept at the heart o... more The average uncertainty associated with words is an information-
theoretic concept at the heart of quantitative and computational linguistics. The entropy has been established as a measure of this average uncertainty - also called average information content. We here
use parallel texts of 21 languages to establish the number of tokens
at which word entropies converge to stable values. These convergence
points are then used to select texts from a massively parallel corpus,
and to estimate word entropies across more than 1000 languages. Our
results help to establish quantitative language comparisons, to understand the performance of multilingual translation systems, and to
normalize semantic similarity measures.
We present THE GLOTTOLOG DATA EXPLORER, an interactive web application in which the world's langu... more We present THE GLOTTOLOG DATA EXPLORER, an interactive web application in which the world's languages are mapped using a JavaScript library in the 'Shiny' framework for R (Chang et al., 2016). The world's languages and major dialects are mapped using coordinates from the Glottolog database (Hammarström et al., 2016). The application is primarily intended to portray the endangerment status of the world's languages, and hence the default map shows the languages colour-coded for this factor. Subsequently, the user may opt to hide (or reintroduce) data subsets by endangerment status, and to resize the datapoints by speaker counts. Tooltips allow the user to view language family classification and links the user to the relevant Glottolog webpage for each entry. We provide a data table for exploration of the languages by various factors. The web application is freely available at http://cainesap.shinyapps.io/langmap
We announce the release of the CROWDED CORPUS: a pair of speech corpora collected via crowdsourci... more We announce the release of the CROWDED CORPUS: a pair of speech corpora collected via crowdsourcing, containing a native speaker corpus of English (CROWDED_ENGLISH), and a corpus of German/English bilinguals (CROWDED_BILINGUAL). Release 1 of the CROWDED CORPUS contains 1000 recordings amounting to 33,400 tokens collected from 80 speakers and is freely available to other researchers. We recruited participants via the Crowdee application for Android. Recruits were prompted to respond to business-topic questions of the type found in language learning oral tests. We then used the CrowdFlower web application to pass these recordings to crowdworkers for transcription and annotation of errors and sentence boundaries. Finally, the sentences were tagged and parsed using standard natural language processing tools. We propose that crowdsourcing is a valid and economical method for corpus collection, and discuss the advantages and disadvantages of this approach.
Words that are used more frequently tend to be shorter. This statement is known as Zipf's law of ... more Words that are used more frequently tend to be shorter. This statement is known as Zipf's law of abbreviation. Here we perform the widest investigation of the presence of the law to date. In a sample of 1262 texts and 986 different languages - about 13% of the world's language diversity - a negative correlation between word frequency and word length is found in all cases. In line with Zipf's original proposal, we argue that this universal trend is likely to derive from fundamental principles of information processing and transfer.
The quantitative measurement of language complexity has witnessed a recent rise of interest, not ... more The quantitative measurement of language complexity has witnessed a recent rise of interest, not least because language complexities reflect the learning constraints and pressures that shape languages over historical and evolutionary time. Here, an information-theoretic account of measuring language complexity is presented. Based on the entropy of word frequency distributions in parallel text samples, the complexities of overall 646 languages are estimated. A large-scale finding of this analysis is that languages just above the equator exhibit lower complexity than languages further away from the equator. This geo-spatial pattern is here referred to as the Low-Complexity-Belt (LCB). The statistical significance of the positive latitude/complexity relationship is assessed in a linear regression and a linear mixed-effects regression, suggesting that the pattern holds between different families and areas, but not within different families and
areas. The lack of systematic within-family effects is taken as potential evidence for a phylogenetically “deep” explanation. The pressures shaping language complexities probably pre-date the expansion of language families from their proto-languages. Large-scale pre-historic contact around the equator is tentatively given as a possible factor involved in the evolution of the LCB.
Cognitive Science, 2014
This study presents original evidence that abstract and concrete concepts are organized and repre... more This study presents original evidence that abstract and concrete concepts are organized and represented differently in the mind, based on analyses of thousands of concepts in publicly available data sets and computational resources. First, we show that abstract and concrete concepts have differing patterns of association with other concepts. Second, we test recent hypotheses that abstract concepts are organized according to association, whereas concrete concepts are organized according to (semantic) similarity. Third, we present evidence suggesting that concrete representations are more strongly feature-based than abstract concepts. We argue that degree of feature-based structure may fundamentally determine concreteness, and we discuss implications for cognitive and computational models of meaning.
Languages across the world exhibit Zipf's law of abbreviation, namely more frequent words tend to... more Languages across the world exhibit Zipf's law of abbreviation, namely more frequent words tend to be shorter. The generalized version of the law -an inverse relationship between the frequency of a unit and its magnitude -holds also for the behaviours of other species and the genetic code. The apparent universality of this pattern in human language and its ubiquity in other domains calls for a theoretical understanding of its origins. To this end, we generalize the information theoretic concept of mean code length as a mean energetic cost function over the probability and the magnitude of the types of the repertoire. We show that the minimization of that cost function and a negative correlation between probability and the magnitude of types are intimately related.
In this paper, we provide quantitative evidence showing that languages spoken by many second lang... more In this paper, we provide quantitative evidence showing that languages spoken by many second language speakers tend to have relatively small nominal case systems or no nominal case at all. In our sample, all languages with more than 50 % second language speakers had no nominal case. The negative association between the number of second language speakers and nominal case complexity generalizes to different language areas and families. As there are many studies attesting to the difficulty of acquiring morphological case in second language acquisition, this result supports the idea that languages adapt to the cognitive constraints of their speakers, as well as to the sociolinguistic niches of their speaking communities. We discuss our results with respect to sociolinguistic typology and the Linguistic Niche Hypothesis, as well as with respect to qualitative data from historical linguistics. All in all, multiple lines of evidence converge on the idea that morphosyntactic complexity is reduced by a high degree of language contact involving adult learners.
Word frequencies are central to linguistic studies investigating processing difficulty, learnabil... more Word frequencies are central to linguistic studies investigating processing difficulty, learnability, age of acquisition, diachronic transmission and the relative weight given to a concept in society. However, there are few studies on entire distributions of word frequencies, and even less on systematic changes within them. Here, we first define and test an exact measure for the relative difference between distributions – the normalized frequency difference (NFD). We then apply this measure to parallel corpora, explaining
systematic variation in the frequency distributions within the same language and across different languages. We further establish the NFD between lemmatized and unlemmatized corpora as a frequency-based measure of inflectional productivity. Finally, we argue that quantitative measures like the NFD can advance language typology beyond abstract, theory-driven expert judgments, towards more corpus-based,
empirical and reproducible analyses.
Explaining the diversity of languages across the world is one of the central aims of typological,... more Explaining the diversity of languages across the world is one of the central aims of typological, historical and evolutionary linguistics. We consider the effect of language contact -the number of non-native speakers a language has -on the way languages change and evolve. By analysing hundreds of languages within and across language families, regions and text types, we show that languages with greater levels of contact typically employ fewer word forms to encode the same information content (a property we refer to as lexical diversity). Based on three types of statistical analyses, we demonstrate that this variance can in part be explained by the impact of non-native speakers on information encoding strategies. Finally, we argue that languages are information encoding systems shaped by the varying needs of their speakers. Language evolution and change should be modeled as the co-evolution of multiple intertwined adaptive systems: On one hand, the structure of human societies and human learning capabilities, and on the other, the structure of language. We refer to this property of languages -the number of word forms or word types 13 they use to encode essentially the same information -as their lexical diversity (LDT). 14 This difference is a central part of the variation in encoding strategies we find across 15 languages of the world. 16 This paper centers on the question of where variation in lexical diversity stems from. 17 Why do some languages employ a wide range of opaque lexical items while others are 18 1/22 more economical? Variation between languages has often be seen as driven by language 19 acquisition of native speakers (L1) [2-8]. However, some sociolinguistic and historical 20 studies have raised the question of whether large numbers of non-native (L2) language 21 speakers in a society can also lead to systematic changes in the use of the language in 22 65
Languages use different lexical inventories to encode information, ranging from small sets of sim... more Languages use different lexical inventories to encode information, ranging from small sets of simplex words to large sets of morphologically complex words. Grammaticalization theories argue that this variation arises as the outcome of diachronic processes whereby co-occurring words merge to one word and build up complex morphology.
This paper reports a quantitative analysis of the relationship between word frequency distributio... more This paper reports a quantitative analysis of the relationship between word frequency distributions and morphological features in languages. We analyze a commonly-observed process of historical language change: The loss of inflected forms in favour of 'analytic' periphrastic constructions. These tendencies are observed in parallel translations of the Book of Genesis in Old English and Modern English. We show that there are significant differences in the frequency distributions of the two texts, and that parts of these differences are independent of total number of words, style of translation, orthography or contents. We argue that they derive instead from the trade-off between synthetic inflectional marking in Old English and analytic constructions in Modern English. By exploiting the earliest ideas of Zipf, we show that the syntheticity of the language in these texts can be captured mathematically, a property we tentatively call their grammatical fingerprint. Our findings suggest implications for both the specific historical process of inflection loss and more generally for the characterization of languages based on statistical properties.
In this paper, we provide quantitative evidence showing that languages spoken by many second lang... more In this paper, we provide quantitative evidence showing that languages spoken by many second language speakers tend to have relatively small nominal case systems or no nominal case at all. In our sample, all languages with more than 50% second language speakers had no nominal case. The negative association between the number of second language speakers and nominal case complexity generalizes to different language areas and families. As there are many studies attesting to the difficulty of acquiring morphological case in second language acquisition, this result supports the idea that languages adapt to the cognitive constraints of their speakers, as well as to the sociolinguistic niches of their speaking communities. We discuss our results with respect to sociolinguistic typology and the Linguistic Niche Hypothesis, as well as with respect to qualitative data from historical linguistics. All in all, multiple lines of evidence converge on the idea that morphosyntactic complexity is reduced by a high degree of language contact that involves adult learners.
The rule versus rote distinction is one of the most debated issues in recent psycholinguistics. D... more The rule versus rote distinction is one of the most debated issues in recent psycholinguistics. Dual route accounts hold that words can either be stored whole in the mental lexicon or computationally derived by simple combinatorial rules such as stem+affix. Within this framework, response latencies in lexical decision tasks have been applied to point out the difference between rote memorization, on the one hand, and combinatorial rule manipulation, on the other. However, this paper argues that there may be alternatives to this distinction. It will be shown that German nouns, which can be distinctively marked for number, case or both number and case, do elicit differing reaction times. Crucially, this effect can neither be explained by surface frequency effects nor by internal morphological structure. Rather, it seems to be triggered by the degree of embedding into usage-based units.
Every two years researchers at Evolang bring together an impressive amount of evidence to unravel... more Every two years researchers at Evolang bring together an impressive amount of evidence to unravel the puzzling phenomenon of language evolution. However, an overarching framework of how to integrate all these bits of data into a coherent theory, into a full picture, is missing. This paper proposes that the complex adaptive system framework can serve this purpose. The central idea is that language structures adapt to the niche of learner populations and are therefore 'shaped' to fit the social and cognitive needs of human beings. That is, the structural features of today's languages should be viewed as the product of the co-evolution of domain-general preadaptations for language and language as a cultural tool, rather than growing 'naturally' based on genetic encoding.
Uploads
Papers by Christian Bentz
theoretic concept at the heart of quantitative and computational linguistics. The entropy has been established as a measure of this average uncertainty - also called average information content. We here
use parallel texts of 21 languages to establish the number of tokens
at which word entropies converge to stable values. These convergence
points are then used to select texts from a massively parallel corpus,
and to estimate word entropies across more than 1000 languages. Our
results help to establish quantitative language comparisons, to understand the performance of multilingual translation systems, and to
normalize semantic similarity measures.
areas. The lack of systematic within-family effects is taken as potential evidence for a phylogenetically “deep” explanation. The pressures shaping language complexities probably pre-date the expansion of language families from their proto-languages. Large-scale pre-historic contact around the equator is tentatively given as a possible factor involved in the evolution of the LCB.
systematic variation in the frequency distributions within the same language and across different languages. We further establish the NFD between lemmatized and unlemmatized corpora as a frequency-based measure of inflectional productivity. Finally, we argue that quantitative measures like the NFD can advance language typology beyond abstract, theory-driven expert judgments, towards more corpus-based,
empirical and reproducible analyses.
theoretic concept at the heart of quantitative and computational linguistics. The entropy has been established as a measure of this average uncertainty - also called average information content. We here
use parallel texts of 21 languages to establish the number of tokens
at which word entropies converge to stable values. These convergence
points are then used to select texts from a massively parallel corpus,
and to estimate word entropies across more than 1000 languages. Our
results help to establish quantitative language comparisons, to understand the performance of multilingual translation systems, and to
normalize semantic similarity measures.
areas. The lack of systematic within-family effects is taken as potential evidence for a phylogenetically “deep” explanation. The pressures shaping language complexities probably pre-date the expansion of language families from their proto-languages. Large-scale pre-historic contact around the equator is tentatively given as a possible factor involved in the evolution of the LCB.
systematic variation in the frequency distributions within the same language and across different languages. We further establish the NFD between lemmatized and unlemmatized corpora as a frequency-based measure of inflectional productivity. Finally, we argue that quantitative measures like the NFD can advance language typology beyond abstract, theory-driven expert judgments, towards more corpus-based,
empirical and reproducible analyses.
- Modeling of Linguistic Data
- Algorithms for Inferences on Linguistic Data
- Case studies of Specific Families/Regions
- Synergies between Linguistic and Non-Linguistic Data
It was held at Lorentz Center Leiden from 26-30 October 2015.
Scientific organizers:
Devdatt Dubhashi (Gothenburg, Sweden)
Russell Gray (Auckland, New Zealand)
Harald Hammarström (Nijmegen, The Netherlands)
Gerhard Jäger (Tübingen, Germany)
Marian Klamer (Leiden, The Netherlands)
Andrew Meade (Reading, United Kingdom)
Proceedings Editors: Christian Bentz, Gerhard Jäger, Igor Yanovich
Cite as: Bentz, Christian, Jäger, Gerhard & Yanovich, Igor (eds.) (2016). Proceedings of the Leiden Workshop on Capturing Phylogenetic Algorithms for Linguistics. University of Tübingen, online publication system, https://publikationen.uni-tuebingen.de/xmlui/handle/10900/68558.