Christian Bentz

University Of Tuebingen, Department of General Linguistics, Post-Doc

Followers

136

Following

Co-authors

Public Views

Address: Christian Bentz
Department of Theoretical and Applied Linguistics (DTAL)
Faculty of Modern and Medieval Languages (MML)
University of Cambridge
Sidgwick Avenue, Cambridge, CB3 9DA

less

Dimitrios Alikaniotis

University of Cambridge

Aleksandrs Berdicevskis

Uppsala University

Tanja Samardzic

University of Zurich, Switzerland

Annemarie Verkerk

Universität des Saarlandes

Bryan Jurish

Berlin-Brandenburg Academy of Sciences and Humanities

Vamshi Ambati

Carnegie Mellon University

nisha yadav

Mayank Vahia

Tata Institute of Fundamental Research

João Neto

António J S Teixeira

University of Aveiro

InterestsView All (32)

Uploads

Papers by Christian Bentz

The Entropy of Words—Learnability and Expressivity across More Than 1000 Languages

by Christian Bentz, Dimitrios Alikaniotis, and Ramon Ferrer-i-Cancho

The choice associated with words is a fundamental property of natural languages. It lies at the h... more The choice associated with words is a fundamental property of natural languages. It lies at the heart of quantitative linguistics, computational linguistics, and language sciences more generally. Information-theory gives us tools at hand to measure precisely the average amount of choice associated with words – the word entropy. Here we use three parallel corpora – encompassing ca. 450 million words in 1916 texts and 1259 languages – to tackle some of the major conceptual and practical problems of word entropy estimation: dependence on text size, register, style and estimation method, as well as non-independence of words in co-text. We present three main results: 1) a text size of 50K tokens is sufficient for word entropies to stabilize throughout the text, 2) across languages of the world, word entropies display a unimodal distribution that is skewed to the right. This suggests that there is a trade-off between the learnability and expressivity of words across languages of the world. 3) There is a strong linear relationship between unigram entropies and entropy rates, suggesting that they are inherently linked. We discuss the implications of these results for studying the diversity and evolution of languages from an information-theoretic point of view.

Download

Learning pressures reduce morphological complexity: Linking corpus, computational and experimental evidence

by Christian Bentz and Aleksandrs Berdicevskis

The morphological complexity of languages differs widely and changes over time. Pathways of chang... more The morphological complexity of languages differs widely and changes over time. Pathways of change are often driven by the interplay of multiple competing factors, and are hard to disentangle. We here focus on a paradigmatic scenario of language change: the reduction of morphological complexity from Latin towards the Romance languages. To establish a causal explanation for this phenomenon, we employ three lines of evidence: 1) analyses of parallel corpora to measure the complexity of words in actual language production, 2) applications of NLP tools to further tease apart the contribution of inflectional morphology to word complexity, and 3) experimental data from artificial language learning, which illustrate the learning pressures at play when morphology simplifies. These three lines of evidence converge to show that pressures associated with imperfect language learning are good candidates to causally explain the reduction in morphological complexity in the Latin-to-Romance scenario. More generally, we argue that combining corpus, computational and experimental evidence is the way forward in historical linguistics and linguistic typology.

Download

A Comparison Between Morphological Complexity Measures: Typological Data vs. Language Corpora

by Christian Bentz and Tanja Samardzic

Language complexity is an intriguing phenomenon argued to play an important role in both language... more Language complexity is an intriguing phenomenon argued to play an important role in both language learning and processing. The need to compare languages with regard to their complexity resulted in a multitude of approaches and methods, ranging from accounts targeting specific structural features to global quantification of variation more generally. In this paper, we investigate the degree to which morphological complexity measures are mutually correlated in a sample of more than 500 languages of 101 language families. We use human expert judgements from the World Atlas of Language Structures (WALS), and compare them to four quantitative measures automatically calculated from language corpora. These consist of three previously defined corpus-derived measures, which are all monolingual, and one new measure based on automatic word-alignment across pairs of languages. We find strong correlations between all the measures, illustrating that both expert judgements and automated approaches converge to similar complexity ratings, and can be used interchangeably.

Download

The word entropy of natural languages

by Christian Bentz and Dimitrios Alikaniotis

The average uncertainty associated with words is an information- theoretic concept at the heart o... more The average uncertainty associated with words is an information-
theoretic concept at the heart of quantitative and computational linguistics. The entropy has been established as a measure of this average uncertainty - also called average information content. We here
use parallel texts of 21 languages to establish the number of tokens
at which word entropies converge to stable values. These convergence
points are then used to select texts from a massively parallel corpus,
and to estimate word entropies across more than 1000 languages. Our
results help to establish quantitative language comparisons, to understand the performance of multilingual translation systems, and to
normalize semantic similarity measures.

Download

The Glottolog Data Explorer: Mapping the world's languages

by Christian Bentz, Dimitrios Alikaniotis, and P. Buttery

We present THE GLOTTOLOG DATA EXPLORER, an interactive web application in which the world's langu... more We present THE GLOTTOLOG DATA EXPLORER, an interactive web application in which the world's languages are mapped using a JavaScript library in the 'Shiny' framework for R (Chang et al., 2016). The world's languages and major dialects are mapped using coordinates from the Glottolog database (Hammarström et al., 2016). The application is primarily intended to portray the endangerment status of the world's languages, and hence the default map shows the languages colour-coded for this factor. Subsequently, the user may opt to hide (or reintroduce) data subsets by endangerment status, and to resize the datapoints by speaker counts. Tooltips allow the user to view language family classification and links the user to the relevant Glottolog webpage for each entry. We provide a data table for exploration of the languages by various factors. The web application is freely available at http://cainesap.shinyapps.io/langmap

Download

Crowdsourcing a multilingual speech corpus: recording, transcription and annotation of the CROWDED CORPUS

We announce the release of the CROWDED CORPUS: a pair of speech corpora collected via crowdsourci... more We announce the release of the CROWDED CORPUS: a pair of speech corpora collected via crowdsourcing, containing a native speaker corpus of English (CROWDED_ENGLISH), and a corpus of German/English bilinguals (CROWDED_BILINGUAL). Release 1 of the CROWDED CORPUS contains 1000 recordings amounting to 33,400 tokens collected from 80 speakers and is freely available to other researchers. We recruited participants via the Crowdee application for Android. Recruits were prompted to respond to business-topic questions of the type found in language learning oral tests. We then used the CrowdFlower web application to pass these recordings to crowdworkers for transcription and annotation of errors and sentence boundaries. Finally, the sentences were tagged and parsed using standard natural language processing tools. We propose that crowdsourcing is a valid and economical method for corpus collection, and discuss the advantages and disadvantages of this approach.

Download

Zipf's law of abbreviation as a language universal

Words that are used more frequently tend to be shorter. This statement is known as Zipf's law of ... more Words that are used more frequently tend to be shorter. This statement is known as Zipf's law of abbreviation. Here we perform the widest investigation of the presence of the law to date. In a sample of 1262 texts and 986 different languages - about 13% of the world's language diversity - a negative correlation between word frequency and word length is found in all cases. In line with Zipf's original proposal, we argue that this universal trend is likely to derive from fundamental principles of information processing and transfer.

Download

The Low-Complexity-Belt: evidence for large-scale language contact in human prehistory?

The quantitative measurement of language complexity has witnessed a recent rise of interest, not ... more The quantitative measurement of language complexity has witnessed a recent rise of interest, not least because language complexities reflect the learning constraints and pressures that shape languages over historical and evolutionary time. Here, an information-theoretic account of measuring language complexity is presented. Based on the entropy of word frequency distributions in parallel text samples, the complexities of overall 646 languages are estimated. A large-scale finding of this analysis is that languages just above the equator exhibit lower complexity than languages further away from the equator. This geo-spatial pattern is here referred to as the Low-Complexity-Belt (LCB). The statistical significance of the positive latitude/complexity relationship is assessed in a linear regression and a linear mixed-effects regression, suggesting that the pattern holds between different families and areas, but not within different families and
areas. The lack of systematic within-family effects is taken as potential evidence for a phylogenetically “deep” explanation. The pressures shaping language complexities probably pre-date the expansion of language families from their proto-languages. Large-scale pre-historic contact around the equator is tentatively given as a possible factor involved in the evolution of the LCB.

Download

A Quantitative Empirical Analysis of the Abstract/Concrete Distinction

Cognitive Science, 2014

This study presents original evidence that abstract and concrete concepts are organized and repre... more This study presents original evidence that abstract and concrete concepts are organized and represented differently in the mind, based on analyses of thousands of concepts in publicly available data sets and computational resources. First, we show that abstract and concrete concepts have differing patterns of association with other concepts. Second, we test recent hypotheses that abstract concepts are organized according to association, whereas concrete concepts are organized according to (semantic) similarity. Third, we present evidence suggesting that concrete representations are more strongly feature-based than abstract concepts. We argue that degree of feature-based structure may fundamentally determine concreteness, and we discuss implications for cognitive and computational models of meaning.

Download

Compression and the origins of Zipf's law of abbreviation

Languages across the world exhibit Zipf's law of abbreviation, namely more frequent words tend to... more Languages across the world exhibit Zipf's law of abbreviation, namely more frequent words tend to be shorter. The generalized version of the law -an inverse relationship between the frequency of a unit and its magnitude -holds also for the behaviours of other species and the genetic code. The apparent universality of this pattern in human language and its ubiquity in other domains calls for a theoretical understanding of its origins. To this end, we generalize the information theoretic concept of mean code length as a mean energetic cost function over the probability and the magnitude of the types of the repertoire. We show that the minimization of that cost function and a negative correlation between probability and the magnitude of types are intimately related.

Download

Bentz & Winter 2013 - Languages with more second language learners tend to lose nominal case

by Christian Bentz and Bodo Winter

In this paper, we provide quantitative evidence showing that languages spoken by many second lang... more In this paper, we provide quantitative evidence showing that languages spoken by many second language speakers tend to have relatively small nominal case systems or no nominal case at all. In our sample, all languages with more than 50 % second language speakers had no nominal case. The negative association between the number of second language speakers and nominal case complexity generalizes to different language areas and families. As there are many studies attesting to the difficulty of acquiring morphological case in second language acquisition, this result supports the idea that languages adapt to the cognitive constraints of their speakers, as well as to the sociolinguistic niches of their speaking communities. We discuss our results with respect to sociolinguistic typology and the Linguistic Niche Hypothesis, as well as with respect to qualitative data from historical linguistics. All in all, multiple lines of evidence converge on the idea that morphosyntactic complexity is reduced by a high degree of language contact involving adult learners.

Download

Variation in word frequency distributions: Definitions, measures and implications for a corpus-based language typology

by Christian Bentz and Dimitrios Alikaniotis

Word frequencies are central to linguistic studies investigating processing difficulty, learnabil... more Word frequencies are central to linguistic studies investigating processing difficulty, learnability, age of acquisition, diachronic transmission and the relative weight given to a concept in society. However, there are few studies on entire distributions of word frequencies, and even less on systematic changes within them. Here, we first define and test an exact measure for the relative difference between distributions – the normalized frequency difference (NFD). We then apply this measure to parallel corpora, explaining
systematic variation in the frequency distributions within the same language and across different languages. We further establish the NFD between lemmatized and unlemmatized corpora as a frequency-based measure of inflectional productivity. Finally, we argue that quantitative measures like the NFD can advance language typology beyond abstract, theory-driven expert judgments, towards more corpus-based,
empirical and reproducible analyses.

Download

Adaptive communication: Languages with more non-native speakers tend to have fewer word forms

Explaining the diversity of languages across the world is one of the central aims of typological,... more Explaining the diversity of languages across the world is one of the central aims of typological, historical and evolutionary linguistics. We consider the effect of language contact -the number of non-native speakers a language has -on the way languages change and evolve. By analysing hundreds of languages within and across language families, regions and text types, we show that languages with greater levels of contact typically employ fewer word forms to encode the same information content (a property we refer to as lexical diversity). Based on three types of statistical analyses, we demonstrate that this variance can in part be explained by the impact of non-native speakers on information encoding strategies. Finally, we argue that languages are information encoding systems shaped by the varying needs of their speakers. Language evolution and change should be modeled as the co-evolution of multiple intertwined adaptive systems: On one hand, the structure of human societies and human learning capabilities, and on the other, the structure of language. We refer to this property of languages -the number of word forms or word types 13 they use to encode essentially the same information -as their lexical diversity (LDT). 14 This difference is a central part of the variation in encoding strategies we find across 15 languages of the world. 16 This paper centers on the question of where variation in lexical diversity stems from. 17 Why do some languages employ a wide range of opaque lexical items while others are 18 1/22 more economical? Variation between languages has often be seen as driven by language 19 acquisition of native speakers (L1) [2-8]. However, some sociolinguistic and historical 20 studies have raised the question of whether large numbers of non-native (L2) language 21 speakers in a society can also lead to systematic changes in the use of the language in 22 65

Download

Zipf's law across languages of the world

Download

Towards a computational model of grammaticalization and lexical diversity

Languages use different lexical inventories to encode information, ranging from small sets of sim... more Languages use different lexical inventories to encode information, ranging from small sets of simplex words to large sets of morphologically complex words. Grammaticalization theories argue that this variation arises as the outcome of diachronic processes whereby co-occurring words merge to one word and build up complex morphology.

Download

Zipf's law and the grammar of languages: A quantitative study of Old and Modern English parallel texts

This paper reports a quantitative analysis of the relationship between word frequency distributio... more This paper reports a quantitative analysis of the relationship between word frequency distributions and morphological features in languages. We analyze a commonly-observed process of historical language change: The loss of inflected forms in favour of 'analytic' periphrastic constructions. These tendencies are observed in parallel translations of the Book of Genesis in Old English and Modern English. We show that there are significant differences in the frequency distributions of the two texts, and that parts of these differences are independent of total number of words, style of translation, orthography or contents. We argue that they derive instead from the trade-off between synthetic inflectional marking in Old English and analytic constructions in Modern English. By exploiting the earliest ideas of Zipf, we show that the syntheticity of the language in these texts can be captured mathematically, a property we tentatively call their grammatical fingerprint. Our findings suggest implications for both the specific historical process of inflection loss and more generally for the characterization of languages based on statistical properties.

Download

Languages with more second language learners tend to lose nominal case

by Christian Bentz and Bodo Winter

In this paper, we provide quantitative evidence showing that languages spoken by many second lang... more In this paper, we provide quantitative evidence showing that languages spoken by many second language speakers tend to have relatively small nominal case systems or no nominal case at all. In our sample, all languages with more than 50% second language speakers had no nominal case. The negative association between the number of second language speakers and nominal case complexity generalizes to different language areas and families. As there are many studies attesting to the difficulty of acquiring morphological case in second language acquisition, this result supports the idea that languages adapt to the cognitive constraints of their speakers, as well as to the sociolinguistic niches of their speaking communities. We discuss our results with respect to sociolinguistic typology and the Linguistic Niche Hypothesis, as well as with respect to qualitative data from historical linguistics. All in all, multiple lines of evidence converge on the idea that morphosyntactic complexity is reduced by a high degree of language contact that involves adult learners.

Download

Beyond rule versus rote? Processing of distinctive genitive and dative case markers in German

The rule versus rote distinction is one of the most debated issues in recent psycholinguistics. D... more The rule versus rote distinction is one of the most debated issues in recent psycholinguistics. Dual route accounts hold that words can either be stored whole in the mental lexicon or computationally derived by simple combinatorial rules such as stem+affix. Within this framework, response latencies in lexical decision tasks have been applied to point out the difference between rote memorization, on the one hand, and combinatorial rule manipulation, on the other. However, this paper argues that there may be alternatives to this distinction. It will be shown that German nouns, which can be distinctively marked for number, case or both number and case, do elicit differing reaction times. Crucially, this effect can neither be explained by surface frequency effects nor by internal morphological structure. Rather, it seems to be triggered by the degree of embedding into usage-based units.

Download

Linguistic Adaptation: The trade-off between case marking and fixed word orders in Germanic and Romance languages

Download

What's next? A (possible) agenda for evolutionary linguistics after EVOLANG9

Every two years researchers at Evolang bring together an impressive amount of evidence to unravel... more Every two years researchers at Evolang bring together an impressive amount of evidence to unravel the puzzling phenomenon of language evolution. However, an overarching framework of how to integrate all these bits of data into a coherent theory, into a full picture, is missing. This paper proposes that the complex adaptive system framework can serve this purpose. The central idea is that language structures adapt to the niche of learner populations and are therefore 'shaped' to fit the social and cognitive needs of human beings. That is, the structural features of today's languages should be viewed as the product of the co-evolution of domain-general preadaptations for language and language as a cultural tool, rather than growing 'naturally' based on genetic encoding.

Download

The Entropy of Words—Learnability and Expressivity across More Than 1000 Languages

by Christian Bentz, Dimitrios Alikaniotis, and Ramon Ferrer-i-Cancho

Download

Learning pressures reduce morphological complexity: Linking corpus, computational and experimental evidence

by Christian Bentz and Aleksandrs Berdicevskis

Download

A Comparison Between Morphological Complexity Measures: Typological Data vs. Language Corpora

by Christian Bentz and Tanja Samardzic

Download

The word entropy of natural languages

by Christian Bentz and Dimitrios Alikaniotis

Download

The Glottolog Data Explorer: Mapping the world's languages

by Christian Bentz, Dimitrios Alikaniotis, and P. Buttery

Download

Crowdsourcing a multilingual speech corpus: recording, transcription and annotation of the CROWDED CORPUS

Download

Zipf's law of abbreviation as a language universal

Download

The Low-Complexity-Belt: evidence for large-scale language contact in human prehistory?

Download

A Quantitative Empirical Analysis of the Abstract/Concrete Distinction

Cognitive Science, 2014

Download

Compression and the origins of Zipf's law of abbreviation

Download

Bentz & Winter 2013 - Languages with more second language learners tend to lose nominal case

by Christian Bentz and Bodo Winter

Download

Variation in word frequency distributions: Definitions, measures and implications for a corpus-based language typology

by Christian Bentz and Dimitrios Alikaniotis

Download

Adaptive communication: Languages with more non-native speakers tend to have fewer word forms

Download

Zipf's law across languages of the world

Download

Towards a computational model of grammaticalization and lexical diversity

Download

Zipf's law and the grammar of languages: A quantitative study of Old and Modern English parallel texts

Download

Languages with more second language learners tend to lose nominal case

by Christian Bentz and Bodo Winter

In this paper, we provide quantitative evidence showing that languages spoken by many second lang... more In this paper, we provide quantitative evidence showing that languages spoken by many second language speakers tend to have relatively small nominal case systems or no nominal case at all. In our sample, all languages with more than 50% second language speakers had no nominal case. The negative association between the number of second language speakers and nominal case complexity generalizes to different language areas and families. As there are many studies attesting to the difficulty of acquiring morphological case in second language acquisition, this result supports the idea that languages adapt to the cognitive constraints of their speakers, as well as to the sociolinguistic niches of their speaking communities. We discuss our results with respect to sociolinguistic typology and the Linguistic Niche Hypothesis, as well as with respect to qualitative data from historical linguistics. All in all, multiple lines of evidence converge on the idea that morphosyntactic complexity is reduced by a high degree of language contact that involves adult learners.

Download

Beyond rule versus rote? Processing of distinctive genitive and dative case markers in German

Download

Linguistic Adaptation: The trade-off between case marking and fixed word orders in Germanic and Romance languages

Download

What's next? A (possible) agenda for evolutionary linguistics after EVOLANG9

Download

Measuring and modeling the impact of non-native speakers on morphological productivity

Explaining the diversity of phonological, lexical, morphological and syntactic encoding strategie... more Explaining the diversity of phonological, lexical, morphological and syntactic encoding strategies across languages of the world is a central aim of variationist sociolinguistics. My account will focus on the lexical and morphological level of encoding. Recent qualitative and quantitative studies suggest that the morphological productivity of languages might be partly driven by the presence of significant proportions of non-native speakers. Since non-native speakers seem to have difficulties with learning the panoply of word forms that native speakers master with relative ease, they might drive the reduction of such word forms over several generations of language learning and usage. This is seen as an instance of language as a complex adaptive system shaped by the learning constraints of speaker populations. I will outline some of the methods used to measure and model this (potential) impact of non-native speakers on a) case marking systems and and b) on morphological productivity and lexical diversity more general.

Download

Measuring and modelling lexical diversity across languages

Download

Towards a computational model of grammaticalization and lexical diversity

Languages use different lexical inventories to encode information, ranging from small sets of si... more Languages use different lexical inventories to encode information, ranging from small sets of simplex words to large sets of morphologically complex words. Grammaticalization theories argue that this variation arises as the outcome of diachronic processes whereby co-occurring words merge to one word and build up complex morphology. To model these processes we present a) a quantitative measure of lexical diversity and b) a preliminary computational model of changes in lexical diversity over several generations of merging higly frequent collocates.

Download

Zipf's law across languages of the world: Towards a quantitative measure of lexical diversity

Beyond rule versus rote? Processing of distinctive dative and genitive case markers in German

The impact of second language speakers on the evolution of case marking

This paper is a summary of a research project addressing the question of how L2 speakers in lingu... more This paper is a summary of a research project addressing the question of how L2 speakers in linguistic communities can shape the structure of languages. We present evidence in support for the view that L2 speakers have an impact on the future development of grammar, namely, that languages with more L2 speakers tend to lose abundant case marking systems. This is in line with the idea that language structure is predominantly the outcome of the processes of cultural evolution, language contact and language learning rather than biological evolution.

Proceedings of the Leiden Workshop on Capturing Phylogenetic Algorithms for Linguistics

The past decade has seen an explosion in the availability of large datasets of both structural an... more The past decade has seen an explosion in the availability of large datasets of both structural and lexical linguistic data. These data can be profitably exploited computationally to explore linguistic history and linguistic universals. Work in Bioinformatics has addressed parallel sets of questions for data on living species, and a large array of computational modeling techniques have been developed towards this goal. This workshop aimed to bring together linguists, biologists and mathematicians to make progress on the following topics:

- Modeling of Linguistic Data
- Algorithms for Inferences on Linguistic Data
- Case studies of Specific Families/Regions
- Synergies between Linguistic and Non-Linguistic Data

It was held at Lorentz Center Leiden from 26-30 October 2015.

Scientific organizers:
Devdatt Dubhashi (Gothenburg, Sweden)
Russell Gray (Auckland, New Zealand)
Harald Hammarström (Nijmegen, The Netherlands)
Gerhard Jäger (Tübingen, Germany)
Marian Klamer (Leiden, The Netherlands)
Andrew Meade (Reading, United Kingdom)

Proceedings Editors: Christian Bentz, Gerhard Jäger, Igor Yanovich

Cite as: Bentz, Christian, Jäger, Gerhard & Yanovich, Igor (eds.) (2016). Proceedings of the Leiden Workshop on Capturing Phylogenetic Algorithms for Linguistics. University of Tübingen, online publication system, https://publikationen.uni-tuebingen.de/xmlui/handle/10900/68558.

Christian Bentz

Uploads

Papers by Christian Bentz

Log In

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.