Skip to main content

Søren Wichmann

Christian-Albrechts-Universität zu Kiel, Excellence Cluster ROOTS, Faculty Member

Leiden University, Leiden University Centre for Linguistics, Faculty Member

Followers

1,266

Following

34

Co-authors

33

Public Views

The University of Texas at Austin

David Goldstein

University of California, Los Angeles

Armando Marques-Guedes

UNL - New University of Lisbon

The Hebrew University of Jerusalem

Martin Haspelmath

Max Planck Institute for Evolutionary Anthropology

Brook Danielle Lillehaugen

Haverford

University of Genova

Hungarian Academy of Sciences

Vladimir Plungian

Vinogradov Russian Language Institute

Viacheslav Kuleshov

Stockholm University

Interests

Uploads

Papers by Søren Wichmann

Identifying the optimal datasize for lexically-based Bayesian inference of linguistic phylogenies

International Conference on Computational Linguistics, Aug 1, 2018

Bayesian linguistic phylogenies are standardly based on cognate matrices for words referring to a... more Bayesian linguistic phylogenies are standardly based on cognate matrices for words referring to a fix set of meanings-typically around 100-200. To this day there has not been any empirical investigation into which datasize is optimal. Here we determine, across a set of language families, the optimal number of meanings required for the best performance in Bayesian phylogenetic inference. We rank meanings by stability, infer phylogenetic trees using first the most stable meaning, then the two most stable meanings, and so on, computing the quartet distance of the resulting tree to the tree proposed by language family experts at each step of datasize increase. When a gold standard tree is not available we propose to instead compute the quartet distance between the tree based on the n-most stable meaning and the one based on the n + 1-most stable meanings, increasing n from 1 to N -1, where N is the total number of meanings. The assumption here is that the value of n for which the quartet distance begins to stabilize is also the value at which the quality of the tree ceases to improve. We show that this assumption is borne out. The results of the two methods vary across families, and the optimal number of meanings appears to correlate with the number of languages under consideration.

Testing methods of linguistic homeland detection using synthetic data

Philosophical Transactions of the Royal Society B, Mar 22, 2021

Two families of quantitative methods have been used to infer geographical homelands of language f... more Two families of quantitative methods have been used to infer geographical homelands of language families: Bayesian phylogeography and the ‘diversity method'. Bayesian methods model how populations may have moved using a phylogenetic tree as a backbone, while the diversity method assumes that the geographical area where linguistic diversity is highest likely corresponds to the homeland. No systematic tests of the performances of the different methods in a linguistic context have so far been published. Here, we carry out performance testing by simulating language families, including branching structures and word lists, along with speaker populations moving in space. We test six different methods: two versions of BayesTraits; the relaxed random walk model of BEAST 2; our own RevBayes implementations of a fixed rate and a variable rates random walk model; and the diversity method. As a result of the tests, we propose a hierarchy of performance of the different methods. Factors such as geographical idiosyncrasies, incomplete sampling, tree imbalance and small family sizes all have a negative impact on performance, but mostly across the board, the performance hierarchy generally being impervious to such factors. This article is part of the theme issue ‘Reconstructing prehistoric languages'.

A test of Generalized Bayesian dating: A new linguistic dating method

PLOS ONE, Aug 12, 2020

On the relation between structural diversity and geographical distance among languages: observations and computer simulations

arXiv (Cornell University), Jul 4, 2006

Zavala, Roberto and Søren Wichmann. 2025. A typological overview of Mixe-Zoquean languages. In: Wichmann, Søren (ed.), The Languages and Linguistics of Middle America: A Comprehensive Guide, 627-685. Berlin: de Gruyter Mouton.

Temperature shapes language sonority: Revalidation from a large dataset

PNAS nexus, Nov 30, 2023

Languages in China link climate, voice quality, and tone in a causal chain

Humanities and Social Sciences Communications

Are the sound systems of languages ecologically adaptive like other aspects of human behavior? In... more Are the sound systems of languages ecologically adaptive like other aspects of human behavior? In previous substantive explorations of the climate–language nexus, the hypothesis that desiccation affects the tone systems of languages was not well supported. The lack of analysis of voice quality data from natural speech undermines the credibility of the following two key premises: the compromised voice quality caused by desiccated ambient air and constrained use of phonemic tone due to a desiccated larynx. Here, the full chain of causation, humidity → voice quality → number of tones, is for the first time strongly supported by direct experimental tests based on a large speech database (China’s Language Resources Protection Project). Voice quality data is sampled from a recording set that includes 997 language varieties in China. Each language is represented by about 1200 sound files, amounting to a total of 1,174,686 recordings. Tonally rich languages are distributed throughout China ...

Linguistic Clues to Kiowa-Tanoan Prehistory

Journal of the Southwest, 2021

A practical introduction to the ‘Toponym’ R package

by Lennart Chevallier and Søren Wichmann

Names, 2024

In this note, we describe how to install and use the ‘toponym’ R package, which is designed for m... more

Fingerprinting conflict: A comparative model with applications to archaeological and historical data

This paper is envisioned as a primarily methodological contribution towards a more sophisticated ... more This paper is envisioned as a primarily methodological contribution towards a more sophisticated and systematic approach to conflict research in archaeology and history. Studies of conflicts in these fields have tended towards a rather unnuanced focus on violence and war. Instead, we offer a more holistic approach to conflict research, taking into account different levels of both escalation and de-escalation that embrace all the possible aspects of a conflict from a mere undeveloped potential over complete annihilation to various countermeasures and stages of resolution. A model taking into account different levels of escalation and de-escalation is presented which embodies our multi-faceted view of conflicts and which also allows for a systematic, comparative analysis of conflict situations anywhere and any time in (pre)history. Through ten relatively detailed European case studies spanning the Bronze Age to the 20th century we demonstrate the comparative potential of our model and suggest ways in which it may help to identify typical patterns in conflict situations.

Editorial Statement: Language Dynamics and Change

From Linguistic Descriptions to Language Profiles

The DReaM corpus: A multilingual annotated corpus of grammars for the world’s languages

Language Resources and Evaluation, May 1, 2020

A global analysis of matches and mismatches between human genetic and linguistic histories

Proceedings of the National Academy of Sciences of the United States of America, Nov 21, 2022

Human history is written in both our genes and our languages. The extent to which our biological ... more Human history is written in both our genes and our languages. The extent to which our biological and linguistic histories are congruent has been the subject of considerable debate, with clear examples of both matches and mismatches. To disentangle the patterns of demographic and cultural transmission, we need a global systematic assessment of matches and mismatches. Here, we assemble a genomic database (GeLaTo, or Genes and Languages Together) specifically curated to investigate genetic and linguistic diversity worldwide. We find that most populations in GeLaTo that speak languages of the same language family (i.e., that descend from the same ancestor language) are also genetically highly similar. However, we also identify nearly 20% mismatches in populations genetically close to linguistically unrelated groups. These mismatches, which occur within the time depth of known linguistic relatedness up to about 10,000 y, are scattered around the world, suggesting that they are a regular outcome in human history. Most mismatches result from populations shifting to the language of a neighboring population that is genetically different because of independent demographic histories. In line with the regularity of such shifts, we find that only half of the language families in GeLaTo are genetically more cohesive than expected under spatial autocorrelations. Moreover, the genetic and linguistic divergence times of population pairs match only rarely, with Indo-European standing out as the family with most matches in our sample. Together, our database and findings pave the way for systematically disentangling demographic and cultural history and for quantifying processes of shifts in language and social identities on a global scale.

The halfway similarity avoidance rule replicated using phonetic data from European language varieties

Language Dynamics and Change, 2023

Previous work using lexical data from around the world has suggested that distances between langu... more Previous work using lexical data from around the world has suggested that distances between language varieties are distributed such that varieties are typically either rather similar, qualifying as dialects of the same language, or rather dissimilar, qualifying as different languages, with a scarcity of varieties that are around halfway similar. Using a potentially biased sample, Wichmann (2019) observed that there is a bimodal distribution of distances with two roughly normal distributions separated by a valley. Here we test whether a similar distribution is found when using another source of data and an unbiased sample drawn from the cells of a geographical grid (of central Europe). The data consists of 18 lexemes from 274 doculects. Using Bayesian beta regression and leave-one-out cross-validation, we show that the data follows a bimodal distribution which is robust to sampling, and also to at least some aspects of the data (coarse-vs. fine-grained phonetic transcriptions).

Temperature shapes language sonority: Revalidation from a large dataset

PNAS Nexus, 2023

Multiple factors of the natural environment have been found to impact and mold the phonetic patte... more Multiple factors of the natural environment have been found to impact and mold the phonetic patterns of human speech, among which the potential correlation between sonority and temperature has garnered significant attention. We leverage a large database containing basic vocabularies of 5,293 languages and calculate the average sonority for each language by adopting a universal sonority scale. Our findings confirm a positive correlation between sonority and temperature across macroareas and language families, whereas this relationship cannot be discerned within language families. We suggest that the adaptation of the distribution of speech sounds within languages is a slow process which is moreover insensitive to minor differences in temperature experienced by speakers as they carry their languages to new regions. Nevertheless, at the global level a solid relationship emerges. Furthermore, we delve deeper into the nature of the relationship and contend that it is mainly due to cold temperatures having a weakening effect on sonority. This research provides compelling additional evidence that climatic factors contribute to shaping language and its evolution.

From Indonesia to Madagascar: in search of the origins of the Malagasy language

ABSTRACT t Madagascar exhibits a strong linguistic uniformity since all dialects are regional var... more ABSTRACT t Madagascar exhibits a strong linguistic uniformity since all dialects are regional variants of the same language, which belongs to the Greater Barito East group of the Austronesian family [Houtman, 1603]. This was firmly established a long time ago in [Tuuk, 1864], while, more recently, [Dahl, 1951] pointed out a particularly close relationship between Malagasy and Maanyan of south-eastern Kalimantan [Dyen, 1953]. Nevertheless, Malagasy also bears similarities to languages in Sulawesi, Malaysia, Sumatra and Philippines, including loanwords from Malay, Javanese, and one (or more) language(s) of south Sulawesi [Adelaar, 2009] and Philippines. On the contrary, the east-African contribution to the vocabulary seems to be limited to few faunal names [Blench and Walsh, 2009]. Because of these linguistic relationships, it is widely accepted that the island was settled by Indonesian sailors after a maritime trek but dates and place of landing are still debated and it is also not clear whether there were multiple settlements or just a single one. The linguistic composition of the Austronesian settlers is also debated as well its consequences on the vocabulary of Malagasy dialects. In this paper we review our research [Serva et al, 2012, Serva, 2012], which tries to shed new light on these problems. The key point is the application of a new quantitative methodology [Serva and Petroni, 2008, Petroni and Serva, 2008, Bakker et al, 2009] which is able to find out the kinship relations among languages (or dialects). New techniques are also introduced in order to extract the maximum information from these relations concerning time and space patterns [Blanchard et al, 2010a, Wichmann et al, 2010a]. We consider 23 Malagasy dialects plus Malay and Maanyan. The data concerning Madagascar were collected by one of the authors (M.S.) at the beginning of 2010 and consist of Swadesh lists of 200 items for each of the 23 dialects covering all areas of the island.

Words for ‘dog’ as a diagnostic of language contact in the Americas

On the relation between structural diversity and geographical distance among languages: Observations and computer simulations

Linguistic Typology, Jan 19, 2007

Since the groundbreaking work of Nichols (1992) it has been clear that the use of typological dat... more Since the groundbreaking work of Nichols (1992) it has been clear that the use of typological databases for making inferences regarding linguistic prehistory could potentially have much to offer. The recent availability of larger typological databases such as Haspelmath et al. (2005) has brought the linguistics community closer to having a solid, empirical foundation for making actual claims about prehistoric migrations, deep genealogical relationship, and patterns of areal linguistic interaction. Nevertheless, still more data are needed, and more than anything a number of methodological problems need to be adressed. These problems, which are the focus of this paper, include the following. How might we go about distinguishing diffusion from genealogical inheritance when looking at structural similarities among languages? For how long may we expect to continue to see a difference between traces of relatedness and traces of diffusion? What effects do factors such as speed of migration, the time depth of interaction among a certain set of languages or the rate of diffusion have on the similarities among initially related languages on the one hand and initially unrelated languages on the other? These questions will be addressed from two perspectives. The first perspective is an empirical one, where observations primarily derive from analyses of the data of Haspelmath et al. (2005). The second perspective is a computational one, where simulations are drawn upon to test the effects of different parameters on the development of structural linguistic diversity. The results suggest that there is indeed some hope that we may derive new empirical insights regarding linguistic prehistory by drawing upon typological data. We do not, however, make any specific empirical claims in this paper, but instead concentrate our efforts on methodology.

The extent and degree of utterance-final word lengthening in spontaneous speech from 10 languages

Linguistics vanguard, 2021

Words in utterance-final positions are often pronounced more slowly than utterance-medial words, ... more Words in utterance-final positions are often pronounced more slowly than utterance-medial words, as previous studies on individual languages have shown. This paper provides a systematic cross-linguistic comparison of relative durations of final and penultimate words in utterances in terms of the degree to which such words are lengthened. The study uses time-aligned corpora from 10 genealogically, areally, and culturally diverse languages, including eight small, under-resourced, and mostly endangered languages, as well as English and Dutch. Clear effects of lengthening words at the end of utterances are found in all 10 languages, but the degrees of lengthening vary. Languages also differ in the relative durations of words that precede utterance-final words. In languages with on average short words in terms of number of segments, these penultimate words are also lengthened. This suggests that lengthening extends backwards beyond the final word in these languages, but not in languages with on average longer words. Such typological patterns highlight the importance of examining prosodic phenomena in diverse language samples beyond the small set of majority languages most commonly investigated so far.

Identifying the optimal datasize for lexically-based Bayesian inference of linguistic phylogenies

International Conference on Computational Linguistics, Aug 1, 2018

Bayesian linguistic phylogenies are standardly based on cognate matrices for words referring to a... more Bayesian linguistic phylogenies are standardly based on cognate matrices for words referring to a fix set of meanings-typically around 100-200. To this day there has not been any empirical investigation into which datasize is optimal. Here we determine, across a set of language families, the optimal number of meanings required for the best performance in Bayesian phylogenetic inference. We rank meanings by stability, infer phylogenetic trees using first the most stable meaning, then the two most stable meanings, and so on, computing the quartet distance of the resulting tree to the tree proposed by language family experts at each step of datasize increase. When a gold standard tree is not available we propose to instead compute the quartet distance between the tree based on the n-most stable meaning and the one based on the n + 1-most stable meanings, increasing n from 1 to N -1, where N is the total number of meanings. The assumption here is that the value of n for which the quartet distance begins to stabilize is also the value at which the quality of the tree ceases to improve. We show that this assumption is borne out. The results of the two methods vary across families, and the optimal number of meanings appears to correlate with the number of languages under consideration.

Testing methods of linguistic homeland detection using synthetic data

Philosophical Transactions of the Royal Society B, Mar 22, 2021

Two families of quantitative methods have been used to infer geographical homelands of language f... more Two families of quantitative methods have been used to infer geographical homelands of language families: Bayesian phylogeography and the ‘diversity method'. Bayesian methods model how populations may have moved using a phylogenetic tree as a backbone, while the diversity method assumes that the geographical area where linguistic diversity is highest likely corresponds to the homeland. No systematic tests of the performances of the different methods in a linguistic context have so far been published. Here, we carry out performance testing by simulating language families, including branching structures and word lists, along with speaker populations moving in space. We test six different methods: two versions of BayesTraits; the relaxed random walk model of BEAST 2; our own RevBayes implementations of a fixed rate and a variable rates random walk model; and the diversity method. As a result of the tests, we propose a hierarchy of performance of the different methods. Factors such as geographical idiosyncrasies, incomplete sampling, tree imbalance and small family sizes all have a negative impact on performance, but mostly across the board, the performance hierarchy generally being impervious to such factors. This article is part of the theme issue ‘Reconstructing prehistoric languages'.

A test of Generalized Bayesian dating: A new linguistic dating method

PLOS ONE, Aug 12, 2020

On the relation between structural diversity and geographical distance among languages: observations and computer simulations

arXiv (Cornell University), Jul 4, 2006

Zavala, Roberto and Søren Wichmann. 2025. A typological overview of Mixe-Zoquean languages. In: Wichmann, Søren (ed.), The Languages and Linguistics of Middle America: A Comprehensive Guide, 627-685. Berlin: de Gruyter Mouton.

Temperature shapes language sonority: Revalidation from a large dataset

PNAS nexus, Nov 30, 2023

Languages in China link climate, voice quality, and tone in a causal chain

Humanities and Social Sciences Communications

Are the sound systems of languages ecologically adaptive like other aspects of human behavior? In... more Are the sound systems of languages ecologically adaptive like other aspects of human behavior? In previous substantive explorations of the climate–language nexus, the hypothesis that desiccation affects the tone systems of languages was not well supported. The lack of analysis of voice quality data from natural speech undermines the credibility of the following two key premises: the compromised voice quality caused by desiccated ambient air and constrained use of phonemic tone due to a desiccated larynx. Here, the full chain of causation, humidity → voice quality → number of tones, is for the first time strongly supported by direct experimental tests based on a large speech database (China’s Language Resources Protection Project). Voice quality data is sampled from a recording set that includes 997 language varieties in China. Each language is represented by about 1200 sound files, amounting to a total of 1,174,686 recordings. Tonally rich languages are distributed throughout China ...

Linguistic Clues to Kiowa-Tanoan Prehistory

Journal of the Southwest, 2021

A practical introduction to the ‘Toponym’ R package

by Lennart Chevallier and Søren Wichmann

Names, 2024

In this note, we describe how to install and use the ‘toponym’ R package, which is designed for m... more

Fingerprinting conflict: A comparative model with applications to archaeological and historical data

This paper is envisioned as a primarily methodological contribution towards a more sophisticated ... more This paper is envisioned as a primarily methodological contribution towards a more sophisticated and systematic approach to conflict research in archaeology and history. Studies of conflicts in these fields have tended towards a rather unnuanced focus on violence and war. Instead, we offer a more holistic approach to conflict research, taking into account different levels of both escalation and de-escalation that embrace all the possible aspects of a conflict from a mere undeveloped potential over complete annihilation to various countermeasures and stages of resolution. A model taking into account different levels of escalation and de-escalation is presented which embodies our multi-faceted view of conflicts and which also allows for a systematic, comparative analysis of conflict situations anywhere and any time in (pre)history. Through ten relatively detailed European case studies spanning the Bronze Age to the 20th century we demonstrate the comparative potential of our model and suggest ways in which it may help to identify typical patterns in conflict situations.

Editorial Statement: Language Dynamics and Change

From Linguistic Descriptions to Language Profiles

The DReaM corpus: A multilingual annotated corpus of grammars for the world’s languages

Language Resources and Evaluation, May 1, 2020

A global analysis of matches and mismatches between human genetic and linguistic histories

Proceedings of the National Academy of Sciences of the United States of America, Nov 21, 2022

Human history is written in both our genes and our languages. The extent to which our biological ... more Human history is written in both our genes and our languages. The extent to which our biological and linguistic histories are congruent has been the subject of considerable debate, with clear examples of both matches and mismatches. To disentangle the patterns of demographic and cultural transmission, we need a global systematic assessment of matches and mismatches. Here, we assemble a genomic database (GeLaTo, or Genes and Languages Together) specifically curated to investigate genetic and linguistic diversity worldwide. We find that most populations in GeLaTo that speak languages of the same language family (i.e., that descend from the same ancestor language) are also genetically highly similar. However, we also identify nearly 20% mismatches in populations genetically close to linguistically unrelated groups. These mismatches, which occur within the time depth of known linguistic relatedness up to about 10,000 y, are scattered around the world, suggesting that they are a regular outcome in human history. Most mismatches result from populations shifting to the language of a neighboring population that is genetically different because of independent demographic histories. In line with the regularity of such shifts, we find that only half of the language families in GeLaTo are genetically more cohesive than expected under spatial autocorrelations. Moreover, the genetic and linguistic divergence times of population pairs match only rarely, with Indo-European standing out as the family with most matches in our sample. Together, our database and findings pave the way for systematically disentangling demographic and cultural history and for quantifying processes of shifts in language and social identities on a global scale.

The halfway similarity avoidance rule replicated using phonetic data from European language varieties

Language Dynamics and Change, 2023

Previous work using lexical data from around the world has suggested that distances between langu... more Previous work using lexical data from around the world has suggested that distances between language varieties are distributed such that varieties are typically either rather similar, qualifying as dialects of the same language, or rather dissimilar, qualifying as different languages, with a scarcity of varieties that are around halfway similar. Using a potentially biased sample, Wichmann (2019) observed that there is a bimodal distribution of distances with two roughly normal distributions separated by a valley. Here we test whether a similar distribution is found when using another source of data and an unbiased sample drawn from the cells of a geographical grid (of central Europe). The data consists of 18 lexemes from 274 doculects. Using Bayesian beta regression and leave-one-out cross-validation, we show that the data follows a bimodal distribution which is robust to sampling, and also to at least some aspects of the data (coarse-vs. fine-grained phonetic transcriptions).

Temperature shapes language sonority: Revalidation from a large dataset

PNAS Nexus, 2023

Multiple factors of the natural environment have been found to impact and mold the phonetic patte... more Multiple factors of the natural environment have been found to impact and mold the phonetic patterns of human speech, among which the potential correlation between sonority and temperature has garnered significant attention. We leverage a large database containing basic vocabularies of 5,293 languages and calculate the average sonority for each language by adopting a universal sonority scale. Our findings confirm a positive correlation between sonority and temperature across macroareas and language families, whereas this relationship cannot be discerned within language families. We suggest that the adaptation of the distribution of speech sounds within languages is a slow process which is moreover insensitive to minor differences in temperature experienced by speakers as they carry their languages to new regions. Nevertheless, at the global level a solid relationship emerges. Furthermore, we delve deeper into the nature of the relationship and contend that it is mainly due to cold temperatures having a weakening effect on sonority. This research provides compelling additional evidence that climatic factors contribute to shaping language and its evolution.

From Indonesia to Madagascar: in search of the origins of the Malagasy language

ABSTRACT t Madagascar exhibits a strong linguistic uniformity since all dialects are regional var... more ABSTRACT t Madagascar exhibits a strong linguistic uniformity since all dialects are regional variants of the same language, which belongs to the Greater Barito East group of the Austronesian family [Houtman, 1603]. This was firmly established a long time ago in [Tuuk, 1864], while, more recently, [Dahl, 1951] pointed out a particularly close relationship between Malagasy and Maanyan of south-eastern Kalimantan [Dyen, 1953]. Nevertheless, Malagasy also bears similarities to languages in Sulawesi, Malaysia, Sumatra and Philippines, including loanwords from Malay, Javanese, and one (or more) language(s) of south Sulawesi [Adelaar, 2009] and Philippines. On the contrary, the east-African contribution to the vocabulary seems to be limited to few faunal names [Blench and Walsh, 2009]. Because of these linguistic relationships, it is widely accepted that the island was settled by Indonesian sailors after a maritime trek but dates and place of landing are still debated and it is also not clear whether there were multiple settlements or just a single one. The linguistic composition of the Austronesian settlers is also debated as well its consequences on the vocabulary of Malagasy dialects. In this paper we review our research [Serva et al, 2012, Serva, 2012], which tries to shed new light on these problems. The key point is the application of a new quantitative methodology [Serva and Petroni, 2008, Petroni and Serva, 2008, Bakker et al, 2009] which is able to find out the kinship relations among languages (or dialects). New techniques are also introduced in order to extract the maximum information from these relations concerning time and space patterns [Blanchard et al, 2010a, Wichmann et al, 2010a]. We consider 23 Malagasy dialects plus Malay and Maanyan. The data concerning Madagascar were collected by one of the authors (M.S.) at the beginning of 2010 and consist of Swadesh lists of 200 items for each of the 23 dialects covering all areas of the island.

Words for ‘dog’ as a diagnostic of language contact in the Americas

On the relation between structural diversity and geographical distance among languages: Observations and computer simulations

Linguistic Typology, Jan 19, 2007

Since the groundbreaking work of Nichols (1992) it has been clear that the use of typological dat... more Since the groundbreaking work of Nichols (1992) it has been clear that the use of typological databases for making inferences regarding linguistic prehistory could potentially have much to offer. The recent availability of larger typological databases such as Haspelmath et al. (2005) has brought the linguistics community closer to having a solid, empirical foundation for making actual claims about prehistoric migrations, deep genealogical relationship, and patterns of areal linguistic interaction. Nevertheless, still more data are needed, and more than anything a number of methodological problems need to be adressed. These problems, which are the focus of this paper, include the following. How might we go about distinguishing diffusion from genealogical inheritance when looking at structural similarities among languages? For how long may we expect to continue to see a difference between traces of relatedness and traces of diffusion? What effects do factors such as speed of migration, the time depth of interaction among a certain set of languages or the rate of diffusion have on the similarities among initially related languages on the one hand and initially unrelated languages on the other? These questions will be addressed from two perspectives. The first perspective is an empirical one, where observations primarily derive from analyses of the data of Haspelmath et al. (2005). The second perspective is a computational one, where simulations are drawn upon to test the effects of different parameters on the development of structural linguistic diversity. The results suggest that there is indeed some hope that we may derive new empirical insights regarding linguistic prehistory by drawing upon typological data. We do not, however, make any specific empirical claims in this paper, but instead concentrate our efforts on methodology.

The extent and degree of utterance-final word lengthening in spontaneous speech from 10 languages

Linguistics vanguard, 2021

Words in utterance-final positions are often pronounced more slowly than utterance-medial words, ... more Words in utterance-final positions are often pronounced more slowly than utterance-medial words, as previous studies on individual languages have shown. This paper provides a systematic cross-linguistic comparison of relative durations of final and penultimate words in utterances in terms of the degree to which such words are lengthened. The study uses time-aligned corpora from 10 genealogically, areally, and culturally diverse languages, including eight small, under-resourced, and mostly endangered languages, as well as English and Dutch. Clear effects of lengthening words at the end of utterances are found in all 10 languages, but the degrees of lengthening vary. Languages also differ in the relative durations of words that precede utterance-final words. In languages with on average short words in terms of number of segments, these penultimate words are also lengthened. This suggests that lengthening extends backwards beyond the final word in these languages, but not in languages with on average longer words. Such typological patterns highlight the importance of examining prosodic phenomena in diverse language samples beyond the small set of majority languages most commonly investigated so far.

The paleobiolinguistics of domesticated manioc (Manihot esculenta)

by Charles R. Clement, Søren Wichmann, and Patience Epps

Ethnobiology Letters, 4: 61-70, 2013

Paleobiolinguistics is used to identify on maps where and when manioc (Manihot esculenta) develop... more Paleobiolinguistics is used to identify on maps where and when manioc (Manihot esculenta) developed importance for different prehistoric groups of Native Americans. This information indicates, among other things, that significant interest in manioc developed at least a millennium before a village‐farming way of life became widespread in the New World.

Wichmann Forthc Linguistic phylogeography

Robbeets, Martine and Mark Hudson (eds.), The Oxford Handbook of Archaeology and Language. Oxford: Oxford University Press.

Linguistic phylogeography is concerned with the reconstruction of the geographical distribution o... more Linguistic phylogeography is concerned with the reconstruction of the geographical distribution of lineages pertaining to a specific language family (phylogeny). This paper reviews both traditional methods and modern computational ones. The computational methods draw upon information from the range of languages in a phylogeny, inferring origins and patterns of dispersal from current patterns. Some draw upon the structure of phylogenetic trees, some do not. This paper outlines the different methods, recapitulates the results of general evaluations, and identifies the challenges that they face, some of which are shared across the board. As a contribution unique to this paper, hypothetical homelands for all the world's language families as well as their subgroups generated by the new 'minimal distance' method are published here for the first time.

A world language family simulation

Physica A, 2021

Following a methodological approach developed in collaboration with Dietrich Stauffer, some empir... more Following a methodological approach developed in collaboration with Dietrich Stauffer, some empirical observations on the dynamics of the world's languages, in this case rates of language diffusion, are graphed and a computational model that might shed light on the underlying dynamics is developed. It is verified that it is possible to capture a similar kind of curve by simulating a spread of languages on a world landscape without any other assumptions than some simple parameters defining a semi-random walk and considering the world's languages as belonging to a single family. It is found that the simple computational model can, indeed, go far in achieving realism with respect to the linguistic population of the globe throughout prehistory. Nevertheless, it is concluded that a model assuming multiple origins of the world's languages is more probable.

Determinants of phonetic word duration in ten language documentation corpora: Word frequency, complexity, position, and part of speech

by Frank Seifart, Jan Strunk, and Søren Wichmann

Language Documentation & Conservation, 2020

This paper explores the application of quantitative methods to study the effect of various factor... more This paper explores the application of quantitative methods to study the effect of various factors on phonetic word duration in ten languages. Data on most of these languages were collected in fieldwork aiming at documenting spontaneous speech in mostly endangered languages, to be used for multiple purposes, including the preservation of cultural heritage and community work. Here we show the feasibility of studying processes of online acceleration and deceleration of speech across languages using such data, which have not been considered for this purpose before. Our results show that it is possible to detect a consistent effect of higher frequency of words leading to faster articulation even in the relatively small language documentation corpora used here. We also show that nouns tend to be pronounced more slowly than verbs when other factors are controlled for. Comparison of the effects of these and other factors shows that some of them are difficult to capture with the current data and methods, including potential effects of cross-linguistic differences in morphological complexity. In general, this paper argues for widening the cross-linguistic scope of phonetic and psycholinguistic research by including the wealth of language documentation data that has recently become available.

Towards identifying the optimal datasize for lexically-based Bayesian inference of linguistic phylogenies

Bayesian linguistic phylogenies are standardly based on cognate matrices for words referring to a... more Bayesian linguistic phylogenies are standardly based on cognate matrices for words referring to a fix set of meanings—typically around 100-200. To this day there has not been any empirical investigation into which datasize is optimal. Here we determine, across a set of language families, the optimal number of meanings required for the best performance in Bayesian phylogenetic inference. We rank meanings by stability, infer phylogenetic trees using first the most stable meaning, then the two most stable meanings, and so on, computing the quartet distance of the resulting tree to the tree proposed by language family experts at each step of datasize increase. When a gold standard tree is not available we propose to instead compute the quartet distance between the tree based on the n-most stable meaning and the one based on the n + 1-most stable meanings, increasing n from 1 to N − 1, where N is the total number of meanings. The assumption here is that the value of n for which the quartet distance begins to stabilize is also the value at which the quality of the tree ceases to improve. We show that this assumption is borne out. The results of the two methods vary across families, and the optimal number of meanings appears to correlate with the number of languages under consideration.

Nouns slow down speech across structurally and culturally diverse languages

by Balthasar Bickel, Frank Seifart, Søren Wichmann, and Alena Witzlack-Makarevich

Proceedings of the National Academy of Sciences, 2018

By force of nature, every bit of spoken language is produced at a particular speed. However, this... more By force of nature, every bit of spoken language is produced at a particular speed. However, this speed is not constant-speakers regularly speed up and slow down. Variation in speech rate is influenced by a complex combination of factors, including the frequency and predictability of words, their information status, and their position within an utterance. Here, we use speech rate as an index of word-planning effort and focus on the time window during which speakers prepare the production of words from the two major lexical classes, nouns and verbs. We show that, when naturalistic speech is sampled from languages all over the world, there is a robust cross-linguistic tendency for slower speech before nouns compared with verbs, both in terms of slower articulation and more pauses. We attribute this slowdown effect to the increased amount of planning that nouns require compared with verbs. Unlike verbs, nouns can typically only be used when they represent new or unexpected information; otherwise, they have to be replaced by pronouns or be omitted. These conditions on noun use appear to outweigh potential advantages stemming from differences in internal complexity between nouns and verbs. Our findings suggest that, beneath the staggering diversity of grammatical structures and cultural settings, there are robust universals of language processing that are intimately tied to how speakers manage referential information when they communicate with one another. speech rate | nouns | language universals | word planning | language processing H uman language in its most widespread form (i.e., in spontaneously spoken interactions) is locked in one-dimensional time. This was recognized by the founding father of modern linguistics, Ferdinand de Saussure, as one of the two fundamental principles of the linguistic sign, the other one being its arbitrary nature (1, 2). An unresolved question is which aspects of local variation in speech rate are universal (3, 4), which vary across languages and cultures (5), and which vary across individuals (6). For example, marking the end of utterances by slowing down speech is cross-linguistically common, but its implementation is languagespecific (7). Good candidates for truly universal temporal features are the relatively fast pronunciations of frequent, and thus predictable, words (8) and second mentions of words (9). This speedup is argued to result from automated articulation (4) and has been suggested to contribute to efficient communication by spreading information more evenly across the speech signal (10, 11). Frequency effects also explain why function words, such as articles, prepositions, and pronouns, are pronounced faster than the less frequently occurring content words, such as nouns and verbs (12).

Chitimacha: A Mesoamerican Language in the Lower Mississippi Valley 1

by Søren Wichmann and David Beck

The comparative method of historical linguistics is carefully applied to the hypothesis that Chit... more The comparative method of historical linguistics is carefully applied to the hypothesis that Chitimacha, a language of southern Louisiana now without fully fluent speakers, and languages of the Totozoquean family of Mesoamerica are genealogically related. 91 lexi- cal sets comparing Chitimacha words collected by Swadesh (1939, 1946a, 1950) with words reconstructed for Proto-Totozoquean (Brown et al. 2011) show regular sound cor- respondences. Along with certain structural similarities, this evidence attests to the de- scent of these languages from a common ancestor, Proto-Chitimacha-Totozoquean. By identifying regular sound correspondences, the phonological inventory and some of the vocabulary of the proto-language are reconstructed. Reconstructed words relating to maize agriculture and the fabrication of paper indicate that prehistoric Chitimacha speak- ers migrated to the Lower Mississippi Valley from Mesoamerica. Some speculations on how and when Chitimacha speakers migrated are offered.

by David Beck and Søren Wichmann

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, a... more JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.

Towards unsupervised extraction of linguistic typological features from language descriptions

Presented at First Workshop on Typology for Polyglot NLP, Florence, Aug. 1, 2019 (Co-located with ACL, July 28-Aug. 2, 2019)., 2019

Manual encoding of typological databases is a tiresome procedure that takes large amounts of time... more Manual encoding of typological databases is a tiresome procedure that takes large amounts of time. Bender (2016) reviews recent efforts in extracting typological features from interlinear glossed text (Lewis and Xia, 2010), Bible corpora (Östling, 2015; Malaviya et al., 2017), and sources such as morphologically annotated resources and treebanks (Bjerva and Augenstein, 2018). However, there is a lack of publications describing the application of NLP techniques to extract typological features directly from language descriptions contained in grammar books, dissertations, and linguistics articles. Collections of such descriptive sources are accumulating as PDFs (including many from scans) that have subsequently been OCR’ed. In this paper, we describe our first attempt at building an NLP pipeline that extracts typological features from OCR’ed linguistic descriptions.

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Alternative Proxies:

Alternative Proxy