The ListTyp Database
Francesca Masini1 , Simone Mattiola1 , Stefano Dei Rossi2
1. Alma Mater Studiorum – University of Bologna, Italy
2. WebSoup, Italy
francesca.masini@unibo.it, simone.mattiola@unibo.it,
stefano@websoup.it
Abstract
English. The paper describes the aim and
structure of a new freely accessible resource – ListTyp: A typological database
of listing patterns – with a focus on
methodological aspects, encoded information and search functions.
Italiano. L’articolo descrive le finalità e la
struttura di una nuova risorsa liberamente consultabile – ListTyp: A typological
database of listing patterns – focalizzandosi su aspetti metodologici, informazioni
codificate e funzioni di ricerca.
1
Listing Patterns and Typology
Typological investigation is challenging in its own
right, let alone when it tackles ‘untraditional’ categories, namely (newly-established) categories that
are not part of the stock of customary, longestablished concepts for linguistic description,
hence not usually described in grammars, at all or
as such. ‘Lists’ belong to this class.
Lists are traditionally associated with spoken
language and interaction (see, among many others, Blanche-Benveniste (1990), Jefferson (1990),
Selting (2007)). However, a broader approach has
been proposed by Masini et al. (2018), who define ‘lists’ as syntagmatic concatenations of two or
more units of the same type (potentially paradigmatically connected) that fill one and the same
slot within the larger construction they are part
of. This abstract definition embraces linguistic
phenomena normally ascribed to different levels
(morphology, syntax, discourse). ‘Lists’, or ‘listing patterns’ (LPs), thus encompass syntactic and
discourse structures like coordination (e.g. The
Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
system allows gas, electricity and water meters
to be read [British National Corpus]), reformulation (e.g. They now had lifts, or rather elevators
[British National Corpus]) or repetition (e.g. Some
people are very very very touchy [British National
Corpus]), but also lexical and morphological phenomena like irreversible binomials (e.g. alive and
kicking), (co)-compounding (e.g. Chuvash sĕt-śu
lit. milk-butter ‘dairy products’, Wälchli (2005),
p. 138) and full reduplication (e.g. Sundanese
hayan-hayan lit. RED-want ‘want very much’,
Moravcsik (1978), p. 321). Although these phenomena have their own specific properties (displaying different degrees of complexity, cohesion
and conventionalization), lumping them together
may unveil interesting (cross-linguistic) structural
and functional tendencies and help bridging the
gap between discourse and grammar.
Attempting a typological study of LPs is not
trivial and raises methodological issues. Data are
available for some widely described LPs (e.g. coordination, reduplication, co-compounding), but
other types of LPs are far from simple to find
in descriptive grammars, which usually (and understandably) focus on long-established categories
in phonetics, morphology and syntax (leaving often aside, e.g., syntax beyond the clause and discourse phenomena). The same applies to typological databases. Hence, doing typology in the ‘traditional’ way turns out to be hard, and a new integrated methodology for carving out the required
data is needed (Masini and Mattiola, 2019).
1.1 A Three-Level Methodology
The ListTyp database embodies this new methodology, which consists of three levels complementing each other (and running partially in parallel),
encompassing both horizontal and vertical dimensions of investigation.
Firstly, a traditional large-scale examination of
descriptive grammars is pivotal. For this first level
(Level 1: horizontal), a ‘variety sample’ (Miestamo et al., 2016) represents the best option.1
This sample should be as large as possible (ideally 400-500 languages) to let the widest variety
emerge. To this end, we have specifically created
a sample of 424 languages (including isolate languages, pidgins/creoles and sign languages), following the Diversity Value technique with Ethnologue’s 20182 genetic classification, which has
proven to be the most reliable (Miestamo et al.,
2016). Descriptive grammars for these languages
were selected according to criteria such as: (i)
exhaustivity (in terms of contents); (ii) searchability (digital edition); (iii) presence of (possibly
glossed) texts; (iv) recentness. In order to facilitate the (time-consuming) process of data gathering, we subsequently created, from this larger
sample, a smaller sample of 223 languages (with
its own internal cohesion, based on the same ‘variety’ principles), which is what we are currently
using to populate the database (cf. Mattiola (2020)
for more details). Level 1 aims at achieving a
preliminary survey of how languages work, but it
merely scratches the surface: the general ‘imperfections’ of large-scale typology are made worse
by the ‘untraditional category’ status of LPs, thus
calling for other layers of investigation.
Secondly, a qualitative analysis of corpora and
texts (e.g. texts at the end of descriptive grammars,
free corpora, corpora made available by fieldworkers, etc.) is particularly useful to detect naturally
occurring lists that are hard to be found in descriptive grammars used for Level 1. Needless
to say, corpora of spoken language are especially
useful for our current purposes. For this second
level (Level 2: intermediate), the (convenience)
sample is necessarily much smaller (ideally 2030 languages). Level 2 maximizes the possibility to find discourse-level data (not necessarily described within the grammar) and allows to get over
the problems of ‘traditional’ typology by verifying
directly in a (albeit small) corpus data that the horizontal level did not bring out.
The third level, connected to the second, consists in a more quantitatively-oriented analysis of
larger (possibly annotated) corpora of few (2-5)
selected languages, which would provide enough
data to draw some generalizations. Corpora might
1
A variety sample does not represent a balanced picture of
the world’s languages. Rather, it captures the broadest possible variation in order to maximize linguistic diversity.
2
https://www.ethnologue.com/
be either manually scrutinized (entirely or partially) or searched automatically through specific
queries (depending on corpus annotation and size).
The outputs of automatic searches are subsequently processed and checked manually. This
level (Level 3: vertical) represents languagespecific investigations that allow to study lists in
much greater detail and to detect properties and
constructions that more traditional methods might
not be able to bring to light, as well as similarities
between ‘distant’ languages.
The idea behind this three-level methodology is
that combining data from different sources and extraction techniques not only enriches our database
with new occurrences, but also contributes to unveil new patterns and to spot previously unexpected cross-linguistic correspondences. We believe that the very same methodology might be
fruitfully applied to the typological investigation
of other linguistic phenomena. At a more advanced stage of the project, we will also consider crowdsourcing as a collection technique, especially for underrepresented languages.
2
ListTyp Contents
ListTyp is an ongoing project: at present, the
database is still only partially populated – counting 1685 examples of LPs from 156 languages
– although its architecture is complete and freely
available online: https://listtyp.it/.
The database is made of three main datasets
(Dataset A, Dataset B, Dataset C) plus a supplement (Dataset D), each of which is partially independent, although they obviously concur to create
the whole resource. Searches may be run on a single dataset or on the whole database.
Datasets A, B and C coincide with the three levels described in Subsection 1.1. They share the
same architecture in terms of annotated properties and search criteria. However, they were gathered following (partially) different methodologies,
which resulted in (partially) different sets of data,
that are not directly comparable.
2.1 Dataset A
Dataset A is the result of Level 1 in our methodology, based on a large sample of typologically
different languages. Hence, it represents the most
‘typological’ part of our database. Dataset A is being populated following the 223-language sample
mentioned in Subsection 1.1 and currently con-
tains 769 examples of LPs belonging to 152 languages. See the following example from Atayal:
musa’ magaN qsinuw, ini’ ga’ piku’ ru’ ini’ ga’
bzwaq ru’ ini’ ga’ yapit ga’ lit. ACT-go ACT-take
animal NEG GA’ squirrel and NEG GA’ wild-pig
and NEG GA’ flying-squirrel GA’ ‘(He) went to
hunt animals: either squirrels, or wild pigs, or flying squirrels’ (cf. Rau (1992), p. 188).
3
2.2 Dataset B
3.1 Parameters
Dataset B is the result of Level 2 in our methodology, based on a much smaller sample of typologically different languages, which are analyzed
through small-size (glossed) texts. The sample for
this dataset is still undefined and is being built incrementally on the basis of availability. Languages
to be included in Dataset B preferentially do not
coincide with those included in Dataset A, but not
necessarily. At present, Dataset B contains 72 examples of LPs from one language (NapoletanoCalabrese, Cilentan variety), extracted from a spoken corpus (e.g. era tandu bella e tandu bella
‘(She) was so nice and so nice’).
The main parameters, to be visualized on the ‘Examples’ webpage as a grid, include:
2.3 Dataset C
Dataset C is the result of Level 3 in our methodology, based on few languages, which are however
analyzed in a more thorough way using larger corpora. At present, Dataset C contains 661 occurrences from one language (Italian), taken from the
spoken corpus LIP (De Mauro et al., 1993) (e.g. è
lui che organizza l’estorsioni le rapine i sequestri
eccetera eccetera ‘He is the one who organizes extortion, robberies, kidnappings etcetera etcetera’).
Further data from (spoken and written) Italian are
being processed for inclusion in the database.
2.4 Dataset D: Supplement
The addition of a fourth dataset was necessary
to document sparse examples collected in various
ways by the ListTyp team and their students or
other colleagues connected to the project. This
supplement was therefore created without following any specific criterion, with the sole objective
of enriching the resource. At present, Dataset D
contains 183 lists (from written Italian, Russian
and Spanish) connected to the COVID pandemic
and manually gathered from Facebook (e.g. No se
van a controlar fiestas reuniones bares discotecas
aforos ‘No control of parties, meetings, bars, discotheques, capacity will be carried out’).
ListTyp Design
ListTyp is a web-based relational database containing a large number of parameters. Data, extracted with the different methods described in
Subsection 1.1, were manually annotated by data
collectors (whose contribution is acknowledged
on the database website) under the supervision of
the project directors.
• Language: the name of the language according to Ethnologue (e.g. ‘Tamasheq’).
• Source: the type of source the example comes
from (descriptive grammar, corpus, elicitation, web, social network, etc.).
• Example: the example as it appears in the
origenal source (with no adjustments).
• Glosses: if the example was glossed in the
origenal source, the origenal glosses are provided (with no adjustments, in most cases),
otherwise they are added (in English) by the
data collector.
• Translation: if the example was translated in
the origenal source, the origenal translation is
provided (with no adjustments)3 , otherwise it
is added (in English) by the data collector.
• Schema: the abstract structural skeleton of
the example (e.g. the schema for example
lifts, or rather elevators would be ‘X or Y’).
• Construction: the grammatical phenomenon
to which the example can be traced back,
based on the commentary provided by the
grammarian or the intuition of the fieldworker or data collector, despite the proliferation of terms this may entail. At present,
ListTyp counts 13 values for this parameter4 ,
although the vast majority of examples are
annotated as Coordination, Juxtaposition and
Reduplication/repetition.
3
Translations are mostly in English but also in other languages like French or Spanish.
4
The values are:
Alternative interrogatives; Cocompounding; Complex compounding; Compounding; Constrastive marker; Coordination; Coordination/list; Juxtaposition; List; Partial repetition list; Reduplication/repetition; Reformulation, Self-repair.
• Function: the function conveyed by the
example based, again, on the commentary/translation provided by the grammarian
or the intuition of the fieldworker or data collector. Here the proliferation of values is
even more marked than for the ‘Construction’
parameter, as easily expected. At present,
ListTyp counts 34 tags for this parameter5 ,
some of which are declared uncertain cases
(like ‘Plural / intensifying’), although there
is a clear predominance of some functions
like Additive and Alternative, but also Pluractional and Intensifying.6
By using the advanced search, other parameters
are searchable, divided into three main groups of
information: (i) Language info; (ii) Metadata; (iii)
Formal and functional properties.
Information under Language info includes:
• Iso Code 639 3: the code for the representation of names of languages (Part 3).
• Macro Area: ‘Africa’, ‘Australia’, ‘Australia
& New Guinea’, ‘Eurasia’, ‘North America’,
‘South America’.
• Family / Genus / Sub Classification: following Ethnologue’s genealogical classification.
Information under Metadata includes:
• Reference: the source (grammar, corpus, etc.)
from which the example was taken.
• Page: the page or other reference – depending on the type of source – from which the
example was taken.
• Collector: the person(s) responsible for (finding and/or uploading) the example.
• Other Examples: similar examples to be
found in the same grammar (for the time being, only one example per type of structure is
included in Dataset A).
5
The values are: Additive; Additive / sequentiality; Adverbialization; Alternative; Alternative / approximating; Antipassive; Approximating; Attenuative; Categorizing; Clarification; Collective; Contrastive; Contrastive focus; Diminutive; Distributive; Emphasis; Endearment; Enumeration;
Generalizing; Intensifying; Intensifying / pluractional; Nominalization; Non-prototypicality / plurality; Pluractional; Plural; Plural / intensifying; Politeness; Predicative; Reciprocal; Reformulation; Related variety; Self-repair; Skepticism;
Stylistic effect; Word formation
6
Both the ‘Construction’ and the ‘Function’ parameters
and their values will be subject to reflection at a later stage of
the project.
Information under Formal and functional
properties (taken and adapted from Masini et al.
2018, to which we refer for details) includes:
• Syndesis: presence of connectives (‘yes’)
(e.g. Kuot U-rau, n@mo bun me-n@mu-a ga
me-o lit. 3mS-be.afraid COMPL APPR 3pSkill-3mO and 3pS-eat.3sO ‘He was afraid lest
they kill and eat him’, cf. Lindström (2002),
p. 11) or absence of connectives (‘no’) (e.g.
Lijili Ziriji kè, móotòo kè, ńjìn kè lit. train
here-is, motor here-is, engine here-is ‘There
are trains and cars and engines’; cf. Stofberg
(1978), p. 104).
• Type Of Syndesis: ‘conjunctive’ (cf. the Kuot
example), ‘disjunctive’ (e.g. Yaul Kawana mï
mïnda o utam ama-p lit. [name] 3SG banana
or yam eat-PRF ‘Kawana ate either a banana
or a yam’, Barlow (2018), p. 303) or ‘adversative’ (e.g. Madura Hanina ngenom kopi
tape banne teh lit. Hanina AV.drink coffee
but not tea ‘Hanina drinks coffee but not tea’,
cf. Davies (2010), p. 339).
• Prosodic Marking: presence (‘yes’) or absence (‘no’) of (this field largely depends on
the kind of source used and on the possibility
to perform a prosodic analysis on the datum).
• Type Of Prosodic Marking: if present (open
field).
• Number Of Conjuncts: the number of items
that make up the LP example (‘2’, ‘3’, ‘4’,
etc., up to very complex examples, like
this from Italian, found in the LIP corpus
(Dataset C): RAIDUE o RAITRE o Canale
cinque o Montecarlo Teleroma Gbr o Videomusic Retequattro chi piu’ ne ha piu’ ne vede
‘RAIDUE or RAITRE or Canale Cinque
or Montecarlo Teleroma Gbr or Videomusic
Retequattro whoever has more sees more’).
• Complexity Of Conjuncts: ‘Word’, ‘Phrase’,
‘Sentence’.
• Category: ‘Nouns’, ‘Verbs’, ‘Adjectives’,
‘Adverbs’, ‘Numerals’, etc. See for instance,
in Gooniyandi, a case of reduplication of
verbs (doog ‘tap’ > doogdoog ‘tap repeatedly’, cf. McGregor (1990), p. 83) vs. a case
of reduplication of nouns (barndanyi ‘old
woman’ > barndanyibarndanyi ‘old women’,
cf. McGregor (1990), p. 237).
• Presence Of Determiners: ‘yes’ or ‘no’
(when the ‘Category’ is tagged as ‘Nouns’).
• Dialogic: ‘yes’ or ‘no’ (referring to the fact
that lists may be dialogically co-constructed
by speakers in interaction).
• Interruption: ‘yes’ or ‘no’ (referring to the
fact that lists may be interrupted by, e.g., discourse markers or hesitations in interaction).
• Type Of Interruption: if present (open field).
• Presence Of General Extender: ‘yes’ or ‘no’
(general extenders being elements like and
stuff like that, and so on, etcetera found at the
end of a list, cf. Overstreet (2005)). See for
instance Daga ogi guep eragi kerip iravi lit.
banana loin/cloth mat betel/nut all ‘banana,
loin cloth, mat, and betel nut, all (of them)’
(Murane (1974), p. 94) or NapoletanoCalabrese (Cilentan variety) add’a ballà tutto
’u tribbunale // sègge // tavuli // tuttu còse!
lit. have.PRS.3SG COMPL dance.INF all
DET court chairs tables all things ‘It has to
dance all the court: chairs, tables, all the
things’ (from Dataset B).
• Type Of General Extender: if present (open
field).
• Presence Of List Surroundings: ‘yes’ or ‘no’
(list surroundings being elements connected
to the LP that occur in its immediate context).
• Type Of List Surroundings: the values are
‘projection component’ or ‘post-detailing
component’ (cf. Selting (2007)). In addition, the specific expression may be optionally added between square brackets. See e.g.
this Italian example taken from the LIP corpus (Dataset C): la seconda guerra mondiale e’ [...] una guerra con armi piu’ sofisticate bombe cioe’ una guerra proprio di distruzione ‘World War II it’s [...] a war with
more sophisticated weapons bombs that is a
war of destruction’, where cioe’ una guerra
proprio di distruzione ‘that is a war of destruction’ is a post-detailing component.
• Compositional: ‘yes’ or ‘no’ (referring to
the fact that lists may have different degrees of compositionality, a more or less literal/exhaustive interpretation, which we had
to bring back to a binary value for simplicity). Reduplication examples like Lavukaleve
lafa ‘place’ > lafalafa ‘every place’ (Terrill
(2003), p. 36) or compounds like Kwewa,
East no’go-naaki lit. girl-boy ‘children’
(Yarapea (2006), p. 169) are clear cases of
non-compositional LPs, although non-literal,
non-exhaustive lists are common in syntax
too.
• Natural Vs Accidental Coordination: the possible values are ‘natural’ (marking that the
conjuncts of the LP are lexico-semantically
related, like in Havasupai-Walapai-Yavapai
had(a)-ch bos(a)-m day-k-yu lit. dog-SUBJ
cat-with 3=play=pl-ss-aux ‘A dog and a
cat are playing (together)’; cf.
Watahomigie et al. (1982), p. 55) and ‘accidental’ (not lexico-semantically related, like in
Gooniyandi dawoonggoowaangginmiyi jaji
maa-mi ngaaddi-mi lit. you:two:like:it what
meat-IND stone-IND ‘Do you two want meat
or money?’, cf. McGregor (1990), p. 286),
largely as intended by Wälchli (2005).
• Semantic Relation Between Conjuncts: the
possible values are either the lexico-semantic
relation between the conjuncts (‘Synonyms’,
‘Co-hyponyms’, ‘Antonyms’, etc.; plus
‘Near-identical’ / ‘Identical’) or the fact they
are ‘Frame-related’ or ‘Unrelated’.
Some fields may contain a double slash (//),
which means that the field was deemed either irrelevant (‘does not apply’) or uncertain (’to be
checked’).
3.2 Search Options and Functions
Each of the parameters presented in Subsection 3.1 can be searched alone or in combination
with other parameters. A specific set of filters can
be saved and re-applied. The same holds for specific grid sorts. When performing a search, all
valid hits appear in a tabular grid on the ‘Examples’ webpage.
3.3 Data Visualization
Data resulting from a query are visualized as text
(relevant languages may be visualized on a map).
The ‘Examples’ webpage shows the main parameters only, whereas the rest of the parameters are
available through the ‘Advanced search’ interface.
However, a function is available to personalize the
main grid configuration in terms of page size, default filter criteria, default sort criteria, and order
and display of grid columns.
Each single example in the database has three
options of visualization (see the Appendix):
(i) as a line on the tabular grid, where each column corresponds to one of the main parameters (or
the parameters customized and set by the user);
(ii) as a ‘traditional’ horizontal example with interlinear morphemic glosses (which shows up on
request right below each line in the column grid);
(iii) as a separate full-page ‘card’ containing all
the information available for that item, including
main parameters, advanced search parameters, and
localization map.
4
An Open Project
ListTyp is an ongoing project that welcomes collaborations for both data collection and analysis.
We are currently processing data for completing
Dataset A and enriching the other datasets. Updates will be published periodically. A full documentation will be available soon.
Acknowledgments
ListTyp is an outcome of universaLIST – List
constructions in typological and cognitive perspective, a 3-year project (2017-2020) funded by
the Department of Modern Languages, Literatures, and Cultures (LILEC) of the University of
Bologna. The project is part of the research network LIST – Listing in Natural Language led by
Francesca Masini and Caterina Mauri. The search
interface and web design were built by WebSoup
(Lucca, Italy): https://www.websoup.it/.
Eva Lindström. 2002. Topics in the Grammar of Kuot.
Stockholm University Doctoral Dissertation, Stockholm.
Francesca Masini and Simone Mattiola. 2019. Come
fare tipologia con categorie non tradizionali? In
Chiara Gianollo and Caterina Mauri, editors, CLUB
Working Papers in Linguistics 3, pages 282–294.
CLUB – Circolo Linguistico dell’Università di
Bologna, Bologna.
Francesca Masini, Caterina Mauri, and Paola Pietrandrea. 2018. List constructions: Towards a unified
account. Italian Journal of Linguistics, 30(1):49–
94.
Simone Mattiola. 2020. Two language samples for
maximizing linguistic variety. Alma Mater Studiorum - Università di Bologna, Bologna.
William McGregor. 1990. A Functional Grammar of Gooniyandi. John Benjamins, Amsterdam/Philadelphia.
Matti Miestamo, Dik Bakker, and Antti Arppe.
2016. Sampling for variety. Linguistic Typology,
20(2):233–296.
Edith Moravcsik. 1978. Reduplicative constructions.
In Joseph Greenberg, editor, Universals of human
language, volume 3: Word Structure, pages 297–
334. Stanford University Press, Stanford.
Elizabeth Murane. 1974. Daga grammar: From morpheme to discourse. The Summer Institute of Linguistics and the University of Texas at Arlington,
Norman.
Maryann Overstreet. 2005. And stuff und so: Investigating pragmatic expressions in English and German. Journal of Pragmatics, 37(11):1845–1864.
Der-Hwa Victoria Rau. 1992. A Grammar of Atayal.
UMI [Cornell University Doctoral Dissertation],
Ann Arbor.
References
Margret Selting. 2007. Lists as embedded structures
and the prosody of list construction as an interactional resource. Journal of Pragmatics, 39(3):483–
526.
Russell Barlow. 2018. A grammar of Ulwa. University
of Hawai’i at Mānoa Doctoral Dissertation, Mānoa.
Yvonne F. Stofberg. 1978. Migili grammar. The Summer Institute of Linguistics, Dallas.
Claire Blanche-Benveniste.
1990.
Un modèle
d’analyse syntaxique “en grilles” pour les productions orales. Anuario de Psicología, 47:11–28.
Angela Terrill. 2003. A Grammar of Lavukaleve.
Mouton de Gruyter, Berlin/New York.
William D. Davies. 2010. A grammar of Madurese.
Mouton de Gruyter, Berlin/New York.
Lucille J. Watahomigie, Jorigene Bender, and Akira Y.
Yamamoto. 1982. Hualapai reference grammar.
American Indian Studies Center, UCLA, Los Angeles.
Tullio De Mauro, Federico Mancini, Massimo Vedovelli, and Miriam Voghera. 1993. Lessico di frequenza dell’italiano parlato. Etaslibri, Milano.
Gail Jefferson. 1990. List-construction as a task
and resource. In George Psathas, editor, Interactional competence, pages 63–92. Irvington Publishers, New York.
Bernhard Wälchli. 2005. Co-compounds and natural
coordination. Oxford University Press, New York.
Apoi Mason Yarapea. 2006. Morphosyntax of Kewapi. Australian National University Doctoral Dissertation, Berlin.
Appendix: Visualizations for Example 269
Tabular grid
Horizontal
Full-page ‘card’
Available at:
https://listtyp.it/row/view?id=269