Presented at First Workshop on Typology for Polyglot NLP, Florence, Aug. 1, 2019 (Co-located with ACL, July 28-Aug. 2, 2019)., 2019
Manual encoding of typological databases is a tiresome procedure that takes large amounts of time... more Manual encoding of typological databases is a tiresome procedure that takes large amounts of time. Bender (2016) reviews recent efforts in extracting typological features from interlinear glossed text (Lewis and Xia, 2010), Bible corpora (Östling, 2015; Malaviya et al., 2017), and sources such as morphologically annotated resources and treebanks (Bjerva and Augenstein, 2018). However, there is a lack of publications describing the application of NLP techniques to extract typological features directly from language descriptions contained in grammar books, dissertations, and linguistics articles. Collections of such descriptive sources are accumulating as PDFs (including many from scans) that have subsequently been OCR’ed. In this paper, we describe our first attempt at building an NLP pipeline that extracts typological features from OCR’ed linguistic descriptions.
Uploads
Papers by Søren Wichmann