The availability of lexical resources is a cornerstone for endangered languages preservation and ... more The availability of lexical resources is a cornerstone for endangered languages preservation and documentation, they also constitute a primary source for language teaching and revitalization. For instance, Mexico has around 70 indigenous languages and XX variations spoken by~7 million people, which despite its cultural importance, lack digital presence, have poor data quality, and face language extinction. To confront these circumstances we made use of text mining approaches to collect and transform existing lexical resources into language-learning resources for four endangered languages of Mexico. Finally, we present an application for such learning resources using Anki, an open-source and multi-platform
Abstract
Introduction
For a long time indigenous languages, in general, and Mexican, in particula... more Abstract Introduction For a long time indigenous languages, in general, and Mexican, in particular, have been a niche research area. Different disciplines, from Linguistics, anthropology, and sociology, to —in a lesser fashion— Computational Linguistics and Natural Language Processing, are traditional stakeholders on the topic. Still, the cost-benefit ratio is not favorable in the realm of computer sciences because of the scarcity of sources, divergence in variants of the same languages, lack of standard orthographies, and low topic awareness, as well as the perception of narrowness in the realm of applications, and lack of impact for these. Thus, NLP/CL researchers have remained on the fringes.
Development. We believe that shining a different light on these apparent obstacles and dissuasion elements can help to improve the perception of their relevance and interest as research objects. On the one hand, speakers of these languages amount to approximately seven million in Mexico, and self-identified descendants of the Mexican indigenous cultures to almost four times as many, not a negligible number. On the other hand, the study of low-resource NLP is interesting in itself as a research problem, but it also can help develop tools for the social sciences and receive valuable insights from them, fostering an environment of multidisciplinary studies that, in turn, lead to the preservation, inclusion, cultural identity exaltation, and enhanced visibility of Mexican indigenous cultures. As we perceive sparks of renewed interest in low-resource languages amongst the researchers of the NLP/CL community assessing the current state we believe that an assessment of the present challenges for the area of study is pertinent. With this, we move on to develop a systemic analysis of the field and proceed to outline a technical roadmap proposal towards the inclusion of Mexican indigenous languages in NLP/CL mainstream research. Briefly, this roadmap has two main areas that eventually converge: technical and linguistic which coalesce through the multidisciplinary paradigm. Within these areas, several tasks are proposed both in parallel and in succession, amongst the most notorious are the formation of teams of linguists, social anthropologists, and computer scientists to create tools for the collection of languages that are socially acceptable within the culture and standards of the communities. Technically, the development of algorithms for small corpora and linguistically the development of dictionaries of equivalent orthographies for distinctive variants of these languages. Conclusion Even though the current panorama for the indigenous languages in NLP/CL research is bleak, the re-emergence of the focus on the exaltation and preservation of origenary cultures and the push for multidisciplinary studies creates a fertile environment for new approaches towards the inclusion of these languages in mainstream NLP/CL research. The implications of the development of applied research in this area can build bridges for the inclusion of the indigenous communities in nation-wide social and political conversations from which they are now excluded via the digital divide. Also, it can have a tremendous impact on the preservation of the past and present versions of origenary languages as well as enabling the visibility of these cultures.
The availability of lexical resources is a cornerstone for endangered languages preservation and ... more The availability of lexical resources is a cornerstone for endangered languages preservation and documentation, they also constitute a primary source for language teaching and revitalization. For instance, Mexico has around 70 indigenous languages and XX variations spoken by~7 million people, which despite its cultural importance, lack digital presence, have poor data quality, and face language extinction. To confront these circumstances we made use of text mining approaches to collect and transform existing lexical resources into language-learning resources for four endangered languages of Mexico. Finally, we present an application for such learning resources using Anki, an open-source and multi-platform
Abstract
Introduction
For a long time indigenous languages, in general, and Mexican, in particula... more Abstract Introduction For a long time indigenous languages, in general, and Mexican, in particular, have been a niche research area. Different disciplines, from Linguistics, anthropology, and sociology, to —in a lesser fashion— Computational Linguistics and Natural Language Processing, are traditional stakeholders on the topic. Still, the cost-benefit ratio is not favorable in the realm of computer sciences because of the scarcity of sources, divergence in variants of the same languages, lack of standard orthographies, and low topic awareness, as well as the perception of narrowness in the realm of applications, and lack of impact for these. Thus, NLP/CL researchers have remained on the fringes.
Development. We believe that shining a different light on these apparent obstacles and dissuasion elements can help to improve the perception of their relevance and interest as research objects. On the one hand, speakers of these languages amount to approximately seven million in Mexico, and self-identified descendants of the Mexican indigenous cultures to almost four times as many, not a negligible number. On the other hand, the study of low-resource NLP is interesting in itself as a research problem, but it also can help develop tools for the social sciences and receive valuable insights from them, fostering an environment of multidisciplinary studies that, in turn, lead to the preservation, inclusion, cultural identity exaltation, and enhanced visibility of Mexican indigenous cultures. As we perceive sparks of renewed interest in low-resource languages amongst the researchers of the NLP/CL community assessing the current state we believe that an assessment of the present challenges for the area of study is pertinent. With this, we move on to develop a systemic analysis of the field and proceed to outline a technical roadmap proposal towards the inclusion of Mexican indigenous languages in NLP/CL mainstream research. Briefly, this roadmap has two main areas that eventually converge: technical and linguistic which coalesce through the multidisciplinary paradigm. Within these areas, several tasks are proposed both in parallel and in succession, amongst the most notorious are the formation of teams of linguists, social anthropologists, and computer scientists to create tools for the collection of languages that are socially acceptable within the culture and standards of the communities. Technically, the development of algorithms for small corpora and linguistically the development of dictionaries of equivalent orthographies for distinctive variants of these languages. Conclusion Even though the current panorama for the indigenous languages in NLP/CL research is bleak, the re-emergence of the focus on the exaltation and preservation of origenary cultures and the push for multidisciplinary studies creates a fertile environment for new approaches towards the inclusion of these languages in mainstream NLP/CL research. The implications of the development of applied research in this area can build bridges for the inclusion of the indigenous communities in nation-wide social and political conversations from which they are now excluded via the digital divide. Also, it can have a tremendous impact on the preservation of the past and present versions of origenary languages as well as enabling the visibility of these cultures.
Uploads
Papers by Jason Angel
Introduction
For a long time indigenous languages, in general, and Mexican, in particular, have been a niche research area. Different disciplines, from Linguistics, anthropology, and sociology, to —in a lesser fashion— Computational Linguistics and Natural Language Processing, are traditional stakeholders on the topic. Still, the cost-benefit ratio is not favorable in the realm of computer sciences because of the scarcity of sources, divergence in variants of the same languages, lack of standard orthographies, and low topic awareness, as well as the perception of narrowness in the realm of applications, and lack of impact for these. Thus, NLP/CL researchers have remained on the fringes.
Development.
We believe that shining a different light on these apparent obstacles and dissuasion elements can help to improve the perception of their relevance and interest as research objects. On the one hand, speakers of these languages amount to approximately seven million in Mexico, and self-identified descendants of the Mexican indigenous cultures to almost four times as many, not a negligible number. On the other hand, the study of low-resource NLP is interesting in itself as a research problem, but it also can help develop tools for the social sciences and receive valuable insights from them, fostering an environment of multidisciplinary studies that, in turn, lead to the preservation, inclusion, cultural identity exaltation, and enhanced visibility of Mexican indigenous cultures. As we perceive sparks of renewed interest in low-resource languages amongst the researchers of the NLP/CL community assessing the current state we believe that an assessment of the present challenges for the area of study is pertinent. With this, we move on to develop a systemic analysis of the field and proceed to outline a technical roadmap proposal towards the inclusion of Mexican indigenous languages in NLP/CL mainstream research. Briefly, this roadmap has two main areas that eventually converge: technical and linguistic which coalesce through the multidisciplinary paradigm. Within these areas, several tasks are proposed both in parallel and in succession, amongst the most notorious are the formation of teams of linguists, social anthropologists, and computer scientists to create tools for the collection of languages that are socially acceptable within the culture and standards of the communities. Technically, the development of algorithms for small corpora and linguistically the development of dictionaries of equivalent orthographies for distinctive variants of these languages.
Conclusion
Even though the current panorama for the indigenous languages in NLP/CL research is bleak, the re-emergence of the focus on the exaltation and preservation of origenary cultures and the push for multidisciplinary studies creates a fertile environment for new approaches towards the inclusion of these languages in mainstream NLP/CL research. The implications of the development of applied research in this area can build bridges for the inclusion of the indigenous communities in nation-wide social and political conversations from which they are now excluded via the digital divide. Also, it can have a tremendous impact on the preservation of the past and present versions of origenary languages as well as enabling the visibility of these cultures.
Introduction
For a long time indigenous languages, in general, and Mexican, in particular, have been a niche research area. Different disciplines, from Linguistics, anthropology, and sociology, to —in a lesser fashion— Computational Linguistics and Natural Language Processing, are traditional stakeholders on the topic. Still, the cost-benefit ratio is not favorable in the realm of computer sciences because of the scarcity of sources, divergence in variants of the same languages, lack of standard orthographies, and low topic awareness, as well as the perception of narrowness in the realm of applications, and lack of impact for these. Thus, NLP/CL researchers have remained on the fringes.
Development.
We believe that shining a different light on these apparent obstacles and dissuasion elements can help to improve the perception of their relevance and interest as research objects. On the one hand, speakers of these languages amount to approximately seven million in Mexico, and self-identified descendants of the Mexican indigenous cultures to almost four times as many, not a negligible number. On the other hand, the study of low-resource NLP is interesting in itself as a research problem, but it also can help develop tools for the social sciences and receive valuable insights from them, fostering an environment of multidisciplinary studies that, in turn, lead to the preservation, inclusion, cultural identity exaltation, and enhanced visibility of Mexican indigenous cultures. As we perceive sparks of renewed interest in low-resource languages amongst the researchers of the NLP/CL community assessing the current state we believe that an assessment of the present challenges for the area of study is pertinent. With this, we move on to develop a systemic analysis of the field and proceed to outline a technical roadmap proposal towards the inclusion of Mexican indigenous languages in NLP/CL mainstream research. Briefly, this roadmap has two main areas that eventually converge: technical and linguistic which coalesce through the multidisciplinary paradigm. Within these areas, several tasks are proposed both in parallel and in succession, amongst the most notorious are the formation of teams of linguists, social anthropologists, and computer scientists to create tools for the collection of languages that are socially acceptable within the culture and standards of the communities. Technically, the development of algorithms for small corpora and linguistically the development of dictionaries of equivalent orthographies for distinctive variants of these languages.
Conclusion
Even though the current panorama for the indigenous languages in NLP/CL research is bleak, the re-emergence of the focus on the exaltation and preservation of origenary cultures and the push for multidisciplinary studies creates a fertile environment for new approaches towards the inclusion of these languages in mainstream NLP/CL research. The implications of the development of applied research in this area can build bridges for the inclusion of the indigenous communities in nation-wide social and political conversations from which they are now excluded via the digital divide. Also, it can have a tremendous impact on the preservation of the past and present versions of origenary languages as well as enabling the visibility of these cultures.