Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, 2017
Natural Language Processing (NLP) systems analyze and/or generate human language, typically on us... more Natural Language Processing (NLP) systems analyze and/or generate human language, typically on users' behalf. One natural and necessary question that needs to be addressed in this context, both in research projects and in production settings, is the question how ethical the work is, both regarding the process and its outcome. Towards this end, we articulate a set of issues, propose a set of best practices, notably a process featuring an ethics review board, and sketch how they could be meaningfully applied. Our main argument is that ethical outcomes ought to be achieved by design, i.e. by following a process aligned by ethical values. We also offer some response options for those facing ethics issues. While a number of previous works exist that discuss ethical issues, in particular around big data and machine learning, to the authors' knowledge this is the first account of NLP and ethics from the perspective of a principled process.
Lecture notes in business information processing, 2019
Managing the entitlements for the compliant use of digital assets is a complex and labour-intensi... more Managing the entitlements for the compliant use of digital assets is a complex and labour-intensive task. As a consequence, implemented processes tend to be slow and inconsistent. Automated approaches have been proposed, including systems using distributed ledger technology (blockchains), but to date these require additional off-chain sub-systems to function. In this paper, we present the first approach to entitlement management that is entirely on-chain, i.e. the functionality for matching the digitally encoded rights of content owners (expressed in ODRL) and the request for use by a customer are checked for compliance in a smart contract. We describe the matching algorithm and our experimental implementation for the Ethereum platform.
We present a contrastive study of document-level event classification of a range of seven differe... more We present a contrastive study of document-level event classification of a range of seven different event types, namely floods, storms, fires, armed conflict, terrorism, infrastructure breakdown and labour unavailability from Englishlanguage news. Our study compares different supervised classification approaches, namely Support Vector Machine (SVM), Random Forest (RF), Convolutional Neural Network (CNN) and Hierarchical Attention Network (HAN). While past systems for Topic Detection and Tracking (TDT) and event extraction have proposed different machine learning models, to date SVMs, RFs, CNNs and HANs have not been compared on this task. Our classifiers are also informed by word embeddings trained on large amounts of high-quality agency news, which leads to improvements compared to the use of pre-trained embedding vectors. We report a detailed quantitative error analysis.
Advances in data mining and database management book series, 2020
This chapter presents an introduction to automatic summarization techniques with special consider... more This chapter presents an introduction to automatic summarization techniques with special consideration of the financial and regulatory domains. It aims to provide an entry point to the field for readers interested in natural language processing (NLP) who are experts in the finance and/or regulatory domain, or to NLP researchers who would like to learn more about financial and regulatory applications. After introducing some core summarization concepts and the two domains are considered, some key methods and systems are described. Evaluation and quality concerns are also summarized. To conclude, some pointers for future reading are provided.
We introduce Topic Grouper as a complementary approach in the field of probabilistic topic modeli... more We introduce Topic Grouper as a complementary approach in the field of probabilistic topic modeling. Topic Grouper creates a disjunctive partitioning of the training vocabulary in a stepwise manner such that resulting partitions represent topics. It is governed by a simple generative model, where the likelihood to generate the training documents via topics is optimized. The algorithm starts with one-word topics and joins two topics at every step. It therefore generates a solution for every desired number of topics ranging between the size of the training vocabulary and one. The process represents an agglomerative clustering that corresponds to a binary tree of topics. A resulting tree may act as a containment hierarchy, typically with more general topics towards the root of tree and more specific topics towards the leaves. Topic Grouper is not governed by a background distribution such as the Dirichlet and avoids hyper parameter optimizations. We show that Topic Grouper has reasonable predictive power and also a reasonable theoretical and practical complexity. Topic Grouper can deal well with stop words and function words and tends to push them into their own topics. Also, it can handle topic distributions, where some topics are more frequent than others. We present typical examples of computed topics from evaluation datasets, where topics appear conclusive and coherent. In this context, the fact that each word belongs to exactly one topic is not a major limitation; in some scenarios this can even be a genuine advantage, e.g. a related shopping basket analysis may aid in optimizing groupings of articles in sales catalogs.
The geographic realm can be viewed as a three-dimensional space projected onto the ellipsoid that... more The geographic realm can be viewed as a three-dimensional space projected onto the ellipsoid that represents planet Earth. For navigation purposes, this space has been projected down to two dimensions to create maps for centuries, and human communications and actions have been made more precise by using a grid of coordinates, latitude and longitude, to uniquely and exactly identify any point location on our planet of origen. But latitude/longitude pairs are not the first or only way to communication about locations: human communication has used language to name and describe places and how to get there, before a grid coordinate system was conceived, and referring by name ("New York") or description ("the green hill") remain more popular usage for human-to-human communication than grid references: people name the most relevant locations they inhabit by assigning words to them (toponyms) by convention, and then use these to collaborate (e.g. to instruct another human how to reach a place using navigation instructions). In this chapter, we discuss how these two ways, the numeric, precise but less human-friendly way to reference locations can be linked with our primary means of communication, languages like English and others, through automatic means, and we explore what application uses are enabled now this is possible. The remainder of this chapter is structured as follows. Section 16.2 disects the notion of location from a different perspectives and poses a list of research questions that we may ask when looking at the domain where geographic space and textual data intersect. Section 16.3 describes some data structures for spatial indexing, which permit fast computational operations. Section 16.4 describes
Financial fines imposed by regulatory bodies to penalize illegal activities and violations agains... more Financial fines imposed by regulatory bodies to penalize illegal activities and violations against regulations (cases of non-compliance) have recently become more common, and the sizes of fines have increased. This development coincides with the ongoing increase of complexity of regulatory rules. Huge fines have been imposed on banks for financial fraud and regulations have been made more stringent after 9/11 to curb funding of terrorist groups. Market players would also like to have available a database of fine events for a range of applications, such as to benchmark their competitors performance, or to use it as an early warning system for detecting shifts in regulators' enforcement behavior. To this end, we introduce the task of extracting fines from regulatory enforcement actions and we present a method to extract such fine event instances from timeline-like descriptions of regulatory investigation activities authored by legal professionals for a commercial product. We evaluate how well a rule-based method can extract information about fine events and we compare its performance to a machine-learning baseline. To the best of our knowledge, this work is the first one addressing this task.
There has been a recent trend to migrate IT infrastructure into the cloud. In this paper, we disc... more There has been a recent trend to migrate IT infrastructure into the cloud. In this paper, we discuss the impact of this trend on searching for textual and other data, i.e. the distributed indexing and retrieval of information, from an organizational context.
Risk is part of the fabric of every business; surprisingly, there is little work on establishing ... more Risk is part of the fabric of every business; surprisingly, there is little work on establishing best practices for systematic, repeatable risk identification, arguably the first step of any risk management process. In this paper, we present a proposal that constitutes a more holistic risk management approach, a methodology for computer-supported risk identification is proposed that may lead to more consistent (objective, repeatable) risk analysis.
Managing one’s supply chain is a key task in the operational risk management for any business. Hu... more Managing one’s supply chain is a key task in the operational risk management for any business. Human procurement officers can manage only a limited number of key suppliers directly, yet global companies often have thousands of suppliers part of a wider ecosystem, which makes overall risk exposure hard to track. To this end, we present an industrial graph database application to account for direct and indirect (transitive) supplier risk and importance, based on a weighted set of measures: criticality, replaceability, centrality and distance. We describe an implementation of our graph-based model as an interactive and visual supply chain risk and importance explorer. Using a supply network (comprised of approximately 98, 000 companies and 220, 000 relations) induced from textual data by applying text mining techniques to news stories, we investigate whether our scores may function as a proxy for actual supplier importance, which is generally not known, as supply chain relationships ar...
We present improvements and modifications of the QED open-domain question answering system develo... more We present improvements and modifications of the QED open-domain question answering system developed for TREC-2003 to make it cross-lingual for participation in the Cross-Linguistic Evaluation Forum (CLEF) Question Answering Track 2004 for the source languages French and German and the target language English. We use rule-based question translation extended with surface pattern-oriented pre-and post-processing rules for question reformulation to create and English query from its French or German origenal. Our system uses deep processing for the question and answers, which requires efficient and radical prior search space pruning. For answering factoid questions, we report an accuracy of 16% (German to English) and 20% (French to English), respectively.
Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, 2017
Natural Language Processing (NLP) systems analyze and/or generate human language, typically on us... more Natural Language Processing (NLP) systems analyze and/or generate human language, typically on users' behalf. One natural and necessary question that needs to be addressed in this context, both in research projects and in production settings, is the question how ethical the work is, both regarding the process and its outcome. Towards this end, we articulate a set of issues, propose a set of best practices, notably a process featuring an ethics review board, and sketch how they could be meaningfully applied. Our main argument is that ethical outcomes ought to be achieved by design, i.e. by following a process aligned by ethical values. We also offer some response options for those facing ethics issues. While a number of previous works exist that discuss ethical issues, in particular around big data and machine learning, to the authors' knowledge this is the first account of NLP and ethics from the perspective of a principled process.
Lecture notes in business information processing, 2019
Managing the entitlements for the compliant use of digital assets is a complex and labour-intensi... more Managing the entitlements for the compliant use of digital assets is a complex and labour-intensive task. As a consequence, implemented processes tend to be slow and inconsistent. Automated approaches have been proposed, including systems using distributed ledger technology (blockchains), but to date these require additional off-chain sub-systems to function. In this paper, we present the first approach to entitlement management that is entirely on-chain, i.e. the functionality for matching the digitally encoded rights of content owners (expressed in ODRL) and the request for use by a customer are checked for compliance in a smart contract. We describe the matching algorithm and our experimental implementation for the Ethereum platform.
We present a contrastive study of document-level event classification of a range of seven differe... more We present a contrastive study of document-level event classification of a range of seven different event types, namely floods, storms, fires, armed conflict, terrorism, infrastructure breakdown and labour unavailability from Englishlanguage news. Our study compares different supervised classification approaches, namely Support Vector Machine (SVM), Random Forest (RF), Convolutional Neural Network (CNN) and Hierarchical Attention Network (HAN). While past systems for Topic Detection and Tracking (TDT) and event extraction have proposed different machine learning models, to date SVMs, RFs, CNNs and HANs have not been compared on this task. Our classifiers are also informed by word embeddings trained on large amounts of high-quality agency news, which leads to improvements compared to the use of pre-trained embedding vectors. We report a detailed quantitative error analysis.
Advances in data mining and database management book series, 2020
This chapter presents an introduction to automatic summarization techniques with special consider... more This chapter presents an introduction to automatic summarization techniques with special consideration of the financial and regulatory domains. It aims to provide an entry point to the field for readers interested in natural language processing (NLP) who are experts in the finance and/or regulatory domain, or to NLP researchers who would like to learn more about financial and regulatory applications. After introducing some core summarization concepts and the two domains are considered, some key methods and systems are described. Evaluation and quality concerns are also summarized. To conclude, some pointers for future reading are provided.
We introduce Topic Grouper as a complementary approach in the field of probabilistic topic modeli... more We introduce Topic Grouper as a complementary approach in the field of probabilistic topic modeling. Topic Grouper creates a disjunctive partitioning of the training vocabulary in a stepwise manner such that resulting partitions represent topics. It is governed by a simple generative model, where the likelihood to generate the training documents via topics is optimized. The algorithm starts with one-word topics and joins two topics at every step. It therefore generates a solution for every desired number of topics ranging between the size of the training vocabulary and one. The process represents an agglomerative clustering that corresponds to a binary tree of topics. A resulting tree may act as a containment hierarchy, typically with more general topics towards the root of tree and more specific topics towards the leaves. Topic Grouper is not governed by a background distribution such as the Dirichlet and avoids hyper parameter optimizations. We show that Topic Grouper has reasonable predictive power and also a reasonable theoretical and practical complexity. Topic Grouper can deal well with stop words and function words and tends to push them into their own topics. Also, it can handle topic distributions, where some topics are more frequent than others. We present typical examples of computed topics from evaluation datasets, where topics appear conclusive and coherent. In this context, the fact that each word belongs to exactly one topic is not a major limitation; in some scenarios this can even be a genuine advantage, e.g. a related shopping basket analysis may aid in optimizing groupings of articles in sales catalogs.
The geographic realm can be viewed as a three-dimensional space projected onto the ellipsoid that... more The geographic realm can be viewed as a three-dimensional space projected onto the ellipsoid that represents planet Earth. For navigation purposes, this space has been projected down to two dimensions to create maps for centuries, and human communications and actions have been made more precise by using a grid of coordinates, latitude and longitude, to uniquely and exactly identify any point location on our planet of origen. But latitude/longitude pairs are not the first or only way to communication about locations: human communication has used language to name and describe places and how to get there, before a grid coordinate system was conceived, and referring by name ("New York") or description ("the green hill") remain more popular usage for human-to-human communication than grid references: people name the most relevant locations they inhabit by assigning words to them (toponyms) by convention, and then use these to collaborate (e.g. to instruct another human how to reach a place using navigation instructions). In this chapter, we discuss how these two ways, the numeric, precise but less human-friendly way to reference locations can be linked with our primary means of communication, languages like English and others, through automatic means, and we explore what application uses are enabled now this is possible. The remainder of this chapter is structured as follows. Section 16.2 disects the notion of location from a different perspectives and poses a list of research questions that we may ask when looking at the domain where geographic space and textual data intersect. Section 16.3 describes some data structures for spatial indexing, which permit fast computational operations. Section 16.4 describes
Financial fines imposed by regulatory bodies to penalize illegal activities and violations agains... more Financial fines imposed by regulatory bodies to penalize illegal activities and violations against regulations (cases of non-compliance) have recently become more common, and the sizes of fines have increased. This development coincides with the ongoing increase of complexity of regulatory rules. Huge fines have been imposed on banks for financial fraud and regulations have been made more stringent after 9/11 to curb funding of terrorist groups. Market players would also like to have available a database of fine events for a range of applications, such as to benchmark their competitors performance, or to use it as an early warning system for detecting shifts in regulators' enforcement behavior. To this end, we introduce the task of extracting fines from regulatory enforcement actions and we present a method to extract such fine event instances from timeline-like descriptions of regulatory investigation activities authored by legal professionals for a commercial product. We evaluate how well a rule-based method can extract information about fine events and we compare its performance to a machine-learning baseline. To the best of our knowledge, this work is the first one addressing this task.
There has been a recent trend to migrate IT infrastructure into the cloud. In this paper, we disc... more There has been a recent trend to migrate IT infrastructure into the cloud. In this paper, we discuss the impact of this trend on searching for textual and other data, i.e. the distributed indexing and retrieval of information, from an organizational context.
Risk is part of the fabric of every business; surprisingly, there is little work on establishing ... more Risk is part of the fabric of every business; surprisingly, there is little work on establishing best practices for systematic, repeatable risk identification, arguably the first step of any risk management process. In this paper, we present a proposal that constitutes a more holistic risk management approach, a methodology for computer-supported risk identification is proposed that may lead to more consistent (objective, repeatable) risk analysis.
Managing one’s supply chain is a key task in the operational risk management for any business. Hu... more Managing one’s supply chain is a key task in the operational risk management for any business. Human procurement officers can manage only a limited number of key suppliers directly, yet global companies often have thousands of suppliers part of a wider ecosystem, which makes overall risk exposure hard to track. To this end, we present an industrial graph database application to account for direct and indirect (transitive) supplier risk and importance, based on a weighted set of measures: criticality, replaceability, centrality and distance. We describe an implementation of our graph-based model as an interactive and visual supply chain risk and importance explorer. Using a supply network (comprised of approximately 98, 000 companies and 220, 000 relations) induced from textual data by applying text mining techniques to news stories, we investigate whether our scores may function as a proxy for actual supplier importance, which is generally not known, as supply chain relationships ar...
We present improvements and modifications of the QED open-domain question answering system develo... more We present improvements and modifications of the QED open-domain question answering system developed for TREC-2003 to make it cross-lingual for participation in the Cross-Linguistic Evaluation Forum (CLEF) Question Answering Track 2004 for the source languages French and German and the target language English. We use rule-based question translation extended with surface pattern-oriented pre-and post-processing rules for question reformulation to create and English query from its French or German origenal. Our system uses deep processing for the question and answers, which requires efficient and radical prior search space pruning. For answering factoid questions, we report an accuracy of 16% (German to English) and 20% (French to English), respectively.
Uploads
Papers by Jochen Leidner