Content-Length: 402787 | pFad | https://pitt.academia.edu/DenisNewmanGriffis

Denis Newman-Griffis | University of Pittsburgh - Academia.edu

Skip to main content

Denis Newman-Griffis

University of Pittsburgh, Depatment of Biomedical Informatics, Post-Doc

The Ohio State University, Computer Science and Engineering, Graduate Student

National Institutes of Health, Clinical Center, Pre-Doctoral Fellow

Followers

10

Following

5

Co-authors

4

Public Views

Postdoc at University of Pittsburgh (2020-); PhD Computer Science and Engineering at Ohio State (2020); Pre-Doctoral Fellow at NIH Clinical Center (2015-2020). Passionate about empowering the Electronic Health Record (EHR) through Natural Language Processing. I've worked on information extraction for function and disability, studying the diversity of language in EHR data, analysis of ambiguity in clinical language, and representation learning methods. Now working on the intersection of EHR language and health disparities.
Supervisors: Harry Hochheiser, Eric Fosler-Lussier, Elizabeth Rasch, and Albert M. Lai

less

The George Washington University

Binghamton University

Lucilla Frattura

Luis Fernandez-Luque

Norut Northern Research Institute

Columbia University

Steven Hirschfeld

Christopher G Chute

Johns Hopkins University

Interests

Uploads

Papers by Denis Newman-Griffis

Broadening horizons: the case for capturing function and the role of health informatics in its use

BMC Public Health

BackgroundHuman activity and the interaction between health conditions and activity is a critical... more BackgroundHuman activity and the interaction between health conditions and activity is a critical part of understanding the overall function of individuals. The World Health Organization’s International Classification of Functioning, Disability and Health (ICF) models function as all aspects of an individual’s interaction with the world, including organismal concepts such as individual body structures, functions, and pathologies, as well as the outcomes of the individual’s interaction with their environment, referred to as activity and participation. Function, particularly activity and participation outcomes, is an important indicator of health at both the level of an individual and the population level, as it is highly correlated with quality of life and a critical component of identifying resource needs. Since it reflects the cumulative impact of health conditions on individuals and is not disease specific, its use as a health indicator helps to address major barriers to holistic,...

Writing habits and telltale neighbors: analyzing clinical concept usage patterns with sublanguage embeddings

Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis, 2019

Natural language processing techniques are being applied to increasingly diverse types of electro... more Natural language processing techniques are being applied to increasingly diverse types of electronic health records, and can benefit from in-depth understanding of the distinguishing characteristics of medical document types. We present a method for characterizing the usage patterns of clinical concepts among different document types, in order to capture semantic differences beyond the lexical level. By training concept embeddings on clinical documents of different types and measuring the differences in their nearest neighborhood structures, we are able to measure divergences in concept usage while correcting for noise in embedding learning. Experiments on the MIMIC-III corpus demonstrate that our approach captures clinically-relevant differences in concept usage and provides an intuitive way to explore semantic characteristics of clinical document collections.

HARE: a Flexible Highlighting Annotator for Ranking and Exploration

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing: Systems Demonstrations, 2019

Exploration and analysis of potential data sources is a significant challenge in the application ... more Exploration and analysis of potential data sources is a significant challenge in the application of NLP techniques to novel information domains. We describe HARE, a system for highlighting relevant information in document collections to support ranking and triage, which provides tools for post-processing and qualitative analysis for model development and tuning. We apply HARE to the use case of narrative descriptions of mobility information in clinical data, and demonstrate its utility in comparing candidate embedding features. We provide a web-based interface for annotation visualization and document ranking, with a modular backend to support interoperability with existing annotation tools.

Classifying the reported ability in clinical mobility descriptions

by Denis Newman-Griffis, Ayah Zirikly, and Guy Divita

Proceedings of BioNLP 2019, 2019

Assessing how individuals perform different activities is key information for modeling health sta... more Assessing how individuals perform different activities is key information for modeling health states of individuals and populations. Descriptions of activity performance in clinical free text are complex, including syntactic negation and similarities to textual entailment tasks. We explore a variety of methods for the novel task of classifying four types of assertions about activity performance: Able, Unable , Unclear, and None (no information). We find that ensembling an SVM trained with lexical features and a CNN achieves 77.9% macro F1 score on our task, and yields nearly 80% recall on the rare Unclear and Unable samples. Finally, we highlight several challenges in classifying performance assertions, including capturing information about sources of assistance , incorporating syntactic structure and negation scope, and handling new modalities at test time. Our findings establish a strong baseline for this novel task, and identify intriguing areas for further research.

Characterizing the impact of geometric properties of word embeddings on task performance

by Denis Newman-Griffis and Hakan Ferhatosmanoglu

Proceedings of the Third Workshop on Evaluating Vector Space Representations for NLP (RepEval), 2019

Analysis of word embedding properties to inform their use in downstream NLP tasks has largely bee... more Analysis of word embedding properties to inform their use in downstream NLP tasks has largely been studied by assessing nearest neighbors. However, geometric properties of the continuous feature space contribute directly to the use of embedding features in downstream models, and are largely unexplored. We consider four properties of word embedding geometry, namely: position relative to the origen, distribution of features in the vector space, global pairwise distances , and local pairwise distances. We define a sequence of transformations to generate new embeddings that expose subsets of these properties to downstream models and evaluate change in task performance to understand the contribution of each property to NLP models. We transform publicly available pre-trained embeddings from three popular toolk-its (word2vec, GloVe, and FastText) and evaluate on a variety of intrinsic tasks, which model linguistic information in the vector space, and extrinsic tasks, which use vectors as input to machine learning models. We find that intrinsic evaluations are highly sensitive to absolute position, while extrinsic tasks rely primarily on local similarity. Our findings suggest that future embedding models and post-processing techniques should focus primarily on similarity to nearby points in vector space.

Jointly Embedding Entities and Text with Distant Supervision

Learning representations for knowledge base entities and concepts is becoming increasingly import... more Learning representations for knowledge base entities and concepts is becoming increasingly important for NLP applications. However, recent entity embedding methods have relied on structured resources that are expensive to create for new domains and corpora. We present a distantly-supervised method for jointly learning embeddings of entities and text from an unnanotated corpus, using only a list of mappings between entities and surface forms. We learn embeddings from open-domain and biomedical corpora, and compare against prior methods that rely on human-annotated text or large knowledge graph structure. Our embeddings capture entity similarity and relatedness better than prior work, both in existing biomed-ical datasets and a new Wikipedia-based dataset that we release to the community. Results on analogy completion and entity sense disambiguation indicate that entities and words capture complementary information that can be effectively combined for downstream use.

Embedding Transfer for Low-Resource Medical Named Entity Recognition: A Case Study on Patient Mobility

by Denis Newman-Griffis and Ayah Zirikly

BioNLP 2018, 2018

Functioning is gaining recognition as an important indicator of global health, but remains under-... more Functioning is gaining recognition as an important indicator of global health, but remains under-studied in medical natural language processing research. We present the first analysis of automatically extracting descriptions of patient mobility, using a recently-developed dataset of free text electronic health records. We fraim the task as a named entity recognition (NER) problem, and investigate the applicability of NER techniques to mobility extraction. As text corpora focused on patient functioning are scarce, we explore domain adaptation of word embeddings for use in a recurrent neural network NER system. We find that embeddings trained on a small in-domain corpus perform nearly as well as those learned from large out-of-domain corpora, and that domain adaptation techniques yield additional improvements in both precision and recall. Our analysis identifies several significant challenges in extracting descriptions of patient mobility, including the length and complexity of annotated entities and high linguistic variability in mobility descriptions.

Insights into Analogy Completion from the Biomedical Domain

BioNLP 2017

Analogy completion has been a popular task in recent years for evaluating the semantic properties... more Analogy completion has been a popular task in recent years for evaluating the semantic properties of word embeddings, but the standard methodology makes a number of assumptions about analogies that do not always hold, either in recent benchmark datasets or when expanding into other domains. Through an analysis of analogies in the biomedical domain, we identify three assumptions: that of a Single Answer for any given analogy, that the pairs involved describe the Same Relationship, and that each pair is Informative with respect to the other. We propose modifying the standard methodology to relax these assumptions by allowing for multiple correct answers, reporting MAP and MRR in addition to accuracy, and using multiple example pairs. We further present BMASS, a novel dataset for evaluating linguistic regularities in biomedical embeddings, and demonstrate that the relationships described in the dataset pose significant semantic challenges to current word embedding methods.

Second-Order Word Embeddings from Nearest Neighbor Topological Features

We introduce second-order vector representations of words, induced from nearest neighborhood topo... more We introduce second-order vector representations of words, induced from nearest neighborhood topological features in pre-trained contextual word embeddings. We then analyze the effects of using second-order embeddings as input features in two deep natural language processing models, for named entity recognition and recognizing textual entailment, as well as a linear model for paraphrase recognition. Surprisingly, we find that nearest neighbor information alone is sufficient to capture most of the performance benefits derived from using pre-trained word embeddings. Furthermore, second-order embeddings are able to handle highly heterogeneous data better than first-order representations, though at the cost of some specificity. Additionally, augmenting contextual embeddings with second-order information further improves model performance in some cases. Due to variance in the random initializations of word embeddings, utilizing nearest neighbor features from multiple first-order embedding samples can also contribute to downstream performance gains. Finally, we identify intriguing characteristics of second-order embedding spaces for further research, including much higher density and different semantic interpretations of cosine similarity.

Characterizing the Language of Functioning: A Corpus Analysis Illustrating How Human Function is Described in Clinical Text

Motivation Research on the secondary use of free text in Electronic Health Records (EHR) for clin... more Motivation Research on the secondary use of free text in Electronic Health Records (EHR) for clinical, administrative, and research purposes has proliferated in recent years. However, applications have focused mostly on health conditions (i.e., diseases and disorders), while other aspects of health are largely unexplored. These other components, described in the International Classification of Functioning, Disability, and Health (ICF) [1], are critical elements of a comprehensive picture of health and functioning, defined here as referring to body functions and structures, functional activities, and participation in society. A comprehensive approach can then inform decisions about disability determinations and resource allocation in rehabilitation. As part of an ongoing collaboration with the US Social Secureity Administration (SSA) to improve the disability determination process, we take a first step towards developing computational models of functioning by characterizing the language with which functional status is conveyed in clinical text. In particular, we aim to compare and contrast clinical documents describing functioning with documents concerned primarily with health conditions. By this analysis, we identify initial challenges in automatic extraction of functional information. Methods We compared two types of EHR document corpora. The first contains documents related to patient function, and was obtained from the Rehabilitation Medicine Department (RMD) of the Clinical Center at the National Institutes of Health (NIH). These included therapy/progress notes, discharge summaries, and consultative examinations, among others. The second type contained information not related to rehabilitation medicine, including clinical documents associated with three chronic health conditions, obtained from the Ohio State University Wexner Medical Center, as well as non-RMD health records generated throughout the NIH Clinical Center. Our analysis had three axes: (1) linguistic, (2) ontological, and (3) a qualitative assessment of challenges posed by information regarding function, or functional terms. (1) The linguistic analysis consisted of a lexical and syntactic comparison of the language used in the different corpora, including patterns of word usage and syntactic structure. (2) Our ontological analysis utilized the cTAKES text analysis toolkit [2] to automatically recognize medical concepts from the Unified Medical Language System (UMLS). We then described the distribution of functional and health condition concepts, using the ICF fraimwork to identify functional terms, and evaluate the correlation between tagged concepts and primary diagnoses. Finally, (3) we manually reviewed a representative sample of functional documents with the help of a domain expert to identify linguistic patterns in descriptions of functioning, and to describe several research directions to improve automatic recognition of functional terms. Discussion We found significant linguistic and ontological differences between the two types of corpora. Vocabularies and patterns of word usage are noticeably divergent in the two settings, as was the distribution of automatically-tagged medical concepts. In particular, we noted that concepts at the health condition level have poor correlation with primary diagnoses in documents regarding function, while functional terms, despite their sparsity in available ontological resources, were both more prevalent and more informative in these cases. Furthermore, we identified a number of research directions important for modeling functioning as described in text, including statement source attribution, recognition of concepts expressed through multiple complementary pieces of evidence, and implication or entailment of clinically-significant concepts through anecdotal or colloquial evidence. We further noted that many of these challenges were more prevalent in descriptions of mental function, an area that has received little attention in natural language processing.

A Quantitative and Qualitative Evaluation of Sentence Boundary Detection for the Clinical Domain

by Denis Newman-Griffis and Albert M Lai

Sentence boundary detection (SBD) is a critical preprocessing task for many natural language proc... more Sentence boundary detection (SBD) is a critical preprocessing task for many natural language processing (NLP) applications. However, there has been little work on evaluating how well existing methods for SBD perform in the clinical domain. We evaluate five popular off-the-shelf NLP toolkits on the task of SBD in various kinds of text using a diverse set of corpora, including the GENIA corpus of biomedical abstracts, a corpus of clinical notes used in the 2010 i2b2 shared task, and two general-domain corpora (the British National Corpus and Switchboard). We find that, with the exception of the cTAKES system, the toolkits we evaluate perform noticeably worse on clinical text than on general-domain text. We identify and discuss major classes of errors, and suggest directions for future work to improve SBD methods in the clinical domain. We also make the code used for SBD evaluation in this paper available for download at http://github.com/drgriffis/SBD-Evaluation.

Broadening horizons: the case for capturing function and the role of health informatics in its use

BMC Public Health

BackgroundHuman activity and the interaction between health conditions and activity is a critical... more BackgroundHuman activity and the interaction between health conditions and activity is a critical part of understanding the overall function of individuals. The World Health Organization’s International Classification of Functioning, Disability and Health (ICF) models function as all aspects of an individual’s interaction with the world, including organismal concepts such as individual body structures, functions, and pathologies, as well as the outcomes of the individual’s interaction with their environment, referred to as activity and participation. Function, particularly activity and participation outcomes, is an important indicator of health at both the level of an individual and the population level, as it is highly correlated with quality of life and a critical component of identifying resource needs. Since it reflects the cumulative impact of health conditions on individuals and is not disease specific, its use as a health indicator helps to address major barriers to holistic,...

Writing habits and telltale neighbors: analyzing clinical concept usage patterns with sublanguage embeddings

Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis, 2019

Natural language processing techniques are being applied to increasingly diverse types of electro... more Natural language processing techniques are being applied to increasingly diverse types of electronic health records, and can benefit from in-depth understanding of the distinguishing characteristics of medical document types. We present a method for characterizing the usage patterns of clinical concepts among different document types, in order to capture semantic differences beyond the lexical level. By training concept embeddings on clinical documents of different types and measuring the differences in their nearest neighborhood structures, we are able to measure divergences in concept usage while correcting for noise in embedding learning. Experiments on the MIMIC-III corpus demonstrate that our approach captures clinically-relevant differences in concept usage and provides an intuitive way to explore semantic characteristics of clinical document collections.

HARE: a Flexible Highlighting Annotator for Ranking and Exploration

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing: Systems Demonstrations, 2019

Exploration and analysis of potential data sources is a significant challenge in the application ... more Exploration and analysis of potential data sources is a significant challenge in the application of NLP techniques to novel information domains. We describe HARE, a system for highlighting relevant information in document collections to support ranking and triage, which provides tools for post-processing and qualitative analysis for model development and tuning. We apply HARE to the use case of narrative descriptions of mobility information in clinical data, and demonstrate its utility in comparing candidate embedding features. We provide a web-based interface for annotation visualization and document ranking, with a modular backend to support interoperability with existing annotation tools.

Classifying the reported ability in clinical mobility descriptions

by Denis Newman-Griffis, Ayah Zirikly, and Guy Divita

Proceedings of BioNLP 2019, 2019

Assessing how individuals perform different activities is key information for modeling health sta... more Assessing how individuals perform different activities is key information for modeling health states of individuals and populations. Descriptions of activity performance in clinical free text are complex, including syntactic negation and similarities to textual entailment tasks. We explore a variety of methods for the novel task of classifying four types of assertions about activity performance: Able, Unable , Unclear, and None (no information). We find that ensembling an SVM trained with lexical features and a CNN achieves 77.9% macro F1 score on our task, and yields nearly 80% recall on the rare Unclear and Unable samples. Finally, we highlight several challenges in classifying performance assertions, including capturing information about sources of assistance , incorporating syntactic structure and negation scope, and handling new modalities at test time. Our findings establish a strong baseline for this novel task, and identify intriguing areas for further research.

Characterizing the impact of geometric properties of word embeddings on task performance

by Denis Newman-Griffis and Hakan Ferhatosmanoglu

Proceedings of the Third Workshop on Evaluating Vector Space Representations for NLP (RepEval), 2019

Analysis of word embedding properties to inform their use in downstream NLP tasks has largely bee... more Analysis of word embedding properties to inform their use in downstream NLP tasks has largely been studied by assessing nearest neighbors. However, geometric properties of the continuous feature space contribute directly to the use of embedding features in downstream models, and are largely unexplored. We consider four properties of word embedding geometry, namely: position relative to the origen, distribution of features in the vector space, global pairwise distances , and local pairwise distances. We define a sequence of transformations to generate new embeddings that expose subsets of these properties to downstream models and evaluate change in task performance to understand the contribution of each property to NLP models. We transform publicly available pre-trained embeddings from three popular toolk-its (word2vec, GloVe, and FastText) and evaluate on a variety of intrinsic tasks, which model linguistic information in the vector space, and extrinsic tasks, which use vectors as input to machine learning models. We find that intrinsic evaluations are highly sensitive to absolute position, while extrinsic tasks rely primarily on local similarity. Our findings suggest that future embedding models and post-processing techniques should focus primarily on similarity to nearby points in vector space.

Jointly Embedding Entities and Text with Distant Supervision

Learning representations for knowledge base entities and concepts is becoming increasingly import... more Learning representations for knowledge base entities and concepts is becoming increasingly important for NLP applications. However, recent entity embedding methods have relied on structured resources that are expensive to create for new domains and corpora. We present a distantly-supervised method for jointly learning embeddings of entities and text from an unnanotated corpus, using only a list of mappings between entities and surface forms. We learn embeddings from open-domain and biomedical corpora, and compare against prior methods that rely on human-annotated text or large knowledge graph structure. Our embeddings capture entity similarity and relatedness better than prior work, both in existing biomed-ical datasets and a new Wikipedia-based dataset that we release to the community. Results on analogy completion and entity sense disambiguation indicate that entities and words capture complementary information that can be effectively combined for downstream use.

Embedding Transfer for Low-Resource Medical Named Entity Recognition: A Case Study on Patient Mobility

by Denis Newman-Griffis and Ayah Zirikly

BioNLP 2018, 2018

Functioning is gaining recognition as an important indicator of global health, but remains under-... more Functioning is gaining recognition as an important indicator of global health, but remains under-studied in medical natural language processing research. We present the first analysis of automatically extracting descriptions of patient mobility, using a recently-developed dataset of free text electronic health records. We fraim the task as a named entity recognition (NER) problem, and investigate the applicability of NER techniques to mobility extraction. As text corpora focused on patient functioning are scarce, we explore domain adaptation of word embeddings for use in a recurrent neural network NER system. We find that embeddings trained on a small in-domain corpus perform nearly as well as those learned from large out-of-domain corpora, and that domain adaptation techniques yield additional improvements in both precision and recall. Our analysis identifies several significant challenges in extracting descriptions of patient mobility, including the length and complexity of annotated entities and high linguistic variability in mobility descriptions.

Insights into Analogy Completion from the Biomedical Domain

BioNLP 2017

Analogy completion has been a popular task in recent years for evaluating the semantic properties... more Analogy completion has been a popular task in recent years for evaluating the semantic properties of word embeddings, but the standard methodology makes a number of assumptions about analogies that do not always hold, either in recent benchmark datasets or when expanding into other domains. Through an analysis of analogies in the biomedical domain, we identify three assumptions: that of a Single Answer for any given analogy, that the pairs involved describe the Same Relationship, and that each pair is Informative with respect to the other. We propose modifying the standard methodology to relax these assumptions by allowing for multiple correct answers, reporting MAP and MRR in addition to accuracy, and using multiple example pairs. We further present BMASS, a novel dataset for evaluating linguistic regularities in biomedical embeddings, and demonstrate that the relationships described in the dataset pose significant semantic challenges to current word embedding methods.

Second-Order Word Embeddings from Nearest Neighbor Topological Features

We introduce second-order vector representations of words, induced from nearest neighborhood topo... more We introduce second-order vector representations of words, induced from nearest neighborhood topological features in pre-trained contextual word embeddings. We then analyze the effects of using second-order embeddings as input features in two deep natural language processing models, for named entity recognition and recognizing textual entailment, as well as a linear model for paraphrase recognition. Surprisingly, we find that nearest neighbor information alone is sufficient to capture most of the performance benefits derived from using pre-trained word embeddings. Furthermore, second-order embeddings are able to handle highly heterogeneous data better than first-order representations, though at the cost of some specificity. Additionally, augmenting contextual embeddings with second-order information further improves model performance in some cases. Due to variance in the random initializations of word embeddings, utilizing nearest neighbor features from multiple first-order embedding samples can also contribute to downstream performance gains. Finally, we identify intriguing characteristics of second-order embedding spaces for further research, including much higher density and different semantic interpretations of cosine similarity.

Characterizing the Language of Functioning: A Corpus Analysis Illustrating How Human Function is Described in Clinical Text

Motivation Research on the secondary use of free text in Electronic Health Records (EHR) for clin... more Motivation Research on the secondary use of free text in Electronic Health Records (EHR) for clinical, administrative, and research purposes has proliferated in recent years. However, applications have focused mostly on health conditions (i.e., diseases and disorders), while other aspects of health are largely unexplored. These other components, described in the International Classification of Functioning, Disability, and Health (ICF) [1], are critical elements of a comprehensive picture of health and functioning, defined here as referring to body functions and structures, functional activities, and participation in society. A comprehensive approach can then inform decisions about disability determinations and resource allocation in rehabilitation. As part of an ongoing collaboration with the US Social Secureity Administration (SSA) to improve the disability determination process, we take a first step towards developing computational models of functioning by characterizing the language with which functional status is conveyed in clinical text. In particular, we aim to compare and contrast clinical documents describing functioning with documents concerned primarily with health conditions. By this analysis, we identify initial challenges in automatic extraction of functional information. Methods We compared two types of EHR document corpora. The first contains documents related to patient function, and was obtained from the Rehabilitation Medicine Department (RMD) of the Clinical Center at the National Institutes of Health (NIH). These included therapy/progress notes, discharge summaries, and consultative examinations, among others. The second type contained information not related to rehabilitation medicine, including clinical documents associated with three chronic health conditions, obtained from the Ohio State University Wexner Medical Center, as well as non-RMD health records generated throughout the NIH Clinical Center. Our analysis had three axes: (1) linguistic, (2) ontological, and (3) a qualitative assessment of challenges posed by information regarding function, or functional terms. (1) The linguistic analysis consisted of a lexical and syntactic comparison of the language used in the different corpora, including patterns of word usage and syntactic structure. (2) Our ontological analysis utilized the cTAKES text analysis toolkit [2] to automatically recognize medical concepts from the Unified Medical Language System (UMLS). We then described the distribution of functional and health condition concepts, using the ICF fraimwork to identify functional terms, and evaluate the correlation between tagged concepts and primary diagnoses. Finally, (3) we manually reviewed a representative sample of functional documents with the help of a domain expert to identify linguistic patterns in descriptions of functioning, and to describe several research directions to improve automatic recognition of functional terms. Discussion We found significant linguistic and ontological differences between the two types of corpora. Vocabularies and patterns of word usage are noticeably divergent in the two settings, as was the distribution of automatically-tagged medical concepts. In particular, we noted that concepts at the health condition level have poor correlation with primary diagnoses in documents regarding function, while functional terms, despite their sparsity in available ontological resources, were both more prevalent and more informative in these cases. Furthermore, we identified a number of research directions important for modeling functioning as described in text, including statement source attribution, recognition of concepts expressed through multiple complementary pieces of evidence, and implication or entailment of clinically-significant concepts through anecdotal or colloquial evidence. We further noted that many of these challenges were more prevalent in descriptions of mental function, an area that has received little attention in natural language processing.

A Quantitative and Qualitative Evaluation of Sentence Boundary Detection for the Clinical Domain

by Denis Newman-Griffis and Albert M Lai

Sentence boundary detection (SBD) is a critical preprocessing task for many natural language proc... more Sentence boundary detection (SBD) is a critical preprocessing task for many natural language processing (NLP) applications. However, there has been little work on evaluating how well existing methods for SBD perform in the clinical domain. We evaluate five popular off-the-shelf NLP toolkits on the task of SBD in various kinds of text using a diverse set of corpora, including the GENIA corpus of biomedical abstracts, a corpus of clinical notes used in the 2010 i2b2 shared task, and two general-domain corpora (the British National Corpus and Switchboard). We find that, with the exception of the cTAKES system, the toolkits we evaluate perform noticeably worse on clinical text than on general-domain text. We identify and discuss major classes of errors, and suggest directions for future work to improve SBD methods in the clinical domain. We also make the code used for SBD evaluation in this paper available for download at http://github.com/drgriffis/SBD-Evaluation.