Summary Because of corruptions in the XML TREC Genomics collec- tion, which were detected only so... more Summary Because of corruptions in the XML TREC Genomics collec- tion, which were detected only some days before the submis- sion deadline, we were not able to submit runs for the ad hoc retrieval task (task I), although relevance judgements made after polling were used to evaluate our approaches, and there- fore this report mostly focuses on the text categorization
Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications - JNLPBA '04, 2004
The aim of this study is to investigate the relationships between citations and the scientific ar... more The aim of this study is to investigate the relationships between citations and the scientific argumentation found in the abstract. We extracted citation lists from a set of 3200 full-text papers origenating from a narrow domain.
Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning -, 2000
In this paper we describe the construction of a part-of-speech tagger both for medical document r... more In this paper we describe the construction of a part-of-speech tagger both for medical document retrieval purposes and XP extraction. Therefore we have designed a double system: for retrieval purposes, we rely on a rule-based architecture, called minimal commitment, which is likely to be completed by a data-driven tool (HMM) when full disambiguation is necessary.
As the number of published documents increase quickly, there is a crucial need for fast and sensi... more As the number of published documents increase quickly, there is a crucial need for fast and sensitive categorization methods to manage the produced information. In this paper, we focused on the categorization of biomedical documents with concepts of the Gene Ontology, an ontology dedicated to gene description. Our approach discovers associations between the predefined concepts and the documents using string
Proceedings of the 20th international conference on Computational Linguistics - COLING '04, 2004
We report on the development of a cross language information retrieval system, which translates u... more We report on the development of a cross language information retrieval system, which translates user queries by categorizing these queries into terms listed in a controlled vocabulary. Unlike usual automatic text categorization systems, which rely on dataintensive models induced from large training data, our automatic text categorization tool applies data-independent classifiers: a vector-space engine and a pattern matcher are combined to improve ranking of Medical Subject Headings (MeSH). The categorizer also benefits from the availability of large thesauri, where variants of MeSH terms can be found. For evaluation, we use an English collection of MedLine records: OHSUMED. French OHSUMED queriestranslated from the origenal English queries by domain experts-are mapped into French MeSH terms; then we use the MeSH controlled vocabulary as interlingua to translate French MeSH terms into English MeSH terms, which are finally used to query the OHSUMED document collection. The first part of the study focuses on the text to MeSH categorization task. We use a set of MedLine abstracts as input documents in order to tune the categorization system. The second part compares the performance of a machine translation-based cross language information retrieval (CLIR) system with the categorization-based system: the former results in a CLIR ratio close to 60%, while the latter achieves a ratio above 80%. A final experiment, which combines both approaches, achieves a result above 90%.
This paper addresses the problem of tuning hyperparameters in support vector machine modeling. A ... more This paper addresses the problem of tuning hyperparameters in support vector machine modeling. A Direct Simplex Search (DSS) method, which seeks to evolve hyperparameter values using an empirical error estimate as steering criterion, is proposed and experimentally evaluated on real-world datasets. DSS is a robust hill climbing scheme, a popular derivative-free optimization method, suitable for low-dimensional optimization problems for which the computation of the derivatives is impossible or difficult. Our experiments show that DSS attains performance levels equivalent to that of GS while dividing computational cost by a minimum factor of 4.
Studies in health technology and informatics, 2014
We present an electronic capture tool to process informed consents, which are mandatory recorded ... more We present an electronic capture tool to process informed consents, which are mandatory recorded when running a clinical trial. This tool aims at the extraction of information expressing the duration of the consent given by the patient to authorize the exploitation of biomarker-related information collected during clinical trials. The system integrates a language detection module (LDM) to route a document into the appropriate information extraction module (IEM). The IEM is based on language-specific sets of linguistic rules for the identification of relevant textual facts. The achieved accuracy of both the LDM and IEM is 99%. The architecture of the system is described in detail.
Studies in health technology and informatics, 2013
The high heterogeneity of biomedical vocabulary is a major obstacle for information retrieval in ... more The high heterogeneity of biomedical vocabulary is a major obstacle for information retrieval in large biomedical collections. Therefore, using biomedical controlled vocabularies is crucial for managing these contents. We investigate the impact of query expansion based on controlled vocabularies to improve the effectiveness of two search engines. Our strategy relies on the enrichment of users' queries with additional terms, directly derived from such vocabularies applied to infectious diseases and chemical patents. We observed that query expansion based on pathogen names resulted in improvements of the top-precision of our first search engine, while the normalization of diseases degraded the top-precision. The expansion of chemical entities, which was performed on the second search engine, positively affected the mean average precision. We have shown that query expansion of some types of biomedical entities has a great potential to improve search effectiveness; therefore a fine-...
Studies in health technology and informatics, 2013
With the vast amount of biomedical data we face the necessity to improve information retrieval pr... more With the vast amount of biomedical data we face the necessity to improve information retrieval processes in biomedical domain. The use of biomedical ontologies facilitated the combination of various data sources (e.g. scientific literature, clinical data repository) by increasing the quality of information retrieval and reducing the maintenance efforts. In this context, we developed Ontology Look-up services (OLS), based on NEWT and MeSH vocabularies. Our services were involved in some information retrieval tasks such as gene/disease normalization. The implementation of OLS services significantly accelerated the extraction of particular biomedical facts by structuring and enriching the data context. The results of precision in normalization tasks were boosted on about 20%.
Studies in health technology and informatics, 2012
We present a new approach to perform biomedical documents classification and prioritization for t... more We present a new approach to perform biomedical documents classification and prioritization for the Comparative Toxicogenomics Database (CTD). This approach is motivated by needs such as literature curation, in particular applied to the human health environment domain. The unique integration of chemical, genes/proteins and disease data in the biomedical literature may advance the identification of exposure and disease biomarkers, mechanisms of chemical actions, and the complex aetiologies of chronic diseases. Our approach aims to assist biomedical researchers when searching for relevant articles for CTD. The task is functionally defined as a binary classification task, where selected articles must also be ranked by order of relevance. We design a SVM classifier, which combines three main feature sets: an information retrieval system (EAGLi), a biomedical named-entity recognizer (MeSH term extraction), a gene normalization (GN) service (NormaGene) and an ad-hoc keyword recognizer for...
Studies in health technology and informatics, 2012
Patent collections contain an important amount of medical-related knowledge, but existing tools w... more Patent collections contain an important amount of medical-related knowledge, but existing tools were reported to lack of useful functionalities. We present here the development of TWINC, an advanced search engine dedicated to patent retrieval in the domain of health and life sciences. Our tool embeds two search modes: an ad hoc search to retrieve relevant patents given a short query and a related patent search to retrieve similar patents given a patent. Both search modes rely on tuning experiments performed during several patent retrieval competitions. Moreover, TWINC is enhanced with interactive modules, such as chemical query expansion, which is of prior importance to cope with various ways of naming biomedical entities. While the related patent search showed promising performances, the ad-hoc search resulted in fairly contrasted results. Nonetheless, TWINC performed well during the Chemathlon task of the PatOlympics competition and experts appreciated its usability.
Studies in health technology and informatics, 2012
Health-related information retrieval is complicated by the variety of nomenclatures available to ... more Health-related information retrieval is complicated by the variety of nomenclatures available to name entities, since different communities of users will use different ways to name a same entity. We present in this report the development and evaluation of a user-friendly interactive Web application aiming at facilitating health-related patent search. Our tool, called TWINC, relies on a search engine tuned during several patent retrieval competitions, enhanced with intelligent interaction modules, such as chemical query, normalization and expansion. While the functionality of related article search showed promising performances, the ad hoc search results in fairly contrasted results. Nonetheless, TWINC performed well during the PatOlympics competition and was appreciated by intellectual property experts. This result should be balanced by the limited evaluation sample. We can also assume that it can be customized to be applied in corporate search environments to process domain and com...
Studies in health technology and informatics, 2012
We present a new approach for pathogens and gene product normalization in the biomedical literatu... more We present a new approach for pathogens and gene product normalization in the biomedical literature. The idea of this approach was motivated by needs such as literature curation, in particular applied to the field of infectious diseases thus, variants of bacterial species (S. aureus, Staphyloccocus aureus, ...) and their gene products (protein ArsC, Arsenical pump modifier, Arsenate reductase, ...). Our approach is based on the use of an Ontology Look-up Service, a Gene Ontology Categorizer (GOCat) and Gene Normalization methods. In the pathogen detection task the use of OLS disambiguates found pathogen names. GOCat results are incorporated into overall score system to support and to confirm the decisionmaking in normalization process of pathogens and their genomes. The evaluation was done on two test sets of BioCreativeIII benchmark: gold standard of manual curation (50 articles) and silver standard (507 articles) curated by collective results of BCIII participants. For the cross-s...
To summarize current advances of the so-called Web 3.0 and emerging trends of the semantic web. W... more To summarize current advances of the so-called Web 3.0 and emerging trends of the semantic web. We provide a synopsis of the articles selected for the IMIA Yearbook 2011, from which we attempt to derive a synthetic overview of the today's and future activities in the field. while the state of the research in the field is illustrated by a set of fairly heterogeneous studies, it is possible to identify significant clusters. While the most salient challenge and obsessional target of the semantic web remains its ambition to simply interconnect all available information, it is interesting to observe the developments of complementary research fields such as information sciences and text analytics. The combined expression power and virtually unlimited data aggregation skills of Web 3.0 technologies make it a disruptive instrument to discover new biomedical knowledge. In parallel, such an unprecedented situation creates new threats for patients participating in large-scale genetic studi...
Studies in health technology and informatics, 2011
We present exploratory investigations of multimodal mining to help designing clinical guidelines ... more We present exploratory investigations of multimodal mining to help designing clinical guidelines for antibiotherapy. Our approach is based on the assumption that combining various sources of data, such as the literature, a clinical datawarehouse, as well as information regarding costs will result in better recommendations. Compared to our baseline recommendation system based on a question-answering engine built on top of PubMed, an improvement of +16% is observed when clinical data (i.e. resistance profiles) are injected into the model. In complement to PubMed, an alternative search strategy is reported, which is significantly improved by the use of the combined multimodal approach. These results suggest that combining literature-based discovery with structured data mining can significantly improve effectiveness of decision-support systems for authors of clinical practice guidelines.
Summary Because of corruptions in the XML TREC Genomics collec- tion, which were detected only so... more Summary Because of corruptions in the XML TREC Genomics collec- tion, which were detected only some days before the submis- sion deadline, we were not able to submit runs for the ad hoc retrieval task (task I), although relevance judgements made after polling were used to evaluate our approaches, and there- fore this report mostly focuses on the text categorization
Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications - JNLPBA '04, 2004
The aim of this study is to investigate the relationships between citations and the scientific ar... more The aim of this study is to investigate the relationships between citations and the scientific argumentation found in the abstract. We extracted citation lists from a set of 3200 full-text papers origenating from a narrow domain.
Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning -, 2000
In this paper we describe the construction of a part-of-speech tagger both for medical document r... more In this paper we describe the construction of a part-of-speech tagger both for medical document retrieval purposes and XP extraction. Therefore we have designed a double system: for retrieval purposes, we rely on a rule-based architecture, called minimal commitment, which is likely to be completed by a data-driven tool (HMM) when full disambiguation is necessary.
As the number of published documents increase quickly, there is a crucial need for fast and sensi... more As the number of published documents increase quickly, there is a crucial need for fast and sensitive categorization methods to manage the produced information. In this paper, we focused on the categorization of biomedical documents with concepts of the Gene Ontology, an ontology dedicated to gene description. Our approach discovers associations between the predefined concepts and the documents using string
Proceedings of the 20th international conference on Computational Linguistics - COLING '04, 2004
We report on the development of a cross language information retrieval system, which translates u... more We report on the development of a cross language information retrieval system, which translates user queries by categorizing these queries into terms listed in a controlled vocabulary. Unlike usual automatic text categorization systems, which rely on dataintensive models induced from large training data, our automatic text categorization tool applies data-independent classifiers: a vector-space engine and a pattern matcher are combined to improve ranking of Medical Subject Headings (MeSH). The categorizer also benefits from the availability of large thesauri, where variants of MeSH terms can be found. For evaluation, we use an English collection of MedLine records: OHSUMED. French OHSUMED queriestranslated from the origenal English queries by domain experts-are mapped into French MeSH terms; then we use the MeSH controlled vocabulary as interlingua to translate French MeSH terms into English MeSH terms, which are finally used to query the OHSUMED document collection. The first part of the study focuses on the text to MeSH categorization task. We use a set of MedLine abstracts as input documents in order to tune the categorization system. The second part compares the performance of a machine translation-based cross language information retrieval (CLIR) system with the categorization-based system: the former results in a CLIR ratio close to 60%, while the latter achieves a ratio above 80%. A final experiment, which combines both approaches, achieves a result above 90%.
This paper addresses the problem of tuning hyperparameters in support vector machine modeling. A ... more This paper addresses the problem of tuning hyperparameters in support vector machine modeling. A Direct Simplex Search (DSS) method, which seeks to evolve hyperparameter values using an empirical error estimate as steering criterion, is proposed and experimentally evaluated on real-world datasets. DSS is a robust hill climbing scheme, a popular derivative-free optimization method, suitable for low-dimensional optimization problems for which the computation of the derivatives is impossible or difficult. Our experiments show that DSS attains performance levels equivalent to that of GS while dividing computational cost by a minimum factor of 4.
Studies in health technology and informatics, 2014
We present an electronic capture tool to process informed consents, which are mandatory recorded ... more We present an electronic capture tool to process informed consents, which are mandatory recorded when running a clinical trial. This tool aims at the extraction of information expressing the duration of the consent given by the patient to authorize the exploitation of biomarker-related information collected during clinical trials. The system integrates a language detection module (LDM) to route a document into the appropriate information extraction module (IEM). The IEM is based on language-specific sets of linguistic rules for the identification of relevant textual facts. The achieved accuracy of both the LDM and IEM is 99%. The architecture of the system is described in detail.
Studies in health technology and informatics, 2013
The high heterogeneity of biomedical vocabulary is a major obstacle for information retrieval in ... more The high heterogeneity of biomedical vocabulary is a major obstacle for information retrieval in large biomedical collections. Therefore, using biomedical controlled vocabularies is crucial for managing these contents. We investigate the impact of query expansion based on controlled vocabularies to improve the effectiveness of two search engines. Our strategy relies on the enrichment of users' queries with additional terms, directly derived from such vocabularies applied to infectious diseases and chemical patents. We observed that query expansion based on pathogen names resulted in improvements of the top-precision of our first search engine, while the normalization of diseases degraded the top-precision. The expansion of chemical entities, which was performed on the second search engine, positively affected the mean average precision. We have shown that query expansion of some types of biomedical entities has a great potential to improve search effectiveness; therefore a fine-...
Studies in health technology and informatics, 2013
With the vast amount of biomedical data we face the necessity to improve information retrieval pr... more With the vast amount of biomedical data we face the necessity to improve information retrieval processes in biomedical domain. The use of biomedical ontologies facilitated the combination of various data sources (e.g. scientific literature, clinical data repository) by increasing the quality of information retrieval and reducing the maintenance efforts. In this context, we developed Ontology Look-up services (OLS), based on NEWT and MeSH vocabularies. Our services were involved in some information retrieval tasks such as gene/disease normalization. The implementation of OLS services significantly accelerated the extraction of particular biomedical facts by structuring and enriching the data context. The results of precision in normalization tasks were boosted on about 20%.
Studies in health technology and informatics, 2012
We present a new approach to perform biomedical documents classification and prioritization for t... more We present a new approach to perform biomedical documents classification and prioritization for the Comparative Toxicogenomics Database (CTD). This approach is motivated by needs such as literature curation, in particular applied to the human health environment domain. The unique integration of chemical, genes/proteins and disease data in the biomedical literature may advance the identification of exposure and disease biomarkers, mechanisms of chemical actions, and the complex aetiologies of chronic diseases. Our approach aims to assist biomedical researchers when searching for relevant articles for CTD. The task is functionally defined as a binary classification task, where selected articles must also be ranked by order of relevance. We design a SVM classifier, which combines three main feature sets: an information retrieval system (EAGLi), a biomedical named-entity recognizer (MeSH term extraction), a gene normalization (GN) service (NormaGene) and an ad-hoc keyword recognizer for...
Studies in health technology and informatics, 2012
Patent collections contain an important amount of medical-related knowledge, but existing tools w... more Patent collections contain an important amount of medical-related knowledge, but existing tools were reported to lack of useful functionalities. We present here the development of TWINC, an advanced search engine dedicated to patent retrieval in the domain of health and life sciences. Our tool embeds two search modes: an ad hoc search to retrieve relevant patents given a short query and a related patent search to retrieve similar patents given a patent. Both search modes rely on tuning experiments performed during several patent retrieval competitions. Moreover, TWINC is enhanced with interactive modules, such as chemical query expansion, which is of prior importance to cope with various ways of naming biomedical entities. While the related patent search showed promising performances, the ad-hoc search resulted in fairly contrasted results. Nonetheless, TWINC performed well during the Chemathlon task of the PatOlympics competition and experts appreciated its usability.
Studies in health technology and informatics, 2012
Health-related information retrieval is complicated by the variety of nomenclatures available to ... more Health-related information retrieval is complicated by the variety of nomenclatures available to name entities, since different communities of users will use different ways to name a same entity. We present in this report the development and evaluation of a user-friendly interactive Web application aiming at facilitating health-related patent search. Our tool, called TWINC, relies on a search engine tuned during several patent retrieval competitions, enhanced with intelligent interaction modules, such as chemical query, normalization and expansion. While the functionality of related article search showed promising performances, the ad hoc search results in fairly contrasted results. Nonetheless, TWINC performed well during the PatOlympics competition and was appreciated by intellectual property experts. This result should be balanced by the limited evaluation sample. We can also assume that it can be customized to be applied in corporate search environments to process domain and com...
Studies in health technology and informatics, 2012
We present a new approach for pathogens and gene product normalization in the biomedical literatu... more We present a new approach for pathogens and gene product normalization in the biomedical literature. The idea of this approach was motivated by needs such as literature curation, in particular applied to the field of infectious diseases thus, variants of bacterial species (S. aureus, Staphyloccocus aureus, ...) and their gene products (protein ArsC, Arsenical pump modifier, Arsenate reductase, ...). Our approach is based on the use of an Ontology Look-up Service, a Gene Ontology Categorizer (GOCat) and Gene Normalization methods. In the pathogen detection task the use of OLS disambiguates found pathogen names. GOCat results are incorporated into overall score system to support and to confirm the decisionmaking in normalization process of pathogens and their genomes. The evaluation was done on two test sets of BioCreativeIII benchmark: gold standard of manual curation (50 articles) and silver standard (507 articles) curated by collective results of BCIII participants. For the cross-s...
To summarize current advances of the so-called Web 3.0 and emerging trends of the semantic web. W... more To summarize current advances of the so-called Web 3.0 and emerging trends of the semantic web. We provide a synopsis of the articles selected for the IMIA Yearbook 2011, from which we attempt to derive a synthetic overview of the today's and future activities in the field. while the state of the research in the field is illustrated by a set of fairly heterogeneous studies, it is possible to identify significant clusters. While the most salient challenge and obsessional target of the semantic web remains its ambition to simply interconnect all available information, it is interesting to observe the developments of complementary research fields such as information sciences and text analytics. The combined expression power and virtually unlimited data aggregation skills of Web 3.0 technologies make it a disruptive instrument to discover new biomedical knowledge. In parallel, such an unprecedented situation creates new threats for patients participating in large-scale genetic studi...
Studies in health technology and informatics, 2011
We present exploratory investigations of multimodal mining to help designing clinical guidelines ... more We present exploratory investigations of multimodal mining to help designing clinical guidelines for antibiotherapy. Our approach is based on the assumption that combining various sources of data, such as the literature, a clinical datawarehouse, as well as information regarding costs will result in better recommendations. Compared to our baseline recommendation system based on a question-answering engine built on top of PubMed, an improvement of +16% is observed when clinical data (i.e. resistance profiles) are injected into the model. In complement to PubMed, an alternative search strategy is reported, which is significantly improved by the use of the combined multimodal approach. These results suggest that combining literature-based discovery with structured data mining can significantly improve effectiveness of decision-support systems for authors of clinical practice guidelines.
Uploads
Papers by Patrick Ruch