0% found this document useful (0 votes)
26 views7 pages

Natural_Language_Processing_and_Informat

The Natural Language Processing and Information Retrieval Group at NIST, established in 1994, aims to enhance the management of unstructured textual information through effective retrieval techniques. Their work includes the development of the PRISE system for large-scale text retrieval and the organization of the Text REtrieval Conference (TREC) to foster collaboration among researchers. Ongoing projects focus on improving indexing efficiency, user feedback mechanisms, and integrating advanced search capabilities into various applications.

Uploaded by

rahul17131207
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views7 pages

Natural_Language_Processing_and_Informat

The Natural Language Processing and Information Retrieval Group at NIST, established in 1994, aims to enhance the management of unstructured textual information through effective retrieval techniques. Their work includes the development of the PRISE system for large-scale text retrieval and the organization of the Text REtrieval Conference (TREC) to foster collaboration among researchers. Ongoing projects focus on improving indexing efficiency, user feedback mechanisms, and integrating advanced search capabilities into various applications.

Uploaded by

rahul17131207
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Lab Report Special Section:

Natural Language Processing and Information Retrieval Group Information


Access and User Interfaces Division National Institute of Standards and
Technology

Donna Harman, Manager


NIST
Bldg. 225, Room A216
Gaithersburg, MD 20899

Overview

The Natural Language Processing and Information Retrieval Group was formed in 1994 at the
National Institute of Standards and Technology (NIST) in recognition of the importance of
managing the ever-increasing amount of electronically available text. The formal objective of the
group is "to work with industry, academia and other government agencies to promote the use of
more effective and efficient techniques for manipulating (largely) unstructured textual
information, especially the browsing, searching, and presentation of that information". The group
carries on the work in text retrieval started in 1988.

Because of the initial work in text retrieval, the majority of the current projects involve improving
the transfer of better text retrieval technology into commercial systems. Two approaches are
being followed. The first approach (started in 1988) is to build a large-scale prototype retrieval
system (the PRISE system) capable of handling over three gigabytes of data. This system uses
natural language input and state-of-the-art statistical ranking mechanisms. The prototype has
become the focus for continued research by the group into more effective and efficient retrieval
systems using natural language queries. Additionally it serves as a vehicle for work in the NISO
Z39.50 Information Retrieval standard and for in-house usability testing.

The second approach (started in 1992) is to conduct a conference attracting international


participation from researchers in information retrieval. The Text REtrieval Conference (TREC) is
now starting its fifth year. Participating groups work with a large test collection built at NIST,
submit their results for a common evaluation, and meet for a three-day workshop to compare
techniques and results. The conference is starting to serve as a major technology-transfer
mechanism in the area of information retrieval. New methods of evaluating retrieval technology
are also being developed based on work for this conference.

Several other projects represent a broadening of group's interests to include additional areas of
natural language processing and evaluation. Projects are being pursued in the area of improved
handling of large text files, including how to automatically create hypertext links and how to
facilitate machine-aided editing of massive text files. Also planned are projects in the area of
human computer interaction techniques for information access, including advanced interfaces, i.e.,
mixed mode and multimedia, for information retrieval.

6
Z39.50/PRISE

In 1988, NIST undertook a major study for another government agency examining text retrieval
across distributed systems (Harman et al., 1991). Part of this'study required the building of a
large-scale prototype search engine capable of handling a gigabyte of text, and user testing this
engine with a sample of about 40 potential users. These users were either completely new to the
area of text searching or had used commercial Boolean retrieval systems in the past. The study
seemed an ideal opportunity to demonstrate that statistical ranking techniques for text retrieval
could be made to run in real time on large databases, and equally important, would be accepted by
end-users in place of Boolean retrieval.

This project needed research in two areas. The first area involved producing very efficient
implementations of indexing and searching algorithms, efficient both in retrieval and indexing
time, and in the amount of storage needed for indices. The second area of research involved the
design and implementation of interfaces for user testing the system against databases such as
manuals, legal code books with chapters and sections, and a gigabyte of data containing about
40,000 separate court cases. Here the goal was to examine how well the system served the needs
of various user groups. Results from both areas of research were very positive. During testing it
was shown that the new methods are indeed feasible for efficient indexing and searching in large
text collections, and that they are well-liked by users. Results of the experiments for this
prototype, including the user testing, are given in Harman & Candela 1989.

The initial search prototype was ported to SPARCstations in 1992, but the distributed version of
the code remained very fragile and was not given to other research groups. In 1994, based on
funding from the High Performance Computing and Communications Initiative, the group
extended the system to include a Z39.50-1994 UNIX client and server. The Z39.50 standard
defines a protocol within the application layer of the OSI reference model for the search and
retrieval of information. NIST's Z39.50/PRISE server was especially designed to isolate the
search engine from the details of the Z39.50 protocol and to minimize the effort needed to
interface the server to natural language search engines other than PRISE.

The goals of this work include the following:

1) to increase NIST's understanding of the Z39.50 standard and to encourage its


development and use;

2) to allow easier use of this standard by publishing source code for a working Z39.50
client/server for groups already having an advanced search engine;

3) to provide an advanced search engine embedded in Z39.50 for use by groups mainly
working in related areas such as hypertext, interfaces;

4) to create a stable, user-friendly version of the PRISE system for in-house research
purposes.

In July of 1995, the code was formally released and is available via ffp.
Documentation is available at:

<http://potomac.ncsl.nist.gov/-over/zpl/main.html>.

Contact Paul Over (over@nist.gov) with copy to Willie Rogers (wrogers@nist.gov) for
information on obtaining Z39.50/PRISE 1.0 by ftp.

PRISE Search:

Supports 3 varieties of search against whole documents and fields

- Keywords (stemmed, order-independent)


- Phrase (not stemmed, order matters, whitespace/SGML-tags ignored)
- Proximity (not stemmed, order-independent, whitespace/SGML-tags ignored)

PRISE interprets flee-form natural language input as a series of keywords and builds a list of
documents ranked in order of their relevance to the query. All field-level queries and
document-level phrase and proximity queries are implemented by scanning these documents.

PRISE Index:

- Takes SGML-tagged text as input


- Supports indexing of TIPSTER-sized document collections (about a million)

Dependencies:

- BSD 4.3 aka SunOS 4.1.x


- Tk 3.6 / Tcl 7.3 (available from NIST ftp site)
- sgmls parser (available from NIST tip site)
- ANSI-C
-X 11 R3 or later

In addition to distributing a public domain "starter" system, the group has become more involved
in the Z39.50 standard. One result of this is the NIST publication this year of a collection of
papers by members of the Z39.50 Implementor's Group. These papers discuss the practical
problems of implementing the standard, and offer guidance based on the experience of veteran
implementors (Over et al. 1995).

The system itself is being used as a research vehicle at NIST. Four research projects are currently
underway. The first project is continued work in efficiency, both in indexing time, and in the
amount of storage needed for indices. Techniques were investigated in 1994 that reduced the
indexing time for large collections by a factor of 10 (from days to hours), without using any type
of index compression. A new index storage scheme was also devised that allowed term positional
information to be kept in a more compact form than traditional schemes (Rogers et al. 1995).

The second research project involves building a relevance feedback operation into the existing
prototype, and experimenting to determine the best feedback algorithms for Z39.50/PRISE. In
1996 user experiments will be conducted in cooperation with Rutgers University. Additional
feedback experiments involving different types of data and different users are anticipated in the
future.

The third project with the prototype involves experiments in the automatic selection of indexing
features and term weighting for different types of text (and users), and extensions of the system
(particularly the interface) to handle different types of search needs and access styles. This is the
beginning of a long-term project to insert Z39.50/PRISE into many NIST information access
systems as a testbed for various retrieval and interface issues.

The fourth research project examines the linking of this text searching prototype to an in-house
OCR package and a prototype speech recognition system. This project, which is scheduled to
start in 1996, has the goal of a detailed understanding of the problems in this linkage, including
development of multiple test corpora and evaluation techniques.

The Text REtrieval Conferences (TRECs)

In November of 1992, the first Text REtrieval Conference (TREC-1) was held at NIST. The
conference, co-sponsored by ARPA and NIST, brought together information retrieval researchers
to discuss their system results on a new large test collection (the TIPSTER collection). This
conference became the first in a series of ongoing conferences dedicated to encouraging research
in retrieval from large-scale test collections, and to encouraging increased interaction between
research groups in industry and academia. From the beginning there has been an almost equal
number of universities and companies participating, with an emphasis on exploring many different
types of approaches to the text retrieval problem.

The conference has grown from 28 participating groups in 1992, to 36 groups in 1995. These
groups work with the test collection built at NIST, use the same evaluation procedures, and meet
for a three-day workshop to compare techniques and results.

The test collection used in the first four TRECs contains over 1 million documents, 250 test
"questions" or topics, and the "right answers" or relevance judgments for these topics. The
documents cover many different writing styles and different information domains. They include
information from the Wall Street Journal, the San Jose Mercury News, the AP Newswire, and
articles from the Computer Select disks. The topics and relevance judgments were created by real
users, with the topic format ranging from long structured topics in TREC-1 and 2 to very short
1-sentence topics in TREC-4. The size of the collection and the type of topics used is therefore
very reflective of the real-world environment faced by most retrieval systems.

Research done by the various groups in TREC has been highly varied, but has followed a general
pattern. TREC-1 required significant system rebuilding by most participating groups due to the
huge increase in the size of the document collection (from a traditional test collection of several
megabytes
in size to the 2 gigabyte TIPSTER collection). The second TREC conference (TREC-2) occurred
in August of 1993, less than 10 months after the first conference. Many of original TREC-1
groups were able to "complete" their system rebuilding and tuning, and in general the TREC-2
results show significant improvements over the TREC-1 results. In some senses, however, the
TREC-2 results should best be viewed as an excellent baseline for more complex experimentation.

The TREC-3 results reflect some of that more complex experimentation. For some groups that
meant more extensive experiments using their basic system techniques. For example, many
groups tried different automatic methods of adding additional terms to the queries, such as
automatic thesaurus generation or relevance feedback techniques applied to the top set of
retrieved documents. Other groups tried various types of manual query expansion. For the
routing task, where training data was available and the topics remain constant (with new data to
"route"), some groups used massive relevance feedback, and others used machine learning
techniques. Many experiments involved trying techniques "borrowed" from other groups in
TREC-2 to explore hybrid approaches.

The latest TREC conference (TREC-4) was held in November 1995. Many of the participating
groups tried very different approaches to the task, such as major new term weighting fianctions or
unusual approaches to manual query construction using sophisticated tools. In addition to the
"mainline" tasks done in earlier TRECs, five focussed "tracks" were organized, including merging
results across heterogeneous text collections, examining new evaluation techniques for interactive
retrieval and working in non-English retrieval environments. The sharp increase in experiments
resulting from these tracks not only added excitement to the conference itself but will lead to
fiarther participation in these focussed tracks by other groups in TREC-5.

The proceedings from the TREC-3 conference is available on the group homepage:

<http://potomac.ncsl.nist.gov/trec>

and are also published as a NIST Special Publication (NIST SP500-225) from NTIS. Earlier
proceedings from TREC-1 and TREC-2 are also available from NTIS. The TREC-4 proceedings
will be available in the early spring, both in printed form and on the group homepage.

Available from NTIS (phone 703-487-4650)

- The First Text REtrieval Conference


PB93-191-641, $61.00

- The Second Text REtrieval Conference

l0
PB94-178407, $52.00

- Overview of the Third Text REtrieval Conference


PB95-216883/AS, $61.00

In addition to the proceedings, the TREC conferences have resulted in a large test collection. The
collection consists of three CD-ROM disks containing the documents, plus the topics and the
relevance judgments. The document disks are available from the Linguistic Data Consortium
(ldc@unagi.cis.upenn.edu), and the topics and other information are available from NIST via ftp.

There will be a fifth TREC conference in 1996, and groups wishing to join should respond to the
call for participation that will be going out in December of this year (in IR-Digest and other
places). To be put on the email list for this and other TREC announcements, send email to Donna
Harman (harman@potomac.ncsl.nist.gov).

Other Projects

The group is working on several other projects in cooperation with the Social Security
Administration (SSA). The first project is the development of a system (EAMATE) which
automates the retrieval of employee wage records for SSA. This system, built by NIST in
conjunction with SSA as a prototype to quickly search for names which are a "best-match" to a
user query, received the "Federal Applications Medal of Excellence" for saving the government
money in March 1994. The EAMATE system attacks the automation of a manual search task
from two standpoints: 1) the development of a new indexing and searching methodology which
has the power and flexibility to overcome inconsistencies in the data, and 2) the development of a
highly functional user interface which allows maximum exploitation of the system internals while
offering an intuitive environment. Results from the initial prototype indicate that productivity
may increase by as much as a factor of 8 and that the system is easy to use as evidenced by an
average training time of 1 hour (Willman 1994).

Work on this successful re-engineering effort will continue with SSA. As part of its Earnings
Modernization project, the prototype EAMATE system is being installed at SSA for use with one
year's worth of data (approximately 50 gigabytes). The expanded prototype will run for six
months and will allow NIST to conduct major usability studies and performance evaluations.
More than 70 measurements are built into the client portion of the system to provide data for user
interaction analysis and system performance from a usability perspective. User satisfaction data
will also be collected and analyzed. Additionally, key search result data is also being maintained
to provide data for an in-depth analysis of the search engine from an information retrieval
perspective.

Another project related to the EAMATE system is the design and development of the DataACE
system, a tool for analyzing and correcting multigigabyte legacy data and integrating it into the
modernization project. The development of DataACE in 1996 will explore issues encountered
when attempting to utilize extremely large data sets in different ways than was originally intended.

li
The final project examines the automatic linking of various types and sources of text. One goal of
this project is to provide a method to automatically link parts of the SSA procedural manual;
another goal is to identify programmatic help screens which require changes as a result of updates
and/or changes to the manual. The initial work to generate automatic links within the manual
itself has shown that these methods are viable, and ~rther research will be done to improve the
link generation process. An implementation of the hypertext version of the manual (using these
automatic linking techniques) will be put online for use by SSA to investigate issues concerning
the manual. This tool will include integration of a version of the PRISE system for direct
searching.

References

Harman D. and Candela G. (1989). "Retrieving Records from a Gigabyte of Text on a


Minicomputer Using Statistical Ranking." Journal of the American Society for Information
Science, 41(8), 581-589.

Harman D., McCoy W., Toense R. and Candela G. (1991). "Prototyping a Distributed
Information Retrieval System that uses Statistical Ranking." Information Processing and
Management, 27(5), 449-460.

Over P., Denenberg R., Moen W., and Stovel L. (Eds). (1995). Z39.50 Implementor's
Experiences. National Institute of Standards and Technology Special Publication 500-229,
Gaithersburg, MD. 20899.

Rogers W, Candela G. and Harman D. (1995). "Space and Time Improvements for Indexing in
Information Retrieval." Proceedings of the Fourth Annual Symposium on Document Analysis
and Information Retrieval, Las Vegas.

Willman (1994). "A Prototype Information Retrieval System to Perform a Best-Match Search for
Names." Proceedings of RIAO '94, New York, New York.

12

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy