Coling82, Z Homck) (Ed.) North - Hollandpublishingcompany © Academ 1982

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

COLING82, Z Homck)(ed.

)
North.HollandPublishingCompany
Academ~ 1982

RECOGNITION'OF ABSTRACT OBJECTS

A DECISION THEORY

APPROACH WITHIN NATURAL LANGUAGE PROCESSING


Gerhard Knorz
Fachbereich Informatik,- FG Datenverwaltungssysteme I I
Technische Hochschule Darmstadt
Karolinenplatz 5
D-6100 Darmstadt
W-Germany

The DAISY/ALIBABA-system developed w i t h i n the WAl-project


represents both a specific solution to the automatic indexing
problem and a general framework f o r problems in the f i e l d of
natural language processing, characterized by fuzziness and
uncertainty The WAI approach to the indexing problem has already
been published [ 3 ] , [ 5 ] . This paper however presents the underlying paradigm of recognizing abstract objects. The basic
concepts are described, including the decision theory approach
used f o r recognition.

1 THE "WAI" AND THE "AIR" PROJECT1


The DAISY/ALIBABA system [ 1 ] , [ 2 ] , [ 3 ] , as developed at the Technical U n i v e r s i t y
Darmstadt analyses abstracts and describes them according to the coordinate indexing philosophy using a prescribed set of descriptors. To perform t h i s task, a
domain dependent d i c t i o n a r y is needed. Estimating the non-existence of s u i t a b l y
sized d i c t i o n a r i e s to be one of the main problems f o r research and development of
automatic indexing [ 4 ] , in 1978 the WAI p r o j e c t started with d i c t i o n a r y construct i o n . The two completed d i c t i o n a r i e s are
FST, covering the scope of food science and technology 3 and
PHYS, covering the scopeofPhysics, a part of INIS ( I n t e r n a t i o n a l Nuclear
Information ystem) 4 .
D i f f e r e n t procedures f o r generating d i c t i o n a r y data were developed and applied.
To c l a s s i f y them and to unify the created data is one of the main tasks of d i c t i o n ary construction (described in d e t a i l in [ 3 ] , [ 4 ] ) . This cannot be done without
examination of t h e i r influence on the q u a l i t y of the r e s u l t i n g indexing. To perform
indexing t e s t s , the development of DAISY and ALIBABA was another important object i v e of WAI.
Indexing results are reported in [ 4 ] , [ 5 ] , [6] which are based on consistency tests
only, using the manual indexing as a standard. To confirm or to modify these res u l t s , the AIR project is now preparing a r e t r i e v a l t e s t on the physics data base
INKA-PHYS of the Fachinformationszentrum FIZ 4 (Energie, Physik, Mathematik; Karlsruhe) (order of magnitude: I0.000 documents, 200 search requests). The indexing
w i l l be based upon the new d i c t i o n a r y PHYS-2 which is to be constructed using about
80.000 documents of the INKA-PHYS data base.

2 THE BASIC PRINCIPLES UNDERLYING THE "WAI"/"AIR" APPROACH


The WAI/AIR approach represents both a specific solution o f the indexing problem
161

162

G. K N O R Z .

and a general framework f o r a wide class of problems


cessing and other f i e l d s .

w i t h i n natural language pro-

This paper w i l l only give reference to details of the particular solution published elsewhere. The objective of this work is to present the general framework
derivable from the basic principles underlying the WAI and the AIR project:
(I) Knowledgebases are very important for problem solving. But to presuppose
knowledge for an automatic system must notquestionits a p p l i c a b i l i t y , caused
by non-existent procedures for construction of knowledge bases of an indispensable size. The r e a l i s t i c appropriate solution is the main aim rather than a
perfect one.
(2) Controlling the quality and expenditure of e f f o r t of a system must not wait
until i t is put into practice. System development has to be guided by a control
derivable from the task to be performed,
(3) The algorithms that make the bases of the procedure should not be assumed to be
perfect. Applied to complex tasks, i t is a fundamental fact that they are
based on simplified models.
The principles can be considered to be a guideline for designing application
oriented systems. With good reason i t is claimed that the quality of such a system
can be determined by evaluation in application environments only (see for example
[ 7 ] , [ 8 ] ) . This cannot be done without empirical studies of the user-system interaction.
The paradigm of recognizing abstract objects presented here is an approach to
integrate the evaluation aspect into system development. I t is also an approach
to problems, for which no perfect solutions exist or seem to be applicable.

3 RECOGNITION OF ABSTRACT OBJECTS


3.1 THE DEFINITION OF THE RECOGNITION TASK
The basic idea is to use the application environment i t s e l f to get an i m p l i c i t
description of the problem. Whenever talking about a particular application
environment there is no other way then to take a conceptual model ME as a basis
which determines the adequate concepts (see [9 ] , or see also [1015).
Here, a conceptual model has to be formulated in this way, that i t defines
(abstract) objects ( x , k ) , ~EX, k~K. ~ denotes those aspects of an object which
can be observed d i r e c t l y with regard to the problem, K denotes a set of object
classes. A model mE of the a p p l i c a t i o n environment gives an i m p l i c i t d e f i n i t i o n
of the (recognition) problem, by forming a continuous stream of abstracts objects.
To develop a recognition system (RS) is nothing more than the f i n d i n g of a s u i t a b l e
mapping e: xe(x)

that recognizes an actual x to be (x, e(x)).

I f the RS-mEinterface is identical to the system-user interface, then mE may refer


to the user's judgement d i r e c t l y , to define the co-occurrence of ~ and k.

RECOGNITION OF ABSTRACT OBJECTS

163

This is also adequate, whenever human cognitive capabilities are to be simulated.


We give some examples:
Information retrieval can be based upon recognition of document-query r e l a t i o n ships (described in [ 6 ] ) . ~ can be represented by (d,f) where d denotes the
document, f denotes the query, k may be in the most simple case a member of
the set {is relevant, is not relevant}, refering to the user's judgement
Expressions, possibly within the scope of a quantifier as well as hypotheses
for inferences, can both be regarded as abstract objects. Determining

the

scope of a quantifier or drawing inferences can be based on the recognition


of those objects by simulating human decisions
Two other examples are given - avoiding the simulation approach:
Complex tasks often require the testing of many hypotheses, which can be
regarded as abstract objects, mE may refer to the final results of the
processing.
In [6] a decision theory approach to optimal retrieval forms a basis for mE,
defining the task of indexing as recognition of document-descriptor r e l a t i o n ships

3.2 STRUCTUREOF THE RECOGNITION SYSTEM


The structure of the recognition system as presented here makes evident tbat the
recognition problem arises essentially at the interface of two models:
The (external) conceptual model ME defining the recognition problem.
The (internal) conceptual model MI used to describe the object with respect
to the recognition task.
MI is part of the recognition system (Figure 1). I t structures the object using
the knowledge base, so that a l l available aspects that may influence the decision
of the RS are included. In many cases i t also i n i t i a t e s the recognition process,
i . e . i t constructs the hypothesis, represented by the object.
According to MI a formal description x of ~ is produced. We do not consider here
' t h e nature of MI , that can be a sophisticated one with

a strong theoretical

foundation as well as a rather simple and heuristic one. Different models MI


might cause quite d i f f e r e n t recognition systems for the same task. The main point
is, that MI leads to an object description

instead of a decision. Another point

is, that both models ME and MI are essentially independent This fact causes every

164

G, KNORZ

objects
. ~

describing

l ;l

deciding

setof

~ o b j e c t

classes

K
( ~

Figure 1 The r e c o g n i t i o n system


system

RSMI - provided i t

x ~

and i t s environment

is a d e t e r m i n i s t i c one - to make i n c o r r e c t decisions

in some cases. That means, an 'optimal r e c o g n i t i o n systems' cannot be defined


without taking the number of cases causing f a u l t s i n t o c o n s i d e r a t i o n or - more
p r e c i s e l y - the s t a t i s t i c a l

p r o p e r t i e s of the a p p l i c a t i o n environment represented

by mE . The desision theory approach appropriate to the given s i t u a t i o n is d e s c r i bed in [ 5 ] and [ 6 ] with respect to the indexing problem. The approach requires that
every s i n g l e d e c i s i o n of RS is c l a s s i f i e d . This task is f o r the most p a r t a n t i cipated by ME , which defines the set of object classes K. K determines the scope
of possible f a u l t s . Those

can be weighted independently by a loss function c:

( e ( x ) , k ) +w. With the model mE given, a p a r t i c u l a r r e c o g n i t i o n system w i l l

cause

an expected value E(w). The optimal system RS~!tvp i s t h e r e s u l t of s e a r c h i n g f o r t h i s


RsMI which minimizes E(w). I t can be shown t h a t the optimal decision R S ~ t ( x ) can
W
be based on the r e s t r i c t e d p r o b a b i l i t i e s p ( k l x ) . The mappings ek(x ) = p(klx)can b(
approximated by polynomial functions to be constructed a u t o m a t i c a l l y using a
sample of objects ( ~ , k ) . This way has been choosen by the ALIBABA system, that
uses polynomial c l a s s i f i e r s ,

adapted in the mean square sense [11] . The indexing

r e s u l t s in [ 5 ] and [ 6 ] demonstrate that - applied to the indexing problem - the


r e c o g n i t i o n approach and in p a r t i c u l a r the method of approximation is adequate f o r
the problem.

4 DISCUSSION
The approach of recognizing a b s t r a c t objects is evaluated using the p a r a d i g nf
automatic indexing. The model ~,IE refers - f o r p r a c t i c a l reasons - not to the

RECOGNITION OF ABSTRACT OBJECTS

165

retrieval process but to the decisions of human indexers. I f a consistency factor


(comparing manual and automatic indexing) measures the quality of automatic indexing, the set K requires two elements only. I f a more sophisticated evaluation
is intended, the set K can be increased, according to the kind of faults that
should be considered. The classification of faults can for example depend on the
descriptor under consideration.
For the model MI used see for example [5] and [12].
We summarize the essentials of the suggested approach (the f i r s t point refers in
particular to the indexing paradigm).
-

The recognition problem causes one to regard two independent models: one with
respect to retrieval and one with respect to analysis of abstracts. This point
of view is important for an approach to optimal indexing [6 ], but i t is not
self-obvious. In [14] the retrieval oriented approach of Robertson and the
indexing oriented approach of Harter [13] are brought together. The result

is

a one model approach l i k e also other approaches in this f i e l d (for example


[15]).
-

The i n t e r n a l model MI is r e s t r i c t e d to the base of the decision to be made.


This f a c t makes i t very easy to a d d i t i o n a l l y include a l o t of knowledge and
h e u r i s t i c procedures, t h a t might play a r o l e only f o r decision making. There
is no risc of causing f a u l t s by determining how to compute the decision, using
this k n o w l e d g e . A r t i f i c i a l i n t e l l i g e n c e approaches use a correspondant model MI
to determine the decision [1G].

- The need f o r a model mE implies an educational aspect with respect to evaluation.


mE

ensures, that the gap between the optimal system RsM~t and the ideal system

(equivalent to mE) is.under c o n t r o l .

FOOTNOTES
WAI means W~rterbuchentwicklung fUr automatisches Indexing (dictionary construction for automatic indexing), [3]. The research was supported by the BMFT
contract PT 131.05 to Technische Hochschule Darmstadt (march I , 1978 - december
12, 1981).
AIR means Weiterentwicklung der automatischen Indexierung und des Information
Retrieval (further development of automatic indexing and information r e t r i e v a l ) .
Supported by the BMFT contract PT 131.10 to Technische Hochschule Darmstadt
(march I , 1981 - december 31, 1983).
The order of magnitude of the two dictionaries may be characterized as follows:
about 13.000 single words, 20.000 phrases and 100.000 term-descriptor relations
each.

166

G. KNORZ

The two volumes 3 and 4 of the abstract journal Food Science and Technology
Abstracts (FSTA 71/72) containing about 33.000 documents were used as a basis
for dictionary.construction.

4 The scope of Physics (INIS) is represented by about 40.000 documents.


5 In [10] the term paradigm is used instead of 'conceptual model' that is taken
here from [9].
REFERENCES
[ I ] Putze-Meier, G., DAISY - Darmst~dter Indexierungssystem, to appear as a
report, Technische Hochschule Darmstadt, Fachbereich Informatik, DVS I I (1982).
[2] Knorz, G., SoftwaresystemALIBABA, Adaptives lernstichprobenorientiertes In~
dexierungssystem, basierend auf Beschreibungen abstrakter Objekte, Berich~
DV II 82-I, Techni~che Hochschule Darmstadt, FB-Informatik, FG DVS I I , (1982).
[3] Lustig, G., Das Projekt WAI: W~rterbuchentwicklungfur automatisches Indexing,
to appear in the proceedings of the Deutscher Dokumentartag 1981 (Saur KG,
MUnchen, 1982).
[4] Lustig, G., Ober die Entwicklung eines automatischen Indexierungssystems, in:
Krallmann, D. (ed.), Dialogsysteme und Textverarbeitung (LDV-Fittings, Essen,
1980).

[5] Knorz, G., Automatic Indexing as an Application of Pattern Recognition Methods


to Document-Descriptor Relationship, applied informatics I (1982) 1-10.
[6] Knorz, G., A Decision Theory Approach to Optimal Automatic Indexing, to appear
in the proceedings ofthe GI/ACM/BCS Conference (Berlin, May 1982).
[7] Krause, J., Lehmann, H., User Speciality Languages. A natural language based
information system and its evaluation, in: Krallmann, D. (ed.), Dialogsysteme
und Textverarbeitung (LDV-Fittings, Essen, 1980).
[8] Ackermann, Ammon, Ebert, Krause, Krug, Marschke, Sauerer, Zimmermann (ed.~ ,
Cobis. ComputergestUtztes BUro-lnformationssystem als Pilotanwendung von
CONDOR, BMFT-report (Karlsruhe, 1982).
[9] Schmitt, B., Computer Science and the General Theory of Models - An Introduction, applied informatics i (1982), 35-42.
[10] Kuhn, T.S., The structure of S c i e n t i f i c Revolutions. (Chicago, 1970).
[11] SchUrmann, J., Polynomklassifikatoren fur die Zeichenerkennung - Ansatz,
Adaption, Anwendung -, (Oldenbourg Verlag, MUnchen, 1977).
[12] Knorz, G., Mustererkennung im B~reich der inhaltlichen Erschlie~ung von
Texten, in: Radig, Bo (ed.), Modelle und Strukturen (Springer Verlag, Berlin
Heidelberg New York, 1981).
[13] Harter, S.P., A p r o b a b i l i s t i c approach to automatic kexword indexing. Part I:
On the d i s t r i b u t i o n of speciality words in a technical l i t e r a t u r e , Journal of
the ASIS, 26 (1975), 197-206, Part I I : An alogorithm for p r o b a b i l i s t i c indexing, Journal of the ASIS,26 (1975) 280-289.
[14] Robertson, S.E., van Rijsbergen, C.J., Porter, M.F., Probabilistic models of
indexing and searching, in Oddy, R.N., Robertson, S.E., van Rijsbergen, G.J.,
Williams,P.W. (ed.), Information Retrieval Research,(Butterworth,London,1981).
[15] Cooper, W.S., Maron, M.E., Foundation of Probabilistic and U t i l i t y - T h e o r e t i c
Indexing, IACM, 1/25 (1978) 67-80.
[16] Wahlster,W., Implementing Fuzziness in Dialogue Systems, in Rieger, B. i~d.)
Empirical Semantics, (Brockmeyer, Bochum, 1981).

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy