2. The general linguistic approach
The general linguistic approach
The approach to automatic indexing taken may be described in general as linguistic, i.e. as having a strong morpho-syntactic component based on large dictionaries and a small semantic component (in contrast e.g. to mathematical-statistical approaches). The underlying linguistic theory is dependency grammar. Common to all applications is the development of a dependency structure for all sentences parsed. The results may then be further processed for translation, tagging, or indexing. At present only a German parser and an English generator are implemented. The German-English translation may be tested on the internet (see SUSY system).
The approach is based on the assumption that disambiguation of natural language expressions (text, sentences, phrases, words) is necessary for MT and multilingual indexing. Whereas this has been the focus of research in MT, its use has been questioned for indexing. The various facets of the ambiguity phenomenon are discussed in different scientific communities, as e.g. summarized for the CORPORA mailing list (see Kilgarriff 1995) and discussed in detail in Krovetz/Croft 1992: disambiguation is esp. important for MT, to a lesser extent for IR. For multilingual automatic indexing a similar importance is assumed as for MT, and on this the focus will be in the present paper.
For a start, I shall work with the following types of ambiguity:
homographs 'distributed': finite verb, participle 'während': preposition: 'during', conjunction: 'while' polysemes 'während'/'while': temporal, contrastive 'Anzeige': Strafanzeige: 'criminal charge', Anzeige eines Geräts: 'display' 'einschalten': switch on (device), call in (a moderator) homonyms 'ear': Ohr (from 'auris') Ähre (from 'acus')
According to Heger (1963, 484) polysemy is the case if ‚one and the same word body has two or more different meanings, but only one syntactic function‘. Homonymy means that the word in question has different sources. Syntactic homography is defined as the case that the word has two or more different syntactic functions (Welte 1974, 506). In the following, I shall give a short account of homography, and the larger part of this chapter will be on polysemy.
Homographs are mostly dissolved morpho-syntactically. In NLP, it is one of the tasks of structural analysis to find out the correct syntactic function of a word in a given context:
a) Das Terminal quittierte den Auftrag. ('quittierte' = finite verb) b) Hier ist der quittierte Auftrag. ('quittierte' = past participle) c) Während des Betriebs nicht rauchen. ('während' = preposition) d) Nicht rauchen, während das Gerät läuft. ('während' = conjunction)
A special homograph problem concerns compound analysis: the compound ‚Lesebefehl‘ (‚command to read‘) will be split up into ‚Befehl‘ (= ‚command‘) and the pseudo-homograph „Lese“ (from ‚LESEN‘ = ‚read‘ and ‚LESE‘ = ‚vintage‘) where the latter, of course, is not correct).
In MT, polysemy is recognizable in parsing and transfer. To be true, some say that polysemy plays only a role in transfer, i.e. when translating words. But there are arguments for maintaining that certain kinds of polysemy are best dissolved already during the parsing process. This will be shown a little later.
First, I want to admit that there are many cases where polysemy is better treated in transfer (see Luckhardt 1987). That are those cases where a language community has – for socio-cultural or etymological reasons – developed several words that are equivalent to just one word in another language. It may be questioned whether it can be the task of a parser of German to distinguish between three readings of ‚Uhr‘ (Armbanduhr, Standuhr, Turmuhr) only because French has three different words for this: ‚montre‘, ‚pendule‘, ‚horloge‘. If this disambiguation is possible at all, it ought to take place in bilingual transfer. For English, there would have to be a differentiation into ‚watch‘ and ‚clock‘. For automatic indexing, it may be a good thing to have this differentiation done already by the German parser so that the descriptors ‚Armbanduhr‘, ‚Standuhr‘, ‚Turmuhr‘ might be assigned if necessary.
In what follows I shall give some examples for how disambiguation of polysemes was achieved in the SUSY system during parsing, i.e. with morphosyntactic means.
First of all, there is the use of morphological criteria. The word „Betrieb“ has two meanings:
Betrieb = Firma (company) Betrieb = (das) Betreiben (von)... (the operating of...)
If this word occurs in plural, the meaning must be ‚companies‘, as the other reading may not be used in plural. So, a simple lexical rule can lead to disambiguation.
A syntactic criterion may be the occurrence of specific prepositional or direct objects or attributes (cf. Luckhardt 1987):
Der Monteur hängt das Gerät ab. ( = take down) Die Firma hängt vom Monteur ab. (= depend on) Wir nehmen an, daß ein Irrtum vorliegt. ( =assume) Wir nehmen uns der Sache an. (= take care of) Wir nehmen das Paket an. (= accept) Die Sache nimmt Gestalt an. (= take shape) Wir sehen dies als einen Fehler an. (= regard) Wir sehen uns das Gerät an. (= inspect) Wir sehen dem Gerät an, woher es kommt. (= We can tell by its look ...)
Nouns may be disambiguated by the existence of attributes or appositions:
the term 'deposits' => der Begriff "Einlagen"
term => Frist
The passive voice may be a criterion:
These are called credits. => ... heißen ... He called it a mistake. => ... nannte ...
These are criteria that may be used for disambiguation when parsing source language text. This is an advantage if a text is to be translated into more than one language or descriptors are to be produced for several languages, for disambiguation has to be done only once. Otherwise disambiguation would have to take place in transfer, i.e. n times for n language pairs.
The above-mentioned criteria can only solve a limited number of problematic cases, as may be seen from the following list:
More than 1 target language equivalent: einschalten => switch on (device, light), call in (moderator), tune in (radio) Ausfall => failure, loss, deficit, stoppage> Karte => map, card, chart, ticket Gerät => equipment, device Vorgang => process, procedure Rücklauf => return, recoil, reflux, flyback ausführen => export, carry out, take out einführen => import, introduce Eingriff => operation, manipulation lösen => solve, dissolve, loosen, unscrew, detach, remove Anschluß => connection, supply Feld => field, area Anschlußfeld => connector panel aufklappen => flap open aufdrücken => press open aufschieben => push open aufmachen => open Abschnitt => paragraph, chapter, section, portion, clipping, cut, sector Altersheim => old people's home, old age pensioners' home Entwurf => design, draft Haushalt => household, budget Stand => position, stall, state of the art, stand, profession, state, situation, level, height, score
auf den neuesten Stand bringen => to bring up to date jmdn. in den Stand setzen => to enable
semantically equivalent target language expressions:
use => benutzen, verwenden, gebrauchen
Herstellung von Erzeugnissen => * production of products financial officials => * finanzielle Offizielle fully => voll, vollkommen, genau fully known => * voll bekannt fully automatic => * genau automatisch
(* = stylistic problem)
With these examples, we can sum up some problems for MT and automatic multilingual indexing that all have to do with ambiguity:
Different usages in different domains
Karte => map, card, chart, ticket Anzeige => criminal charge, display
Different usages in different (text) contexts
different variants of „öffnen“ = ‚open‘:
aufklappen => flap open aufdrücken => press open aufschieben => push open aufmachen => open
In every case the translation could be ‚open‘, i.e. the hyperonym to this set. For certain translation purposes (‚quick and dirty‘) this may be sufficient, but certainly some of the meaning is lost.
Stand => position, stall, state of the art, stand, state, situation, level, height, score, profession ausführen => export, carry out, take out einführen => import, introduce
Different inhouse usages / individual preferences
Altersheim => old people's home, old age pensioners' home
For IR this would be unproblematic as these are synonyms of each other. For MT it is relevant where an organization or company requests the usage of one of them.
Stand => position, state of the art, stand, state, situation, level fully (known,automatic) => (voll, vollkommen, genau) (bekannt, automatisch)
For human translators such equivalent variations are part of their daily work, for MT they cannot be formalized at all, except by statistical variation. They may be undesirable where consistency in terminology is demanded. Here MT and IR come together, as for both fields there have to be means to enforce a specific (inhouse) terminology, be it by means of a thesaurus or by giving preference to certain target language equivalents in the lexicon.
Besides inhouse usage, another criterion for the selection of equivalents may be the special field of the text translated / indexed. The problem here is setting up an appropriate classification for MT which will be discussed in more detail in a later chapter. A few examples may be useful here:
The term ‚Eingriff‘ (= operation, interference) in medical texts probably means ‚operation‘ though there can be no certainty that ‚interference‘ is not meant.
import (economy) einführen < introduce (general) export (economy) ausführen < carry out (general) dissolve (chemistry) lösen < solve (general) unit (computer technology) Anlage < investment (banking)
This problem may be approached by ranking and not eliminating equivalents by comparing the subject field codes and keeping the lower ranking equivalents for a post-editor. This strategy was followed in various projects working with the SUSY system. It is described in the chapter on sublanguage.