8. Informationswissenschaft als Brückenwissenschaft

Approaches to sense disambiguation with respect to automatic indexing and machine translation

2. The general linguistic approach

Heinz-Dirk Luckhardt

The general linguistic approach

The approach to automatic indexing taken may be described in general as linguistic, i.e. as having a strong morpho-syntactic component based on large dictionaries and a small semantic component (in contrast e.g. to mathematical-statistical approaches). The underlying linguistic theory is dependency grammar. Common to all applications is the development of a dependency structure for all sentences parsed. The results may then be further processed for translation, tagging, or indexing. At present only a German parser and an English generator are implemented. The German-English translation may be tested on the internet (see SUSY system).

The approach is based on the assumption that disambiguation of natural language expressions (text, sentences, phrases, words) is necessary for MT and multilingual indexing. Whereas this has been the focus of research in MT, its use has been questioned for indexing. The various facets of the ambiguity phenomenon are discussed in different scientific communities, as e.g. summarized for the CORPORA mailing list (see Kilgarriff 1995) and discussed in detail in Krovetz/Croft 1992: disambiguation is esp. important for MT, to a lesser extent for IR. For multilingual automatic indexing a similar importance is assumed as for MT, and on this the focus will be in the present paper.

For a start, I shall work with the following types of ambiguity:

homographs  'distributed': finite verb, participle

            'während': preposition: 'during',
                       conjunction: 'while'

polysemes   'während'/'while': temporal, contrastive

            'Anzeige': Strafanzeige: 'criminal charge',
                       Anzeige eines Geräts: 'display'

            'einschalten': switch on (device),
                           call in (a moderator)

homonyms    'ear': Ohr (from 'auris')
                   Ähre (from 'acus')

top of page

According to Heger (1963, 484) polysemy is the case if ‚one and the same word body has two or more different meanings, but only one syntactic function‘. Homonymy means that the word in question has different sources. Syntactic homography is defined as the case that the word has two or more different syntactic functions (Welte 1974, 506). In the following, I shall give a short account of homography, and the larger part of this chapter will be on polysemy.

Homographs are mostly dissolved morpho-syntactically. In NLP, it is one of the tasks of structural analysis to find out the correct syntactic function of a word in a given context:

a) Das Terminal quittierte den Auftrag.
                ('quittierte' = finite verb)

b) Hier ist der quittierte Auftrag.
                ('quittierte' = past participle)

c) Während des Betriebs nicht rauchen.
   ('während' = preposition)

d) Nicht rauchen, während das Gerät läuft.
   ('während' = conjunction)

A special homograph problem concerns compound analysis: the compound ‚Lesebefehl‘ (‚command to read‘) will be split up into ‚Befehl‘ (= ‚command‘) and the pseudo-homograph „Lese“ (from ‚LESEN‘ = ‚read‘ and ‚LESE‘ = ‚vintage‘) where the latter, of course, is not correct).

In MT, polysemy is recognizable in parsing and transfer. To be true, some say that polysemy plays only a role in transfer, i.e. when translating words. But there are arguments for maintaining that certain kinds of polysemy are best dissolved already during the parsing process. This will be shown a little later.

First, I want to admit that there are many cases where polysemy is better treated in transfer (see Luckhardt 1987). That are those cases where a language community has – for socio-cultural or etymological reasons – developed several words that are equivalent to just one word in another language. It may be questioned whether it can be the task of a parser of German to distinguish between three readings of ‚Uhr‘ (Armbanduhr, Standuhr, Turmuhr) only because French has three different words for this: ‚montre‘, ‚pendule‘, ‚horloge‘. If this disambiguation is possible at all, it ought to take place in bilingual transfer. For English, there would have to be a differentiation into ‚watch‘ and ‚clock‘. For automatic indexing, it may be a good thing to have this differentiation done already by the German parser so that the descriptors ‚Armbanduhr‘, ‚Standuhr‘, ‚Turmuhr‘ might be assigned if necessary.

In what follows I shall give some examples for how disambiguation of polysemes was achieved in the SUSY system during parsing, i.e. with morphosyntactic means.

First of all, there is the use of morphological criteria. The word „Betrieb“ has two meanings:

     Betrieb = Firma (company)

     Betrieb = (das) Betreiben (von)... (the operating of...)

If this word occurs in plural, the meaning must be ‚companies‘, as the other reading may not be used in plural. So, a simple lexical rule can lead to disambiguation.

A syntactic criterion may be the occurrence of specific prepositional or direct objects or attributes (cf. Luckhardt 1987):

     Der Monteur hängt das Gerät ab. ( = take down)

     Die Firma hängt vom Monteur ab. (= depend on)

     Wir nehmen an, daß ein Irrtum vorliegt. ( =assume)

     Wir nehmen uns der Sache an. (= take care of)

     Wir nehmen das Paket an. (= accept)

     Die Sache nimmt Gestalt an. (= take shape)

     Wir sehen dies als einen Fehler an. (= regard)

     Wir sehen uns das Gerät an. (= inspect)

     Wir sehen dem Gerät an, woher es kommt.
     (= We can tell by its look ...)

top of page

Nouns may be disambiguated by the existence of attributes or appositions:

     the term 'deposits' => der Begriff "Einlagen"

else:

     term => Frist

The passive voice may be a criterion:

     These are called credits. => ... heißen ...

     He called it a mistake. => ... nannte ...

These are criteria that may be used for disambiguation when parsing source language text. This is an advantage if a text is to be translated into more than one language or descriptors are to be produced for several languages, for disambiguation has to be done only once. Otherwise disambiguation would have to take place in transfer, i.e. n times for n language pairs.

The above-mentioned criteria can only solve a limited number of problematic cases, as may be seen from the following list:

More than 1 target language equivalent:

einschalten => switch on (device, light), call in (moderator), tune in (radio)

Ausfall => failure, loss, deficit, stoppage>

Karte => map, card, chart, ticket

Gerät => equipment, device

Vorgang => process, procedure

Rücklauf => return, recoil, reflux, flyback

ausführen => export, carry out, take out

einführen => import, introduce

Eingriff => operation, manipulation

lösen => solve, dissolve, loosen, unscrew, detach, remove

Anschluß => connection, supply

Feld => field, area

Anschlußfeld => connector panel

aufklappen => flap open

aufdrücken => press open

aufschieben => push open

aufmachen => open

Abschnitt => paragraph, chapter, section, portion, clipping, cut, sector

Altersheim => old people's home, old age pensioners' home

Entwurf => design, draft

Haushalt => household, budget

Stand => position, stall, state of the art, stand,
          profession, state, situation, level, height, score

top of page

Phrasal verbs:

     auf den neuesten Stand bringen => to bring up to date

     jmdn. in den Stand setzen      => to enable

semantically equivalent target language expressions:

     use => benutzen, verwenden, gebrauchen

stylistic problems:

     Herstellung von Erzeugnissen => * production of products

     financial officials => * finanzielle Offizielle

     fully           => voll, vollkommen, genau
     fully known     => * voll bekannt
     fully automatic => * genau automatisch

(* = stylistic problem)

With these examples, we can sum up some problems for MT and automatic multilingual indexing that all have to do with ambiguity:

Different usages in different domains

     Karte => map, card, chart, ticket

     Anzeige => criminal charge, display

Different usages in different (text) contexts

different variants of „öffnen“ = ‚open‘:

     aufklappen => flap open

     aufdrücken => press open

     aufschieben => push open

     aufmachen => open

In every case the translation could be ‚open‘, i.e. the hyperonym to this set. For certain translation purposes (‚quick and dirty‘) this may be sufficient, but certainly some of the meaning is lost.

     Stand => position, stall, state of the art, stand, state,
              situation, level, height, score, profession

     ausführen => export, carry out, take out

     einführen => import, introduce

Different inhouse usages / individual preferences

     Altersheim => old people's home, old age pensioners' home

For IR this would be unproblematic as these are synonyms of each other. For MT it is relevant where an organization or company requests the usage of one of them.

Stylistic variation

     Stand => position, state of the art, stand, state, situation, level

     fully (known,automatic) => (voll, vollkommen, genau) (bekannt, automatisch)

For human translators such equivalent variations are part of their daily work, for MT they cannot be formalized at all, except by statistical variation. They may be undesirable where consistency in terminology is demanded. Here MT and IR come together, as for both fields there have to be means to enforce a specific (inhouse) terminology, be it by means of a thesaurus or by giving preference to certain target language equivalents in the lexicon.

Besides inhouse usage, another criterion for the selection of equivalents may be the special field of the text translated / indexed. The problem here is setting up an appropriate classification for MT which will be discussed in more detail in a later chapter. A few examples may be useful here:

The term ‚Eingriff‘ (= operation, interference) in medical texts probably means ‚operation‘ though there can be no certainty that ‚interference‘ is not meant.

               import (economy)
einführen <
               introduce (general)

               export (economy)
ausführen <

               carry out (general)

               dissolve (chemistry)
lösen     <
               solve (general)

               unit (computer technology)
Anlage    <
               investment (banking)

This problem may be approached by ranking and not eliminating equivalents by comparing the subject field codes and keeping the lower ranking equivalents for a post-editor. This strategy was followed in various projects working with the SUSY system. It is described in the chapter on sublanguage.

1. Introduction | 3. The morpho-syntactic approach to automatic tagging

Universität des Saarlandes - Fachrichtung Informationswissenschaft

8. Informationswissenschaft als Brückenwissenschaft

Approaches to sense disambiguation with respect to automatic indexing and machine translation

2. The general linguistic approach

Heinz-Dirk Luckhardt

The general linguistic approach