8. Informationswissenschaft als Brückenwissenschaft
Approaches to sense disambiguation with respect to automatic indexing and machine translation
3. The morpho-syntactic approach to automatic tagging
Heinz-Dirk Luckhardt
When we ‚tag‘ a text we give every word in the text its grammatical description, i.e. we select – by determining its function in the present sentence – the correct reading from a number of readings a word form may have; we disambiguate it. This was formerly done intellectually, now there are many tools that do it automatically.
One of these tools might be any parsing component of a machine translation system, because disambiguation is the foremost task of any of these parsers. It may be achieved with morphological, syntactic, or semantic means. The approach to disambiguation shown here is based on morphology and syntax (partial parsing) and has been employed in the SUSY MT system. The SUSY system has, in fact, never been used for tagging texts, but it can be shown that the parsing results might – among other purposes – be used for that purpose. The figure below shows all homographs in the German sentence
Trotz der schwierigen konjunkturellen Rahmenbedingungen in wichtigen Marktsegmenten unseres Unternehmens erhoeht sich das Ergebnis der gewoehnlichen Geschaeftstaetigkeit um 58 Prozent gegenueber dem Vorjahr.
that result from morphological analysis and dictionary look-up (all the output given is real output of the SUSY parser and MT system):
TEXT WORD WKL LEMMA NAME STW ------------------------------------------------------------- Trotz FIV TROTZEN VRB SUB TROTZ SUB PRP TROTZ FWK der REL D- (REL) FWK ARTB D- (ARTB) FWK PER D- (PER) FWK schwierigen ADJ SCHWIERIG ADJ konjunkturellen ADJ KONJUNKTURELL ADJ Rahmenbedingungen SUB /RAHMEN/BEDINGUNG SUB in PRP IN (DATIV) FWK PRP IN (AKKUSATIV) FWK wichtigen ADJ WICHTIG ADJ Marktsegmenten SUB /MARKT/SEGMENT SUB unseres POSS UNSR- FWK Unternehmens SUB UNTERNEHMEN (SUB) SUB SBI UNTERNEHMEN VRB SBI UNTERNEHMEN VRB erhoeht ADP ERHOEHEN VRB PTZ2 ERHOEHEN VRB FIV ERHOEHEN VRB sich REF ER/SIE/ES/SIE (REF) FWK das REL D- (REL) FWK ARTB D- (ARTB) FWK PER D- FWK Ergebnis SUB ERGEBNIS SUB der REL D- (REL) FWK ARTB D- (ARTB) FWK PER D- (PER) FWK gewoehnlichen ADJ GEWOEHNLICH ADJ Geschaeftstaetigkeit SUB GESCHAEFTSTAETIGKEIT SUB um UOA UM FWK PRP UM (HERUM) FWK VZS UM FWK PRP UM (WILLEN) FWK PRP UM (PRP) FWK 58 NUM 58 FWK Prozent SUB PROZENT SUB gegenueber PRP GEGENUEBER (PRP) FWK ADV GEGENUEBER (ADV) FWK VZS GEGENUEBER (VZS) FWK dem REL D- (REL) FWK ARTB D- (ARTB) FWK PER D- FWK Vorjahr SUB VORJAHR SUB * *
In SUSY, a strategy has been followed that tries to solve as many ambiguities as possible at an early stage of the parsing process in order to make parsing faster, without taking premature decisions. This was achieved by a hybrid system of routines that gives weights to pairs of parts of speech and employs partial parses. E.g. it would give the pair ‚preposition + article‘ a higher weight than ’noun + relative pronoun‘ as in the sample sentence (Trotz der …). Also it would eliminate readings on syntactic grounds, e.g. it would eliminate the ‚conjunction‘ reading of ‚um‘, because ‚um‘ is not followed by an infinitive verb. These criteria suffice to disambiguate all the homographs in this sentence and produce an unambiguous basis for tagging the sentence:
TEXT WORD WKL LEMMA NAME STW ------------------------------------------------------------- Trotz PRP TROTZ FWK der ARTB D- (ARTB) FWK schwierigen ADJ SCHWIERIG ADJ konjunkturellen ADJ KONJUNKTURELL ADJ Rahmenbedingungen SUB /RAHMEN/BEDINGUNG SUB in PRP IN (DAT) FWK wichtigen ADJ WICHTIG ADJ Marktsegmenten SUB /MARKT/SEGMENT SUB unseres POSS UNSR- FWK Unternehmens SUB UNTERNEHMEN (SUB) SUB erhoeht FIV ERHOEHEN VRB sich PER ER/SIE/ES/SIE (REF) FWK das ARTB D- (ARTB) FWK Ergebnis SUB ERGEBNIS SUB der ARTB D- (ARTB) FWK gewoehnlichen ADJ GEWOEHNLICH ADJ Geschaeftstaetigkeit SUB GESCHAEFTSTAETIGKEIT SUB um PRP UM (PRP) FWK 58 NUM 58 FWK Prozent SUB PROZENT SUB gegenueber PRP GEGENUEBER (PRP) FWK dem ARTB D- (ARTB) FWK Vorjahr SUB VORJAHR SUB * *
As mentioned above, there is no actual SUSY tagger, but a good programmer could elicit all that is necessary from the parser’s output tables. Of course, only a small portion of the output is shown here. It may be safely assumed that much more detailed information could be assigned to the text words than a tagger would normally produce. But as tagging is not in the center of interest here, I shall leave it at that.
TEXT WORD WKL LEMMA NAME STW ------------------------------------------------------------- Kostensenkungen SUB /KOSTEN/SENKUNG SUB und NKO UND FWK Produktivitaetsfort SUB /PRODUKTIVITAET*S/ SUB schritte FORTSCHRITT gehen FIV GEHEN VRB nicht ADV NICHT FWK zu Lasten PRP ZU LASTEN FWK der ARTB D- (ARTB) FWK Qualitaet SUB QUALITAET SUB * *