8. Informationswissenschaft als Brückenwissenschaft
Approaches to sense disambiguation with respect to automatic indexing and machine translation
3. The morpho-syntactic approach to automatic tagging
Heinz-Dirk Luckhardt
When we ‚tag‘ a text we give every word in the text its grammatical description, i.e. we select – by determining its function in the present sentence – the correct reading from a number of readings a word form may have; we disambiguate it. This was formerly done intellectually, now there are many tools that do it automatically.
One of these tools might be any parsing component of a machine translation system, because disambiguation is the foremost task of any of these parsers. It may be achieved with morphological, syntactic, or semantic means. The approach to disambiguation shown here is based on morphology and syntax (partial parsing) and has been employed in the SUSY MT system. The SUSY system has, in fact, never been used for tagging texts, but it can be shown that the parsing results might – among other purposes – be used for that purpose. The figure below shows all homographs in the German sentence
Trotz der schwierigen konjunkturellen Rahmenbedingungen in wichtigen Marktsegmenten unseres Unternehmens erhoeht sich das Ergebnis der gewoehnlichen Geschaeftstaetigkeit um 58 Prozent gegenueber dem Vorjahr.
that result from morphological analysis and dictionary look-up (all the output given is real output of the SUSY parser and MT system):
TEXT WORD WKL LEMMA NAME STW
-------------------------------------------------------------
Trotz FIV TROTZEN VRB
SUB TROTZ SUB
PRP TROTZ FWK
der REL D- (REL) FWK
ARTB D- (ARTB) FWK
PER D- (PER) FWK
schwierigen ADJ SCHWIERIG ADJ
konjunkturellen ADJ KONJUNKTURELL ADJ
Rahmenbedingungen SUB /RAHMEN/BEDINGUNG SUB
in PRP IN (DATIV) FWK
PRP IN (AKKUSATIV) FWK
wichtigen ADJ WICHTIG ADJ
Marktsegmenten SUB /MARKT/SEGMENT SUB
unseres POSS UNSR- FWK
Unternehmens SUB UNTERNEHMEN (SUB) SUB
SBI UNTERNEHMEN VRB
SBI UNTERNEHMEN VRB
erhoeht ADP ERHOEHEN VRB
PTZ2 ERHOEHEN VRB
FIV ERHOEHEN VRB
sich REF ER/SIE/ES/SIE (REF) FWK
das REL D- (REL) FWK
ARTB D- (ARTB) FWK
PER D- FWK
Ergebnis SUB ERGEBNIS SUB
der REL D- (REL) FWK
ARTB D- (ARTB) FWK
PER D- (PER) FWK
gewoehnlichen ADJ GEWOEHNLICH ADJ
Geschaeftstaetigkeit SUB GESCHAEFTSTAETIGKEIT SUB
um UOA UM FWK
PRP UM (HERUM) FWK
VZS UM FWK
PRP UM (WILLEN) FWK
PRP UM (PRP) FWK
58 NUM 58 FWK
Prozent SUB PROZENT SUB
gegenueber PRP GEGENUEBER (PRP) FWK
ADV GEGENUEBER (ADV) FWK
VZS GEGENUEBER (VZS) FWK
dem REL D- (REL) FWK
ARTB D- (ARTB) FWK
PER D- FWK
Vorjahr SUB VORJAHR SUB
* *
In SUSY, a strategy has been followed that tries to solve as many ambiguities as possible at an early stage of the parsing process in order to make parsing faster, without taking premature decisions. This was achieved by a hybrid system of routines that gives weights to pairs of parts of speech and employs partial parses. E.g. it would give the pair ‚preposition + article‘ a higher weight than ’noun + relative pronoun‘ as in the sample sentence (Trotz der …). Also it would eliminate readings on syntactic grounds, e.g. it would eliminate the ‚conjunction‘ reading of ‚um‘, because ‚um‘ is not followed by an infinitive verb. These criteria suffice to disambiguate all the homographs in this sentence and produce an unambiguous basis for tagging the sentence:
TEXT WORD WKL LEMMA NAME STW ------------------------------------------------------------- Trotz PRP TROTZ FWK der ARTB D- (ARTB) FWK schwierigen ADJ SCHWIERIG ADJ konjunkturellen ADJ KONJUNKTURELL ADJ Rahmenbedingungen SUB /RAHMEN/BEDINGUNG SUB in PRP IN (DAT) FWK wichtigen ADJ WICHTIG ADJ Marktsegmenten SUB /MARKT/SEGMENT SUB unseres POSS UNSR- FWK Unternehmens SUB UNTERNEHMEN (SUB) SUB erhoeht FIV ERHOEHEN VRB sich PER ER/SIE/ES/SIE (REF) FWK das ARTB D- (ARTB) FWK Ergebnis SUB ERGEBNIS SUB der ARTB D- (ARTB) FWK gewoehnlichen ADJ GEWOEHNLICH ADJ Geschaeftstaetigkeit SUB GESCHAEFTSTAETIGKEIT SUB um PRP UM (PRP) FWK 58 NUM 58 FWK Prozent SUB PROZENT SUB gegenueber PRP GEGENUEBER (PRP) FWK dem ARTB D- (ARTB) FWK Vorjahr SUB VORJAHR SUB * *
As mentioned above, there is no actual SUSY tagger, but a good programmer could elicit all that is necessary from the parser’s output tables. Of course, only a small portion of the output is shown here. It may be safely assumed that much more detailed information could be assigned to the text words than a tagger would normally produce. But as tagging is not in the center of interest here, I shall leave it at that.
TEXT WORD WKL LEMMA NAME STW ------------------------------------------------------------- Kostensenkungen SUB /KOSTEN/SENKUNG SUB und NKO UND FWK Produktivitaetsfort SUB /PRODUKTIVITAET*S/ SUB schritte FORTSCHRITT gehen FIV GEHEN VRB nicht ADV NICHT FWK zu Lasten PRP ZU LASTEN FWK der ARTB D- (ARTB) FWK Qualitaet SUB QUALITAET SUB * *
