Diese Website ist seit dem Ende des Studiengangs Informationswissenschaft
im Juni 2014 archiviert und wird nicht mehr aktualisiert.
Bei technischen Fragen: Sascha Beck - s AT saschabeck PUNKT ch
Drucken

8. Informationswissenschaft als Brückenwissenschaft

Approaches to sense disambiguation with respect to automatic indexing and machine translation

3. The morpho-syntactic approach to automatic tagging

Heinz-Dirk Luckhardt

When we ‚tag‘ a text we give every word in the text its grammatical description, i.e. we select – by determining its function in the present sentence – the correct reading from a number of readings a word form may have; we disambiguate it. This was formerly done intellectually, now there are many tools that do it automatically.

One of these tools might be any parsing component of a machine translation system, because disambiguation is the foremost task of any of these parsers. It may be achieved with morphological, syntactic, or semantic means. The approach to disambiguation shown here is based on morphology and syntax (partial parsing) and has been employed in the SUSY MT system. The SUSY system has, in fact, never been used for tagging texts, but it can be shown that the parsing results might – among other purposes – be used for that purpose. The figure below shows all homographs in the German sentence

Trotz der schwierigen konjunkturellen Rahmenbedingungen in wichtigen Marktsegmenten unseres Unternehmens erhoeht sich das Ergebnis der gewoehnlichen Geschaeftstaetigkeit um 58 Prozent gegenueber dem Vorjahr.

that result from morphological analysis and dictionary look-up (all the output given is real output of the SUSY parser and MT system):

glossary of acronyms used

TEXT WORD             WKL   LEMMA NAME          STW 
 
------------------------------------------------------------- 
 
Trotz                 FIV   TROTZEN             VRB 
                      SUB   TROTZ               SUB 
                      PRP   TROTZ               FWK 
der                   REL   D- (REL)            FWK 
                      ARTB  D- (ARTB)           FWK 
                      PER   D- (PER)            FWK 
schwierigen           ADJ   SCHWIERIG           ADJ 
konjunkturellen       ADJ   KONJUNKTURELL       ADJ 
Rahmenbedingungen     SUB   /RAHMEN/BEDINGUNG   SUB 
in                    PRP   IN (DATIV)          FWK 
                      PRP   IN (AKKUSATIV)      FWK 
wichtigen             ADJ   WICHTIG             ADJ 
Marktsegmenten        SUB   /MARKT/SEGMENT      SUB 
unseres               POSS  UNSR-               FWK 
Unternehmens          SUB   UNTERNEHMEN (SUB)   SUB 
                      SBI   UNTERNEHMEN         VRB 
                      SBI   UNTERNEHMEN         VRB 
erhoeht               ADP   ERHOEHEN            VRB 
                      PTZ2  ERHOEHEN            VRB 
                      FIV   ERHOEHEN            VRB 
sich                  REF   ER/SIE/ES/SIE (REF) FWK 
das                   REL   D- (REL)            FWK 
                      ARTB  D- (ARTB)           FWK 
                      PER   D-                  FWK 
Ergebnis              SUB   ERGEBNIS            SUB 
der                   REL   D- (REL)            FWK 
                      ARTB  D- (ARTB)           FWK 
                      PER   D- (PER)            FWK 
gewoehnlichen         ADJ   GEWOEHNLICH         ADJ 
Geschaeftstaetigkeit  SUB   GESCHAEFTSTAETIGKEIT SUB 
um                    UOA   UM                  FWK 
                      PRP   UM (HERUM)          FWK 
                      VZS   UM                  FWK 
                      PRP   UM (WILLEN)         FWK 
                      PRP   UM (PRP)            FWK 
58                    NUM   58                  FWK 
Prozent               SUB   PROZENT             SUB 
gegenueber            PRP   GEGENUEBER (PRP)    FWK 
                      ADV   GEGENUEBER (ADV)    FWK 
                      VZS   GEGENUEBER (VZS)    FWK 
dem                   REL   D- (REL)            FWK 
                      ARTB  D- (ARTB)           FWK 
                      PER   D-                  FWK 
Vorjahr               SUB   VORJAHR             SUB 
 
*                           * 
 
 

top of page


In SUSY, a strategy has been followed that tries to solve as many ambiguities as possible at an early stage of the parsing process in order to make parsing faster, without taking premature decisions. This was achieved by a hybrid system of routines that gives weights to pairs of parts of speech and employs partial parses. E.g. it would give the pair ‚preposition + article‘ a higher weight than ’noun + relative pronoun‘ as in the sample sentence (Trotz der …). Also it would eliminate readings on syntactic grounds, e.g. it would eliminate the ‚conjunction‘ reading of ‚um‘, because ‚um‘ is not followed by an infinitive verb. These criteria suffice to disambiguate all the homographs in this sentence and produce an unambiguous basis for tagging the sentence:

glossary of acronyms used

TEXT WORD             WKL   LEMMA NAME           STW 
 
 
------------------------------------------------------------- 
 
 
Trotz                 PRP   TROTZ                FWK 
der                   ARTB  D- (ARTB)            FWK 
schwierigen           ADJ   SCHWIERIG            ADJ 
konjunkturellen       ADJ   KONJUNKTURELL        ADJ 
Rahmenbedingungen     SUB   /RAHMEN/BEDINGUNG    SUB 
in                    PRP   IN (DAT)             FWK 
wichtigen             ADJ   WICHTIG              ADJ 
Marktsegmenten        SUB   /MARKT/SEGMENT       SUB 
unseres               POSS  UNSR-                FWK 
Unternehmens          SUB   UNTERNEHMEN (SUB)    SUB 
erhoeht               FIV   ERHOEHEN             VRB 
sich                  PER   ER/SIE/ES/SIE (REF)  FWK 
das                   ARTB  D- (ARTB)            FWK 
Ergebnis              SUB   ERGEBNIS             SUB 
der                   ARTB  D- (ARTB)            FWK 
gewoehnlichen         ADJ   GEWOEHNLICH          ADJ 
Geschaeftstaetigkeit  SUB   GESCHAEFTSTAETIGKEIT SUB 
um                    PRP   UM (PRP)             FWK 
58                    NUM   58                   FWK 
Prozent               SUB   PROZENT              SUB 
gegenueber            PRP   GEGENUEBER (PRP)     FWK 
dem                   ARTB  D- (ARTB)            FWK 
Vorjahr               SUB   VORJAHR              SUB 
 
*                           * 
 
 

As mentioned above, there is no actual SUSY tagger, but a good programmer could elicit all that is necessary from the parser’s output tables. Of course, only a small portion of the output is shown here. It may be safely assumed that much more detailed information could be assigned to the text words than a tagger would normally produce. But as tagging is not in the center of interest here, I shall leave it at that.

TEXT WORD            WKL    LEMMA NAME        STW 
------------------------------------------------------------- 
Kostensenkungen      SUB   /KOSTEN/SENKUNG    SUB 
und                  NKO   UND                FWK 
Produktivitaetsfort  SUB   /PRODUKTIVITAET*S/ SUB 
schritte                   FORTSCHRITT 
gehen                FIV   GEHEN              VRB 
nicht                ADV   NICHT              FWK 
zu Lasten            PRP   ZU LASTEN          FWK 
der                  ARTB  D- (ARTB)          FWK 
Qualitaet            SUB   QUALITAET          SUB 
*                          *