8. Informationswissenschaft als Brückenwissenschaft

Approaches to sense disambiguation with respect to automatic indexing and machine translation

4. The Sublanguage Approach

Heinz-Dirk Luckhardt

The Sublanguage Approach:
how can different special domains be dealt with

There have been various attempts at using the sublanguage notion for disambiguation and the selection of target language equivalents in machine translation. In this chapter, a theoretical concept and its implementation in a real MT or automatic indexing application are presented. Above this, means of linguistic engineering like weighting mechanisms are discussed.

It has been proposed by a number of authors (cf. Kittredge 1987, Kittredge/Lehrberger 1982, Luckhardt 1984) to use the sublanguage notion for solving some of the notorious problems in machine translation (MT) and natural language processing (NLP) in general, such as disambiguation and selection of target language equivalents. A 1989 account of the state of the art (cf. Quinlan 1989) makes it obvious that not much progress has been achieved since the first publications drew the attention to the possibility of tailoring MT systems to the needs of specific sublanguages.

To be true, neither of the approaches described in Kittredge/Lehrberger 1982 (TAUM-AVIATION system) or Luckhardt 1984 (SUSY system) lead to any kind of breakthrough. The TAUM-AVIATION project was not allowed to run long enough to prove the usefulness of its sublanguage concept, and the SUSY sublanguage concept was never implemented completely. In all, there was not much response to the proposals, although they seem to have been accepted as ‚potentially useful‘. At least it seems that the commercial systems which do not disclose much of their strategies in open available literature use much of what is addressed here.

So, sublanguage does play a part in concrete MT developments and systems, even if this is not explicitly mentioned. Projects and systems like METAL, SYSTRAN, ARIANE etc. use part of what constitutes the sublanguage notion as described below, especially when it comes to the selection of target language equivalents.

In the following, I shall present a reflection on what sublanguages can contribute to the solution of problems in natural language processing.

top of page

A sublanguage concept for use in MT systems

It was Z. Harris who introduced the term ’sublanguage‘ for a portion of natural language differing from other portions of the same language syntactically and/or lexically (cf. Harris 1968, 152):

‚Certain proper subsets of the sentences of a language may be closed under some or all of the operations defined in the language, and thus constitute a sublanguage of it‘.

This definition arose from Harris‘ work on transformational grammar and discourse analysis. According to Hirschman/Sager (1982) the closure property is not sufficient for the definition of the sublanguage notion:

(A sublanguage is) ‚the particular language used in a body of texts dealing with a circumscribed subject area (often reports or articles on a technical speciality or science subfield), in which the authors of the documents share a common vocabulary and common habits of word usage.‘ (Hirschman/Sager 1982, 28)

Lehrberger (1982) gives some properties ‚which help to characterize sublanguage:

limited subject matter
lexical, semantic, and syntactic restrictions
deviant rules of grammar
high frequency of certain constructions
text structure
use of special symbols‘.

top of page

In order to be able to use such characterizations in NLP, they have to be formalized in a way adequate to the MT system in question. Such formalizable properties were combined in my 1984 definition of what sublanguage can mean for NLP.

                            sublanguage 
     ______________________________|________________________ 
    |                   |                  |                | 
 
STRUCTURE        SUBJECT FIELD         PURPOSE         INHOUSE USAGE 
(syntactic/      (lexical level)      (lexical/        (application 
syntagmatic                        pragmatic level)    level) 
level) 
 
running text     biology              abstract           Siemens 
word list        mathematics          job offer          Daimler Benz 
nominal struc.   computer science     minutes            IBM 
elliptic sent.   etc.                 etc.               etc. 
etc.

STRUCTURE represents the syntactic-syntagmatic level of a sublanguage for which only a rather weak differentiation can be proposed (e.g. running text, word list, nominal structures etc.). A sublanguage has, e.g., the STRUCTURE type ’nominal structures‘, if nominal structures prevail in it, i.e. if hardly ever a verb is used, like in the sublanguage ‚titles from scientific papers from the field of EDP‘, e.g.:

Rechnerunterstützte   => Computer-aided geometry 
Geometrieverarbeitung    processing 
 
Anforderungsprofil für         => Job profiles for activities 
Tätigkeiten in der DV und IV      in DP and IP

Unfortunately, such a restriction does not completely preclude the use of argument/verb constructions like the following:

Der TV Bildschirm wird erwachsen =>
The TV screen is growing up

which in fact is a sample from a collection of titles from the field of data and information processing machine-translated by the SUSY system.

SUBJECT FIELD represents the lexical level of a sublanguage, i.e. for every sublanguage a subject field is determined as being characteristic, so that the MT system may choose on the basis of the sublanguage of a text those translation equivalents from the lexicon which carry the same subject field code as the translated text.

Such a restriction cannot be complete in such a way that only terms that carry the same subject field code as the text are acceptable. Legal terms may be used in chemical texts or mathematical ones in texts about physics. Lexical restrictions should only determine the order of preference in cases of ambiguity or one-to-many relations in lexical transfer.

A further problem is the lack of a commonly accepted subject field classification for MT (see below). Such a classification has been proposed in Luckhardt/Zimmermann (1991), but its usefulness will have to be tested on more applications.

PURPOSE represents the lexical-pragmatic level. The purpose of a text (or its target group) may determine the choice of TL equivalents and of syntactic structure or style, e.g. for the text function ‚job offer‘:

Ingenieur, Richtung Elektronik
=> Engineer, special field electronics

The INHOUSE USAGE criterion covers a number of aspects determined by special requests of the MT user or the firm ordering the translation. This is first of all a question of inhouse terminology and of stylistic standards (corporate identity). A very important prerequisite for consistent translation is the strict observance of priorities with regard to the use of terms in (the translation department of) a specific firm. Companies very often want terms to be translated their way, no matter what terminology standardization committees may say. Such requests are understandable in particular in the case of data base producers, as it is indispensable for information retrieval that terms be used consistently.

Another aspect of inhouse usage is the preference of morphophonological variants (e.g. standardisation vs. standardization) or of certain syntactic structures.

top of page

Sublanguages for MT: some examples

1. PATENT DESCRIPTIONS

Many informative texts are complex in that they consist of different sublanguages, so that these sublanguages together constitute a definite ‚document type‘, e.g. ‚patent descriptions‘ or ‚maintenance requirements‘.

Patent descriptions are a rather inhomogeneous sort of document (cf. Lawson 1980) that may be regarded as consisting of different sublanguages (with one shared subject field, e.g. ‚chemistry‘ like in the example below):

1. title:

STRUCTURE ’nominal structure‘
PURPOSE ‚title‘

e.g.: ‚Free-flowing aluminium powder not tending towards cold sintering and treated with alkali oleate‘

2. abstract:

STRUCTURE ‚running text‘
PURPOSE ‚abstract‘

e.g.: ‚A process and a reactor for performing a catalytic conversion of H₂S and SO₂ to elementary sulfur are described. It is provided that catalyst beds are operated during the conversion below the sulfur fixed point and regenerated afterwards by warming-up‘.

3. claim:

STRUCTURE ‚running text‘
PURPOSE ‚patent claim‘

syntactic pattern:

(complex) noun phrase
+
‚thus characterized that‘
+
complex predicate

e.g.: ‚Magnesia granulate thus characterized that the surface of the granulate particles is covered with a continuous and homogeneous coating of a material selected from silicon oxide and forsterite‘.

4. descriptors:

STRUCTURE ‚word list‘
PURPOSE ‚thesaurus terms‘

e.g.: ‚granulate, coating, silicon oxide, forsterite‘.

5. actual patent text:

STRUCTURE ‚running text‘
PURPOSE ‚description‘

A patent document consists of many more information arrays (‚fields‘), but only those five given above are worth translating by machine.

Although these five ’sublanguages‘ are quite distinct from each other as far as their syntax is concerned, the differences are not quite easy to grasp. 2. and 3., e.g., contain ‚running text‘ differing – if we disregard the fixed structure at the beginning of 3. for a moment – only in the length of the individual sentences, 3. normally containing just one (usually very) long sentence. But what does ‚long sentence‘ mean, when we are trying to formalize criteria for MT?

Differences that can be used for disambiguation are to be found, e.g., in 1. titles (nominal structures) and 2. abstracts (running text). ‚Finite verb‘ readings of words have a very low weight, e.g. ‚powder‘ and ‚treated‘, for the sublanguage ‚titles in patent documents‘. They may even be completely neglected (but not, e.g., in the sublanguage ‚titles of articles in journals‘).

For the above example the SUBJECT FIELD criterion can be shown to be useful, e.g., when translating ‚conversion‘ into ‚Umsetzung‘ (in texts about economy this should be ‚Konvertierung‘) and ‚warming up‘ into ‚Erhitzen‘ (in texts about sports this should be ‚Aufwärmen‘).

Here, also the SUBJECT FIELD (‚mechanical engineering‘) can help in disambiguating, e.g., when translating ‚housing‘ into ‚Gehäuse‘ (instead of ‚Wohnung‘ in the SUBJECT FIELDs ‚architecture‘ or ‚construction‘) or ‚tear‘ into ‚Riß‘ (instead of ‚Träne‘ in everyday usage).

top of page

2. MAINTENANCE REQUIREMENTS

A typical maintenance requirement card of the Bundessprachenamt (Federal Translations Agency) among others contains the following parts (sublanguages):

1. designation of equipment

STRUCTURES ’nominal structure‘
PURPOSE ‚title‘

e.g.: ‚Portable gasoline driven pump‘

2. tools, parts, materials

STRUCTURES ‚word list‘
PURPOSE ‚accessories‘; e.g.:

key set, head screw, L-type hex
wrench, adjustable, open end 6′
solvent, type II
screwdriver, flat tip, medium duty
rags, wiping

3. procedure

STRUCTURES ‚instructions‘ (imperative style)
PURPOSE ‚maintenance instructions‘, e.g.:

‚Accomplish annually or when directed as a result of operational test. Clean and inspect fuel filter and float valve;

remove pump housing covers, if applicable
observe no smoking regulation
remove choke knob and fuel connection
remove float chamber and gasket
clean all parts in solvent, allow to air dry
inspect filter for clogging, tears, and deterioration‘

(cf. Wilms 1983)

The example indicates how nicely the different sublanguages of this type of document can be differentiated, and it ought to be possible in all MT systems to capture these differences, especially the typical ‚imperative style‘ of the STRUCTURES type ‚instructions‘. In order to achieve this it must be possible to weight rules or resulting structures like in the SUSY system (cf. Thiel 1987). This is important, because there is no absolute certainty that all predicate structures appear as imperatives in English or as infinitives in German.

top of page

The use of sublanguages in the STS project and system

After the end of its research phase in 1986, the SUSY system has been used as the core MT system within the computer-aided Saarbrücken Translation Service (STS), i.e. in human-aided MT and in machine-aided human translation (HT). Titles of scientific papers from German databases were machine-translated and postedited by humans, abstracts were translated by translators (in all 5 million words), with the MT system automatically supplying the correct terminology (from a terminology pool of more than 350.000 German-English entries). This paragraph gives an overview of the facets of the sublanguage concept employed in STS (see also Luckhardt/Zimmermann 1991).

top of page

1. Homograph resolution in titles

Verb readings, e.g., are generally given a low weight, e.g.:

‚Verfahren der Waldschadensinventur‘
(‚Methods for wood damage inventories‘)

VERFAHREN = (finite verb,infinitive,past participle,noun)

In a title the noun reading will be preferred, in an instruction there would be two possibilities:

‚Verfahren Sie wie oben angegeben‘
(‚Proceed as stated above‘)
‚Dabei wie oben angegeben verfahren‘
(‚[Then] proceed as stated above‘)

At the beginning of the clause, the homograph is disambiguated as ‚finite verb‘ (= imperative), at the end as ‚infinitive‘.

top of page

2. Nominal structures in titles

As there are no predicates, NPs and PPs are much more likely to be adjuncts than in running text, where they also may be analysed as complements. This may be used for optimizing parsing, e.g. by applying deterministic methods, c.f.:

‚Der Beitrag der kommunalen Beschaffung zum Umweltschutz‘
(‚The contribution of communal purchasing to pollution control‘)

‚Erfahrung mit der Kompostierung in Berlin‘
(‚Experience with composting in Berlin‘)

In titles it is rather safe to let the PP-attachment rule operate deterministically, if the governing noun has the appropriate valency (here: Erfahrung mit …, Beitrag zu …). This is quite safe at the beginning of the title. The argument, of course, is theoretically not very interesting, but criteria like this one help improve parsing speed.

top of page

3. Semantics of prepositions in titles

Highly ambiguous prepositions like ‚zu‘, ‚über‘ etc. can be rather safely disambiguated on the basis of word order:

'Zur Optimierung von Waldschadenserhebungen' 
 
=> 'The optimization of wood damage surveys' 
 
'Zur Rückgewinnung von Wärme verpflichtet' 
 
=> 'Obliged to recover heat' 
 
'Technologien zur Verminderung von Abfällen' 
 
=> 'Technologies for the reduction of waste' 
 
'Über Arbeit und Umwelt' 
 
=> 'Labour and environment'

A ‚zu‘-phrase at the beginning of a title (the top node of the nominal structure) always denotes a TOPIC (1st example), otherwise (3rd example) a purpose. ‚Über‘ at the beginning also denotes a TOPIC. These rules only apply, if the PP is not embedded in a predicate structure like in the 2nd example, where it fills the zu-valency of ‚verpflichtet‘. So, if the parser produces a structure like the following:

        verpflichten 
            /     | 
SUBJECT:none  GOAL:rückgewinnen 
                  | 
              OBJECT: Wärme

where we have a predicate ‚verpflichten‘ with an unspecified SUBJECT and a GOAL which is another predicate structure (‚rückgewinnen‘ with the OBJECT ‚Wärme‘), there only has to be lexical transfer =>

         oblige 
        /     | 
 
SUBJECT:none  GOAL:recover 
                  | 
              OBJECT: heat

to present a structure to generation that carries enough information to produce the English translation given above (‚Obliged to recover heat‘).

Similarly, examples 1. and 3. can be represented by the parser in a way which allows the generation of the correct target language equivalent, e.g.:

‚Zur Optimierung von Waldschadenserhebungen‘

TOPIC: Optimierung 
       | 
OBJECT: Waldschadenserhebung

transfer =>

TOPIC: optimization 
       | 
OBJECT: wood damage survey

generation =>

‚The optimization of wood damage surveys‘

The surface realization of the semantic roles TOPIC and OBJECT is a task for generation, i.e. transfer can be completely relieved of rules treating such semantic roles (see Luckhardt 1987 for a discussion of the role of transfer in MT).

top of page

4. Inhouse usage and subject field

The preference of specific terms by users of machine translations has proved a very important criterion in the STS project. So, a very straightforward rule for selecting a target language (TL) equivalent is the following: for a text marked ‚user X‘ prefer the translational equivalent also marked ‚user X‘ in the lexicon. Of course, this causes problems where the user prefers two or three readings (perhaps to 10 others) or where a specific context or the SUBJECT FIELD criterion preclude the use of the preferred reading.

At the moment, much work remains to be done by the posteditor, especially in cases where there are not enough criteria for automatically choosing the correct TL equivalent. It is very difficult to formalize those cases for MT where the INHOUSE USAGE criterion is not applicable. Nevertheless, it has been very successful in the STS project, where the above-mentioned German-English terminology pool has been built up for 15 different users and a 3-level hierarchy of subject fields. The following configurations of readings can occur:

a. a single TL equivalent

e.g. Achsstummel => axle end

In such cases it is irrelevant which codes are given to an entry as these are only used for differentiation, and here is nothing to be differentiated. The codes may even be left zero. This touches on a rather important point: is it – generally speaking – necessary to encode with every entry every user and every subject field for that a specific TL equivalent may be used? This could become an unmanageable task, for with every new user or subject field every entry would have to be checked whether it applies for this user / subject field. In STS these codes only come into play where a new reading is added to the pool (see below).

b. Different readings for different users

e.g. 
Achsschenkel => (user X) knuckle 
             => (user Y) axle spindle 
             => (user Z) steering knuckle 
 
Buntlack => (user X) multicolored lacquer 
         => (user Y) multicoloured varnish

Such cases are rather straightforward, as here only terminological variants are concerned without formalizable differences in meaning. Nevertheless, such entries are important to guarantee terminological consistency for every user.

c. Different readings for different subject fields

e.g. 
Dichtung => sealing 
         => (everyday usage) poetry 
Fuge => joint 
     => (music) fugue

If there is more than one reading, one of them is made the principal one (usually that one which is used most frequently). Deviant readings are marked with the code for the user / subject field for which they apply. In the STS German-English terminology pool every entry corresponds to a pair

source language expression + target language expression

with information about the translation project (user) for that the entry was created and a special field code (see above). Further information may be input, e.g., for the translation of verbs (e.g. translation of valencies, cf. Luckhardt 1987). The following subject fields are well-covered:

construction
social sciences
environmental protection
energy (cross-sectional coverage)
patents (cross-sectional coverage)
technical rules (cross-sectional coverage)
mechanical engineering
industrial products

As we have seen above, project/user code and subject field code are only relevant, if there is more than one target language (TL) equivalent. It is to a certain degree possible to compute a priority for the selection of a TL equivalent with the parameters ‚user code‘ and ’subject field‘:

The equivalent that carries the same user code as the translated text is selected.
If there is more than one equivalent with the correct user code, that equivalent with the same subject code as the text is chosen. If there is more than one, the first one is selected.
As a rule, in cases of undecidability the first equivalent offered by the pool is selected. This guarantees a consistent use of the pool, as always the same equivalent is chosen, unless the user wants something changed.

top of page

There are some problems, however, that still have to be solved:

Texts cannot always be assigned an unambiguous special field code so that there may be conflicts. If, e.g. a text is given the codes ‚chemistry‘ and ‚biology‘ and there are two different equivalents marked ‚chemistry‘ and ‚biology‘ respectively, no decision can be made.
In many cases there is more than one equivalent per subject field or user that may be used depending on more detailed technical information, on style requirements or other criteria like thesauri that are sketched in the following chapter.
The building-up of such huge termbases like the STS termpool where it is impossible to control the impact of the entry of every single SL/TL pair leads to inconsistencies that spoil some of the effort.

In all, terminology in MT and HT may be viewed quite differently. There are a number of aspects of general terminology (and terminography) that are only of indirect relevance to MT, e.g. the way in which a (complex) word becomes a ‚term‘, e.g. ‚Achsschenkelbolzen‘, and how it is linked to an equivalent term in an other language, e.g. ‚kingpin‘, as these are intellectual processes that have to run before a term is incorporated into an MT system. For unambiguous pairs like

Achsschenkelbolzen => kingpin

it is quite irrelevant whether these are called ‚terms‘, at all. The interesting point about terminology in MT is how this notion can be employed for lexical disambiguation, e.g. by calling different TL readings of the same source language word in different subject fields different terms.

top of page

Conclusion

Sublanguage is a notion MT developers ought to turn their attention to

when their system has reached a stable and robust state offering the necessary tools and methods of language engineering like weighting mechanisms
when their system is about to be applied to large volumes of text with distinct sublanguage characteristics
if a terminological data base system has been established which makes it possible to cover the lexical and INHOUSE USAGE levels of sublanguages and which can be accessed by the MT system
if the necessary machine-readable terminology is at hand.

A sublanguage is not as easy to implement as it may appear from a first glance at texts of a specific corpus, however distinct that type of text may look. Very often the apparently formalizable criteria turn out to be useless for MT, although any human reader could easily formulate them. The METEO ideal of a sublanguage (French and English weather reports in Canada) surely cannot be reproduced easily.

3. The morpho-syntactic approach | 5. The semantic relations approach

Universität des Saarlandes - Fachrichtung Informationswissenschaft

8. Informationswissenschaft als Brückenwissenschaft

Approaches to sense disambiguation with respect to automatic indexing and machine translation

4. The Sublanguage Approach

Heinz-Dirk Luckhardt

The Sublanguage Approach:how can different special domains be dealt with

A sublanguage concept for use in MT systems

Sublanguages for MT: some examples

1. PATENT DESCRIPTIONS

2. MAINTENANCE REQUIREMENTS

The use of sublanguages in the STS project and system

1. Homograph resolution in titles

2. Nominal structures in titles

3. Semantics of prepositions in titles

4. Inhouse usage and subject field

Conclusion

The Sublanguage Approach:
how can different special domains be dealt with