Under construction!
Some text is written in Thai language. You may need Thai browser to read this documents.

A Model for Contextual Constraints and its Implementation
on the CAT2 Formalism

Chumphol Krootkaew, Hidetoshi Nagai
Teigo Nakamura and Hirosato Nomura

Dept. of Artificial Intelligence Kyushu Institute of Technology Iizuka, 820 JAPAN
email: chumphol@dumbo.ai.kyutech.ac.jp

Abstract

This paper proposes a model for Thai contextual constraints and then describes itsกกimplementation as a grammar on the CAT2 formalism, which was originally designed for the EUROTRA machine translation system. The constraint devices are defined by using some cohesive relationships between sentences, which can apply to a pair of initial and non-initial sentences [Vichin,91] in Thai language. We design a structure of a text as a sequence of binding relationships among sentences, which are defined from the cohesive relationships. Such binding relationships supply omitted information to the non-initialกกsentence so that the ambiguity is to be resolved.

1 Introduction

Thai language is said to be a paragraph-efficient language [Thomas,88]. Unlike the clause-efficient languages such as European languages, a sentence of Thai language appears very inefficient, giving insufficient information for constructing the action being communicated. This problem also occurs when using a machine to translate a Thai text to other languages, as shown in the example below:

The second sentence, without the first one, does not make sense but gives only "Someone play something very well". There is also nothing to show any tense information of this sentence. Machine Translation systems, which take each sentence independently as the basic unit for translation, may treat this sentence as an ill-formed input, which results in failure of translation, despite a fact that the second sentence could retrieve contextual information from the first sentence as shown above by the boldface words.

To figure out this problem, we try to make use of contextual constraints between sentences and implement them on the Machine Translation system, CAT2, to perform contextual ability in the translation of Thai text. We concern here some cohesive relationships [Halliday,76] between two sentences as the contextual constraints, which are applied to Thai sentences. The Thai sentences, then, are divided into a pair or a group of initial and non-initial sentences [Vichin,91]. These two types of sentences are combined and processed together in the CAT2 formalism. The non-initial sentences, which have insufficient information, will be provided the omitted information from the initial sentence, by the time they are processed.

2 CAT2 formalism system

The CAT2 Machine Translation system is a natural language processing system designed especially for performing full-scale automatic translation. [sharp,88] The CAT2 framework is a direct descendant of the ,T framework, formerly used in Eurotra MT project. [sharp,91] It offers a formalism for describing levels of internal representations and a means of relating adjacent levels. That is, CAT2 is simply a formalism for describing grammars and translators.

fig.1

2.1 Structure

CAT2, like many linguistic theories and approaches, divides the linguistic world into levels of representation. [Sharp,94] (see figure 1) The first level is devoted to word structure, called morphological structure (MS). The next level is devoted to constituent structure (CS), i.e. the structure of the phrase and sentence. The last level of the representation in postulated which acts as the interface between two languages; we call this level as interface structure (IS). Whereas the CS level represents the sentence's syntactic structure, the IS level is related more to the sentence's semantic structure.

2.2 Formalism

CAT2 uses an LFG-like grammatical framework. An object, which represents a word, phrase or sentence, takes the form of a tree. Each node in the tree contains a set of features called a feature bundle. The rules in CAT2 describe fragments of the tree. As shown in figure 1, each level has two rule types: b-rule for building the object structures and f-rule for controlling the feature content of the structures. Similarly, in the step of the transformation, CAT2 makes use of two rule types: t-rule for transforming structures and tf-rule for transforming features.

During the process of rule application, CAT2 uses the operation of unification, both for constructing trees and for combining features. Although unification is the sole operation, a more accurate term is constraint satisfaction.[Sharp,94] The important aspect of CAT2 constraints is that if the satisfiability of a constraint cannot be determined at the time when the constraint is encountered, its evaluation is postponed until sufficient information becomes available.

2.3 Extension of the formalism for contextual processing

As shown in figure 2 below, we define the structure of a text as a simple sequence of sentences with or without the semantic relations. The type of relation, x, is to be found and marked in CS level (See figure 3) by using b-rule and constraint satisfaction in f-rule. In the transformation step, t-rule and tf-rule, which transform the structure in CS level to IS level, will also copy the essential information from one to others, so that, the all sentences structure can be built in IS level.

3 Cohesive relationships

The word text is used in linguistics to refer to any passage of whatever length that does form a unified whole.[Halliday,76] In this paper, however, we focus only on the text which contains more than one sentence, especially two sentences. The relation among sentences is not grammatical one but is the semantic relation which has no syntax.

However as we have seen in the section 2.3, we define the structure of a text as a simple sequence of sentences with or without the semantic relationship. Actually, such relationship is the link between two items of that two sentences. We call that link as a cohesion. The concept of cohesion is semantic; it refers to relations among meanings that exist within the text. The cohesive relation is subdivided into different types of cohesion. In this paper, we make use of some types of the grammatical cohesions [Halliday,76], which are reference, substitution and ellipsis.

3.1 Reference

Reference, as its name, means a relation between two words (or phrases), which a word is to retrieve some information from another. There are three types of references: personal, demonstrative, and comparative, which are shown in [2] below:


	[2] Dang gave up his job and went to Japan.  He hopes to find a better job there.

The words "he", "better (job)" and "there" indicate to refer back to some words in the first sentence, which are personal, comparative and demonstrative references respectively. We make use of these key words as the devices to define that the second sentence is tied with the the first by these three types of references. we can use theseกกdevices to tie two sentences for the sake of semantic analysis.

3.2 Substitution

The distinction between substitution and reference is that the substitution is a relation in the wording rather than in the meaning. Substitution is divided into three types, nominal, verbal and clausal substitution as in the next examples respectively:

	[3] My axe is too blunt. I must get a sharper one.

	[4] You think John already knows ? - I think everybody does.

	[5] Has Barbara left? - I think so.

The "one" substitutes for a noun "axe", the "does" for a verb "knows" and the "so" for a clause "that Barbara has left". We can see that the substitute item, has the same structural function as that for which it substitutes. In the above example, "one " and "axe" are both head in the nominal group, "does" and "knows" are both head in the verbal group, and not obvious but "so" also have the same function as the clause it refers. Unlike reference, substitute can said to be a grammatical type of cohesion.

Again, we simply use these key words as a device to define that a sentence is tied by substitution. Then, if necessary, the contextual information is copied from the word it substitutes to, which is the head in the same structure of the preceding sentence.

3.3 Ellipsis

Comparing with substitution which is the replacement of one item by another, ellipsis can be interpreted as a form of substitution in which the item is replaced by nothing. Similarly, ellipsis is divided into three types: nominal, verbal and clausal ellipsis. The following is an example of verbal ellipsis, brought.

     [6] Joan brought some carnations. Catherine ฆี some sweet peas.

We can say that ellipsis occurs when something, which is structurally necessary, is left unsaid. Then we define that sentence as being tied by ellipsis relation, and copy the omitted item from the preceding sentence.

4 Implementation

4.1 Thai language

Vichin [Vichin,91] has divided Thai sentences into an initial sentence and a non-initial sentence. The initial sentence indicates a sentence which can be used for beginning the conversation. However, the initial sentence can also be used in the middle of the conversation. On the other hand any non-initial sentence can not begin the conversation, because of its insufficient information or ill-formedness as a complete sentence. The non-initial sentence must come after another sentence.

	[7]	A: เบอเคยไปเชียงใหม่ไหม	Have you been to Chiangmai?  	initi

	[8]	B: เคยไ	Yes,I do.	non-initial

	[9]	B: ฉันชอบไปดอยสุเทพจั	I like to visit Doi Suthep.	initial

	[10]	A: เธอไปถึงภูพิงค์ไหม	Have you been to Phoophing? 	non-initial

	[11]	B: ไปถึงแต่ไม่ได้เข้าไป	Yes, but I did not get in. 	non-initial

	[12]	B: วันหลังจะไปอีก	I'll go again.	non-initial

The general rules to determine if a sentence is non-initial, are omission and/or substitution of the item which is stated before, or understood by situation, or other semantic relationships. However, in this paper we concern, only, with the semantic relationships which stated in section 3.1 above.

In this paper, we define that a sentence is non-initial if it has the following constraints:


	Existence of Reference key words
		personal: 	 3rd person pronoun  ( เข มั...)
		demonstrative:	 place/time conjunction+distant demonstrative word
				(ตอนนั้ เวลานัั้ ท่ีนั่ ที่โน่...)
		comparative:	 adjective+comparative word,กว่า (มากกว่ ดีกว่า,...)

	Existence of Substitution key words
		nominal:	classifier+adj/demonstrative word (ตัวใหม เล่มนี อันนั้...)
			
	Ellipsis sentence structure patterns: [Vichin,91]
		1)  Vt		2)  subj  Vt
		3)  Vd		4)  subj  Vd
		5)  Vd  obj2		6)  subj  Vd  obj2
		7)  Vd  obj1		8)  subj  Vd  obj1
		* Vt = transitive verb, Vd = transitive verb which requires two compliments

Although there are still other constraints to indicate the non-initial sentence, we claim here that the initial sentence is the sentence which is not non-initial sentence. So that, we can bind them together and transfer contextual information from the initial one to the non-initial one, whose examples will be described in the next section.

4.2 CAT2 formalism

To implement the above contextual constraints, we have to develop, at least, Thai grammar for constituent structure level (CS), interface structure level (IS) and transformation. Then we extend them by adding text structure and contextual devices. The Thai grammar in CS level is based on Vichin's grammatical system. [Vichin,91] The development has not completed yet, however, we can show some simple examples of them below:

4.2.1 The existence of the Reference/Substitution key words

In this example, we show how to find the sentence objects which contain personal reference key word, third person pronoun. This part of grammar uses f-rule to find the sentence objects, which contain key words, then to add a feature to mark the sentence objects as the non-initial sentences. The f-rule, generally, is used to check and/or add the feature in the objects which satisfies with f-rule structure. The f-rule pattern is shown in the following:

        @level(cs/syntactic/thai).
	@rule(f).
	personal_ref        	=  {cat=sentence,personal_ref=yes}
                                                   .[ *,{cat=np}
                                                        .[ *,{cat=pronoun,subcat=third}, *], *].
	default_rule      	=  {cat=sentence, personal_per=no}.[*].

The objects which have {cat=sentence} as a root node, {cat=np} as a first generation daughter node and {cat=pronoun,subcat=third} as a second generation daughter node, will unify with the rule "personal_ref". Then the root node is added with {personal_ref=yes}.

The other sentence objects which can not unify with this pattern will unify with "default_rule". The default value, then, be added to those objects.

4.2.2 Ellipsis sentence structure patterns

The ellipsis sentence objects is marked by the time they are constructed by using b-rule. The b-rule combines components in the sequence of objects and construct the tree structure, if those components satisfy with the structure described in the rule. If the pattern of the components is the same as the ellipsis structure pattern, the root node of the tree structure is marked to be ellipsis pattern.

The following example describes b-rule of the ellipsis sentence pattern 2 in section 4.1, which is " subj Vt".

 	@level(cs/syntactic/thai).          
	@rule(b).
	sentence_core 	= {cat=sentence_core, ellipsis=2}
	                 .[ {cat=np, role=subj}, {cat=vp, role=vt} ].

	sentence               = {cat=sentence,ellipsis=X}
	                .[ *{...},{cat=sentence_core, ellipsis=X}, *{...} ].
	@rule(f).
	default_rule      	= {cat=sentence_core, ellipsis=no}.[*].

As shown by figure 5, first, sentence_core is built with the feature {ellipsis=2} in the root node. Next, in the process of building a sentence structure, the feature {ellipsis=2} is copied to the root node of the sentence.

4.2.3 Sentences combining

We define new categories, the cohesion (cohe) as the constituent of the initial sentence (I) with or without non-initial sentence (N) and the text as the sequence of cohesions, as shown in figure 6. However, we do not concern if there are more than one non-initial sentence within the cohesion as shown in figure 7 and figure 8.

By using the contextual constraints mentioned above, we can define subcategory of a sentence to be initial or non-initial. Then we can construct the cohesion and text structures by the part of grammar below: @level(cs/syntactic/thai). @rule(b). cohesion = {cat=cohesion} .[ {cat=sentence,subcat=initial}, ^{cat=sentence,subcat=non_initial} ]. text = {cat=text} .[ +{cat=cohesion} ].

4.2.4 Information transforming

This step is to supply the omitted information into the non-initial sentence by duplicating it from the initial sentence. This part of the grammar uses t-rule, which is the rule for transforming an object structure from one level to another level, to duplicate the omitted node from an initial sentence to a non-initial sentence. The t-rule pattern is of the form below:

RULENAME  =  ROOT .[ BODY ]  =>  ROOT .[ BODY ].
	@level(cs=>is). 
	ellipsis_pattern_2   =  {cat = cohesion}
	             .[ {cat=sentence}
	                        .[ *, {cat=sentence_core}
		                               .[ 1:{cat=np}, 2:{cat=vp}, 3:{cat=np}],*],
	                 {cat=sentence}
	                        .[ *, {cat=sentence_core}
		                               .[ 4:{cat=np}, 5:{cat=vp}],*]]
	      => {}.[ 1, 2, 3, 4, 5, 3 ].
 

     [13]        เมื่อวาน         แด    ไป      ตลาด                  วันนี้     ดำ     ไป
               (yesterday   Dang    go    market             today   Dam   go)

fig.9

The node labeled with 3 (object node) is copied to the end of the sequence of nodes in the right hand side of t-rule. This sequence will be used as the objects to build the structures in IS level, which is to be written by the framework of case grammar.

5 Conclusion and Remarks

This paper proposed a method which can process contextual information in a text by applying some cohesive relationships between sentences. Such cohesive relationships are analyzed as contextual constraints which bind sentences together. Examples for Thai language are demonstrated on a pair of initial and non-initial sentences. This paper also investigated the implementation of such contextual constraints of Thai language on the CAT2 formalism which is the grammatical framework for the Eurotra machine translation system. It is obvious that there are many other semantic phenomena necessary for the more detail contextual processing which are to be elaborated furtherly while this paper suggested to an approach to the problem.

References

[Halliday,76]	Halliday M. A. K. "Cohesion in English". 1976.
[Sharp,88]	Sharp Randall. "CAT2 - Implementing a Formalism for Multi-Lingual MT".  Proc. of the 
	2nd International Conference on Theoretical & Methodological Issues in Machine Translation 
	of Natural Language, Pittsburgh, PA.
[Sharp,91]	Sharp Randall. "CAT2: An Experimental Eurotra Alternative". Machine Translation 6,1991.
[Sharp,94]	Sharp Randall. "CAT2 Reference Manual Version 3.5". (Unfinished Draft) May,1994.
[Thomas,88]	Thomas David. "Clause-efficient VS. paragraph-efficient language". The international    
	symposium on language and linguistics,9-11 August 1998, Thailand.
[Vichin,91]	Vichin Panupong. "The structure of Thai: Grammatical System". (โครงสร้างของภาษาไท
	ระบบไวยกรณ) 1991.

A Model for Contextual Constraints and its Implementation on the CAT2 Formalism