This paper proposes a model for Thai contextual constraints and then describes itsกกimplementation as a grammar on the CAT2 formalism, which was originally designed for the EUROTRA machine translation system. The constraint devices are defined by using some cohesive relationships between sentences, which can apply to a pair of initial and non-initial sentences [Vichin,91] in Thai language. We design a structure of a text as a sequence of binding relationships among sentences, which are defined from the cohesive relationships. Such binding relationships supply omitted information to the non-initialกกsentence so that the ambiguity is to be resolved.
Thai language is said to be a paragraph-efficient language [Thomas,88]. Unlike the clause-efficient languages such as European languages, a sentence of Thai language appears very inefficient, giving insufficient information for constructing the action being communicated. This problem also occurs when using a machine to translate a Thai text to other languages, as shown in the example below:
The second sentence, without the first one, does not make sense but gives only "Someone play something very well". There is also nothing to show any tense information of this sentence. Machine Translation systems, which take each sentence independently as the basic unit for translation, may treat this sentence as an ill-formed input, which results in failure of translation, despite a fact that the second sentence could retrieve contextual information from the first sentence as shown above by the boldface words.
To figure out this problem, we try to make use of contextual constraints between sentences and implement them on the Machine Translation system, CAT2, to perform contextual ability in the translation of Thai text. We concern here some cohesive relationships [Halliday,76] between two sentences as the contextual constraints, which are applied to Thai sentences. The Thai sentences, then, are divided into a pair or a group of initial and non-initial sentences [Vichin,91]. These two types of sentences are combined and processed together in the CAT2 formalism. The non-initial sentences, which have insufficient information, will be provided the omitted information from the initial sentence, by the time they are processed.
The CAT2 Machine Translation system is a natural language processing system designed especially for performing full-scale automatic translation. [sharp,88] The CAT2 framework is a direct descendant of the
CAT2, like many linguistic theories and approaches, divides the linguistic world into levels of representation. [Sharp,94] (see figure 1) The first level is devoted to word structure, called morphological structure (MS). The next level is devoted to constituent structure (CS), i.e. the structure of the phrase and sentence. The last level of the representation in postulated which acts as the interface between two languages; we call this level as interface structure (IS). Whereas the CS level represents the sentence's syntactic structure, the IS level is related more to the sentence's semantic structure.
CAT2 uses an LFG-like grammatical framework. An object, which represents a word, phrase or sentence, takes the form of a tree. Each node in the tree contains a set of features called a feature bundle. The rules in CAT2 describe fragments of the tree. As shown in figure 1, each level has two rule types: b-rule for building the object structures and f-rule for controlling the feature content of the structures. Similarly, in the step of the transformation, CAT2 makes use of two rule types: t-rule for transforming structures and tf-rule for transforming features.
During the process of rule application, CAT2 uses the operation of unification, both for constructing trees and for combining features. Although unification is the sole operation, a more accurate term is constraint satisfaction.[Sharp,94] The important aspect of CAT2 constraints is that if the satisfiability of a constraint cannot be determined at the time when the constraint is encountered, its evaluation is postponed until sufficient information becomes available.
As shown in figure 2 below, we define the structure of a text as a simple sequence of sentences with or without the semantic relations. The type of relation, x, is to be found and marked in CS level (See figure 3) by using b-rule and constraint satisfaction in f-rule. In the transformation step, t-rule and tf-rule, which transform the structure in CS level to IS level, will also copy the essential information from one to others, so that, the all sentences structure can be built in IS level.
The word text is used in linguistics to refer to any passage of whatever length that does form a unified whole.[Halliday,76] In this paper, however, we focus only on the text which contains more than one sentence, especially two sentences. The relation among sentences is not grammatical one but is the semantic relation which has no syntax.
However as we have seen in the section 2.3, we define the structure of a text as a simple sequence of sentences with or without the semantic relationship. Actually, such relationship is the link between two items of that two sentences. We call that link as a cohesion. The concept of cohesion is semantic; it refers to relations among meanings that exist within the text. The cohesive relation is subdivided into different types of cohesion. In this paper, we make use of some types of the grammatical cohesions [Halliday,76], which are reference, substitution and ellipsis.
Reference, as its name, means a relation between two words (or phrases), which a word is to retrieve some information from another. There are three types of references: personal, demonstrative, and comparative, which are shown in [2] below:
The words "he", "better (job)" and "there" indicate to refer back to some words in the first sentence, which are personal, comparative and demonstrative references respectively. We make use of these key words as the devices to define that the second sentence is tied with the the first by these three types of references. we can use theseกกdevices to tie two sentences for the sake of semantic analysis.
The distinction between substitution and reference is that the substitution is a relation in the wording rather than in the meaning. Substitution is divided into three types, nominal, verbal and clausal substitution as in the next examples respectively:
The "one" substitutes for a noun "axe", the "does" for a verb "knows" and the "so" for a clause "that Barbara has left". We can see that the substitute item, has the same structural function as that for which it substitutes. In the above example, "one " and "axe" are both head in the nominal group, "does" and "knows" are both head in the verbal group, and not obvious but "so" also have the same function as the clause it refers. Unlike reference, substitute can said to be a grammatical type of cohesion.
Again, we simply use these key words as a device to define that a sentence is tied by substitution. Then, if necessary, the contextual information is copied from the word it substitutes to, which is the head in the same structure of the preceding sentence.
Comparing with substitution which is the replacement of one item by another, ellipsis can be interpreted as a form of substitution in which the item is replaced by nothing. Similarly, ellipsis is divided into three types: nominal, verbal and clausal ellipsis. The following is an example of verbal ellipsis, brought.
We can say that ellipsis occurs when something, which is structurally necessary, is left unsaid. Then we define that sentence as being tied by ellipsis relation, and copy the omitted item from the preceding sentence.
Vichin [Vichin,91] has divided Thai sentences into an initial sentence and a non-initial sentence. The initial sentence indicates a sentence which can be used for beginning the conversation. However, the initial sentence can also be used in the middle of the conversation. On the other hand any non-initial sentence can not begin the conversation, because of its insufficient information or ill-formedness as a complete sentence. The non-initial sentence must come after another sentence.
The general rules to determine if a sentence is non-initial, are omission and/or substitution of the item which is stated before, or understood by situation, or other semantic relationships. However, in this paper we concern, only, with the semantic relationships which stated in section 3.1 above.
In this paper, we define that a sentence is non-initial if it has the following constraints:
Although there are still other constraints to indicate the non-initial sentence, we claim here that the initial sentence is the sentence which is not non-initial sentence. So that, we can bind them together and transfer contextual information from the initial one to the non-initial one, whose examples will be described in the next section.
To implement the above contextual constraints, we have to develop, at least, Thai grammar for constituent structure level (CS), interface structure level (IS) and transformation. Then we extend them by adding text structure and contextual devices. The Thai grammar in CS level is based on Vichin's grammatical system. [Vichin,91] The development has not completed yet, however, we can show some simple examples of them below:
In this example, we show how to find the sentence objects which contain personal reference key word, third person pronoun. This part of grammar uses f-rule to find the sentence objects, which contain key words, then to add a feature to mark the sentence objects as the non-initial sentences. The f-rule, generally, is used to check and/or add the feature in the objects which satisfies with f-rule structure. The f-rule pattern is shown in the following:
The objects which have {cat=sentence} as a root node, {cat=np} as a first generation daughter node and {cat=pronoun,subcat=third} as a second generation daughter node, will unify with the rule "personal_ref". Then the root node is added with {personal_ref=yes}.
The other sentence objects which can not unify with this pattern will unify with "default_rule". The default value, then, be added to those objects.
The ellipsis sentence objects is marked by the time they are constructed by using b-rule. The b-rule combines components in the sequence of objects and construct the tree structure, if those components satisfy with the structure described in the rule. If the pattern of the components is the same as the ellipsis structure pattern, the root node of the tree structure is marked to be ellipsis pattern.
The following example describes b-rule of the ellipsis sentence pattern 2 in section 4.1, which is " subj Vt".
As shown by figure 5, first, sentence_core is built with the feature {ellipsis=2} in the root node. Next, in the process of building a sentence structure, the feature {ellipsis=2} is copied to the root node of the sentence.
We define new categories, the cohesion (cohe) as the constituent of the initial sentence (I) with or without non-initial sentence (N) and the text as the sequence of cohesions, as shown in figure 6. However, we do not concern if there are more than one non-initial sentence within the cohesion as shown in figure 7 and figure 8.
By using the contextual constraints mentioned above, we can define subcategory of a sentence to be initial or non-initial. Then we can construct the cohesion and text structures by the part of grammar below:
@level(cs/syntactic/thai).
@rule(b).
cohesion = {cat=cohesion}
.[ {cat=sentence,subcat=initial}, ^{cat=sentence,subcat=non_initial} ].
text = {cat=text}
.[ +{cat=cohesion} ].
This step is to supply the omitted information into the non-initial sentence by duplicating it from the initial sentence. This part of the grammar uses t-rule, which is the rule for transforming an object structure from one level to another level, to duplicate the omitted node from an initial sentence to a non-initial sentence. The t-rule pattern is of the form below:
The node labeled with 3 (object node) is copied to the end of the sequence of nodes in the right hand side of t-rule. This sequence will be used as the objects to build the structures in IS level, which is to be written by the framework of case grammar.
This paper proposed a method which can process contextual information in a text by applying some cohesive relationships between sentences. Such cohesive relationships are analyzed as contextual constraints which bind sentences together. Examples for Thai language are demonstrated on a pair of initial and non-initial sentences. This paper also investigated the implementation of such contextual constraints of Thai language on the CAT2 formalism which is the grammatical framework for the Eurotra machine translation system. It is obvious that there are many other semantic phenomena necessary for the more detail contextual processing which are to be elaborated furtherly while this paper suggested to an approach to the problem.
fig.1
2.1 Structure
2.2 Formalism
2.3 Extension of the formalism for contextual processing
3 Cohesive relationships
3.1 Reference
[2] Dang gave up his job and went to Japan. He hopes to find a better job there.
3.2 Substitution
[3] My axe is too blunt. I must get a sharper one.
[4] You think John already knows ? - I think everybody does.
[5] Has Barbara left? - I think so.
3.3 Ellipsis
[6] Joan brought some carnations. Catherine ฆี some sweet peas.
4 Implementation
4.1 Thai language
[7] A: เบอเคยไปเชียงใหม่ไหม Have you been to Chiangmai? initi
[8] B: เคยไ Yes,I do. non-initial
[9] B: ฉันชอบไปดอยสุเทพจั I like to visit Doi Suthep. initial
[10] A: เธอไปถึงภูพิงค์ไหม Have you been to Phoophing? non-initial
[11] B: ไปถึงแต่ไม่ได้เข้าไป Yes, but I did not get in. non-initial
[12] B: วันหลังจะไปอีก I'll go again. non-initial
Existence of Reference key words
personal: 3rd person pronoun ( เข มั...)
demonstrative: place/time conjunction+distant demonstrative word
(ตอนนั้ เวลานัั้ ท่ีนั่ ที่โน่...)
comparative: adjective+comparative word,กว่า (มากกว่ ดีกว่า,...)
Existence of Substitution key words
nominal: classifier+adj/demonstrative word (ตัวใหม เล่มนี อันนั้...)
Ellipsis sentence structure patterns: [Vichin,91]
1) Vt 2) subj Vt
3) Vd 4) subj Vd
5) Vd obj2 6) subj Vd obj2
7) Vd obj1 8) subj Vd obj1
* Vt = transitive verb, Vd = transitive verb which requires two compliments
4.2 CAT2 formalism
4.2.1 The existence of the Reference/Substitution key words
@level(cs/syntactic/thai).
@rule(f).
personal_ref = {cat=sentence,personal_ref=yes}
.[ *,{cat=np}
.[ *,{cat=pronoun,subcat=third}, *], *].
default_rule = {cat=sentence, personal_per=no}.[*].
4.2.2 Ellipsis sentence structure patterns
@level(cs/syntactic/thai).
@rule(b).
sentence_core = {cat=sentence_core, ellipsis=2}
.[ {cat=np, role=subj}, {cat=vp, role=vt} ].
sentence = {cat=sentence,ellipsis=X}
.[ *{...},{cat=sentence_core, ellipsis=X}, *{...} ].
@rule(f).
default_rule = {cat=sentence_core, ellipsis=no}.[*].
4.2.3 Sentences combining
4.2.4 Information transforming
RULENAME = ROOT .[ BODY ] => ROOT .[ BODY ].
@level(cs=>is).
ellipsis_pattern_2 = {cat = cohesion}
.[ {cat=sentence}
.[ *, {cat=sentence_core}
.[ 1:{cat=np}, 2:{cat=vp}, 3:{cat=np}],*],
{cat=sentence}
.[ *, {cat=sentence_core}
.[ 4:{cat=np}, 5:{cat=vp}],*]]
=> {}.[ 1, 2, 3, 4, 5, 3 ].
[13] เมื่อวาน แด ไป ตลาด วันนี้ ดำ ไป
(yesterday Dang go market today Dam go)
fig.9
5 Conclusion and Remarks
References
[Halliday,76] Halliday M. A. K. "Cohesion in English". 1976.
[Sharp,88] Sharp Randall. "CAT2 - Implementing a Formalism for Multi-Lingual MT". Proc. of the
2nd International Conference on Theoretical & Methodological Issues in Machine Translation
of Natural Language, Pittsburgh, PA.
[Sharp,91] Sharp Randall. "CAT2: An Experimental Eurotra Alternative". Machine Translation 6,1991.
[Sharp,94] Sharp Randall. "CAT2 Reference Manual Version 3.5". (Unfinished Draft) May,1994.
[Thomas,88] Thomas David. "Clause-efficient VS. paragraph-efficient language". The international
symposium on language and linguistics,9-11 August 1998, Thailand.
[Vichin,91] Vichin Panupong. "The structure of Thai: Grammatical System". (โครงสร้างของภาษาไท
ระบบไวยกรณ) 1991.