Relaxation Labelling constraint grammar file

The syntax of the file is based on that of Constraint Grammars [KVHA95], but simplified in many aspects, and modified to include weighted constraints.

An initial file based on statistical constraints may be generated from a tagged corpus using the src/utilities/train-relax.perl script provided with FreeLing. Later, hand written constraints can be added to the file to improve the tagger behaviour.

The file consists of two sections: SETS and CONSTRAINTS.

The SETS section consists of a list of set definitions, each of the form Set-name = element1 element2 ... elementN ;

Where the Set-name is any alphanumeric string starting with a capital letter, and the elements are either forms, lemmas, plain PoS tags, or senses. Forms are enclosed in parenthesis -e.g. (comimos)-, lemmas in angle brackets -e.g. <comer>-, PoS tags are alphanumeric strings starting with a capital letter -e.g. NCMS000-, and senses are enclosed in square brackets -e.g. [00794578]. The sets must be homogeneous: That is, all the elements of a set have to be of the same kind.

Examples of set definitions:

   DetMasc = DA0MS0 DA0MP0 DD0MS0 DD0MP0 DI0MS0 DI0MP0 DP1MSP DP1MPP
             DP2MSP DP2MPP DT0MS0 DT0MP0 DE0MS0 DE0MP0 AQ0MS0 AQ0MP0;
   VerbPron = <dar_cuenta> <atrever> <arrepentir> <equivocar> <inmutar>
              <morir> <ir> <manifestar> <precipitar> <referir> <reír> <venir>;
   Animal = [00008019] [00862484] [00862617] [00862750] [00862871] [00863425]
            [00863992] [00864099] [00864394] [00865075] [00865379] [00865569]
            [00865638] [00867302] [00867448] [00867773] [00867864] [00868028]
            [00868297] [00868486] [00868585] [00868729] [00911889] [00985200]
            [00990770] [01420347] [01586897] [01661105] [01661246] [01664986] 
            [01813568] [01883430] [01947400] [07400072] [07501137];

The CONSTRAINTS section consists of a series of context constraits, each of the form: weight core context;

Where:

Note that the use of sense information in the rules of the constraint grammar (either in the core or in the context) only makes sense when this information distinguishes one analysis from another. If the sense tagging has been performed with the option DuplicateAnalysis=no, each PoS tag will have a list with all analysis, so the sense information will not distinguish one analysis from the other (there will be only one analysis with that sense, which will have at the same time all the other senses as well). If the option DuplicateAnalysis was active, the sense tagger duplicates the analysis, creating a new entry for each sense. So, when a rule selects an analysis having a certain sense, it is unselecting the other copies of the same analysis with different senses.

Examples:
The next constraint states a high incompatibility for a word being a definite determiner (DA*) if the next word is a personal form of a verb (VMI*):
-8.143 DA* (1 VMI*);

The next constraint states a very high compatibility for the word mucho (much) being an indefinite determiner (DI*) -and thus not being a pronoun or an adverb, or any other analysis it may have- if the following word is a noun (NC*):
60.0 DI* (mucho) (1 NC*);

The next constraint states a positive compatibility value for a word being a noun (NC*) if somewhere to its left there is a determiner or an adjective (DA* or AQ*), and between them there is not any other noun:
5.0 NC* (-1* DA* or AQ* barrier NC*);

The next constraint states a positive compatibility value for a word being a masculine noun (NCM*) if the word to its left is a masculine determiner. It refers to a previously defined SET which should contain the list of all tags that are masculine determiners. This rule could be useful to correctly tag Spanish words which have two different NC analysis differing in gender: e.g. el cura (the priest) vs. la cura (the cure):
5.0 NCM* (-1* DetMasc;

The next constraint adds some positive compatibility to a 3rd person personal pronoun being of undefined gender and number (PP3CNA00) if it has the possibility of being masculine singular (PP3MSA00), the next word may have lemma estar (to be), and the sencond word to the right is not a gerund (VMG). This rule is intended to solve the different behaviour of the Spanish word lo in sentences such as si, lo estoy or lo estoy viendo.
0.5 PP3CNA00 (0 PP3MSA00) (1 <estar>) (not 2 VMG*);

2008-01-24