The syntax of the file is based on that of Constraint Grammars [KVHA95], but simplified in many aspects, and modified to include weighted constraints.
An initial file based on statistical constraints may be generated from a tagged corpus using the src/utilities/train-relax.perl script provided with FreeLing. Later, hand written constraints can be added to the file to improve the tagger behaviour.
The file consists of two sections: SETS and CONSTRAINTS.
The SETS section consists of a list of set definitions, each of the form Set-name = element1 element2 ... elementN ;
Where the Set-name is any alphanumeric string starting with a capital letter, and the elements are either forms, lemmas, plain PoS tags, or senses. Forms are enclosed in parenthesis -e.g. (comimos)
-, lemmas in angle brackets -e.g. <comer>
-, PoS tags are alphanumeric strings starting with a capital letter -e.g. NCMS000
-, and senses are enclosed in square brackets -e.g. [00794578]
.
The sets must be homogeneous: That is, all the elements of a set
have to be of the same kind.
Examples of set definitions:
DetMasc = DA0MS0 DA0MP0 DD0MS0 DD0MP0 DI0MS0 DI0MP0 DP1MSP DP1MPP DP2MSP DP2MPP DT0MS0 DT0MP0 DE0MS0 DE0MP0 AQ0MS0 AQ0MP0; VerbPron = <dar_cuenta> <atrever> <arrepentir> <equivocar> <inmutar> <morir> <ir> <manifestar> <precipitar> <referir> <reír> <venir>; Animal = [00008019] [00862484] [00862617] [00862750] [00862871] [00863425] [00863992] [00864099] [00864394] [00865075] [00865379] [00865569] [00865638] [00867302] [00867448] [00867773] [00867864] [00868028] [00868297] [00868486] [00868585] [00868729] [00911889] [00985200] [00990770] [01420347] [01586897] [01661105] [01661246] [01664986] [01813568] [01883430] [01947400] [07400072] [07501137];
The CONSTRAINTS section consists of a series of context constraits, each of the form: weight core context;
Where:
<comer>
, VMIP3S0<comer>
,
VMI*<comer>
will match any
word analysis with those tag/prefix and lemma.
[00862617]
, NCMS000[00862617]
,
NC*[00862617]
will match any
word analysis with those tag/prefix and sense.
Conditions may be negated using the token not, i.e. (not pos terms)
Where:
<comer>
, VMIP3S0<comer>
,
VMI*<comer>
will match any
word analysis with those tag/prefix and lemma.
[00862617]
, NCMS000[00862617]
,
NC*[00862617]
will match any
word analysis with those tag/prefix and sense.
{DetMasc}
, {VerbPron}
will match any
word analysis with a tag, lemma or sense in the
specified set.
Note that the use of sense information in the rules of
the constraint grammar (either in the core or in the context)
only makes sense when this information distinguishes one analysis
from another. If the sense tagging has been performed with the
option DuplicateAnalysis=no
, each PoS tag will have a list
with all analysis, so the sense information will not distinguish
one analysis from the other (there will be only one analysis with
that sense, which will have at the same time all the other senses
as well).
If the option DuplicateAnalysis
was active, the sense
tagger duplicates the analysis, creating a new entry for each
sense. So, when a rule selects an analysis having a certain sense,
it is unselecting the other copies of the same analysis with
different senses.
Examples:
The next constraint states a high incompatibility for a word
being a definite determiner (DA*) if the next word is a personal form
of a verb (VMI*):
-8.143 DA* (1 VMI*);
The next constraint states a very high compatibility for the
word mucho (much) being an indefinite determiner (DI*)
-and thus not being a pronoun or an adverb, or any
other analysis it may have- if the following word is a noun (NC*):
60.0 DI* (mucho) (1 NC*);
The next constraint states a positive compatibility value for
a word being a noun (NC*) if somewhere to its left
there is a determiner or an adjective (DA* or AQ*), and
between them there is not any other noun:
5.0 NC* (-1* DA* or AQ* barrier NC*);
The next constraint states a positive compatibility value for
a word being a masculine noun (NCM*) if the word to its
left is a masculine determiner. It refers to a previously
defined SET which should contain the list of all tags
that are masculine determiners. This rule could be useful to
correctly tag Spanish words which have two different NC
analysis differing in gender: e.g. el cura (the priest)
vs. la cura (the cure):
5.0 NCM* (-1* DetMasc;
The next constraint adds some positive compatibility to a
3rd person personal pronoun being of undefined gender and
number (PP3CNA00) if it has the possibility of being
masculine singular (PP3MSA00), the next word may have
lemma estar (to be), and the sencond word to the right
is not a gerund (VMG). This rule is intended to solve the
different behaviour of the Spanish word lo in sentences
such as si, lo estoy or lo estoy viendo.
0.5 PP3CNA00 (0 PP3MSA00) (1 <estar>
) (not 2 VMG*);
2008-01-24