Suffixation rules file

One rule per line, each rule has eight fields:

  1. Suffix to erase form word form (e.g: crucecita - cecita = cru)
  2. Suffix (* for emtpy string) to add to the resulting root to rebuild the lemma that must be searched in dictionary (e.g. cru + z = cruz)
  3. Condition on the parole tag of found dictionary entry (e.g. cruz is NCFS). The condition is a perl RegExp
  4. Parole tag for suffixed word (* = keep tag in dictionary entry)
  5. Check lemma adding accents
  6. Enclitic suffix (special accent behaviour in Spanish)
  7. Use original form as lemma instead of the lemma in dictionary entry
  8. Consider the suffix always, not only for unknown words.
  9. Retokenization info, explained below.. (or "-" if the suffix doesn't cause retokenization).

E.g.

 cecita  z|za  ^NCFS  NCFS00A  0  0  0  0  -
 les     *     ^V      *       0  1  0  1  $$+les:$$+PP

The first line (cecita) states a suffix rule that will be applied to unknown words, to see whether a valid feminine singular noun is obtained when substituting the suffix cecita with z ot za. This is the case of crucecita (diminutive of cruz). If such a base form is found, the original word is analyzed as diminutive suffixed form. No retokenization is performed.

The second rule (mela) applies to all words and tries to check whether a valid verb form is obtained when removing the suffix les. This is the case of words such as viles (which may mean I saw them, but also is the plural of the adjective vil). In this case, the retokenization info states that if eventually the verb tag is selected for this word, it may be retokenized in two words: The base verb form (referred to as $$, vi in the example) plus the word les. The tags for these new words are expressed after the colon: The base form must keep its PoS tag (this is what the second $$ means) and the second word may take any tag starting with PP it may have in the dictionary.

So, for word viles would obtain its adjective analysis from the dictionary, plus its verb + clitic pronoun from the suffix rule:

    viles vil AQ0CP0 ver VMIS1S0

The second analysis will carry the retokenization information, so if eventually the PoS tagger selects the VMI analysis (and the TaggerRetokenize option is set), the word will be retokenized into:

   vi ver VMIS1S0
   les ellos PP3CPD00

2008-01-24