Define lexical probabilities for each tag of each word.
This file can be generated from a tagged corpus using the script src/utilitities/make-probs-file.perl provided in FreeLing package. See comments in the script file to find out in which format the file must be set.
The probabilities file has six sections:
<UnknownTags>
, <Theeta>
, <Suffixes>
, <SingleTagFreq>
, <ClassTagFreq>
, <FormTagFreq>
. Each section is closed by it corresponding tag </UnknownTags>
, </Theeta>
, </Suffixes>
, </SingleTagFreq>
, </ClassTagFreq>
, </FormTagFreq>
.
<FormTagFreq>
. Probability data of some high frequency forms.
If the word is found in this list, lexical probabilities are computed using data in <FormTagFreq>
section.
The list consists of one form per line, each line with format:
form ambiguity-class, tag1 #observ1 tag2 #observ2 ...
E.g. japonesas AQ-NC AQ 1 NC 0
Form probabilities are smoothed to avoid zero-probabilities.
<ClassTagFreq>
. Probability data of ambiguity classes.
If the word is not found in the <FormTagFreq>
, frequencies for its ambiguity class are used.
The list consists of class per line, each line with format:
class tag1 #observ1 tag2 #observ2 ...
E.g. AQ-NC AQ 2361 NC 2077
Class probabilities are smoothed to avoid zero-probabilities.
<SingleTagFreq>
. Unigram probabilities.
If the ambiguity class is not found in the <ClassTagFreq>
, individual
frequencies for its possible tags are used.
One tag per line, each line with format: tag #observ
E.g. AQ 7462
Tag probabilities are smoothed to avoid zero-probabilities.
<Theeta>
. Value for parameter theeta used in smoothing of tag probabilities based on word suffixes.
If the word is not found in dictionary (and so the list of its
possible tags is unknown), the distribution is computed using the
data in the <Theeta>
, <Suffixes>
, and <UnknownTags>
sections.
The section has exactly one line, with one real number.
E.g.
<Theeta>
0.00834
</Theeta>
<Suffixes>
. List of suffixes obtained from a
train corpus, with information about which tags were assigned to
the word with that suffix.
The list has one suffix per line, each line with format: suffix #observ tag1 #observ1 tag2 #observ2 ...
E.g.
orada 133 AQ0FSP 17 VMP00SF 8 NCFS000 108
<UnknownTags>
. List of open-category tags to
consider as possible candidates for any unknown word.
One tag per line, each line with format: tag #observ. The tag is the complete Parole label. The count is the number of occurrences in a training corpus.
E.g. NCMS000 33438
2008-01-24