Initial probabilities, transition
probabilities, lexical probabilities, etc. The file has six
sections: <Tag>
, <Bigram>
, <Trigram>
, <Initial>
,
<Word>
, and <Smoothing>
. Each section is closed by it
corresponding tag </Tag>
, </Bigram>
, </Trigram>
, etc.
The tag (unigram), bigram, and trigram probabilities are used in Linear Interpolation smoothing by the tagger. The package includes a perl script that may be used to generate an appropriate config file from a tagged corpus. See the file src/utilities/hmm_smooth.perl for details.
<Tag>
. List of unigram tag probabilities
(estimated via your preferred method).
Each line is a tag probability P(t) with format
Lines for zero tag (for initial states) and for x (unobserved tags) must be included.
E.g.
0 0.03747
AQ 0.00227
NC 0.18894
x 1.07312e-06
<Bigram>
. List of bigram
transition probabilities (estimated via your preferred method),
Each line is a transition probability, with the format:
Tag zero indicates sentence-beggining.
E.g. the following line indicates the transition probability between a
sentence start and the tag of the first word being AQ.
0.AQ 0.01403
E.g. the following line indicates the transition probability between two
consecutive tags.
AQ.NC 0.16963
<Trigram>
. List of trigram
transition probabilities (estimated via your preferred method),
Each line is a transition probability, with the format:
Tag1.Tag2.Tag3 Probability.
Tag zero indicates sentence-beggining.
E.g. the following line indicates the transition probability that
after a 0.AQ sequence, the next word has NC tag.
0.AQ.NC 0.204081
E.g. the following line indicates the probability of a tag SP appearing after two words tagged DA and NC.
DA.NC.SP 0.33312
<Initial>
. List of initial state probabilities
(estimated via your preferred method), i.e. the ``pi'' parameters of
the HMM.
Each line is an initial probability, with the format InitialState LogProbability.
Each state is a PoS-bigram code with the form 0.tag. Probabilities are given in logarithmic form to avoid underflows.
E.g. the following line indicates the probability that the
sequence starts with a determiner.
0.DA -1.744857
E.g. the following line indicates the probability that the
sequence starts with an unknown tag.
0.x -10.462703
<Word>
. Contains a list of word probabilities
P(w)
(estimated via your preferred method). It is used to compute
observation probability toghether with the tag probabilities above.
Each line is a word probability P(w) with format word
LogProbability. A special line for <UNOBSERVED\_WORD>
must be
included.
E.g.
afortunado -13.69500 sutil -13.57721 <UNOBSERVED_WORD> -13.82853
2008-01-24