The distributed version includes morphological dictionaries for
covered languages (English, Spanish, Catalan, Galician, and Italian):
- The Spanish dictionary was obtained from the
Spanish Resource Grammar project developed at the
Universitat Pompeu Fabra, and contains over 550,000
forms corresponding to more than 76.000 lemma-PoS combinations.
These data are distributed under their original Lesser General
Public License For Linguistic Resources (LGPL-LR) license.
See THANKS and COPYING files for further information.
- The Catalan dictionary is hand build and contains
near 67,000 forms corresponding to more than
7,400 different combinations lemma-PoS.
- The Galician dictionary was obtained from OpenTrad project
(a nice open source Machine Translation project at www.opentrad.org), and contains over 90,000
forms corresponding to near 7,500 lemma-PoS combinations.
These data are distributed under their original Creative Commons
license, see THANKS and COPYING files for further information.
- The English dictionary was automatically extracted from WSJ,
with accurate manual post-edition and completion.
It contains over 65,000 forms corresponding to some 40,000
different combinations lemma-PoS.
- The Italian dictionary is extracted from Morph-it! lexicon
developed the University of Bologna, and contains over 360,000
forms corresponding to more than 40,000 lemma-PoS combinations.
These data are distributed under their original Creative Commons
license, see THANKS and COPYING files for further information.
Smaller dictionaries (Catalan and Galician) are expected to cover
over 80% of open-category tokens in a text. Larger dictionaries
are expected to cover between 90-95% of open-category tokens in a
text. For words not found in the dictionary, all open categories
are assumed, with a probability distribution based on word
suffixes, which includes the right tag for 99% of the words, and
allow the tagger to make the most suitable choice based on tag
sequence probability.
This version also includes WordNet-based sense dictionaries for covered languages,
as well as some knowledge extracted from WordNet, such as semantic file codes, or
hypernymy relationships.
2008-01-24