Command line | Configuration file |
-h , --help |
N/A |
Prints to stdout a help screen with valid options and exits.
Command line | Configuration file |
-l <int> , --tlevel <int> |
TraceLevel=<int> |
Set the trace level (0:no trace - 3:maximum trace), for debugging purposes. Only valid if program was compiled with -DVERBOSE flag.
Command line | Configuration file |
-m <mask> , --tmod <mask> |
TraceModule=<mask> |
Specify modules to trace. Each module is identified with an hexadecimal flag. All flags may be OR-ed to specificy the set of modules to be traced.
Valid masks are:
Module | Mask |
Splitter | 0x00000001 |
Tokenizer | 0x00000002 |
Morphological analyzer | 0x00000004 |
Options management | 0x00000008 |
Number detection | 0x00000010 |
Date identification | 0x00000020 |
Punctuation detection | 0x00000040 |
Dictionary search | 0x00000080 |
Suffixation rules | 0x00000100 |
Multiword detection | 0x00000200 |
Named entity detection | 0x00000400 |
Probability assignment | 0x00000800 |
Quantities detection | 0x00001000 |
Named entity classification | 0x00002000 |
Automata (abstract) | 0x00004000 |
PoS Tagger (abstract) | 0x00008000 |
HMM tagger | 0x00010000 |
Relaxation labelling | 0x00020000 |
RL tagger | 0x00040000 |
RL tagger constr. grammar | 0x00080000 |
Sense annotation | 0x00100000 |
Chart parser | 0x00200000 |
Parser grammar | 0x00400000 |
Dependency parser | 0x00800000 |
Utilities | 0x01000000 |
Command line | Configuration file |
-f <filename> |
N/A |
Specify configuration file to use (default: analyzer.cfg).
Command line | Configuration file |
--lang <language> |
Lang=<language> |
Language of input text (es: Spanish, ca: Catalan, en: English). Other languages may be added to the library. See chapter 4 for details.
Command line | Configuration file |
--flush , --noflush |
AlwaysFlush=(yes|y|on|no|n|off) |
When inactive (most usual choice) sentence splitter buffers lines until a sentence marker is found. Then, it outputs a complete sentence. When active, the splitter never buffers any token, and considers each newline as sentence end, thus processing each line as an independent sentence.
Command line | Configuration file |
--inpf <string> |
InputFormat=<string> |
Format of input data (plain, token, splitted, morfo, tagged, sense, parsed, dep).
Command line | Configuration file |
--outf <string> |
OutputFormat=<string> |
Format of output data (plain, token, splitted, morfo, tagged, parsed, dep).
Command line | Configuration file |
--abrev <filename> |
TokenizerFile=<filename> |
File of tokenization rules. See section 2.3 for details.
Command line | Configuration file |
--fsplit <filename> |
SplitterFile=<filename> |
File of splitter options rules. See section 2.4 for details.
Command line | Configuration file |
--sufx , --nosufx |
SuffixAnalysis=(yes|y|on|no|n|off) |
Whether to perform suffix analysis on unknown words. Suffix analysis applies known suffixation rules to the word to check whether it is a derived form of a known word (see option Suffix Rules File, below).
Command line | Configuration file |
--loc , --noloc |
MultiwordsDetection=(yes|y|on|no|n|off) |
Whether to perform multiword detection. Multiwords may be detected if a multiword file is provided, (see Multiword File option, below).
Command line | Configuration file |
--numb , --nonumb |
NumbersDetection=(yes|y|on|no|n|off) |
Whether to perform nummerical expression detection. Deactivating this feature will affect the behaviour of date/time and ratio/currency detection modules.
Command line | Configuration file |
--punt , --nopunt |
PunctuationDetection=(yes|y|on|no|n|off) |
Whether to assign PoS tag to punctuation signs
Command line | Configuration file |
--date , --nodate |
DatesDetection=(yes|y|on|no|n|off) |
Whether to perform date and time expression detection.
Command line | Configuration file |
--quant , --noquant |
QuantitiesDetection=(yes|y|on|no|n|off) |
Whether to perform currency amounts, physical magnitudes, and ratio detection.
Command line | Configuration file |
--dict , --nodict |
DictionarySearch=(yes|y|on|no|n|off) |
Whether to search word forms in dictionary. Deactivating this feature also deactivates SuffixAnalysis option.
Command line | Configuration file |
--prob , --noprob |
ProbabilityAssignment=(yes|y|on|no|n|off) |
Whether to compute a lexical probability for each tag of each word. Deactivating this feature will affect the behaviour of the PoS tagger.
Command line | Configuration file |
--dec <string> |
DecimalPoint=<string> |
Specify decimal point character (for instance, in English is a dot, but in Spanish is a comma).
Command line | Configuration file |
--thou <string> |
ThousandPoint=<string> |
Specify thousand point character (for instance, in English is a comma, but in Spanish is a dot).
Command line | Configuration file |
-L <filename> , --floc <filename> |
LocutionsFile=<filename> |
Multiword definition file. See section 2.5 for details.
Command line | Configuration file |
-Q <filename> , --fqty <filename> |
QuantitiesFile=<filename> |
Quantitiy recognition configuration file. See section 2.6 for details.
Command line | Configuration file |
-S <filename> , --fsuf <filename> |
SuffixFile=<filename> |
Suffix rules file. See section 2.7 for details.
Command line | Configuration file |
--thres <float> |
ProbabilityThreshold=<float> |
Threshold that must be reached by the probability of a tag given the suffix of an unknown word in order to be included in the list of possible tags for that word. Default is zero (all tags are included in the list). A non-zero value (e.g. 0.0001, 0.001) is recommended.
Command line | Configuration file |
-P <filename> , --fprob <filename> |
ProbabilityFile=<filename> |
Lexical probabilities file. The probabilities in this file are used to compute the most likely tag for a word, as well to estimate the likely tags for unknown words. See section 2.8 for details.
Command line | Configuration file |
-D <filename> , --fdict <filename> |
DictionaryFile=<filename> |
Dictionary database. Must be a Berkeley DB indexed file. See section 2.9 and chapter 4 for details.
Command line | Configuration file |
--ner , --noner |
NERecognition=(yes|y|on|no|n|off) |
Whether to perform NE recognition. Deactivating this feature will affect the behaviour of the NE Classification module.
Command line | Configuration file |
-N <filename> , --fnp <filename> |
NPDataFile=<filename> |
Configuration data file for simple heuristic Proper Noun recognizer. See section 2.10 for details.
Command line | Configuration file |
--nec , --nonec |
NEClassification=(yes|y|on|no|n|off) |
Whether to perform NE classification.
Command line | Configuration file |
--fnec <filename> |
NECFilePrefix=<filename> |
Prefix to find files for Named Entity Classifier configuration.
The searched files will be the given prefix with the following extensions:
See section 2.11 for details.
Command line | Configuration file |
--sense <string> |
SenseAnnotation=<string> |
Kind of sense annotation to perform
Whether to perform sense anotation. If active, the PoS tag selected by the tagger for each word is enriched with a list of all its possible WN1.6 synsets.
Command line | Configuration file |
--fsense <filename> |
SenseFile=<filename> |
Word sense data file. It is a Berkeley DB indexed file. See section 2.12 for details.
Command line | Configuration file |
--dup , --nodup |
DuplicateAnalysis=(yes|y|on|no|n|off) |
When this option is set, the senses annotator will duplicate the analysis once for each of its possible senses.
For instance, analyzing the sentence el gato come pescado with --senses all
and --nodup
options, would enrich each analysis of each word with a list of all possible senses for that lemma and part-of-speech.
Form | Lemma | Tag | Prob | Senses |
el | el | DA0MS0 | 1.0 | - |
gato | gato | NCMS000 | 1.0 | 01630731:07221232:01631653 |
come | comer | VMIP3S0 | 0.75 | 00794578:00793267 |
comer | VMM02S0 | 0.25 | 00794578:00793267 | |
pescado | pescado | NCMS000 | 0.84 | 05810856:02006311 |
pescar | VMP00SM | 0.16 | 00491793:00775186 |
Alternatively, if we use option --dup
, each analysis is
duplicated as many times as possible senses, so that each analysis
has only one sense:
Form | Lemma | Tag | Prob | Senses |
el | el | DA0MS0 | 1.0 | - |
gato | gato | NCMS000 | 0.33 | 01630731 |
gato | NCMS000 | 0.33 | 07221232 | |
gato | NCMS000 | 0.33 | 01631653 | |
come | comer | VMIP3S0 | 0.375 | 00794578 |
comer | VMIP3S0 | 0.375 | 00793267 | |
comer | VMM02S0 | 0.125 | 00794578 | |
comer | VMM02S0 | 0.125 | 00793267 | |
pescado | pescado | NCMS000 | 0.42 | 05810856 |
pescado | NCMS000 | 0.42 | 02006311 | |
pescar | VMP00SM | 0.08 | 00491793 | |
pescar | VMP00SM | 0.08 | 00775186 |
This may be useful if one wants to perform WSD, or to use the sense field in the analysis in the constraint grammar (see section 2.15).
Command line | Configuration file |
-M <filename> , --fpunct <filename> |
PunctuationFile=<filename> |
Punctuation symbols file. See section 2.13 for details.
Command line | Configuration file |
-T <string> , --tag <string> |
Tagger=<string> |
Algorithm to use for PoS tagging
Command line | Configuration file |
-H <filename> , --hmm <filename> |
TaggerHMMFile=<filename> |
Parameters file for HMM tagger. See section 2.14 for details.
Command line | Configuration file |
--iter <int> |
TaggerRelaxMaxIter=<int> |
Maximum numbers of iterations to perform in case relaxation does not converge.
Command line | Configuration file |
--sf <float> |
TaggerRelaxScaleFactor=<float> |
Scale factor to normalize supports inside RL algorithm. It is comparable to the step lenght in a hill-climbing algorithm: The larger scale factor, the smaller step.
Command line | Configuration file |
--eps <float> |
TaggerRelaxEpsilon=<float> |
Real value used to determine when a relaxation labelling iteration has produced no significant changes. The algorithm stops when no weight has changed above the specified epsilon.
Command line | Configuration file |
-R <filename> |
TaggerRelaxFile=<filename> |
File containing the constraints to apply to solve the PoS tagging. See section 2.15 for details.
Command line | Configuration file |
--retk , --noretk |
TaggerRetokenize=(yes|y|on|no|n|off) |
Determine whether the tagger must perform retokenization after the appropriate analysis has been selected for each word. This is closely related to suffix analysis, see section 2.7 for details.
Command line | Configuration file |
--force <string> |
TaggerForceSelect=(none,tagger,retok) |
Determine whether the tagger must be forced to (probably randomly) make a unique choice and when.
Command line | Configuration file |
-G <filename> , --grammar <filename> |
GrammarFile=<filename> |
This file contains a CFG grammar for the chart parser, and some directives to control which chart edges are selected to build the final tree. See section 2.16 for details.
Command line | Configuration file |
-J <filename> , --dep <filename> |
HeuristicsFile==<filename> |
Heuristic rules used to perform dependency analysis. See section 2.17 for details.
2008-01-24