Tokenizer rules file

The file is divided in three sections <Macros>, <RegExps> and <Abbreviations>. Each section is closed by </Macros>, </RegExps> and </Abbreviations> tags respectively.

The <Macros> section allows the user to define regexp macros that will be used later in the rules. Macros are defined with a name and a Perl regexp.
E.g. ALPHA [A-Za-z]

The <RegExps> section defines the tokenization rules. Previously defined macros may be referred to with their name in curly brackets.
E.g. *ABREVIATIONS1 0 ((\{ALPHA\}+\.)+)(?!\.\.)

Rules are regular expressions, and are applied in the order of definition. The first rule matching the beginning of the line is applied, a token is built, and the rest of the rules are ignored. The process is repeated until the line has been completely processed.

The <Abbreviations> section defines common abbreviations (one per line) that must not be separated of their following dot (e.g. etc., mrs.). They must be lowercased.

2008-01-24