The file is divided in three sections <Macros>
, <RegExps>
and <Abbreviations>
.
Each section is closed by </Macros>
, </RegExps>
and </Abbreviations>
tags respectively.
The <Macros>
section allows the user to define regexp macros
that will be used later in the rules. Macros are defined with a name and
a Perl regexp.
E.g. ALPHA [A-Za-z]
The <RegExps>
section defines the tokenization
rules. Previously defined macros may be referred to with their name
in curly brackets.
E.g. *ABREVIATIONS1 0 ((\{ALPHA\}+\.)+)(?!\.\.)
Rules are regular expressions, and are applied in the order of definition. The first rule matching the beginning of the line is applied, a token is built, and the rest of the rules are ignored. The process is repeated until the line has been completely processed.
<Abbreviations>
section).
The <Abbreviations>
section defines common abbreviations (one per line) that must not be separated of their following dot (e.g. etc., mrs.). They must be lowercased.
2008-01-24